<a href="https://colab.research.google.com/github/SOOB2NHO/KU_COSE471_SUMALSA/blob/torchroh/sentiment_youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from transformers import pipeline, AutoTokenizer
import pandas as pd
from tqdm import tqdm
import torch

# 1) 디바이스 설정
device = 0 if torch.cuda.is_available() else -1
print("사용 디바이스:", "GPU" if device == 0 else "CPU")

# 2) 감정 분석 파이프라인 및 토크나이저 로드
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
sentiment_pipeline = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3) 데이터 로드 (CSV 파일 경로는 Colab 또는 로컬 환경에 맞게 수정)
df = pd.read_csv("/content/cluster_topic_summary_2.csv")  # 너가 업로드한 경로 기준

# 4) 전처리: 'comment_text' 기준
df = df.dropna(subset=["comment_text"]).reset_index(drop=True)
df["comment_text"] = df["comment_text"].astype(str)

# 5) 클러스터별 감정 분석
results = []
for cluster in df["candidate_cluster"].unique():
    subset = df[df["candidate_cluster"] == cluster]
    pos = neg = neu = 0

    for comment in tqdm(subset["comment_text"], desc=f"{cluster} 처리 중"):
        if not isinstance(comment, str):
            continue

        # 토큰 기준 자르기
        encoded = tokenizer.encode(comment, truncation=True, max_length=512)
        text = tokenizer.decode(encoded, skip_special_tokens=True)

        if not text.strip():
            continue

        try:
            out = sentiment_pipeline(text)[0]["label"]
            stars = int(out.split()[0])
            if stars >= 4:
                pos += 1
            elif stars <= 2:
                neg += 1
            else:
                neu += 1
        except Exception as e:
            print("오류 발생:", e)
            continue

    total = pos + neg + neu
    results.append({
        "클러스터":         cluster,
        "처리된 댓글 수":   total,
        "긍정 비율 (%)":    round(pos / total * 100, 2) if total else 0.0,
        "부정 비율 (%)":    round(neg / total * 100, 2) if total else 0.0,
        "중립 비율 (%)":    round(neu / total * 100, 2) if total else 0.0,
    })

# 6) 결과 저장
res_df = pd.DataFrame(results)
res_df.to_csv("sentiment_by_cluster.csv", index=False, encoding="utf-8-sig")
print("CSV 파일 저장 완료: sentiment_by_cluster.csv")


사용 디바이스: GPU


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0
윤석열 처리 중:   0%|          | 10/10063 [00:00<08:10, 20.50it/s] You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
윤석열 처리 중: 100%|██████████| 10063/10063 [01:45<00:00, 95.34it/s]
이재명 처리 중: 100%|██████████| 5666/5666 [00:57<00:00, 99.15it/s] 
안철수 처리 중: 100%|██████████| 4271/4271 [00:42<00:00, 100.96it/s]

CSV 파일 저장 완료: sentiment_by_cluster.csv



