<a href="https://colab.research.google.com/github/Tiabet/BaekJoon/blob/main/DACON_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
%%capture
pip install sentence_transformers

In [4]:
import re
import pandas as pd
import numpy as np
import random
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [5]:
SEED = 0

np.random.seed(SEED)
random.seed(SEED)

In [6]:
df = pd.read_csv('/content/drive/MyDrive/news.csv')
df.head()

Unnamed: 0,id,title,contents
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a..."
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...


In [7]:
# 제목 + 내용
df['text'] = df['title'] + ' : ' + df['contents']
df['text']

0        Spanish coach facing action in race row : MADR...
1        Bruce Lee statue for divided city : In Bosnia,...
2        Only Lovers Left Alive's Tilda Swinton Talks A...
3        Macromedia contributes to eBay Stores : Macrom...
4        Qualcomm plans to phone it in on cellular repa...
                               ...                        
59995    Dolphins Break Through, Rip Rams For First Win...
59996    After Steep Drop, Price of Oil Rises : The fre...
59997    Pro football: Culpepper puts on a show : To sa...
59998    Albertsons on the Rebound : The No. 2 grocer r...
59999    Cassini Craft Spies Saturn Moon Dione (AP) : A...
Name: text, Length: 60000, dtype: object

In [41]:
def preprocess_text(text):
    if not isinstance(text, str):
          return text  # If the input is not a string, return it as is
    # URL 제거
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # 해시태그 제거
    text = re.sub(r'#\w+', '', text)

    # 멘션 제거
    text = re.sub(r'@\w+', '', text)

    # 이모지 제거
    text = text.encode('ascii', 'ignore').decode('ascii')

    # 공백 및 특수문자 제거
    text = re.sub(r'\s+', ' ', text).strip()

    # 숫자 제거
    text = re.sub(r'\d+', '', text)
    text = re.sub(r':\s*//.*$', '', text)

    return text.lower()

In [42]:
df['processed_text'] = df['text'].apply(preprocess_text)

In [44]:
df['processed_text'][18]

'a fair way to choose candidates for republican debate '

In [45]:
# Sentence BERT 모델 로드
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

# 텍스트 feature 추출
sentence_embeddings = model.encode(df['processed_text'].tolist())

# 추출한 feature를 데이터프레임에 저장
df_embeddings = pd.DataFrame(sentence_embeddings)

In [12]:
df_embeddings.to_csv("/content/drive/MyDrive/embedding_file.csv",index = False)

In [46]:
# Sentence BERT 임베딩을 사용하여 군집화 수행
kmeans = KMeans(n_clusters=6, random_state=SEED)

df['kmeans_cluster'] = kmeans.fit_predict(sentence_embeddings)



In [59]:
df[df['kmeans_cluster'] == 5]['text'].head(5)

0     Spanish coach facing action in race row : MADR...
13    GAME DAY PREVIEW Game time: 6:00 PM : CHARLOTT...
21    Blake Leeper Wants to Be the First American Pa...
22    College Basketball: Georgia Tech, UConn Win : ...
26    Doping case was flawed, report finds : MONTREA...
Name: text, dtype: object

In [58]:
print(df['text'][1])
print(df['text'][8])
print(df['text'][16])
print(df['text'][29])
print(df['text'][34])

Bruce Lee statue for divided city : In Bosnia, where one man #39;s hero is often another man #39;s villain, some citizens have decided to honour one whom Serbs, Croats and Muslims can all look up to - the kung fu great Bruce Lee.
Obama Marks Anniversary Of 9/11 Attacks With Moment Of Silence authors : We stand as strong as ever.
Fischer's Fiancee: Marriage Plans Genuine (AP) : AP - Former chess champion Bobby Fischer's announcement thathe is engaged to a Japanese woman could win him sympathy among Japanese officials and help him avoid deportation to the United States, his fiancee and one of his supporters said Tuesday.
Israel Kills 3 Palestinians in Big Gaza Incursion (Reuters) : Reuters - Israeli forces killed three\Palestinians, including two teenagers, on Wednesday after\storming into the northern Gaza Strip for the third time in as\many months to quell Palestinian rocket fire into Israel.
The Folly of the Sole Superpower Writ Small authors : Think of this as a little imperial folly

In [61]:
mapping_dict = {
    0: 0, #Business
    1: 5, #World
    2: 4, #Techs
    3: 2, #Politics
    4: 1, #Entertainment
    5: 3
}
df['mapping'] = df['kmeans_cluster'].apply(lambda x: mapping_dict[x])
sample = pd.read_csv('/content/drive/MyDrive/sample_submission.csv')
sample['category'] = df['mapping'].values
sample.to_csv('/content/drive/MyDrive/baseline_submit.csv', index=False)