# 월간 데이콘 쇼츠 - 뉴스 기사 레이블 복구 해커톤


*   대회 링크: https://dacon.io/competitions/official/236159/overview/description

### [주제]
긴급 레이블 복구: 뉴스 데이터 6개 카테고리 분류

### [대회 개요]
알고리즘 | 언어 | 분류 | 클러스터링 | 라벨링 | Macro F1 Score

### [데이터셋]
* 'id': 뉴스기사 고유 id
* 'title': 뉴스기사 제목
* 'contents': 뉴스기사 내용

## Import

* sentence-transformers 설치
* 기본 라이브러리 불러오기

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB

In [None]:
import re
import pandas as pd
import numpy as np
import random
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

## Random Seed

* 시드 고정

In [None]:
SEED = 0

np.random.seed(SEED)
random.seed(SEED)

## Load Data

* 데이터 불러오기

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/교육/새싹3기/데이터공유/news.csv')
df.head()

Unnamed: 0,id,title,contents
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a..."
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...


In [None]:
df.shape

(60000, 3)

In [None]:
# 제목 + 내용
df['text'] = df['title'] + ' : ' + df['contents']
df.head()

Unnamed: 0,id,title,contents,text
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...,Spanish coach facing action in race row : MADR...
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a...","Bruce Lee statue for divided city : In Bosnia,..."
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...,Only Lovers Left Alive's Tilda Swinton Talks A...
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...,Macromedia contributes to eBay Stores : Macrom...
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...,Qualcomm plans to phone it in on cellular repa...


## Pre-processing

* 뉴스기사에서 필요 없는 문자 제거

In [None]:
def preprocess_text(text):
    # URL 제거
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # 해시태그 제거
    text = re.sub(r'#\w+', '', text)

    # 멘션 제거
    text = re.sub(r'@\w+', '', text)

    # 이모지 제거
    text = text.encode('ascii', 'ignore').decode('ascii')

    # 공백 및 특수문자 제거
    text = re.sub(r'\s+', ' ', text).strip()

    # 숫자 제거
    text = re.sub(r'\d+', '', text)

    return text.lower()

In [None]:
df['processed_text'] = df['text'].apply(preprocess_text)
df.head()

Unnamed: 0,id,title,contents,text,processed_text
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...,Spanish coach facing action in race row : MADR...,spanish coach facing action in race row : madr...
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a...","Bruce Lee statue for divided city : In Bosnia,...","bruce lee statue for divided city : in bosnia,..."
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...,Only Lovers Left Alive's Tilda Swinton Talks A...,only lovers left alive's tilda swinton talks a...
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...,Macromedia contributes to eBay Stores : Macrom...,macromedia contributes to ebay stores : macrom...
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...,Qualcomm plans to phone it in on cellular repa...,qualcomm plans to phone it in on cellular repa...


## Feature Extraction

* all-distilroberta-v1 모델 활용해서 feature 추출
* 차원축소 실행

In [None]:
# Sentence BERT 모델 로드
# model = SentenceTransformer('paraphrase-distilroberta-base-v1')
model = SentenceTransformer('all-distilroberta-v1')

# 텍스트 feature 추출
sentence_embeddings = model.encode(df['text'].tolist())

# 추출한 feature를 데이터프레임에 저장
df_embeddings = pd.DataFrame(sentence_embeddings)
df_embeddings.head()

Downloading (…)87e68/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)5afc487e68/README.md:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading (…)fc487e68/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e68/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading (…)afc487e68/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)87e68/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)7e68/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)afc487e68/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)c487e68/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.066443,-0.009704,-0.0187,0.031596,-0.023231,0.01843,0.018347,-0.002112,-0.017111,-0.043926,...,0.010557,-0.023332,-0.009371,-0.036035,0.025629,0.01495,-0.054599,0.043132,-0.003193,-0.031453
1,0.05028,-0.01476,-0.047576,-0.024381,-0.000225,-0.052771,0.015267,-0.05194,-0.008074,-0.007261,...,0.038694,0.004903,0.012976,0.007488,0.000195,0.024938,0.038839,0.043692,0.009458,0.05111
2,0.01194,-0.038582,0.015936,-0.027933,0.000268,0.073231,-0.034987,0.062235,0.030132,0.066497,...,-0.021217,-0.012434,0.035116,0.023837,-0.051077,-0.028438,0.012292,-0.020587,-0.029713,0.017502
3,-0.018634,-0.014005,0.005451,0.045944,-0.014108,0.011596,-0.027887,-0.01384,0.032961,-0.038505,...,-0.017245,0.047654,-0.014083,-0.048169,-0.00219,-0.024481,0.037928,-0.03271,-0.013198,0.022344
4,0.010623,0.051811,-0.000796,-0.026437,0.003134,-0.028429,-0.027274,-0.037995,0.055954,-0.053387,...,0.013502,-0.011402,-0.015648,0.063833,0.045138,-0.00522,-0.022275,-0.055201,0.007243,0.033584


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=20)
df_pca = pca.fit_transform(df_embeddings)

df_pca = pd.DataFrame(df_pca)
df_pca.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.261294,-0.050943,0.254612,0.120213,-0.053299,0.330738,-0.036105,-0.124505,0.147887,0.163435,-0.189294,0.08508,0.009385,-0.098761,-0.062439,-0.096182,-0.082206,0.106746,-0.096542,0.053534
1,0.208853,-0.14443,0.019764,0.172033,-0.236044,-0.046719,0.042574,-0.022877,0.025408,0.132083,0.06254,-0.028513,0.076663,-0.113936,-0.049879,-0.071599,-0.073655,-0.042672,-0.034817,0.248059
2,0.096916,-0.087075,0.120438,-0.085133,-0.21371,-0.115029,0.040824,-0.131517,0.003771,-0.149719,0.010515,-0.078722,-0.162357,0.140309,-0.070986,0.007209,0.098009,0.000741,0.076391,-0.143026
3,0.019194,0.188121,-0.081461,-0.114992,-0.058856,0.068025,0.061425,0.040316,-0.083597,-0.01704,-0.001738,0.053039,0.006455,-0.087466,-0.173769,0.064797,0.170874,0.075265,0.018699,-0.048051
4,0.107301,0.285064,-0.054065,0.022615,-0.113391,0.10488,-0.030273,0.176466,-0.003632,-0.161788,-0.031317,-0.047511,0.045271,0.136564,0.108728,-0.113192,-0.000445,-0.134181,0.096187,0.104194


## Clustering

* 비슷한 feature끼리 군집화 수행

In [None]:
# Sentence BERT 임베딩을 사용하여 군집화 수행
kmeans = KMeans(n_clusters=6, random_state=SEED)

df['kmeans_cluster'] = kmeans.fit_predict(df_pca)



In [None]:
# Sentence BERT 임베딩을 사용하여 군집화 수행
gmm = GaussianMixture(n_clusters=6, random_state=SEED)

df['gmm_cluster'] = gmm.fit_predict(sentence_embeddings)

## Post-processing

* 매핑을 수행하기 위해 카테고리끼리 연결해주기

### Sports: 0 -> 3

In [None]:
df[df['kmeans_cluster'] == 0]['text'].head(10)

3     Macromedia contributes to eBay Stores : Macrom...
4     Qualcomm plans to phone it in on cellular repa...
5     Thomson to Back Both Blu-ray and HD-DVD : Comp...
23    FTC Files First Lawsuit Against Spyware Concer...
31    Sony PSP Draws Crowds and Lines on First Day (...
35    Is E-Voting Secure? : (CBS) Nearly one third o...
37    Deep Impact Space Probe Aims to Slam Into Come...
40    Out for V-I-C-T-O-R-Y, but Missing Tiles : Mis...
41    Photos from MacExpo 2004 : With over 100 exhib...
50    UN Predicts Boom In Robot Labor : The use of r...
Name: text, dtype: object

In [None]:
print(df['text'][1])
print(df['text'][10])
print(df['text'][16])

Bruce Lee statue for divided city : In Bosnia, where one man #39;s hero is often another man #39;s villain, some citizens have decided to honour one whom Serbs, Croats and Muslims can all look up to - the kung fu great Bruce Lee.
Harry #39;s argy-bargy : PRINCE Charles has asked Scotland Yard for an in-depth report on his son Harry #39;s trip to Argentina after reports of excessive drinking and a kidnap plot.
Fischer's Fiancee: Marriage Plans Genuine (AP) : AP - Former chess champion Bobby Fischer's announcement thathe is engaged to a Japanese woman could win him sympathy among Japanese officials and help him avoid deportation to the United States, his fiancee and one of his supporters said Tuesday.


### Tech: 1 -> 4

In [None]:
df[df['kmeans_cluster'] == 1]['text'].head(10)

3     Macromedia contributes to eBay Stores : Macrom...
4     Qualcomm plans to phone it in on cellular repa...
5     Thomson to Back Both Blu-ray and HD-DVD : Comp...
23    FTC Files First Lawsuit Against Spyware Concer...
31    Sony PSP Draws Crowds and Lines on First Day (...
35    Is E-Voting Secure? : (CBS) Nearly one third o...
37    Deep Impact Space Probe Aims to Slam Into Come...
40    Out for V-I-C-T-O-R-Y, but Missing Tiles : Mis...
41    Photos from MacExpo 2004 : With over 100 exhib...
50    UN Predicts Boom In Robot Labor : The use of r...
Name: text, dtype: object

In [None]:
print(df['text'][0])
print(df['text'][13])
print(df['text'][22])

Spanish coach facing action in race row : MADRID (AFP) - Spanish national team coach Luis Aragones faces a formal investigation after Spain #39;s Football Federation decided to open disciplinary proceedings over racist comments about Thierry Henry of France and Arsenal.
GAME DAY PREVIEW Game time: 6:00 PM : CHARLOTTE, North Carolina (Ticker) -- The Detroit Shock face a critical road test Saturday when they take on the Charlotte Sting at Charlotte Coliseum.
College Basketball: Georgia Tech, UConn Win : ATLANTA (Sports Network) - BJ Elder poured in a game-high 27 points to lead fourth-ranked Georgia Tech to a convincing 99-68 win over Michigan in the ACC-Big Ten Challenge at Alexander Memorial Coliseum.


### Politics: 2 -> 2

In [None]:
df[df['kmeans_cluster'] == 2]['text'].head(10)

8     Obama Marks Anniversary Of 9/11 Attacks With M...
9     Republican Congressman Says Trump Should Apolo...
11    Kerry rolls out tax-cut plan for middle class ...
12    Read Live Updates From The South Carolina Demo...
14    Obama Administration Helps Wall Street Crimina...
15    It's Not As Easy As You Think To Spot A Gerrym...
17    Parents Of School Shooting Victims Decry 'Moro...
18    A Fair Way to Choose Candidates for Republican...
32    Sunday Show Hosts Hit Back On Trump Administra...
33    Memo To EPA Chief Pruitt : //www.huffingtonpos...
Name: text, dtype: object

In [None]:
print(df['text'][2])
print(df['text'][6])
print(df['text'][7])

Only Lovers Left Alive's Tilda Swinton Talks About Almost Quitting Acting and Yasmine Hamdan Performs 'Hal' Live In NYC   (HuffPo Exclusive Videos) authors : Yasmine Hamdan performs 'Hal' which she also sings in the film during a scene when two world-weary vampires begin to heal and find a way to continue living as they remember the power and mystery of creation itself.
Time to Talk Baseball : It's time to talk about the serious risks and potential benefits of building an expensive ballpark in Washington.
Bump Stock Maker Resumes Sales One Month After Las Vegas Mass Shooting authors : Move along nothing to see here.


### World: 3 -> 5

In [None]:
df[df['kmeans_cluster'] == 3]['text'].head(10)

1     Bruce Lee statue for divided city : In Bosnia,...
29    Israel Kills 3 Palestinians in Big Gaza Incurs...
34    The Folly of the Sole Superpower Writ Small au...
49    Bribery Considered, Halliburton Notes Suggest ...
56    Sadr #39;s aide denies entering of Iraqi polic...
57    Former Nazi Guard Loses Canadian Court Ruling ...
59    Afghanistan Death Toll in 2004 Up to 957 : KAN...
60    Portugal PM, Cabinet Submit Resignations : LIS...
61    Typhoon-Like Gusts Hit Japan; 13 Injured : TOK...
63    Family appeals for release of UK hostage : The...
Name: text, dtype: object

In [None]:
print(df['text'][11])
print(df['text'][20])
print(df['text'][50])

Kerry rolls out tax-cut plan for middle class : After two weeks of focusing on Iraq, Democratic presidential challenger John Kerry turned his emphasis to the economy Saturday, delivering what he called a plan for  quot;middle-class families.
Deere's Color Is Green : With big tractors, big sales, and big earnings, Deere's hoeing a profitable row.
UN Predicts Boom In Robot Labor : The use of robots around the home to mow lawns, vacuum floors and manage other chores is set to surge sevenfold by 2007 as more consumers snap up smart machines, the United Nations said.


### Entertainment: 4 -> 1

In [None]:
df[df['kmeans_cluster'] == 4]['text'].head(10)

2     Only Lovers Left Alive's Tilda Swinton Talks A...
10    Harry #39;s argy-bargy : PRINCE Charles has as...
16    Fischer's Fiancee: Marriage Plans Genuine (AP)...
25    Be on TOP : //www.huffingtonpost.com/entry/be-...
28    Cate Blanchett Set To Star As Lucille Ball In ...
45    The Trouble with Broadcasting in a Social Worl...
62    John Waters' Women at the Film Society of Linc...
64    Jon Voight Is 'Concerned' About Daughter Angel...
80    Robert Redford Sidesteps Oscars Controversy Bu...
84    The Man Who Grasped the Heavens' Gravitas : Th...
Name: text, dtype: object

In [None]:
print(df['text'][3])
print(df['text'][4])
print(df['text'][5])

Macromedia contributes to eBay Stores : Macromedia has announced a special version of its Contribute website editing application designed to simplify the creation and customisation of eBay Stores.
Qualcomm plans to phone it in on cellular repairs : Over-the-air fixes for cell phones comes to Qualcomm's CDMA.
Thomson to Back Both Blu-ray and HD-DVD : Company, one of the core backers of Blu-ray, will also support its rival format.


### Business: 5 -> 0

In [None]:
df[df['kmeans_cluster'] == 5]['text'].head(10)

7      Bump Stock Maker Resumes Sales One Month After...
19     Congress Spikes Handout For Private Equity aut...
20     Deere's Color Is Green : With big tractors, bi...
27     Kmart-Sears merger about price, quality : Aver...
51     Oil Falls Below \$49 on Nigeria Cease-Fire : L...
70     ABN Amro Profit Rises, Buoyed by Sale of Asia ...
85     Stocks to Open Higher on Growth Outlook : NEW ...
98     Producer Prices Up 0.1 Pct, Energy Drops (Reut...
99     Rigel, Merck Form Development Partnership : NE...
100    GM, DaimlerChrysler to develop hybrid engines ...
Name: text, dtype: object

In [None]:
print(df['text'][18])
print(df['text'][25])
print(df['text'][33])

A Fair Way to Choose Candidates for Republican Debate : //www.huffingtonpost.com/entry/a-fair-way-to-choose-cand_b_7922194.html short_description
Be on TOP : //www.huffingtonpost.com/entry/be-on-top-amazon-best-sel_b_12508618.html short_description
Memo To EPA Chief Pruitt : //www.huffingtonpost.com/entry/memo-to-epa-chief-pruitt-lets-end-subsidies-for-fossil_us_59ee9567e4b0b8a51417bcc6 short_description


### Mapping

* 매핑 수행

In [None]:
mapping_dict = {
    0: 3,
    1: 4,
    2: 2,
    3: 5,
    4: 1,
    5: 0
}

In [None]:
df['mapping2'] = df['kmeans_cluster'].apply(lambda x: mapping_dict[x])

In [None]:
df[df[['mapping', 'mapping2']].std(axis=1) != 0]

Unnamed: 0,id,title,contents,text,processed_text,kmeans_cluster,mapping,mapping2


## Submission

In [None]:
sample = pd.read_csv('/content/drive/MyDrive/교육/새싹3기/데이터공유/sample_submission.csv')

In [None]:
sample['category'] = df['mapping'].values
sample['category'].head()

0    3
1    5
2    1
3    4
4    4
Name: category, dtype: int64

In [None]:
sample.to_csv('baseline_submit.csv', index=False)