# 뉴스기사 분석
- 빅카인즈 뉴스기사 데이터 활용(https://www.bigkinds.or.kr/)
- 문서-단어 벡터 생성 CountVectorizer()
- TfIdf 개념

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# 뉴스 데이터 가져오기
!curl https://raw.githubusercontent.com/StillWork/data/master/news.xlsx \
    -o news.xlsx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1789k  100 1789k    0     0  1206k      0  0:00:01  0:00:01 --:--:-- 1206k


In [None]:
news_all = pd.read_excel("news.xlsx")
news_all.columns

Index(['뉴스 식별자', '일자', '언론사', '기고자', '제목', '통합 분류1', '통합 분류2', '통합 분류3',
       '사건/사고 분류1', '사건/사고 분류2', '사건/사고 분류3', '인물', '위치', '기관', '키워드', '특성추출',
       '본문', 'URL', '분석제외 여부'],
      dtype='object')

In [None]:
news_text = news_all['본문']
news_text[:5]

0    - 비핵화 수준 상응 조치 놓고\n- 양국 협상팀 막판까지 ‘밀당’\n- 1차 때와...
1    김정은 국무위원장이 27일 시작되는 제2차 북미정상회담 성공을 위해 심혈을 기울이고...
2    북미가 처음으로 정상 간 단독회담과 만찬을 가지며 또다시 새로운 역사 창조에 나섰다...
3    지난해 9월 남북정상회담 당시 리선권 북한 조국평화통일위원장의 '냉면' 발언으로 정...
4    지자체 민간 교류 활성화 대부분 \n여, 부처간 논의 예산 지원 확대 \n야, 사업...
Name: 본문, dtype: object

In [None]:
cv = CountVectorizer()
dtm = cv.fit_transform(news_text.tolist())
# document-term-matrix
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names()))
df[:5]

Unnamed: 0,00,000원을,001420,001550,002100,005690,01,017800,02,025860,...,힌국당,힘겨루기로,힘겨웠던,힘들다고,힘들어,힘들어지는,힘을,힘이,힘입어,靈山
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
len(cv.vocabulary_)

16199

In [None]:
cv = CountVectorizer(max_features=2000)
dtm = cv.fit_transform(news_text.tolist())
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names()))
df[:3]

Unnamed: 0,00,01,02,0px,10,100주년,100주년을,10시,10일,10일까지,...,회동을,회복,회의론을,회의를,회의에서,효과,효과를,후보지로,후속,힘을
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
top_word = []
for i in df.transpose().values:
    top_word.append(sum(i.tolist()))
    
df.transpose().index[top_word.index(max(top_word))]

'2차'

In [None]:
'''
TfidfVectorizer]
- 이전 정리 https://wiserloner.tistory.com/646
- 이 방식은 위에서 말했듯 카운팅 방식의 단점을 해결하기 위한 기법입니다.
기본 개념들부터 알아봅시다.
- TF(Term Frequency) : 특정 단어가 하나의 데이터 안에서 등장하는 횟수
DF(Document Frequency) : 특정 단어가 여러 데이터에 자주 등장하는지를 알려주는 지표.
IDF(Inverse Document Frequency) : DF에 역수를 취해(inverse) 구함
TF-IDF : TF와 IDF를 곱한 값. 즉 TF가 높고, DF가 낮을수록 값이 커지는 것을 이용하는 것입니다.
    조금 더 풀어 설명하자면, 해당 단위(문장) 안에서는 많이 등장하지만, 다른 문서들까지 전체에서는 적게 사용될수록,
    분별력 있는 특징이란 것입니다.
'''

In [None]:
# tf: term frequency, idf: inverse document frequency
tv = TfidfVectorizer(max_features=2000)
dtm = tv.fit_transform(news_text.tolist())
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = tv.get_feature_names()))

In [None]:
df[:3]

Unnamed: 0,00,01,02,0px,10,100주년,100주년을,10시,10일,10일까지,...,회동을,회복,회의론을,회의를,회의에서,효과,효과를,후보지로,후속,힘을
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
