<a href="https://colab.research.google.com/github/EUNYUGNYU/SocialTextAnalysis_exmple_02/blob/main/TextAnalysis_class02_preknowledge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TextAnalysis Preknowledge
---
<br>

### 2강 범위

**핵심 키워드**
 * 벡터화
 * TF-IDF
 * N-gram

---
*TextAnalysis_class02_preknowledge.ipynb*
- e-mail: aswe56@naver.com

## 수업에 들어가기에 앞서...



### 첫시간에 배운 개념과 텍스트 전처리 기법들
* 텍스트 단위(말뭉치, 문서, 문단 등)
* 문장 분리
* 형태소 분석
* 사용자 사전 추가
* 불용어 제거

### 전처리하여 우리가 분석하고 싶은 것
* 연관어분석 - **글**에 들어있는 단어들의 연관관계는 어떠한가?
* 토픽모델링 - **글이나 문단**에 담겨져 있는 주제는 무엇인가?
* 감성분석 - **글이나 문장**에 담겨져 있는 글의 감정은 어떠한가? 긍정 or 부정?
<br><br>
    **=> 특정 텍스트 단위에 대한 분석을 하고자함**
<br><br>

## Q. 컴퓨터에게 텍스트 단위(글이나 문장)를 어떤 형태로 전달해야 분석에 활용할 수 있을까?

![image.png](attachment:image.png)

### 컴퓨터가 텍스트를 이해하는 방법 => 벡터화

- 기계가 이해할 수 있도록 텍스트 단위를 수치로 표현하는 방법
- 일반적으로 행렬과 같은 형태로 표현


### 텍스트 단위를 행렬로 표현 하는 형태 => 카운트 기반 행렬(DTM - Document Term Matrix)
- 보편적으로 텍스트마이닝 분야에선 문서별로 각 단어에 대한 '빈도' 또는 '유사한 수치'를 활용해서 표현
- 행: 문서의 고유번호(index)를 통해 문서를 구분 (document)
- 열: 문서에 등장하는 단어의 종류를 구분 (term). 모든 문서의 모든 단어가 쭉 나열됨

### 
### 예시)
![image-3.png](attachment:image-3.png)

## 데이터 불러오기

**사용할 데이터**
* 2017년 네이버 뉴스 크롤링 데이터
* 데이터 컬럼 내 [발행일자, 분야, 언론사, 제목, 뉴스, 링크] 등으로 구성되어 있음
* 데이터 내에 다양한 분야의 뉴스로 구성되어 있으나, 시간상 IT_과학 뉴스 데이터 100건만 활용 예정



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/B2C 리뷰그룹/news_articles.csv').dropna()
data.columns = ['date','section','media','title','news', 'site']
data.head()

Unnamed: 0,date,section,media,title,news,site
0,20170101,IT_science,디지털데일리,DD 주간브리핑 사상 최대 규모로 열리는 CES 2017 에 관심 고조,2017년의 첫 주가 시작됐다. 새해 벽두 건강한 긴장감이 넘친다. 새해 첫 주부터...,https://news.naver.com/main/read.nhn?mode=LSD&...
1,20170101,IT_science,SBS,최첨단 미래TV 시대 열린다…어떤 서비스 있을까,앵커 올해는 UHD 차세대 초고화질 방송 시대가 열리는 첫해입니다. 시청자들이 어떤...,https://news.naver.com/main/read.nhn?mode=LSD&...
2,20170101,IT_science,세계일보,심층기획 2020년 우주기술 개발 자립국 대열 합류,세계 최초 인공위성 스푸트니크 발사 60주년 한국 우주개발 역사와 현주소 우주개발 ...,https://news.naver.com/main/read.nhn?mode=LSD&...
3,20170101,IT_science,세계일보,심층기획 지구 밖으로 쏘아 올린 인간의 꿈… 우주개발 무한경쟁,세계 최초 인공위성 스푸트니크 발사 60주년 우주 알리는 신호탄 1957년 10월4...,https://news.naver.com/main/read.nhn?mode=LSD&...
4,20170101,IT_science,한국경제,호흡만으로 질병 진단하는 센서 층간소음 없애는 IoT기술 첫선,2017 한국 과학기술자들의 위대한 도전... 로켓 액체엔진 두번째 연소시험 성공 ...,https://news.naver.com/main/read.nhn?mode=LSD&...


## 간단한 데이터 전처리
* IT_science 분야의 100건의 뉴스 데이터 추출
* Komoran 라이브러리를 활용하여 각 뉴스 데이터별 동사, 형용사, 명사만 추출

In [6]:
#knolypy가 설치되어 있지 않을 경우, 아래 링크를 참고하여 설치 
#https://velog.io/@soo-im/konlpy-%EC%84%A4%EC%B9%98-%EC%97%90%EB%9F%AC-%ED%95%B4%EA%B2%B0%EC%B1%85-%EC%95%84%EB%82%98%EC%BD%98%EB%8B%A4-JPYPE
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.5.2-py2.py3-none-any.whl (19.4 MB)
[K     |████████████████████████████████| 19.4 MB 1.3 MB/s 
[?25hCollecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 48.6 MB/s 
Collecting beautifulsoup4==4.6.0
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.3 MB/s 
[?25hCollecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: JPype1, colorama, beautifulsoup4, konlpy
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed JPype1-1.3.0 beautifulsoup4-4.6.0 colorama-0.4.4 konlpy-0.5.2


In [7]:
from konlpy.tag import Komoran
komoran = Komoran()

In [8]:
import re
def pos_filter(pos_pattern, tagged_sentece):
    return [morph for morph, pos in tagged_sentece if re.match(pos_pattern, pos)]

data_processed = []
for sentence in data['news'][:100]:
    data_processed.append( ' '.join(pos_filter('NN*|VV|VA', komoran.pos(sentence))) )
#         data_processed.append(sentence)

In [9]:
data_processed

['2017년 주가 시작 새해 벽두 건강 긴장감 넘치 새해 주 주목 만 현안 많 세계 최대 가전 전시회 행사 국내 뿐 세계 이목 쏠리 일 현지 시각 일 간 미국 라스베이거스 진행 올해 역대 최다인 여개 업체 참가 뿐 자동차 인공지능 드 등 관련 신제품 선보이',
 '앵커 올해 차세대 초고 화질 방송 시대 열리 첫해 시청자 방송 서비스 즐기 수 있 되 유병수 기자 소개 기자 드라마 스포츠 예능 다큐멘터리 기존 방송 배 이상 초고 화질 방송 뛰어나 압축 기술 바탕 개 이상 채널 가상현실 증강현실 같 차세대 방송 서비스 즐기 수 지상파 방송 처음 인터넷 기반 전송 기술 접목 시청자 참여 첨단 방향 서비스 경험 수 심상민 성신여대 문화 커뮤니케이션 학부 교수 새롭 혁명 변화 차 산업 혁명 미디어 새롭 진입 자체 결합 신호탄 새롭 서비스 하 수 우리나라 올해 차세대 초고 화질 본방송 시작 내년 평창 올림픽 세계 중계 이 위하 지난해 방식 초고 화질 시험 방송 스포츠 중계 성공 마치 차세대 방송 보편 무료 서비스 위하 정부 지원 프로그램 제작 위하 지상파 방송사 앞 10년 간 투자 금액 조 원 육박 것 예상 재원 마련 위하 지상파 방송 부과 정부 규제 풀 하 지적 많 한광석 남 서울대 광고 홍보 학과 교수 제작비 자체 1.5 배 배 정도 소요 때문 중간광고 같 제도 부분 완화 방송 콘텐츠 제작 활성 필요 차세대 초고 화질 본방송 올해 수도권 시작 내년 평창 올림픽 맞추 강원도 전국 광역시 확대',
 '세계 최초 인공위성 스푸트니크 발사 주년 한국 우주 개발 역사 현주소 우주 개발 분야 후발 주자 한국 확보 우주 기술 실용 차세대 산업 육성 하 시점 놓이 러시아 스푸트니크 1호 발사 년 한국 전쟁 직후 인당 달러 넘 가난 나라 한국 우주 기술 개발 논의 착수 것 한참 뒤 1980년대 후반 우주 강국 꿈 안고 영국 유학 길 오르 젊 공학도 피나 노력 끝 년 국내 최초 인공위성 우리별 1호 띄우 한국 우주 개발 신호탄 클릭 하면 크 그림 보 수 제도 년 우주 개발 중장기 기본

## 1차 접근: 컴퓨터에게 텍스트 단위의 단어 빈도를 전달하기

* 가장 기본적인 접근방법
* 텍스트 단위에서의 각 단어의 출현 빈도(frequency)를 전달

### 1-1 전체 문헌(뉴스 100건)의 빈도수 계산해보기

In [10]:
# 라이브러리 import (빈도수 기반으로 벡터화해줌)
from sklearn.feature_extraction.text import CountVectorizer

# 단어빈도수 세고, matrix 형태로 만들어주기
cv = CountVectorizer()
cv_tdm = cv.fit_transform(data_processed) # tdm 완성

In [11]:
cv_tdm

<100x4600 sparse matrix of type '<class 'numpy.int64'>'
	with 13485 stored elements in Compressed Sparse Row format>

In [12]:
# DataFrame 형태로 만들어주기
word_count=pd.DataFrame({
    '단어': cv.get_feature_names(),
    '빈도': cv_tdm.sum(axis=0).flat
})
tdm_count = word_count.sort_values(by='빈도', ascending=False).reset_index(drop=True)
tdm_count



Unnamed: 0,단어,빈도
0,산업,184
1,기술,143
2,기업,136
3,시장,128
4,위하,128
...,...,...
4595,이래,1
4596,이듬해,1
4597,이든,1
4598,이동하,1


### 1-2 단일 문헌(뉴스 1건)에 대한 빈도수 계산하기

In [13]:
# 라이브러리 import 
from sklearn.feature_extraction.text import CountVectorizer

# 단어빈도수 세고, matrix 형태로 만들어주기
cv2 = CountVectorizer()
cv_tdm1 = cv2.fit_transform([data_processed[1]]) # tdm 완성

In [14]:
# DataFrame 형태로 만들어주기
word_count=pd.DataFrame({
    '단어': cv2.get_feature_names(),
    '빈도': cv_tdm1.sum(axis=0).flat
})
tdm_count = word_count.sort_values(by='빈도', ascending=False).reset_index(drop=True)
tdm_count



Unnamed: 0,단어,빈도
0,방송,10
1,서비스,5
2,화질,5
3,차세대,5
4,초고,5
...,...,...
101,산업,1
102,부분,1
103,부과,1
104,보편,1


### 1-3 여러개의 문헌에 대한 빈도수 계산하여 행렬로 표현하기 
- 행: 각 문헌의 고유 번호(index)
- 열: 문헌에 등장하는 단어

In [18]:
tdm_count = pd.DataFrame(cv_tdm.toarray(), columns = cv.get_feature_names()) #array 형태 df으로
tdm_count.head()



Unnamed: 0,0시,100,10년,10대,10분,10억,10월,11월,11호,12월,13,16,18세기,1960년대,1980년대,1년,1월,1일,1호,2003년,2005년,2013년,2015년,2016년,2017년,20대,20세기,21년,21세기,21일,24,27,29일,2ne1,2년,2월,2일,2호,30,30일,...,회피,획기적,획득,횟수,횡설수설,효과,효능,효시,효율,후대,후반,후발,후생,후속작,후예,후쿠시마,후한,훈련,훔치,훗날,휴대,휴대폰,휴머노이드,휴스,휴스턴,흐름,흔들리,흘리,흡연,흥미,흥행,흩어지,희망,희선,흰색,히터,힘들,힘쓰,힘입,힘차
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,9,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


## 2차 접근: 컴퓨터에게 텍스트 단위에서의 각 단어의 Tf-idf 중요도를 전달하기

<br> 빈도의 경우, 중요하지 않은 단어가 높은 빈도를 차지하는 경우가 있음 (ex.조사, 일반명사 - 사람, 나 등 의미없고 빈번하게 나올 수 밖에 없는 단어)

<br><br>
### Tf-idf란?
- Term Frequency-Inverse Document Frequency의 약자로, 단어의 빈도와 역 문서 빈도를 사용하여 DTM 내의 각 단어들마다 중요한 정도를 가중치로 주는 방법 
- tf(d,t): 특정 문서 d에서의 특정 단어 t의 등장 횟수<br>
- df(t) : 특정 단어 t가 등장한 문헌의 수<br>
- idf(t) : df(t)에 반비례 하는 수. 즉 df(t)에 대한 역수를 계산할 수 았게 해줌


![image-4.png](attachment:image-4.png)


<br><br>
- 각 문헌에서 다뤄지는 일반명사의 경우, 한 문서에서 여러번 등장하며, 다른 문헌에서도 동일하게 여러번 등장하는 경향이 있음 
    <br> => tfidf 수치가 많이 낮아짐
- 각 문헌의 핵심 주제어들의 경우, 한 문헌에선 여러번 등장하지만, 그 주제를 다루지 않은 문헌의 경우 거의 등장하지 않음
    <br> => tfidf 수치가 소폭 낮아짐



![ex.PNG](attachment:ex.PNG)

![tfidf.PNG](attachment:tfidf.PNG)

### 2-1 전체 문헌(뉴스 100건)의 tf-idf 가중치 상위 단어 계산해보기

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer #tf-idf 기반 벡터화
tfidf = TfidfVectorizer()
tfidf_tdm = tfidf.fit_transform(data_processed)

In [20]:
# DataFrame 형태로 만들어주기
word_count=pd.DataFrame({
    '단어': tfidf.get_feature_names(),
    'tf-idf가중치': tfidf_tdm.sum(axis=0).flat
})

word_tfidf = word_count.sort_values(by='tf-idf가중치', ascending=False).reset_index(drop=True)
word_tfidf.head(10)



Unnamed: 0,단어,tf-idf가중치
0,산업,3.282655
1,로봇,2.882249
2,서비스,2.72342
3,기업,2.668263
4,데이터,2.503021
5,시장,2.502221
6,게임,2.45926
7,기술,2.420907
8,할인,2.416335
9,혁명,2.308632


### 2-2 여러개의 문헌에 대한 tf-idf 가중치 계산하여 행렬로 표현하기

In [21]:
tdm_count = pd.DataFrame(tfidf_tdm.toarray(), columns = tfidf.get_feature_names())
tdm_count.head()



Unnamed: 0,0시,100,10년,10대,10분,10억,10월,11월,11호,12월,13,16,18세기,1960년대,1980년대,1년,1월,1일,1호,2003년,2005년,2013년,2015년,2016년,2017년,20대,20세기,21년,21세기,21일,24,27,29일,2ne1,2년,2월,2일,2호,30,30일,...,회피,획기적,획득,횟수,횡설수설,효과,효능,효시,효율,후대,후반,후발,후생,후속작,후예,후쿠시마,후한,훈련,훔치,훗날,휴대,휴대폰,휴머노이드,휴스,휴스턴,흐름,흔들리,흘리,흡연,흥미,흥행,흩어지,희망,희선,흰색,히터,힘들,힘쓰,힘입,힘차
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.189586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.063651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043863,0.0,0.0,0.0,0.120749,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043863,0.037686,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037686,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.018151,0.02095,0.026969,0.018728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.222729,0.026969,0.0,0.0,0.0,0.0,0.0,0.0,0.026969,0.0,0.026969,0.0,0.0,0.0,0.026969,0.0,0.0,0.0,0.0,0.026969,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026969,0.0,0.026969,0.0,0.023171,0.0,0.0,0.0,0.0,0.0,0.024748,0.0,0.026969,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023171,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033536,0.0,0.0,0.0,0.023288,0.0,0.0,0.033536,0.0,...,0.0,0.033536,0.0,0.0,0.0,0.019368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033536,0.0,0.0,0.0,0.0,0.0


## 3차 접근: 컴퓨터에게 텍스트 단위에서의 단어가 아닌 N-gram 빈도 전달하기 


#### N-gram이란?
- 수치가 아닌 단어의 형태를 변환하는 기법
- 하나의 연속적인 단어의 뭉치가 하나의 토큰(단어)처럼 취급
- 단어의 뭉치의 단위가 몇 개인지에 따라 unigram(1개), bigram(2개), trigram(3개)으로 불림

##### 예시
주어진 문장: An adorable little boy is spreading smiles

![image-2.png](attachment:image-2.png)

<br> 단어를 기반으로 접근하는 경우, 문장 내에서의 단어의 순서를 반영할 수 없음
<br> 단어 하나로 봤을 때, 온전한 의미를 전달할 수 없는 경우가 있음. 이를 극복할 수 있게 해줌 (ex. 정책 => 통신 정책, 경제 => 경제 구조)


### 3-1 전체 문헌(뉴스 100건)에서의 바이그램 빈도 계산해보기

In [22]:
# 라이브러리 import 
from sklearn.feature_extraction.text import CountVectorizer

# 단어빈도수 세고, matrix 형태로 만들어주기
bicv = CountVectorizer(ngram_range=(2, 2)) #n-gram 옵션. 최소댄위와 최대단위 지정.(현재 bigram만 보는 형태)
bicv_tdm = bicv.fit_transform(data_processed) # tdm 완성

In [23]:
# DataFrame 형태로 만들어주기
word_count=pd.DataFrame({
    '단어': bicv.get_feature_names(),
    '빈도': bicv_tdm.sum(axis=0).flat
})
bi_tdm_count = word_count.sort_values(by='빈도', ascending=False).reset_index(drop=True)
bi_tdm_count.head(10)
#참고: 분석 결과 내 데이 그램은 올데이 그램. bigram으로도 온전히 이해하지 못하는 데이터도 존재할 수 있음



Unnamed: 0,단어,빈도
0,산업 혁명,85
1,사물 인터넷,24
2,우주 개발,18
3,데이 그램,18
4,제4차 산업,17
5,재생 에너지,17
6,배터리 용량,16
7,이동 통신,15
8,사용 시간,14
9,세계 최대,14


### 3-2 여러개의 문헌에 대한 바이그램의 빈도 계산해 행렬로 표현하기

In [24]:
tdm_count = pd.DataFrame(bicv_tdm.toarray(), columns = bicv.get_feature_names())
tdm_count.head()



Unnamed: 0,0시 0시,0시 카카오톡,100 수준,10년 모바일,10년 이통,10년 투자,10대 뉴스,10대 주목,10대 트렌드,10분 손의,10분 스페셜,10억 유로,10월 1일,10월 강진,10월 국내,10월 독일,10월 세계,10월 안전,10월 조사,10월 중국,10월 타임워너,11월 개월,11월 딥젠고는,11월 베이징,11월 보강,11월 스푸트니크,11호 처음,12월 21일,12월 29일,12월 광역시,12월 국민연금공단,12월 동계,12월 시험,12월 임원,12월 중순,12월 출시,12월 회원사,13 최종,16 실적,18세기 전기,...,휴대 전화,휴대 커피숍,휴대 편리,휴대 필요,휴대폰 가입자,휴대폰 들리,휴대폰 요금,휴대폰 회선,휴머노이드 로봇,휴스 텔레매틱스,휴스턴 옮기,흐름 감안,흐름 뒤떨어지,흐름 보이,흐름 선제,흐름 투자,흐름 평균,흔들리 국제,흘리 sk텔레콤,흡연 음주,흥미 미국,흥미 시나리오,흥행 넷마블게임즈,흥행 배출,흩어지 경쟁력,희망 기대,희망 기원,희선 기자,흰색 보이,히터 악천후,힘들 속도,힘들 중국인,힘쓰 동시,힘쓰 있다,힘쓰 한국,힘입 기차,힘입 노트북,힘입 탄력,힘차 달리,힘차 열어젖히
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


### 행렬을 생성하는데, N-gram의 N을 늘리기 어려운 이유

#### 희소문제
- 여러 문헌에서 나오는 단어의 토큰의 종류가 기하급수적으로 늘어나 컴퓨터 연산에 비효율적
- 따라서 일단 유니그램으로 진행하고 결과가 만족스럽지 않을 때 bigram이나 trigram 으로 진행하는 것이 효율적임
   <br> => 특히,문헌의 갯수가 증가할 수록, 단어 토큰의 종류 갯수 역시 더 빠르게 증가
   <br> => 가장 작은 바이그램(2개) 역시, 유니그램(1개)에 비해 훨씬 더 많은 열의 갯수가 필요
    <br> ex) 100개 뉴스 기사 기준 Unigram - 4000여개의 단어 종류, Bigram - 18000여개의 단어 종류       
    

### 참고
- 문헌내에서 상위 키워드 추출방법

#### 참고#1 분석 텍스트 단위에서 빈도 기반 상위 키워드 5개씩 뽑기

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

vectorizer = CountVectorizer()
sp_matrix = vectorizer.fit_transform(data_processed)

word2id = defaultdict(lambda : 0)
for idx, feature in enumerate(vectorizer.get_feature_names()):
    word2id[feature] = idx

cv_keywords_list=[]
for i, sent in enumerate(data_processed):
    tfidf_dic={}
    for token in sent.split():
        tfidf_dic[token]=sp_matrix[i, word2id[token]]
    cv_keywords=','.join([i[0] for i in sorted(tfidf_dic.items(), key=lambda x : x[1], reverse=True)[:5]])
    cv_keywords_list.append(cv_keywords)

# cv_keywords_list

#### 참고#2 분석 텍스트 단위에서  Tfidf 가중치 기반 상위 키워드 5개씩 뽑기

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

vectorizer = TfidfVectorizer()
sp_matrix = vectorizer.fit_transform(data_processed)

word2id = defaultdict(lambda : 0)
for idx, feature in enumerate(vectorizer.get_feature_names()):
    word2id[feature] = idx

tfidf_keywords_list=[]
for i, sent in enumerate(data_processed):
    tfidf_dic={}
    for token in sent.split():
        tfidf_dic[token]=sp_matrix[i, word2id[token]]
    tfidf_keywords=','.join([i[0] for i in sorted(tfidf_dic.items(), key=lambda x : x[1], reverse=True)[:5]])
    tfidf_keywords_list.append(tfidf_keywords)

# tfidf_keywords_list

In [None]:
compare_keywords = pd.DataFrame(zip(cv_keywords_list, tfidf_keywords_list), columns=['빈도수 기반 키워드', 'Tfidf 기반 키워드'] )
compare_keywords

Unnamed: 0,빈도수 기반 키워드,Tfidf 기반 키워드
0,"새해,세계,2017년,주가,시작","긴장감,최다인,새해,건강,넘치"
1,"방송,차세대,초고,화질,서비스","방송,초고,화질,차세대,지상파"
2,"우주,개발,한국,발사체,기술","우주,발사체,개발,발사,위성"
3,"우주,스푸트니크,개발,발사,미국","우주,스푸트니크,발사,러시아,1호"
4,"레이저,연구,엔진,세계,올해","레이저,엔진,연구,연소,우주"
...,...,...
95,"시간,그램,인치,충전,배터리","인치,그램,충전,시간,배터리"
96,"복지,센터,부회장,임직원,봉사","복지,임직원,부회장,봉사,센터"
97,"음향,입체,디지,소닉,개발","음향,입체,디지,소닉,3차원"
98,"시간,사용,그램,인치,배터리","그램,인치,시간,신제품,사용"
