# 학습목표
텍스트를 벡터로 표현하는 방법 중 빈도 기반 단어 표현방법을 수행합니다  

  
빈도 기반 단어표현(Count-based Representation)
- bag of words (BOW) : TF 방식
- bag of words (BOW) : TF-IDF 방식

## sklearn CounterVectorizer

### 데이터 불러오기 

In [63]:
# spacy 불러오기
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import spacy

In [64]:
nlp = spacy.load('en_core_web_sm')

In [65]:
text = """In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
tf–idf is one of the most popular term-weighting schemes today.
A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf."""

### 토큰화 결과 확인만 해보기 

In [66]:
doc = nlp(text)
doc

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
tf–idf is one of the most popular term-weighting schemes today.
A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

In [67]:
token_list = []
for token in doc:
  if token.is_stop == False and token.is_punct == False:
    token_list.append(token.lemma_)

token_list[:10]

['information',
 'retrieval',
 'tf',
 'idf',
 'tfidf',
 'short',
 'term',
 'frequency',
 'inverse',
 'document']

### 벡터화 

DTM Matrix (TF)

In [68]:
# text를 문장 단위로 리스트로 저장
sentences_list = text.split('\n')
sentences_list[:10]

['In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.',
 'It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.',
 'The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word,',
 'which helps to adjust for the fact that some words appear more frequently in general.',
 'tf–idf is one of the most popular term-weighting schemes today.',
 'A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.']

In [69]:
# CountVectorizer 객체 선언
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

In [70]:
# 어휘사전 생성
vectorizer.fit(sentences_list)

CountVectorizer()

In [71]:
# 매핑 확인
vectorizer.vocabulary_

{'in': 26,
 'information': 28,
 'retrieval': 49,
 'tf': 60,
 'idf': 24,
 'or': 44,
 'tfidf': 61,
 'short': 52,
 'for': 18,
 'term': 58,
 'frequency': 19,
 'inverse': 30,
 'document': 14,
 'is': 31,
 'numerical': 39,
 'statistic': 55,
 'that': 62,
 'intended': 29,
 'to': 65,
 'reflect': 48,
 'how': 23,
 'important': 25,
 'word': 73,
 'collection': 9,
 'corpus': 12,
 'it': 32,
 'often': 42,
 'used': 68,
 'as': 6,
 'weighting': 71,
 'factor': 17,
 'searches': 51,
 'of': 40,
 'text': 59,
 'mining': 34,
 'and': 3,
 'user': 69,
 'modeling': 35,
 'the': 63,
 'value': 70,
 'increases': 27,
 'proportionally': 46,
 'number': 38,
 'times': 64,
 'appears': 5,
 'offset': 41,
 'by': 8,
 'documents': 15,
 'contain': 11,
 'which': 72,
 'helps': 22,
 'adjust': 2,
 'fact': 16,
 'some': 54,
 'words': 74,
 'appear': 4,
 'more': 36,
 'frequently': 20,
 'general': 21,
 'one': 43,
 'most': 37,
 'popular': 45,
 'schemes': 50,
 'today': 66,
 'survey': 56,
 'conducted': 10,
 '2015': 0,
 'showed': 53,
 '83': 1,


In [72]:
# 토큰 나열
vectorizer.get_feature_names()[:10] # => 이것이 dt 행렬의 컬럼이 됨



['2015',
 '83',
 'adjust',
 'and',
 'appear',
 'appears',
 'as',
 'based',
 'by',
 'collection']

In [73]:
# DTM matrix 변환
dt = vectorizer.transform(sentences_list)
dt

<6x75 sparse matrix of type '<class 'numpy.int64'>'
	with 108 stored elements in Compressed Sparse Row format>

In [74]:
# DTM matrix를 희소행렬에서 넘파이 행렬로 변환해서 보자
dt.todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 1, 2, 0,
         0, 0, 1, 1, 1, 2, 0, 1, 1, 1, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 2, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1,
         0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
         1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 1,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
         6, 1, 1, 0, 0, 0, 0, 1, 0, 0, 2, 0],
        [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
         1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1

In [75]:
# DTM dense를 DF화
dt_df = pd.DataFrame(dt.todense(), columns=vectorizer.get_feature_names())
dt_df

Unnamed: 0,2015,83,adjust,and,appear,appears,as,based,by,collection,...,to,today,use,used,user,value,weighting,which,word,words
0,0,0,0,0,0,0,0,0,0,1,...,2,0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,1,0,1,0,0,0
2,0,0,0,1,0,1,0,0,1,0,...,1,0,0,0,0,1,0,0,2,0
3,0,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
5,1,1,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0


### TF-IDF vectorizer

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
# tf-idf vectorizer 객체 선언
tfidf = TfidfVectorizer(stop_words='english', max_features=15)

In [105]:
# tf-idf matrix
tfidf.fit(sentences_list)
tfidf_sentences_list = tfidf.transform(sentences_list)

In [106]:
tfidf_matrix = pd.DataFrame(tfidf_sentences_list.todense(), columns=tfidf.get_feature_names())
tfidf_matrix



Unnamed: 0,corpus,document,frequency,idf,information,number,recommender,reflect,retrieval,searches,term,text,tf,weighting,word
0,0.239165,0.478329,0.583318,0.173029,0.239165,0.0,0.0,0.291659,0.239165,0.0,0.239165,0.0,0.173029,0.0,0.239165
1,0.0,0.0,0.0,0.0,0.4269,0.0,0.0,0.0,0.4269,0.520601,0.0,0.4269,0.0,0.4269,0.0
2,0.277399,0.277399,0.0,0.200691,0.0,0.67657,0.0,0.0,0.0,0.0,0.0,0.0,0.200691,0.0,0.554797
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.414476,0.0,0.0,0.0,0.0,0.0,0.0,0.572896,0.0,0.414476,0.572896,0.0
5,0.0,0.0,0.0,0.384849,0.0,0.0,0.648703,0.0,0.0,0.0,0.0,0.531946,0.384849,0.0,0.0
