# TfidfVectorizerの特徴

TfidfVectorizerはテキスト同士の類似度を算出するのに適している  
各テキストごとの特徴を計算に入れる

## test-data
1: BANANA APPLE BANANA DOG
BANANA x 2 APPLE x 1 DOG x 1  

BANANAの単語数が多いので、BANANAのスコアが高くなる
  
2: BOOK BANANA APPLE CAT  
BANANA x 1 BOOK x 1 APPLE x 1 CAT x 1   
  
各単語はバラバラなのでスコアが分散される
  
3: BANANA BANANA BANANA DOG  
BANANA x 3 DOG x 1  

BANANAの単語数が多いので、BANANAのスコアが高くなる

4:DOG BANANA BANANA BANANA
BANANA x 3 DOG x 1 

3と並びが違うだけ



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfids = TfidfVectorizer(stop_words="english")

test_ids = [1, 2, 3, 4, 5, 6]
test_words = ['BANANA APPLE BANANA DOG', 'BOOK BANANA APPLE CAT', 'BANANA BANANA BANANA DOG', 'DOG BANANA BANANA BANANA',
              'BANANA BANANA DOG BANANA',  'BANANA DOG BANANA BANANA', 
             ]

df = pd.DataFrame({ 'id': test_ids, 'word': test_words})
df.head()

Unnamed: 0,id,word
0,1,BANANA APPLE BANANA DOG
1,2,BOOK BANANA APPLE CAT
2,3,BANANA BANANA BANANA DOG
3,4,DOG BANANA BANANA BANANA
4,5,BANANA BANANA DOG BANANA


## 特徴

- 単語数が多いと、その分だけスコアが増える
- 各、wordはそれぞれに関係する(例えば、word0でcatが出現すると、word1のcatのスコアが下がりbookのスコアが上がる)

In [2]:
from sklearn.metrics.pairwise import linear_kernel

tfidf_matrix = tfids.fit_transform(df['word'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
pd.DataFrame(tfidf_matrix.toarray(), columns=tfids.get_feature_names())

Unnamed: 0,apple,banana,book,cat,dog
0,0.624694,0.676333,0.0,0.0,0.390295
1,0.484084,0.26205,0.590336,0.590336,0.0
2,0.0,0.933314,0.0,0.0,0.359062
3,0.0,0.933314,0.0,0.0,0.359062
4,0.0,0.933314,0.0,0.0,0.359062
5,0.0,0.933314,0.0,0.0,0.359062


word0にcatを出現させたバージョンで類似度を出力する  
bookのスコアが上昇して、catのスコアが下がった事が分かる  

In [3]:
test_ids2 = [1, 2, 3]
test_words2 = ['BANANA APPLE BANANA DOG', 'BOOK BANANA APPLE CAT', 'BANANA BANANA BANANA CAT',  'DOG BANANA BANANA BANANA',
              'BANANA BANANA DOG BANANA',  'BANANA DOG BANANA BANANA', ]

df2 = pd.DataFrame({ 'id': test_ids, 'word': test_words2})

tfidf_matrix2 = tfids.fit_transform(df2['word'])
pd.DataFrame(tfidf_matrix2.toarray(), columns=tfids.get_feature_names())

Unnamed: 0,apple,banana,book,cat,dog
0,0.60908,0.659428,0.0,0.0,0.440654
1,0.514331,0.278423,0.627222,0.514331,0.0
2,0.0,0.851513,0.0,0.524333,0.0
3,0.0,0.913456,0.0,0.0,0.406936
4,0.0,0.913456,0.0,0.0,0.406936
5,0.0,0.913456,0.0,0.0,0.406936


In [4]:
indicies = pd.Series(df.index, index=df['word']).drop_duplicates()

def recommend(word, cosine_sim=cosine_sim, df=df, indicies=indicies):
    idx = indicies[word]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1],  reverse=True)
    sim_scores = sim_scores[1:4]
    
    movie_indices = [i[0] for i in  sim_scores]
    
    return df['word'].iloc[movie_indices]

- DOGの重みが大きくなりテキスト同士が引っ張られる
- データ4,5,6から考えて類似度抽出のロジックにソート順は関係なさそう

In [5]:
recommend('BANANA APPLE BANANA DOG')

2    BANANA BANANA BANANA DOG
3    DOG BANANA BANANA BANANA
4    BANANA BANANA DOG BANANA
Name: word, dtype: object

-ソートが関係していたらBANANA BANANA DOG BANANAが1番上に来るはず

In [6]:
recommend('BANANA BANANA BANANA DOG')

3    DOG BANANA BANANA BANANA
4    BANANA BANANA DOG BANANA
5    BANANA DOG BANANA BANANA
Name: word, dtype: object

In [7]:
recommend('BOOK BANANA APPLE CAT')

0     BANANA APPLE BANANA DOG
2    BANANA BANANA BANANA DOG
3    DOG BANANA BANANA BANANA
Name: word, dtype: object