# TfidfVectorizerの特徴をメモ

TfidfVectorizerはテキスト同士の類似度を算出するのに適している  
各テキストごとの特徴を計算に入れる

## test-data
1: BANANA APPLE BANANA DOG
BANANA x 2 APPLE x 1 DOG x 1  

BANANAの単語数が多いので、BANANAのスコアが高くなる
  
2: BOOK BANANA APPLE CAT  
BANANA x 1 BOOK x 1 APPLE x 1 CAT x 1   
  
各単語はバラバラなのでスコアが分散される
  
3: BANANA BANANA BANANA DOG  
BANANA x 3 DOG x 1 

BANANAの単語数が多いので、BANANAのスコアが高くなる

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfids = TfidfVectorizer(stop_words="english")

test_ids = [1, 2, 3]
test_words = ['BANANA APPLE BANANA DOG', 'BOOK BANANA APPLE CAT', 'BANANA BANANA BANANA DOG']

df = pd.DataFrame({ 'id': test_ids, 'word': test_words})
df.head()

Unnamed: 0,id,word
0,1,BANANA APPLE BANANA DOG
1,2,BOOK BANANA APPLE CAT
2,3,BANANA BANANA BANANA DOG


## 特徴

- 単語数が多いと、その分だけスコアが増える
- 各、wordはそれぞれに関係する(例えば、word0でcatが出現すると、word1のcatのスコアが下がりbookのスコアが上がる)

In [2]:
from sklearn.metrics.pairwise import linear_kernel

tfidf_matrix = tfids.fit_transform(df['word'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
pd.DataFrame(tfidf_matrix.toarray(), columns=tfids.get_feature_names())

Unnamed: 0,apple,banana,book,cat,dog
0,0.476063,0.739411,0.0,0.0,0.476063
1,0.444514,0.345205,0.584483,0.584483,0.0
2,0.0,0.918927,0.0,0.0,0.394428


word0にcatを出現させたバージョンで類似度を出力する  
bookのスコアが上昇して、catのスコアが下がった事が分かる  

In [3]:
test_ids2 = [1, 2, 3]
test_words2 = ['BANANA APPLE BANANA DOG', 'BOOK BANANA APPLE CAT', 'BANANA BANANA BANANA CAT']

df2 = pd.DataFrame({ 'id': test_ids, 'word': test_words2})

tfidf_matrix2 = tfids.fit_transform(df2['word'])
pd.DataFrame(tfidf_matrix2.toarray(), columns=tfids.get_feature_names())

Unnamed: 0,apple,banana,book,cat,dog
0,0.441027,0.684993,0.0,0.0,0.579897
1,0.480458,0.373119,0.631745,0.480458,0.0
2,0.0,0.918927,0.0,0.394428,0.0


In [4]:
indicies = pd.Series(df.index, index=df['word']).drop_duplicates()

def recommend(word, cosine_sim=cosine_sim, df=df, indicies=indicies):
    idx = indicies[word]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1],  reverse=True)
    sim_scores = sim_scores[1:3]
    
    movie_indices = [i[0] for i in  sim_scores]
    
    return df['word'].iloc[movie_indices]

DOGの重みが大きくなりテキスト同士が引っ張られる

In [5]:
recommend('BANANA APPLE BANANA DOG')

2    BANANA BANANA BANANA DOG
1       BOOK BANANA APPLE CAT
Name: word, dtype: object

In [6]:
recommend('BANANA BANANA BANANA DOG')

0    BANANA APPLE BANANA DOG
1      BOOK BANANA APPLE CAT
Name: word, dtype: object

In [7]:
recommend('BOOK BANANA APPLE CAT')

0     BANANA APPLE BANANA DOG
2    BANANA BANANA BANANA DOG
Name: word, dtype: object