TF is term frequency, i.e. how many times a word appears in description.
IDF is inverse document frequence, i.e. how rare that word is across all users.

TF-IDF: common in one user's desc, rare in other's -> high score

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv('../data/books.csv')

In [4]:
def clean_description(text):
    doc = nlp(text.lower())
    return [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
df['cleaned_tokens'] = df['description'].apply(clean_description)

In [5]:
print(df['cleaned_tokens'])

0             [love, dragon, mysterious, land]
1    [enjoy, explore, space, futuristic, tech]
2              [fascinate, ghost, dark, story]
Name: cleaned_tokens, dtype: object


TfidfVectorizer accepts string, so:

In [10]:
df['cleaned_text'] = df['cleaned_tokens'].apply(lambda tokens: ' '.join(tokens))
print(df[['cleaned_tokens','cleaned_text']])

                              cleaned_tokens  \
0           [love, dragon, mysterious, land]   
1  [enjoy, explore, space, futuristic, tech]   
2            [fascinate, ghost, dark, story]   

                          cleaned_text  
0          love dragon mysterious land  
1  enjoy explore space futuristic tech  
2           fascinate ghost dark story  


In [11]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['cleaned_text'])

TfidfVectorizer() creates a TF-IDF transformer, and .fit_transform() learns vocabulary and applies TF-IDF

Now, lets visualise the data.

In [12]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),columns=vectorizer.get_feature_names_out())
print(tfidf_df)

   dark  dragon     enjoy   explore  fascinate  futuristic  ghost  land  love  \
0   0.0     0.5  0.000000  0.000000        0.0    0.000000    0.0   0.5   0.5   
1   0.0     0.0  0.447214  0.447214        0.0    0.447214    0.0   0.0   0.0   
2   0.5     0.0  0.000000  0.000000        0.5    0.000000    0.5   0.0   0.0   

   mysterious     space  story      tech  
0         0.5  0.000000    0.0  0.000000  
1         0.0  0.447214    0.0  0.447214  
2         0.0  0.000000    0.5  0.000000  


In [15]:
# Saving the output:
tfidf_df.to_csv("../data/tfidf_vectors.csv",index=False)

Now, we can find out which two rows have similar tastes.

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
similarity_matrix = cosine_similarity(tfidf_df)

In [18]:
print(similarity_matrix)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [19]:
similarity_df = pd.DataFrame(similarity_matrix,index=df['name'],columns = df['name'])
print(similarity_df)

name     Alice  Bob  Charlie
name                        
Alice      1.0  0.0      0.0
Bob        0.0  1.0      0.0
Charlie    0.0  0.0      1.0
