pandas: for working with tabular data.

TfidfVectorizer: transforms text into numerical vectors based on term frequency and importance.

cosine_similarity: calculates how similar two vectors (movies) are.

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


sample data 

In [19]:
data = {'title': ['Inception', 'Interstellar', 'The Matrix', 'The Prestige', 'Memento'],
    'overview': [
        'A thief who steals corporate secrets through use of dream-sharing technology.',
        'A team travels through a wormhole in space to ensure humanity\'s survival.',
        'A computer hacker learns about the true nature of his reality.',
        'Two magicians engage in a battle to create the ultimate illusion.',
        'A man with short-term memory loss attempts to track down his wife\'s murderer.'
    ]}

In [20]:
df = pd.DataFrame(data)

In [21]:
print(df)

          title                                           overview
0     Inception  A thief who steals corporate secrets through u...
1  Interstellar  A team travels through a wormhole in space to ...
2    The Matrix  A computer hacker learns about the true nature...
3  The Prestige  Two magicians engage in a battle to create the...
4       Memento  A man with short-term memory loss attempts to ...


In [22]:
tfidf = TfidfVectorizer(stop_words = "english")
tfidf_matrix = tfidf.fit_transform(df["overview"])

TfidfVectorizer transforms each movie's overview into a vector.

stop_words='english' removes common words like "the", "and", etc.

fit_transform(...):

fit: learns the vocabulary and calculates IDF.

transform: converts each overview to a vector of numbers based on TF-IDF.

ðŸ§  Why TF-IDF?

TF = how often a word appears in a document.

IDF = how rare the word is across all documents.

Result: Rare but meaningful words (e.g., "wormhole") get more weig

In [23]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Measures how similar every movie is to every other movie.

Output is a matrix:

cosine_sim[i][j] = similarity between movie i and movie j
Values range from 0 (no similarity) to 1 (perfect match).



In [36]:
def recommend(title):
    idx = df[df["title"]== title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:4]  # Top 3 recommendations
    for i in sim_scores:
        print(df.iloc[i[0]]['title'])
    

recommend('Inception')

Interstellar
The Matrix
The Prestige


title is the input movie.

df[df['title'] == title] locates the row with that title.

.index[0] gets the row index of that movie.

Gets a list of similarity scores for the movie at idx.

enumerate gives both the index and similarity score for each comparison.

Sorts the list in descending order of similarity.

Highest similarity will always be with itself (score = 1.0).