In [1]:
import pandas as pd

In [2]:
movies=pd.read_csv('movies.csv')

In [3]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [4]:
import re

The ***re*** library in Python is used for working with regular expressions (regex), which help in pattern matching and text manipulation.

***re.sub() Function***:
re.sub(pattern, replacement, string) replaces all occurrences of the pattern in string with replacement.

In [5]:
def clean_title(title):
    return re.sub(r"[^a-zA-Z0-9 ]", "", title)

In [6]:
movies["clean_title"]=movies["title"].apply(clean_title)

In [7]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


# Next step is to build the **TFIDF MATRIX**

as the computer can only compute numbers we use the Term Frequency matrix so each column is a unique word accross the titles.

### **Term Frequency (TF)**
Term Frequency (TF) is a numerical measure of how often a word appears in a document relative to the total number of words in that document. It helps determine the importance of a word in a specific document.

### **Formula:**
\[
TF = \frac{\text{Number of times a term appears in a document}}{\text{Total number of terms in the document}}
\]

### **Example:**
Consider this document:
> "Machine learning is amazing. Machine learning is the future."

- **Term:** `"Machine"`
- Appears **2** times.
- Total words: **8**  
- TF(`Machine`) = **2 / 8 = 0.25**  

### **Why is TF Useful?**
TF is commonly used in **TF-IDF (Term Frequency-Inverse Document Frequency)** to rank the importance of words in text processing tasks like search engines and NLP applications.

we can do inverse document frequency as we accord each word in log format so the matrix of vectrors. when we search of similarities it converts words into numbers and search for the most similar and give them as suggestions

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
vectorizer=TfidfVectorizer(ngram_range=(1,2))

In [10]:
vectorizer

In [11]:
tfidf=vectorizer.fit_transform(movies["clean_title"])

In [12]:
tfidf

<62423x170073 sparse matrix of type '<class 'numpy.float64'>'
	with 446566 stored elements in Compressed Sparse Row format>

creating the search format

In [13]:
#compute similarities between term we enter and the list we have
#we are going to use the cosine_similarity in sklearn metrix in order to

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

Example before creating the def function with "Harry Potter

In [15]:
title="Harry Potter"
title=clean_title(title)

In [16]:
query_vec=vectorizer.transform([title])

In [17]:
query_vec
#sparce matrix that turns our input into a set of numbers

<1x170073 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [18]:
similarity=cosine_similarity(query_vec,tfidf).flatten()
similarity

array([0., 0., 0., ..., 0., 0., 0.])

we are going to search for the titles that has most similarity with the given title using np.argpartition function

In [19]:
import numpy as np

This function finds the indices of the top 5 highest values in the similarity array.

Step-by-Step Breakdown:
np.argpartition(similarity, -5)

Partially sorts similarity such that the top 5 highest values are placed at the last 5 positions.
It does not fully sort them but ensures the largest values are at the correct end.
[-5:]

Retrieves the indices of the top 5 values from the last 5 positions.

In [20]:
indices=np.argpartition(similarity,-5)[-5:]
indices

array([11700, 10408,  5704,  4790, 13512])

In [21]:
results=movies.iloc[indices][::-1]
results

Unnamed: 0,movieId,title,genres,clean_title
13512,69844,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance|IMAX,Harry Potter and the HalfBlood Prince 2009
4790,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy,Harry Potter and the Sorcerers Stone aka Harry...
5704,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy,Harry Potter and the Chamber of Secrets 2002
10408,40815,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX,Harry Potter and the Goblet of Fire 2005
11700,54001,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy|IMAX,Harry Potter and the Order of the Phoenix 2007


Now we can define the final function:

In [22]:
def search(title):
  title=clean_title(title)
  query_vec=vectorizer.transform([title])
  similarity=cosine_similarity(query_vec,tfidf).flatten()
  indices=np.argpartition(similarity,-5)[-5:]
  results=movies.iloc[indices][::-1]
  return results

Now creating a notebook search box widget

In [23]:
import ipywidgets as widgets
from IPython.display import display

cereating an imput text widget with a default value and default description

In [24]:
movie_input=widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    disabled=False
)

it isn't interactive yet so we simply have an input widget and where to type

In [25]:
movie_input

Text(value='Toy Story', description='Movie Title:')

In [26]:
movie_list=widgets.Output()

In [27]:
def on_type(data):
  with movie_list:
    movie_list.clear_output()
    title=data["new"]
    if len(title)>3:
      display(search(title))

In [28]:
movie_input.observe(on_type,names="value")

In [29]:
display(movie_input,movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

In [30]:
ratings=pd.read_csv("ratings.csv")

In [31]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1.147880e+09
1,1,306,3.5,1.147869e+09
2,1,307,5.0,1.147869e+09
3,1,665,5.0,1.147879e+09
4,1,899,3.5,1.147869e+09
...,...,...,...,...
1307176,8814,6874,3.5,1.487199e+09
1307177,8814,7153,4.5,1.487198e+09
1307178,8814,7451,3.0,1.487199e+09
1307179,8814,7669,4.5,1.487200e+09


In [32]:
ratings.dtypes

Unnamed: 0,0
userId,int64
movieId,int64
rating,float64
timestamp,float64


to actually manage the recommendation system we need to find users who liked the movie i liked and see their likings list as they are potential recommendation. So let's find users aho liked the same movie.

In [33]:
movie_id=1

In [34]:
similar_users=ratings[(ratings["movieId"]==movie_id) & (ratings["rating"]>4)]["userId"].unique()

In [35]:
similar_users

array([  36,   75,   86, ..., 8796, 8798, 8802])

Now let's see the otehr movies that they liked

In [36]:
similar_user_recs1 = ratings[
    (ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)
][["userId", "movieId", "rating", "timestamp"]]

In [37]:
similar_user_recs=ratings[(ratings["userId"].isin(similar_users))&(ratings["rating"]>4)]["movieId"]
similar_user_recs

Unnamed: 0,movieId
5101,1
5105,34
5111,110
5114,150
5127,260
...,...
1305558,168252
1305563,170705
1305580,176371
1305601,184399


In [38]:
similar_user_recs1

Unnamed: 0,userId,movieId,rating,timestamp
5101,36,1,5.0,8.571314e+08
5105,36,34,5.0,8.344138e+08
5111,36,110,5.0,8.344130e+08
5114,36,150,5.0,8.399286e+08
5127,36,260,5.0,8.571311e+08
...,...,...,...,...
1305558,8802,168252,4.5,1.549911e+09
1305563,8802,170705,4.5,1.532478e+09
1305580,8802,176371,4.5,1.537467e+09
1305601,8802,184399,4.5,1.533530e+09


In [39]:
#we are going to find only the movies that more thann 10% of the users select
#we are gonna narrow down the search
similar_user_recs.value_counts()

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
1,1010
318,420
260,384
296,353
356,346
...,...
1824,1
1925,1
1975,1
1977,1


In [40]:
similar_user_recs=similar_user_recs.value_counts() / len(similar_users)

In [41]:
similar_user_recs

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
1,1.000000
318,0.415842
260,0.380198
296,0.349505
356,0.342574
...,...
1824,0.000990
1925,0.000990
1975,0.000990
1977,0.000990


In [42]:
#we will only tak ethe ones greater than ten percents
similar_user_recs=similar_user_recs[similar_user_recs  > .1]

In [43]:
similar_user_recs

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
1,1.000000
318,0.415842
260,0.380198
296,0.349505
356,0.342574
...,...
380,0.101980
111,0.101980
1387,0.101980
1219,0.100990


In [44]:
#finding how much all users like movies
all_users=ratings[(ratings["movieId"].isin(similar_user_recs.index))&(ratings["rating"]>4)]

In [45]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1.147880e+09
29,1,4973,4.5,1.147869e+09
48,1,7361,5.0,1.147880e+09
72,2,110,5.0,1.141417e+09
76,2,260,5.0,1.141417e+09
...,...,...,...,...
1307128,8814,904,5.0,1.487198e+09
1307137,8814,1198,4.5,1.487198e+09
1307149,8814,1732,5.0,1.487198e+09
1307171,8814,4993,4.5,1.487198e+09


In [46]:
#percentage of people recommending these movies and we will search for the higher percentage to determine the other movie they ranked higher
all_users_recs=all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [47]:
all_users_recs

Unnamed: 0_level_0,count
movieId,Unnamed: 1_level_1
318,0.346853
296,0.284169
2571,0.243732
356,0.232424
593,0.226155
...,...
1580,0.043879
1278,0.040069
50872,0.038963
78499,0.035890


In [48]:
#creating a recommendation score
rec_percentages=pd.concat([similar_user_recs,all_users_recs],axis=1)

In [49]:
rec_percentages.columns=["similar","all"]

In [50]:
rec_percentages

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.000000,0.124140
318,0.415842,0.346853
260,0.380198,0.215339
296,0.349505,0.284169
356,0.342574,0.232424
...,...,...
380,0.101980,0.048427
111,0.101980,0.076573
1387,0.101980,0.046214
1219,0.100990,0.056785


In [51]:
#here we try to find ratios between percentages
rec_percentages["score"]=rec_percentages["similar"]/rec_percentages["all"]

In [52]:
rec_percentages

Unnamed: 0_level_0,similar,all,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.000000,0.124140,8.055446
318,0.415842,0.346853,1.198897
260,0.380198,0.215339,1.765577
296,0.349505,0.284169,1.229919
356,0.342574,0.232424,1.473921
...,...,...,...
380,0.101980,0.048427,2.105865
111,0.101980,0.076573,1.331799
1387,0.101980,0.046214,2.206678
1219,0.100990,0.056785,1.778475


In [53]:
rec_percenatges=rec_percentages.sort_values("score",ascending=False)

In [54]:
rec_percenatges

Unnamed: 0_level_0,similar,all,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.000000,0.124140,8.055446
3114,0.269307,0.050885,5.292467
2355,0.111881,0.024336,4.597300
78499,0.147525,0.035890,4.110484
588,0.234653,0.070305,3.337658
...,...,...,...
4973,0.129703,0.107793,1.203265
318,0.415842,0.346853,1.198897
2858,0.195050,0.164208,1.187816
7361,0.118812,0.103982,1.142616


In [55]:
rec_percentages.head(10).merge(movies,left_index=True,right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.12414,8.055446,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
314,0.415842,0.346853,1.198897,318,"Shawshank Redemption, The (1994)",Crime|Drama,Shawshank Redemption The 1994
257,0.380198,0.215339,1.765577,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,Star Wars Episode IV A New Hope 1977
292,0.349505,0.284169,1.229919,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,Pulp Fiction 1994
351,0.342574,0.232424,1.473921,356,Forrest Gump (1994),Comedy|Drama|Romance|War,Forrest Gump 1994
2480,0.318812,0.243732,1.308045,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,Matrix The 1999
1166,0.309901,0.183505,1.688784,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,Star Wars Episode V The Empire Strikes Back 1980
1168,0.30495,0.158923,1.918853,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure,Raiders of the Lost Ark Indiana Jones and the ...
585,0.30099,0.226155,1.3309,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,Silence of the Lambs The 1991
522,0.281188,0.213864,1.314797,527,Schindler's List (1993),Drama|War,Schindlers List 1993


In [56]:
#Building the recommendation function

In [66]:
def find_similar_movies(movie_id):
    # Find users who rated the given movie highly
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

    # Find movies these users also rated highly
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]

    # Count occurrences of each movie and normalize by number of similar users
    similar_user_recs = similar_user_recs.groupby("movieId").size() / len(similar_users)

    # Filter recommendations with at least 10% of similar users liking them
    similar_user_recs = similar_user_recs[similar_user_recs > 0.1]

    # Get ratings from all users for these recommended movies
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

    # Count occurrences of each movie among all users and normalize
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

    # Combine both data sources into a single DataFrame
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]

    # Compute a recommendation score (higher means better)
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

    # Sort by score in descending order
    rec_percentages = rec_percentages.sort_values("score", ascending=False)

    # Return top 10 recommendations with movie titles and genres
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]


In [68]:
import ipywidgets as widgets
from IPython.display import display, clear_output

movie_input_name = widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    disabled=False
)

recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 3:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_input_name.observe(on_type, names="value")

display(movie_input_name, recommendation_list)


Text(value='Toy Story', description='Movie Title:')

Output()

In [None]:
#we can further improve this ccode with either making it focus on genre instead of name and more other files like the actors and more