In [None]:
import pandas as pd

movies = pd.read_csv(r"C:\Users\dipak\Downloads\ml-25m\ml-25m\movies.csv")


In [2]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# Phase 1: Data Loading and Preprocessing

### 1. Initial Data Inspection
* **What I have done:** Loaded the movie dataset and displayed the first few entries to inspect columns like `movieId`, `title`, and `genres`.




The original titles contain years and special characters (like Toy Story (1995)). If a user searches for "Toy Story," the parentheses and numbers might interfere with the search accuracy. By removing everything except letters, numbers, and spaces, we create a "normalized" version of the title that is much easier for the computer to match.

In [3]:
import re

def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title

I have defined a function called clean_title using Regular Expressions (regex). Now, I am applying this function to the title column to create a new column called clean_title.

In [4]:
movies["clean_title"] = movies["title"].apply(clean_title)

In [5]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movies["clean_title"])

The TfidfVectorizer has analyzed the clean_title column and created a sparse matrix.

It identified all unique words (unigrams) and pairs of consecutive words (bigrams) across all titles.

It assigned a weight to each word based on how frequently it appears in a specific title versus how common it is across the entire dataset.

The ngram_range=(1,2) setting is crucial because it allows the system to differentiate between "Toy" and "Story" as individual concepts, but also recognize "Toy Story" as a unique entity.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    
    return results

I am defining a function called search that uses Cosine Similarity to compare a user's input against the tfidf matrix we built earlier.

Why I am doing this: Now that our movies are represented as numerical vectors, we need a way to find the "closest" movies to a search term.

Cosine Similarity: Measures the cosine of the angle between two vectors. If the angle is small, the titles are very similar.

np.argpartition: This is a high-performance way to find the top 5 most similar movies without sorting the entire dataset, which makes the search very fast.

Result: It returns the most relevant movies based on title matching

In [8]:
!pip install ipywidgets




In [9]:
!jupyter nbextension enable --py widgetsnbextension

usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: console dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert notebook qtconsole run script server
troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


In [8]:
import ipywidgets as widgets
widgets.IntSlider(value=50, description='Test:')

IntSlider(value=50, description='Test:')

In [23]:
!jupyter --version

Selected Jupyter core packages...
IPython          : 8.21.0
ipykernel        : 6.29.0
ipywidgets       : 7.8.1
jupyter_client   : 8.6.0
jupyter_core     : 5.7.1
jupyter_server   : 2.14.1
jupyterlab       : 4.0.11
nbclient         : 0.8.0
nbconvert        : 7.10.0
nbformat         : 5.9.2
notebook         : 7.0.8
qtconsole        : 5.5.1
traitlets        : 5.14.1


In [24]:
%pip install --upgrade ipywidgets jupyterlab_widgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting jupyterlab_widgets
  Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl.metadata (20 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Downloading ipywidgets-8.1.8-py3-none-any.whl (139 kB)
   ---------------------------------------- 0.0/139.8 kB ? eta -:--:--
   -- ------------------------------------- 10.2/139.8 kB ? eta -:--:--
   -------- ------------------------------ 30.7/139.8 kB 445.2 kB/s eta 0:00:01
   ------------------- ------------------- 71.7/139.8 kB 563.7 kB/s eta 0:00:01
   -------------------------------------- 139.8/139.8 kB 830.6 kB/s eta 0:00:00
Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl (914 kB)
   ---------------------------------------- 0.0/914.9 kB ? eta -:--:--
   ------- -------------------------------- 174.1/914.9 kB 5.3 MB/s eta 0:00:01
   ------------- -------

In [8]:
import ipywidgets as widgets
widgets.IntSlider(value=50, description='Test:')

IntSlider(value=50, description='Test:')

In [13]:
# 1. Force upgrade the packages
%pip install --upgrade ipywidgets jupyterlab_widgets

# 2. Check that the versions are now ipywidgets 8.x
import ipywidgets
print(f"New version: {ipywidgets.__version__}")

Note: you may need to restart the kernel to use updated packages.
New version: 8.1.8


In [9]:
import ipywidgets as widgets
from IPython.display import display

movie_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
movie_list = widgets.Output()

def on_type(data):
    # Added the colon here 
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            # Ensure the 'search' function is defined elsewhere in your notebook
            display(search(title))

movie_input.observe(on_type, names='value')

display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

To build a recommendation system that goes beyond just matching titles (Collaborative Filtering), we need to know how users have rated movies. This dataset will allow us to:

Find users who watched and liked our target movie (The Avengers).

See what other movies those specific users rated highly.

Use those ratings to calculate which movies are most "similar" in terms of audience preference.

In [11]:
movie_id = 89745

#def find_similar_movies(movie_id):
movie = movies[movies["movieId"] == movie_id]

I am selecting a specific movieId (in this case, 89745 for The Avengers) and retrieving its record from my dataset.

In [14]:
ratings = pd.read_csv(r"C:\Users\dipak\Downloads\ml-25m\ml-25m\ratings.csv")


In [15]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

The core logic of a recommendation engine is: "If User A and User B both liked Movie X, then User A might also like other movies that User B enjoyed." By narrowing our list to users who rated this specific movie highly, we are identifying a group of people whose preferences are relevant to our target. Using .unique() ensures we don't count the same user multiple times if there were duplicate entries.

In [16]:
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

I am filtering the ratings dataset to find the unique IDs of users who watched the same movie (movie_id = 89745) and gave it a high rating (greater than 4 stars).

In [17]:
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

This is the "Discovery" phase of collaborative filtering. We want to see which movies are popular among the specific sub-group of users we identified in the previous step. If a large percentage of people who liked The Avengers also liked Iron Man, then Iron Man is a strong candidate to recommend.


I am identifying all movies that the "similar users" (those who liked The Avengers) also rated highly (greater than 4 stars).

We have a huge list of movie IDs, but we don't know which ones are actually significant.

By dividing the value_counts() by the total number of similar_users, we get a score (percentage) for each movie.

Filtering (> 0.10): We only care about movies that a reasonable portion of this specific group liked. If only 1% of the people who liked The Avengers liked another movie, it's likely just "noise." By setting a 10% threshold, we focus on movies with a stronger consensus.

In [18]:
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .10]

I am calculating what percentage of users who liked our target movie (The Avengers) also liked other movies in the dataset. I am then filtering this list to only include movies that were liked by at least 10% of those similar users.

To make a good recommendation, we need to find movies that are uniquely liked by people who enjoyed our target movie (The Avengers).

Some movies are "universally" liked (e.g., The Shawshank Redemption or Forrest Gump).

If 30% of "Avengers fans" like a movie, but 30% of everyone likes it too, it's not a specific recommendation for an Avengers fan.

By calculating how much the general population likes these movies, we can later find the "gap" or "ratio" between the two groups to identify niche favorites.

In [19]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

I am filtering the entire ratings dataset to find all users who gave a high rating (greater than 4) to the movies that our "similar users" also liked.

We want to find movies that are disproportionately popular among people who liked The Avengers.The Score: $\text{Score} = \frac{\% \text{ of Similar Users who liked the movie}}{\% \text{ of All Users who liked the movie}}$A high score means that people who liked The Avengers are much more likely to like this movie than the average person.This is the secret sauce that filters out "generic" hits (like The Shawshank Redemption) and prioritizes "relevant" hits (like Iron Man or Captain America).

In [20]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

I am combining the similar_user_recs and all_user_recs into a single table and calculating a final "score" for each movie by dividing the two percentages

In [21]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [22]:
rec_percentages

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
89745,1.000000,0.040459
58559,0.573393,0.148256
59315,0.530649,0.054931
79132,0.519715,0.132987
2571,0.496687,0.247010
...,...,...
47610,0.103545,0.022770
780,0.103380,0.054723
88744,0.103048,0.010383
1258,0.101226,0.083887


In [23]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [24]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [25]:
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
17067,1.0,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
25058,0.241054,0.012367,19.49177,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011
16312,0.175447,0.010142,17.299824,86332,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,Thor 2011
21348,0.287608,0.016737,17.183667,110102,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,Captain America The Winter Soldier 2014
25071,0.214049,0.012856,16.649399,122920,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,Captain America Civil War 2016
25061,0.136017,0.008573,15.865628,122900,Ant-Man (2015),Action|Adventure|Sci-Fi,AntMan 2015
14628,0.242876,0.015517,15.651921,77561,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,Iron Man 2 2010


In [26]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

In [27]:
import ipywidgets as widgets
from IPython.display import display

movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()