# üì¶ Importing Required Libraries

In this section, we import all necessary Python libraries used throughout the project.

These libraries are responsible for:

- **Data manipulation:** `pandas`, `numpy`
- **Visualization:** `matplotlib`
- **Mathematical operations:** `math`

Keeping all imports at the beginning of the notebook improves readability and reproducibility.


In [20]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt

# Mathematical operations
from math import sqrt

# Optional: Improve plot aesthetics
plt.style.use("seaborn-v0_8")

# üìÇ Loading the Dataset

In this section, we load the datasets required for building the recommendation system.

We use two CSV files:

- `movies.csv` ‚Üí Contains movie metadata (movieId, title, genres)
- `ratings.csv` ‚Üí Contains user ratings for movies

These datasets are provided by GroupLens (IMDB-based dataset).

After loading the data, we will inspect their structure.


In [21]:
# Define dataset paths (recommended for better project structure)
MOVIES_PATH = "data/movies.csv"
RATINGS_PATH = "data/ratings.csv"

# Load datasets into pandas DataFrames
movies_df = pd.read_csv(MOVIES_PATH)
ratings_df = pd.read_csv(RATINGS_PATH)

# Display basic information
print("Movies Dataset Shape:", movies_df.shape)
print("Ratings Dataset Shape:", ratings_df.shape)

Movies Dataset Shape: (9742, 3)
Ratings Dataset Shape: (100836, 4)


# üîé Initial Data Exploration (Movies Dataset)

In this section, we perform an initial exploration of the `movies` dataset to better understand its structure.

We will:

- Preview the first few rows
- Check dataset dimensions
- Inspect data types
- Generate statistical summaries (where applicable)

This step helps us understand the structure and quality of the dataset before preprocessing.

In [22]:
# Initial Exploration - Movies Dataset
print("First 5 Rows:")
display(movies_df.head())

print("\n Dataset Shape:")
print(movies_df.shape)

print("\n Data Types:")
display(movies_df.dtypes)

print("\n Statistical Summary:")
display(movies_df.describe(include='all'))


First 5 Rows:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy



 Dataset Shape:
(9742, 3)

 Data Types:


movieId     int64
title      object
genres     object
dtype: object


 Statistical Summary:


Unnamed: 0,movieId,title,genres
count,9742.0,9742,9742
unique,,9737,951
top,,Emma (1996),Drama
freq,,2,1053
mean,42200.353623,,
std,52160.494854,,
min,1.0,,
25%,3248.25,,
50%,7300.0,,
75%,76232.0,,


# üßπ Preprocessing Movies Data

In this step, we clean and preprocess the `movies` dataset to make it ready for modeling.

We perform the following transformations:

1. **Extract release year** from the movie title and store it in a new column `year`
2. **Clean movie titles** by removing the year part (e.g., "(1995)") and trimming extra spaces
3. **Parse genres** by splitting the `genres` string into a list of genre labels

These features will be useful for:
- Content-based recommendation (genres)
- Better similarity calculations (clean titles)
- Optional clustering/analysis by release year


In [23]:
# 1) Extract release year from title (4-digit year inside parentheses)
movies_df["year"] = movies_df["title"].str.extract(r"\((\d{4})\)")
movies_df["year"] = pd.to_numeric(movies_df["year"], errors="coerce")

# 2) Clean title by removing the year and extra spaces
movies_df["title"] = movies_df["title"].str.replace(r"\(\d{4}\)", "", regex=True).str.strip()

# 3) Convert genres from "Action|Comedy|..." to a list of genres
movies_df["genres"] = movies_df["genres"].str.split("|")

# Quick sanity-check
display(movies_df.head())


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995.0
4,5,Father of the Bride Part II,[Comedy],1995.0


# üé¨ Genre Encoding (One-Hot Representation)

To use genres in similarity calculations and recommendation models, 
we convert the list of genres into a One-Hot encoded format.

Each genre becomes a separate binary column:

- 1 ‚Üí Movie belongs to that genre
- 0 ‚Üí Movie does not belong to that genre

This transformation is essential for:
- Content-based filtering
- User preference profiling
- Clustering users based on genre interests

In [24]:
# Make a copy of the dataset
movies_with_genres_df = movies_df.copy()

# Explode genres list into separate rows
exploded_df = movies_with_genres_df.explode("genres")

# Apply one-hot encoding
genres_dummies = pd.get_dummies(exploded_df["genres"])

# Combine back with original movie info
movies_with_genres_df = (
    exploded_df[["movieId", "title", "year"]]
    .join(genres_dummies)
    .groupby(["movieId", "title", "year"], as_index=False)
    .sum()
)

# Preview result
display(movies_with_genres_df.head())

# Show genre columns
print("Genre Columns:")
display(movies_with_genres_df.columns[3:])

Unnamed: 0,movieId,title,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995.0,0,0,5,5,5,5,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995.0,0,0,3,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,1995.0,0,0,0,0,0,2,0,...,0,0,0,0,0,2,0,0,0,0
3,4,Waiting to Exhale,1995.0,0,0,0,0,0,3,0,...,0,0,0,0,0,3,0,0,0,0
4,5,Father of the Bride Part II,1995.0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Genre Columns:


Index(['(no genres listed)', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller',
       'War', 'Western'],
      dtype='object')

# ‚≠ê Preprocessing Ratings Data

In this step, we clean and prepare the `ratings` dataset.

The dataset contains:

- `userId`
- `movieId`
- `rating`
- `timestamp`

Since the timestamp is not required for our current recommendation approach,
we remove it to simplify the dataset.

Later, this dataset will be used to:
- Build user profiles
- Construct the user-item interaction matrix
- Perform collaborative filtering


In [25]:
# Preview first rows
print(" Initial Ratings Data:")
display(ratings_df.head())

# Remove timestamp column (not needed for current analysis)
ratings_df = ratings_df.drop(columns=["timestamp"])

# Check structure after modification
print("\n Ratings Data After Dropping 'timestamp':")
display(ratings_df.head())

print("\n Dataset Shape:")
print(ratings_df.shape)

 Initial Ratings Data:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931



 Ratings Data After Dropping 'timestamp':


Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0



 Dataset Shape:
(100836, 3)


# üë§ Simulating a New User (Cold-Start Input)

To demonstrate a **content-based recommendation** workflow, we create a small set of movies rated by a hypothetical new user.

This is a typical *cold-start* scenario where:
- The user has no history in the ratings dataset
- We only know a few rated movies from them

Next, we map the input movie titles to their corresponding `movieId` in `movies_df`,
then merge the ratings into a single dataframe for further processing.


In [26]:
user_input = [
    {'title' : 'Brreakfast Club, The', 'rating':5},
    {'title' : 'Toy Story', 'rating':3.5},
    {'title' : 'Jumanji', 'rating':2},
    {'title' : 'Pulp Fiction', 'rating':5},
    {'title' : 'Akira', 'rating':4.5}

]

In [27]:
input_movies = pd.DataFrame(user_input)

print(" User Input (raw):")
display(input_movies)

# Step 2) Map titles to movieId (exact-title match)
matched_movies = movies_df[movies_df["title"].isin(input_movies["title"])][["movieId", "title", "genres", "year"]]

# Step 3) Merge user ratings with movie metadata
input_movies = matched_movies.merge(input_movies, on="title", how="right")

# Step 4) Warn if some titles were not matched
missing_titles = input_movies[input_movies["movieId"].isna()]["title"].tolist()
if missing_titles:
    print(" Warning: These titles were not found in movies_df (check spelling):")
    for t in missing_titles:
        print("-", t)

print("\n User Input After Mapping to movieId:")
display(input_movies)

 User Input (raw):


Unnamed: 0,title,rating
0,"Brreakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


- Brreakfast Club, The

 User Input After Mapping to movieId:


Unnamed: 0,movieId,title,genres,year,rating
0,,"Brreakfast Club, The",,,5.0
1,1.0,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0,3.5
2,2.0,Jumanji,"[Adventure, Children, Fantasy]",1995.0,2.0
3,296.0,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994.0,5.0
4,1274.0,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988.0,4.5


# üßπ Cleaning User Input Data

After merging the user ratings with movie metadata, 
we remove unnecessary columns (`genres`, `year`) 
to keep only the essential information required for building the user profile.

Final columns needed:
- `movieId`
- `title`
- `rating`

This cleaned dataframe will be used to calculate the user's genre preference vector.

In [28]:
print(" Columns Before Cleaning:")
display(input_movies.columns)

# Keep only necessary columns for recommendation
input_movies = input_movies.drop(columns=["genres", "year"])

print("\n Cleaned User Input:")
display(input_movies)

 Columns Before Cleaning:


Index(['movieId', 'title', 'genres', 'year', 'rating'], dtype='object')


 Cleaned User Input:


Unnamed: 0,movieId,title,rating
0,,"Brreakfast Club, The",5.0
1,1.0,Toy Story,3.5
2,2.0,Jumanji,2.0
3,296.0,Pulp Fiction,5.0
4,1274.0,Akira,4.5


# üéØ Extracting User-Rated Movies (Genre Matrix)

To build the user profile, we first retrieve the one-hot encoded genre representation 
of the movies rated by the user.

We filter the `movies_with_genres_df` dataset using the selected `movieId`s 
from the user's input.

This dataset will be used to compute the user's genre preference vector.

In [29]:
# Filter movies that the user has rated
user_movies = movies_with_genres_df[
    movies_with_genres_df["movieId"].isin(input_movies["movieId"])
].reset_index(drop=True)

print(" User Rated Movies (One-Hot Genre Representation):")
display(user_movies)

 User Rated Movies (One-Hot Genre Representation):


Unnamed: 0,movieId,title,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995.0,0,0,5,5,5,5,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995.0,0,0,3,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,296,Pulp Fiction,1994.0,0,0,0,0,0,4,4,...,0,0,0,0,0,0,0,4,0,0
3,1274,Akira,1988.0,0,4,4,4,0,0,0,...,0,0,0,0,0,0,4,0,0,0


# üßÆ Building the User Genre Matrix

To construct the user profile, we isolate only the genre columns 
from the one-hot encoded dataframe.

We remove non-feature columns such as:
- `movieId`
- `title`
- `year`

The remaining columns represent binary genre features 
that will be weighted by the user's ratings.


In [30]:
# Remove non-genre columns
user_genre_table = user_movies.drop(columns=["movieId", "title", "year"])

print("User Genre Matrix:")
display(user_genre_table)

User Genre Matrix:


Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,5,5,5,5,0,0,0,5,0,0,0,0,0,0,0,0,0,0
1,0,0,3,0,3,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,4,4,0,4,0,0,0,0,0,0,0,0,4,0,0
3,0,4,4,4,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0


# üß† Creating the User Profile Vector

We build a **user profile** by weighting each genre feature by the user's ratings.

Mathematically:
- We take the genre one-hot matrix (movies √ó genres)
- Multiply it by the user's rating vector
- Result: a **genre preference score** for the user

Higher values indicate stronger preference for that genre.


In [32]:
# 1) Align user_movies and input_movies on movieId to guarantee same order/index
rated_movies = (
    user_movies[["movieId"]]
    .merge(input_movies[["movieId", "rating"]], on="movieId", how="inner")
    .set_index("movieId")
)

# 2) Build the genre feature matrix with movieId as index (same as ratings)
user_genre_table = (
    user_movies
    .set_index("movieId")
    .drop(columns=["title", "year"])
)

# 3) Ensure the rows match exactly
user_genre_table = user_genre_table.loc[rated_movies.index]

# 4) Compute user profile (genre preference vector)
user_profile = user_genre_table.T.dot(rated_movies["rating"])

# 5) Nice display
user_profile_df = user_profile.sort_values(ascending=False).to_frame("preference_score")

print(" User Profile computed successfully.")
display(user_profile_df)

 User Profile computed successfully.


Unnamed: 0,preference_score
Adventure,41.5
Comedy,37.5
Animation,35.5
Children,23.5
Fantasy,23.5
Crime,20.0
Drama,20.0
Thriller,20.0
Action,18.0
Sci-Fi,18.0


# üé¨ Creating the Global Genre Feature Matrix

To generate recommendations, we need the genre representation 
of **all movies** in the dataset.

We create a matrix where:

- Rows ‚Üí movieId
- Columns ‚Üí genre features (one-hot encoded)

This matrix will later be multiplied by the user profile vector 
to compute recommendation scores.


In [33]:
# Set movieId as index
genre_table = movies_with_genres_df.set_index("movieId")

# Keep only genre feature columns (remove metadata columns)
genre_table = genre_table.drop(columns=["title", "year"])

print(" Global Genre Feature Matrix:")
display(genre_table.head())


 Global Genre Feature Matrix:


Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,0,5,5,5,5,0,0,0,5,0,0,0,0,0,0,0,0,0,0
2,0,0,3,0,3,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0
4,0,0,0,0,0,3,0,0,3,0,0,0,0,0,0,3,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# ‚úÖ Computing Recommendation Scores (Content-Based)

We compute a recommendation score for each movie by measuring how well its genre vector
matches the user's genre preference profile.

Score formula (normalized):

\[
score(m) = \frac{\sum_{g \in Genres} (movie_g \times user\_pref_g)}{\sum user\_pref_g}
\]

Finally, we:
- Sort movies by score (descending)
- Remove movies already rated by the user
- Return the Top-N recommendations

In [34]:
# Use the corrected variable name if you used user_profile earlier
# (If your variable is still userprofile, rename it once for consistency)
user_profile = user_profile if "user_profile" in globals() else userprofile

# Compute normalized recommendation scores for all movies
recommendation_scores = (genre_table.mul(user_profile, axis=1).sum(axis=1)) / user_profile.sum()

# Convert to DataFrame and sort
recommendation_table = (
    recommendation_scores
    .sort_values(ascending=False)
    .to_frame(name="score")
)

# Remove movies already rated by the user
rated_movie_ids = set(input_movies["movieId"].dropna().astype(int).tolist())
recommendation_table = recommendation_table[~recommendation_table.index.isin(rated_movie_ids)]

# Attach movie titles for readability
recommendation_table = (
    recommendation_table
    .reset_index()
    .merge(movies_df[["movieId", "title", "year"]], on="movieId", how="left")
    .sort_values("score", ascending=False)
    .reset_index(drop=True)
)

print(" Top 10 Recommended Movies:")
display(recommendation_table.head(10))


 Top 10 Recommended Movies:


Unnamed: 0,movieId,score,title,year
0,81132,6.097087,Rubber,2010.0
1,2987,4.933981,Who Framed Roger Rabbit?,1988.0
2,32031,4.879612,Robots,2005.0
3,85261,4.730097,Mars Needs Moms,2011.0
4,52462,4.730097,Aqua Teen Hunger Force Colon Movie Film for Th...,2007.0
5,56152,4.390291,Enchanted,2007.0
6,6902,4.363107,Interstate 60,2002.0
7,1907,4.295146,Mulan,1998.0
8,134853,4.229126,Inside Out,2015.0
9,108932,4.182524,The Lego Movie,2014.0


In [None]:
movies_df.loc[movies_df['movieId'].isin(recommendation_table_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
478,546,Super Mario Bros.,"[Action, Adventure, Children, Comedy, Fantasy,...",1993
559,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
2250,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
4348,6350,Laputa: Castle in the Sky (Tenk√ª no shiro Rapy...,"[Action, Adventure, Animation, Children, Fanta...",1986
4631,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
5490,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
5819,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
6047,40339,Chicken Little,"[Action, Adventure, Animation, Children, Comed...",2005
6448,51939,TMNT (Teenage Mutant Ninja Turtles),"[Action, Adventure, Animation, Children, Comed...",2007
6455,52287,Meet the Robinsons,"[Action, Adventure, Animation, Children, Comed...",2007
