# Recommendation Systems

![Alt text](Images\Becauseyouwatched.jpg)

## BUSINESS UNDERSTANDING.

## Business Overview.

Imagine the number of streaming sites, now imagine the challenges the platform faces: the two main questions being how do they get to help  users discover content they will not only love but also will keep them engaged over time- remember there are thousands of titles available as such users could experience decision fatigue which may lead to lower watch time and or increased churn. Now how about having a well-designed recommendation system that can transform this experience by only surfacing relevant and personalized content that resonates with one's individual taste and preference?

This project is therefore designed to build a movie recommendation engine using the MovieLens dataset, with the aim of improving the retention of users through personalized suggestions and we shall do so by analyzing the user ratings and preferences. In the end we shall be able to deliver top five movie recommendations that feel intuitive, relevant and engaging.

## Problem statement.

As we have seen that its really a struggle by users to find content that matches their prefences, especially as more movies gets produced each day. Therefore our question is ***'how can we deliver personalized movie recommendations that will ultimately increase user satsifaction and retention on a streaming platform?'***

## Stakeholders

**Product team**- to improve user engagement & retention through personalization  
**Data Scientist**- For building and validating recommendation engine  
**Marketing team**- for segmentation and promotion of contetnt based on preferences  
**Owners of streaming platforms/ Executive leadership**- Evaluation of ROI of a well designed personalized system on platform performance.


## Success Metrics.
- Building a model that generates top five movie recommendations per user.
- Improvement of user engagement as a result of tailor-making contetnt to user preferences.
- Provision of actionable insights for product and marketing team to use.
- Visualizing resulsts for non technical team e.g the executive.


## Type of Recommendation and Model Evaluation Metrics.
This project will be focusing on personalized recommendations, leveraging collaborative filtering to unearth latent user preferances and as a result suggest movies aligned with their taste. Some of the metrics that we may deploy include:

- **RMSE/MAE** -For rating prediction accuracy
- **Precision, Recall, F1 Score** - For quality ranking.
- **Coverage, Diversity** - To help with assessment of recomendation variety & system robustness.

In [1]:
# Eric lead developing here | Lynn, feel free to support co-creation 
# Address Business overview, problem statement, stakholders, goals/objectives (e.g., product suggestions, movie recommendations, personalized content))
# Determine the type of recommendation (personalized vs. non-personalized)
# Outline some metrics we may want to use in model evaluation-

## Data Understanding  

### Dataset Overview  
The provided dataset, comprising four CSV files: 
- **links.csv**
- **movies.csv**
- **ratings.csv**
- **tags.csv**

 offers a comprehensive foundation for a robust movie recommendation system. The data is structured to facilitate the analysis of user behavior, movie characteristics, and their interrelationships.

##  Importing Libraries 

In [2]:
# Loading the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime, timezone
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import permutation_importance

sns.set_theme(style="whitegrid", context="notebook")

---
## Movies dataset.


In [3]:
# Loading  the movies dataset
movies = pd.read_csv('Data/movies.csv')
movies.info

<bound method DataFrame.info of       movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
...       ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres  
0     Adventure|Animation|Children|Comedy|Fantasy  
1                      Adventure|Children|Fantasy  
2                                  Comedy|Romance  
3                            Comedy|Dra

#### Movies.csv  

This file contains **movie titles** and their corresponding **genres**.  

It, Serves as the **central movie catalog**, this file contains a unique entry for each of the **9,742 films**.  
### Columns 
- **movieId**: Primary key that links to other datasets.  
- **title**: Movie title (with release year).  
- **genres**: Pipe-separated list of genres.  

This file is essential for **content-based filtering**, enabling genre-specific recommendations and theme analysis. 

---
## Ratings Dataset.


In [4]:
# Loading ratings dataset
ratings = pd.read_csv('Data/ratings.csv')
ratings.info

<bound method DataFrame.info of         userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]>

### ratings.csv  
The most **data-heavy file**, containing **100,836 explicit user ratings**.  
#### Columns 
- **userId**: Identifier of the user.  
- **movieId**: Identifier of the rated movie.  
- **rating**: Explicit user rating (e.g., 1.0–5.0).  
- **timestamp**: Time of rating (Unix format).  

This dataset forms the **backbone of collaborative filtering**. The timestamps also enable **temporal analysis**, helping track evolving preferences and movie popularity trends.  


---
## Links dataset.


In [5]:
# Loading  the links dataset
links = pd.read_csv('Data/links.csv')
links.info

<bound method DataFrame.info of       movieId   imdbId    tmdbId
0           1   114709     862.0
1           2   113497    8844.0
2           3   113228   15602.0
3           4   114885   31357.0
4           5   113041   11862.0
...       ...      ...       ...
9737   193581  5476944  432131.0
9738   193583  5914996  445030.0
9739   193585  6397426  479308.0
9740   193587  8391976  483455.0
9741   193609   101726   37891.0

[9742 rows x 3 columns]>

#### links.csv  

This file serves as a **bridge to external metadata sources**, mapping internal `movieId` values to industry-standard identifiers:  
### Columns  
- **movieId**: Unique identifier for a movie.
- **imdbId**: IMDb identifier.  
- **tmdbId**: The Movie Database (TMDb) identifier. 

These external links allow for **data enrichment**, such as retrieving cast, plot, and ratings. This enhances the recommendation engine with richer context.  

## Tags Dataset.
---

In [6]:
# Loading  the tags dataset
tags = pd.read_csv('Data/tags.csv')
tags.info

<bound method DataFrame.info of       userId  movieId               tag   timestamp
0          2    60756             funny  1445714994
1          2    60756   Highly quotable  1445714996
2          2    60756      will ferrell  1445714992
3          2    89774      Boxing story  1445715207
4          2    89774               MMA  1445715200
...      ...      ...               ...         ...
3678     606     7382         for katie  1171234019
3679     606     7936           austere  1173392334
3680     610     3265            gun fu  1493843984
3681     610     3265  heroic bloodshed  1493843978
3682     610   168248  Heroic Bloodshed  1493844270

[3683 rows x 4 columns]>

#### tags.csv  
This file contains **3,683 qualitative user-generated tags**, offering descriptive insights beyond numerical ratings.  
### Columns
- **userId**: Identifier of the user who tagged the movie.  
- **movieId**: Identifier of the tagged movie.  
- **tag**: User-generated keyword(s).  
- **timestamp**: Time the tag was applied.  

Tags capture **nuanced characteristics** that genres miss, enabling **expressive content-based models**. They also reflect how perceptions of movies **shift over time**.  


In [7]:
# Newton work on the Data Understanding part
# Briefly describe each dataset in the CSV format

## Data Preparation

In [8]:
# Joackim kindly lead here : Check need to create a single dataframe with desired variables including justifications as part of notes


In [9]:
# Merge datasets
df = ratings.merge(movies, on='movieId', how='left') \
            .merge(tags, on=['userId', 'movieId'], how='left') \
            .merge(links, on='movieId', how='left')

In [10]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp_x,title,genres,tag,timestamp_y,imdbId,tmdbId
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,,114709,862.0
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,,,113228,15602.0
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,,,113277,949.0
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,,,114369,807.0
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,,,114814,629.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102677 entries, 0 to 102676
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId       102677 non-null  int64  
 1   movieId      102677 non-null  int64  
 2   rating       102677 non-null  float64
 3   timestamp_x  102677 non-null  int64  
 4   title        102677 non-null  object 
 5   genres       102677 non-null  object 
 6   tag          3476 non-null    object 
 7   timestamp_y  3476 non-null    float64
 8   imdbId       102677 non-null  int64  
 9   tmdbId       102664 non-null  float64
dtypes: float64(3), int64(4), object(3)
memory usage: 7.8+ MB


In [12]:
print("Shape of df:", df.shape)

Shape of df: (102677, 10)


In [13]:
#  filter/validate timestamps in the dataset
TS_MIN = int(datetime(1990, 1, 1, tzinfo=timezone.utc).timestamp())
TS_MAX = int(datetime(2025, 12, 31, tzinfo=timezone.utc).timestamp())

RATING_MIN, RATING_MAX, RATING_STEP = 0.5, 5.0, 0.5 #Range and granularity of ratings

In [14]:
# Replace newline characters with a pipe (|) separator,remove extra spaces and duplicate separators,handle missing or invalid values.
def fix_genres(series: pd.Series) -> pd.Series:
    """
    Normalize genres:
      - convert literal '\\n' and real newlines to '|'
      - collapse repeated separators and trim
    """
    s = series.astype(str) #Convert all values in the Series to strings to ensure consistent processing.
    s = s.str.replace("\\n", "|", regex=False).str.replace("\n", "|", regex=False)#Replaces \n and actual newline characters with a pipe (|)
    s = s.str.replace(r"\s*\|\s*", "|", regex=True).str.strip("| ")#Remove any extra spaces around the pipe symbol and trims leading/trailing pipes or spaces
    return s.replace({"nan": ""})

In [15]:
def extract_year_from_title(title: str):
    """
    Extract (YYYY) at end of title.
    """
    if not isinstance(title, str): #Check if the input title is a string. If not, return NaN.
        return np.nan
    m = re.search(r"\((\d{4})\)\s*$", title.strip()) #regular expression to search for a 4-digit year enclosed in parentheses at the end of the string.
   # \((\d{4})\) matches a year like (1999).\s*$ ensures it's at the end of the string, possibly followed by spaces.
    return float(m.group(1)) if m else np.nan #Extract the year and convert it to a float (e.g., 1999.0) if a match is found. Otherwise, return NaN.

In [16]:
def to_decade(year): #converts a numeric year (like 1999) into a decade label (like "1990s")
    """
    Convert numeric year -> '1980s' style label; 'Unknown' if NaN.
    """
    if pd.isna(year):
        return "Unknown" #Checks if the input year is missing (NaN). If so, it returns "Unknown"
    return f"{int(year)//10*10}s"# Converts the year to an integer.Divides it by 10, multiplies by 10 to get the start of the decade. 
    #Appends "s" to format it like "1990s"

In [17]:
# round ratings to the nearest 0.5 and clip them within a valid range, usually between 0.5 and 5.0 stars.
def clip_half_star(x: pd.Series) -> pd.Series:
    """
    Snap ratings to nearest 0.5 and clip to [0.5, 5.0].
    """
    x = pd.to_numeric(x, errors="coerce") #Converts the Series x to numeric values. Non-numeric entries are converted to NaN, using errors="coerce"
    x = x.clip(RATING_MIN, RATING_MAX) #Any value below 0.5 becomes 0.5, and any value above 5.0 becomes 5.0.
    return (np.round(x / RATING_STEP) * RATING_STEP).astype(float)

In [18]:
#  ensure that all timestamps fall within a valid range defined by TS_MIN and TS_MAX
def sanitize_timestamp(ts: pd.Series) -> pd.Series:
    """
    Keep timestamps within [TS_MIN, TS_MAX]; set others to NaN.
    """
    ts = pd.to_numeric(ts, errors="coerce").astype("Int64") #Converts the input Series ts to numeric values.Non-numeric entries are coerced to NaN.
    #The result is cast to pandas' nullable integer type "Int64" to allow NaN values.
    return ts.where(ts.between(TS_MIN, TS_MAX)) #Keeps only the values that fall between TS_MIN and TS_MAX.Any value outside this range is replaced with NaN.

In [19]:
# Clean up column names by removing leading and trailing whitespace from each column name
df.columns = [c.strip() for c in df.columns]

In [20]:
# Checks if the column "genres" exists in the DataFrame.If it does, it applies the fix_genres() function to clean and normalize the genre strings 
#(e.g., replacing newlines with |, removing extra spaces, collapsing repeated separators)
if "genres" in df.columns:
    df["genres"] = fix_genres(df["genres"])
else:
    df["genres"] = ""

In [21]:
# Remove exact duplicate rows.
df = df.drop_duplicates()

In [22]:
# Loop through a list of column names that typically represent unique identifiers for users and movies.Check if each column exists in the DataFrame. 
#If it does:converts the column to numeric values.Non-numeric entries are replaced with NaN
for col in ["userId", "movieId", "imdbId", "tmdbId"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")

In [23]:
# Convert ratings to numeric values.Clip them to a valid range (typically between 0.5 and 5.0).
#Round them to the nearest 0.5 (e.g., 3.7 → 3.5, 4.3 → 4.5).
if "rating" in df.columns:
    df["rating"] = clip_half_star(df["rating"])

In [24]:
# Convert the column to numeric values (non-numeric entries become NaN).
#Keep only timestamps that fall within a valid range defined by TS_MIN and TS_MAX.
if "timestamp_x" in df.columns:
    df["timestamp_x"] = sanitize_timestamp(df["timestamp_x"])

In [25]:
# Parse/repair year & decade from title
if "title" in df.columns:
    year_from_title = df["title"].apply(extract_year_from_title)#looks for a 4-digit year in parentheses at the end of the title and extracts it.
    if "year" not in df.columns:
        df["year"] = year_from_title #If the "year" column doesn't exist, it creates it using the extracted year from the title.
    else:
        df["year"] = pd.to_numeric(df["year"], errors="coerce").fillna(year_from_title) #Converts values to numeric (invalid entries become NaN).
#Fills in missing values (NaN) with the year extracted from the title.
    # Normalize 'decade' to labeled form; if missing, derive from year
    if "decade" not in df.columns:
        df["decade"] = df["year"].apply(to_decade)
    else:
        # If the "decade" column does exist, and its data type is numeric, it converts values like 1990.0 into "1990s" format.
        if df["decade"].dtype.kind in "if":
            df["decade"] = df["decade"].apply(lambda x: f"{int(x)}s" if pd.notna(x) else np.nan)
        df["decade"] = df["decade"].fillna(df["year"].apply(to_decade))
else:
    #Ensures both "year" and "decade" columns exist.Sets "year" to NaN and "decade" to "Unknown" as default placeholders.
    if "year" not in df.columns:   df["year"] = np.nan
    if "decade" not in df.columns: df["decade"] = "Unknown"

In [26]:
# Drop records missing essentials
essentials = ["userId", "movieId", "rating"]
keep = np.ones(len(df), dtype=bool)
for c in essentials:
    keep &= df[c].notna()
df = df[keep].copy()

# Deduplicate (userId, movieId) → keep most recent by timestamp_x when available
if {"userId", "movieId", "timestamp_x"}.issubset(df.columns):
    df = (df
          .sort_values(["userId", "movieId", "timestamp_x"])
          .drop_duplicates(["userId", "movieId"], keep="last"))

In [27]:
# Build a unique movie table
movie_meta = (df[["movieId", "title", "genres", "year", "decade"]]
              .drop_duplicates("movieId")
              .copy())

# 1) Genres → MultiLabelBinarizer (multi-hot)
mlb = MultiLabelBinarizer()
genre_lists = (
    movie_meta["genres"].fillna("")
    .apply(lambda s: [g.strip() for g in s.split("|")
                      if g.strip() and g.strip().lower() != "(no genres listed)"])
)
G = pd.DataFrame(
    mlb.fit_transform(genre_lists),
    index=movie_meta.index,
    columns=[f"genre_{g}" for g in mlb.classes_]
)

# 2) Decade → OneHotEncoder (one-hot)

# Ensure decade is a clean string column
movie_meta["decade"] = movie_meta["decade"].astype(str).fillna("Unknown")

# For scikit-learn < 1.2 use 'sparse=False'; for >= 1.2 use 'sparse_output=False'.
# The code below tries the newer arg first and falls back if needed.
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False, dtype=int)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=False, dtype=int)

decade_2d = movie_meta[["decade"]]  # 2D input
D_mat = ohe.fit_transform(decade_2d)  # dense ndarray

# Get the correct output column names from the encoder
try:
    dec_cols = ohe.get_feature_names_out(["decade"])
except AttributeError:  # scikit-learn < 1.0
    dec_cols = ohe.get_feature_names(["decade"])

D = pd.DataFrame(D_mat, index=movie_meta.index, columns=dec_cols)


# Final movie features
movie_features_ohe = pd.concat([movie_meta[["movieId", "title", "year"]], G, D], axis=1).reset_index(drop=True)

In [28]:
# Row-level ratings table for modeling (keep useful columns if present)
keep_cols = ["userId", "movieId", "rating"]
for c in ["timestamp_x", "timestamp", "title", "genres", "imdbId", "tmdbId", "year", "decade"]:
    if c in df.columns:
        keep_cols.append(c)

ratings_clean = df[keep_cols].reset_index(drop=True)

# Display small previews
display(ratings_clean.head(3))
display(movie_features_ohe.head(3))

Unnamed: 0,userId,movieId,rating,timestamp_x,title,genres,imdbId,tmdbId,year,decade
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,1995.0,1990s
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,113228,15602,1995.0,1990s
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,113277,949,1995.0,1990s


Unnamed: 0,movieId,title,year,genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,...,decade_1930s,decade_1940s,decade_1950s,decade_1960s,decade_1970s,decade_1980s,decade_1990s,decade_2000s,decade_2010s,decade_Unknown
0,1,Toy Story (1995),1995.0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
1,3,Grumpier Old Men (1995),1995.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
2,6,Heat (1995),1995.0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


In [29]:
# join ratings_clean and movie_features_ohe.

item_feats = movie_features_ohe.drop(columns=["title"])  # keep if you want; usually not used by ML models
df_joined = ratings_clean.merge(item_feats, on="movieId", how="left")

# Identify genre columns once
genre_cols = [c for c in movie_features_ohe.columns if c.startswith("genre_")]

# User-genre profile: average (rating-weighted) genre affinity
tmp = (
    ratings_clean[["userId", "rating", "movieId"]]
    .merge(movie_features_ohe[["movieId"] + genre_cols], on="movieId", how="left")
)

# weight genre indicators by rating
tmp[genre_cols] = tmp[genre_cols].mul(tmp["rating"], axis=0)

user_profile = (tmp.groupby("userId")[genre_cols]
                   .mean()
                   .add_prefix("u_")
                   .reset_index())

# Merge user profile + item features + rating label
df_train = (ratings_clean
            .merge(item_feats, on="movieId", how="left")
            .merge(user_profile, on="userId", how="left"))

df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp_x,title,genres,imdbId,tmdbId,year_x,decade,...,u_genre_Film-Noir,u_genre_Horror,u_genre_IMAX,u_genre_Musical,u_genre_Mystery,u_genre_Romance,u_genre_Sci-Fi,u_genre_Thriller,u_genre_War,u_genre_Western
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,1995.0,1990s,...,0.021552,0.25431,0.0,0.443966,0.323276,0.482759,0.728448,0.982759,0.426724,0.12931
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,113228,15602,1995.0,1990s,...,0.021552,0.25431,0.0,0.443966,0.323276,0.482759,0.728448,0.982759,0.426724,0.12931
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,113277,949,1995.0,1990s,...,0.021552,0.25431,0.0,0.443966,0.323276,0.482759,0.728448,0.982759,0.426724,0.12931
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,114369,807,1995.0,1990s,...,0.021552,0.25431,0.0,0.443966,0.323276,0.482759,0.728448,0.982759,0.426724,0.12931
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,114814,629,1995.0,1990s,...,0.021552,0.25431,0.0,0.443966,0.323276,0.482759,0.728448,0.982759,0.426724,0.12931


## Modeling

In [30]:
# Perform data preprocessing  
# Clean the data to handle missing values, duplicates and outliers
# Normalize or scale numerical features if applicable
# One-hot encode categorical data to suitable formats
# Split into training and test sets
# Choose a recommendation approach ans apply
# Build a model using an algorithm of choice - KNN, SVD or deep learning
# Train model using hitorical interaction data
# optimize the hyperparameters

## Evaluation 

In [31]:
# Use metrics such as: 
#RMSE/MAE for Rating predictions
# Precision/Recall/F1 score - For ranking 
# MAP/NDCG for ordered recommendations

## Findings and Conclusion

In [32]:
# Use the information above to answer the objectives outlined in the introduction

## Recommendations

In [33]:
# Make recommendations based on the findings and interest of the stakeholders