# Recommendation Systems

![Alt text](Images\Becauseyouwatched.jpg)

## BUSINESS UNDERSTANDING.

## Business Overview.

Imagine the number of streaming sites, now imagine the challenges the platform faces: the two main questions being how do they get to help  users discover content they will not only love but also will keep them engaged over time- remember there are thousands of titles available as such users could experience decision fatigue which may lead to lower watch time and or increased churn. Now how about having a well-designed recommendation system that can transform this experience by only surfacing relevant and personalized content that resonates with one's individual taste and preference?

This project is therefore designed to build a movie recommendation engine using the MovieLens dataset, with the aim of improving the retention of users through personalized suggestions and we shall do so by analyzing the user ratings and preferences. In the end we shall be able to deliver top five movie recommendations that feel intuitive, relevant and engaging.

## Problem statement.

As we have seen that its really a struggle by users to find content that matches their prefences, especially as more movies gets produced each day. Therefore our question is ***'how can we deliver personalized movie recommendations that will ultimately increase user satsifaction and retention on a streaming platform?'***

## Stakeholders

**Product team**- to improve user engagement & retention through personalization  
**Data Scientist**- For building and validating recommendation engine  
**Marketing team**- for segmentation and promotion of contetnt based on preferences  
**Owners of streaming platforms/ Executive leadership**- Evaluation of ROI of a well designed personalized system on platform performance.


## Success Metrics.
- Building a model that generates top five movie recommendations per user.
- Improvement of user engagement as a result of tailor-making contetnt to user preferences.
- Provision of actionable insights for product and marketing team to use.
- Visualizing resulsts for non technical team e.g the executive.


## Type of Recommendation and Model Evaluation Metrics.
This project will be focusing on personalized recommendations, leveraging collaborative filtering to unearth latent user preferances and as a result suggest movies aligned with their taste. Some of the metrics that we may deploy include:

- **RMSE/MAE** -For rating prediction accuracy
- **Precision, Recall, F1 Score** - For quality ranking.
- **Coverage, Diversity** - To help with assessment of recomendation variety & system robustness.

In [None]:
# Eric lead developing here | Lynn, feel free to support co-creation 
# Address Business overview, problem statement, stakholders, goals/objectives (e.g., product suggestions, movie recommendations, personalized content))
# Determine the type of recommendation (personalized vs. non-personalized)
# Outline some metrics we may want to use in model evaluation-

## Data Understanding  

### Dataset Overview  
The provided dataset, comprising four CSV files (links.csv, movies.csv, ratings.csv, and tags.csv), offers a comprehensive foundation for a robust movie recommendation system. The data is structured to facilitate the analysis of user behavior, movie characteristics, and their interrelationships.

##  Importing Libraries 

In [None]:
# Loading the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

---
## Movies dataset.


In [32]:
# Loading  the movies dataset
df = pd.read_csv('Data/movies.csv')
df.info

<bound method DataFrame.info of       movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
...       ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres  
0     Adventure|Animation|Children|Comedy|Fantasy  
1                      Adventure|Children|Fantasy  
2                                  Comedy|Romance  
3                            Comedy|Dra

#### Movies.csv  

This file contains **movie titles** and their corresponding **genres**.  

It, Serves as the **central movie catalog**, this file contains a unique entry for each of the **9,742 films**.  
### Columns 
- **movieId**: Primary key that links to other datasets.  
- **title**: Movie title (with release year).  
- **genres**: Pipe-separated list of genres.  

This file is essential for **content-based filtering**, enabling genre-specific recommendations and theme analysis. 

---
## Ratings Dataset.


In [31]:
# Loading ratings dataset
df = pd.read_csv('Data/ratings.csv')
df.info

<bound method DataFrame.info of         userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]>

### ratings.csv  
The most **data-heavy file**, containing **100,836 explicit user ratings**.  
#### Columns 
- **userId**: Identifier of the user.  
- **movieId**: Identifier of the rated movie.  
- **rating**: Explicit user rating (e.g., 1.0–5.0).  
- **timestamp**: Time of rating (Unix format).  

This dataset forms the **backbone of collaborative filtering**. The timestamps also enable **temporal analysis**, helping track evolving preferences and movie popularity trends.  


---
## Links dataset.


In [30]:
# Loading  the links dataset
df = pd.read_csv('Data/links.csv')
df.info

<bound method DataFrame.info of       movieId   imdbId    tmdbId
0           1   114709     862.0
1           2   113497    8844.0
2           3   113228   15602.0
3           4   114885   31357.0
4           5   113041   11862.0
...       ...      ...       ...
9737   193581  5476944  432131.0
9738   193583  5914996  445030.0
9739   193585  6397426  479308.0
9740   193587  8391976  483455.0
9741   193609   101726   37891.0

[9742 rows x 3 columns]>

#### links.csv  

This file serves as a **bridge to external metadata sources**, mapping internal `movieId` values to industry-standard identifiers:  
### Columns  
- **movieId**: Unique identifier for a movie.
- **imdbId**: IMDb identifier.  
- **tmdbId**: The Movie Database (TMDb) identifier. 

These external links allow for **data enrichment**, such as retrieving cast, plot, and ratings. This enhances the recommendation engine with richer context.  

## Tags Dataset.
---

In [29]:
# Loading  the tags dataset
df = pd.read_csv('Data/tags.csv')
df.info

<bound method DataFrame.info of       userId  movieId               tag   timestamp
0          2    60756             funny  1445714994
1          2    60756   Highly quotable  1445714996
2          2    60756      will ferrell  1445714992
3          2    89774      Boxing story  1445715207
4          2    89774               MMA  1445715200
...      ...      ...               ...         ...
3678     606     7382         for katie  1171234019
3679     606     7936           austere  1173392334
3680     610     3265            gun fu  1493843984
3681     610     3265  heroic bloodshed  1493843978
3682     610   168248  Heroic Bloodshed  1493844270

[3683 rows x 4 columns]>

#### tags.csv  
This file contains **3,683 qualitative user-generated tags**, offering descriptive insights beyond numerical ratings.  
### Columns
- **userId**: Identifier of the user who tagged the movie.  
- **movieId**: Identifier of the tagged movie.  
- **tag**: User-generated keyword(s).  
- **timestamp**: Time the tag was applied.  

Tags capture **nuanced characteristics** that genres miss, enabling **expressive content-based models**. They also reflect how perceptions of movies **shift over time**.  


In [None]:
# Newton work on the Data Understanding part
# Briefly describe each dataset in the CSV format

## Data Preparation

In [None]:
# Joackim kindly lead here : Check need to create a single dataframe with desired variables including justifications as part of notes


## Modeling

In [None]:
# Perform data preprocessing  
# Clean the data to handle missing values, duplicates and outliers
# Normalize or scale numerical features if applicable
# One-hot encode categorical data to suitable formats
# Split into training and test sets
# Choose a recommendation approach ans apply
# Build a model using an algorithm of choice - KNN, SVD or deep learning
# Train model using hitorical interaction data
# optimize the hyperparameters

## Evaluation 

In [None]:
# Use metrics such as: 
#RMSE/MAE for Rating predictions
# Precision/Recall/F1 score - For ranking 
# MAP/NDCG for ordered recommendations

## Findings and Conclusion

In [None]:
# Use the information above to answer the objectives outlined in the introduction

## Recommendations

In [None]:
# Make recommendations based on the findings and interest of the stakeholders