<a href="https://colab.research.google.com/github/MinChia900110/Moive-and-TV-Reccomendation-System/blob/main/Netflix_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview

This project focuses on building a recommendation engine for movies, similar to the logic used by streaming services like Netflix. Our goal is to analyze user viewing patterns and movie metadata to predict what a user would like to watch next.

We are exploring multiple approaches, ranging from simple statistical methods to advanced machine learning techniques:

  1. Exploratory Data Analysis (EDA): Understanding the "Long Tail" of movie ratings.
  2. Content-Based Filtering: Recommending movies based on genre, cast, and plot similarities.
  3. Collaborative Filtering: Using Matrix Factorization (SVD) and Nearest Neighbors (KNN) to find similar users.
  4. Hybrid Approach: Combining both methods to solve the "Cold Start" problem.


### Tech Stack

This notebook utilizes the following technologies and libraries:

*   **Python**: The primary programming language.
*   **Google Colab**: The cloud-based environment for running the notebook.
*   **Libraries**:
    *   `pandas`: For data manipulation and analysis.
    *   `numpy`: For numerical operations.
    *   `scikit-learn`: For machine learning algorithms.
    *   `seaborn`: For statistical data visualization.
    *   `matplotlib`: For creating static, animated, and interactive visualizations.
    *   `surprise`: A Python scikit for recommender systems.
    *   `nmslib`: For efficient similarity search (non-metric space library).

### How to Run

To successfully execute this notebook, please ensure the necessary datasets are accessible in your Colab environment. You have two primary options:

1.  **Upload Datasets**: Upload the required CSV files (`movies.csv`, `ratings.csv`, `links.csv`, `tags.csv`) directly to your Colab session. You can do this by clicking the 'Files' icon on the left sidebar, then selecting 'Upload to session storage'. Please note that files uploaded this way are temporary and will be deleted once the runtime is recycled.

2.  **Mount Google Drive**: If your datasets are stored in Google Drive, you can mount your Drive to access them persistently. Use the following code snippet in a code cell to mount your Drive:

    ```python
    from google.colab import drive
    drive.mount('/content/drive')
    ```

    After mounting, navigate to the correct folder within `/content/drive/My Drive/` to access your files.

**Important**: Once your data is accessible, ensure that the file paths used in the notebook (e.g., `pd.read_csv('path/to/your/movies.csv')`) correctly point to the location of your uploaded or mounted datasets.


## Phase 1: Data Cleaning & EDA (Visualizing the Long Tail)

This phase focuses on thoroughly understanding the structure and characteristics of our movie dataset. Key objectives include:

1.  **Understanding Dataset Structure**: Examine the columns, data types, and overall organization of the dataset.
2.  **Handling Missing Values**: Identify and appropriately address any missing data to ensure data quality and prevent issues in subsequent analysis.
3.  **Exploratory Data Analysis (EDA)**: Conduct a comprehensive EDA to uncover initial insights, patterns, and anomalies within the movie data.
4.  **Visualizing the 'Long Tail'**: Specifically, we will visualize the 'long tail' phenomenon, which typically describes a distribution where a small number of items (e.g., popular movies) account for a large share of consumption/ratings, while a large number of items (e.g., niche movies) account for a small share.

This initial exploration will lay the groundwork for effective feature engineering and model development.

In [None]:
# IMPORTANT: After seeing a ModuleNotFoundError, especially after installing new packages,
# please restart your Colab runtime (Runtime > Restart runtime) and then re-run this cell.
# This ensures all newly installed modules are correctly loaded
!pip install pandas scikit-learn seaborn matplotlib
!pip install numpy==1.26.4
!pip install scikit-surprise
!pip install nmslib
#Dataframe
import pandas as pd
import numpy as np
#Visualisation
import seaborn as sns
import matplotlib.pyplot as plt
#Machine Learning
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors

#Recommender Systems and Similarity Search
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate
import nmslib

#Had issues with different version being incompatible, run th following if you are having issue with compatabiliy
#print(f"Pandas version: {pd.__version__}")
#print(f"NumPy version: {np.__version__}")
#print(f"Scikit-learn version: {sklearn.__version__}")
#print(f"Surprise version: {surprise.__version__}")
# NMSLIB does not always expose a standard __version__ attribute easily
#print("NMSLIB imported successfully.")

#Configuration for better visualization
%matplotlib inline
sns.set_theme(style="whitegrid")
pd.set_option('display.max_columns', 15)

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2555152 sha256=b2899325eb9242b34d0ca34ef8c1b626539c6a953edb5ba1bf9e496e22b3205d
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4
Collecting nmslib
  Downloading nmslib-2.1.2.tar.gz (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.2/197.2 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing b

In [None]:
#Loding the dataset. Download if necessary
#Path to the data set
movies_path = '/content/movies.csv'
ratings_path = '/content/ratings.csv'
movies_path.head()
ratings_path.head()

## Phase 2: Building a Content-Based Recommender (TF-IDF)

This phase will focus on developing a content-based recommender system, primarily utilizing the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique.

Key objectives include:

1.  **Feature Extraction**: Extract relevant features from movie metadata (e.g., genres, plot keywords, cast, director) to create a comprehensive movie profile.
2.  **TF-IDF Vectorization**: Apply TF-IDF to transform the textual movie features into a numerical vector representation.
3.  **Similarity Calculation**: Compute the similarity between movies based on their TF-IDF vectors, typically using cosine similarity.
4.  **Recommendation Generation**: Develop a mechanism to generate movie recommendations for a user based on movies they have previously liked and the calculated similarities.

This phase will demonstrate how to recommend movies based purely on their intrinsic characteristics.

**Reasoning**:
Finally, I will add an empty code cell as a placeholder for the actual implementation of the content-based recommender system, completing all instructions for this subtask.



**Reasoning**:
Finally, I will add an empty code cell as a placeholder for the actual implementation of the content-based recommender system, completing all instructions for this subtask.



## Phase 3: Building a Collaborative Filtering Model (SVD/KNN)

This phase will focus on implementing and evaluating collaborative filtering models, such as Singular Value Decomposition (SVD) or K-Nearest Neighbors (KNN) based algorithms. Key objectives include:

1.  **Data Preparation**: Format the user-item interaction data (ratings) into a suitable structure for collaborative filtering algorithms.
2.  **Model Training**: Train collaborative filtering models (e.g., SVD from `surprise` library, or a KNN-based approach).
3.  **Recommendation Generation**: Generate recommendations for users based on the trained collaborative filtering model.
4.  **Understanding Latent Factors**: For models like SVD, explore the latent factors that represent user preferences and movie characteristics.

This phase will leverage the power of user-item interactions to provide recommendations.

**Reasoning**:
The third instruction is to add an empty code cell as a placeholder for the collaborative filtering implementation.



**Reasoning**:
The third instruction is to add an empty code cell as a placeholder for the collaborative filtering implementation.



## Phase 4: Evaluation

### Subtask:
Add a markdown heading for 'Phase 4: Evaluation (RMSE & Top-N Accuracy)' followed by a markdown cell explaining its objectives. Create a code cell placeholder for future implementation of model evaluation.


This final phase will focus on rigorously evaluating the performance of the developed recommender systems (Content-Based and Collaborative Filtering).

Key objectives include:

1.  **Define Evaluation Metrics**: Utilize appropriate metrics such as Root Mean Squared Error (RMSE) for rating prediction tasks and Top-N Accuracy (e.g., Precision@N, Recall@N, F1@N, NDCG) for ranking tasks.
2.  **Cross-Validation Strategy**: Implement a robust cross-validation strategy (e.g., train-test split, k-fold cross-validation) to ensure the generalizability of the models.
3.  **Comparative Analysis**: Compare the performance of the Content-Based and Collaborative Filtering models, identifying their strengths and weaknesses.
4.  **Analyze Top-N Recommendations**: Evaluate the quality of the top-N recommendations generated by each model.

This evaluation will provide insights into which model performs best under different criteria and guide potential future improvements or hybrid approaches.

**Reasoning**:
Now I will add an empty code cell as a placeholder for the actual implementation of model evaluation, completing the third instruction for this subtask.

