# Building a Content-Based Movie Recommendation System for Netflix

## Summary
### Business and Data Understanding

The stakeholder for this project is Netflix, a global streaming platform with a vast and diverse movie catalog. Netflix's mission is to provide personalized content to its users, ensuring they stay engaged and satisfied with their viewing experience.

With thousands of movies available on Netflix, users often face difficulty in discovering new content that aligns with their preferences, leading to decision fatigue and potentially lower engagement. Netflix wants to enhance its recommendation engine by suggesting movies similar to those users have already enjoyed, based on the content and genre of the films. 

The dataset used includes movie titles, genres, and descriptions, which are ideal for developing a recommendation system based on content similarity.By building a content-based recommendation system, Netflix can suggest relevant films to users, keeping them engaged on the platform and increasing viewing time. The system will help Netflix continue to offer a personalized and enjoyable user experience.

### Data Preparation

Data preparation involved combining three key columns—movie title, genre, and description—into a single feature. This combination allowed us to create a robust representation of each movie’s content. Missing values were handled by removing incomplete entries, ensuring data quality. We applied the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to the combined text to convert the descriptions into numerical features. This method was chosen because it effectively transforms text data into a format that can be used for similarity analysis. Pandas was used for data manipulation, and Scikit-learn was employed for text vectorization and similarity computation.

### Modeling

We employed cosine similarity as the key metric to measure the closeness between movies based on their textual content. Using the cosine similarity score, we built a function that generates the top 5 movie recommendations for any given movie. The Surprise library and Singular Value Decomposition (SVD) were considered but ultimately not used here, as this model relies on content features rather than user ratings.

### Evaluation

The model was tested by retrieving the top 5 recommendations for specific movies. For example, when queried with Casanova (2015), the model suggested other films such as Das Casanova-Projekt (1981) and Peur de rien (2015). The results demonstrate the ability to recommend movies based on shared content characteristics. Although no formal performance metric (e.g., RMSE) was applicable due to the content-based nature of the model, the quality of recommendations was visually inspected for relevance.

### Limitations and Recommendations

The current recommendation system relies purely on movie metadata and descriptions. It does not take into account user preferences or ratings, which could limit its effectiveness for personalized recommendations. Future improvements could involve integrating collaborative filtering techniques with user rating data to develop a hybrid system that combines both content and user interactions for more accurate recommendations.

## Data Understanding
In this section, we will load the data, understand its structure, and check for missing values.

In [2]:
import pandas as pd

# Load the dataset
movie_data = pd.read_csv('movie_descriptions.csv')

# Show dataset info
movie_data.info()

# Display basic statistics
print(movie_data.describe())

# Sample of the dataset
movie_data.sample(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      5000 non-null   int64 
 1   title   5000 non-null   object
 2   genre   5000 non-null   object
 3   desc    5000 non-null   object
dtypes: int64(1), object(3)
memory usage: 156.4+ KB
                 id
count   5000.000000
mean   27188.676200
std    15680.129994
min        1.000000
25%    13691.250000
50%    26974.000000
75%    41004.500000
max    54198.000000


Unnamed: 0,id,title,genre,desc
4435,530,Going Where I've Never Been: The Photography ...,documentary,The work of photographer Diane Arbus as expla...
4730,16328,Der Golem (1915),horror,"In this version of the golem legend, the gole..."
2136,30089,My Heart in Kenya (2016),documentary,After fleeing from civil war in her native Et...
4653,740,Cutthroats (1994),comedy,The unhappy employees of a company that puts ...
3078,50058,Sanda (2014),drama,"Work, earn a salary, live off that money. Thi..."


The dataset contains the following columns:

- id: A unique identifier for each movie.
- title: The title of the movie.
- genre: The genre of the movie (e.g., documentary, drama, comedy).
- desc: A description or summary of the movie.