# <center> Netflix Recommendation System using Cosine Similarity</center>

1. <b>Introduction</b>
The purpose of this document is to outline the approach for developing a recommendation system for Netflix using cosine similarity. The recommendation system aims to provide personalized movie and TV show recommendations to users based on the similarity of their viewing preferences.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings("ignore") 

2. <b>Dataset</b>
The recommendation system will be built using a dataset containing user viewing history and preferences. The dataset should include information such as Show Id	, Title, Description, Genres, Cast, Production Country, Release Date, Rating, Duration, Imdb Score , Content Type and Date Added 

In [2]:
data = pd.read_csv("netflixData.csv")
data.head(3)

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020.0,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019.0,TV-14,95 min,6.4/10,Movie,"July 1, 2020"


3. <b>Data Preprocessing</b>
Before applying cosine similarity, it is necessary to preprocess the dataset. This may involve handling missing data, normalizing ratings, and converting categorical variables into appropriate representations.

In [3]:
data.isnull().sum()

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64

In [4]:
data = data[["Title", "Description", "Content Type", "Genres"]]
data.head()

Unnamed: 0,Title,Description,Content Type,Genres
0,(Un)Well,This docuseries takes a deep dive into the luc...,TV Show,Reality TV
1,#Alive,"As a grisly virus rampages a city, a lone man ...",Movie,"Horror Movies, International Movies, Thrillers"
2,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...",Movie,"Documentaries, International Movies"
3,#blackAF,Kenya Barris and his family navigate relations...,TV Show,TV Comedies
4,#cats_the_mewvie,This pawesome documentary explores how our fel...,Movie,"Documentaries, International Movies"


In [5]:
#remove null values
data = data.dropna()

4. <b>TF-IDF Representation</b>
To compute cosine similarity, we need to represent the movie and TV show data in a numerical format. We can use the TF-IDF (Term Frequency-Inverse Document Frequency) representation, which converts textual information into a numerical vector space representation. This representation takes into account the importance of words in each document (i.e., movie or TV show) relative to the entire dataset.

In [6]:
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SAGAR\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["Title"] = data["Title"].apply(clean)
    
    

5. <b>Cosine Similarity Calculation</b>
After obtaining the TF-IDF representation, we can calculate the cosine similarity between different movies and TV shows. Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. Higher cosine similarity values indicate greater similarity.

In [8]:
feature = data["Genres"].tolist()

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Compute the TF-IDF matrix for movie descriptions
tfidf_matrix = tfidf.fit_transform(feature)

similarity = cosine_similarity(tfidf_matrix)

In [9]:
indices = pd.Series(data.index,index=data['Title']).drop_duplicates()

6. <b>Recommendation Generation</b>
To generate recommendations, we need to identify movies or TV shows that are most similar to the ones a user has already watched or expressed interest in. This can be done by calculating the cosine similarity between the user's preferred items and all other items in the dataset. The top N items with the highest cosine similarity scores can be recommended to the user.

In [10]:
def netFlix_recommendation(title, similarity = similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:10]
    movieindices = [i[0] for i in similarity_scores]
    return data['Title'].iloc[movieindices]

In [11]:
netFlix_recommendation("girlfriend")

3                          blackaf
285                     washington
417                 arrest develop
434     astronomi club sketch show
451    aunti donna big ol hous fun
656                      big mouth
752                bojack horseman
805                   brew brother
935                       champion
937                  chappell show
Name: Title, dtype: object

9. <b>Conclusion</b>
The document has outlined the approach for developing a recommendation system for Netflix using cosine similarity. By leveraging the TF-IDF representation and cosine similarity measure, the system aims to provide personalized recommendations based on the similarity of user preferences. This helps enhance the user experience and increase user engagement on the Netflix platform.