# Netflix Recommendation System Works
The recommendation system of Netflix shows you movies and TV shows according to your interests. Netflix has a lot of data because of its user base. Its recommendation system predicts a personalised catalogue for you based on factors like:

- your viewing history
- the viewing history of other users with similar tastes and preferences as yours
- genres, category, description, and more information about the content that you watched in the past

<br>The genre of the content is one of the most valuable factors that helps Netflix recommend more content even to new users.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
data = pd.read_csv("E:/DS/Datasets/netflixData.csv")
print(data.head())

                                Show Id                          Title  \
0  cc1b6ed9-cf9e-4057-8303-34577fb54477                       (Un)Well   
1  e2ef4e91-fb25-42ab-b485-be8e3b23dedb                         #Alive   
2  b01b73b7-81f6-47a7-86d8-acb63080d525  #AnneFrank - Parallel Stories   
3  b6611af0-f53c-4a08-9ffa-9716dc57eb9c                       #blackAF   
4  7f2d4170-bab8-4d75-adc2-197f7124c070               #cats_the_mewvie   

                                         Description  \
0  This docuseries takes a deep dive into the luc...   
1  As a grisly virus rampages a city, a lone man ...   
2  Through her diary, Anne Frank's story is retol...   
3  Kenya Barris and his family navigate relations...   
4  This pawesome documentary explores how our fel...   

                      Director  \
0                          NaN   
1                       Cho Il   
2  Sabina Fedeli, Anna Migotto   
3                          NaN   
4             Michael Margolis   

             

In [3]:
print(data.isnull().sum())

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64


In [4]:
data = data[["Title", "Description", "Content Type", "Genres"]]#coz only have to work on these 
print(data.head())

                           Title  \
0                       (Un)Well   
1                         #Alive   
2  #AnneFrank - Parallel Stories   
3                       #blackAF   
4               #cats_the_mewvie   

                                         Description Content Type  \
0  This docuseries takes a deep dive into the luc...      TV Show   
1  As a grisly virus rampages a city, a lone man ...        Movie   
2  Through her diary, Anne Frank's story is retol...        Movie   
3  Kenya Barris and his family navigate relations...      TV Show   
4  This pawesome documentary explores how our fel...        Movie   

                                           Genres  
0                                      Reality TV  
1  Horror Movies, International Movies, Thrillers  
2             Documentaries, International Movies  
3                                     TV Comedies  
4             Documentaries, International Movies  


In [5]:
print(data.isnull().sum())

Title           0
Description     0
Content Type    0
Genres          0
dtype: int64


In [6]:
data = data.dropna() #its not needed as we can see but for the sake of comletence

In [7]:
#!pip install nltk

In [8]:
import nltk
import re

In [9]:
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [10]:
stemmer = nltk.SnowballStemmer("english")

In [11]:
from nltk.corpus import stopwords
import string

In [12]:
stopword=set(stopwords.words('english'))

In [13]:
def clean(text):
    text = str(text).lower()
    
    text = re.sub('\[.*?\]', '', text)
 
    text = re.sub('https?://\S+|www\.\S+', '', text)
   
    text = re.sub('<.*?>+', '', text)

    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
  
    text = re.sub('\n', '', text)
   
    text = re.sub('\w*\d\w*', '', text)

    text = [word for word in text.split(' ') if word not in stopword]

    text=" ".join(text)

    text = [stemmer.stem(word) for word in text.split(' ')]
  
    text=" ".join(text)
 
    return text

In [14]:
data["Title"] = data["Title"].apply(clean)

In [15]:
print(data.Title.sample(10))

104     famili reunion christma
3969                   riverdal
5031                    mansion
2740            littl miss sumo
1166               dabb possess
2104                 homunculus
4179    seven soul skull castl 
919                 cathedr sea
5396                  time danc
2509     keith richard influenc
Name: Title, dtype: object


# Cosine Similarity in Machine Learning
Cosine similarity is used to find similarities between the two documents. It does this by calculating the similarity score between the vectors, which is done by finding the angles between them. The range of similarities is between 0 and 1. If the value of the similarity score between two vectors is 1, it means that there is a greater similarity between the two vectors.

On the other hand, if the value of the similarity score between two vectors is 0, it means that there is no similarity between the two vectors. When the similarity score is one, the angle between two vectors is 0 and when the similarity score is 0, the angle between two vectors is 90 degrees.

In machine learning applications, this technique is mainly used in recommendation systems to find the similarities between the description of two products so that we can recommend the most similar product to the user to provide a better user experience. 

In [16]:
feature = data["Genres"].tolist()
feature

['Reality TV',
 'Horror Movies, International Movies, Thrillers',
 'Documentaries, International Movies',
 'TV Comedies',
 'Documentaries, International Movies',
 'Dramas, International Movies, Romantic Movies',
 'Dramas, International Movies, Romantic Movies',
 'Comedies',
 'Documentaries, Sports Movies',
 'Comedies, Dramas, International Movies',
 'Comedies, Dramas, International Movies',
 'Comedies, International Movies, Romantic Movies',
 'Comedies, Dramas, International Movies',
 'International TV Shows, Romantic TV Shows, TV Dramas',
 'Docuseries, Science & Nature TV',
 'Dramas, International Movies, Sports Movies',
 'Movies',
 'Dramas, International Movies',
 'Dramas, International Movies',
 'Horror Movies, International Movies',
 'Crime TV Shows, TV Dramas, TV Mysteries',
 'Crime TV Shows, Docuseries',
 'Documentaries',
 'Documentaries',
 'Comedies, Dramas, Independent Movies',
 'Dramas, Independent Movies, International Movies',
 'Dramas, International Movies',
 'Dramas, Thril

In [17]:
tfidf = text.TfidfVectorizer(input=feature, stop_words="english")
#Convert a collection of raw documents to a matrix of TF-IDF features.
tfidf

In [18]:
tfidf_matrix = tfidf.fit_transform(feature)
tfidf_matrix

<5967x44 sparse matrix of type '<class 'numpy.float64'>'
	with 22096 stored elements in Compressed Sparse Row format>

In [19]:
similarity = cosine_similarity(tfidf_matrix)

In [41]:
similarity

array([[1.        , 0.        , 0.        , ..., 0.32075218, 0.        ,
        0.        ],
       [0.        , 1.        , 0.30428612, ..., 0.07587812, 0.68953015,
        0.15936057],
       [0.        , 0.30428612, 1.        , ..., 0.11962968, 0.27899812,
        0.12562419],
       ...,
       [0.32075218, 0.07587812, 0.11962968, ..., 1.        , 0.25478887,
        0.        ],
       [0.        , 0.68953015, 0.27899812, ..., 0.25478887, 1.        ,
        0.110801  ],
       [0.        , 0.15936057, 0.12562419, ..., 0.        , 0.110801  ,
        1.        ]])

In [22]:
indices = pd.Series(data.index, 
                    index=data['Title']).drop_duplicates()

In [39]:
list(enumerate(similarity[2]))

[(0, 0.0),
 (1, 0.30428612047201303),
 (2, 1.0),
 (3, 0.0),
 (4, 1.0),
 (5, 0.38528903587421265),
 (6, 0.38528903587421265),
 (7, 0.0),
 (8, 0.5734281209374412),
 (9, 0.3477494595366168),
 (10, 0.3477494595366168),
 (11, 0.37589586537942343),
 (12, 0.3477494595366168),
 (13, 0.07368644569494472),
 (14, 0.0),
 (15, 0.31241241420582044),
 (16, 0.40424669524898316),
 (17, 0.43342164858467935),
 (18, 0.43342164858467935),
 (19, 0.3644368560777935),
 (20, 0.0),
 (21, 0.0),
 (22, 0.8268475914255569),
 (23, 0.8268475914255569),
 (24, 0.13877042454412414),
 (25, 0.3642011832090597),
 (26, 0.43342164858467935),
 (27, 0.0),
 (28, 0.07176477993782433),
 (29, 0.07599085831013172),
 (30, 0.18729687732121572),
 (31, 0.07190863675978865),
 (32, 0.3477494595366168),
 (33, 0.38528903587421265),
 (34, 0.6921415423383728),
 (35, 0.07176477993782433),
 (36, 0.0),
 (37, 0.0),
 (38, 0.07176477993782433),
 (39, 0.0),
 (40, 0.3644368560777935),
 (41, 0.2462122581562543),
 (42, 0.27899812231420795),
 (43, 0.05

In [52]:
def netFlix_recommendation(title, similarity = similarity):
    index = indices[title]
    print(index)
    similarity_scores = list(enumerate(similarity[0]))

    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:10]
    
    movieindices = [i[0] for i in similarity_scores]
    return data['Title'].iloc[movieindices]

In [53]:
netFlix_recommendation("go live way")

1846


0                         unwel
68                          day
305                        alon
322      america next top model
406                         one
468    awak million dollar game
615            best leftov ever
694     black ink crew new york
720                 bling empir
843                buri bernard
Name: Title, dtype: object

# Conclusion
If want to work with textual data first of select interested features. Drop the null data. Use **nltk** to remove stopwords and other stuff like #,@ etc. Use **cosine similarties** and find the similarties in the data. 