**Recommendation system is one of interesting applications in Data Science. Even though we have so many libraries including scikit-learn to implement machine learning models, there is no any direct library for build recommendations. Hence, we have to create this system by our own way. In this notebook, I'm going to create recommendation system only for netflix movies based on title, director, country and description.**    

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/netflix-shows/netflix_titles.csv


<h2>Let's import needful libraries</h2>

In [2]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

In [3]:
data=pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")

In [4]:
df=data.copy()
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


<h2>Data Wrangling</h2>

In [5]:
#shape of the dataset
df.shape

(8807, 12)

In [6]:
#check duplicates
df['show_id'].duplicated().sum()

0

In [7]:
#check missing values
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

<h4>Six columns have some missing values. Since all these six columns consists of text data, it is really hard to impute those values. Therefore, it's better to remove all the missing values.</h4>

In [8]:
#remove all the missing values
df.dropna(inplace=True)

<h4>Since we have removed missing values, index values are not in order. So as a good practice it's better to reset our index values.</h4>

In [9]:
#reset the index values
df.reset_index(inplace=True)

<h4>As I told, Let's extract only movie data</h4>

In [10]:
df1=df[df['type']=='Movie']

In [11]:
#reset index
df1.reset_index(inplace=True)

<h2>Needful Columns</h2>

In [12]:
newdf=df1[['title','director','country','description']]
newdf.head()

Unnamed: 0,title,director,country,description
0,Sankofa,Haile Gerima,"United States, Ghana, Burkina Faso, United Kin...","On a photo shoot in Ghana, an American model s..."
1,The Starling,Theodore Melfi,United States,A woman adjusting to life after a loss contend...
2,Je Suis Karl,Christian Schwochow,"Germany, Czech Republic",After most of her family is murdered in a terr...
3,Jeans,S. Shankar,India,When the father of the man she loves insists t...
4,Grown Ups,Dennis Dugan,United States,Mourning the loss of their beloved junior high...


<h2>Cleaning Text Data</h2>

<h4>As a good practice, it's better to remove all unnecessary notations, remove stop words, use lemmatization (Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words), and convert all letters to lower cases (Since I'm going to apply count vectorizer, if there exist a word with both upper and lower cases, then count vectorizer identify that word as two words).</h4>  

In [13]:
features=[]

for i in range(newdf.shape[0]):
    features.append(" ".join(list(newdf.iloc[i].values)))

In [14]:
lem=nltk.WordNetLemmatizer()
corpus=[]

for i in range(len(features)):
    review=re.sub('[^a-zA-Z]',' ',features[i])
    review=review.lower()
    review=review.split()
    review=[lem.lemmatize(w) for w in review if w not in set(stopwords.words('english'))]
    review=' '.join(review)
    corpus.append(review)

In [15]:
newdf['features']=corpus
newdf.head()

Unnamed: 0,title,director,country,description,features
0,Sankofa,Haile Gerima,"United States, Ghana, Burkina Faso, United Kin...","On a photo shoot in Ghana, an American model s...",sankofa haile gerima united state ghana burkin...
1,The Starling,Theodore Melfi,United States,A woman adjusting to life after a loss contend...,starling theodore melfi united state woman adj...
2,Je Suis Karl,Christian Schwochow,"Germany, Czech Republic",After most of her family is murdered in a terr...,je suis karl christian schwochow germany czech...
3,Jeans,S. Shankar,India,When the father of the man she loves insists t...,jean shankar india father man love insists twi...
4,Grown Ups,Dennis Dugan,United States,Mourning the loss of their beloved junior high...,grown ups dennis dugan united state mourning l...


<h2>Feature Count Matrix</h2>

<h4>I'm going to use count vectorizer for this task. However, you can use TF-IDF as well.</h4> 

In [16]:
cv=CountVectorizer()
cvdf=cv.fit_transform(newdf['features'])

In [17]:
#since cvdf is sparse matrix, we need to put toarray() part to show our matrix
cvdf.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

<h2>Let's Calculate Cosine Similarity</h2>

<h4>We use cosine similarity to calculate similarity between movies (we can use linear kernal as well).</h4>

In [18]:
cs=cosine_similarity(cvdf)

In [19]:
cs

array([[1.        , 0.12166607, 0.04055536, ..., 0.12166607, 0.11572751,
        0.04287465],
       [0.12166607, 1.        , 0.05263158, ..., 0.15789474, 0.15018785,
        0.05564149],
       [0.04055536, 0.05263158, 1.        , ..., 0.        , 0.        ,
        0.05564149],
       ...,
       [0.12166607, 0.15789474, 0.        , ..., 1.        , 0.10012523,
        0.        ],
       [0.11572751, 0.15018785, 0.        , ..., 0.10012523, 1.        ,
        0.        ],
       [0.04287465, 0.05564149, 0.05564149, ..., 0.        , 0.        ,
        1.        ]])

<h2>Recommendations</h2>

In [20]:
#let's write a function to get recommendations for given movie
def movie_rec(title):
    
    #extract movie index of given movie title
    movie_index=newdf[[title in name for name in newdf["title"]]].index[0]
    
    #get similarity score and its index for given movie title
    similarity_score=list(enumerate(cs[movie_index]))
    
    #sorted similarity scores for given movie title (Descending order)
    similarity_score=sorted(similarity_score,key=lambda x:x[1],reverse=True)
    
    #extract top 10 similarity scores for given movie
    similarity_score=similarity_score[1:11]
    
    #extract index values of top 10 movies
    movie_indices=[idx[0] for idx in similarity_score]
    
    #return recommended movies with their index values
    return newdf['title'][movie_indices]

<h4>Let's check our functions!</h4>

In [21]:
movie_rec('Jeans')

1894               Twins Mission
3424                      Bazaar
3444                     Bewafaa
1857              Kaake Da Viyah
2883    Oh Darling Yeh Hai India
1685                    Mahi NRI
2713                        Jaal
2365                       Dev.D
3436               Beiimaan Love
1016                   PhotoCopy
Name: title, dtype: object

<h2>According to above results, we can recommend above 10 movies to people who have watched 'Jeans' movie.</h2>