<a href="https://colab.research.google.com/github/Faisal-Al-Mamun/-Netflix-Show-Recommendation-System-using-NLP-Techniques/blob/main/Netflix_Show_Recommendation_System_using_NLP_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <font color='#f78fb3'> Netflix Show Recommendations System || Word2Vec - Google Pre-Trained Model - Cosine Similarities <br> </font>  
# <font color='#3dc1d3'>  
1.  Preprocess data
2.  Transfer Learning, using Google Pretrained Data
3.  Create Word2Vec Model
4.  Content based Recommendation System; Find 'what to watch' based which you movie/show you watched <br>



### Loading Data on Google Colab from Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r "/content/drive/MyDrive/Colab Notebooks/Dataset/netflix_titles.zip" "/content/"

In [3]:
!unzip 'netflix_titles.zip'

Archive:  netflix_titles.zip
  inflating: netflix_titles.csv      


Dataset Downloaded from Kaggle
Link:- https://www.kaggle.com/shivamb/netflix-shows

### Importing Libraries

In [4]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
import re
import string
import random
from PIL import Image
import requests
from io import BytesIO
from sklearn.metrics.pairwise import cosine_similarity
!pip install gensim #Install gensim, a useful NLP library that we will use to load w2v embeddings
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from matplotlib import pyplot
from gensim.models import KeyedVectors
import warnings  
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')  
warnings.filterwarnings(action='ignore',category=FutureWarning,module='gensim')  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Explore dataset

In [6]:
df = pd.read_csv("netflix_titles.csv")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [7]:
df['description'].count()

8807

In [8]:
df['title'][1000]

'Wild Dog'

In [9]:
df['description'][1000]

'A brash but brilliant Indian intelligence agent leads a covert operation to nab the mastermind behind a series of attacks threatening national security.'

Only show columns of interest

In [10]:
df2 = df[["title", "description","listed_in"]]
df2.head(10)

Unnamed: 0,title,description,listed_in
0,Dick Johnson Is Dead,"As her father nears the end of his life, filmm...",Documentaries
1,Blood & Water,"After crossing paths at a party, a Cape Town t...","International TV Shows, TV Dramas, TV Mysteries"
2,Ganglands,To protect his family from a powerful drug lor...,"Crime TV Shows, International TV Shows, TV Act..."
3,Jailbirds New Orleans,"Feuds, flirtations and toilet talk go down amo...","Docuseries, Reality TV"
4,Kota Factory,In a city of coaching centers known to train I...,"International TV Shows, Romantic TV Shows, TV ..."
5,Midnight Mass,The arrival of a charismatic young priest brin...,"TV Dramas, TV Horror, TV Mysteries"
6,My Little Pony: A New Generation,Equestria's divided. But a bright-eyed hero be...,Children & Family Movies
7,Sankofa,"On a photo shoot in Ghana, an American model s...","Dramas, Independent Movies, International Movies"
8,The Great British Baking Show,A talented batch of amateur bakers face off in...,"British TV Shows, Reality TV"
9,The Starling,A woman adjusting to life after a loss contend...,"Comedies, Dramas"


In [11]:
df.type.unique()

array(['Movie', 'TV Show'], dtype=object)

Drop null values in description column

In [12]:
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Ensure Descriptions only contain strings; not float dtyp; Pandas astype() is the one of the most important methods.

In [13]:
df['description'] = df['description'].astype(str)

### Preprocessing (cleaning) the Descriptions.

In [14]:
def non_ascii(s):
  return "".join(i for i in s if ord(i)<128)

def lower(text):
  return text.lower()

def stop_words(text):
  text = text.split()#split tokens to find stop words
  stops = set(stopwords.words("english"))
  text = [w for w in text if not w in stops]
  text = " ".join(text)
  #join into a string after removing stop words 
  return text 

def clean_html(text):
  html = re.compile('<.*?>')#regex
  return html.sub(r'',text)

def punct(text):
  token=RegexpTokenizer(r'\w+')#regex
  text = token.tokenize(text)
  text= " ".join(text)
  return text 

A new column is created to store the cleaned, preprocessed descriptions.

In [15]:
df['new_desc'] = df['description'].apply(non_ascii)
df['new_desc'] = df.new_desc.apply(func = lower)
df['new_desc'] = df.new_desc.apply(func = stop_words)
df['new_desc'] = df.new_desc.apply(func = punct)
df['new_desc'] = df.new_desc.apply(func = clean_html)

### Start Work on the Word2Vec Model and Transfer Learning

<font color='#3dc1d3'>Splitting the descriptions into words and stored in a list called ‘universe’; universe is essentially our corpus used for training our word2vec model<br>The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.<br><font color='#3dc1d3'>Word tokenization; break up description into word chunks

In [16]:
universe = []
for words in df['new_desc']:
  universe.append(words.split())
  #appends split-word element to the end of the list - universe 

In [17]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,new_desc
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,city coaching centers known train indias fines...


<font color='#3dc1d3'>using the word2vec pre-trained Google News model (GoogleNews-vectors-negative300) with the gensim Python library.<br>get GoogleNews pretrained data<br> takes a few seconds to load in gensim

In [18]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2021-12-27 18:12:05--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.136.56
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.136.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2021-12-27 18:13:47 (15.5 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [19]:
EMBEDDING_FILE = '/root/input/GoogleNews-vectors-negative300.bin.gz' #embedding_file as the GoogleNews-vectors-negative300 file

### Training corpus with Google Pretrained Model

In [20]:
pretrained_model = Word2Vec(size = 300, window=5, min_count = 2, workers=-1)
pretrained_model.build_vocab(universe)
pretrained_model.intersect_word2vec_format(EMBEDDING_FILE, lockf=1.0, binary = True)
pretrained_model.train(universe, total_examples=pretrained_model.corpus_count, epochs = 7)

(0, 0)

Try out the similarity between words; after intersecting with our corpus-universe

In [21]:
pretrained_model.wv.most_similar(positive=["prince"])
#paramater positive: gives a list of keys that contribute positively 

[('princess', 0.6986509561538696),
 ('monarch', 0.668681263923645),
 ('royal', 0.6433806419372559),
 ('king', 0.6159993410110474),
 ('throne', 0.5817439556121826),
 ('palace', 0.5728127956390381),
 ('queen', 0.5534094572067261),
 ('nobleman', 0.5447623133659363),
 ('sultan', 0.5442555546760559),
 ('knight', 0.5390364527702332)]

In [22]:
pretrained_model.wv.most_similar(positive=["hand"])

[('hands', 0.6113166809082031),
 ('arm', 0.4354873597621918),
 ('thumb', 0.4265793561935425),
 ('handed', 0.4130321741104126),
 ('paw', 0.40222683548927307),
 ('fist', 0.395877480506897),
 ('side', 0.37118443846702576),
 ('cheek', 0.3695729672908783),
 ('chest', 0.3638487458229065),
 ('nose', 0.3625941276550293)]

In [23]:
pretrained_model.wv.similarity("king","queen")

0.6510957

In [24]:
pretrained_model.wv.similarity("hand","queen")

0.05110423

### Netflix Show Recomendation Model


The function Vectorize() defines the average word2vec for each Netflix description

In [25]:
def vectorize(x):

  global embeddings
  embeddings = []
  #a list to store the vectors; these are vectorized Netflix Descriptions
  for line in df['new_desc']: #for each cleaned description
    w2v = None
    count = 0
    for word in line.split():
      if word in pretrained_model.wv.vocab:
        count += 1
        if w2v is None:
          w2v = pretrained_model.wv[word]
        else:
          w2v = w2v + pretrained_model.wv[word]
    if w2v is not None:
      w2v = w2v / count
      #append element to the end of the embeddings list 
      embeddings.append(w2v)

<font color='#3dc1d3'>Define the function 'netflix_because_you_watched' to find the top 5 most similar/recommended Netflix shows, based on the one you previously watched.<br>Invert index; we have a dictionary of characters/embeddings from Description mapped to their titles, and we want to convert that to a dictionary of titles mapped to the characters that have them.<br> We trained on the descriptions of the Netflix shows but we want to search/match based on a title. 

In [26]:
def netflix_because_you_watched(title):
  vectorize(df)
  cosine_similarities = cosine_similarity(embeddings,embeddings)
  netflix_shows = df[['title']]#new dataframe with reversed indexes
  indices = pd.Series(df.index, index = df['title']).drop_duplicates()
  ix = indices[title]
  cosine_sim = list(enumerate(cosine_similarities[ix]))
  #enumerate adds a counter to an iteratable and returns it 
  cosine_sim = sorted(cosine_sim, key = lambda x: x[1], reverse = True)
  cosine_sim = cosine_sim[1:6] #we want the top 5 similar netflix shows 
  netflix_index = [i[0] for i in cosine_sim]
  watch_next = netflix_shows.iloc[netflix_index]
  for index, row in watch_next.iterrows():
    print(row['title'])

### Exploring Netflix Recommendations

In [27]:
netflix_because_you_watched("Apaches")

Longmire
Hide and Seek
In Family I Trust
Notes for My Son
Imperial Dreams


In [28]:
netflix_because_you_watched("Friends")

Big Mouth
LEGO Friends: The Power of Friendship
Why Are You Like This
A Haunted House
The Bachelorette


In [29]:
netflix_because_you_watched("Transformers Prime")

Satria Heroes: Revenge of the Darkness
Transformers: Robots in Disguise
The Shannara Chronicles
Trollhunters: Rise of the Titans
Transformers: War For Cybertron Trilogy
