## NetFlix Recommender System

In [1]:
!pip install streamlit

Collecting streamlit
  Using cached streamlit-1.10.0-py2.py3-none-any.whl (9.1 MB)
Collecting pyarrow
  Downloading pyarrow-8.0.0-cp39-cp39-win_amd64.whl (17.9 MB)
Collecting pympler>=0.9
  Using cached Pympler-1.0.1-py3-none-any.whl (164 kB)
Collecting pydeck>=0.1.dev5
  Using cached pydeck-0.7.1-py2.py3-none-any.whl (4.3 MB)
Collecting rich
  Using cached rich-12.5.1-py3-none-any.whl (235 kB)
Collecting tzlocal
  Using cached tzlocal-4.2-py3-none-any.whl (19 kB)
Collecting blinker
  Using cached blinker-1.4.tar.gz (111 kB)
Collecting semver
  Using cached semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting validators
  Using cached validators-0.20.0.tar.gz (30 kB)
Collecting gitpython!=3.1.19
  Using cached GitPython-3.1.27-py3-none-any.whl (181 kB)
Collecting altair>=3.2.0
  Using cached altair-4.2.0-py3-none-any.whl (812 kB)
Collecting gitdb<5,>=4.0.1
  Using cached gitdb-4.0.9-py3-none-any.whl (63 kB)
Collecting smmap<6,>=3.0.1
  Using cached smmap-5.0.0-py3-none-any.whl (24 kB)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
credits_df=pd.read_csv("credits.csv")
titles_df=pd.read_csv("titles.csv")

print(f"The credits data has {credits_df.shape[0]} rows and {credits_df.shape[1]} columns")
print(f"The titles data has {titles_df.shape[0]} rows and {titles_df.shape[1]} columns")

The credits data has 77213 rows and 5 columns
The titles data has 5806 rows and 15 columns


In [3]:
credits_df.sample(5)

Unnamed: 0,person_id,id,name,character,role
21452,419159,tm171891,Anil Dhawan,Savitri's Husband,ACTOR
30448,1176985,tm244174,Daehyun Kim,Bulk Employee,ACTOR
1355,454597,tm147829,Gary Frank,Neil Harrison,ACTOR
50367,1116280,tm469193,Abdou Balde,Cheikh,ACTOR
19126,6957,tm178211,Isla Fisher,Melanie Ralston,ACTOR


In [4]:
titles_df.sample(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
3275,ts88577,Hollywood,SHOW,A group of aspiring actors and filmmakers in p...,2020,TV-MA,50,"['drama', 'history']",['US'],1.0,tt9827854,7.5,35067.0,15.085,7.6
2944,ts57281,All Hail King Julien: Exiled,SHOW,"Julien's been dethroned, but loyal friends and...",2017,TV-Y7,23,"['animation', 'comedy', 'family', 'fantasy']",['US'],1.0,tt6865906,7.5,559.0,6.454,8.5
1776,tm424879,Road To High & Low,MOVIE,"Cobra, Yamato and Noboru have been friends sin...",2016,,95,"['action', 'drama']",['JP'],,tt5659164,6.7,177.0,4.283,6.0
3932,ts89583,Trailer Park Boys: The Animated Series,SHOW,Nova Scotia’s favorite miscreants have always ...,2019,TV-MA,25,"['animation', 'comedy']",['CA'],2.0,tt9814900,7.6,2854.0,17.146,7.3
3969,tm518467,Dragon Rider,MOVIE,A young silver dragon teams up with a mountain...,2020,,100,"['family', 'fantasy', 'animation', 'comedy', '...","['ES', 'BE', 'DE']",,tt7080422,5.6,1522.0,15.265,7.3


## Data Wrangling

Visually asessing the credits data , we can see it contains different actor and director names for the same id

So I will be removing duplicates based on this id so that it will be easier to merge

In [5]:
credits_df['id'].duplicated().sum()

71779

In [6]:
len(credits_df.drop_duplicates(subset='id'))

5434

In [7]:
new_credits_df=credits_df[['id','name']].drop_duplicates(subset='id')

In [8]:
new_credits_df.sample(5)

Unnamed: 0,id,name
51372,ts256309,Mark Bonanno
16008,tm28171,Bill Burr
75075,tm1102226,Alessandro Preziosi
19931,tm152843,Mike Smith
24373,ts41948,Afrika Bambaataa


In [9]:
#merging the titles and new credits data together
final_df=pd.merge(titles_df,new_credits_df ,on='id')

In [10]:
final_df.sample(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,name
2329,tm408890,Yucatán,MOVIE,Two white collar thieves compete fiercely agai...,2018,PG,129,"['comedy', 'romance', 'european']",['ES'],,tt6502956,5.5,2065.0,5.943,5.6,Luis Tosar
1554,tm230426,"Liar, Liar, Vampire",MOVIE,When ordinary boy Davis suddenly becomes famou...,2015,G,66,"['comedy', 'family']",['US'],,tt4448304,5.8,940.0,9.114,7.1,Rahart Adams
4178,tm918026,Sam Jay: 3 in the Morning,MOVIE,"Comedian and ""Saturday Night Live"" writer Sam ...",2020,,64,['comedy'],[],,tt12689876,6.4,504.0,3.111,6.5,Sam Jay
2999,tm460198,Catch.er,MOVIE,When an ambitious career woman is found murder...,2017,,81,"['crime', 'drama']",['NG'],,tt8607728,4.5,30.0,0.84,6.0,Beverly Naya
4204,tm828318,Axone,MOVIE,This bittersweet comedy follows immigrants in ...,2019,,96,"['drama', 'comedy']",['IN'],,tt8747548,6.9,2182.0,2.268,5.6,Sayani Gupta


In [11]:
print(f"The final merged data has {final_df.shape[0]} rows and {final_df.shape[1]} columns")

The final merged data has 5434 rows and 16 columns


In [12]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5434 entries, 0 to 5433
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5434 non-null   object 
 1   title                 5433 non-null   object 
 2   type                  5434 non-null   object 
 3   description           5424 non-null   object 
 4   release_year          5434 non-null   int64  
 5   age_certification     2951 non-null   object 
 6   runtime               5434 non-null   int64  
 7   genres                5434 non-null   object 
 8   production_countries  5434 non-null   object 
 9   seasons               1776 non-null   float64
 10  imdb_id               5024 non-null   object 
 11  imdb_score            4963 non-null   float64
 12  imdb_votes            4949 non-null   float64
 13  tmdb_popularity       5432 non-null   float64
 14  tmdb_score            5259 non-null   float64
 15  name                 

In [13]:

null=[]
col_=[]
null_per=[]
dict={}
for col in final_df.columns:
    null_count=final_df[col].isnull().sum()
    
    if null_count>0:
        col_.append(col)
        null.append(null_count)
        nullpercent=(round(null_count*100/len(final_df)))
        null_per.append(nullpercent)
        dict['column']=col_
        dict['null_count']=null
        dict['null_percent(%)']=null_per
        null_df=pd.DataFrame(dict)
        #print(col,null_count ,str(round(null_count*100/len(final_df)))+ "%" )

null_df

Unnamed: 0,column,null_count,null_percent(%)
0,title,1,0
1,description,10,0
2,age_certification,2483,46
3,seasons,3658,67
4,imdb_id,410,8
5,imdb_score,471,9
6,imdb_votes,485,9
7,tmdb_popularity,2,0
8,tmdb_score,175,3


In this section, we will be reviewing the columns with missing values with the end goal being to remove nulls from the dataset


In [14]:
final_df.query("type == 'MOVIE'").shape[0]

3658

In [15]:
#checking each column to check for source of nulls
final_df.query("type == 'MOVIE'").seasons.isnull().sum()*100/final_df.query("type == 'MOVIE'").shape[0]

100.0

All movies in the dataset had 100% nulls in the seasons column which is expected as movies tended to not be seasonal

We will be filling missing values with 0 instead of dropping them.

In [16]:
final_df['seasons']=final_df['seasons'].fillna(0)

In [17]:
#filling nulls in the age certification column with NR which means Not Rated
final_df['age_certification']=final_df['age_certification'].fillna("NR")

In [18]:
final_df.drop(columns=['imdb_id'],inplace=True)

Filling missing numerical columns with the mean 

In [19]:
cols=['imdb_score','imdb_votes','tmdb_popularity','tmdb_score']
for col in cols:
    mean=final_df[col].mean().round(2)
    final_df[col]=final_df[col].fillna(mean)

Filling the remaining missing values with empty string

In [20]:
final_df=final_df.fillna('')

In [21]:
final_df.duplicated().sum()

0

In [22]:
final_df=final_df.drop_duplicates()

In [23]:
print(f"The final cleaned data has {final_df.shape[0]} rows and {final_df.shape[1]} columns")

The final cleaned data has 5434 rows and 15 columns


In [24]:
def clean(col):
    new_column=[]

    for line in final_df[col]:
        cols=str(line).replace("['",'').replace("']",'').replace("'",'').replace(',','').strip()
        new_column.append(cols)

    final_df[col]=new_column

In [25]:
clean('genres')
clean('production_countries')

In [26]:
final_df.sample(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,name
4309,tm475638,Luccas Neto em: Uma Babá Muito Esquisita,MOVIE,Luccas and Gi forgot about Mother's Day and no...,2019,NR,75,comedy family,BR,0.0,3.0,36.0,1.771,7.1,Luccas Neto
4208,tm845594,Irandam Ulagaporin Kadaisi Gundu,MOVIE,A hardworking lorry driver who works at an old...,2019,NR,141,drama thriller,IN,0.0,7.2,960.0,2.243,6.3,Dinesh Ravi
2588,tm283559,The Art of Loving: Story of Michalina Wislocka,MOVIE,"Michalina Wislocka, the most famous and recogn...",2017,NR,121,drama comedy romance european,PL,0.0,7.1,3690.0,6.817,6.6,Magdalena Boczarska
2877,tm353791,Seven Sundays,MOVIE,The Bonifacio siblings reunite when they find ...,2017,NR,128,drama comedy,PH,0.0,7.7,456.0,3.045,8.5,Aga Muhlach
4262,ts314529,캐치! 티니핑,SHOW,,2020,TV-Y,12,[],[],2.0,6.53,24675.55,1.497,6.79,이지현


In [27]:
final_df.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,5434.0,5434.0,5434.0,5434.0,5434.0,5434.0,5434.0
mean,2015.929886,79.869157,0.730217,6.528051,24675.55,23.344594,6.792427
std,7.410901,38.979547,1.894051,1.101597,85309.78,70.400573,1.119633
min,1953.0,0.0,0.0,1.5,5.0,0.6,1.0
25%,2015.0,46.0,0.0,5.9,716.0,3.33125,6.1
50%,2018.0,87.0,0.0,6.53,3320.5,7.861,6.8
75%,2020.0,106.0,1.0,7.3,21508.5,18.57125,7.5
max,2022.0,251.0,42.0,9.5,2268288.0,1823.374,10.0


## Data Visualization

In [28]:
"""import seaborn as sns
for col in final_df.columns:
    if final_df[col].dtype=='int64'or final_df[col].dtype=='float64':
        plt.figure(figsize=[12,6])
        sns.histplot(final_df[col])
        plt.show()"""

"import seaborn as sns\nfor col in final_df.columns:\n    if final_df[col].dtype=='int64'or final_df[col].dtype=='float64':\n        plt.figure(figsize=[12,6])\n        sns.histplot(final_df[col])\n        plt.show()"

In [29]:
model_df=final_df.copy()

#model_df['tags']

model_df['tags']=model_df['genres'] + " " +model_df['description']+ ' '+ model_df['production_countries'] + " " +model_df['type']+" " +model_df['age_certification']+ ' '+ model_df['name']

In [30]:
model_df.sample(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,name,tags
4882,ts317078,The Journalist,SHOW,A journalist known as the maverick of news med...,2022,TV-14,51,drama thriller,JP,1.0,7.0,542.0,10.647,8.1,Ryoko Yonekura,drama thriller A journalist known as the maver...
2590,tm233345,Love Beats Rhymes,MOVIE,A young woman dreams of making it big in the w...,2017,R,105,drama,US,0.0,5.4,1057.0,9.665,5.7,Azealia Banks,drama A young woman dreams of making it big in...
302,ts21197,Merlin,SHOW,"The unlikely friendship between Merlin, a youn...",2008,TV-PG,44,action scifi drama fantasy european,GB,5.0,7.9,80138.0,123.15,7.9,Colin Morgan,action scifi drama fantasy european The unlike...
4096,tm469966,The Unknown Saint,MOVIE,"Moments before his capture by police, a thief ...",2020,NR,100,drama comedy crime,FR QA MA,0.0,6.5,1092.0,2.829,6.4,Younes Bouab,drama comedy crime Moments before his capture ...
26,tm94651,Dostana,MOVIE,Vijay and Ravi are best friends (hence the nam...,1980,NR,161,drama comedy romance action crime,IN,0.0,2.1,25.0,3.46,4.9,Amitabh Bachchan,drama comedy romance action crime Vijay and Ra...


In [31]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem(text):
    v = ' '.join([ps.stem(i) for i in text.split()])
    return v

In [32]:
new_df = model_df[['id','title','tags']]
new_df['tags'] = new_df.tags.apply(lambda x:x.lower())
new_df['title'] = new_df.title.apply(lambda x:x.lower())
new_df['tags'].apply(stem)

0       crime drama a mental unstabl vietnam war veter...
1       comedi fantasi king arthur, accompani by hi sq...
2       comedi brian cohen is an averag young jewish m...
3       horror 12-year-old regan macneil begin to adap...
4       comedi european a british sketch comedi seri w...
                              ...                        
5429    comedi three women with total differ live acci...
5430    romanc drama a beauti love stori that can happ...
5431    music document rise star edis' career journey ...
5432    famili drama a man from nigeria return to hi f...
5433    action thriller a famili face destruct in a lo...
Name: tags, Length: 5434, dtype: object

In [33]:
new_df.sample()

Unnamed: 0,id,title,tags
5186,tm1027297,time,"drama comedy once famous for his quick blade, ..."


In [65]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(max_features = 5000, stop_words = 'english')
vectors = cvect.fit_transform(new_df['tags']).toarray()


from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)

In [66]:
def recommend(movie):
    movie = movie.lower()
    movie_index = new_df[new_df.title == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:21]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [79]:
recommend('the walking dead')

daybreak
kipo and the age of wonderbeasts
the haunting of hill house
chilling adventures of sabrina
sleepless society nyctophobia
bard of blood
all of us are dead
record of ragnarok
the witcher
the outsider
the dark crystal: age of resistance
hemlock grove
biohackers
7seeds
larva
day of the dead: bloodline
ghoul
alice in borderland
black summer
elves


In [84]:
recommend('breaking bad')

ozark
the billion dollar code
clickbait
queen sono
dare me
who killed sara?
bordertown
apaches
american vandal
ghost in the shell: sac_2045
black spot
spotless
sleepless society nyctophobia
my name
the sinner
nightcrawler
narcos: mexico
pieces of her
making a murderer
mindhunter
