## Importing the libraries

In [1]:
import numpy as np 
import pandas as pd 

## Importing the dataset

In [2]:
dataset = pd.read_csv('dataset/netflix_dataset.csv')
dataset

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


This dataset has got a lot of columns. We need to first clean this dataset. We only need the `type`, `title` and `description` columns.

In [3]:
df = dataset[['type', 'title', 'description']].copy()
df

Unnamed: 0,type,title,description
0,Movie,Dick Johnson Is Dead,"As her father nears the end of his life, filmm..."
1,TV Show,Blood & Water,"After crossing paths at a party, a Cape Town t..."
2,TV Show,Ganglands,To protect his family from a powerful drug lor...
3,TV Show,Jailbirds New Orleans,"Feuds, flirtations and toilet talk go down amo..."
4,TV Show,Kota Factory,In a city of coaching centers known to train I...
...,...,...,...
8802,Movie,Zodiac,"A political cartoonist, a crime reporter and a..."
8803,TV Show,Zombie Dumb,"While living alone in a spooky town, a young g..."
8804,Movie,Zombieland,Looking to survive in a world taken over by zo...
8805,Movie,Zoom,"Dragged from civilian life, a former superhero..."


Checking if there is any Nan values in any of the columns.

In [4]:
check_type = df['type'].isnull().values.any()
check_title = df['title'].isnull().values.any()
check_description = df['description'].isnull().values.any()
print(check_type, ' ', check_title, ' ', check_type)    #there's none

False   False   False


We shall be using the `TfidfVectorizer` class of scikit-learn library to calculate the tf-idf values for the `description` column.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
description_matrix = vectorizer.fit_transform(df['description'])

In [13]:
description_matrix.toarray() #sparse matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(description_matrix)
cosine_similarities

array([[1.        , 0.        , 0.        , ..., 0.        , 0.01538292,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.02230089],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.01538292, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.02230089, ..., 0.        , 0.        ,
        1.        ]])