# Netflix Movie Data

This dataset contains more than 8,500 Netflix movies and TV shows, including cast members, duration, and genre. It contains titles added as recently as late September 2021.

Not sure where to begin? Scroll to the bottom to find challenges!

In [1]:
import pandas as pd
import numpy as np
!pip install rake-nltk
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer


Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Collecting nltk<4.0.0,>=3.6.2
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Downloading regex-2023.6.3-cp37-cp37m-win_amd64.whl (268 kB)
Installing collected packages: regex, nltk, rake-nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.4.5
    Uninstalling nltk-3.4.5:
      Successfully uninstalled nltk-3.4.5
Successfully installed nltk-3.8.1 rake-nltk-1.0.6 regex-2023.6.3


In [2]:
df = pd.read_csv('netflix_dataset.csv', index_col=0)
df.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
new = df[['title', 'description', 'listed_in','director','cast']]
new.head()

Unnamed: 0_level_0,title,description,listed_in,director,cast
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
s1,Dick Johnson Is Dead,"As her father nears the end of his life, filmm...",Documentaries,Kirsten Johnson,
s2,Blood & Water,"After crossing paths at a party, a Cape Town t...","International TV Shows, TV Dramas, TV Mysteries",,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban..."
s3,Ganglands,To protect his family from a powerful drug lor...,"Crime TV Shows, International TV Shows, TV Act...",Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi..."
s4,Jailbirds New Orleans,"Feuds, flirtations and toilet talk go down amo...","Docuseries, Reality TV",,
s5,Kota Factory,In a city of coaching centers known to train I...,"International TV Shows, Romantic TV Shows, TV ...",,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K..."


In [5]:
new['cast'].astype(str)

new['listed_in'].astype(str)
            
new['director'].astype(str)

show_id
s1       Kirsten Johnson
s2                   nan
s3       Julien Leclercq
s4                   nan
s5                   nan
              ...       
s8803      David Fincher
s8804                nan
s8805    Ruben Fleischer
s8806       Peter Hewitt
s8807        Mozez Singh
Name: director, Length: 8807, dtype: object

In [6]:
new.dtypes

title          object
description    object
listed_in      object
director       object
cast           object
dtype: object

In [10]:
new['cast'] = new['cast'].map(lambda x: x.split(',') [:])

new['listed_in'] = new['listed_in'].map(lambda x: x.split('') [:])

new['director'] = new['director'].map(lambda x: x.lower().split(' '))

for index, row in new.iterrows():
    row['cast'] = [actor.lower() for actor in row['cast']]
    row['listed_in'] = [genre.lower() for genre in row['listed_in']]
    row['director'] = ''.join(row['director']).lower()
new.head(10)

AttributeError: 'list' object has no attribute 'split'

In [7]:
import nltk
nltk.download("stopwords")
nltk.download("punkt")

from rake_nltk import Rake

r = Rake()

new['keywords'] = new['description'].apply(lambda x: r.extract_keywords_from_text(x))
new['keywords'] = new['description'].apply(lambda x: list(r.get_word_degrees().keys()))

new.head(10)

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0_level_0,title,description,listed_in,director,cast,keywords
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
s1,Dick Johnson Is Dead,"As her father nears the end of his life, filmm...","[d, o, c, u, m, e, n, t, a, r, i, e, s]",kirstenjohnson,"[n, a, n]","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s2,Blood & Water,"After crossing paths at a party, a Cape Town t...","[i, n, t, e, r, n, a, t, i, o, n, a, l, , t, ...",,"[a, m, a, , q, a, m, a, t, a, ,, , k, h, o, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s3,Ganglands,To protect his family from a powerful drug lor...,"[c, r, i, m, e, , t, v, , s, h, o, w, s, ,, ...",julienleclercq,"[s, a, m, i, , b, o, u, a, j, i, l, a, ,, , ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s4,Jailbirds New Orleans,"Feuds, flirtations and toilet talk go down amo...","[d, o, c, u, s, e, r, i, e, s, ,, , r, e, a, ...",,"[n, a, n]","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s5,Kota Factory,In a city of coaching centers known to train I...,"[i, n, t, e, r, n, a, t, i, o, n, a, l, , t, ...",,"[m, a, y, u, r, , m, o, r, e, ,, , j, i, t, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s6,Midnight Mass,The arrival of a charismatic young priest brin...,"[t, v, , d, r, a, m, a, s, ,, , t, v, , h, ...",mikeflanagan,"[k, a, t, e, , s, i, e, g, e, l, ,, , z, a, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s7,My Little Pony: A New Generation,Equestria's divided. But a bright-eyed hero be...,"[c, h, i, l, d, r, e, n, , &, , f, a, m, i, ...","robertcullen,joséluisucha","[v, a, n, e, s, s, a, , h, u, d, g, e, n, s, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s8,Sankofa,"On a photo shoot in Ghana, an American model s...","[d, r, a, m, a, s, ,, , i, n, d, e, p, e, n, ...",hailegerima,"[k, o, f, i, , g, h, a, n, a, b, a, ,, , o, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s9,The Great British Baking Show,A talented batch of amateur bakers face off in...,"[b, r, i, t, i, s, h, , t, v, , s, h, o, w, ...",andydevonshire,"[m, e, l, , g, i, e, d, r, o, y, c, ,, , s, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
s10,The Starling,A woman adjusting to life after a loss contend...,"[c, o, m, e, d, i, e, s, ,, , d, r, a, m, a, s]",theodoremelfi,"[m, e, l, i, s, s, a, , m, c, c, a, r, t, h, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."


In [8]:
new.set_index('title', inplace = True)
new.drop(columns= ['description'], inplace= True)
new.head()

Unnamed: 0_level_0,listed_in,director,cast,keywords
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dick Johnson Is Dead,"[d, o, c, u, m, e, n, t, a, r, i, e, s]",kirstenjohnson,"[n, a, n]","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
Blood & Water,"[i, n, t, e, r, n, a, t, i, o, n, a, l, , t, ...",,"[a, m, a, , q, a, m, a, t, a, ,, , k, h, o, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
Ganglands,"[c, r, i, m, e, , t, v, , s, h, o, w, s, ,, ...",julienleclercq,"[s, a, m, i, , b, o, u, a, j, i, l, a, ,, , ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
Jailbirds New Orleans,"[d, o, c, u, s, e, r, i, e, s, ,, , r, e, a, ...",,"[n, a, n]","[scrappy, poor, boy, worms, way, tycoon, dysfu..."
Kota Factory,"[i, n, t, e, r, n, a, t, i, o, n, a, l, , t, ...",,"[m, a, y, u, r, , m, o, r, e, ,, , j, i, t, ...","[scrappy, poor, boy, worms, way, tycoon, dysfu..."


In [9]:
new['bag_of_words'] = ''
columns = new.columns
for index, row in new.iterrows():
    words = ''
    for col in columns:
        if col != 'director':
            words = words + ' '.join(str(row[col])) + ' '
        else:
            words = words + ' '.join(str(row[col])) + ' '
    new.at[index, 'bag_of_words'] = words

new.drop(columns=[col for col in new.columns if col != 'bag_of_words'], inplace=True)
new.head()

Unnamed: 0_level_0,bag_of_words
title,Unnamed: 1_level_1
Dick Johnson Is Dead,"[ ' d ' , ' o ' , ' c ' , ' u ' , ' m ..."
Blood & Water,"[ ' i ' , ' n ' , ' t ' , ' e ' , ' r ..."
Ganglands,"[ ' c ' , ' r ' , ' i ' , ' m ' , ' e ..."
Jailbirds New Orleans,"[ ' d ' , ' o ' , ' c ' , ' u ' , ' s ..."
Kota Factory,"[ ' i ' , ' n ' , ' t ' , ' e ' , ' r ..."


In [34]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')  # Add stop_words parameter

count_matrix = count.fit_transform(new['bag_of_words'])

indices = pd.Series(new.index)  # Change df to new
indices[:5]

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
cosine_sim= cosine_similarity(count_matrix, count_matrix)
cosine_sim

In [31]:


def recommendations(title, cosine_sim = cosine_sim):
    
    recommended_movies = []
    
    idx= indices[indices == title].index[1]
    
    score_series= pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    for i in top_10_indexes:
        recommended_movies.append(list(df.index)[i])
        
    return recommended_movies

In [None]:
def recommendations(title, cosine_sim=cosine_sim):
    recommended_movies = []
    
    # Find the index of the movie title in the indices list
    idx = indices[indices == title].index[0]
    
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending=False)
    
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # Get the titles of the recommended movies using the indices
    for i in top_10_indexes:
        recommended_movies.append(indices[i])
        
    return recommended_movies

recommendations('Zoom')

IndexError: index 0 is out of bounds for axis 0 with size 0

[Source](https://www.kaggle.com/shivamb/netflix-shows) of dataset.

## Don't know where to start? 

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: How much variety exists in Netflix's offering? Base this on three variables: `type`, `country`, and `listed_in`.
- 📊 **Visualize**: Build a word cloud from the movie and TV shows descriptions. Make sure to remove stop words!
- 🔎 **Analyze**: Has Netflix invested more in certain genres (see `listed_in`) in recent years? What about certain age groups (see `ratings`)?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

A talent agency has hired you to analyze patterns in the professional relationships of cast members and directors. The key deliverable is a network graph where each node represents a cast member or director. An edge represents a movie or TV show worked on by both nodes in this undirected graph. You can limit the actors to the first four names listed in `cast`. The client is interested in any insights you can derive from your Netflix network analysis, such as actor/actor and actor/director pairs that work most closely together, most popular actors and directors to work with, and graph differences over time.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.