<a href="https://colab.research.google.com/github/Ajay0110/Mobile-Price-Range-Prediction/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## <b>In this  project, you are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



# **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

# Importing libraries

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings; warnings.simplefilter('ignore')

# Connecting Google Drive

In [4]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Dataset path, shape, head and tail

In [2]:
# Setting up path
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/EDA/Unsupervised EDA Capstone Project/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

In [3]:
df.shape # To check the shape

(7787, 12)

In [4]:
df.head() # To get the first 5 rows

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [5]:
df.tail() # To get the last 5 rows

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...
7786,s7787,Movie,ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS,Sam Dunn,,"United Kingdom, Canada, United States","March 1, 2020",2019,TV-MA,90 min,"Documentaries, Music & Musicals",This documentary delves into the mystique behi...


# Data Cleaning

In [6]:
#checking null values
df.isnull().sum()

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

There are 2389 null values in director column, which when dropped makes a huge impact on the dataset. Therefore removing director column, inorder to maintain data consistency

In [7]:
# Removing director column
df.drop(['director'],axis=1,inplace=True)

Dropping rows with NULL

In [8]:
# Dropping null row wise
df.dropna(axis=0,inplace=True)

In [9]:
# Checking for null values, if any
df.isnull().sum()

show_id         0
type            0
title           0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [10]:
# Dataset shape after removing NULL values
df.shape

(6643, 11)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6643 entries, 0 to 7785
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6643 non-null   object
 1   type          6643 non-null   object
 2   title         6643 non-null   object
 3   cast          6643 non-null   object
 4   country       6643 non-null   object
 5   date_added    6643 non-null   object
 6   release_year  6643 non-null   int64 
 7   rating        6643 non-null   object
 8   duration      6643 non-null   object
 9   listed_in     6643 non-null   object
 10  description   6643 non-null   object
dtypes: int64(1), object(10)
memory usage: 622.8+ KB


In [16]:
# Path of IMDB files
basic_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/EDA/Unsupervised EDA Capstone Project/data.tsv",sep='\t')
rating_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/EDA/Unsupervised EDA Capstone Project/Ratings/data.tsv",sep='\t')

In [17]:
basic_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [18]:
rating_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1905
1,tt0000002,5.9,256
2,tt0000003,6.5,1702
3,tt0000004,5.7,168
4,tt0000005,6.2,2517


In [19]:
df3 = pd.merge(basic_df.set_index('tconst'), rating_df.set_index('tconst'), left_index=True, right_index=True)

In [20]:
df3.head()

Unnamed: 0_level_0,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,1905
tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",5.9,256
tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.5,1702
tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short",5.7,168
tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.2,2517


In [21]:
df3.isnull().sum()

titleType         0
primaryTitle      0
originalTitle     0
isAdult           0
startYear         0
endYear           0
runtimeMinutes    0
genres            2
averageRating     0
numVotes          0
dtype: int64

In [22]:
#lower case titles
df['title']= df['title'].str.lower()
df3['primaryTitle'] = df3['primaryTitle'].str.lower()
df3['originalTitle'] = df3['originalTitle'].str.lower()

In [23]:
df3 = df3[df3.startYear.apply(lambda x: str(x).isnumeric())]
df3['startYear'] = df3['startYear'].astype(int)

In [25]:
final_df = pd.merge(df, df3, left_on=['title','release_year'], right_on=['primaryTitle','startYear'])

In [26]:
final_df.shape

(5658, 21)

In [27]:
final_df.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,...,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",...,movie,7:19,7:19,0,2016,\N,94,"Drama,History",5.8,655
1,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies",...,movie,23:59,23:59,0,2011,\N,78,Horror,4.5,978
2,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...",...,movie,9,9,0,2009,\N,79,"Action,Adventure,Animation",7.0,139858
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...",...,short,9,9,0,2009,\N,9,"Comedy,Drama,Short",5.5,93
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,...,movie,21,21,0,2008,\N,123,"Crime,Drama,History",6.8,250120


In [28]:
final_df.columns

Index(['show_id', 'type', 'title', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear',
       'endYear', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes'],
      dtype='object')

In [None]:
-# Splitting the country feature
df['country']=df['country'].apply(lambda x:x.split(","))

In [None]:
# Exploding country feature
df = df.explode('country')

In [None]:
df.head()

### Removing left and right spaces

In [None]:
# Trimming left and right spaces
df['country'] = df['country'].str.lstrip()
df['country'] = df['country'].str.rstrip()

In [None]:
# Checking datatypes of features
df.dtypes

In [None]:
# Converting date_added feature object to datetime datatype
df['date_added'] = pd.to_datetime(df['date_added'])

In [None]:
df.head()

# Exploratory Data Analysis

### Number of movies and TV shows visualization

In [None]:
# Countplot for feature type
plt.figure(figsize = (10,8))
sns.countplot(df['type'])

The number of movies is high whereas TV shows is significantly low

### Top 5 Countries using Netflix

In [None]:
# Getting Top 5 countries
plt.figure(figsize = (10,8))
df['country'].value_counts().head().plot(kind='bar')

Here we can see the top 5 countries that use Netflix with United States at the top. Our India stands second in the usage, UK in the third place, Canada in fourth and at last France.



### Overall Top 5 programmes

In [None]:
# Getting top 5 movies
plt.figure(figsize = (10,8))
df['title'].value_counts().head().plot(kind='bar')

Here are the top 5 programmes that are being featured in Netflix.

# For TV Shows

In [None]:
# TV Shows Dataframe
df1=df.loc[df['type']=='TV Show',:]

In [None]:
# Top 5 countries with feature the most TV Shows
df1['country'].value_counts().head().plot(kind='bar')

United States is the highest having around 700 TV shows.
Canada has the least number of TV shows. 

In [None]:
# Getting top 5 TV Shows
plt.figure(figsize = (10,8))
df1['title'].value_counts().head().plot(kind='bar')

Top 5 TV shows are Frozen Planet, The Making of the Frozen Planet, Ultimate Beastmaster, Black Crows and Shaka Zulu.

In [None]:
df1['cast']=df1['cast'].apply(lambda x:x.split(","))

In [None]:
df1 = df1.explode('cast')

In [None]:
# Trimming left and right spaces
df1['cast'] = df1['cast'].str.lstrip()
df1['cast'] = df1['cast'].str.rstrip()

In [None]:
# Getting top 5 casts for TV Shows
plt.figure(figsize = (10,8))
df1['cast'].value_counts().head().plot(kind='bar')

In [None]:
df1_us = df.loc[df['type']=='TV Show',:]