<a href="https://colab.research.google.com/github/RanojoyBiswas/Netflix-Movies-TV-Shows-Clustering---Ranajay-Biswas/blob/main/Netflix_Movies_%26_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project**    - Netflix Movies & TV Shows Clustering



### **Contributor :** Ranajay Biswas

### **Github Link :**

### **Project Type**    - Unsupervised

## **Project Summary -**

Netflix is a streaming service that offers a wide variety of award-winning TV shows, movies, anime, documentaries and more – on thousands of internet-connected devices. New TV shows and movies are added every week!

Netflix has an extensive library of feature films, documentaries, shows, award-winning Netflix originals, and tons of other contents.

In this unsupervised Machine Learning project, we have to find out different patterns and insights about various movies and shows. 

Clustering similar contents and developing a content based recommender system that is capable of recommending new movies and shows to users as per their interests is going to be our main goal.

Performing Exploratory data analysis, insights can be drawn from the data that can be used to better understand the production and demand for different content in different countries. Evaluating genres and content rating for different age groups, we can find out which demographic is dominated by what kind of audience.

Analyzing audiences' taste and finding similar content will help to build intelligent recommendation systems that will be able to satisfy the audience's appetite for content consumption.

## **Problem Statement**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

### <b>In this  project, it we are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



## **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

## **Approach:**

This project is heavily dependent upon analyzing the data and finding significant patterns and trends to find movies and shows which are similar to each other. This way Netflix can be more confident when making recommendations to their audience. It is also going to be helpful to know what type of content is going to be more popular among their platform users. So that Netflix can produce or add those new shows and movies on their platform.

Our approach to solve this problem is going to be - 

1. Understanding the dataset, different rows and columns.

2. During the EDA, we will try to find popular actors and directors, what are the top content producing countries. We will get an idea about different genres and runtimes for different contents. Statistical methods and Visualizations are going to be very helpful in this EDA process.

3. In the pre-processing step, we shall filter the most important features, make necessary transformations, create or omit features as needed. Depending on the features we choose, we will find best approaches for text pre-processing. Then we will have the data, ready for implementing clustering algorithms or building a recommender system with this data.

4. We shall try different clustering algorithms to see which fits the data best. The best performing algorithm will be selected and then we shall analyze different clusters made by the algorithm to find patterns and insights.

5. Then we shall conclude the project with an overview and discussion about the observations that we make and how that can prove to be useful for Netflix's streaming services.

## **Data Collection & Summary:**

Very first step is to import the libraries for the task. We shall start off by importing the absolute necessary packages and as we continue working on the data, we will be adding more to the list.

In [24]:
#invite people for the party
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import numpy as np
import missingno as msno
from scipy.stats import norm
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import pearsonr # Pearson's r

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc 

from sklearn.metrics.pairwise import cosine_similarity  # used to calculate distance or similarity
from sklearn.feature_extraction.text import TfidfVectorizer   # importing TfidVectorizer from sklearn library
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from datetime import datetime, timezone, timedelta

from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score  #davies_bouldin_score of clusters 
from scipy.stats import zscore
from sklearn import metrics

In [25]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [26]:
# importing Natural Language Toolkit
import nltk
import re

In [27]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [28]:
from nltk.corpus import stopwords  # Stopwords
from nltk.stem import PorterStemmer  # stemmer package
from nltk.stem import WordNetLemmatizer  # Lemmatizer

In [29]:
sns.set(font_scale = 1.5)

In [30]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
# reading data set
dataset = pd.read_csv('/content/drive/MyDrive/ML Unsupervised Projects/Netflix Movies & TV Shows Clustering - Ranajay Biswas/Datasets/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv',  encoding= 'unicode_escape')

In [32]:
pd.set_option('display.max_columns', None)

Checking the data:

In [33]:
# top 5 rows of the data
dataset.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"JoÃ£o Miguel, Bianca Comparato, Michel Gomes, ...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"DemiÃ¡n Bichir, HÃ©ctor Bonilla, Oscar Serrano...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [34]:
# last 5 rows of the data
dataset.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...
7786,s7787,Movie,ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS,Sam Dunn,,"United Kingdom, Canada, United States","March 1, 2020",2019,TV-MA,90 min,"Documentaries, Music & Musicals",This documentary delves into the mystique behi...


In [35]:
# checking the shape of our data
dataset.shape

(7787, 12)

In [36]:
# checking columns
dataset.columns.tolist()

['show_id',
 'type',
 'title',
 'director',
 'cast',
 'country',
 'date_added',
 'release_year',
 'rating',
 'duration',
 'listed_in',
 'description']

In [37]:
# datatypes and null values overview
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


### *Dataset has 7787 rows and 12 columns. And it contains object, integer and float data types*...

### <b>Descriptive Stats ---

In [38]:
# descriptive statistics
dataset.describe(percentiles=[.1, .9])

Unnamed: 0,release_year
count,7787.0
mean,2013.93258
std,8.757395
min,1925.0
10%,2006.0
50%,2017.0
90%,2020.0
max,2021.0


Since, most of the columns are object data type, Descriptive statistics does not give us any important insight about the data.