<a href="https://colab.research.google.com/github/AnshikaPrajapati/PROJECT/blob/main/UNSUPERVISED_ON_NETFLIX_MOVIE_AND_TV_SHOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NETFLIX MOVIES AND TV SHOWS CLUSTERING



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1  - ANSHIKA PRAJAPATI

# **Project Summary -**

Netflix uses data analysis and machine learning techniques such as clustering to group their content into similar categories, aiming to improve the user experience by providing personalized content recommendations to users based on their viewing history and preferences. This involves analyzing various characteristics of each title, such as genre, cast, and plot, and using algorithms to identify patterns and similarities.

Clustering techniques such as k-means, hierarchical clustering, and principal component analysis (PCA) are used to group movies and TV shows with similar features into distinct groups, each representing a unique genre or category. This approach helps Netflix provide personalized recommendations to users based on their viewing history and preferences, leading to increased user engagement and satisfaction, which in turn can lead to increased retention and company revenue.

Clustering enables Netflix to make data-driven decisions about content production and licensing by understanding underlying trends and patterns in user behavior. This helps the platform to optimize its content library and offer titles that are more likely to be successful with its user base, leading to increased customer retention and company revenue.

In conclusion, Netflix Movies and TV Shows Clustering is a data-driven approach that relies on unsupervised machine learning algorithms to analyze and group its vast library of content into similar categories, providing users with personalized recommendations and improving the overall user experience. Clustering helps Netflix make informed decisions about content production and licensing, leading to increased customer retention and company revenue.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Netflix, one of the largest streaming platforms globally, boasts a vast library of movies and TV shows. However, the abundance of options makes it difficult for users to locate content that suits their preferences. Thus, this project's objective is to use unsupervised learning techniques to cluster similar titles on Netflix. By grouping movies and TV shows with similar attributes, the project aims to provide users with more targeted recommendations, facilitating the discovery of new content that aligns with their interests.

To achieve this goal, the project will analyze a Netflix title dataset, incorporating features like genre, cast, release year, plot summary, among others. Utilizing clustering algorithms like K-Means or Hierarchical clustering, the project intends to categorize movies and TV shows with comparable attributes.

Ultimately, the project aims to develop a reliable clustering model that can group Netflix titles accurately based on their characteristics. This model can subsequently be used to enhance Netflix's content discovery algorithms or offer recommendations to users.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
## Data Maipulation Libraries
import numpy as np
import pandas as pd
import datetime as dt

## Data Visualisation Libraray
import matplotlib.pyplot as plt
import missingno as msno
import matplotlib.cm as cm
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
%matplotlib inline

# libraries used to process textual data
import string
string.punctuation
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# libraries used to implement clusters
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_samples
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage

# Library of warnings would assist in ignoring warnings issued
import warnings;warnings.filterwarnings('ignore')
import warnings;warnings.simplefilter('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# import drive
from google.colab import drive
drive.mount('/content/drive')
# Load Dataset

df = pd.read_csv('/content/drive/MyDrive/DATA/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')


### Dataset First View

In [None]:

# Dataset First Look
df.head()


In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset Size")
print("Rows = {} and  Columns = {}".format(df.shape[0], df.shape[1]))


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar = False)

### What did you know about your dataset?

The dataset utilized for clustering Netflix movies and TV shows encompasses several features of the titles, including genre, rating, release year, duration, director, cast, and type. It comprises 7787 rows and 12 columns. Nonetheless, certain columns like director, cast, and country contain null values, which necessitate addressing during the data analysis phase.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

Variables Description

**show_id**: Unique ID for every Movie / Tv Show

**type**: Identifier - A Movie or TV Show

**title** : Title of the Movie / Tv Show

**director** : Director of the Movie

**cast** : Actors involved in the movie / show

**country** : Country where the movie / show was produced

**date_added** : Date it was added on Netflix

**release_year** : Actual Releaseyear of the movie / show

**rating** : TV Rating of the movie / show

**duration** : Total Duration - in minutes or number of seasons

**listed_in**: Genre

**description** : The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ", i , "is" , df[i].nunique(), ".")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Summing null values
print('Missing Data Count')
df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)

In [None]:
print('Missing Data Percentage')
print(round(df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)/len(df)*100,2))

The missing values in the 'director', 'cast', and 'country' columns can be replaced with the label 'Unknown'.
     

In [None]:
df[['director']] = df[['director']].fillna('Unknown')
df[['cast']]     = df[['cast']].fillna('Unknown')
df[['country'] ] = df[['country']].fillna('Unknown')


We cannot replace missing values in the 'date_added' column.

 And since they constitute a small and relatively unimportant portion of the data.

Therefore, we will exclude these values from our analysis.

In [None]:
df.dropna(subset=['date_added'], inplace=True)

In [None]:
df.shape

For the missing values in the 'rating' column, we can impute them with the mode since this attribute is discrete.
     

In [None]:
df['rating'].fillna(value=df['rating'].mode()[0],inplace=True)

In [None]:
df.isnull().sum()

To simplify the analysis, we will choose the primary country and primary genre for each entry in country and listed in column.
     


In [None]:
df['country'] = df['country'].apply(lambda x: x.split(',')[0])
df['listed_in'] = df['listed_in'].apply(lambda x: x.split(',')[0])

We will transform the 'duration' column in the dataframe by splitting the string value

on whitespace delimiter and then converting it into an integer datatype.
     


In [None]:
df['duration'] = df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# datatype of duration
df.duration.dtype

In [None]:
print("Columns and data types")
pd.DataFrame(df.dtypes).rename(columns = {0:'dtype'})

In [None]:
#Convert timestamp to datetime format to fetch the other details
df["date_added"] = pd.to_datetime(df['date_added'])


In [None]:
#addding new column to dataframe such as 'month_added'and 'year_added' to gain more insights from the data
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month


In [None]:
# Changing the values in the rating column
# Create a dictionary to map the current ratings to new ratings
rating_map = {'TV-MA':'Adults',
'R':'Adults',
'PG-13':'Teens',
'TV-14':'Young Adults',
'TV-PG':'Older Kids',
'NR':'Adults',
'TV-G':'Kids',
'TV-Y':'Kids',
'TV-Y7':'Older Kids',
'PG':'Older Kids',
'G':'Kids',
'NC-17':'Adults',
'TV-Y7-FV':'Older Kids',
'UR':'Adults'}
# Replace the current ratings with the new ratings using the mapping dictionary
df['rating'].replace(rating_map,inplace=True)
# Print the unique values in the 'rating' column to verify that the changes have been made
print(df['rating'].unique())


In [None]:
df.head()


### What all manipulations have you done and insights you found?

1)The label 'Unknown' was used to replace missing values in the 'director', 'cast', and 'country' columns.

2)The mode was used for imputing missing values in the 'rating' column. For simplifying the analysis, the primary country and primary genre were selected for each entry in the dataframe.

3)The 'duration' column in the dataframe was transformed by converting the string value to an integer datatype after splitting it on a whitespace delimiter.

4)To extract additional details, the timestamp in the 'date_added' column was converted to datetime format, and new columns such as 'month_added' and 'year_added' were added to the dataframe.

5)A dictionary was created to map the current ratings to new ratings, which were then used to replace the values in the rating column.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1 visualization code

type_counts = df['type'].value_counts()        # Count the occurrences of each unique value in the 'type' column
plt.bar(type_counts.index, type_counts.values) # Create a bar chart of the type counts
plt.xlabel('Content Type')                     # Add labels and a title to the chart
plt.ylabel('Number of Titles')
plt.title('Distribution of Movies and TV Shows in Netflix Dataset')
plt.show()                                     # Show the chart


##### 1. Why did you pick the specific chart?

For displaying the distribution of categorical data, such as the number of movies and TV shows in the Netflix dataset, a bar chart is a suitable option. It facilitates simple comparison between categories and offers a clear visualization of the overall content type distribution in the dataset, making it a fitting choice for this particular dataset and research inquiry.

##### 2. What is/are the insight(s) found from the chart?

The Netflix dataset contains a higher count of movies compared to TV shows, indicating a preference towards movies on the platform. However, it is important to note that TV shows still make up a significant portion of the dataset, suggesting that Netflix also invests in producing and acquiring TV shows for its platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The bar chart provides valuable insights for Netflix's business decisions. Understanding that movies are the majority of the content in the dataset and that Netflix has a preference towards them can inform decisions related to content production and acquisition. This information may lead to a strategic allocation of resources towards producing and acquiring more movies, potentially attracting more viewers and subscribers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Filter out the rows where the director is unknown, count the number of shows for each director, and plot the top 10
top_directors = df.loc[df['director'] != 'Unknown', 'director'].value_counts().nlargest(10)
plt.figure(figsize=(15,6))
colors = sns.color_palette('pastel', n_colors=10)
plt.bar(top_directors.index, top_directors.values, color=colors)
plt.title('Top 10 Directors by Number of Shows Directed')
plt.xlabel('Director')
plt.ylabel('Number of Shows')
plt.xticks(rotation=15)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is an effective way to visualize the top 10 directors in the Netflix dataset, ranked by the number of shows they have directed. This chart is useful for understanding the relationship between directors and the amount of content they have contributed to Netflix, and it provides insights into the most prominent directors on the platform. Additionally, the use of color in the chart enhances its visual appeal and readability.

##### 2. What is/are the insight(s) found from the chart?

1)Raul Campos and Jan Suter are the top directors in the Netflix dataset, having directed the highest number of shows at 18.

2)Marcus Raboy is the second most popular director, having directed 16 shows.

3)The majority of the top 10 directors have directed between 7-11 shows on Netflix.

4)With the exception of David Dhawan from India, all of the top 10 directors are from the US.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The valuable insights gained from analyzing the top directors and their past work on Netflix may guide decisions related to content production and acquisition.

#### Chart - 3 CAST (Univariate Analysis)

In [None]:
# Chart - 3 visualization code
filtered_df = df[~(df['cast']=='Unknown')]                       # Filtering out unknown cast members
split_cast = filtered_df['cast'].str.split(', ', expand=True)    # split remaining cast into separate values
cast_values = split_cast.stack().reset_index(level=1, drop=True)
top_10_actors = cast_values.value_counts().nlargest(10)          #the top 10 actors by number of shows
top_10_actors.plot(kind='bar', figsize=(13,5))                   # Create a bar chart
plt.title('Top 10 Actors by Number of Shows', fontsize=14)       # Set chart title
plt.ylabel('Number of Shows', fontsize=12)                       #y-axis label
plt.xticks(rotation=15)                                          #x-axis label with rotation
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is used to display the top 10 actors in the Netflix dataset, based on the number of shows they appeared in. This chart is useful in examining the association between actors and the number of shows they have been part of on Netflix, and it can offer significant insights into the most in-demand actors on the platform.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals valuable insights about the popularity of actors on Netflix, including:

Anupam Kher is the most frequently appearing actor on the platform, having appeared in 42 shows in the dataset.

Shah Rukh Khan follows closely behind, having appeared in 35 shows.

The majority of the top 10 actors have appeared in 25-30 shows on Netflix.

With the exception of Takahiro Sakurai and Yuki Kaji from Japan, the top 10 actors are primarily from India.

These insights can inform decisions related to content production and acquisition by providing information about the most popular actors on Netflix and their past work.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

1)The knowledge of the most popular actors on the platform can aid Netflix in acquiring or producing content that showcases these actors, potentially resulting in increased viewership and engagement on their platform.

2)These insights may assist in identifying the target audience for various titles, as different actors may appeal to different demographics, enabling Netflix to cater to their audience's preferences better.

3)The data can also help Netflix recognize patterns and preferences among its user base, which can inform decisions regarding content acquisition and production.

#### Chart - 4 LISTED IN (Univariate Analysis)

In [None]:
# Chart - 4 visualization code
top_genres = df["listed_in"].value_counts().head(10)
plt.figure(figsize=(8,8))
plt.pie(top_genres, labels=top_genres.index, autopct='%1.1f%%', pctdistance=0.8, labeldistance=1.1,
        radius=1.2, wedgeprops=dict(width=0.5), startangle=90,
        textprops=dict(color="black", fontsize=12), counterclock=False)
plt.title('Top 10 Genres', fontsize=16)
plt.show()



##### 1. Why did you pick the specific chart?

I selected the pie chart because it effectively illustrates the distribution of the top 10 genres in the Netflix dataset. Pie charts are ideal for displaying relative proportions or percentages of categorical data, making them a suitable choice for this analysis.

##### 2. What is/are the insight(s) found from the chart?

The pie chart effectively illustrates the proportion of each genre among the top 10 genres in the Netflix dataset. By analyzing this chart, we can conclude that dramas are the most prevalent genre, followed by comedies and documentaries. The chart also highlights that the top 10 genres constitute a considerable portion of the dataset, emphasizing the importance of these genres in the streaming industry.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

By using the insights gained from the chart, Netflix can gain a deeper understanding of their audience's content preferences. This information can be used to make data-driven decisions about which types of content to acquire and produce, ultimately leading to increased viewership and revenue. The pie chart provides valuable information about the most popular genres in the Netflix dataset, allowing Netflix to tailor their content strategy to better align with their audience's preferences.

#### Chart - 5 RATING (Univariate Analysis)

In [None]:
# Chart - 5 visualization code
df_rating = df['rating'].value_counts()
plt.figure(figsize=(8,8))
plt.pie(df_rating.values, labels=df_rating.index,
        autopct='%1.1f%%',startangle=90, counterclock=False)
plt.title('Distribution of Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

I chose to use a pie chart to illustrate the distribution of content ratings in the Netflix dataset as it is an efficient way to depict the proportion of data in each category. The chart clearly displays the percentage of titles in each rating category, making it easy to observe that the majority of titles in the dataset are TV-MA, followed by TV-14 and TV-PG.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates the distribution of content ratings in the Netflix dataset. As per the chart, the majority of titles in the dataset have been rated as Adults (TV-MA - Mature Audiences), which comprises nearly 47% of all titles. Additionally, Young Adults (TV-14 - Parents Strongly Cautioned) and Older Kids (TV-PG - Parental Guidance Suggested) are the next most common ratings, representing approximately 25% and 17% of titles, respectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

By gaining insights into the distribution of content ratings in the Netflix dataset, businesses can make informed decisions about their content acquisition and creation strategies. The fact that the TV-MA rating is the most common suggests a strong demand for mature content on the platform, which could guide decisions about what types of content to acquire or produce to cater to this audience. Additionally, understanding the distribution of ratings can help businesses tailor their marketing efforts to different demographics and promote content that is likely to be popular with specific age groups.

#### Chart - 6 COUNTRY (Univariate Analysis)

In [None]:
# Chart - 6 visualization code
# Get the top 10 countries with the highest number of movies and TV shows in the dataset
top_countries = df.loc[df['country'] != 'Unknown', 'country'].value_counts().nlargest(10)
plt.figure(figsize=(15,5))
colors = sns.color_palette('deep', n_colors=10)
plt.barh(top_countries.index, top_countries.values, color=colors) # Plot a horizontal bar chart
plt.title('Top 10 Countries with the Highest Number of Shows')
plt.xlabel('Number of Shows')
plt.ylabel('Country')

In [None]:
# Calculate the percentage share of shows by the top 3 and top 10 countries
top_3_share = top_countries.nlargest(3).sum() / len(df) * 100
top_10_share = top_countries.sum() / len(df) * 100

# Print the percentage shares
print(f"The top 3 countries account for {top_3_share:.2f}% of shows in the dataset.")
print(f"The top 10 countries account for {top_10_share:.2f}% of shows in the dataset.")


##### 1. Why did you pick the specific chart?

I chose this chart because it displays the top 10 countries with the highest number of movies and TV shows in the Netflix dataset, which provides valuable information for companies seeking to enter or expand in the global streaming market.

##### 2. What is/are the insight(s) found from the chart?

According to the chart, the United States dominates the production of movies and TV shows in the Netflix dataset with over 2,500 titles. India, the United Kingdom, and Canada are the next highest producing countries, each with around 500-1000 titles. The top 10 countries, including France, Japan, and Spain, account for 73.19% of shows, while the top 3 countries (USA, India, UK) contributing 56.69%. This information is valuable for businesses looking to understand the global streaming market and identify potential areas for expansion or investment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can have a significant impact on businesses in the streaming industry. Understanding that the US is the largest producer of movies and TV shows can help companies plan their content acquisition and marketing strategies. Furthermore, the knowledge that the top 3 countries account for more than half of the shows in the dataset can assist companies in targeting these markets to increase their viewership and expand their operations.


#### Chart - 7 MONTH ADDED (Univariate Analysis)



In [None]:
# Chart - 7 visualization code
# Plotting the Countplot
plt.figure(figsize=(10,8))
ax = sns.countplot(x='month_added', data=df)
plt.show()


##### 1. Why did you pick the specific chart?

This count plot, created using the seaborn library, displays the number of TV shows and movies added to Netflix for each month in the dataset. The plot provides a clear and concise visualization of the trends in content additions over time.

##### 2. What is/are the insight(s) found from the chart?

The Count Plot provides valuable insights into the frequency of TV shows and movies added to Netflix for each month in the dataset. By analyzing this chart, we can identify the months with the highest and lowest number of additions to the platform, which can help inform decisions related to content acquisition and release schedule. Specifically, the chart indicates that the highest number of movies were added between October and January, while the number of additions was relatively low during the rest of the year. This information could be useful for Netflix to plan their content acquisition strategy and ensure a consistent flow of new releases throughout the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The chart provides valuable insights that could help Netflix identify areas for improvement. For instance, if there are months with low numbers of additions, Netflix may consider acquiring more content during those months or releasing more original content to fill the gaps. On the other hand, if there are months with a high number of additions, Netflix may need to adjust its release schedule to avoid overcrowding the platform with too much content at once. These actions could help ensure a consistent flow of content and prevent user fatigue, ultimately improving user engagement and satisfaction.

#### Chart - 8 TV SHOW VS NO OF SEASONS (Bivariate Analysis)

In [None]:
# Chart - 8 visualization code
tv_shows = df[df['type'] == 'TV Show']                    # Filter the dataframe to only include TV shows
plt.figure(figsize=(15, 7))                               # Create a histogram of the number of seasons per TV show
plt.hist(tv_shows['duration'], bins=20, edgecolor='black')
plt.xlabel('Number of Seasons', fontsize=12)
plt.ylabel('Number of Shows', fontsize=12)
plt.title('Number of Seasons per TV Show Distribution')


##### 1. Why did you pick the specific chart?

The chart represents a histogram that displays how many TV shows in the dataset have a particular number of seasons. This information can be used to identify the most frequent number of seasons in TV shows and the spread of data.

##### 2. What is/are the insight(s) found from the chart?

1)Most TV shows have one to three seasons.

2)Few TV shows have more than 10 seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

1)Understanding that shorter seasons may be more popular and successful could guide Netflix's decisions on the number of seasons to order for new shows.

2)Knowing the distribution of the number of seasons per TV show can be valuable in negotiating the length of a show with production companies.

#### Chart - 9 RELEASE YEAR VS TYPE (Bivariate Analysis)

In [None]:
# Chart - 9 visualization code
df_release_year = df.groupby(['release_year', 'type'])['show_id'].count().reset_index()
plt.figure(figsize=(12, 8))
sns.lineplot(data=df_release_year, x='release_year', y='show_id', hue='type')
plt.title('Number of Movies and TV Shows Released Each Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()



##### 1. Why did you pick the specific chart?


I selected this chart because it shows how the number of movies and TV shows added to Netflix has changed over time. By using a line plot, this chart makes it simple to compare the trends in the release of movies and TV shows each year. Additionally, the chart's color coding allows for a quick visual comparison of the two types of content. Overall, this chart can be useful for understanding the relationship between the year of release and the number of movies and TV shows added to Netflix.

##### 2. What is/are the insight(s) found from the chart?

1)This chart provides insights into the trend of media content production over the years by showing the number of movies and TV shows released each year.

2)The line plot demonstrates a significant increase in the number of movies produced from the mid-2000s to 2020, while the number of TV shows produced has also increased but not as much as movies.

3)A dip in movie production in 2020 is also visible in the chart, which may be attributed to the COVID-19 pandemic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart can have a significant impact on the business. It can provide valuable information for content creators, streaming platforms, and investors. For instance, the increased production of movies suggests a change in audience preferences towards movies, which can guide content creation and platform offerings. However, the dip in movie production due to the COVID-19 pandemic can have a negative impact, leading to a shortage of new content for streaming platforms and reduced revenue for content creators.

#### Chart - 10 TYPE VS RELEASE YEAR (Bivariate Analysis)

In [None]:
# Chart - 10 visualization code
df.release_year.value_counts()

We can observe from the visualization that the number of shows released on Netflix has increased significantly in recent years,

indicating that Netflix has gained more popularity in recent times.

In [None]:
# Chart - 10 visualization code
filtered_df = df[df['release_year'] >= 2008]
plt.figure(figsize=(15, 7))
sns.countplot(x='release_year', data=filtered_df, hue='type', order=range(2008, 2022))
plt.xlabel('Release Year')
plt.ylabel('Number of Shows')
plt.title('Number of Shows Released Each Year Since 2008 on Netflix')


##### 1. Why did you pick the specific chart?

The reason for selecting this chart is that it displays the trend in the yearly release of TV shows and movies since 2008, with a clear comparison of the two types of content.

##### 2. What is/are the insight(s) found from the chart?

1.This chart shows an upward trend in the yearly release of TV shows since 2008.

2.TV show releases have been steady over the years, with slight fluctuations.

3.The chart suggests a shift towards producing more original movies content on Netflix, based on the difference in the number of movies and TV shows released each year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Insights from this chart can positively impact Netflix's business by providing information on content production trends, allowing them to make strategic decisions regarding future content type and quantity. This chart does not provide any insights that could lead to negative growth.

#### Chart - 11 COUNTRY VS SHOW ID (Bivariate Analysis)

In [None]:
# Chart - 11 visualization code
# group the data by country and type, and count the number of shows
df_country = df.groupby(['country', 'type'])['show_id'].count().reset_index()
df_country = df_country.sort_values(by='show_id', ascending=False)            # sort the data in descending order
plt.figure(figsize=(15, 6))
sns.barplot(data=df_country[:20], x='country', y='show_id', hue='type')       # plot a bar chart of the top 20 countries
plt.xticks(rotation=90)
plt.legend(loc='upper right')
plt.title('Top 20 Countries by Number of Shows on Netflix')
plt.xlabel('Country')
plt.ylabel('Number of Shows')
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart for its ability to provide valuable insights into the top 20 countries with the highest number of shows on Netflix, categorized by country and show type. These insights can be leveraged to inform content acquisition and localization strategies for Netflix.

##### 2. What is/are the insight(s) found from the chart?

The chart displays the top 20 countries with the highest number of shows on Netflix, providing valuable information for content acquisition and localization strategies. The United States has the largest amount of content available on Netflix with over 2,000 shows, followed by India with over 800 shows and the United Kingdom with over 300 shows. In terms of show types, most of the content in these countries are movies, with the United Kingdom having a relatively equal distribution between movies and TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the chart can lead to a positive business impact by guiding content acquisition and localization strategies. For instance, companies can prioritize acquiring content from countries with the highest number of shows on Netflix to expand their content library in a specific region. Additionally, localization strategies such as dubbing or subtitling can be developed for these regions to increase viewership and revenue.

#### Chart - 12 CAST VS TV SHOW (Bivariate Analysis)

In [None]:
# Chart - 12 visualization code
# Selecting TV shows with known cast information
tv_shows = df[(df['type'] == 'TV Show') & ~(df['cast'] == 'Unknown')]
# Counting the number of TV shows each actor has appeared in
actor_counts = tv_shows['cast'].str.split(', ').explode().value_counts()
top_actors = actor_counts.head(10)             # Selecting the top 10 actors with the most TV show appearances
plt.figure(figsize=(15, 6))                    # Creating a horizontal bar plot of the top actors
plt.barh(top_actors.index, top_actors.values, color='purple')
plt.xlabel('Number of TV Shows', fontsize=12)
plt.ylabel('Actor Name for TV Shows', fontsize=12)
plt.title('Actors with the Most TV Show Appearances', fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

The horizontal bar plot represents the top 10 actors with the highest number of appearances in TV shows, providing insights into the most frequently cast actors in this medium. This chart could be valuable for individuals in the entertainment industry or those interested in popular culture who want to stay updated on which actors are in demand for TV show roles.

##### 2. What is/are the insight(s) found from the chart?

The chart depicts the top 10 actors with the highest number of TV show appearances on Netflix, indicating that Takahiro Sakurai has the most appearances, followed by Yuki Kaji and Daisuke Ono. This information could be leveraged to identify popular actors that could attract audiences for new TV show releases, making the insights valuable for content creators and streaming platforms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart could positively impact the selection of actors for TV shows. Casting popular actors could potentially increase viewership and have a positive impact on business. However, it is important to note that an actor's popularity alone is not a guarantee of success, as the quality of the TV show is also a significant factor.

#### Chart - 13 WORDCLOUD

In [None]:
# Chart - 13 visualization code
# Join all the movie descriptions together into a single string
comment_words = ' '.join(df['description'].astype(str).str.lower())
stopwords = set(STOPWORDS)                                   # Define the stopwords
wordcloud = WordCloud(width=800, height=800,
                      background_color='black',
                      stopwords=stopwords,
                      min_font_size=10).generate(comment_words)
plt.figure(figsize=(10,5), facecolor=None)                   # Plot the word cloud
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

##### 1. Why did you pick the specific chart?

This code generates a word cloud using the descriptions of shows and movies in the Netflix dataset, providing a quick and easy way to visualize the most frequent words and themes in the content. This insight can be useful for understanding the overall genre and content themes available on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The word cloud generated from the Netflix dataset provides valuable insights into the frequently occurring words and phrases in the descriptions of shows and movies. This information can be used to identify the popular themes and genres among Netflix users, and also to discover unique keywords and phrases for marketing purposes. The most common words in the word cloud include "life", "family", "friend", "love", and others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the word cloud can benefit Netflix by enabling them to understand their users' interests better and personalize their content to meet those interests. This can result in a positive business impact by allowing Netflix to create more effective marketing campaigns and enhance the overall user experience.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,8))
correlation = df.corr()
sns.heatmap((correlation), annot=True, cmap=sns.color_palette("mako", as_cmap=True))


##### 1. Why did you pick the specific chart?

This heatmap provides a visual representation of the correlation coefficients among the numerical columns in the Netflix dataset. Positive correlation is represented by lighter colors, while negative correlation is represented by darker colors. The values of the correlation coefficients are also displayed within each cell of the heatmap, thanks to the annotation parameter being set to True.

##### 2. What is/are the insight(s) found from the chart?

This heatmap is a useful tool for identifying the relationships between different variables in the Netflix dataset. By analyzing the heatmap, we can easily see which variables have a strong positive or negative correlation with each other, which can help in making predictions and building machine learning models.

Overall, this heatmap can provide valuable insights into the relationships between different variables in the Netflix dataset.

1.   We can see that duration and release year are negatively correlated by 24%.

2.   year added and release year are positively correlated by 10%.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

The pair plot provides a comprehensive view of the numerical variables in the dataset, allowing for a deeper exploration of the relationships and patterns between each pair of variables. With scatter plots and histograms, it can reveal insights into the distribution and correlations of the data, which can aid in making informed decisions and identifying trends in the dataset.

##### 2. What is/are the insight(s) found from the chart?

1.The diagonal plots in the pair plot show the distribution and range of each variable. The duration of movies and TV shows appears to be concentrated in certain ranges.

2.The pair plot also displays scatter plots of each pair of variables and their correlation coefficients. There appears to be a positive correlation between release year and duration, suggesting newer movies and TV shows tend to be longer.

3.Outliers in the data can be identified from the scatter plots in the pair plot. One movie in the dataset appears to have an unusually long duration compared to the rest.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis 1**: How does the average duration of movies compare to TV shows on Netflix

**Hypothesis 2**: How does the average number of seasons for TV shows on Netflix vary between those produced in the United States and those produced outside of the United States

**Hypothesis 3**: The quantity of TV shows added to Netflix has grown progressively over time.

### Hypothetical Statement - 1 How does the average duration of movies compare to TV shows on Netflix

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis(H0) - There is no significant difference in the average duration of movies and TV shows on Netflix.

Alternative Hypothesis(H1) - There is a significant difference in the average duration of movies and TV shows on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

# Extract the durations of movies and TV shows from the dataset
movie_durations = df[df['type'] == 'Movie']['duration']
tv_show_durations = df[df['type'] == 'TV Show']['duration']

# Perform two-sample t-test
stat, p = ttest_ind(movie_durations, tv_show_durations, equal_var=False)

# Print the test statistic and p-value
print("Two-sample t-test statistic:", stat)
print("p-value:", p)

# Interpret the result
alpha = 0.05
if p > alpha:
    print("Failed to reject null hypothesis.")
else:
    print("Reject null hypothesis.")



There is a significant difference in the average duration of movies and TV shows on Netflix.


##### Which statistical test have you done to obtain P-Value?

We chose the two-sample t-test as our statistical test since we are comparing the means of two independent samples (movie durations and TV show durations) and want to determine if the difference between the sample means is statistically significant or simply due to chance. The test's p-value provides insight into the significance of the difference.

##### Why did you choose the specific statistical test?

The two-sample t-test was used to compare the means of movie and TV show durations, assuming that the samples are independent and normally distributed. The test also assumes that the variances of the two samples are not equal, which is likely due to the differences in content between movies and TV shows.

### Hypothetical Statement - 2 How does the average number of seasons for TV shows on Netflix vary between those produced in the United States and those produced outside of the United States

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis : There is no significant difference in the average number of seasons for TV shows on Netflix between those produced in the United States and those produced outside of the United States.

Alternate hypothesis : There is a significant difference in the average number of seasons for TV shows on Netflix between those produced in the United States and those produced outside of the United States.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

# Extract the number of seasons for TV shows produced in the US and outside the US
# Extract the number of seasons for TV shows produced in the US and outside the US
us_shows = df[(df['type'] == 'TV Show') & (df['country'] == 'United States')]
us_shows_seasons = us_shows['duration'].apply(lambda x: int(x.split(' ')[0]) if isinstance(x, str) and 'season' in x else 0)

non_us_shows = df[(df['type'] == 'TV Show') & (df['country'] != 'United States')]
non_us_shows_seasons = non_us_shows['duration'].apply(lambda x: int(x.split(' ')[0]) if isinstance(x, str) and 'season' in x else 0)

# Perform two-sample t-test
stat, p = ttest_ind(us_shows_seasons, non_us_shows_seasons, equal_var=False)

# Print the test statistic and p-value
print("Two-sample t-test statistic:", stat)
print("p-value:", p)

# Interpret the result
alpha = 0.05
if p > alpha:
    print("Failed to reject null hypothesis.")
else:
    print("Reject null hypothesis.")

There is a significant difference in the average number of seasons for TV shows on Netflix

between those produced in the United States and those produced outside of the United States.
     

##### Which statistical test have you done to obtain P-Value?

This analysis involves using a two-sample t-test, which is a statistical test used to compare the means of two independent samples and determine whether they are significantly different from each other.

##### Why did you choose the specific statistical test?

We utilized a two-sample t-test to compare the mean number of seasons between TV shows produced in the US and those produced outside the US. The choice of this test was based on our goal of identifying if there is a significant difference between the two groups' means. We also took into consideration the potential inequality of variances between the two groups by setting the equal_var parameter to False in the ttest_ind() function.

### Hypothetical Statement - 3 The quantity of TV shows added to Netflix has grown progressively over time.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: The mean number of TV shows added to Netflix per year has not changed over time.

Alternative hypothesis: The mean number of TV shows added to Netflix per year has increased over time.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy import stats
from scipy.stats import ttest_ind

# Extract the year from the date_added column
df['year_added'] = pd.DatetimeIndex(df['date_added']).year

# Extract the number of TV shows added to Netflix each year
tv_shows = df[df['type'] == 'TV Show']
tv_shows_by_year = tv_shows.groupby('year_added').size()

# Perform a linear regression to test for a positive slope (i.e., an increase over time)
slope, intercept, r_value, p_value, std_err = stats.linregress(tv_shows_by_year.index, tv_shows_by_year)

# Print the p-value
print("p-value:", p_value)

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("Failed to reject null hypothesis.")
else:
    print("Reject null hypothesis.")

##### Which statistical test have you done to obtain P-Value?

The code uses the stats.linregress function to perform a linear regression that tests for a positive slope in the number of TV shows added to Netflix each year. The resulting p-value measures the strength of evidence against the null hypothesis that the slope is zero, indicating no increase over time, and in favor of the alternative hypothesis that there is a positive slope indicating an increase over time.

##### Why did you choose the specific statistical test?

This code utilizes a linear regression with a hypothesis test on the slope coefficient to test for a trend over time. This is suitable because the objective is to model the relationship between the year and the number of TV shows added to Netflix, and a linear regression is well-suited for this purpose. The resulting p-value provides evidence against the alternative hypothesis that there is a positive trend.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

I have already handled all the missing values in the data wrangling section.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
sns.boxplot(data=df)

##### What all outlier treatment techniques have you used and why did you use those techniques?

No need to handle the outliers.

### 3. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
df['organized'] =(df['description'].astype(str) + ' ' +
                  df['listed_in'].astype(str)   + ' ' +
                  df['rating'].astype(str)      + ' ' +
                  df['cast'].astype(str)        + ' ' +
                  df['country'].astype(str)     + ' ' +
                  df['director'].astype(str))


In [None]:

df.organized[0]

#### 2. Lower Casing

In [None]:
# Lower Casing
df['Lower_casing']= df['organized'].str.lower()

In [None]:
df.Lower_casing[0]

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
[punc for punc in string.punctuation]

In [None]:

def remove_punctuation(text):
    # remove punctuation from text
    return text.translate(str.maketrans('', '', string.punctuation))

In [None]:

df['cleaned_text'] = df['Lower_casing'].apply(remove_punctuation)

In [None]:
df.cleaned_text[0]

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

In [None]:

import re

In [None]:
def cleaned(x):
    return re.sub(r"[^a-zA-Z ]", "", str(x))

def remove_urls(text):
    cleaned_text = re.sub(r'http\S+', '', text)
    return cleaned_text

def remove_digits(text):
    cleaned_text = re.sub(r'\w*\d\w*', '', text)
    return cleaned_text

In [None]:
df['removed_words']  = df['cleaned_text'].apply(cleaned)
df['removed_url']    = df['removed_words'].apply(remove_urls)
df['removed_digits'] = df['removed_url'].apply(remove_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
nltk.download('stopwords')


def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    cleaned_text = ' '.join(words)
    return cleaned_text

df['removed_stopwords'] = df['removed_digits'].apply(remove_stopwords)
print(df.removed_stopwords[0])


In [None]:
# Remove White spaces
def remove_whitespaces(text):
    cleaned_text = text.strip()
    return cleaned_text

In [None]:
df['removed_whitespaces']=df['removed_stopwords'].apply(remove_whitespaces)
df['removed_whitespaces'].head()

#### 6. Tokenization

In [None]:
# Tokenization
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    return tokens


In [None]:
df['tokenized'] = df['removed_whitespaces'].apply(tokenize_text)

In [None]:
df['tokenized'].head()

#### 7. Text Normalization

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
def normalize_text(tokens):
    stemmer = SnowballStemmer('english')          # apply stemming to tokens
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    lemmatizer = WordNetLemmatizer()              # apply lemmatization
    normalized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]
    normalized_text = ' '.join(normalized_tokens) # join normalized tokens
    return normalized_text

In [None]:
df['normalized'] = df['tokenized'].apply(normalize_text)

In [None]:
df['normalized'].head()

##### Which text normalization technique have you used and why?

Stemming and lemmatization are two techniques used in NLP to normalize text by converting words to their base or canonical form. They are used to reduce the complexity of data, improve search results, decrease the vocabulary size, and enhance model accuracy. Stemming is an aggressive technique that chops off word endings, while lemmatization is a more sophisticated approach that considers the morphology of words to bring them to their base form.

#### 8. Text Vectorization

In [None]:
# Vectorizing Text

new_df = df[['title', 'normalized']]
new_df.head()

In [None]:
#using tfidf
from sklearn.feature_extraction.text import TfidfVectorizer

t_vectorizer = TfidfVectorizer(max_features=20000)
x= t_vectorizer.fit_transform(new_df['normalized'])

x.shape

##### Which text vectorization technique have you used and why?

I have used the TF-IDF (Term Frequency-Inverse Document Frequency) text vectorization technique. This technique is commonly used for text classification and information retrieval tasks. It assigns weights to each word in the document based on its frequency and rarity across the corpus. This helps to highlight the most important words in the document and down-weight the common words that do not provide much useful information for the analysis.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

Not Needed

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

Not Needed

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?


YES

As the number of features (words in this case) is high, it is useful to apply dimensionality reduction to simplify the dataset and improve computational efficiency.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x.toarray())

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org

In [None]:
# Calculate the cumulative explained variance ratio
cumulative_var_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot the cumulative explained variance ratio versus the number of components
plt.figure(figsize=(5, 5), dpi=120)
plt.plot(cumulative_var_ratio)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.show()

In [None]:
pca_tuned = PCA(n_components=0.95)
x_dense = x.toarray()
pca_tuned.fit(x_dense)
x = pca_tuned.transform(x_dense)
print(x.shape)

In [None]:
x

Which dimensionality reduction technique have you used and why?
(If dimensionality reduction done on dataset.)

This code utilizes PCA (Principal Component Analysis) for the purpose of dimensionality reduction. Initially, a PCA model is fitted to the data without specifying the number of components, allowing us to obtain the explained variance ratio for each component. This information is used to determine the appropriate number of components to retain.

Once the number of components is determined, a new PCA model is created with n_components set to 0.95, indicating that we want to retain enough components to explain 95% of the variance in the data. Finally, the original data is transformed using the transform method to produce a reduced-dimensionality dataset.

The ultimate goal of this code is to effectively reduce the dimensionality of the text data while preserving important information, thereby improving the efficiency of subsequent analysis.

## ***7. ML Model Implementation***

In [None]:
from tabulate import tabulate
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

def evaluate_clustering_model(model, X, y_predict):
    """
    Evaluate a clustering model and print the results.
    & Returns
    Model evaluation result
    """
    # Calculate the number of clusters and evaluation metrics
    n_clusters = len(set(y_predict))
    S_score = silhouette_score(X, y_predict)
    CH_score = calinski_harabasz_score(X, y_predict)
    DB_score = davies_bouldin_score(X, y_predict)

    # Print the evaluation results
    print(f"Number of clusters: {n_clusters}")
    print(f"Silhouette score: {S_score:.4f}")
    print(f"Calinski-Harabasz score: {CH_score:.4f}")
    print(f"Davies-Bouldin score: {DB_score:.4f}")

    # Create a dictionary to store the evaluation scores
    scores_dict = {"silhouette_score": S_score,
                   "calinski_harabasz_score": CH_score,
                   "davies_bouldin_score": DB_score}

    # Create a dataframe to display the evaluation results
    df_eval = pd.DataFrame({"Evaluation Metric": ["Silhouette Score",
                                                  "Calinski-Harabasz Score",
                                                  "Davies-Bouldin Score"],
                                     "Score": [S_score, CH_score, DB_score]})

    # Print the dataframe
    print(tabulate(df_eval, headers="keys", tablefmt="grid"))

    # Return the evaluation results
    return {"n_clusters": n_clusters,
            "silhouette_score": S_score,
            "calinski_harabasz_score": CH_score,
            "davies_bouldin_score": DB_score}

def plot_clustering_scores(scores_dict):
    """
    Plot the clustering evaluation scores using a bar chart.
    """
    # Extract the scores from the dictionary
    scores = [scores_dict["silhouette_score"], scores_dict["calinski_harabasz_score"], scores_dict["davies_bouldin_score"]]
    labels = ["Silhouette", "Calinski-Harabasz", "Davies-Bouldin"]

    # Plot the scores as a bar chart
    fig, ax = plt.subplots()
    ax.bar(labels, scores, color=["tab:blue", "tab:orange", "tab:green"])

    # Add labels and titles
    ax.set_xlabel("Evaluation Metric")
    ax.set_ylabel("Score")
    ax.set_title("Clustering Evaluation Scores")

    # Set the y-axis limits to the range of the scores
    ax.set_ylim([np.min(scores) - 0.1, np.max(scores) + 0.1])

    # Display the plot
    plt.show()

### **ML Model - 1) K-MEANS CLUSTERING**

In [None]:
# Just guessing and checking by k=3

In [None]:
kmeans = KMeans(n_clusters = 3, max_iter = 50)
kmeans.fit(x)


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.

On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# checking labels
kmeans.labels_

In [None]:
np.array(x)[:, 0]

In [None]:

import seaborn as sns

In [None]:

sns.scatterplot(x=np.array(x)[:, 0], y=np.array(x)[:, 2], hue=kmeans.labels_)
#sns.scatterplot(np.array(x)[:, 0], np.array(x)[:, 2], c = kmeans.labels_)

To Find Optimum Numbers of Clusters


*   Elbow Method
*   silhouette Score

In [None]:

# Determine the optimal number of clusters

In [None]:
# Create a list to store the sum of squared errors for each K value
Sum_of_Squared_Errors = []

# Iterate over range of K values and compute SSE for each value
for k in range(1, 10):
    # Initialize the k-means model with the current value of K
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    # Fit the model to the data
    kmeans.fit(x)
    # Compute the sum of squared errors for the model
    Sum_of_Squared_Errors.append(kmeans.inertia_)

# Plot the SSE values against the range of K values
plt.plot(range(1, 10), Sum_of_Squared_Errors)
plt.title('Elbow Method for K-Means Clustering')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()

I have narrowed down the range of possible number of clusters to be between 4 to 7, as the slope of the elbow plot is steep at this range. To determine the optimal number of clusters, I will check the silhouette scores for each value in this range and choose the one with the highest score.

In [None]:
def silhouette_score_analysis(n):
  silhouette_avg = []
  for k in range(2, n):
    # Initialize the k-means model with the current value of k
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    # Fit the model to the data
    kmeans.fit(x)
    # Predict the cluster labels for each point in the data
    labels = kmeans.labels_
    preds = kmeans.fit_predict(x)
    # Compute the silhouette score for the model
    score = silhouette_score(x, labels)
    silhouette_avg.append(score)

    score = silhouette_score(x, preds, metric='euclidean')
    print ("For n_clusters = {}, silhouette score is {}".format(k, score))

    visualizer = SilhouetteVisualizer(kmeans)

    visualizer.fit(x) # Fit the training data to the visualizer
    visualizer.poof() # Draw/show/poof the data


In [None]:
silhouette_score_analysis(10)

In the above silhouette score visualization, we can observe that all values range between 0 and 1, indicating that the clusters formed are well-defined and separated, and hence are considered good.

**Silhouette score method to find the optimal value of k**

In [None]:
# Silhouette score method to find the optimal value of k
# Initialize a list to store the silhouette score for each value of k
silhouette_avg = []

# Define a list of possible number of clusters
range_n_clusters = [2, 3, 4, 5, 6, 7]

# Loop through each value of k
for n_clusters in range_n_clusters:
    # Initialize the k-means model with the current value of k
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42)
    # Fit the model to the data
    kmeans.fit(x)
    # Predict the cluster labels for each point in the data
    labels = kmeans.labels_
    # Compute the silhouette score for the model
    score = silhouette_score(x, labels)
    # Append the silhouette score to the list of scores
    silhouette_avg.append(score)
    # Print the silhouette score for the current value of k
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

# Plot the Silhouette analysis
plt.plot(range_n_clusters, silhouette_avg)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()


As the 7 has the highest silhouette score. So we will take number of clusters as "7"

In [None]:
# ML Model - 1 Implementation
# Initialize the KMeans model with the chosen number of clusters
kmeans_model = KMeans(n_clusters=7, random_state=42)
kmeans_model.fit(x)               # Fit the Algorithm
y_kmeans = kmeans_model.predict(x)# Predict on the model
labels = kmeans_model.labels_     # Get the cluster labels for each point in the data
unique_labels = np.unique(labels) # Get the unique cluster labels


In [None]:
# Adding a k-means cluster number attribute
df['kmeans_cluster'] = labels


1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
scores_dict_kmeans = evaluate_clustering_model(kmeans_model, x, y_kmeans)

In [None]:
plot_clustering_scores(scores_dict_kmeans)

In [None]:
# Create a scatter plot of the data colored by cluster label
plt.figure(figsize=(8, 6), dpi=120)
for i in unique_labels:
    plt.scatter(x[labels == i, 0], x[labels == i, 1], s=20, label='Cluster {}'.format(i))
plt.scatter(kmeans_model.cluster_centers_[:, 0], kmeans_model.cluster_centers_[:, 1], s=100, marker='x', c='black', label='Cluster centers')
plt.title('KMeans clustering with {} clusters'.format(len(unique_labels)))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

In [None]:
def kmeans_wordcloud(cluster_number, column_name):
    '''function for Building a wordcloud for the movie/shows'''

    # Filter the data by the specified cluster number and column name
    df_wordcloud = df[['kmeans_cluster', column_name]].dropna()
    df_wordcloud = df_wordcloud[df_wordcloud['kmeans_cluster'] == cluster_number]

    # Combine all text documents into a single string
    text = " ".join(word for word in df_wordcloud[column_name])

    # Create the word cloud
    wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)

    # Convert the wordcloud to a numpy array
    image_array = wordcloud.to_array()

    # Return the numpy array
    return image_array

In [None]:
fig, axs = plt.subplots(nrows=6, ncols=7, figsize=(20, 15))

for i in range(7):
    for j, col in enumerate(['description', 'cast', 'director', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(kmeans_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}')

plt.tight_layout()
plt.show()

### ML Model - 2) HIERARCHICAL CLUSTERING

In [None]:
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(16, 7))
dend = shc.dendrogram(shc.linkage(x, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 5, color='r', linestyle='--')
plt.show()

In [None]:
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(16, 7))
dend = shc.dendrogram(shc.linkage(x, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 5, color='r', linestyle='--')
plt.show()

We can see that Horizontal line cutting 5 branches. So I have choose number of clusters as 5.

**AgglomerativeClustering**

In [None]:
# ML Model - 2  Implementation
# Initialize the hierarchical model with the chosen number of clusters
hierarchical_model = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
y_hierarchical = hierarchical_model.fit_predict(x)# Fit and predict on the model
hierarchical_labels = hierarchical_model.labels_  # Get the cluster labels for each point in the data
unique_labels_h = np.unique(hierarchical_labels)  # Get the unique cluster labels
silhouette_avg = silhouette_score(x, hierarchical_labels)   # Calculate the silhouette score
print("The average silhouette_score is :", silhouette_avg)


In [None]:
df['hierarchical_cluster'] = hierarchical_labels


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
scores_dict_hierarchical = evaluate_clustering_model(hierarchical_model, x, y_hierarchical)

In [None]:
plot_clustering_scores(scores_dict_hierarchical)

In [None]:
# Create a scatter plot of the data colored by cluster label
plt.figure(figsize=(8, 6), dpi=120)
for i in unique_labels_h:
    plt.scatter(x[hierarchical_labels == i, 0], x[hierarchical_labels == i, 1], s=20, label='Cluster {}'.format(i))
#plt.scatter(hierarchical_model.cluster_centers_[:, 0], hierarchical_model.cluster_centers_[:, 1], s=100, marker='x', c='black')
plt.title('Hierarchical clustering with {} clusters'.format(len(unique_labels_h)))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

In [None]:
def hierarchical_wordcloud(cluster_number, column_name):

  '''function for Building a wordcloud for the movie/shows'''

  # Filter the data by the specified cluster number and column name
  df_wordcloud = df[['hierarchical_cluster', column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['hierarchical_cluster'] == cluster_number]

  # Combine all text documents into a single string
  text = " ".join(word for word in df_wordcloud[column_name])

  # Create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)

  # Return the word cloud object
  return wordcloud


In [None]:
fig, axs = plt.subplots(nrows=6, ncols=5, figsize=(15, 15))

for i in range(5):
    for j, col in enumerate(['description', 'cast', 'director', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(hierarchical_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}')

plt.tight_layout()
plt.show()

### ML Model - 3) RECOMMENDATION SYSTEM

In [None]:
# ML Model - 3 Implementation
# Create a TF-IDF vectorizer object and transform the text data
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(new_df['normalized'])

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

def generate_recommendations(title, cosine_sim=cosine_sim, data=new_df):
    # Get the index of the input title in the programme_list
    programme_list = data['title'].to_list()
    index = programme_list.index(title)

    # Create a list of tuples containing the similarity score and index
    # between the input title and all other programmes in the dataset
    sim_scores = list(enumerate(cosine_sim[index]))

    # Sort the list of tuples by similarity score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]

    # Get the recommended movie titles and their similarity scores
    recommend_index = [i[0] for i in sim_scores]
    rec_movie = data['title'].iloc[recommend_index]
    rec_score = [round(i[1], 4) for i in sim_scores]

    # Create a pandas DataFrame to display the recommendations
    rec_table = pd.DataFrame(list(zip(rec_movie, rec_score)),
                             columns=['Recommended movie', 'Similarity score (0-1)'])

    return rec_table

In [None]:
generate_recommendations('Stranger Things')

In [None]:
generate_recommendations('Phir Hera Pheri')

In [None]:
generate_recommendations('Black Panther')

### **ML Model - 4) DBSCAN Clustering**

In [None]:
# ML Model - 4 Implementation
# Create an instance of DBSCAN with specified hyperparameters
dbscan_model = DBSCAN(eps=0.7, min_samples=3)
dbscan_model.fit(x) # Fit the model to the input data
y_dbscan = dbscan_model.labels_ # Get the predicted cluster labels for the input data
dbscan_labels = dbscan_model.labels_
unique_labels_dbscan = np.unique(dbscan_labels)
print(y_dbscan)   # Print the predicted labels


In [None]:
df['dbscan_cluster'] = dbscan_labels


1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
scores_dict_dbscan = evaluate_clustering_model(dbscan_model, x, y_dbscan)

In [None]:
plot_clustering_scores(scores_dict_dbscan)

In [None]:
# Create a scatter plot of the data colored by cluster label
plt.figure(figsize=(8, 6), dpi=120)
for i in unique_labels_dbscan:
    plt.scatter(x[dbscan_labels == i, 0], x[dbscan_labels == i, 1], s=20, label='Cluster {}'.format(i))
plt.title('DBSCAN clustering with {} clusters'.format(len(unique_labels_dbscan)))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()


In [None]:
def dbscan_wordcloud(cluster_number, column_name):

  '''function for Building a wordcloud for the movie/shows'''

  # Filter the data by the specified cluster number and column name
  df_wordcloud = df[['dbscan_cluster', column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['dbscan_cluster'] == cluster_number]

  # Combine all text documents into a single string
  text = " ".join(word for word in df_wordcloud[column_name])

  # Create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)

  # Return the word cloud object
  return wordcloud

In [None]:
fig, axs = plt.subplots(nrows=6, ncols=5, figsize=(15, 15))

for i in range(5):
    for j, col in enumerate(['description', 'cast', 'director', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(dbscan_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}')

plt.tight_layout()
plt.show()


1. Which Evaluation metrics did you consider for a positive business impact and why?

1) Silhouette score

Silhouette score is a popular evaluation metric for clustering algorithms. It measures how well each data point fits into its assigned cluster compared to other clusters. The score ranges from -1 to 1, with a higher score indicating better-defined clusters.

Silhouette score is a useful metric for a positive business impact because it can help identify the optimal number of clusters for a dataset. This, in turn, can help companies make data-driven decisions and allocate resources more efficiently based on the distinct patterns and characteristics of each cluster.

2) Calinski-Harabasz score

The Calinski-Harabasz score, also known as the variance ratio criterion, is a measure of the ratio between the within-cluster dispersion and the between-cluster dispersion. It is calculated by taking the ratio of the sum of squares between groups to the sum of squares within groups, multiplied by the ratio of the number of observations to the number of clusters minus one.

In other words, the Calinski-Harabasz score measures how well separated the clusters are in the data and how compact the clusters are internally. A higher score indicates that the clusters are well separated and compact, while a lower score indicates that the clusters are not well separated or are not compact.

3) Davies-Bouldin score

The Davies-Bouldin score is a measure of the average similarity between each cluster and its most similar cluster, compared to the average dissimilarity between each cluster and its least similar cluster. It is calculated by taking the sum of the ratios of the within-cluster scatter and the between-cluster distances, divided by the number of clusters.

In other words, the Davies-Bouldin score measures how well separated the clusters are in the data and how distinct they are from each other. A lower score indicates that the clusters are well separated and distinct, while a higher score indicates that the clusters are not well separated or are not distinct from each other.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:

# Storing metrics in order to make dataframe of metrics
Model          = ['K-Means Clustering', 'Hierarchical Clustering', 'DBSCAN Clustering']
S_score  = [0.0051, 0.0005, -0.0148]
CH_score = [22.0021, 18.1425, 2.8595]
DB_score = [10.7600, 12.1666, 1.4252]
No_of_cluster = [7, 5, 17]
# Create dataframe from the lists
data = {'Model' : Model,
        'Number of clusters': No_of_cluster,
        'silhouette_score'  : S_score,
        'calinski_harabasz_score': CH_score,
        'davies_bouldin_score': DB_score}
Metric_df = pd.DataFrame(data)

# Printing dataframe
Metric_df

After evaluating several machine learning models, including K-Means Clustering, Hierarchical Clustering - Agglomerative, DBSCAN Clustering, and Recommender System, we concluded that K-Means Clustering was the most suitable prediction model for our project.

We selected K-Means Clustering as our final model because it demonstrated high accuracy and computational efficiency on our evaluation dataset. It was able to cluster similar movies and TV shows based on their shared attributes, which enabled us to provide better recommendations to our users. Additionally, K-Means Clustering was relatively simple to implement and maintain, making it a practical choice for our project.

Although Hierarchical Clustering - Agglomerative and DBSCAN Clustering showed promise, they were computationally intensive and required more processing power and time to execute. The Recommender System, on the other hand, had limitations in its ability to cluster movies and TV shows based on shared attributes and relied heavily on user behavior data to make recommendations.

We selected K-Means Clustering as our final prediction model because it had the best calinski_harabasz_score and Davies-Bouldin score, which were higher and lower, respectively, than the scores obtained by other models. Additionally, the model produced a good silhouette score.

In summary, we chose K-Means Clustering as our final prediction model for its accuracy, efficiency, and practicality in providing recommendations to our users.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
filename='NETFLIX MOVIES AND TV SHOWS CLUSTERING.pkl'

# serialize process (wb=write byte)
pickle.dump(kmeans_model,open(filename,'wb'))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# unserialize process (rb=read byte)
kmeans_model= pickle.load(open(filename,'rb'))

# Predicting the unseen data
kmeans_model.predict(x)

In [None]:
# Checking if we are getting the same predicted values
y_kmeans


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**CONCLUSION FROM EDA:**

1.On the Netflix platform, there are more movies available than TV shows. Netflix primarily features content for mature audiences, with a majority of it having a TV-MA rating.

2.The countries with the highest number of productions available on Netflix are the United States, India, and the United Kingdom.

3.Since its establishment in 2008, Netflix's content library has experienced a consistent increase.

4.The most common genres of content on Netflix are Dramas, Comedies, and Documentaries.

5.According to the Wordcloud visualization of movie descriptions, some of the most frequently used words in Netflix movie descriptions are love, family, young, life, and world.

6.The correlation heatmap indicates a moderate positive correlation between a movie's release year and its duration.

7.The pairplot reveals several interesting patterns between variables, such as a strong positive correlation between the number of reviews and the year of release, and a negative correlation between a movie's rating and its duration.

**CONCLUSION FROM MODEL IMPLEMENTATION:**

1.Based on the attributes of director, cast, country, genre, rating, and description, the data was clustered.

2.To tokenize, preprocess, and vectorize the values in the attributes, TFIDF vectorizer was used, resulting in a total of 10,000 attributes.

3.In order to capture more than 95% of the variance, Principal Component Analysis (PCA) was utilized to reduce the dimensionality of the data.

4.Using the elbow method and Silhouette score analysis, the optimal number of clusters for the K-Means Clustering algorithm was determined to be 7.

5.The optimal number of clusters for the Agglomerative clustering algorithm was determined to be 5 based on the dendrogram visualization.

6.Utilizing cosine similarity, a content-based recommender system was constructed and will provide 10 recommendations to the user based on the type of show they previously watched.

7.DBSCAN clustering was utilized, and it identified 17 optimal clusters with a low metric score.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***