<a href="https://colab.research.google.com/github/Poonam-github3011/Module6-Netflix-movies-and-Tv-shows/blob/main/final_Clustering_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Unsupervised ML - Netflix Movies and TV Shows Clustering



##### **Project Type**    -Unsupervised
##### **Contribution**    - Team
##### **Team Member 1 -** Poonam Khairnar
##### **Team Member 2 -** Shreya Thorat


# **Project Summary -**

This project aims to analyze a Netflix dataset of movies and TV shows up to 2019 from Flixable, a third-party search engine. The objective is to group content using NLP techniques for a better user experience through a recommendation system, helping prevent Netflix subscriber churn among its 220 million users.

> **This project started with** :
 * Handling null values in the dataset.
 * Managing nested columns for better visualization.
 * Categorizing ratings into groups (adult, children's, family-friendly, not rated).
 * Conducting Exploratory Data Analysis (EDA) to gain insights to better understand Dataset
 * Creating clusters using attributes like director, cast, country, genre, rating, and description, processed through TF-IDF vectorization.
 * Reducing dimensionality using PCA for improved performance.
 * Employing K-Means and Hierarchical Clustering algorithms, determining optimal clusters through various evaluation methods.
 * Developing a content-based recommender system with a cosine similarity matrix for personalized recommendations and reducing subscriber churn.

# **GitHub Link -**

https://github.com/Poonam-github3011/Module6-Netflix-movies-and-Tv-shows

# **Problem Statement**


Problem Statement
As one of the world's largest streaming platforms, Netflix offers a vast library of movies and TV shows. However, with so many options to choose from, it can be challenging for users to find content that matches their preferences.

To address this challenge, this project aims to use unsupervised learning techniques to cluster similar movies and TV shows on Netflix. By grouping titles with similar attributes, we can provide users with more targeted recommendations, and help them find new content they will enjoy.

Specifically, this project will involve analyzing a dataset of Netflix titles, including features such as genre, release year, cast, and plot summary, among others. By applying clustering algorithms such as K-Means or Hierarchical clustering, we aim to identify groups of movies and TV shows with similar attributes.

Ultimately, the project aims to create a clustering model that can accurately group Netflix titles based on their characteristics. This model can then be used to make recommendations to users or to help Netflix improve its content discovery algorithms.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
## Data Maipulation Libraries
import numpy as np
import pandas as pd
import datetime as dt

## Data Visualisation Libraray
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
%matplotlib inline
import plotly.graph_objects as go
import plotly.express as px

# libraries used to process textual data
import string
string.punctuation
import nltk
nltk.download('punkt')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# libraries used to implement clusters
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
import pickle

# Library of warnings would assist in ignoring warnings issued
import warnings;warnings.filterwarnings('ignore')
import warnings;warnings.simplefilter('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Module6/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


In [None]:
print(f"Number of Rows: {df.shape[0]} \nNumber of Columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"Duplicate Value Counts: {df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
null_counts = df.isnull().sum()

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='mako')
plt.title('Number of Null Values in Each Columns')
plt.xlabel('Number of Null Values')
plt.show()

### What did you know about your dataset?

> * The Netflix Movies and TV Shows Clustering dataset comprises information on TV shows and movies available on Netflix as of 2019. With 7787 entries and 12 columns, the dataset includes a mix of categorical and numerical variables.
> * Some variables such as director, cast, country, date added, and rating contain null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all').T

### Variables Description

In [None]:
df.dtypes.value_counts()



| Column         |              Description                                                           |
|----------------|-----------------------------------------------------------------------|
| **show_id**    | A unique identifier for each movie or TV show in the dataset.         |
| **type**       | Indicates whether the entry is a movie or a TV show.                   |
| **title**      | The title of the movie or TV show.                                    |
| director       | The name of the director(s) associated with the content.              |
| **cast**       | The names of the main cast members in the movie or TV show.           |
| **country**    | The country or countries where the content was produced or originated.|
| **date_added** | The date when the movie or TV show was added to Netflix.              |
| **release_year**| The year when the movie or TV show was originally released.           |
| **rating**     | The content rating assigned to the movie or TV show (e.g., PG, TV-MA).|
| **duration**   | The duration of the movie or TV show (e.g., "1h 30m" for 1 hour and 30 minutes).|
| **listed_in**  | The categories or genres in which the content is listed.              |
| **description**| A brief summary or description of the movie or TV show.               |



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def unique_values(data_frame):
    for column in data_frame.columns:
        unique_values = data_frame[column].nunique()
        print(f"Column '{column}' has  unique value(s): {unique_values}")

# Call the function with your DataFrame
unique_values(df)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Shape of the dataframe before Data Wrangling
print(f"Dataset size before dropping values : {df.shape}")

In [None]:
new_df = df.copy()

In [None]:
# Fill missing values for 'director', 'cast', and 'country' columns with 'Unknown'
new_df[['director','cast','country']] = new_df[['director','cast','country']].fillna('Unknown')

In [None]:
# Fill missing values for 'rating' with the mode value
new_df['rating']= new_df['rating'].fillna(new_df['rating'].mode()[0])

In [None]:
# Drop rows with any remaining missing values
new_df.dropna(axis=0, inplace=True)

In [None]:
new_df['duration']=new_df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Shape of the dataframe after Data Wrangling
print(f"Dataset size after dropping  : {new_df.shape}")

## Handling nested columns

In [None]:
temp_df = new_df.copy()

In [None]:
def unnest_column(df, column_name):
    # Split the column and unnest
    unnested_df = df[column_name].apply(lambda x: str(x).split(', ')).tolist()
    df = pd.DataFrame(unnested_df, index=df['title']).stack()

    # Create a DataFrame, reset the index, and set the column names
    df = df.reset_index(level=1, drop=True).reset_index(name=column_name)

    return df

# Applying the function for 'director', 'cast', 'listed_in', and 'country'
dt1 = unnest_column(temp_df, 'director')
dt2 = unnest_column(temp_df, 'cast')
dt3 = unnest_column(temp_df, 'listed_in')
dt4 = unnest_column(temp_df, 'country')

In [None]:
df.columns

## Merging all together unnested dataframes

In [None]:
dfs = (
    dt2.merge(dt1, on='title', how='inner')
       .merge(dt3, on='title', how='inner')
       .merge(dt4, on='title', how='inner')
)

# Merging with the original DataFrame
temp_df = dfs.merge(new_df[['type', 'title', 'date_added', 'release_year', 'rating', 'duration', 'description']],
              on='title', how='left')

In [None]:
temp_df.head()

In [None]:
# Stripping leading and trailing white spaces from 'date_added' column
temp_df['date_added'] = temp_df['date_added'].str.strip()

# Typecasting string object to datetime object of date_added column
temp_df['date_added'] = pd.to_datetime(temp_df['date_added'], format='%B %d, %Y')

# Extracting date, day, month, and year from date_added column
temp_df["day_added"] = temp_df["date_added"].dt.day
temp_df["month_added"] = temp_df["date_added"].dt.month
temp_df["year_added"] = temp_df["date_added"].dt.year

# Dropping date_added
temp_df.drop('date_added', axis=1, inplace=True)


In [None]:
new_df.info()

## Remaping ratings column  


* **Adult Content** : TV-MA, NC-17, R
* **Children Content** : TV-PG, PG, TV-G, G
* **Teen Content** : PG-13, TV-14
* **Family-friendly Content** : TV-Y, TV-Y7, TV-Y7-FV
* **Not Rated** : NR, UR

In [None]:
# Binning the values in the rating column
rating_map = {'TV-MA':'Adult Content',
              'R':'Adult Content',
              'PG-13':'Teen Content',
              'TV-14':'Teen Content',
              'TV-PG':'Children Content',
              'NR':'Not Rated',
              'TV-G':'Children Content',
              'TV-Y':'Family-friendly Content',
              'TV-Y7':'Family-friendly Content',
              'PG':'Children Content',
              'G':'Children Content',
              'NC-17':'Adult Content',
              'TV-Y7-FV':'Family-friendly Content',
              'UR':'Not Rated'}

new_df['rating'].replace(rating_map, inplace = True)
new_df['rating'].unique()

### What all manipulations have you done and insights you found?

> * Null values in the **'director'**, **'cast'**, and **'country'** columns have been filled with the string 'Unknown' using the fillna method.
> * Null values in the **'rating'** column have been filled with the mode (most frequent value) of the column using the fillna method. Note that for **'rating'**, the inplace=True parameter has been used.
> * The **'date_added'** column has been converted to datetime format using the pd.to_datetime method.
 * We have also extracted the following features:
   *  'date' from 'date_added'.
   *  'month' from 'date_added'.
   *  'year' from 'date_added'.
> * Rows containing any remaining missing values after the above manipulations have been dropped using the dropna method
> * We have seen that the 'rating' column contains various coded categories, so we have decided to create 5 bins and distribute the values accordingly:
  * **Adult** : TV-MA, NC-17
  * **Restricted** : R, UR
  * **Teen** : PG-13, TV-14
  * **All Ages** : TV-G, TV-Y, TV-Y7, TV-Y7-FV, PG, G, TV-PG

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: count of Movies vs TV Shows on Netflix.

In [None]:
# Chart - 1 visualization code
# count of Movies vs TV Shows on Netflix.


# Creating the countplot to visualize the data
plt.figure(figsize = (10,8))

type_countplot = sns.countplot(data = new_df, x='type', palette='hot_r')

# Adding  a title to the plot
plt.title('Count of Movies vs TV Shows on Netflix', fontsize=15, color='black')

# Adding count annotations on top of the bars
for p in type_countplot.patches:
    type_countplot.annotate(f'{p.get_height()}',
                              (p.get_x() + p.get_width() / 2., p.get_height()),
                               ha='center', va='center', xytext=(0, 10),
                                textcoords='offset points', fontsize=10, color='black')

# Adding labels for the x and y axes
plt.xlabel('Type', fontsize=12, color='black')
plt.ylabel('Count', fontsize=12, color='black')

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

The countplot was chosen to visualize the distribution of movie and TV show types in the Netflix dataset due to its suitability for representing categorical data, enabling straightforward comparison of counts, and its simplicity, which facilitates clear communication of the distribution.

##### 2. What is/are the insight(s) found from the chart?

The chart shows there are more movies (5377) than TV shows (2400) in the Netflix dataset. Basically, Netflix has a lot more movies than TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Positive Impact**: Insights guide smart content decisions, optimizing Netflix's content strategy and tailoring user experiences for increased engagement.
> * **Negative Impact**: Imbalances in content types may lead to dissatisfaction, highlighting the importance of diversification. Missing trend opportunities can result in a competitive disadvantage.

#### Chart - 2: Distribution of release years of Netflix shows.

In [None]:
# Chart - 2 visualization code
# Distribution of release years of Netflix shows.

# Creating the histogram with Plotly
release_year_hist = px.histogram(new_df, x='release_year', nbins=30, title='Distribution of Release Years of Netflix Shows',
                                 labels={'release_year': 'Release Year'}, color_discrete_sequence=['red'],width=800, height=600,text_auto=True)

# Updating layout
release_year_hist.update_layout(
    # Adding  a title to the plot
    title=dict(text='Distribution of Release Years of Netflix Shows', x=0.5, y=0.95, xanchor='center', yanchor='top'),
    # Adding labels for the x and y axes
    xaxis=dict(title='Release Year', showgrid=True,title_font=dict(size=18)),
    yaxis=dict(title='Count', showgrid=True,title_font=dict(size=18)),
    showlegend=False
)

# Displaying the plot
release_year_hist.show()


##### 1. Why did you pick the specific chart?

A histplot was chosen to show how many Netflix shows were released each year. It helps easily see trends, peaks, and gaps in the release history over the years, giving a clear picture of Netflix's content distribution.

##### 2. What is/are the insight(s) found from the chart?

> * The chart highlights a spike in Netflix content production in 2015-2019, showcasing a recent emphasis on new releases.
> * The distribution of content releases over time can reveal growth trends in Netflix's library. The chart shows an increasing trend in content releases over the years, it suggests that Netflix has been expanding its content library consistently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the histogram of Netflix shows' release years can contribute to a positive business impact. Knowing when Netflix shows were released helps Netflix plan better. They can focus on the popular years, offer a variety of shows, and make decisions that match what viewers like. This can make users happier and more engaged with Netflix.

#### Chart - 3:  Top 5 Countries in Netflix Shows

In [None]:
# Chart - 3 visualization code
# Creating a pie chart with Plotly
fig = px.pie(new_df['country'].value_counts().head(),
             labels=new_df['country'].value_counts().head().index,
             values=new_df['country'].value_counts().head().values,
             title='Top 5 Countries in Netflix Shows',
             names=new_df['country'].value_counts().head().index,
             color_discrete_sequence=px.colors.qualitative.Set1,width=800, height=600)

# Adding interactivity to the chart
fig.update_layout(title=dict( x=0.5, y=0.95, xanchor='center', yanchor='top'))
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(legend_title='Country')

# Displaying the plot
fig.show()

##### 1. Why did you pick the specific chart?

This interactive pie chart allows to explore the distribution of shows among the top 5 countries with the highest number of shows. and can hover over the slices to see the country name, count, and percentage, providing a clear visualization of the relative proportions of shows in each country.

##### 2. What is/are the insight(s) found from the chart?

> * The largest slice indicates that the United States has the highest number of shows, highlighting its significant contribution to Netflix's content library.
> * Understanding the geographical distribution helps Netflix strategize content acquisition and production efforts to cater to diverse viewer preferences across different regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Strategic Decision-making**: Insights on dominant countries guide content strategies, enhancing viewer satisfaction and potential for increased viewership.
> * **Competitor Analysis**: Assessing market share relative to other countries aids in understanding the competitive landscape, supporting strategic positioning.

#### Chart - 4: Top 10 countries with the most Netflix shows

In [None]:
# Chart - 4 visualization code

# Create a DataFrame with the top 10 countries
top_countries = new_df['country'].value_counts().head(10).index

# Filter data for the top 10 countries
top_countries_data = new_df[new_df['country'].isin(top_countries)]

# Create a bar plot
plt.figure(figsize=(12, 8))
sns.countplot(x='country', hue='type', order=top_countries, data=top_countries_data, palette='magma')

# Add count annotations on top of the bars
for p in plt.gca().patches:
    plt.gca().annotate(f'{int(p.get_height())}',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', xytext=(0, 10),
                 textcoords='offset points', fontsize=10, color='black')

# Adding a title to the plot
plt.title('Top 10 Countries with Type Count on Netflix')
# Adding labels for the x and y axes

plt.xlabel('Count', fontsize=12, color='black')
plt.ylabel('Country', fontsize=12, color='black')
plt.legend(title='Type', loc='upper right')

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is chosen for this task because it effectively visualizes the frequency count of each category (countries) on the x-axis and the count on the y-axis. This makes it easy to compare the distribution of movie and TV show counts across different countries in a clear and concise manner.

##### 2. What is/are the insight(s) found from the chart?

> * The **United States** is a major contributor to Netflix, leading in both movies (1850) and TV shows (705), underscoring its significant influence on the platform's content.
> * **India** stands out with a substantial movie count (852), but there's room for growth in TV shows (71), indicating a strong presence in movies.
> * The **United Kingdom** maintains a balanced content output with comparable counts in both movies (193) and TV shows (204), reflecting a diverse content landscape.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Targeted Investment**: Insights guide strategic content investments, tailoring production to the influential U.S. market and recognizing growth potential in India.
> * **Market expansion**: Identifying countries with a higher count of movies and TV shows can provide insights into potential markets for expansion. Netflix can prioritize expanding its presence in countries where there is already a substantial demand for their content.

#### Chart - 5 : Distribution of Content Ratings on Netflix

In [None]:
# Chart - 5 visualization code
# Distribution of Content Ratings on Netflix

# Creating the countplot to visualize the data
plt.figure(figsize=(10, 8))

rating_countplot = sns.countplot(data=new_df, x='rating', order=new_df['rating'].value_counts().index, color='skyblue')

# Adding a title to the plot
plt.title('Distribution of Content Ratings on Netflix', fontsize=15, color='black')

# Adding labels for the x and y axes
plt.xlabel('Rating', fontsize=12, color='black')
plt.ylabel('Count', fontsize=12, color='black')
plt.xticks(rotation=45)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is a suitable choice for visualizing the distribution to show how content ratings are distributed on Netflix. It's a good pick because it displays different ratings on the x-axis and shows how many times each rating appears on the y-axis. This makes it easy to see which ratings are more common and gives a quick overview of Netflix's content.

##### 2. What is/are the insight(s) found from the chart?

> * Adult  and teen content ratings have the highest counts, indicating a significant presence of content suitable for mature audiences.
> * Netflix offers content across a range of rated content, including childern, family-friendly  and others content, showcasing a diverse content library catering to various audience preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Viewer Segmentation**: Knowledge of specific ratings' popularity helps in better understanding viewer segments. This segmentation can be leveraged for personalized recommendations and targeted marketing campaigns, improving user engagement.
> * **Content Diversification**: Recognizing the diversity in ratings allows Netflix to continue offering a wide range of content suitable for different audiences. This diversification can attract a broader user base and enhance customer satisfaction.

#### Chart - 6: Poplular genres on netflix

In [None]:
# Chart - 6 visualization code
# Creating the countplot to visualize the data
plt.figure(figsize=(14, 8))

df_genre = temp_df.groupby(['listed_in']).agg({'title': 'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[:10]
ax = sns.barplot(y="listed_in", x='title', data=df_genre , palette = 'rocket')

# Adding count annotations on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_width()}',
                (p.get_width(), p.get_y() + p.get_height() / 2.),
                ha='center', va='center', xytext=(18, 0), textcoords='offset points', fontsize=10, color='black')

# Adding a title to the plot
plt.title('Most Popular Genre on Netflix', fontsize=15, color='black')

# Adding labels for the x and y axes
plt.xlabel('Title Count', fontsize=12, color='black')
plt.ylabel('Genre', fontsize=12, color='black')

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot is suitable for visualizing the count of unique titles in each genre on Netflix because it provides a clear representation of the distribution of titles across different genres.

##### 2. What is/are the insight(s) found from the chart?

> * **"International Movies"** and **"Dramas"** are the most prevalent genres on Netflix, with a significantly higher count of unique titles.
> * Genres like **"Documentaries," "Action & Adventure," and "TV Dramas"** have a notable presence, showcasing a diverse range of content available on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Targeted Marketing**: Utilizing insights on popular genres enables Netflix to customize marketing campaigns, recommendations, and ads for specific viewer preferences, boosting the attraction and retention of subscribers.
> * **Personalized Recommendations**: Genre insights enhance recommendation algorithms, allowing Netflix to offer more precise and personalized content suggestions, elevating the overall user experience.

#### Chart - 7 : Past 20 Years added Content on Netflix

In [None]:
# Chart - 7 visualization code
# Counting the occurrences of each year in 'year_added'
year_counts = temp_df['year_added'].value_counts().head(20).sort_index()

# Creating a line plot to visualize the data
plt.figure(figsize=(10, 8))
sns.lineplot(x=year_counts.index, y=year_counts.values, marker='o')

# Adding a title to the plot
plt.title('Past 20 Years added Content on Netflix ', fontsize=15, color='black')

# Adding grid and labels for the x and y axes
plt.grid(True)
plt.xlabel('Release Year', fontsize=12, color='black')
plt.ylabel('Count', fontsize=12, color='black')

# Displaying the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The line chart is suitable for visualizing the distribution of Netflix content across different release years. It effectively shows the trend and variation in content production over time.

##### 2. What is/are the insight(s) found from the chart?

> * A noticeable surge in content production is observed from 2016 onwards, reaching its peak in 2018. This period aligns with Netflix's strategic focus on original content creation and global expansion.
> * The distribution of content releases over time can reveal growth trends in Netflix's library. The chart shows an increasing trend in content releases over the years, it suggests that Netflix has been expanding its content library consistently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * **Content Strategy and Planning**: The insights from the line chart can inform Netflix's content strategy and planning. Understanding the historical trends in content production allows for more informed decisions on resource allocation, budgeting, and focus areas for future content creation.
> * **Audience Engagement**: By analyzing the growth trends, Netflix can align content releases with periods of increased audience engagement. This strategic timing can maximize viewership, subscriber retention, and overall customer satisfaction.

#### Chart - 8 : Top actors performing in Movies and TV Shows

In [None]:
# Chart - 8 visualization code
df_movies = temp_df[temp_df['type'] == 'Movie']
df_tvshows = temp_df[temp_df['type'] == 'TV Show']

plt.style.use('default')
plt.figure(figsize=(18, 8))

# Plotting Movies Actors
plt.subplot(1, 2, 1)
df_movies_actor = df_movies.groupby(['cast']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[1:10]
plot_movies = sns.barplot(y="cast", x='title', data=df_movies_actor, palette='magma')
plt.title('Top Actors in Movies')
plt.xlabel('Number of Movies')


# Adding count labels on the bars for Movies
# Adjusting count value position for Movies
for index, value in enumerate(df_movies_actor['title']):
    plot_movies.text(value + 0.1, index, str(value), color='black', ha="left", va="center")


# Plotting TV Shows Actors
plt.subplot(1, 2, 2)
df_tvshows_actor = df_tvshows.groupby(['cast']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'], ascending=False)[1:10]
plot_tvshows = sns.barplot(y="cast", x='title', data=df_tvshows_actor, palette='magma')
plt.title('Top Actors in TV Shows')
plt.xlabel('Number of TV Shows')


# Adding count labels on the bars for TV Shows
for index, value in enumerate(df_tvshows_actor['title']):
    plot_tvshows.text(value + 0.1, index, str(value), color='black', ha="left", va="center")

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This bar plot is suitable for visualizing the top actors in movies and TV shows because it effectively compares the number of appearances for each actor across the two categories

##### 2. What is/are the insight(s) found from the chart?

> * Anupam Kher leads with 41 appearances, establishing himself as the most prominent movie actor on Netflix.
> * Bollywood influence is evident, with actors like Shah Rukh Khan, Naseeruddin Shah, and Akshay Kumar featuring prominently.
> * Takahiro Sakurai dominates TV shows with 22 appearances, indicating a significant presence in this category.
> * Japanese anime influence is observed through actors like Yuki Kaji, Ai Kayano, Daisuke Ono, and Junichi Suwabe.
> * The contrast in top actors between movies and TV shows suggests diverse industry dynamics.
> * Bollywood actors dominate movies, while Japanese anime-related content holds sway in TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

> * Netflix can strategically invest in movies or TV shows featuring highly popular actors, aligning with user preferences.
> * Users are more likely to receive personalized recommendations based on their preferred actors, leading to increased user satisfaction.
> * Popular actors can be leveraged in marketing and promotional campaigns to attract and retain subscribers.

#### Chart - 9 : What is the Distribution of Content Rating in each highest content creating countries?

In [None]:
temp_df['rating'].replace(rating_map, inplace = True)

In [None]:
temp_df['count'] = 1
data = temp_df.groupby('country')['count'].sum().sort_values(ascending=False).reset_index()[:10]
top_countries = data['country']
df_heatmap = temp_df[temp_df['country'].isin(top_countries)]
df_heatmap = pd.crosstab(df_heatmap['country'], df_heatmap['rating'], normalize="index").T

# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(10, 8))

# Defining order of representation
country_order = df_heatmap.columns
rating_order = df_heatmap.index

# Calling and plotting heatmap
sns.heatmap(df_heatmap.loc[rating_order, country_order], square=True, linewidth=2.5, cbar=False, annot=True, fmt='1.0%',
            vmax=.6, vmin=0.05, ax=ax, annot_kws={"fontsize": 12})
plt.show()


##### 1. Why did you pick the specific chart?

The heatmap was chosen for its ability to visually represent the distribution of content ratings across different countries.

##### 2. What is/are the insight(s) found from the chart?

> * We found that **most of the countries produces content related to Adult and Teen.**
> * Amomg all the countries **INDIA has less content in Adult segment than teen content.**
> * **85% of content is Adult content from spain.**
> * **Canada produces more content related to Children and Family-Friendly content**.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

> *Hypothetical Statement 1:*
> * **Null Hypothesis**: There is no significant difference in the proportion ratings of drama movies and comedy movies on Netflix.
> * **Alternative Hypothesis**: There is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

> *Hypothetical Statement 2:*
> * **Null Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.
> * **Alternative Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.

> *Hypothetical Statement 3:*
> * **Null Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is not significantly different from the proportion of movies added on Netflix that are produced in the United States.
> * **Alternative Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

> **Null Hypothesis**: There is no significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

> **Alternative Hypothesis**: There is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Import necessary libraries
from statsmodels.stats.proportion import proportions_ztest

# Subset the data to only include drama and comedy movies
subset = temp_df[temp_df['listed_in'].str.contains('Dramas') | temp_df['listed_in'].str.contains('Comedies')]

# Calculate the proportion of drama and comedy movies
drama_prop = len(subset[subset['listed_in'].str.contains('Dramas')]) / len(subset)
comedy_prop = len(subset[subset['listed_in'].str.contains('Comedies')]) / len(subset)

# Set up the parameters for the z-test
count = [int(drama_prop * len(subset)), int(comedy_prop * len(subset))]
nobs = [len(subset), len(subset)]
alternative = 'two-sided'

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)
print('z-statistic:', z_stat)
print('p-value:', p_value)

# Set the significance level
alpha = 0.05

# Print the results of the z-test
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

We conclude that there is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

##### Which statistical test have you done to obtain P-Value?

The statistical test we have used to obtain the P-value is the z-test for proportions.

##### Why did you choose the specific statistical test?

We used a z-test for proportions because we wanted to compare the proportions of two types of movies (drama and comedy) in a sample. The test helps us figure out if the difference we see in the proportions is likely due to a real distinction or just random chance. It's like checking if the observed difference is big enough to be considered meaningful. The test looks at the probability of seeing such a difference if there was actually no difference in the entire population.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

> **Null Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.

> **Alternative Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Import necessary libraries
from scipy.stats import ttest_ind

# Create separate dataframes for TV shows in 2020 and 2021
tv_2020 = temp_df[(temp_df['type'] == 'TV Show') & (temp_df['release_year'] == 2020)]
tv_2021 = temp_df[(temp_df['type'] == 'TV Show') & (temp_df['release_year'] == 2021)]

# Perform two-sample t-test
t, p = ttest_ind(tv_2020['duration'].astype(int),
                 tv_2021['duration'].astype(int), equal_var=False)

# Print the results
print('t-value:', t)
print('p-value:', p)

# Set the significance level
alpha = 0.05

# Print the interpretation of the results
if p < alpha:
    print("Reject the null hypothesis.")

else:
    print("Fail to reject the null hypothesis.")

we conclude That average duration of TV shows added in 2020 on Netflix is significantly different from the average duration of TV shows added in 2021.

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the P-Value is a two-sample t-test.

##### Why did you choose the specific statistical test?

We used a two-sample t-test because we wanted to compare the average durations of two groups of TV shows: those added in 2020 and those added in 2021. The test helps us figure out if the difference in average durations is likely due to a real distinction or just random chance. We also assumed that the variability in durations between the two groups might not be the same, so we used a version of the t-test that doesn't assume equal variability.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

> **Null Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is not significantly different from the proportion of movies added on Netflix that are produced in the United States.

> **Alternative Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Separate data into movies and TV shows
df_movies = temp_df[temp_df['type'] == 'Movie']
df_tvshows = temp_df[temp_df['type'] == 'TV Show']

# Calculate the proportion of TV shows and movies from the United States
tv_proportion = np.sum(df_tvshows['country'].str.contains('United States')) / len(df_tvshows)
movie_proportion = np.sum(df_movies['country'].str.contains('United States')) / len(df_movies)

# Set up the parameters for the z-test
count = [int(tv_proportion * len(df_tvshows)), int(movie_proportion * len(df_movies))]
nobs = [len(df_tvshows), len(df_movies)]
alternative = 'two-sided'

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)
print('z-statistic:', z_stat)
print('p-value:', p_value)

# Set the significance level
alpha = 0.05

# Print the results of the z-test
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

We conclude that the proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain P-Value is a two-sample proportion test.

##### Why did you choose the specific statistical test?

We used this statistical test because it's good for comparing two proportions. It helps us figure out if the difference between the proportions of TV shows and movies from the United States is likely just random chance or if there's a real distinction between them.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
temp_df.isna().sum()

Let's move ahead, as we have already deleted with null/ missing values from our dataset

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values to handle in the given dataset

### 2. Handling Outliers

In [None]:
# @markdown # Outlier detector
def Outlier_detector(data, feature, figsize=(10, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
# @markdown # Outlier detector
def Outlier_detector(data, feature, figsize=(10, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)


In [None]:
# Handling Outliers & Outlier treatments
# Defining  empty lists for different Datatypes
numeric_columns = []
# Seprating columns and categorize based on data type
for column in temp_df.columns:
    if temp_df[column].dtype in ['float64', 'int64']:
      numeric_columns.append(column)

In [None]:
for variable in numeric_columns[:-1]:
  Outlier_detector(temp_df,variable)

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***