<a href="https://colab.research.google.com/github/Radhika190/Netflix-Movies-And-Tv-Shows-Clustering-Unsupervised-ML-Project/blob/main/Individual_Notebook_%7C_Unsupervised_Ml_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies And Tv Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**            - Radhika Dwivedi


# **Project Summary -**

This project aims to conduct an in-depth analysis and clustering of Netflix movies and TV shows based on their various attributes, with the goal of uncovering valuable insights into content categories and user preferences. Clustering methods will be employed to group similar content together, facilitating improved content recommendations, content management, and a deeper understanding of audience preferences.

Project Approach:

* Data Collection: We will assemble a comprehensive dataset encompassing Netflix titles, including attributes such as title, genre, release year, duration, cast, director, and more.

* Data Preprocessing: Thorough data cleaning will be performed to handle missing values. Categorical variables will be transformed into numerical representations, and numerical features will be normalized or scaled.

* Feature Extraction: Depending on the data at hand, relevant features will be extracted from the dataset. This may involve embedding genre information, cast details, and more.

* Clustering Algorithm Selection: We will choose appropriate clustering algorithms, such as K-Means, Hierarchical Clustering, or DBSCAN, based on the dataset's characteristics and the desired outcomes.

* Clustering: The chosen clustering algorithm will be applied to the dataset to form distinct clusters of movies and TV shows based on their attributes. Each cluster will represent content with similar characteristics.

* Evaluation: Cluster quality will be assessed using metrics like silhouette score, Davies-Bouldin index, or domain-specific metrics if available.

* Visualization: Visualization techniques like dimensionality reduction (e.g., t-SNE) will be employed to depict the relationships among different content items within the clusters.

* Interpretation: The clusters will be analyzed to identify common themes, genres, or patterns that emerge within each cluster. This analysis will offer insights into viewer preferences and assist in content curation.

* Recommendations: Leveraging the clusters, we will enhance content recommendations for users. If a user enjoys content from one cluster, similar content from the same cluster can be suggested.

Project Benefits:

Content Curation: Content managers can efficiently categorize and manage the extensive Netflix library, simplifying content organization.

Personalized Recommendations: Improved clustering can enhance the accuracy of content recommendations, resulting in more personalized viewing experiences for users.

Insights for Production: Identifying popular clusters can guide content creators and managers in making informed decisions about future content production and acquisition.

By applying clustering techniques to Netflix movies and TV shows, we aim to provide a more structured and personalized viewing experience for users while delivering valuable insights for content stakeholders.


# **GitHub Link -**

https://github.com/Radhika190/Netflix-Movies-And-Tv-Shows-Clustering-Unsupervised-ML-Project

# **Problem Statement**


Netflix, a global streaming leader, faces a challenge: as its content library expands, users struggle to find what they like. Our proposal: apply clustering analysis to categorize similar content.

Key Issues:

**Content Organization:** Netflix's vast library needs efficient organization across genres, release years, and styles.

**User Engagement:** Personalized recommendations are vital for user satisfaction. Clustering helps understand user preferences for tailored suggestions.

**Content Curation:** To cater to diverse audiences, understanding their preferences is crucial for content creators.

This project offers insights into Netflix's evolving content landscape and audience preferences. These insights can drive better content curation, enhance user satisfaction, and inform strategic decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Pandas for data manipulation and analysis
import pandas as pd

# Import NumPy for numerical computing and array operations
import numpy as np

# Import Matplotlib and Seaborn for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Import Missingno for visualizing missing data patterns
import missingno as msno

# Import Matplotlib's colormap module for color map manipulations
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap
# Import Geopandas for working with geospatial data
import geopandas as gpd

# Install or upgrade the 'country_converter' package using pip
!pip install country_converter --upgrade

# Import country_converter for converting country names and codes
import country_converter as coco

# Import Plotly Express for interactive visualizations
import plotly.express as px

# Import the 'string' module for working with strings and punctuation
import string

# Import statistics module from SciPy for statistical computations
from scipy import stats

# Import CountVectorizer and TfidfVectorizer for text feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Import WordCloud, STOPWORDS, and ImageColorGenerator for text visualization
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Import NLTK modules for natural language processing
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

# Import cosine_similarity from scikit-learn for text similarity
from sklearn.metrics.pairwise import cosine_similarity

# Import PCA for Principal Component Analysis
from sklearn.decomposition import PCA

# Import silhouette_score and silhouette_samples for cluster evaluation
from sklearn.metrics import silhouette_score, silhouette_samples

# Import KMeans for k-means clustering
from sklearn.cluster import KMeans

# Import hierarchy module from SciPy for hierarchical clustering
import scipy.cluster.hierarchy as shc

# Import AgglomerativeClustering for agglomerative hierarchical clustering
from sklearn.cluster import AgglomerativeClustering

# Import SilhouetteVisualizer for visualizing silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Import warnings for managing warning messages
import warnings

# Ignore warning messages to improve code readability
warnings.filterwarnings('ignore')

# Import Spacy and other libraries for preprocessing and analysis
import spacy
import en_core_web_sm
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.decomposition import TruncatedSVD
from nltk import ne_chunk
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MultiLabelBinarizer, OneHotEncoder

# Additional libraries for clustering and visualization
import matplotlib.cm as cm
from sklearn import metrics
import plotly.express as px
from sklearn.metrics.pairwise import linear_kernel
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Load Spacy language model
nlp = en_core_web_sm.load()

### Dataset Loading

In [None]:
# Load Dataset
## mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix_df = pd.read_csv('/content/drive/MyDrive/ML PROJECT/Unsupervised Ml Project/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
netflix_df.head()       ## shows the top 5 rows

In [None]:
netflix_df.tail()       ## shows the bottom 5 rows

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# netflix_data.shape[0] - returns the number of rows
# netflix_data.shape[1] - returns the number of columns
print(f'There are {netflix_df.shape[0]} rows and {netflix_df.shape[1]} columns')

### Dataset Information

In [None]:
# Dataset Info
netflix_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
netflix_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
netflix_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.bar(netflix_df,
         fontsize=10,
         figsize=(7,4),
         color='purple')
plt.title('Missing values')
plt.show()

### What did you know about your dataset?

1. The dataset consists of 7787 rows and 12 columns.

2. Four columns in the dataset have missing values.

3. There are a total of 3631 missing values in the table.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_df.columns

In [None]:
# Dataset Describe
# Transpose of Data Description for better visibility and analysis
netflix_df.describe(include='all').T

### Variables Description

1. show_id: A unique identifier for each show or movie. There are 7787 entries.

2. type: Indicates whether the entry is a "Movie" or a "TV Show". There are 2 unique values, and "Movie" appears 5377 times.

3. title: The title of the show or movie. There are 7787 unique titles.

4. director: The directors of the content. There are 5398 entries, and some entries have multiple directors separated by commas.

5. cast: The cast members of the content. There are 7069 entries, and some entries have multiple cast members separated by commas.

6. country: The country of origin for the content. There are 7280 entries, and the most common country is the "United States" with 2555 occurrences.

7. date_added: The date when the content was added to Netflix. There are 7777 entries with dates like "January 1, 2020".

8. release_year: The year when the content was released. The range of release years is between 1925 and 2021, with an average around 2013.

9. rating: The content rating, such as "TV-MA" (Mature Audience). There are 7780 entries with 14 unique ratings.

10. duration: The duration of the content. For TV shows, it's indicated in terms of seasons (e.g., "1 Season"). There are 7787 entries with various durations.

11. listed_in: Categories or genres in which the content is listed, such as "Documentaries". There are 7787 entries with 492 unique categories.

12. description: A brief description or summary of the content. There are 7787 entries with descriptions.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def unique_check(netflix_df):
  unique = netflix_df.nunique().reset_index()
  unique.columns = ['Column', 'Unique_Values']
  return unique

# Calling the function
unique_check(netflix_df)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Convert date_added to datetime format
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'])

In [None]:
# Handling the Null Values
netflix_df['cast'].fillna(value='No cast',inplace=True)
netflix_df['country'].fillna(value=netflix_df['country'].mode()[0],inplace=True)
netflix_df['director'].fillna(value='NA',inplace=True)

In [None]:
netflix_df.isnull().sum()

### What all manipulations have you done and insights you found?

 * I converted the 'date_added' column to a datetime format using pd.to_datetime(). This allows for easier analysis and manipulation of date-related information.

 * Replaced missing values in the 'cast' column with the placeholder 'No cast'.

 *  Imputed the missing values in the 'country' column with the most frequently occurring value in the column (the mode).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**UNIVARIATE ANALYSIS :-**

#### Chart - 1 Bar Plot- Distribution of Ratings for Netflix Shows

In [None]:
# Chart - 1 visualization code
# Grouping the data by 'rating' and calculating the count
rating_counts = netflix_df['rating'].value_counts()
rating_counts

In [None]:
# Define custom colors
custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']

# Creating a bar plot to visualize the rating distribution
plt.figure(figsize=(15, 6))
rating_counts.plot(kind='bar', color=custom_colors)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Distribution of Ratings for Netflix Shows')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Categorical Data & Bar Plots: "Rating" data, like "TV-PG" and "TV-MA," is best visualized using bar plots. They display category counts for quick comparisons.

Count Focus: Bar plots emphasize counting occurrences, revealing popular and rare ratings.

Comparative Insight: They visually highlight count differences between categories.

Simplicity: Bar plots are straightforward for displaying categorical data.

##### 2. What is/are the insight(s) found from the chart?

Most Common Ratings: The bar plot displays show counts for each rating category, identifying the most prevalent ratings on Netflix, like "TV-MA" and "TV-14."

Dominant Audience: It helps infer the target audience; higher-rated categories imply mature content, while lower-rated ones suggest family-friendly options.

Imbalance in Ratings: The chart reveals rating distribution imbalances, indicating content preference and Netflix's strategic focus.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Enhanced Content Strategy: Netflix can optimize its content strategy by aligning with prevalent ratings, attracting subscribers who prefer tailored content.

Informed Content Decisions: Analyzing rating distribution guides content acquisition and production choices, catering to viewer demand.

Personalized Recommendations: This analysis refines recommendation algorithms, providing subscribers with more accurate and engaging content suggestions.

Challenges: Potential drawbacks include limited audience reach, regulatory constraints, and reduced content diversity, which could impact growth and inclusivity.

#### Chart - 2 Visualization of number of movies and tv shows

In [None]:
# Chart - 2 visualization code
# Visualization of number of movies and tv shows
# Defining custom colors
color_palette = ['#ff9999', '#66b3ff']

# Setting the style
sns.set_style('whitegrid')

# Creating a new figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

# Plot the count using seaborn's countplot with custom palette
sns.countplot(x='type', data=netflix_df, palette=color_palette, ax=ax)

# Labeling of values
ax.set_title('Number of Movies and TV Shows', fontsize=14)
ax.set_xlabel('Type', fontsize=12)
ax.set_ylabel('Count', fontsize=12)

# Visualization of number of movies and TV shows
plt.show()

##### 1. Why did you pick the specific chart?

Using a countplot, a type of bar chart, is an excellent option for visually representing categorical data like the number of movies and TV shows on Netflix. This method effectively illustrates the frequency of each category in a straightforward and easily understandable manner.

##### 2. What is/are the insight(s) found from the chart?

Netflix's library comprises 5372 movies and 2398 TV shows, indicating that the number of movies available surpasses the number of TV shows on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix's Movie-TV Show Ratio: While the balance between movies and TV shows may not be a sole business determinant, it informs strategic decisions. If subscribers favor TV shows, Netflix may prioritize acquiring more. Conversely, successful original movies may lead to resource allocation in that direction.

Potential Negatives: The ratio itself may not harm, but neglecting preferences and persistently emphasizing movies may risk subscriber attrition. Rivals expanding TV show content might erode Netflix's market share if it doesn't respond.

#### Chart - 3 Top 10 Genre in movies

In [None]:
# Chart - 3 visualization code
# Top 10 Genre in movies
# Defining custom colors
color_palette = sns.color_palette("husl", 15)  # You can choose a different color palette if desired

# Creating a figure and axis
plt.figure(figsize=(15, 6))

# Plot the top 10 genres in movies
sns.barplot(x=netflix_df["listed_in"].value_counts().head(15).index,
            y=netflix_df["listed_in"].value_counts().head(15).values, palette=color_palette)
plt.xticks(rotation=45, ha="right")  # Adjust rotation for better readability
plt.title("Top 15 Genres in Movies", size=16, fontweight="bold", color="red")
plt.xlabel("Genre", size=12)
plt.ylabel("Count", size=12)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart for this analysis as it efficiently displays the top 10 movie genres. Vertical bars enable straightforward genre comparisons, with labels on the x-axis. A well-chosen color palette enhances clarity, offering a concise view of popular movie genres.

##### 2. What is/are the insight(s) found from the chart?

The chart showcases Netflix's top 10 movie genres, revealing user preferences and guiding content strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights benefit Netflix by aligning its content with viewer preferences, potentially boosting engagement and attracting new subscribers. No negative growth indicators are evident; the focus is on enhancing user satisfaction and engagement through informed content and marketing strategies.

#### Chart - 4

In [None]:
movies['year_added'] = netflix_df['date_added'].dt.year

In [None]:
# Chart - 4 visualization code
# Create a pie chart to visualize the percentage of original movies and non-original movies in the dataset
fig, ax = plt.subplots(figsize=(5, 5), facecolor="#363336")
ax.patch.set_facecolor('#363336')

# Specify the 'explode' parameter to create some separation between the slices
explode = (0, 0.1)

# Count the number of movies in each category (originals and others) using 'value_counts()'
# Plot a pie chart with labels, percentages, and shadows using the 'ax.pie()' method
# Change the colors of the pie chart slices to more attractive colors
colors = ['#ff9999', '#66b3ff']
labels = ['Others', 'Originals']
sizes = movies['originals'].value_counts()
ax.pie(sizes, explode=explode, autopct='%.2f%%', labels=labels,
       shadow=True, startangle=90, textprops={'color': "black", 'fontsize': 20},
       colors=colors)

# Set the title for the plot
ax.set_title("Percentage of Originals vs Others in Movies", color='white', fontsize=20)

# Display the pie chart
plt.show()


##### 1. Why did you pick the specific chart?

The pie chart was used in the provided code to visually represent the percentage distribution of original movies and non-original movies in the dataset. Pie charts are commonly used to display how a single whole (100%) is divided into various parts, showing the proportion of each part relative to the whole.

##### 2. What is/are the insight(s) found from the chart?

While Netflix is known for its original content, only 30% of its movie library is actually Netflix-produced. The remaining 70% comprises films acquired from various sources, including theaters and other streaming platforms. This highlights the platform's extensive collection, offering diverse content from around the world, from classic Hollywood to foreign cinema. So, as you explore Netflix's vast movie selection, keep in mind that the majority is acquired content, providing a wealth of entertainment choices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Netflix's ability to acquire 70% of its movie content from other sources underscores its cost-effective strategy in offering a diverse range of successful films without the need for extensive original production.

2. The presence of 30% original movies showcases Netflix's investment in exclusive content, setting it apart from competitors and attracting new viewers.

However, if Netflix fails to match the appeal of acquired content with its originals, it could face subscriber attrition. Over-reliance on acquisitions may also lead to costly licensing negotiations and impact profitability.

**BIVARIATE ANALYSIS :-**

#### Chart - 5  Analysis on release year of movies and tv shows

In [None]:
# Chart - 5 visualization code
# creating two extra column
tv_shows = netflix_df[netflix_df['type'] == 'TV Show']
movies = netflix_df[netflix_df['type'] == 'Movie']

# Creating a figure with subplots
plt.figure(figsize=(12, 10))

# Defining colors for the plots
color_palette = sns.color_palette("coolwarm", 15)

# Analysis on release year of movie show
plt.subplot(2, 1, 1)
sns.countplot(y="release_year", data=movies, palette=color_palette,
              order=movies['release_year'].value_counts().index[0:15])
plt.title('ANALYSIS ON RELEASE YEAR OF MOVIES', fontsize=15, fontweight='bold', color='red')

# Analysis release year of TV show
plt.subplot(2, 1, 2)
sns.countplot(y="release_year", data=tv_shows, palette=color_palette,
              order=tv_shows['release_year'].value_counts().index[0:15])
plt.title('ANALYSIS ON RELEASE YEAR OF TV SHOWS', fontsize=15, fontweight='bold', color='red')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose the "countplot" because it's great for visualizing categorical data like release years for movies and TV shows. It uses vertical bars to show category counts, making comparisons easy. Subplots and ordered categories help highlight the most common release years for both types effectively.

##### 2. What is/are the insight(s) found from the chart?

The chart unveils release year trends for movies and TV shows, emphasizing popular periods and production patterns. It facilitates medium comparisons and illuminates content growth and viewer preferences. Insights into production cycles and audience engagement trends are valuable for content creation decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Gained insights can positively influence content planning, targeted marketing, revenue generation, resource allocation, and audience engagement.

Potential for Negative Growth: Overreliance on trends, neglecting creativity, cyclical downturns, and disregarding evolving audience preferences could lead to negative growth despite insights.

#### Chart - 6 Analysing the top 15 countries with the most content

In [None]:
# Chart - 6 visualization code
# Analysing the top 15 countries with the most content
plt.figure(figsize=(10, 6))  # Set the figure size for the plot

# Create a custom color palette for the graph
color_palette = ["#ff9999", "#66b3ff"]

# Create a count plot using seaborn to visualize the top 15 countries with the most content
sns.countplot(x=netflix_df['country'], palette=color_palette,
              order=netflix_df['country'].value_counts().index[:15], hue=netflix_df['type'])

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=10)

# Set the title, title color in red, x label color in blue, and y label color in blue
plt.title('Top 15 Countries with Most Content on Netflix', fontsize=16, fontweight='bold', color='red')
plt.xlabel('Country', fontsize=12, color='blue')
plt.ylabel('Content Count', fontsize=12, color='blue')

# Add a legend to differentiate between 'Movie' and 'TV Show' types
plt.legend(title='Type', title_fontsize=10, fontsize=8)

# Add a grid for easier data reading
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adjust plot layout for better spacing
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

The selected chart, a bar plot with hue differentiation, effectively presents the distribution of content across the top 15 countries while highlighting the types of content in an easily comprehensible manner.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the content distribution across the leading 15 countries, showcasing the content type proportions within each country, which can inform content acquisition strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Certainly, these insights empower informed decisions, potentially leading to targeted content investments and content acquisition agreements for improved viewer engagement and business growth.

#### Chart - 7 Ratings for movies and tv shows.

In [None]:
# Chart - 7 visualization code
# Define the order of ratings
order = ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']

# Create subplots with custom colors
fig, axes = plt.subplots(1, 2, figsize=(19, 5))

# Countplot for Movies with a custom color palette
sns.countplot(data=netflix_df, x='rating', order=order, palette="Blues", ax=axes[0])
axes[0].set_title("Ratings for Movies", color='red')
axes[0].set_xlabel("Rating", color='blue')
axes[0].set_ylabel("Total Count", color='blue')

# Countplot for TV Shows with a custom color palette
sns.countplot(data=netflix_df, x='rating', order=order, palette="Greens", ax=axes[1])
axes[1].set(yticks=np.arange(0, 1600, 200))
axes[1].set_title("Ratings for TV Shows", color='red')
axes[1].set_xlabel("Rating", color='blue')
axes[1].set_ylabel("Total Count", color='blue')

# Adjust layout
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I used side-by-side countplots to compare content ratings for movies and TV shows. This format simplifies the comparison, and the ordered x-axis makes it easy to see rating distributions across categories.

##### 2. What is/are the insight(s) found from the chart?

The chart shows Netflix's content ratings distribution for movies and TV shows. "TV-MA" is the most common rating for both, followed by "TV-14" and "TV-PG" for TV shows, and "R" for movies. This suggests a focus on mature audiences, reflecting diverse content preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can benefit Netflix by tailoring content to viewer preferences and demographics, potentially boosting engagement, satisfaction, and user retention. While the chart itself doesn't suggest negative growth, an imbalance favoring mature content over others could alienate some viewers. Maintaining a balanced content library across ratings can mitigate this and foster growth by appealing to a wider audience.

#### Chart - 8 Analysis of Tv shows and Movies genres.

In [None]:
# Chart - 8 visualization code
# Set the figure size for the side-by-side plots.
plt.figure(figsize=(16, 6))

# Define custom colors for the plots
custom_colors = ["#1f77b4", "#ff7f0e"]

# Analysis of TV shows listed in different categories
plt.subplot(1, 2, 1)  # Create the left subplot for TV shows.
sns.countplot(y="listed_in", data=tv_shows, palette=custom_colors,
              order=tv_shows['listed_in'].value_counts().index[0:15])  # Displaying top 15 categories.
plt.title('ANALYSIS OF TV SHOW GENRES', fontsize=15, fontweight='bold', color='red')  # Set title and font.
plt.xlabel("Count")  # Set x-axis label.

# Analysis of movie shows listed in different categories
plt.subplot(1, 2, 2)  # Create the right subplot for movie shows.
sns.countplot(y="listed_in", data=movies, palette=custom_colors,
              order=movies['listed_in'].value_counts().index[0:15])  # Displaying top 15 categories.
plt.title('ANALYSIS OF MOVIE GENRES', fontsize=15, fontweight='bold', color='red')  # Set title and font.
plt.xlabel("Count")  # Set x-axis label.

plt.tight_layout()  # Adjust the spacing between subplots.
plt.show()  # Display the subplots.


##### 1. Why did you pick the specific chart?

Subplots were chosen to analyze TV show and movie categories side by side, aiding direct comparison and identification of genre preferences.

##### 2. What is/are the insight(s) found from the chart?

Insights from the subplots: The subplots provide a clear overview of the most popular genres for TV shows and movies. They highlight viewer preferences and help in strategic programming decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact: Yes, insights can drive content strategy. Recognizing popular genres can lead to curated content, attracting and retaining a more engaged audience.

Negative growth insights: If specific genres show disproportionately low interest, it might indicate a lack of engagement in those areas. Adjusting content offerings can address potential negative impact.

**MULTIVARIATE ANALYSIS :-**

#### Chart - 9  The distribution of content across different countries.

In [None]:
# Chart - 9 visualization code
country = netflix_df.country.value_counts()

coun = {}

# Loop through countries
for idx, val in country.items():
    l = idx.split(',')
    for i in l:
        i = i.strip()
        if i in coun.keys():
            d = {}
            d[i] = val + coun[i]
            coun.update(d)
        else:
            d = {i: val}
            coun.update(d)

# Create lists for nations and counts
nation, count = [], []
for idx, val in coun.items():
    nation.append(idx)
    count.append(val)

# Create a DataFrame
temp = (pd.DataFrame({'country': nation, 'count': count})
        .sort_values('count', ascending=False))

# Set color based on count for visualization
temp['color'] = temp['count'].apply(lambda x: '#b20710' if x > temp['count'].values[30] else 'grey')

# Load geospatial data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Convert country names to ISO codes
temp['iso_code'] = coco.convert(names=temp['country'], to='ISO3')
temp = temp[temp['iso_code'] != 'not found']

# Merge geospatial data with DataFrame
temp_map = world.merge(temp, left_on='iso_a3', right_on='iso_code')

# Drop unnecessary columns
temp_map.drop(columns=['continent', 'gdp_md_est', 'pop_est', 'name'], inplace=True)
temp_map = temp_map.sort_values(by='count', ascending=False)

# Visualization setup
colors = ['#FF1493', 'grey', '#FF1493']

fig, ax = plt.subplots(figsize=(15, 7.5), dpi=80)
fig.patch.set_facecolor('#FFF8DC')
ax.set_facecolor('#FFF8DC')

# Create a choropleth map
temp_map.dropna().plot(column='count',
                       color=temp_map.dropna()['color'],
                       legend=False,
                       ax=ax)

# Enhance the visualization
for loc in ['left', 'right', 'top', 'bottom']:
    ax.spines[loc].set_visible(False)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

The chosen chart, a choropleth map, was selected to visually represent the distribution of content across different countries. It effectively communicates geographical disparities in content counts by using colors to indicate varying levels of content availability. This type of map is particularly suited for showcasing content data on a global scale, making it easier to identify regions with high and low content counts.

##### 2. What is/are the insight(s) found from the chart?

The choropleth map provides insights into the distribution of content across different countries. Some of the insights gained from the chart include:

Content Concentration: The map highlights countries with the highest content counts, such as the United States, India, and the United Kingdom, which are shown in prominent colors. These countries seem to have a significant amount of content available on the platform.

Global Variability: The varying shades of color across different countries indicate the diversity in content availability. Some countries appear to have limited content, while others have a more extensive range.

Content Deserts: Regions with lighter colors suggest areas with relatively lower content counts. These areas might be underserved in terms of available content options.

Regional Patterns: The map allows us to observe regional trends, with neighboring countries often having similar color shades, indicating similar content availability.

Content Gaps: Countries with darker colors might indicate high content counts, but they could also signify that a large amount of content caters to a specific audience or language group.

Platform Penetration: The map indirectly reflects the platform's popularity in different regions based on content counts.

Potential Growth Areas: Lighter-colored countries might represent potential growth areas for the platform to expand its content library and user base.

Overall, the chart provides a visual summary of content distribution, aiding in identifying content-rich regions, potential growth opportunities, and areas where content diversity could be improved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The choropleth map insights can drive both positive and negative impacts for the streaming platform:

Positive Impact:

* Targeted Expansion: Identifying high-demand regions suggests where expansion can yield positive results.

* Localized Content: Knowing regional preferences guides tailored content, boosting engagement.

* Localization Efforts: Translation and subtitles can attract diverse audiences in content-limited regions.

* Advertising Revenue: Content-rich regions inform targeted ad campaigns, maximizing revenue.

Negative Growth Potential:

* Saturation Concerns: Overloading saturated markets may lead to content fatigue and reduced engagement.

* Neglected Markets: Focusing on content-rich areas might miss growth opportunities in untapped regions.

* Language & Culture: Limiting content to specific languages can hinder engagement in diverse regions.

* Market Competition: Rivals may intensify efforts in content-rich regions, impacting market share.

#### Chart - 10  WordCloud of the Title

In [None]:
# Chart - 10 visualization code

# Combine all the titles into a single text
text = " ".join(word for word in netflix_df['title'])

# Create the WordCloud object
wordcloud = WordCloud(width=800, height=500, background_color='white', stopwords=STOPWORDS, min_font_size=10, colormap='viridis')

# Generate the word cloud from the text
wordcloud.generate(text)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Display the word cloud using imshow
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()  # Turn off axis

# Set a title for the plot
ax.set_title("Word Cloud of Titles", fontsize=16, fontweight='bold', color='red')

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose the word cloud chart to visualize the most frequent words in content titles. It offers a quick, visually engaging way to grasp the prominent keywords and themes, aiding in understanding content focus without exhaustive analysis. It's an effective, at-a-glance summary tool.

##### 2. What is/are the insight(s) found from the chart?

The word cloud highlights prominent words in content titles, offering insights:

* Genres: "Love," "Story," "Life," and "World" suggest a focus on emotional storytelling.
* Action & Adventure: "Action," "Adventure," "Battle," and "War" signify action-packed content.
* Fantasy: "Magic," "Fantasy," and "Kingdom" point to a strong presence of fantasy themes.
* Comedy: The term "Comedy" hints at humorous content.
* Mystery & Thriller: "Mystery," "Crime," "Murder," and "Suspense" imply mystery and suspense elements.
* Cultural & Historical: "History," "Cultural," "Documentary," and "Ancient" suggest historical and cultural themes.
* Education & Learning: Words like "Learn," "Educational," and "School" indicate educational content.
These insights provide a broad overview of prevalent content themes, although quantitative analysis is required for precise theme distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The word cloud insights can positively impact content strategies but don't inherently indicate negative growth.

Positive Impact: Valuable for tailoring offerings to match popular themes and genres.

No Negative Growth Insights: The word cloud doesn't highlight declining trends or negative impacts; it focuses on frequently occurring terms in titles.

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Creating two extra columns for TV shows and movies
tv_shows = netflix_df[netflix_df['type'] == 'TV Show']  # Filter rows with 'TV Show' type into a new DataFrame 'tv_shows'.
movies = netflix_df[netflix_df['type'] == 'Movie']  # Filter rows with 'Movie' type into a new DataFrame 'movies'.

# Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
netflix_df['target_ages'] = netflix_df['rating'].replace(ratings)  # Create a new column 'target_ages' by mapping ratings to grouped categories.
# Group the data by country and calculate the count of content for each country.
netflix_df['count'] = 1
data = netflix_df.groupby('country')[['country', 'count']].sum().sort_values(by='count', ascending=False).reset_index()[:10]
data = data['country']

# Filter the original DataFrame based on the top 10 countries with the highest content count.
df_heatmap = netflix_df.loc[netflix_df['country'].isin(data)]

# Create a cross-tabulation of 'target_ages' and 'country' columns to get the normalized percentage of each target age group for each country.
df_heatmap = pd.crosstab(df_heatmap['country'], df_heatmap['target_ages'], normalize="index").T

# Display the resulting DataFrame that will be used for the heatmap.
df_heatmap

In [None]:
# Correlation Heatmap visualization code
# Set the figure size for the plot
plt.figure(figsize=(15, 6))

# Create a heatmap using Seaborn
sns.heatmap(df_heatmap, cmap='RdGy', annot=True, fmt=".1%", cbar=False)

# Set the title and font properties
plt.title('Normalized Content Distribution by Target Age Group for Top 10 Countries', fontsize=15, fontweight='bold', color='red')

# Set x and y axis labels and their font properties
plt.xlabel('Country', fontsize=12,color='blue')
plt.ylabel('Age Group', fontsize=12,color='blue')

# Display the heatmap with the updated color map
plt.show()

##### 1. Why did you pick the specific chart?

I chose a heatmap because it effectively visualizes the normalized content distribution across different target age groups for the top 10 countries. The heatmap provides a clear and concise overview of how content is distributed among various age groups for each country. The color scale allows for quick identification of patterns and trends in content distribution.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals content distribution among age groups in the top 10 countries. Darker areas signify concentrated content, while lighter areas indicate a balanced distribution. This insight guides age-based content strategies regionally.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***