<a href="https://colab.research.google.com/github/MishraVikas01/Netflix-Show-and-Movies-Clustering/blob/main/Netflix_Show_and_Movies_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Show and Movies Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**



Project Summary Statement: Analyzing Netflix's TV shows and movies dataset, along with integrating external datasets such as IMDB ratings and Rotten Tomatoes, offers a comprehensive opportunity to uncover valuable insights and trends within the streaming industry. This project aims to explore the growth of TV shows and the decline of movies on Netflix since 2010, providing a quantitative analysis of the platform's content evolution. By combining the Flixable dataset with additional sources, we can delve into various dimensions, including viewer preferences, critical reception, and genre popularity, shedding light on the factors that shape Netflix's content strategy.

With the dataset collected from Flixable, we have access to a comprehensive list of TV shows and movies available on Netflix as of 2019. Leveraging this information, we can validate the findings reported by Flixable in 2018, which revealed a substantial increase in the number of TV shows and a decline in the number of movies since 2010. By performing data analysis and visualization techniques, we can quantify these changes and provide an accurate depiction of Netflix's content landscape over time.

# **GitHub Link -**

https://github.com/MishraVikas01/Netflix-Show-and-Movies-Clustering

# **Problem Statement**


"Investigate the changing content landscape on Netflix over the years by analyzing a dataset of TV shows and movies available on the platform as of 2019. Explore the trend highlighted in Flixable's report, which states that the number of TV shows on Netflix has nearly tripled since 2010, while the number of movies has significantly decreased. Additionally, leverage external datasets such as IMDB ratings and Rotten Tomatoes scores to uncover correlations between content availability, ratings, and audience preferences. The objective is to identify key insights, patterns, and potential factors influencing Netflix's content strategy and user engagement, ultimately contributing to a comprehensive understanding of the streaming service's evolution and success."

#### **Define Your Business Objective?**

The business objective for our project is to analyze the Netflix dataset and integrated external datasets like IMDb ratings and Rotten Tomatoes scores to:

Understand Netflix's content strategy: Analyze changes in the number of movies and TV shows over time to uncover patterns and insights into Netflix's content acquisition decisions.

Evaluate content quality: Assess the quality and popularity of movies and TV shows on Netflix using IMDb ratings and Rotten Tomatoes scores, identifying correlations between ratings and the number of titles in each category.

Identify popular genres and trends: Analyze the dataset to determine the most popular genres and identify any significant growth or decline over time, helping understand audience preferences and content gaps.

Develop a recommendation system: Utilize the dataset and external ratings to create a recommendation system for suggesting movies and TV shows based on user preferences and historical data. Evaluate different recommendation algorithms and their impact on user engagement.

Support content acquisition decisions: Provide insights to Netflix's content acquisition teams by identifying successful genres, directors, or actors associated with highly-rated content, aiding their decision-making when acquiring new content.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv("/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING (1).csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:

# Visualizing the missing values
import missingno as msno

# Create a missing value bar chart
msno.bar(df)

# Show the plot
plt.show()


### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# handaling missing values in dataset
# drop null values of "date_added" column
data=df.dropna(subset=['date_added'])

#Replacing null values with "NULL" in rest four columns
data.fillna("NULL",inplace=True)

In [None]:
# count missing value 
data.isnull().sum()

In [None]:
# new shape of datase
data.shape

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe().transpose()

### Variables Description 

show_id : Unique ID for every Movie / Tv Show

type : Identifier - A Movie or TV Show

title : Title of the Movie / Tv Show

director : Director of the Movie

cast : Actors involved in the movie / show

country : Country where the movie / show was produced

date_added : Date it was added on Netflix

release_year : Actual Releaseyear of the movie / show

rating : TV Rating of the movie / show

duration : Total Duration - in minutes or number of seasons

listed_in : Genere

description: The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

# Write your code to make your dataset analysis ready.
# Print the first five rows of the dataset
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Fill missing values with the mean of the column
df = data.fillna(data.mean())

# Check for duplicate rows
print(data.duplicated().sum())


# Print the shape of the cleaned dataset
print(data.shape)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Line Plot: Display the count of shows/movies over time
plt.figure(figsize=(10, 6))
df['date_added'] = pd.to_datetime(df['date_added'])
count_by_year = df['date_added'].dt.year.value_counts().sort_index()
count_by_year.plot(kind='line', marker='o')
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Count of Shows/Movies Over Time')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Stacked Bar Plot: Visualize the count of shows/movies by type and rating
plt.figure(figsize=(10, 6))
rating_counts = df.groupby(['type', 'rating']).size().unstack()
rating_counts.plot(kind='bar', stacked=True)
plt.xlabel('Type')
plt.ylabel('Count')
plt.title('Count of Shows/Movies by Type and Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Bar chart of show types
show_types = data['type'].value_counts()
plt.bar(show_types.index, show_types.values)
plt.xlabel('Show Type')
plt.ylabel('Count')
plt.title('Distribution of Show Types')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Pie chart of show ratings
show_ratings = data['rating'].value_counts()
plt.pie(show_ratings.values, labels=show_ratings.index, autopct='%1.1f%%')
plt.title('Distribution of Show Ratings')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Histogram of show durations
plt.hist(data['duration'], bins=10)
plt.xlabel('Duration')
plt.ylabel('Count')
plt.title('Distribution of Show Durations')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Bar chart of top 10 countries with the most shows
top_countries = data['country'].value_counts().head(10)
plt.bar(top_countries.index, top_countries.values)
plt.xlabel('Country')
plt.ylabel('Count')
plt.title('Top 10 Countries with the Most Shows')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Line chart of show releases over the years
show_releases = df['release_year'].value_counts().sort_index()
plt.plot(show_releases.index, show_releases.values)
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Show Releases Over the Years')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:

# Chart - 8 visualization code

plt.figure(figsize=(8, 8))
df['type'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Distribution of Show Types')
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Stacked Bar Plot: Compare the distribution of show genres by release year
pivot_genre = df.groupby(['release_year', 'listed_in']).size().unstack(fill_value=0)
plt.figure(figsize=(12, 6))
pivot_genre.plot(kind='bar', stacked=True)
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Distribution of Show Genres by Release Year')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
columns = ['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description']

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select the columns for correlation analysis
columns = ['release_year', 'rating', 'duration']

# Create a correlation matrix
correlation_matrix = df[columns].corr()

# Generate a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
# Create a pairplot
# Select the relevant columns for clustering
columns = ['release_year', 'rating', 'duration']

# Subset the data based on the selected columns
subset_data = data[columns]

# Create the pair plot chart
sns.pairplot(subset_data)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***