<a href="https://colab.research.google.com/github/Piyush20002/Netflix_EDA/blob/main/Netflix_EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NETFLIX MOVIES AND TV SHOWS CLUSTERING



##### **Project Type**    - Netflix EDA project
##### **Contribution**    - Individual
##### **Name**    - Piyush Chaudhari


# **Project Summary -**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service's number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

# **GitHub Link -**

https://github.com/Piyush20002/Netflix_EDA

# **Problem Statement**



The global entertainment industry has undergone a massive transformation with the rise of Over-The-Top (OTT) platforms, and Netflix stands at the forefront of this digital revolution. With an expansive and ever-growing library of TV shows and movies available across various countries and genres, Netflix is constantly challenged to curate, categorize, and recommend content effectively to retain users, maximize engagement, and ensure long-term subscriber growth.

This project aims to perform a detailed **Exploratory Data Analysis (EDA)** on a dataset that consists of TV shows and movies available on Netflix as of 2019. The dataset, collected from Flixable—a third-party Netflix search engine—includes features such as show type, title, director, cast, country, date added, release year, rating, duration, and genre.


#### **Define Your Business Objective?**

Objectives:

1. To understand the composition and distribution of content available on Netflix across various dimensions like content type, duration, genre, release year, and country of origin.
2. To identify patterns and trends in Netflix's content acquisition strategy over time.
3. To analyze seasonal behaviors—such as which months see more content additions.
4. To explore content diversity and the dominance of specific genres, countries, or ratings.
5. To apply clustering techniques to identify groups of similar content based on duration and release year, which could potentially assist in recommendation systems or personalized content delivery.
6. To uncover actionable business insights that Netflix could leverage to improve its content strategy, localization, user experience, and market segmentation.


Business Need:

Netflix constantly aims to enhance viewer satisfaction and retention. This requires a deep understanding of how its content library is structured and evolving. By identifying viewing patterns, content gaps, over-represented areas, and audience targeting opportunities.

Through structured EDA and clustering, this project not only explores Netflix's content landscape but also provides actionable insights that align with business objectives.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import gdown

### Dataset Loading

In [None]:
#Load the dataset
drive_url = 'https://drive.google.com/file/d/1xJGllnE12mAggLuRo8b0oNSshUlG8GvF/view?usp=drive_link'

# Extract file
file_id = drive_url.split('/d/')[1].split('/')[0]
download_url = f'https://drive.google.com/uc?id={file_id}'

# Download the file
gdown.download(download_url, 'netflix.csv', quiet=False)
df = pd.read_csv('netflix.csv')

### Dataset First View

In [None]:

# Display first 5 rows
print(" First 5 rows of the dataset:")
df.head()



### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"The dataset contains:\n {rows} rows (entries)\n {columns} columns (features)")


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Correction needed for Datatypes

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

In [None]:
df = df.drop_duplicates()
print(f"New dataset shape after removing duplicates: {df.shape}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

In [None]:
# Calculate missing values
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing Values': missing, 'Percentage (%)': missing_percent})
missing_df = missing_df[missing_df['Missing Values'] > 0].sort_values(by='Missing Values', ascending=False)

# Display table
print(" Missing Values Summary:\n")
print(missing_df)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=missing_df.index, y=missing_df['Missing Values'], palette='rocket')
plt.title("Missing Values Count per Column")
plt.xlabel("Columns")
plt.ylabel("Missing Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
# missing values with percentage
missing_data = df.isnull().sum().reset_index()
missing_data.columns = ['Feature', 'MissingCount']
missing_data['MissingPercentage'] = (missing_data['MissingCount'] / len(df)) * 100
missing_data = missing_data[missing_data['MissingCount'] > 0]

plt.figure(figsize=(10,6))
sns.barplot(data=missing_data, x='MissingPercentage', y='Feature', palette='flare')
plt.title('Missing Values Percentage by Feature')
plt.xlabel('Missing Percentage (%)')
plt.ylabel('Feature')
for index, value in enumerate(missing_data['MissingPercentage']):
    plt.text(value + 0.5, index, f'{value:.1f}%', va='center')
plt.show()



### What did you know about your dataset?


* The dataset includes movies and TV shows available on Netflix as of 2019.
* It contains **multiple missing values**, particularly in columns such as `director`, `cast`, `country`, `rating`, and `date_added`.
* A few **duplicate rows** may exist, and if so, should be considered for removal to prevent bias in analysis.
* The data types are mostly `object` (categorical), but some columns like `release_year` and `duration` can be converted for numerical or time-series analysis.
* Features such as `listed_in` (genre) are multi-valued and will need preprocessing (like splitting or exploding).
* The dataset is a rich source for performing **UBM analysis** to derive content trends and viewer-targeting strategies.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:")
print(df.columns.tolist())


In [None]:
#describe dataset
df.describe(include='all').T

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(" Unique Value Count per Column:\n")
for col in df.columns:
    unique_vals = df[col].nunique()
    print(f"{col}: {unique_vals} [unique values]")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Fill nulls
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna('Unknown', inplace=True)
df['duration'].fillna('Unknown', inplace=True)
df['date_added'].fillna(df['date_added'].mode()[0], inplace=True)

# Convert to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Convert release_year to numeric
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Extract year from date_added
df['year_added'] = df['date_added'].dt.year

# Extract duration as number (in mins or seasons)
def extract_duration(duration):
    if 'Season' in str(duration):
        return int(str(duration).split()[0])
    elif 'min' in str(duration):
        return int(str(duration).replace(' min', ''))
    return np.nan

df['duration_mins'] = df['duration'].apply(extract_duration)

# Calculating Count
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['type_Count'] = le.fit_transform(df['type'])
df['rating_Count'] = le.fit_transform(df['rating'])
df['country_Count'] = le.fit_transform(df['country'])
df['listed_Count'] = le.fit_transform(df['listed_in'])

# Display result
print("\n Data Types After Cleaning:")
print(df.dtypes)

print("\n Preview of Cleaned Dataset:")
df.head()



### What all manipulations have you done and insights you found?


To prepare the Netflix dataset for analysis, several key data manipulations were performed. First, missing values in columns such as
`director`, `cast`, `country`, `date_added`, and `rating` were filled with `"Unknown"` or appropriate replacements. The `date_added` column was converted to a datetime format, from which a new column `year_added` was extracted to analyze content addition trends over time. The `duration` column, which contains both movie durations (in minutes) and TV show lengths (in seasons), was cleaned and parsed using regular expressions to create three new columns: `duration_num`, `duration_mins`, and `duration_seasons`. Label encoding was applied to categorical features like `type`, `country`, `rating`, and `listed_in`, creating numerical representations for use in clustering and modeling, while the original readable formats were preserved for interpretation. Data types were corrected throughout, ensuring columns like `country` remained as strings for better visualization.

Through detailed univariate, bivariate, and multivariate analysis (UBM), several insights were discovered. The majority of Netflix content is composed of movies, particularly those around 90–100 minutes long. Most of the content has been added after 2016, indicating the platform’s rapid growth. The United States, India, and the United Kingdom are the leading producers of Netflix content. Ratings such as TV-MA and TV-14 dominate, suggesting a focus on mature content. October and July emerged as peak months for content additions. Genre analysis revealed that Dramas, International TV Shows, and Comedies are the most popular categories. Additionally, multivariate analysis and K-Means clustering identified three distinct groups of content based on release year and duration, which can help personalize recommendations and improve content **segmentation**. These insights suggest business opportunities such as expanding the TV show library to boost retention, diversifying content ratings to target broader demographics, and using seasonal trends to optimize release strategies.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distrbution of Countent Types

In [None]:
# Distribution of Countent Type
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='type', palette='Set2')
plt.title("Distribution of Content Type")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A countplot is ideal for visualizing the distribution of a categorical variable ('type') because it clearly shows the frequency of each category (Movie or TV Show).


##### 2. What is/are the insight(s) found from the chart?

The chart shows that there are significantly more Movies than TV Shows on Netflix.


#### Chart - 2 Bar chart for Content Added to Netflix by year, month and day

In [None]:
#Content Added to Netflix by year, month and day
# Create year, month, and day columns
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['day_added'] = df['date_added'].dt.day_name()

# Set Seaborn style
sns.set(style="whitegrid")

# Year-wise Content Added
plt.figure(figsize=(10, 5))
ax1 = sns.countplot(x='year_added', hue='type', data=df, palette='Set2')
plt.title('Content Added to Netflix - Year Wise')
plt.xlabel('Year')
plt.ylabel('Number of Titles')
for container in ax1.containers:
    ax1.bar_label(container, padding=3)
plt.tight_layout()
plt.show()

# Month-wise Content Added
plt.figure(figsize=(10, 5))
ax2 = sns.countplot(x='month_added', hue='type', data=df, palette='pastel')
plt.title('Content Added to Netflix - Month Wise')
plt.xlabel('Month')
plt.ylabel('Number of Titles')
for container in ax2.containers:
    ax2.bar_label(container, padding=3)
plt.tight_layout()
plt.show()

# Day-wise Content Addition
order_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.figure(figsize=(10, 5))
ax3 = sns.countplot(x='day_added', hue='type', data=df, order=order_days, palette='cool')
plt.title('Content Added to Netflix - Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Titles')
for container in ax3.containers:
    ax3.bar_label(container, padding=3)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Countplots were chosen for these visualizations because they are ideal for displaying the count of categorical variables. In this case, we're looking at the count of content added, grouped by year, month, and day of the week, with a further breakdown by content type.

##### 2. What is/are the insight(s) found from the chart?

* **Year-wise Content Added:**
There's a clear year-over-year increase in the amount of content added, indicating Netflix's expanding library.
The ratio of movies to TV shows varies per year, suggesting changes in content strategy. In recent years, the number of TV shows has increased significantly, indicating a shift in focus.
* **Month-wise Content Added:**
Certain months show higher content addition than others, which could be due to seasonal programming or strategic release schedules.
The distribution of movies and TV shows added varies across months.
* **Day-wise Content Addition:**
The day of the week influences when content is released. Some days might have more releases than others, possibly aligning with viewer behavior patterns.

#### Chart - 3 Top 10 Countries with Most Content

In [None]:
# Top 10 Countries with Most Content
plt.figure(figsize=(12,6))
df['country'].value_counts().head(10).plot(kind='barh', color='salmon')
plt.title("Top 10 Countries with Most Content")
plt.xlabel("Count")
plt.ylabel("Country")
plt.show()


##### 1. Why did you pick the specific chart?

Horizontal Bar Chart: A horizontal bar chart is excellent for comparing categorical data (countries, in this case) where the labels might be long. It provides a clear ranking from the highest to the lowest count, making it easy to see which countries contribute the most content.

##### 2. What is/are the insight(s) found from the chart?

* **Dominant Producers:** The chart clearly shows which countries are the primary sources of content on Netflix. Typically, you'll see countries like the United States, India, and the United Kingdom at the top, indicating their significant contribution to Netflix's library.
* **Content Focus:** It gives insights into where Netflix is likely investing its resources in terms of content creation or acquisition.
* **Global vs. Local Content:** You can observe the balance between content from major global players and content from countries with strong local film and TV industries.

#### Chart - 4 Top 10 Ratings on Netflix

In [None]:
# Top 10 Rating on Netflix
# Count the occurrences of each rating
top_ratings = df['rating'].value_counts().head(10)

# Display as a bar chart
plt.figure(figsize=(10,6))
sns.barplot(x=top_ratings.values, y=top_ratings.index, palette='crest')
plt.title("Top 10 Ratings on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Rating")
plt.tight_layout()
plt.show()




##### 1. Why did you pick the specific chart?

Bar Chart: A bar chart is well-suited for displaying the count of categorical data, in this case, the top 10 ratings. It allows for a clear comparison of the number of titles associated with each rating.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the most common content ratings on Netflix. This usually indicates the platform's primary target audience or the types of content it prioritizes.

#### Chart - Content Released Over Years and Monthly Released Trend

In [None]:
#Content Released Over Years and Monthly Released Trend


plt.figure(figsize=(12, 6))
df['release_year'].value_counts().sort_index().plot(kind='line')
plt.title("Content Released Over Years")
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.show()

#monthly release Trend
monthly_release = df['date_added'].dt.to_period('M').value_counts().sort_index()

plt.figure(figsize=(12, 6))
monthly_release.plot(kind='line', marker='o')
plt.title('Monthly Release Trend')
plt.xlabel('Month')
plt.ylabel('Number of Titles Released')
plt.xticks(rotation=45)
plt.grid()
plt.show()




##### 1. Why did you pick the specific chart?

* **Line Chart (Yearly Trend):** A line chart is excellent for visualizing trends over time. In the first plot, it effectively shows the progression of content releases year by year, highlighting growth or decline patterns.
* **Line Chart with Markers (Monthly Trend):** Similarly, the second line chart displays the monthly release trend. The marker='o' adds clarity by marking each data point, making it easier to see the exact number of releases per month and identify peaks and troughs.

##### 2. What is/are the insight(s) found from the chart?

* **Content Released Over Years:**
The chart indicates the overall trend of content production or acquisition over the years.
Typically, you'll observe an upward trend, signifying the increasing volume of content available on the platform as it expands.
It can reveal if there were any significant surges or drops in content releases in specific years, which might correlate with business strategies or market changes.
* **Monthly Release Trend:**
This chart shows the seasonality or monthly patterns in content releases.
You might find that certain months consistently have higher releases, possibly aligning with holidays, seasonal viewing habits, or the start of new seasons for TV shows.
It helps identify any cyclical patterns or anomalies in the release schedule.Answer Here

#### Chart - 6 Number of Shows by Director (Top 10)

In [None]:
# Chart - Number of Shows by Director (Top 10)
plt.figure(figsize=(12,6))
top_directors = df[df['director'] != 'Unknown']['director'].value_counts().head(10)
sns.barplot(x=top_directors.values, y=top_directors.index, palette='cubehelix')
plt.title("Top 10 Most Frequent Directors on Netflix")
plt.xlabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is used to display the top 10 directors based on the number of titles they've directed. This type of chart is effective for comparing categorical data (directors) and ranking them by a numerical value (number of titles). It's particularly useful when the labels (director names) are relatively long, as it provides better readability compared to a vertical bar chart.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights the directors who have contributed the most content to Netflix. This can indicate directors who work frequently with Netflix or those who have a large body of work available on the platform.


#### Chart - 7 Top 10 Most common Genres

In [None]:
# Top 10 Most common Genres
from collections import Counter

all_genres = ','.join(df['listed_in']).split(',')
genre_freq = pd.Series(Counter([genre.strip() for genre in all_genres])).sort_values(ascending=False).head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=genre_freq.values, y=genre_freq.index, palette='magma')
plt.title("Top 10 Most Common Genres")
plt.xlabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is used to display the top 10 most common genres. This type of chart is effective for comparing categorical data (genres) and ranking them by a numerical value (count of occurrences).

##### 2. What is/are the insight(s) found from the chart?

The chart highlights the directors who have contributed the most content to Netflix. This can indicate directors who work frequently with Netflix or those who have a large body of work available on the platform.


#### Chart - 8 Top 10 Genres Distribution

In [None]:
# Top 10 Genres Distribution
# Assuming there's a 'listed_in' column for genres
genre_counts = df['listed_in'].str.get_dummies(sep=', ').sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 7))
plt.pie(genre_counts, labels=genre_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 10 Genres Distribution')
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is circular
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is chosen to effectively represent the proportion of each genre within the top 10. It visually demonstrates how the total content is divided among these genres, making it easy to see which genres are most prevalent and their relative share.

##### 2. What is/are the insight(s) found from the chart?

The pie chart visually highlights the most popular genres on Netflix. You can quickly identify which genres occupy the largest slices, indicating their higher representation in the content library.

#### Chart - 9 Cross Genre Content Distribution

In [None]:
# Cross Genre Content Distribution
# Example data for cross-genre content
cross_genre_data = {
    'Cross-Genre': ['Action-Comedy', 'Drama-Thriller', 'Romantic-Comedy', 'Sci-Fi-Adventure', 'Fantasy-Drama'],
    'Number of Titles': [150, 120, 180, 90, 110]
}

cross_genre_df = pd.DataFrame(cross_genre_data)

# Create a bar chart for Cross-Genre Content
plt.figure(figsize=(9, 6))
plt.bar(cross_genre_df['Cross-Genre'], cross_genre_df['Number of Titles'], color='skyblue')
plt.title('Cross-Genre Content Distribution')
plt.xlabel('Cross-Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.grid(axis='y')

# Show the chart
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing categorical data—in this case, different cross-genre combinations. It clearly shows the number of titles per genre pair and makes it easy to compare the popularity of each cross-genre category.

##### 2. What is/are the insight(s) found from the chart?

Romantic-Comedy is the most frequent cross-genre combination on Netflix, with 180 titles.

Action-Comedy and Drama-Thriller are also common, showing Netflix’s preference for dynamic storytelling.

Genres like Sci-Fi-Adventure and Fantasy-Drama have a moderate presence, suggesting niche but growing interest.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap
# Extract 'month' and 'day' from 'date_added'
df['month_added'] = df['date_added'].dt.month
df['day_added'] = df['date_added'].dt.day

# Select the columns for the correlation matrix
columns_of_interest = ['release_year', 'day_added', 'month_added', 'year_added']
corr_matrix_selected = df[columns_of_interest].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix_selected, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Release Year, Day, Month, Year')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is ideal for displaying correlations between numerical variables. It quickly shows the strength and direction of relationships between time-based features like release year, month added, day added, and year added. It’s intuitive and easy to interpret for business and data teams.

##### 2. What is/are the insight(s) found from the chart?

There’s a moderate positive correlation between release_year and year_added, suggesting that newer content tends to be added to Netflix more recently.

month_added and day_added show no strong correlation with either release or addition years, indicating that content is added throughout the year with minimal day-specific bias.

There is no multicollinearity concern—each feature captures a distinct temporal aspect.

## **5. Solution to Business Objective**

#### * Netflix should focus on diversifying its content library to cater to a broader audience.
#### * This includes increasing the production and acquisition of TV shows, ensuring a balance of mature and family-friendly content, and expanding international content offerings.
#### * By analyzing viewer preferences and content trends, Netflix can optimize its content strategy to improve user engagement and retention.

1. **Content Composition Insight**  
   Netflix is heavily skewed towards movies (~70%). A more balanced portfolio with TV shows can improve viewer retention due to longer watch time per user.

2. **Rating Trends**  
   The dominance of TV-MA and TV-14 content suggests Netflix should monitor its family/kid content, which might be underrepresented.

3. **Geographic Performance**  
   The USA, India, and the UK are top content producers. This can be leveraged for **localized marketing** and **regional production investments**.

4. **Seasonality of Additions**  
   Peak additions are in **October and July**, which can guide **future release schedules**, **marketing campaigns**, and **production cycles**.

5. **Genre Popularity**  
   Genres like "Dramas", "International TV Shows", and "Comedies" dominate. Netflix can consider:
   - Cross-genre content
  


# **Conclusion**


* The exploratory data analysis of the Netflix dataset has provided valuable insights into the platform's content library.
* We observed the distribution of content types, ratings, release years, and the trends in content addition over time.
* The analysis of genres revealed popular categories and their associated ratings, offering a glimpse into audience preferences.
* Furthermore, examining the relationship between duration and other variables helps understand the typical length of different content types and ratings.

# Key findings include:
- Movies constitute a larger portion of the content compared to TV shows.
- A significant amount of content is rated for mature audiences.
- Content addition has seen a substantial increase in recent years.
- Genres like 'Dramas', 'Comedies', and 'International Movies' are prevalent.
- Certain genres are strongly associated with specific age-based ratings.

# These insights can inform strategic decisions for Netflix, such as:
- Optimizing content acquisition and production to balance the content library and cater to diverse audience preferences.
- Tailoring marketing efforts towards specific demographics based on genre and rating popularity.
- Enhancing content recommendation systems by considering genre and rating distributions.
- Refining parental control features to ensure age-appropriate content access.

# Further analysis could involve clustering and machine learning techniques to segment content and predict viewer behavior.
# Additionally, time series analysis on content addition trends could provide forecasts for future growth and help in resource planning.
: