<a href="https://colab.research.google.com/github/Med-Lingii/Netflix-EDA/blob/main/Netflix_EDA_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Clustering and Exploratory Analysis of Netflix Movies and TV Shows



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**  - Lingeshwaran S

# **Project Summary -**

In the era of digital entertainment, streaming services like Netflix have transformed how audiences consume media. With a vast and ever-expanding library of movies and TV shows spanning various genres, languages, and countries, organizing and recommending content effectively poses a significant challenge. Netflix’s success partially relies on its ability to deliver personalized and relevant content to users, which in turn depends on understanding the structure and patterns in its content library.

This project undertakes a comprehensive Exploratory Data Analysis (EDA) on a dataset of Netflix titles to uncover insights about the content types, trends over time, genre distributions, and geographic origins. In addition to the exploratory analysis, clustering techniques are employed to group similar shows and movies together, paving the way for better categorization and potential improvements in content recommendation systems.



# **GitHub Link -**

🔗 **GitHub Repository**: Netflix EDA Github Link

# **Problem Statement**


“Netflix has a large, diverse catalog of media content. However, identifying patterns across different titles and grouping similar content together is essential to improve discoverability and enhance user experience. The goal of this project is to explore Netflix’s content library, analyze key features, and apply clustering algorithms to group similar movies and TV shows based on metadata such as genre, duration, description, and other features.”



#### **Define Your Business Objective?**

The primary business objective of this project is to enhance Netflix’s content recommendation system and content strategy through data-driven insights. With a vast and diverse media library, users often face difficulty discovering content they truly enjoy. By performing exploratory data analysis (EDA) and applying clustering techniques, this project aims to identify patterns, group similar content, and reveal hidden relationships within the catalog.

These insights help Netflix:

Improve content discoverability, reducing user drop-off from decision fatigue.

Increase engagement and retention through smarter, more relevant recommendations.

Inform acquisition and production strategies by identifying trending genres, underrepresented content, and geographic content gaps.

Segment the content library to support targeted marketing, personalization, and UI optimization.

Ultimately, this project empowers Netflix to deliver a more personalized viewing experience, strengthen its competitive edge, and make strategic business decisions rooted in audience and content intelligence.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/netflix.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Visualizing the missing values
df.isnull().sum()
df.release_year = df.release_year.astype('int')

### What did you know about your dataset?

After exploring the Netflix Movies and TV Shows dataset, several key observations and insights were uncovered:

Content Type Distribution:
The dataset includes two main types of content: Movies and TV Shows. Movies make up the majority, indicating Netflix's strong focus on film content, although TV Shows still represent a significant portion.

Genre Variety:
The dataset shows a wide range of genres, including Drama, Comedy, Action, Romance, and Documentaries. Drama is the most common genre, which aligns with global viewing trends.

Country Representation:
Content originates from many countries, but the United States, India, and the United Kingdom dominate. However, there's a growing presence of international content, reflecting Netflix’s push for global expansion.

Date and Release Trends:
Most content was added to Netflix between 2017 and 2021, showing aggressive catalog growth during those years. Older release years are common, but recent titles are prioritized.

Duration Differences:
Movies list durations in minutes, while TV Shows use the number of seasons. This structural difference is important for clustering and modeling.

Ratings and Audience Targeting:
The dataset includes ratings like TV-MA, PG, and R, suggesting Netflix targets a broad range of age groups, with a strong tilt toward mature audiences.

Missing Values:
Some fields (like cast, country, or date added) have missing data, which would require imputation or cleaning for accurate analysis.

Clustering Column Present:
A column labeled “cluster” or similar suggests clustering has already been applied, possibly based on features like description, genre, or duration.

Textual Data:
Descriptions provide valuable information for NLP-based analysis and clustering. Using techniques like TF-IDF, these can reveal thematic similarities between titles.

Skew in Data:
A noticeable skew exists toward certain countries, genres, and rating types. This suggests potential content bias or strategic acquisition preferences.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Title – Name of the show or movie

Type – Movie or TV Show

Genre/Category – Classification of the content (e.g., Drama, Comedy)

Description – Textual synopsis of the title

Cast – Main actors

Country – Country of origin

Date Added – When it was added to Netflix

Release Year – Year it was produced or released

Duration – Length of the content (in minutes or seasons)

Rating – Content rating (e.g., TV-MA, PG-13)

Cluster Label – If clustering has already been applied (optional)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_counts = df.nunique().sort_values(ascending=False)
print("Unique values per column:\n")
print(unique_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Fill missing 'country' and 'rating' with 'Unknown'
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna('Not Rated', inplace=True)

# Drop rows with missing 'title' or 'description' if needed
df.dropna(subset=['title', 'description'], inplace=True)

# Separate duration into numeric values
df['duration_int'] = df['duration'].str.extract('(\d+)').astype(float)
df['duration_type'] = df['duration'].str.extract('([a-zA-Z]+)').astype(str)

df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month_name()
df['duration_int'] = df['duration'].str.extract('(\d+)').astype(float)
df['duration_type'] = df['duration'].str.extract('([a-zA-Z]+)')

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(6, 4))
chart = sns.countplot(x='type', data=df, palette='Set2')
plt.title('Distribution of Content Type')
plt.xlabel('Type')
plt.ylabel('Count')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Bar chart is suitable for showing the frequency of categorical variables

##### 2. What is/are the insight(s) found from the chart?

The dataset contains more Movies than TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, knowing what type dominates the platform helps content strategy.
If TV Shows are underrepresented, it might indicate an area to invest in or improve.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
top_countries = df['country'].value_counts().head(10)
plt.figure(figsize=(8, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='coolwarm')
plt.title('Top 10 Countries by Content Count')
plt.xlabel('Count')
plt.ylabel('Country')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Horizontal bar chart is effective to display categorical frequency with long labels like country names

##### 2. What is/are the insight(s) found from the chart?

The United States dominates the Netflix content library, followed by India and the UK.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight guides regional content licensing.
A gap in countries with growing user bases (e.g., Korea) may reflect negative growth or missed opportunities.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
content_year = df['year_added'].value_counts().sort_index()
plt.figure(figsize=(10, 5))
sns.lineplot(x=content_year.index, y=content_year.values, marker='o')
plt.title('Content Added Per Year')
plt.xlabel('Year Added')
plt.ylabel('Number of Titles')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Line chart shows temporal trends in content additions effectively.

##### 2. What is/are the insight(s) found from the chart?

There is a sharp increase in content additions around 2018–2020, with a slight drop in 2021.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding peak periods helps plan release strategies.
A post-2020 drop may indicate negative trends, possibly due to pandemic-related delays.



#### Chart - 4

In [None]:
# Chart - 4 visualization code
from collections import Counter
all_genres = sum([x.split(', ') for x in df['listed_in'].dropna()], [])
genre_count = Counter(all_genres).most_common(10)
genres, counts = zip(*genre_count)
plt.figure(figsize=(8, 5))
sns.barplot(x=list(counts), y=list(genres), palette='viridis')
plt.title('Top 10 Most Frequent Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Bar chart is ideal for showing the most popular genres from a categorical breakdown.

##### 2. What is/are the insight(s) found from the chart?

The most frequent genres are Dramas, International Movies, and Comedies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in understanding what genres users prefer.
Underrepresented genres may signal content gaps or lower viewer interest.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='rating', hue='type', order=df['rating'].value_counts().index, palette='Set3')
plt.xticks(rotation=45)
plt.title('Rating Distribution by Type')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Grouped bar chart helps compare categorical distributions across different groups (Movie vs TV Show).

##### 2. What is/are the insight(s) found from the chart?

Movies dominate most rating categories. TV Shows are more frequent in TV-MA and TV-14 ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps target specific age-rated content to audience segments. If one type is limited in a rating group, it may affect content reach.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(data=df[df['duration_int'].notna()], x='type', y='duration_int', palette='pastel')
plt.title('Duration Distribution by Type')
plt.ylabel('Duration (Minutes / Seasons)')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Boxplot is effective to show spread and outliers of numerical data across categories.

##### 2. What is/are the insight(s) found from the chart?

Movies have a wider spread in duration while TV Shows mostly cluster around fewer seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Useful to evaluate content length strategy. Shorter content might cater to mobile users or casual viewers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
monthly_order = ['January','February','March','April','May','June','July','August','September','October','November','December']
month_data = df['month_added'].value_counts().reindex(monthly_order)
plt.figure(figsize=(10, 5))
sns.barplot(x=month_data.index, y=month_data.values, palette='Blues')
plt.title('Monthly Content Addition Trend')
plt.xlabel('Month')
plt.ylabel('Number of Titles Added')
plt.xticks(rotation=45)
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Bar chart helps track monthly variations in new content release.

##### 2. What is/are the insight(s) found from the chart?

Most content is added during July and October, possibly aligning with global holidays or strategic quarters.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps marketing teams schedule campaigns. A lack of consistency might signal operational or seasonal delays.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
top_directors = df['director'].value_counts().drop('', errors='ignore').head(10)
plt.figure(figsize=(8, 5))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='magma')
plt.title('Top 10 Directors by Content Count')
plt.xlabel('Count')
plt.ylabel('Director')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Bar chart is effective for identifying key contributors like directors with the most content.

##### 2. What is/are the insight(s) found from the chart?

f"{top_directors.index[0]} is the most featured director on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing popular or high-output directors helps maintain partnerships, Lack of diversity in creators may affect innovation.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
heatmap_data = df.pivot_table(index='rating', columns='type', aggfunc='size', fill_value=0)
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Content Distribution by Rating and Type')
plt.tight_layout()

##### 1. Why did you pick the specific chart?

Heatmaps are great for summarizing three variables at once: rating, type, and count.

##### 2. What is/are the insight(s) found from the chart?

TV-MA rating has high frequency in both Movies and TV Shows. Some ratings are dominated by Movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Allows balancing content rating types. Over-concentration in mature content might alienate family or youth audiences.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

Firstly, it is important to optimize the content mix in terms of type and genre. The dataset indicates a strong dominance of movies over TV shows, and a concentration of genres such as Drama and International Movies. To attract and retain a broader audience, Netflix should consider increasing investment in TV shows and exploring underrepresented or emerging genres like documentaries and anime. This would help maintain viewer engagement and cater to diverse content preferences across regions.

Secondly, there is a clear concentration of content originating from the United States, with India and the UK following. This suggests an opportunity to localize content further, particularly in growing markets like South Korea, Latin America, and Africa. Collaborating with regional creators and producing culturally relevant content can strengthen Netflix’s market position and drive subscriber growth in these areas.

In terms of content release strategy, the data shows that most content is added in July and October. These periods likely align with strategic marketing quarters or global holidays. Netflix should continue analyzing these temporal patterns and plan high-impact content drops around key viewing periods to maximize user engagement.

Audience segmentation by content rating reveals that mature-rated content (TV-MA and TV-14) is predominant. While this caters to adult viewers, there is a potential gap in family-friendly and youth-targeted content. By expanding offerings in the TV-G, PG, or PG-13 categories, Netflix can better serve households and younger audiences, increasing its appeal as a family streaming platform.

In terms of content duration, movies show a wide range in length, whereas TV shows tend to have fewer seasons. This suggests that shorter series or limited-run formats could be effective, especially for mobile-first or casual viewers. Netflix can leverage these insights to produce content that matches modern consumption patterns.

Additionally, the data indicates that a few directors contribute a significant volume of content. While maintaining strong relationships with high-output creators is beneficial, there is value in diversifying the pool of directors to introduce new voices and innovative storytelling styles.

Finally, the dataset highlights a decline in content additions after 2020. This may be due to pandemic-related production delays or a shift in strategic priorities. It is important to monitor this trend closely and ensure that the platform continues to offer a steady stream of fresh, diverse content to prevent subscriber attrition.

Overall, by making data-informed decisions on content strategy, regional focus, release timing, and audience segmentation, Netflix can enhance user satisfaction, improve global reach, and support sustainable business growth.










# **Conclusion**

This exploratory data analysis (EDA) of the Netflix Movies and TV Shows dataset provided valuable insights into the platform’s content distribution, trends, and potential areas for strategic improvement. The analysis revealed that Movies significantly outnumber TV Shows, with Drama, International Movies, and Comedies being the most frequent genres. Content is predominantly produced in the United States, with India and the UK also contributing notably, indicating a need for greater regional diversity to meet global audience demand.

Temporal trends showed a surge in content additions during 2018–2020, followed by a slight decline post-2020, possibly due to external disruptions like the COVID-19 pandemic. Monthly addition patterns revealed strategic content releases in July and October, likely aligned with global viewing behavior and marketing strategies. Ratings analysis showed a high concentration in mature categories (TV-MA and TV-14), suggesting an opportunity to expand family-friendly content.

Furthermore, content duration analysis highlighted variability in movie lengths and the consistent structure of TV Shows, pointing to a growing preference for shorter, binge-worthy content formats. The presence of a few highly prolific directors also emphasized the value of both maintaining strong creative partnerships and encouraging diversity among content creators.

Overall, this EDA project demonstrates the power of data in understanding content strategy and user engagement patterns. These insights can support Netflix in enhancing its content portfolio, optimizing regional offerings, and aligning content planning with audience needs, ultimately contributing to improved user satisfaction and business growth.