# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The project titled “Exploratory Data Analysis on Netflix Movies and TV Shows” aims to uncover meaningful insights from Netflix's catalog of entertainment content by analyzing and visualizing key attributes such as content type, genre, release year, duration, country of origin, and audience ratings. This project was undertaken as part of the Business Analyst internship at Labmentix with a focus on building strong analytical reasoning, storytelling through data, and generating business-relevant insights.

Netflix, being a global leader in streaming services, provides a diverse range of movies and TV shows catering to different age groups, countries, and genres. The dataset used in this project comprises 7,787 rows and 12 columns, offering details about the content title, director, cast, country, rating, duration, genres (listed_in), and more. This data served as a valuable resource to understand how Netflix curates its platform, which countries dominate the content list, and what type of content has been favored over the years.

The project followed a structured Exploratory Data Analysis (EDA) approach, beginning with understanding the structure of the dataset, identifying missing values, and performing data wrangling such as converting date columns to datetime formats, separating duration into numeric and categorical fields, and filling missing values logically. After preprocessing, visual and statistical analysis was conducted to uncover patterns and trends.

Key analyses include:

Distribution of Movies vs TV Shows

Frequency of content added over time

Top contributing countries to the Netflix catalog

Commonly assigned maturity ratings

Dominant genres in the platform

Duration analysis for both movies and shows

Trends in content addition over years

Various tools and libraries like pandas, matplotlib, and seaborn were utilized for data manipulation and visualization. The project also included generating clean and interpretable charts, such as bar plots, count plots, boxplots, and heatmaps, to support findings. In addition, text-based insights using word clouds and genre decomposition were considered to understand viewer trends and platform direction.

The findings provide strong support for Netflix's strategy of global content delivery with a heavy skew towards Movies. The majority of content originates from the United States and India, with a clear preference for mature-rated content (e.g., TV-MA, R). Content addition saw a steep rise post-2016, likely correlating with the company's aggressive expansion into original content and regional markets.

In conclusion, this project not only enhanced hands-on EDA skills but also developed the ability to derive business recommendations from data trends. As a Business Analyst, learning how to convert raw data into actionable strategies is crucial. This project demonstrates that skill by identifying content gaps, regional concentration, and audience preferences. These insights could be used by Netflix (or similar platforms) to diversify their content portfolio, optimize recommendations, and improve regional engagement.

This EDA serves as a foundational analysis that could be extended into predictive modeling, user segmentation, or clustering to identify potential user behaviors or content success trends. The experience gained here is directly applicable to real-world business analytics challenges.

# **GitHub Link -**

https://github.com/BrijKumbhani/EDA-on-Netflix-Movies-TV-Shows

# **Problem Statement**


Netflix has a vast and growing library of movies and TV shows catering to a global audience. However, understanding patterns in content type, genre, duration, release trends, and geographic distribution is essential for making data-driven business decisions. This project aims to explore and analyze Netflix’s content dataset to uncover insights into what kind of content dominates the platform, how content trends have evolved over time, and which countries and genres contribute most. The goal is to identify meaningful patterns that can support strategic recommendations for content planning, user engagement, and market expansion.

#### **Business Objective**

The primary business objective of this project is to extract meaningful insights from Netflix's content data to support strategic decision-making around content planning and audience targeting. By performing Exploratory Data Analysis (EDA), the goal is to:

Understand how content is distributed by type (Movie vs. TV Show), rating, genre, country, and release year.

Identify content trends over time and across regions.

Recognize gaps or imbalances in content offerings (e.g., over/under-representation of certain genres or countries).

Provide recommendations that can help Netflix (or similar platforms) improve content diversity, optimize user satisfaction, and expand viewership globally.

The insights gained will serve as a foundation for further business analysis, such as customer segmentation, content recommendations, and regional strategy development.



# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
## Step 1: Import Required Libraries

# We begin by importing the core Python libraries that will be used for data manipulation and visualization throughout the notebook.

# Data handling and manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set fallback style
plt.style.use('ggplot')  # use a built-in, reliable style
sns.set_palette('pastel')

# Display settings
pd.set_option('display.max_columns', None)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
## Step 2: Load and Preview the Dataset

# In this step, we load the dataset into a pandas DataFrame and preview the first few records to understand the structure and types of information available.

# Step 2: Load the Dataset
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

# Load the Netflix dataset (replace the path if needed)
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

# Display first 5 rows
df.head()

### Dataset First View

In [None]:
## Step 3: Dataset First View

# This step helps us understand the structure and quality of the dataset. We review column names, data types, null values, and get a statistical overview to identify potential cleaning or preprocessing tasks.

# Shape of the dataset
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

# Column names
print("\nColumns in the dataset:")
print(df.columns.tolist())

# Dataset info: data types & non-null values
print("\nDataset Info:")
df.info()

# Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# Quick summary of numerical columns (if any)
print("\nStatistical Summary:")
df.describe(include='all').T  # Transposed for readability


### Dataset Rows & Columns count

In [None]:
## Step 5: Dataset Rows & Columns Count

# This step verifies the current size of the dataset after basic cleaning. It helps to keep track of data loss (if any) during preprocessing.


# Get shape of the dataset
rows, cols = df.shape

print(f"Total Rows: {rows}")
print(f"Total Columns: {cols}")


### Dataset Information

In [None]:
# Dataset Info
print("Basic Dataset Info:\n")
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values / Null Values Count
print("Missing Values Per Column:\n")
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

- The dataset contains **7787 rows and 12 columns**.
- It includes metadata about Netflix Movies and TV Shows, such as title, type, cast, director, release year, duration, and genre.
- The dataset has **missing values** in important fields like `director`, `cast`, `country`, `date_added`, and `rating`.
- There are a **few duplicate rows** that were handled during cleaning.
- The date information is available and usable for time-based analysis after conversion.
- Columns like `duration`, `listed_in`, and `rating` will be helpful in EDA and clustering.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns.tolist())

In [None]:
# Dataset Describe
print("Statistical Summary:\n")
print(df.describe(include='all'))


### Variables Description

| Variable        | Description |
|-----------------|-------------|
| `show_id`       | Unique identifier for each show (not useful for analysis). |
| `type`          | Indicates whether the content is a Movie or TV Show. |
| `title`         | Name of the show or movie. |
| `director`      | Name(s) of director(s); useful for grouping. |
| `cast`          | Leading actors; can be used for text analysis or popular actor trends. |
| `country`       | Country of origin for the content. |
| `date_added`    | Date when the content was added to Netflix. |
| `release_year`  | The original release year of the movie/show. |
| `rating`        | Content rating (e.g., TV-MA, PG); useful for age-based analysis. |
| `duration`      | Length of movie or number of seasons. |
| `listed_in`     | Genre(s) the content is listed under. |
| `description`   | Short summary of the content. |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
print("Unique Values Count Per Column:\n")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

In [None]:
# Make a copy of the original dataset
df_clean = df.copy()
# Clean whitespace in 'date_added' column before converting
df_clean['date_added'] = df_clean['date_added'].str.strip()
df_clean['date_added'] = pd.to_datetime(df_clean['date_added'], errors='coerce')


# Convert 'date_added' to datetime format
df_clean['date_added'] = pd.to_datetime(df_clean['date_added'])

# Split duration into numeric and unit (for separating Movies and TV Shows)
df_clean['duration_int'] = df_clean['duration'].str.extract('(\d+)').astype(float)
df_clean['duration_type'] = df_clean['duration'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()

# Fill missing values
df_clean['director'].fillna('Not Available', inplace=True)
df_clean['cast'].fillna('Not Available', inplace=True)
df_clean['country'].fillna('Not Specified', inplace=True)
df_clean['rating'].fillna(df_clean['rating'].mode()[0], inplace=True)
df_clean['date_added'].fillna(df_clean['date_added'].mode()[0], inplace=True)

# Dropping unnecessary columns (optional)
# df_clean.drop(columns=['show_id', 'description'], inplace=True)

# Confirm changes
df_clean.info()

###All manipulations I done and insights I found

## Data Wrangling Summary

1. **Converted `date_added`** from string to datetime format to allow time-based analysis (e.g., trends over years).
2. **Split `duration`** into two columns:
   - `duration_int`: Numeric value (e.g., 90, 2)
   - `duration_type`: Unit (e.g., Minutes, Season)
   This allows clearer separation between **Movies and TV Shows**.
3. **Filled missing values**:
   - `director`, `cast`, and `country` with placeholders (`"Not Available"` / `"Not Specified"`)
   - `rating` and `date_added` with their **mode** (most frequent value).
4. **Kept all original rows** — no rows dropped, preserving dataset size and integrity.

### Insights Gained:
- Large number of missing values in `director` and `cast` might indicate that many titles are documentaries or non-mainstream.
- By splitting the duration, we can now analyze content length separately for Movies vs. TV Shows.
- Handling missing `rating` and `country` improves quality of categorical analysis without distorting distribution.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Chart - 1:Distribution of Content Type (Movies vs TV Shows)

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.countplot(data=df_clean, x='type', palette='Set2')
plt.title('Distribution of Content Type')
plt.xlabel('Type')
plt.ylabel('Count')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This is a univariate categorical variable. A count plot is the most effective way to visualize how many entries fall under each category (Movies or TV Shows).



##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe that Movies dominate the content on Netflix, significantly outnumbering TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight helps stakeholders understand content investment trends and audience preferences. If data shows more user engagement with TV Shows, this might suggest increasing production in that category. No negative growth is inferred at this point—just current content ratios.



#### Chart - 2:Top 10 Countries Producing Netflix Content

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
top_countries = df_clean['country'].value_counts().head(10)
sns.barplot(x=top_countries.values, y=top_countries.index, palette='Set3')
plt.title('Top 10 Countries by Number of Netflix Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.grid(True, axis='x')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart is ideal for visualizing frequency counts of a categorical variable like 'country', allowing us to easily identify the countries contributing most to Netflix content.

##### 2. What is/are the insight(s) found from the chart?

The United States is the leading contributor of Netflix content, followed by India, the United Kingdom, and Canada. This suggests heavy content sourcing from English-speaking nations.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. Knowing the top contributing countries helps in making decisions on regional licensing, marketing strategies, and localized content production. This can directly impact viewership and subscriptions positively. No negative growth is indicated—though underrepresented regions could be growth opportunities.



In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(6, 6))
type_counts = df_clean['type'].value_counts()
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', colors=['#66b3ff','#ff9999'], startangle=90, wedgeprops={'edgecolor':'black'})
plt.title('Distribution of Content Type on Netflix')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is effective here because we’re analyzing a simple binary categorical split — "Movie" vs "TV Show" — and want to easily grasp the proportion of each.

##### 2. What is/are the insight(s) found from the chart?

Netflix has a significantly higher proportion of Movies (around 69%) compared to TV Shows (around 31%). This shows Netflix's content library is movie-dominant.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If Netflix aims to balance content offerings or attract binge-watchers, this insight can inform production/investment strategies in TV Shows. A skewed balance might also influence content recommendation systems or marketing efforts.




#### Chart - 4: Top 10 Countries by Content Production

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
top_countries = df_clean['country'].value_counts().head(10)
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries with Most Netflix Content')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart works well for ranking and comparing quantities across different countries. It keeps the country names readable and highlights disparities clearly.



##### 2. What is/are the insight(s) found from the chart?

The United States overwhelmingly dominates the content library, followed by India, the United Kingdom, Canada, and others. This suggests that a majority of content originates from a few key regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding content origin helps in:

- Targeting regional markets more effectively.

- Making investment decisions in local content production.

- Expanding international offerings by identifying underrepresented regions for growth opportunities.



#### Chart - 5: Top 10 Most Frequent Genres on Netflix

In [None]:
# Chart - 5 visualization code
# Splitting genres by comma and counting frequency
from collections import Counter

genre_series = df_clean['listed_in'].str.split(', ')
genre_counts = Counter([genre for sublist in genre_series.dropna() for genre in sublist])
top_genres = dict(genre_counts.most_common(10))

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x=list(top_genres.values()), y=list(top_genres.keys()), palette='rocket')
plt.title('Top 10 Most Frequent Genres on Netflix')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We chose a horizontal bar chart to clearly represent the top genres and allow easy comparison of frequency across genre types.

##### 2. What is/are the insight(s) found from the chart?

The top genres include Dramas, Comedies, Documentaries, and Action & Adventure, indicating what content types Netflix users are exposed to the most.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely. It helps Netflix and competitors:

- Refine content strategy by doubling down on popular genres.

- Understand user preferences and recommend trending genres more effectively.

- Spot gaps in genre diversity and explore niche opportunities.

#### Chart - 6:  Distribution of Ratings for Netflix Content

In [None]:
# Chart - 6 visualization code
# Countplot for ratings
plt.figure(figsize=(12, 6))
order = df_clean['rating'].value_counts().index
sns.countplot(data=df_clean, y='rating', order=order, palette='mako')
plt.title('Distribution of Content Ratings on Netflix')
plt.xlabel('Number of Titles')
plt.ylabel('Rating')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal countplot is ideal for showing the distribution of categorical variables like content ratings. It makes it easier to read long rating names and identify trends at a glance.

##### 2. What is/are the insight(s) found from the chart?

- TV-MA (Mature Audience) dominates, followed by TV-14, TV-PG, and R.

- Netflix hosts a large amount of content targeted at teenagers and adults.

- Kids-friendly content like TV-Y, TV-G appears much less frequently.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- Helps align marketing and recommendation strategies based on age group
consumption.

- Informs content acquisition or creation for underrepresented categories (e.g., family content).

- Helps maintain compliance and user trust in parental control settings.

#### Chart - 7:  Top 10 Countries Producing Netflix Content

In [None]:
# Chart - 7 visualization code
# Top 10 countries with the most content
top_countries = df_clean['country'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='Spectral')
plt.title('Top 10 Content-Producing Countries on Netflix')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart makes it easy to compare the top content-producing countries and visualize relative volume clearly.

##### 2. What is/are the insight(s) found from the chart?

> United States is by far the leading contributor of content on Netflix.

> India, United Kingdom, Canada, and France also contribute a significant number of shows/movies.

> This suggests strong content partnerships or licensing in these countries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- Netflix can focus on strengthening its market and production in top-contributing countries.

- Helps with regional recommendation engines, tailoring content discovery to viewers by country.

- Shows where Netflix may be over- or under-represented, guiding future investment decisions.



#### Chart - 8: Trend of Netflix Content Added Over the Years

In [None]:
# Chart - 8 visualization code
# Extract year from 'date_added'
df_clean['year_added'] = df_clean['date_added'].dt.year

# Count titles added per year
titles_by_year = df_clean['year_added'].value_counts().sort_index()

plt.figure(figsize=(10, 6))
sns.lineplot(x=titles_by_year.index, y=titles_by_year.values, marker='o', color='coral')
plt.title('Netflix Content Added Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Titles Added')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A line plot is ideal to visualize how the number of titles added to Netflix has changed over time, showing clear growth trends or drops year-over-year.

##### 2. What is/are the insight(s) found from the chart?

- There is a steady increase in content additions from around 2015 to 2019.

- A slight drop or stagnation is visible post-2020, possibly due to the pandemic impacting production schedules.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- It reflects growth strategy and content expansion timeline.

- Netflix can analyze which years performed better in terms of viewership vs. additions.

- Post-2020 data can trigger risk analysis for future disruptions and improve resilience planning.



#### Chart - 9:  Top 10 Countries Producing Netflix Content

In [None]:
# Chart - 9 visualization code
# Count the most frequent countries (some entries have multiple countries separated by commas)
country_counts = df_clean['country'].dropna().str.split(',').explode().str.strip().value_counts().head(10)

# Plot the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='viridis')
plt.title('Top 10 Countries Producing Netflix Content')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart helps in clearly comparing the number of titles across the top 10 countries, especially since country names can be long.

##### 2. What is/are the insight(s) found from the chart?

- The United States is the leading content contributor by a large margin.

- Other countries like India, the UK, Canada, and France also contribute significantly.

- There is a diverse international presence, showing Netflix’s global content strategy.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- This can help focus regional strategies—localize marketing or invest more in high-performing regions.

- Emerging contributors like India and South Korea can be seen as growing content hubs.

- A good global mix reduces reliance on a single content source, aiding business stability.

#### Chart - 10: Distribution of Content Rating on Netflix



In [None]:
# Chart - 10 visualization code
# Count the rating occurrences
rating_counts = df_clean['rating'].value_counts().head(10)

# Plotting the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette='cubehelix')
plt.title('Top 10 Most Common Content Ratings on Netflix')
plt.xlabel('Rating')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing the frequency of categorical values such as content ratings. It quickly highlights which ratings dominate the platform.

##### 2. What is/are the insight(s) found from the chart?

- The majority of content is rated TV-MA (Mature Audience), followed by TV-14, TV-PG, and R.

- This indicates that Netflix heavily features mature content, catering to adult audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- Understanding content rating distribution can help Netflix align content recommendations with audience preferences.

- It may reveal gaps in family/kid-friendly content, which can be addressed by investing in more PG/G-rated titles to expand family viewership.

- Strategically balancing mature and general content can broaden target demographics.

#### Chart - 11: Top 10 Most Common Directors on Netflix

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(12,6))
top_directors = df_clean['director'].value_counts().head(10)
sns.barplot(x=top_directors.values, y=top_directors.index, palette='coolwarm')
plt.title('Top 10 Directors with Most Titles on Netflix', fontsize=14)
plt.xlabel('Number of Titles')
plt.ylabel('Director')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This chart reveals which directors have the most content on Netflix, highlighting production trends and valuable creative partnerships.

##### 2. What is/are the insight(s) found from the chart?

- Certain directors (e.g., Raúl Campos & Jan Suter) appear frequently, often in regional or documentary content.

- Some directors dominate niche categories, which can shape curation and recommendation algorithms.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
- Understanding popular or prolific directors can guide content acquisition, marketing strategies, and even talent partnerships.

#### Chart - 12: Top 10 Most Common Ratings on Netflix

In [None]:
# Chart - 12 visualization code
# Top 10 most frequent ratings
top_ratings = df_clean['rating'].value_counts().head(10)

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x=top_ratings.values, y=top_ratings.index, palette="coolwarm")
plt.title('Top 10 Most Common Ratings on Netflix')
plt.xlabel('Number of Titles')
plt.ylabel('Rating')
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart effectively compares categorical values like content ratings and allows longer rating names to be displayed cleanly.

##### 2. What is/are the insight(s) found from the chart?

- TV-MA and TV-14 are the most frequently used ratings.

- These ratings suggest a large portion of Netflix's library is targeted at mature audiences and teens.

- Very few titles are marked with kid-friendly ratings like TV-Y7 or TV-G.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- Netflix can analyze audience demographics and understand if they’re leaning too heavily into mature content.

- If Netflix wants to grow in family or kids' markets, it may consider increasing content with lower ratings.

- It can help with personalized marketing or content curation strategies by aligning new content with popular rating trends.

#### Chart - 13: Content Type Trend Over the Years (Movies vs. TV Shows)

In [None]:
# Chart - 13 visualization code
# Count of Movies and TV Shows over the years
type_trend = df_clean.groupby(['release_year', 'type']).size().unstack().fillna(0)

# Plotting
plt.figure(figsize=(12, 6))
type_trend.plot(kind='line', marker='o', linewidth=2, figsize=(12, 6))
plt.title('Content Type Trend Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True, linestyle='--', alpha=0.5)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is ideal for showing temporal trends, such as how the number of Movies and TV Shows released has changed over time.

##### 2. What is/are the insight(s) found from the chart?

- There is a steady increase in Movies added over the years, peaking around 2018–2019.

- TV Shows also saw a rise, especially post-2016, indicating Netflix’s strategic focus on original series.

- A dip post-2020 may reflect the impact of the COVID-19 pandemic on content production.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes:

- This analysis can inform content acquisition strategies based on historical demand.

- Identifying growth areas (e.g., surge in TV Shows) helps in deciding resource allocation for future production.

- Helps stakeholders understand how Netflix’s content mix has evolved, aiding in forecasting future trends and subscriber expectations.

#### Chart - 14 - Correlation Heatmap(Numerical Features)

In [None]:
# Correlation Heatmap visualization code
# Selecting only numerical columns
numerical_df = df_clean.select_dtypes(include=['int64', 'float64'])

# Correlation matrix
corr_matrix = numerical_df.corr()

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is a standard tool for visualizing the strength and direction of linear relationships between multiple numerical variables.

##### 2. What is/are the insight(s) found from the chart?

There are very weak or negligible correlations among the numerical features (release_year, duration_int), meaning:

- Release year does not significantly impact content duration.

- Duration isn’t driven by time trends; it’s likely genre- or format-specific.



#### Chart - 15 - Pair Plot (Multivariate Exploration)

In [None]:
# Pair Plot visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Selecting a subset of the dataset for better visualization
pairplot_df = df_clean[['release_year', 'duration_int', 'type']]

# Create the pair plot
sns.pairplot(pairplot_df, hue='type', diag_kind='kde', palette='Set1')
plt.suptitle("Pair Plot of Release Year, Duration & Type", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is perfect for understanding multivariate relationships and distribution overlaps across categories—especially useful when comparing across classes like Movie vs TV Show.

##### 2. What is/are the insight(s) found from the chart?

- TV Shows tend to have shorter durations (1-2 seasons), while Movies vary more widely.

- The release years for both types largely overlap, but recent years show a surge in TV Shows.

- A few outliers exist in duration, indicating extremely long movies or series.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client (a streaming platform like Netflix) achieve their business objective of maximizing viewer engagement and satisfaction, I suggest the following:

1. Content Strategy Based on Audience Trends
Focus more on producing or acquiring content in genres that are most popular (e.g., Documentaries, Dramas, and International TV). These categories consistently show higher presence in the dataset.

2. Enhance Regional Personalization
Since a significant volume of content originates from specific countries (like the U.S. and India), the platform can expand into region-specific content production and local language dubbing to capture more regional audiences.

3. Optimize Release Timing
Leverage insights from peak release years and seasonal patterns (e.g., spikes around specific months) to plan strategic content drops and marketing campaigns that align with high-viewership periods.

4. Segmented Recommendations:
   Use duration and content type (TV vs Movie) to improve recommendation systems:

- Short-duration content can be recommended during weekdays or to casual viewers.

- Long-duration content and series for weekends or binge-watchers.

5. Filling Content Gaps:
Produce more content in underrepresented genres or countries to expand diversity and explore untapped viewer segments.

✅ These actions, guided by EDA findings, will help the platform:

- Boost user retention and acquisition.

- Enhance customer satisfaction through personalized experiences.

- Improve the ROI on content investments.



# **Conclusion**

Through this comprehensive Exploratory Data Analysis (EDA) of the Netflix dataset, we derived several valuable insights:

1. Content Dominance: Movies significantly outnumber TV Shows, indicating a strong focus on one-off content rather than long-term series.

2. Top Contributing Countries: The U.S., India, and the U.K. are major contributors to Netflix’s content library. This insight highlights geographic strengths and possible regional expansion opportunities.

3. Trending Content: Dramas, Documentaries, and International content dominate genre preferences, providing guidance for future content investment.

4. Content Duration Patterns: TV Shows usually span over one or multiple seasons, while Movies vary widely in duration—ideal for different viewer segments and time availability.

5. Release & Addition Trends: Recent years (especially 2018–2020) show a significant increase in content additions, reflecting Netflix's growth and evolving content strategy.

These insights empower data-driven decisions in areas such as content acquisition, regional strategy, user engagement, and marketing timing.

Final Note:
- This EDA demonstrates how data storytelling and structured visualization can uncover hidden trends and offer actionable business value. Future steps could include building a recommendation engine or running predictive models to further optimize content strategies.



