<a href="https://colab.research.google.com/github/SahilAgarwal03/Python_Projects/blob/main/Project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Content Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member**     - Sahil Agarwal

# **Project Summary -**

The primary objective of this project is to explore the global content of Netflix through data-driven methods, uncover trends and insights, and provide actionable recommendations to help optimize content strategy, audience engagement, and platform growth. This project primarily focuses on identifying patterns across content types, genres, countries, durations, and creator information (actors and directors).



# **GitHub Link -**

https://github.com/SahilAgarwal03

# **Problem Statement**


**Write Problem Statement Here.**
Netflix hosts a vast library of global content, but to maintain user engagement and strategic growth, it must understand patterns in content types, genres, regions, durations, and creator involvement. The challenge is to analyze this data to uncover actionable insights that can guide content strategy, optimize releases, and align offerings with viewer preferences.



#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/Netflix Project/Netflix Project.csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
print("Number Of Rows, Number Of Columns:",df.shape)

### Dataset Information

In [None]:
# Dataset Info(Data types and non-null values)
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns
print('Columns:',df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1. **show_id:** Unique identifier for each show or movie in the dataset.
2. **type:**	Indicates whether the content is a Movie or TV Show.
title	Title of the content.
3. **director:**	Names of the directors associated with the content.
cast	List of actors/actresses involved
4. **country:**	Country or countries where the content was produced.
5. **date_added:**	The date when the content was added to Netflix.
6. **release_year:**	The year when the content was originally released.
7. **rating:**	Content rating (e.g., TV-MA, PG-13) based on viewing suitability.
8. **duration:**	Duration of the content (in minutes for movies, or seasons for TV shows).
9. **listed_in:**	Comma-separated genres/categories describing the content.
10. **description:**	Short synopsis or summary of the content.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Eliminating Null Values
df['director'] = df['director'].fillna("Not Available") # For director column which has 2389 Null values
df['cast'] = df['cast'].fillna("Not Available") # For cast column which has 718 Null values
df['country'] = df['country'].fillna("Unknown") # For country column which has 507 Null values
df['date_added'] = df['date_added'].fillna("ffill") # For date_added column which has 10 NUll values
df['rating'] = df['rating'].fillna(df['rating'].mode()[0]) # For rating column which has 7 Null values
df.isnull().sum()

### What all manipulations have you done and insights you found?

1. **No duplicate entries:** All records are unique.
2.**Missing values handled:** Using replacement or forward fill to eliminate missing values.
3. Prepared the **date_added** column for trend analysis by converting it to year-month format.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Content Rating Distribution: TV Shows vs Movies
rating_count = df.groupby('type')['rating'].value_counts().unstack(fill_value=0)
rating_count.plot(kind='barh', figsize=(10, 8), stacked=True, color=plt.cm.Paired.colors)
plt.title('Rating Distribution: TV Shows vs Movies')
plt.xlabel('Count')
plt.ylabel('Rating')
plt.legend(title='Type')
plt.show()
print(rating_count)


##### 1. Why did you pick the specific chart?

I chose a **Bar Chart** because it's ideal for comparing the frequency of
 different categories of ratings (e.g., TV-MA, PG-13, R, etc.). It clearly shows the distribution of ratings in the dataset, which helps in understanding the nature of the content and the target audience.



##### 2. What is/are the insight(s) found from the analysis?


The analysis of the rating distribution reveals that the majority of content on **Netflix** is focused on  mature audiences, with TV-MA and TV-14 being the most frequent ratings. This suggests a strong focus on adult viewers, particularly in TV shows.On the other hand, movies tend to be more evenly spread across ratings like PG, PG-13, and R, showing a slightly broader target audience. Additionally, ratings such as NC-17, TV-Y7-FV, and UR appear very infrequently, indicating limited content for very young children or highly explicit categories. Overall, the platform's content strategy seems to prioritize mature and teen-friendly material, especially through television series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help in creating a positive business impact.

1. Targeted Content Strategy:

Knowing the dominant content ratings allows Netflix to continue investing in content that resonates with its most active viewer segments. Producing more mature content such as dramas, thrillers, or crime series rated TV-MA can lead to higher engagement.

2. Improved Marketing and Advertising:

By understanding that the primary audience is mature, Netflix can tailor its advertising campaigns more effectively and accordingly.

3. Recommendation System:

Insights from rating distributions can help in content suggestion. Recommending content aligned with the viewers preferences can improve user experience, boost viewing time, and reduce  customer churn.

#### Chart - 2

In [None]:
# Chart - 2 Distribution of All Genres in the Dataset
df['genres'] = df['listed_in'].str.split(',')
df['genres'] = df['genres'].apply(lambda x: [genre.strip() for genre in x])
genres_series = df['genres'].explode()
genre_counts = genres_series.value_counts()
genre_counts.plot(kind='barh', figsize=(10, 11), color='skyblue')
plt.title('Distribution of All Genres in the Dataset')
plt.xlabel('Count')
plt.ylabel('Genres')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart works well when dealing with many categories of genres. It avoids clutter and helps compare counts effectively.

##### 2. What is/are the insight(s) found from the chart?

1. A few genres like International Movies, Dramas, Comedies have very high counts, while many others like LGBTQ Movies, Cult TV, Science & Nature TV have very low presence.

2. Netflix seems to rely heavily on a few core genres, like Dramas, Comedies, Documnetaries because they attract broad audiences and they are likely key revenue drivers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The dominance of genres like International Movies, Dramas, and Comedies demonstrates viewer preference.

**Positive Business Impact:** Netflix can foucus on top-performing genres, create region-specific subgenres, and fine-tune its recommendation engine. This will enhance user satisfaction and time spent on the platform.

**Risk Area:** Underrepresented genres such as Faith & Spirituality or Classic TV may indicate neglected  audiences. Ignoring these could lead to market segments being underserved and viewers migrating to competitors who fill those gaps.





#### Chart - 3

In [None]:
# Chart - 3 Yearly Trend of Netflix Content Additions


##### 1. Why did you pick the specific chart?

A line chart effectively shows how the number of releases changes over time. It's ideal for observing trends, peaks, and drops in content production.

##### 2. What is/are the insight(s) found from the chart?

1. There's a massive spike in content from around 2015 to 2020, because of  Netflix’s international expansion.

2. The sharp drop after 2020 could be because of limitation of data or because of COVID-19.

3. Before 2000, very little content was added, meaning that most content is recent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart reveals a growth in content additions post-2015, peak  in 2019. It validates Netflix’s aggressive content strategy and streaming dominance in recent years.

**Positive Business Impact:** The surge in content around 2018–2019 coincides with global subscriber growth, suggesting that content volume drives user acquisition. Insights from this can guide future content ramp-ups and platform scheduling.

 **Negative Growth Concern:** The sharp drop after 2019, especially in 2020 may signal impacts from the COVID-19 pandemic or slowed content investment. If not addressed, this could reduce engagement over time, requiring strategic planning to maintain content flow.

#### Chart - 4

In [None]:
# Chart - 4 Top 10 Acotrs/Actresses on Netflix
df_cast = df[['cast']].copy()
df_cast['cast']= df_cast['cast'].str.split(',').apply(lambda x: [actor.strip() for actor in x])
df_exploded = df_cast.explode('cast')
df_exploded = df_exploded[df_exploded['cast'] != 'Not Available']
top_actors = df_exploded['cast'].value_counts().head(10)
top_actors.plot(kind='bar', figsize=(10,5), color='skyblue')
plt.title("Top 10 Actors/Actresses on Netflix")
plt.xlabel("Actor/Actress")
plt.ylabel("Number of Appearances")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing discrete categories (in this case, actors) based on a count metric. It's easy to see who appears most frequently, making it the best choice for this type of ranking.

##### 2. What is/are the insight(s) found from the chart?

1. Anupam Kher leads significantly in Netflix appearances, followed by Shah Rukh Khan.

2. The chart shows a strong representation of Indian actors, indicating Netflix has a large library of Indian content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart highlights the most featured actors which indicates talent  and viewer preferences. Actors like **Anupam Kher** and **Shah Rukh Khan** being frequently featured, reflect Netflix’s deep focus on Indian cinema.

 **Positive Business Impact:** Recognizing which actors consistently draw viewers allows Netflix to give exclusive contracts, boosting subscriber retention in specific regions.

 **Negative Growth Risk:** If focus is too high on a few familiar faces, it may limit experimentation and innovation in casting, possibly limit users looking for fresh talent or diverse representation.

#### Chart - 5

In [None]:
# Chart -5 Top 10 Countries producing Content
df['countries_count'] = df['country'].str.split(',').apply(lambda x: [country.strip() for country in x])
countries_counts = df['countries_count'].explode()
countries_counts = countries_counts.value_counts()
top_countries = countries_counts.head(10)
top_countries.plot(kind='pie',figsize=(10, 10), autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors)
plt.title('Top 10 Countries Producing Content')
plt.show()

##### 1. Why did you pick the specific chart?

1. Pie chart was chosen for a clear, percentage-based view.

2. Easy to explain to non-technical audiences.

##### 2. What is/are the insight(s) found from the chart?

1. US dominates the platform with 45.8%.

2. India and UK follow as significant contributors with 13.8% and 10.1% respectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that the United States leads Netflix content production, followed by India and the United Kingdom, confirms the platform's focus on English-speaking and high-population markets.

 **Positive Business Impact:** It helps Netflix allocate budgets strategically—invest more in content creation in countries with proven performance, and expand original productions in emerging markets like India.

**Potential Negative Insight:** Over-reliance on U.S. content (45.8%) might limit cultural diversity, leading to viewer fatigue in international markets. A more balanced regional content mix could improve global engagement.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
df['countries'] = df['country'].str.split(',').apply(lambda x: [country.strip() for country in x])
df['genres'] = df['listed_in'].str.split(',').apply(lambda x: [listed_in.strip() for listed_in in x])
df_exploded = df.explode('countries').explode('genres')
top_countries = df_exploded['countries'].value_counts().head(10).index
top_genres = df_exploded['genres'].value_counts().head(10).index
df_filtered = df_exploded[df_exploded['countries'].isin(top_countries) & df_exploded['genres'].isin(top_genres)]
heatmap_data = df_filtered.groupby(['genres', 'countries']).size().unstack()
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt='g')
plt.title('Genre Distribution by Top 10 Countries')
plt.xlabel('Country')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

1. Identify country-specific genre preferences.

2. Heatmap gives quick visual of quantity.

##### 2. What is/are the insight(s) found from the chart?

1. US produces most of the content on the platform and the most dominant  genre are Dramas, Comedies, Documentries, etc.

2. India also produces large content and the most popular geners are International Movies, Comedies, Dramas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This heatmap enables precise regional content planning. Knowing that Indian users prefer Dramas and International Movies while the US leans toward Comedies and Documentaries empowers Netflix to tailor regional libraries. It directly supports global growth and localized user engagement. If such data is ignored, Netflix risks content mismatch, leading to user churn  but this chart helps prevent that.

#### Chart - 7

In [None]:
# Chart - 7 Duration distribution per genre(Movies).
movies_df = df[df['type'] == 'Movie'].copy()
movies_df['duration'] = movies_df['duration'].str.extract('(\d+)').astype(int)
movies_df['genres'] = movies_df['listed_in'].str.split(',').apply(lambda x: [genre.strip() for genre in x])
movies_explode = movies_df.explode('genres')
top_genres= movies_explode['genres'].value_counts().head(10).index
movies_explode = movies_explode[movies_explode['genres'].isin(top_genres)]
sns.boxplot(x='genres', y='duration', data=movies_explode)
plt.xticks(rotation=45)
plt.title('Duration by Genre (Movies)')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

1. Box plots show spread, median, and outliers these are very useful for duration analysis.

2. Best for comparing genres side-by-side.

##### 2. What is/are the insight(s) found from the chart?

1. Dramas and International Movies have longer durations on average.

2. Comedy and similar genres are shorter and tighter in range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart helps refine user recommendations and content packaging. For example, users preferring quick entertainment can be directed toward Comedies, while those looking for deeper stories may enjoy Dramas or Documentaries. This promotes user satisfaction and increases watch time. No negative trend is identified, but genres with extremely high duration variance may confuse audiences unless clearly labeled or segmented.

#### Chart - 8

In [None]:
# Chart 8 Top 10 Directors on Netflix
df_directors = df[['director']].copy()
df_directors['director'] = df_directors['director'].str.split(',')
df_exploded = df_directors.explode('director')
df_exploded = df_exploded[df_exploded['director'] != 'Not Available']
top_directors = df_exploded['director'].value_counts().head(10)
top_directors.plot(kind='barh', figsize=(10,6), color='orange')
plt.title("Top 10 Directors on Netflix")
plt.xlabel("Number of Movies or Shows")
plt.ylabel("Director")
plt.gca().invert_yaxis()
plt.tight_layout
plt.show()

##### 1. Why did you pick the specific chart?

1. To spotlight key director or content creators on the platform.

2. Horizontal bars are easier to read for names.

##### 2. What is/are the insight(s) found from the chart?

1. Certain directors have a significantly higher presence.

2. Repeated collaborations visible like Jan Suter and  Raúl Campos.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying top-performing directors helps Netflix strengthen existing collaborations or seek exclusive deals. Directors like Jan Suter and Raúl Campos having frequent entries may point toward content styles or genres that perform well. This supports a positive business impact by optimizing partnerships and maintaining consistent quality. However, over-reliance on a narrow set of creators could limit diversity and global appeal, so maintaining balance is key.

#### Chart - 9

In [None]:
# Chart - 9 Monthly Trend Of  Content Addition
df['date_added'] = pd.to_datetime(df['date_added']).copy()
df['month_added'] = df['date_added'].dt.month_name()
monthly_counts = df['month_added'].value_counts().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'])
plt.figure(figsize=(10,5))
monthly_counts.plot(kind='line', marker='o', color='green')
plt.title("Monthly Trend of  Content Addition")
plt.xlabel("Month")
plt.ylabel("Number of Tv Shows Or Movies Added Added")
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

1. Reveals seasonal patterns in content addition.

2. This chart type presents sequential values to help you identify trends.

3. Simple line chart to highlight monthly behavior.

##### 2. What is/are the insight(s) found from the chart?

1. Netflix adds more content in the last few months of the year (Oct–Dec).

2. Huge dip in Feb may be because of shorter month.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The monthly trend analysis offers operational value in aligning promotions and new releases with high-engagement months like October–December. This insight enables better resource allocation in marketing and platform updates. The dip in February may indicate a missed opportunity or planned seasonal downtime, but it’s not a negative indicator unless it consistently shows underperformance. Strategically using this insight can drive monthly user spikes, especially during holiday seasons.

#### Chart - 10

In [None]:
# Chart - 10 Content Release Year vs Type(Movies vs TV Shows)
plt.figure(figsize=(10, 6))
df_filtered = df[df['release_year'] >= 1980]
sns.countplot(data=df_filtered, x='release_year', hue='type', palette='Set1')
plt.title("Content Release Year vs Type (Movies vs TV Shows)")
plt.xlabel("Release Year")
plt.ylabel("Number of TV Shows And Movies")
plt.xticks(rotation=45)
plt.legend(title="Type")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

1. It clearly shows the volume of the content types through the years.

2.  It is ideal for showcasing Netflix's production trends.

##### 2. What is/are the insight(s) found from the chart?

1. There is a massive spike in Movies after 2015 and a huge dip after 2019 may be because of Covid-19.

2. TV Shows were grew gradually, but there is also huge deep post 2018.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This visualization clearly showcases Netflix’s heavy focus on Movie production post-2015, with TV Shows catching up steadily. This data can guide content strategy — for example, investing more in original series if growth in TV Show consumption continues. This trend supports business growth by validating Netflix’s strategic shift. No negative impact is inferred, but overproduction of movies without considering audience saturation could be a risk in the long term.



#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Make sure numeric features exist
df['duration_int'] = df['duration'].str.extract('(\d+)').astype(float)
corr_df = df[['release_year', 'duration_int']]
corr = corr_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Content relased per year Vs Duration')
plt.show()

##### 1. Why did you pick the specific chart?

1. Correlation heatmaps are a quick way to analyze how relatable are the numerical features of a dataset.

2. Chosen to verify whether newer content is shorter or longer.

##### 2. What is/are the insight(s) found from the chart?

1. A  negative correlation (-0.24) which means that new releases tend to be shorter in duration.
2. It means that there is a modern shift toward quicker content.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
df['duration_int'] = df['duration'].str.extract('(\d+)').astype(float)
df['month_added'] = pd.to_datetime(df['date_added']).dt.month
pair = df[['release_year', 'month_added', 'duration_int', 'type']]
sns.pairplot(pair, hue='type', palette='Set2')
plt.suptitle("Netflix Pair Plot: Release, Added Month & Duration", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

1.  Because it shows multiple relationships in one view: duration, release year, and month of addition.

2. Chosen to explore  patterns between Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

1. Most content has been released after the year 2000.

2. Most of the content is added during the last month of the year.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
1. **Prioritize Shorter, High-Impact Content:** With modern content trending shorter, Netflix can evolve to audience preferences by focusing on high-quality, time-efficient formats, especially for mobile and younger users.

2. **Release Strategy Optimization:** Capitalize on high-performing months (Oct–Dec) for major releases and marketing campaigns.

3. Continue investment in TV Shows as their growth trend suggests strong future engagement.

4. **Use Popular Talent and Genres:** Prioritize projects featuring top-performing actors and directors, especially in drama, comedy, and international movie genres, which will lead to highest engagement possible.

5. Develop and promote  genres (e.g., classic TV, stand-up comedy, spiritual content) to catch to micro-audiences, enhancing viewer diversity and long-tail value.

Answer Here.

# **Conclusion**


In this project, I conducted a  exploratory data analysis (EDA) on Netflix's global content dataset to uncover meaningful business insights. The dataset initially required s data wrangling, including handling missing values (e.g., in director, cast, date_added), correcting data types (e.g., converting duration and date_added), and filtering relevant information for trend analysis. Once cleaned, the dataset was prepared for structured, meaningful visual exploration.

Through 12 diverse and insightful visualizations, I analyzed content trends by type, genre, duration, country, director, actor, and time, helping to uncover key patterns:

Movies dominate Netflix’s content, but TV shows have been rising sharply since 2018.

Certain months (Oct–Dec) show higher content additions, suggesting seasonal patterns.

Shorter content formats are becoming more common in newer releases, especially for mobile-first audiences.

India, US, and UK are the top producers, each with distinct genre preferences (e.g., Dramas in India, Comedies in the US).

Frequent appearances by specific actors and directors point to strong creative partnerships that can be leveraged further.

Top genres vary by country, enabling region-specific content strategies.

By integrating these insights, Netflix can optimize:

Content planning and release timing

Localized marketing

Genre-specific production decisions

Collaborations with high-performing talent

The analysis has successfully transformed raw, unstructured data into business-ready intelligence — giving Netflix a clearer picture of what works, where, and for whom. These insights are not only aligned with the business objective but also scalable across content, marketing, and strategy teams.

