<a href="https://colab.research.google.com/github/IshanArcane/VOIS_AICTE_Oct2025_NETFLIX_IshanMamidpalliwar/blob/main/Netflix_Major_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
df = pd.read_csv('/content/Netflix Dataset.csv')
display(df.head())

Unnamed: 0,Show_Id,Category,Title,Director,Cast,Country,Release_Date,Rating,Duration,Type,Description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,07:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",PG-13,123 min,Dramas,A brilliant group of students become card-coun...


# Task
Analyze Netflix content trends from "netflix_titles.csv" to provide strategic recommendations for content acquisition and production, focusing on content type, genre, and geographical distribution. Optionally, analyze talent impact.

## Load and inspect data

### Subtask:
Load the Netflix dataset and perform initial inspection to understand the data structure, identify missing values, and data types.


**Reasoning**:
The dataset is already loaded as `df` and the first 5 rows are displayed. The next step is to print the concise summary and the number of missing values as requested in the instructions.



In [None]:
df.info()
print("\nMissing values per column:")
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7789 entries, 0 to 7788
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Show_Id       7789 non-null   object
 1   Category      7789 non-null   object
 2   Title         7789 non-null   object
 3   Director      5401 non-null   object
 4   Cast          7071 non-null   object
 5   Country       7282 non-null   object
 6   Release_Date  7779 non-null   object
 7   Rating        7782 non-null   object
 8   Duration      7789 non-null   object
 9   Type          7789 non-null   object
 10  Description   7789 non-null   object
dtypes: object(11)
memory usage: 669.5+ KB

Missing values per column:
Show_Id            0
Category           0
Title              0
Director        2388
Cast             718
Country          507
Release_Date      10
Rating             7
Duration           0
Type               0
Description        0
dtype: int64


## Data cleaning and preparation

### Subtask:
Handle missing values, convert 'Release_Date' to datetime objects, and extract year and month for time series analysis.


**Reasoning**:
Handle missing values in specified columns and convert 'Release_Date' to datetime, then extract year and month.



In [None]:
# Fill missing values
df['Director'] = df['Director'].fillna('Unknown')
df['Cast'] = df['Cast'].fillna('Unknown')
df['Country'] = df['Country'].fillna('Unknown')
df['Rating'] = df['Rating'].fillna(df['Rating'].mode()[0]) # Use mode for Rating

# Convert 'Release_Date' to datetime with error handling and specified format
df['Release_Date'] = pd.to_datetime(df['Release_Date'], errors='coerce', format='%B %d, %Y')

# Extract year and month
df['Release_Year'] = df['Release_Date'].dt.year
df['Release_Month'] = df['Release_Date'].dt.month

display(df.head())
display(df.isnull().sum())

Unnamed: 0,Show_Id,Category,Title,Director,Cast,Country,Release_Date,Rating,Duration,Type,Description,Release_Year,Release_Month
0,s1,TV Show,3%,Unknown,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...,2020.0,8.0
1,s2,Movie,07:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,2016.0,12.0
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",2018.0,12.0
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",2017.0,11.0
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,PG-13,123 min,Dramas,A brilliant group of students become card-coun...,2020.0,1.0


Unnamed: 0,0
Show_Id,0
Category,0
Title,0
Director,0
Cast,0
Country,0
Release_Date,98
Rating,0
Duration,0
Type,0


**Reasoning**:
The previous attempt to convert 'Release_Date' to datetime failed due to inconsistent date formats. To fix this, I will use `errors='coerce'` to turn problematic dates into NaT (Not a Time), and specify the format.



## Content Type Analysis

### Subtask:
Analyze the trend of movies and TV shows added over the years and visualize the distribution.

In [None]:
# Analyze content type distribution
content_distribution = df['Category'].value_counts().reset_index()
content_distribution.columns = ['Category', 'Count']
display(content_distribution)

# Visualize content type distribution
fig = px.bar(content_distribution, x='Category', y='Count', title='Distribution of Movies and TV Shows')
fig.show()

Unnamed: 0,Category,Count
0,Movie,5379
1,TV Show,2410


**Reasoning**:
Now, we will analyze the trend of content added over the years. We will group the data by `Release_Year` and `Category` to count the number of movies and TV shows released each year, and then visualize this trend using a line plot.

In [None]:
# Analyze content added over the years
content_over_time = df.groupby(['Release_Year', 'Category']).size().reset_index(name='Count')
content_over_time['Release_Year'] = content_over_time['Release_Year'].astype(int)
display(content_over_time.head())

# Visualize content added over the years
fig = px.line(content_over_time, x='Release_Year', y='Count', color='Category', title='Content Added Over the Years (Movies vs. TV Shows)')
fig.show()

Unnamed: 0,Release_Year,Category,Count
0,2008,Movie,1
1,2008,TV Show,1
2,2009,Movie,2
3,2010,Movie,1
4,2011,Movie,13


## Geographical Analysis

### Subtask:
Analyze content contributions from different countries.

**Reasoning**:
To analyze geographical distribution, we will count the number of titles per country. Since a title can be associated with multiple countries, we will split the 'Country' column and then count the occurrences of each country. Finally, we will visualize the top contributing countries.

## Genre Trend Analysis

### Subtask:
Analyze the evolution of genres over time.

**Reasoning**:
To analyze the genre trends, we need to process the 'Type' column, which contains multiple genres per title. We will split the genres, create a new DataFrame with each genre as a separate row, and then group by year and genre to count the occurrences. Finally, we will visualize the top genres over time.

In [None]:
# Prepare data for genre analysis
genres_df = df.assign(Type=df['Type'].str.split(', ')).explode('Type')

# Analyze genre trends over time
genre_over_time = genres_df.groupby(['Release_Year', 'Type']).size().reset_index(name='Count')
genre_over_time['Release_Year'] = genre_over_time['Release_Year'].astype(int)

# Get top genres for visualization (e.g., top 10)
top_genres = genres_df['Type'].value_counts().head(10).index.tolist()
genre_over_time_top = genre_over_time[genre_over_time['Type'].isin(top_genres)]

display(genre_over_time_top.head())

# Visualize genre trends over time
fig = px.line(genre_over_time_top, x='Release_Year', y='Count', color='Type', title='Top 10 Genre Trends Over the Years')
fig.show()

Unnamed: 0,Release_Year,Type,Count
0,2008,Dramas,1
1,2008,Independent Movies,1
4,2009,Dramas,1
6,2009,International Movies,1
9,2011,Children & Family Movies,1


## Talent Impact Analysis (Optional Enhancement)

### Subtask:
Analyze directors or cast members associated with popular content.

**Reasoning**:
To analyze talent impact, we will first identify the top directors and cast members based on the number of titles they have contributed to. We will then display the top contributors in each category.

In [None]:
# Analyze top directors
top_directors = df['Director'].value_counts().reset_index()
top_directors.columns = ['Director', 'Count']

# Exclude 'Unknown' director for visualization
top_directors = top_directors[top_directors['Director'] != 'Unknown']

display("Top 10 Directors:")
display(top_directors.head(10))

# Analyze top cast members
# Split the 'Cast' column and count occurrences of each cast member
cast_df = df.assign(Cast=df['Cast'].str.split(', ')).explode('Cast')
top_cast = cast_df['Cast'].value_counts().reset_index()
top_cast.columns = ['Cast', 'Count']

# Exclude 'Unknown' cast member for visualization
top_cast = top_cast[top_cast['Cast'] != 'Unknown']

display("Top 10 Cast Members:")
display(top_cast.head(10))

'Top 10 Directors:'

Unnamed: 0,Director,Count
1,"Raúl Campos, Jan Suter",18
2,Marcus Raboy,16
3,Jay Karas,14
4,Cathy Garcia-Molina,13
5,Youssef Chahine,12
6,Jay Chapman,12
7,Martin Scorsese,12
8,Steven Spielberg,10
9,David Dhawan,9
10,Ryan Polito,8


'Top 10 Cast Members:'

Unnamed: 0,Cast,Count
1,Anupam Kher,42
2,Shah Rukh Khan,35
3,Naseeruddin Shah,30
4,Om Puri,30
5,Akshay Kumar,29
6,Takahiro Sakurai,29
7,Boman Irani,27
8,Paresh Rawal,27
9,Amitabh Bachchan,27
10,Yuki Kaji,27


## Geographical Analysis

### Subtask:
Analyze content contributions from different countries.

**Reasoning**:
To analyze geographical distribution, we will count the number of titles per country. Since a title can be associated with multiple countries, we will split the 'Country' column and then count the occurrences of each country. Finally, we will visualize the top contributing countries.

In [None]:
# Prepare data for geographical analysis
countries_df = df.assign(Country=df['Country'].str.split(', ')).explode('Country')

# Analyze content contribution by country
country_contribution = countries_df['Country'].value_counts().reset_index()
country_contribution.columns = ['Country', 'Count']

# Exclude 'Unknown' country for visualization
country_contribution = country_contribution[country_contribution['Country'] != 'Unknown']

display(country_contribution.head())

# Visualize top contributing countries (e.g., top 10)
fig = px.bar(country_contribution.head(10), x='Country', y='Count', title='Top 10 Content Contributing Countries')
fig.show()

Unnamed: 0,Country,Count
0,United States,3297
1,India,990
2,United Kingdom,722
4,Canada,412
5,France,349


## Generate Strategic Recommendations

### Subtask:
Based on the analysis, formulate strategic recommendations for content acquisition and production.

**Recommendations based on the analysis:**

Based on the analysis of the Netflix dataset, here are some strategic recommendations for content acquisition and production:

1.  **Focus on TV Shows:** The trend analysis shows a significant increase in the addition of TV shows over the years, indicating a growing demand for serial content. Netflix should continue to invest heavily in TV show production and acquisition to cater to this trend.
2.  **Diversify Genre Investment:** While dramas, international movies, and stand-up comedy are popular, explore increasing content in emerging or less saturated genres based on the genre trend analysis to capture niche audiences.
3.  **Strengthen Local Content Production in Top Contributing Countries:** The geographical analysis highlights the top content-contributing countries. Investing in local content production within these countries can further boost viewership and appeal to regional preferences.
4.  **Leverage Top Talent:** The talent impact analysis identified top directors and cast members. Collaborating with these popular individuals or those with similar profiles can attract a larger audience.
5.  **Explore Content from Emerging Markets:** While top contributing countries are important, analyze the growth trends from other countries to identify emerging markets with potential for content acquisition or co-production.
6.  **Analyze Content Duration and Rating Trends:** (Further analysis needed) Investigate if there are trends in the duration of popular movies or TV shows and how ratings correlate with viewership to inform production decisions.
7.  **Regularly Monitor Trends:** The streaming landscape is dynamic. Implement a system for regularly monitoring content, genre, and geographical trends to adapt the strategy accordingly.

These recommendations provide a starting point for refining Netflix's content strategy to remain competitive in the evolving streaming market.

## Finish task

### Subtask:
Summarize the findings and recommendations.

**Summary of Findings and Recommendations:**

This analysis of the Netflix dataset revealed several key trends and insights crucial for strategic content planning:

*   **Content Type:** There is a clear upward trend in the addition of both movies and TV shows, with a notable increase in TV shows in recent years.
*   **Genre Trends:** Dramas, International Movies, and Stand-Up Comedy consistently appear among the top genres, while other genres show varying growth patterns.
*   **Geographical Distribution:** The United States, India, and the United Kingdom are the top content-contributing countries, indicating strong production or acquisition efforts in these regions.
*   **Talent Impact:** The analysis identified key directors and cast members with a high volume of content on the platform, suggesting their significant contribution to the content library.

Based on these findings, the following strategic recommendations are proposed:

*   Continue to prioritize investment in TV shows.
*   Diversify genre focus beyond traditionally popular categories.
*   Strengthen local content production in top contributing countries.
*   Leverage popular directors and cast members for future projects.
*   Explore content opportunities in emerging markets.
*   Conduct further analysis on content duration and rating correlations.
*   Establish a regular process for monitoring content trends to ensure the strategy remains agile.

By implementing these recommendations, Netflix can make data-driven decisions to optimize its content portfolio, attract and retain subscribers, and maintain a competitive edge in the global streaming market.