# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Rishabh Kumar Chauhan


# **Project Summary -**

In the highly competitive streaming entertainment industry, platforms like Amazon Prime Video must constantly evolve their content libraries to retain subscribers and attract new users. With thousands of titles added regularly, data-driven insights are essential for understanding audience preferences and optimizing content strategy.

This project focuses on performing Exploratory Data Analysis (EDA) on the Amazon Prime dataset to uncover trends in content production, regional availability, and viewer reception.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The core problem is to understand the composition and quality of the Amazon Prime Video library to identify gaps and opportunities. Specifically, the project aims to analyze the dataset of over 9,000 titles to answer four key questions:

Content Diversity: What genres and categories (Movies vs. TV Shows) dominate the platform?

Regional Availability: How does content distribution vary across different production countries?

Trends Over Time: How has Amazon Prime’s content library size and focus evolved over recent years?

Ratings & Popularity: What characterizes the highest-rated and most popular content on the platform?


#### **Define Your Business Objective?**

The primary objective is to derive data-driven insights that help stakeholders (content strategists and business leaders) optimize the platform's offerings to compete in the streaming market.

By analyzing this data, the goal is to uncover trends that directly influence:

Subscription Growth: Identifying what content drives new sign-ups.

User Engagement: Understanding what keeps current users watching.

Content Strategy: Making informed decisions on whether to invest in more movies, specific genres, or international content.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
credits_data = pd.read_csv("/content/credits.csv")
titles_data = pd.read_csv("/content/titles.csv")

### Dataset First View

In [None]:
# Dataset First Look
credits_data.head()

In [None]:
titles_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles Shape:", titles_data.shape)
print("Credits Shape:", credits_data.shape)

### Dataset Information

In [None]:
# Dataset Info
credits_data.info()

In [None]:
titles_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Titles duplicate rows ", titles_data.duplicated().sum())
print("Credits duplicate rows ", credits_data.duplicated().sum())

titles_df = titles_data.drop_duplicates(subset=['id'])
credits_df = credits_data.drop_duplicates()
print(titles_df.shape)
print(credits_df.shape)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Titles null count \n",titles_df.isnull().sum())
print("Titles null count \n",credits_df.isnull().sum())

In [None]:
# Visualizing the missing values

col = titles_df.columns
nullCount = titles_df.isnull().sum()
plt.figure(figsize=(12,6))
plt.bar(col, nullCount)
plt.xticks(rotation=90)
plt.title("Missing values  bar plot")
plt.show()

In [None]:
credits_df.isnull().sum().sort_values(ascending=False).plot(
    kind='bar',
    figsize=(12,6),
    title="Missing Values Count per Column"
)
plt.show()

### What did you know about your dataset?

Answer Here:-
Before performing Exploratory Data Analysis (EDA), we must understand the structure and granularity of our raw data. The dataset consists of two relational files representing the Amazon Prime Video catalog in the United States.

1. The Catalog (titles.csv) This dataset serves as the master list of content. It contains approximately 9,000+ unique titles. Each row represents a distinct Movie or TV Show.

Key Identifiers: id (Unique ID), title.

Content Features: show_type (Movie/TV), release_year, runtime, genres, production_countries.

Performance Metrics: imdb_score, imdb_votes, tmdb_popularity, tmdb_score.

Target Audience: age_certification (e.g., R, PG-13).

2. The Cast & Crew (credits.csv) This dataset details the talent behind the content. It contains over 124,000+ records, representing a "one-to-many" relationship where a single title is linked to multiple actors and directors.

Key Columns: person_id, name, character_name.

Role Type: The role column distinguishes between ACTOR and DIRECTOR.

Linkage: This file connects to the Titles dataset via the id column.

Initial Observations:

Data Types: The dataset is a mix of categorical (e.g., show_type), numerical (e.g., imdb_score), and text data.

Complex Columns: Columns like genres and production_countries appear to contain lists stored as strings (e.g., ['Drama', 'Action']), which will require parsing during the cleaning phase.

Missing Values: We anticipate null values in conditional columns such as seasons (which does not apply to movies) and character_name (which does not apply to directors).

## ***2. Understanding Your Variables***

### Variables Description

Answer Here:-

1 The Titles Dataset (titles.csv)
This dataset contains 15 columns capturing the metadata and performance metrics of the content.

  1. Identifiers & Text Data
  id: The unique identifier for the title on JustWatch. This is the Primary Key used to merge with the credits file.

  title: The official name of the movie or TV show.

  description: A short text synopsis or plot summary of the content.

  imdb_id: The unique identifier used on the IMDb website (e.g., "tt0111161").


  2. Categorical Variables (Groups)
  show_type: A binary classification distinguishing whether the title is a 'Movie' or a 'TV Show'.

  age_certification: The maturity rating (e.g., PG-13, R, TV-MA). This variable is critical for demographic analysis.

  genres: A list of genres associated with the title (e.g., ['Drama', 'Comedy']). Note: This is stored as a string representation of a list.

  production_countries: A list of country codes indicating where the content was produced (e.g., ['US', 'IN']).


  3. Numerical Variables (Metrics)
  release_year: The year the content was first released. Used for time-series trend analysis.

  runtime: The duration of the content in minutes. For movies, it is the total length; for TV shows, it is the episode length.

  seasons: The number of seasons available. Note: This is only applicable to TV Shows and is Null (NaN) for Movies.

  imdb_score: The user rating on IMDb (scale 0-10).

  imdb_votes: The total count of user votes on IMDb. This acts as a proxy for "viewer engagement".

  tmdb_score: The user rating on The Movie Database (TMDB).

  tmdb_popularity: A proprietary metric from TMDB measuring the current trending status of the title.

2. The Credits Dataset (credits.csv)
This dataset contains 5 columns representing the cast and crew.

person_ID: The unique identifier for the person on JustWatch.

id: The Foreign Key that links this person to a specific title in the titles.csv file.

name: The real name of the actor or director.

character_name: The fictional name of the character played. Note: This is applicable only to actors and is Null for directors.

role: A categorical variable specifying the person's job: 'ACTOR' or 'DIRECTOR'.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
credits_df.nunique()

In [None]:
titles_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#### Datatype conversion

In [None]:
titles_df.info()

In [None]:
# Fixing the null values in the description coloumn of titles
titles_df['description']=titles_df['description'].fillna("Unknown")

# Fixing the null values in age_certification
titles_df['age_certification']= titles_df['age_certification'].fillna("Unknown")

In [None]:
# what are diffrent values in age_certification
print(titles_df['age_certification'].unique())

# Seasons should be integer but it is float because it contains some NaN value
titles_df['seasons'] = titles_df['seasons'].fillna(0).astype(int)


# imdb_id if not given than it should not be imputed
titles_df['imdb_id'] = titles_df['imdb_id'].fillna("Unknown")

# imdb_votes should be 0 where NaN value is assigned
titles_df['imdb_votes'] = titles_df['imdb_votes'].fillna(0)

titles_df.info()

In [None]:
credits_df.loc[credits_df['role']== 'DIRECTOR', 'character'] = '-'
credits_df['character'].fillna('Unknown', inplace=True)
credits_df.info()

#### Merging both tables

In [None]:
# 1. Ensure the Key column 'id' is the same data type in both
titles_df['id'] = titles_df['id'].astype(str)
credits_df['id'] = credits_df['id'].astype(str)

# 2. Perform the Left Merge
merged_df = pd.merge(
    titles_df,
    credits_df,
    on='id',
    how='left'
)

# The number of rows should roughly match your CREDITS dataframe (approx 124k), because one movie matches multiple actors.
print(f"Titles Rows: {titles_df.shape[0]}")
print(f"Credits Rows: {credits_df.shape[0]}")
print(f"Merged Rows: {merged_df.shape[0]}")

merged_df.head()

### What all manipulations have you done and insights you found?

Answer Here.
1. Id is the key on which merging is done of both dataset to id has to be of similer datatype for both the dataset so it is important to check for the datatype of both the dataset's id coloumn and making sure both are of string type.

2. Some coloumn of titles dataset contain null value like seasons, description, imdb votes it is important to address them , so rather than imputing the neumerical data value it is important that i should insert 0 in case of seasons and imdb votes coloumn rather than inserting some other neumerical values, and in case of description i put Unknown where it contains the null values.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Data prep
type_counts = titles_df['type'].value_counts()

# Plot
plt.figure(figsize=(6,6))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff'])
plt.title('Content Distribution: Movies vs. TV Shows')
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()

##### 1. Why did you pick the specific chart?

To establish the fundamental baseline of the Amazon Prime library. A donut chart is cleaner than a pie chart for comparing two binary categories.

##### 2. What is/are the insight(s) found from the chart?

Movies significantly outnumber TV Shows (87/13 split).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:-

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# # Explode the list so each genre gets its own row
# genres_exploded = titles_df.explode('genres')

# plt.figure(figsize=(10, 6))
# sns.countplot(y='genres', data=genres_exploded, order=genres_exploded['genres'].value_counts().index[:10], palette='viridis')
# plt.title('Top 10 Most Common Genres on Amazon Prime')
# plt.xlabel('Number of Titles')
# plt.ylabel('Genre')
# plt.show()

plt.figure(figsize=(10,5))
sns.histplot(titles_df['imdb_score'].dropna(), bins=20, kde=True, color='purple')
plt.title('Distribution of IMDb Scores')
plt.xlabel('Score')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the overall quality control of the platform.

##### 2. What is/are the insight(s) found from the chart?

Most shows follow a normal distribution (bell curve) centered around 6.0 - 6.5.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: A right-skew (peak around 7-8) indicates a high-quality "prestige" library.

Negative Growth Risk: A left-skew or a peak around 4-5 indicates the platform is dumping low-quality "filler" content just to increase title counts. This damages brand reputation and trust.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
genres_exploded = titles_df.explode('genres')

plt.figure(figsize=(12,6))
sns.countplot(y='genres', data=genres_exploded, order=genres_exploded['genres'].value_counts().index[:10], palette='viridis')
plt.title('Top 10 Genres on Amazon Prime')
plt.show()

##### 1. Why did you pick the specific chart?

To identify the "Top Content Variety" of the platform.



##### 2. What is/are the insight(s) found from the chart?

Drama, Documentry and Comedy usually dominate. You might see "Romance" or "horror" lower than expected.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: High volume in popular genres (Drama/Comedy) ensures mass appeal.

Negative Growth Risk: Oversaturation. If 60% of the library is Drama, users looking for Horror or Sci-Fi will leave for Netflix or Shudder. Insight: The business needs to diversify investment into underrepresented top-tier genres.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
content_quantity_by_time = titles_df.groupby('release_year')['id'].count()
print(content_quantity_by_time)
plt.figure(figsize=(12,6))
sns.lineplot(x=content_quantity_by_time.index, y=content_quantity_by_time )
plt.title('Content Quantity Over Time')
plt.xlabel('Year')
plt.ylabel('Content Quantity')
plt.xlim(1990, 2025)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To visualize the "Recency" of the library



##### 2. What is/are the insight(s) found from the chart?

Answer Here
You will see if Amazon Prime focuses on classic content (older years) or modern originals (sharp spike after 2015).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive: A sharp upward trend in recent years indicates an active, healthy platform acquiring new content.

Negative Growth Risk: If the line flattens or drops in the last 2 years, it signals a "content drought," which is a primary driver of subscription cancellations.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
pivot_table = titles_df.pivot_table(index='release_year', columns='type', aggfunc='size', fill_value=0)
pivot_table = pivot_table.loc[2000:] # Filter for recent trends

pivot_table.plot(kind='area', stacked=True, figsize=(12,6), alpha=0.5)
plt.title('Evolution of Movies vs TV Shows (2000-Present)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To see if the strategy is shifting like movies to tv series.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Is the platform shifting pivotally toward TV shows in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive: Growing the "TV Show" layer (orange/blue area) suggests a strategy shift toward user retention (series engagement).

Negative Growth Risk: Stagnation in TV show production suggests the platform is relying too heavily on licensing old movies rather than creating new IP (Intellectual Property).

#### Chart - 6

In [None]:
# Chart - 6 visualization code
countries_exploded = titles_df.explode('production_countries')

plt.figure(figsize=(10,6))
sns.countplot(y='production_countries', data=countries_exploded, order=countries_exploded['production_countries'].value_counts().index[:10], palette='magma')
plt.title('Top Content Producing Countries')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To analyze "Regional Availability of content" and globalization of cinema industry.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
US and India are the top two producing countries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Positive: A strong showing from India (Bollywood) indicates successful penetration into high-population Asian markets.

Negative Growth Risk: If more number of content is USA-only, the platform loses relevance in global markets (Europe/Asia/LatAm), limiting international revenue growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
order = ['G', 'PG', 'PG-13', 'R', 'NC-17', 'TV-Y', 'TV-G', 'TV-PG', 'TV-14', 'TV-MA']
plt.figure(figsize=(12,5))
sns.countplot(x='age_certification', data=titles_df, order=order)
plt.title('Content Distribution by Age Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To define the Target Audience is of which age

##### 2. What is/are the insight(s) found from the chart?

Answer Here

 Movie Ratings (MPA – Motion Picture Association)

G — General Audiences
Suitable for all ages. No content that would offend parents.

PG — Parental Guidance Suggested
Some material may not be suitable for children. Mild language or thematic elements.

PG-13 — Parents Strongly Cautioned
Some material inappropriate for children under 13. Stronger language, violence, or themes.

R — Restricted
Under 17 requires accompanying adult. Strong violence, language, sexuality, or drug use.

NC-17 — Adults Only
No one 17 and under admitted. Explicit adult content (not necessarily pornographic).

TV Ratings (TV Parental Guidelines)

TV-Y — All Children
Suitable for young children ages 2–6.

TV-G — General Audience
Suitable for all ages; minimal or no objectionable content.

TV-PG — Parental Guidance Suggested
Some material may be unsuitable for younger children.
May include mild violence, language, or suggestive content.

TV-14 — Parents Strongly Cautioned
Content inappropriate for children under 14.
May include stronger violence or language.

TV-MA — Mature Audience Only
For adults; not suitable for people under 17.
Strong violence, sexual content, or explicit language.

Is the platform family-friendly (PG/TV-Y) or adult-oriented (R/TV-MA), most of the content are rated R, PG-13, PG

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:- Here you can see Restricted movies or very high.

Positive: A balanced distribution attracts whole households (parents pay, kids watch).

Negative Growth Risk: Too much "R/TV-MA" content makes the platform unsuitable for families, limiting the "household" subscription model.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
target_ratings = ['G', 'PG', 'PG-13', 'R', 'TV-MA']
filtered_df = titles_df[titles_df['age_certification'].isin(target_ratings)]

plt.figure(figsize=(12,6))
sns.violinplot(x='age_certification', y='imdb_score', data=filtered_df, palette='muted')
plt.title('Quality Distribution (IMDb Score) by Age Certification')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To check which demographic gets the "Quality" content.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Often, "R" and "TV-MA" content has a wider spread but higher peaks in quality compared to "PG-13".

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:-

Positive: Finding that "TV-MA" has high scores justifies investment in mature originals (like The Boys).

Negative Growth Risk: If "PG" content consistently has low scores, parents will unsubscribe because the kids' content is low quality.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(x='imdb_score', y='tmdb_popularity', data=titles_df, alpha=0.5)
plt.title('Relationship: Quality (IMDb) vs. Popularity (TMDB)')
plt.axhline(y=titles_df['tmdb_popularity'].mean(), color='r', linestyle='--') # Avg popularity
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To test the assumption: Are popular movies actually good? Movie popularity vs Movie Imdb rating.



##### 2. What is/are the insight(s) found from the chart?

Answer Here:-

From the above chart we can see that higher popularity does not means higher imdb ratings. In the middle we can see higher popular movie but it has average imdb rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:-

Positive: Identifying "High Score / Low Popularity" gems allows the business to build recommendation engines to surface for the users.

Negative Growth Risk: Identifying "High Popularity / Low Score" titles indicates marketing money is being wasted on bad content that generates buzz but leaves users disappointed.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
actors_df = merged_df[merged_df['role'] == 'ACTOR']
top_actors = actors_df['name'].value_counts().head(10)
print(top_actors)
plt.figure(figsize=(10,6))
sns.barplot(x=top_actors.values, y=top_actors.index, palette='copper')
plt.title('Top 10 Actors by Number of Titles movie')
plt.show()

##### 1. Why did you pick the specific chart?

\Answer Here.
To analyze the "Star Power" available on the platform

##### 2. What is/are the insight(s) found from the chart?

Answer Here:-
There is diversity in actors like there is not a only actor who has played large number of roles in movies than other or has done relatively very large number of movies than other top actors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:-

Positive: A diverse list of actors implies a broad library.

Negative Growth Risk: If the list is dominated by obscure actors, it implies the platform lacks "Blockbuster" power, which is the primary driver for new user acquisition.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# 1. Filter for Directors
directors_only = merged_df[merged_df['role'] == 'DIRECTOR']

# 2. Group by Name and calculate Mean Score + Count
director_stats = directors_only.groupby('name')['imdb_score'].agg(['mean', 'count'])

# 3. Filter: Keep only directors with at least 5 titles (to ensure consistency)
consistent_directors = director_stats[director_stats['count'] >= 5]

# 4. Sort and take top 10
top_directors = consistent_directors.sort_values(by='mean', ascending=False).head(10)

# 5. Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_directors['mean'], y=top_directors.index, palette='magma')
plt.title('Top 10 Directors (Min. 5 Titles) by Average IMDb Score')
plt.xlabel('Average IMDb Score')
plt.xlim(0, 10) # Fix x-axis 0-10 for fairness
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.:-
To get the count of higly scored imdb rated movie producing director.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:-
directors who consistently deliver high-quality content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here:-
These are the directors Amazon Prime should sign for exclusive multi-year deals, as they are proven "quality guarantees."

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# 1. Get the names of the Top 5 Actors
actors_only = merged_df[merged_df['role'] == 'ACTOR']

top_5_actor_names = actors_only['name'].value_counts().head(5).index

# 2. Filter the main dataframe to only include these 5 people
top_5_data = actors_only[actors_only['name'].isin(top_5_actor_names)]
print(top_5_data)

# 3. Plot Boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(data=top_5_data, x='name', y='imdb_score', palette='coolwarm')
plt.title('Quality Consistency of Top 5 Actors')
plt.ylabel('IMDb Score of their Movies')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
to get the high rated imdb actor with consistant rating

##### 2. What is/are the insight(s) found from the chart?

Small Box: The actor is very consistent (always 6.0 - 7.0).

Large Box / Long Whiskers: The actor is "hit or miss" (some movies are 9.0, some are 3.0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Consistent actors are safer bets for marketing campaigns. Inconsistent actors are high-risk/high-reward.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. Shift Strategy from "Volume" to "Retention" (TV Shows)
Observation: Your analysis likely reveals a library heavily skewed toward Movies. While movies attract new users, they are "one-off" experiences.

Suggestion: Drastically increase investment in TV Shows (Series). Series create "habitual viewing" (binge-watching), which is the strongest driver of monthly retention. A user watching a 5-season show is a guaranteed subscriber for multiple months, unlike a user watching one movie.

2. Diversify Beyond "Drama" and "Comedy"
Observation: If your genre analysis shows an over-saturation of Drama/Comedy, the platform risks becoming generic.

Suggestion: Invest in high-engagement "niche" genres like Sci-Fi, Fantasy, or True Crime. These genres have the most dedicated fanbases who will subscribe specifically for that content (e.g., The Boys or Lord of the Rings). Filling these gaps attracts underserved audiences that Netflix or Disney+ might be ignoring.

3. "Global" Content Strategy (Global + Local)
Observation: If your country analysis shows a dominance of US and Indian content, you are missing huge markets in Europe and Latin America.

Suggestion: adopt a "Global" approach. Don't just license Hollywood movies; produce Local Originals in regions like South Korea, Spain, and Brazil. This strategy (proven by Netflix's Squid Game) captures local markets while providing exotic content for global viewers.

4. Quality Control over Quantity (The "Cleaning" Strategy)
Observation: If your ratings distribution shows a "long tail" of low-rated content (IMDb scores < 5.0), it dilutes the brand.

Suggestion: Purge the library of low-quality "filler" content. A smaller, high-quality library is more valuable than a massive library full of bad movies. This improves the "Signal-to-Noise" ratio, helping users find good things to watch faster and reducing "choice paralysis."

5. Leverage "Hidden Gem" Directors
Observation: Your "Director Consistency" chart likely identified directors who consistently score high but aren't A-list famous.

Suggestion: Sign exclusive First-Look Deals with these specific directors. It is cheaper than hiring Steven Spielberg but yields statistically reliable high-quality content. This is a "Moneyball" approach to content acquisition—using data to find undervalued talent.