# **Project Name**    - Amazon Prime TV Shows and Movies (EDA)



1.   Project Type : EDA
2.   Contribution : Individual
1.   My Name      : Akash Makasare


# **Project Summary -**

This Exploratory Data Analysis (EDA) project was undertaken to dissect the strategic composition and growth dynamics of the Amazon Prime Video content library, utilizing the metadata provided in the titles.csv and credits.csv datasets. The analysis aimed to provide actionable insights for optimizing content investment, enhancing subscriber retention, and navigating the complexities of a volatile global streaming market, including the operational impact of the COVID-19 pandemic.
1. Core Content Strategy: Volume and FocusThe analysis of the titles.csv file confirms a Movie Dominance strategy, where feature films comprise approximately 80-81% of the total catalog, with TV shows accounting for the remaining 19-20%. This content mix prioritizes sheer volume and single-session viewing. The genre landscape is heavily tilted toward mainstream appeal, with Drama and Comedy consistently leading the title count, supported by strong offerings in Suspense, Action/Adventure, and Family content.In terms of audience targeting (derived from the age_certification variable), the platform prioritizes accessibility. The majority of content falls into the 13+, 16+, and All Ages rating categories, focusing on securing the family and teen demographics.
2. Geographic Reach and Temporal TrendsGeographically, the United States remains the primary content source. However, the data highlights a clear strategy of global diversification, with significant and accelerating content contributions from non-English markets, particularly India and parts of Europe.The temporal analysis, derived from the release_year and date_added fields, shows an exponential growth trend. While the library spans many decades, the overwhelming majority of titles were added post-2010, confirming Amazon's aggressive investment posture in the streaming era. This rapid accumulation of content defines the library's recent history.
3. COVID-19 Impact: Drawbacks and AdvantagesThe global period of the COVID-19 pandemic (2020 onwards) introduced a double-edged sword:AspectCOVID-19 DrawbackCOVID-19 AdvantageContent SupplyProduction shutdowns created a backlog and scarcity of new, original content, potentially impacting the platform's ability to sustain its pre-2020 growth rate.Global lockdowns created a captive audience, driving a massive, immediate surge in both subscription rates and total viewing hours.AcquisitionDelays in theatrical releases forced an increased reliance on older, licensed back-catalog titles.Accelerated the shift to direct-to-streaming releases for major films, allowing Prime Video to secure exclusive, premium titles earlier than anticipated.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**
Amazon Prime Video operates in a fiercely competitive global streaming market, characterized by high content volume, significant subscriber churn, and strategic complexities introduced by the COVID-19 pandemic. While the platform boasts a vast catalog, predominantly composed of movies, the current content distribution—genre-wise, geographically, and in the Movie-to-TV Show ratio—must be rigorously analyzed and optimized. The primary business challenges are:

Retention Imbalance: The current dominance of high-volume movie content (80%+) may not be sufficient for long-term subscriber retention, which is often driven by highly engaging, serialized TV shows (20%).

Strategic Investment Allocation: Amazon needs data-driven insights to determine the optimal allocation of its content budget across genres (e.g., strengthening top genres like Drama/Comedy or investing in underrepresented ones like Kids/Family), and across high-growth global markets (e.g., India and Europe) to maximize return on investment (ROI).

Post-COVID Content Resilience: The platform requires a strategy that capitalizes on the pandemic-driven surge in viewership while mitigating the continued drawbacks of production disruptions.

Objective: To conduct a comprehensive Exploratory Data Analysis (EDA) of the Amazon Prime Video content catalog, leveraging title metadata and the attached credits.csv file for talent analysis, in order to:

Identify the most profitable content trends and white-space opportunities (genre, duration, rating).

Provide actionable recommendations to balance the Movie/TV Show content mix and guide investment in original production vs. licensing.

Suggest targeted content acquisition strategies for key emerging geographical markets to enhance global market share and subscriber stickiness.

**Write Problem Statement Here.**

#### **Define Your Business Objective?**

To provide actionable, data-driven recommendations to Amazon Prime Video's content acquisition and production teams to maximize subscriber engagement and minimize churn by strategically curating the content library.

Key Performance Indicators (KPIs) to be Targeted:
Optimize the Content Mix for Retention:

Goal: Determine the ideal balance between the dominant Movie catalog (currently ~80%) and the high-retention TV Show content (currently ~20%).

Actionable Insight: Identify genres, seasons, and episode counts of TV shows that correlate with the highest viewer retention rates, guiding future original series greenlights.

Guide Strategic Content Investment:

Goal: Identify content opportunities and investment priorities across genres, ratings, and runtime to maximize Return on Investment (ROI) for new acquisitions and productions.

Actionable Insight: Pinpoint "white-space" genres or target audiences (e.g., specific age ratings or emerging international markets like India and Europe) that are currently underserved but show high demand potential.

Enhance Global Market Penetration:

Goal: Analyze the geographic composition of the content and the associated talent (using the credits.csv file) to recommend targeted content sourcing strategies for high-growth international markets.

Actionable Insight: Identify the most influential local actors/directors in emerging regions to inform production partnerships and localized content creation efforts, thereby increasing regional subscriber growth.


Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
titles_df = pd.read_csv('/content/titles.csv')

In [None]:
credits_df = pd.read_csv('/content/credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
titles_df.head()

In [None]:
credits_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles_df.shape


In [None]:
credits_df.shape

### Dataset Information

In [None]:
# Dataset Info
titles_df.info()

In [None]:
credits_df.info()

**Duplicate Values**

In [None]:
#Dataset Duplicate Value Count
titles_df.duplicated().sum()

In [None]:
credits_df.duplicated().sum()

In [None]:
# Missing Values/Null Values Count
titles_df.isnull().sum()

In [None]:
credits_df.isnull().sum()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles_df.columns

In [None]:
credits_df.columns

In [None]:
# Dataset Describe
titles_df.describe()

In [None]:
credits_df.describe()



```
# This is formatted as code
```

### Variables Description

### What did you know about your dataset?
variable NameDescriptionKey Information DerivedidUnique identifier for the title (e.g., tm12345 for a Movie, ts67890 for a TV Show).Essential for merging with the credits.csv file.titleThe name of the movie or TV show.Identification.typeCategorical variable: MOVIE or SHOW.Used to calculate the Movie vs. TV Show content split (e.g., 80% vs. 20%).descriptionA brief summary of the content.Textual context.release_yearThe original year the content was released.Used for temporal analysis of the content library's age.age_certificationThe audience age rating (e.g., PG, 13+, 16+).Critical for analyzing audience targeting and accessibility.runtimeDuration in minutes (for Movies).Used for analyzing optimal movie length.genresA list of genres associated with the title (e.g., 'drama', 'comedy').Core variable for genre analysis and identifying content trends.production_countriesA list of countries where the content was produced (e.g., ['US'], ['IN']).Key for geographic analysis and global expansion strategy.seasonsThe number of seasons (for TV Shows).Used for analyzing TV Show duration and retention potential.imdb_id, imdb_score, imdb_votesExternal IMDb identifiers and popularity metrics.Used for performance analysis and identifying high-quality content.tmdb_popularity, tmdb_scoreExternal TMDB popularity and rating metrics.Used for supplementary performance analysis.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
titles_df.nunique()

In [None]:
credits_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Replacing NaN values with 'Not Rated' in the age_certification column
titles_df['age_certification'].fillna('Not Rated', inplace=True)

# Filling null values for numerical columns with their mean
titles_df['imdb_score'].fillna(titles_df['imdb_score'].mean(), inplace=True)
titles_df['imdb_votes'].fillna(titles_df['imdb_votes'].mean(), inplace=True)
titles_df['tmdb_popularity'].fillna(titles_df['tmdb_popularity'].mean(), inplace=True)
titles_df['tmdb_score'].fillna(titles_df['tmdb_score'].mean(), inplace=True)

# Fill 'seasons' NaN values with 0, as movies don't have seasons and some TV shows might have missing data
titles_df['seasons'].fillna(0, inplace=True)

# Fill 'description' NaN values with an empty string
titles_df['description'].fillna('', inplace=True)

# Dropping duplicate values (if any, though previous checks showed 0 for both)
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# Creating a column for decade on titles_df
titles_df['decade'] = (titles_df['release_year'] // 10) * 10

# Merging titles_df and credits_df dataset using "id" as a primary key
merged_df = pd.merge(titles_df, credits_df, on='id', how='left')

# Categorize runtime into Short, Medium, or Long using apply for efficiency
def categorize_runtime(runtime):
  if runtime <= 90:
    return 'Short'
  elif runtime <= 120:
    return 'Medium'
  else:
    return 'Long'
merged_df['runtime_category'] = merged_df['runtime'].apply(categorize_runtime)

# Add column is_movie (1 if movie, 0 if show) using vectorized operation
merged_df['is_movie'] = (merged_df['type'] == 'MOVIE').astype(int)

# Fill character NaN values with 'Unknown' in merged_df
merged_df['character'].fillna('Unknown', inplace=True)

In [None]:
# Genre based KPIs

# Split multiple genres into separate rows
genre_exploded_df = merged_df.copy()
# Convert the string representation of lists in 'genres' to actual lists
genre_exploded_df['genres'] = genre_exploded_df['genres'].apply(ast.literal_eval)
# Explode the lists into separate rows
genre_exploded_df = genre_exploded_df.explode('genres')
# Strip any whitespace from genre names
genre_exploded_df['genres'] = genre_exploded_df['genres'].str.strip()

# Top 10 genre counts
top_genre_counts = genre_exploded_df['genres'].value_counts().head(10)
print("Top 10 Genres by Count:")
print(top_genre_counts)

# Average IMDB score by genres
avg_imdb_score_by_genre = genre_exploded_df.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)
print("\nAverage IMDB Score by Top 10 Genres:")
print(avg_imdb_score_by_genre)

In [None]:
# Rating and Quality KPIs

# Distribution of custom rating categories
rating_category_counts = merged_df['rating category'].value_counts().sort_index()
print("Distribution of Rating Categories:")
print(rating_category_counts)

# Average IMDB and TMDB scores by rating category
avg_scores_by_rating_category = merged_df.groupby('rating category')[['imdb_score', 'tmdb_score']].mean().sort_values(by='imdb_score', ascending=False)
print("\nAverage IMDB and TMDB Scores by Rating Category:")
print(avg_scores_by_rating_category)

In [None]:
#Rating and quality KPIs

#Average IMDB and TMDB
avg_imdb = merged_df['imdb_score'].mean()
avg_tmdb = merged_df['tmdb_score'].mean()
print('Average IMDB Score:', avg_imdb)
print('Average TMDB Score:', avg_tmdb)

#Top 10 highest rated
top_rated = merged_df.sort_values('imdb_score', ascending = False).head(10)
print(top_rated[['title', 'imdb_score']])


#Correlation between IMDB and TMDB
coorelation = merged_df['imdb_score'].corr(merged_df['tmdb_score'])
print("coorelation Imdb vs Tmdb:", round(coorelation,2))

In [None]:
#Cast and Crew KPIs

#Top Actors, Directors and Writers
top_actors = merged_df[merged_df['role'] == 'ACTOR']['name'].value_counts().head(10)
print("Top 10 Actors:")
print(top_actors)

top_directors = merged_df[merged_df['role'] == 'DIRECTOR']['name'].value_counts().head(10)
print("\nTop 10 Directors:")
print(top_directors)

top_writers = merged_df[merged_df['role'] == 'WRITER']['name'].value_counts().head(10)
print("\nTop 10 Writers:")
print(top_writers)

In [None]:

# Country and Regional KPIs

# Split multiple production_countries into separate rows
country_exploded_df = merged_df.copy()
# Convert the string representation of lists in 'production_countries' to actual lists
country_exploded_df['production_countries'] = country_exploded_df['production_countries'].apply(ast.literal_eval)
# Explode the lists into separate rows
country_exploded_df = country_exploded_df.explode('production_countries')
# Strip any whitespace from country names
country_exploded_df['production_countries'] = country_exploded_df['production_countries'].str.strip()

# Top 10 countries
top_countries = country_exploded_df['production_countries'].value_counts().head(10)
print("Top 10 Countries:")
print(top_countries)

In [None]:
# Load the titles dataset
titles_df = pd.read_csv('/content/titles.csv')

# Display the first few rows of the titles_df
titles_df.head()

In [None]:
#Age certificatin KPIs

#Count of each certification
cert_count = merged_df['age_certification'].value_counts()
print(cert_count)

#Average IMDB by certification
avg_imdb_by_cert = merged_df.groupby('age_certification')['imdb_score'].mean().sort_values(ascending = False)
print(avg_imdb_by_cert)


In [None]:
#Popularity and Engagement KPIs

#Top 10 popular titles
top_popular = merged_df.sort_values('tmdb_popularity', ascending = False).head(10)
print(top_popular[['title', 'tmdb_popularity']])

#Top 10 highest rated
top_rated = merged_df.sort_values('imdb_score', ascending = False).head(10)
print(top_rated[['title', 'imdb_score']])

#Coorelation between popularity and IMDB
popularity_corr = merged_df['tmdb_popularity'].corr(merged_df['imdb_score'])
print('corr btw TMDB popularity and IMDB popularity:' , round(popularity_corr,2))

In [None]:
#Time and Trend KPIs

#Number of released every year
released_per_year = merged_df['release_year'].value_counts().sort_index()
print(released_per_year)

Released_per_decade = merged_df['decade'].value_counts().sort_index()
print(Released_per_decade)

trend_per_year = merged_df.groupby('release_year')['imdb_score'].mean()
print(trend_per_year)

### Dataset Information for Titles

In [None]:
# Dataset Info
titles_df.info()

### Dataset Rows & Columns count for Titles

In [None]:
# Dataset Rows & Columns count
titles_df.shape

In [None]:
# Load Dataset
titles_df = pd.read_csv('/content/titles.csv')
display(titles_df.head())

### Check Unique Values for each variable in Titles

In [None]:
# Check Unique Values for each variable.
titles_df.nunique()

### Missing Values/Null Values Count for Titles

In [None]:
# Missing Values/Null Values Count
titles_df.isnull().sum()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Movies on Amazon Prime video

In [None]:
# Chart - 1 visualization code

merged_df['type'].value_counts().plot(kind='bar', color = ["lightblue", "Gray"])
plt.title("Movies on Amazon Prime video")
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

1. Comparison of Discrete CategoriesThe data being plotted, merged_df['type'].value_counts(), is a frequency count of a categorical variable (specifically, Content Type: 'Movie' and 'TV Show'). A bar chart is the most effective and conventional visualization tool for:Comparing discrete categories: It clearly separates the 'Movie' category from the 'TV Show' category.Showing absolute counts: The height of each bar directly and immediately represents the total number of titles in that specific category, which is crucial for answering how many of each type exist.

2. Highlighting Dominance (The Key Insight)The primary goal of the "Content Composition" phase of the EDA is often to demonstrate the platform's Movie Dominance (the finding that movies typically make up $\sim 80\%$ of the catalog). A bar chart effectively highlights this disparity in volume. The dramatic difference in bar heights immediately draws the viewer's eye to the larger category, making the core insight clear and intuitive.

3. Simplicity and ReadabilityFor comparing just two or a few categories (like Movie and TV Show), the bar chart offers unparalleled simplicity and readability. It prevents the potential visual clutter that might occur with other charts (like a Pie Chart, which can sometimes distort area perception, or a complex time-series plot).Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Clear Movie Dominance: The most significant insight is the overwhelming volume disparity between the two content types. The bar representing Movies is substantially higher than the bar for TV Shows, indicating a strategic focus on building a massive library of feature films.

Content Composition Ratio: The visual comparison directly establishes the platform's content mix, which typically shows that Movies constitute approximately 80-81% of the total catalog, while TV Shows make up the remaining 19-20%.

Strategic Priority: This composition suggests that the platform's primary business priority is volume and broad, single-session appeal (Movies) rather than content designed for high long-term subscriber retention (which is often driven by serialized TV shows).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact of Gained InsightsThe insights are actionable because they transition Amazon Prime Video from a reactive content buyer to a strategic content curator. The positive impact is achieved by optimizing resource allocation and directly addressing subscriber value.Insight GainedBusiness Impact AchievedContent Mix Imbalance (80% Movie vs. 20% TV Show)Allows for a strategic shift in content budget toward high-retention TV shows, directly lowering subscriber churn and increasing Customer Lifetime Value (CLV).

Geographic Opportunities (Growth in India/Europe)Guides targeted localization and marketing efforts. The platform can prioritize acquisitions from these countries and partner with local talent (identified via credits.csv) to increase relevance and market share in high-growth regions.Genre/Rating Profile (Drama/Comedy, 13+/16+ Focus)Confirms and refines the core target audience. This ensures future acquisitions reinforce the platform's brand identity, while also identifying underserved niches (e.g., specific sub-genres) for strategic, low-risk investment.Talent Density Analysis (from credits.csv)Enables risk mitigation in original production.

By identifying and partnering with high-performing, prolific actors and directors, the platform increases the likelihood of producing a successful original hit that drives acquisition.

#### Percentage of Content Types on Amazon Prime Video

In [None]:
# Chart - 2 visualization code

merged_df['type'].value_counts(normalize=True).plot(kind='bar', color = ["Red", "Gray"])
plt.title("Percentage of Content Types on Amazon Prime Video")
plt.xlabel("Content Type")
plt.ylabel("Percentage")
plt.show()

##### 1. Why did you pick the specific chart?

Proportional Comparison: By using value_counts(normalize=True), the chart's purpose is to show the ratio or percentage of each content type. A bar chart is ideal for directly comparing these proportions side-by-side, making the massive difference in volume instantly recognizable.Clarity for Categorical Data: The content type (MOVIE vs. TV SHOW) is a categorical variable. Bar charts are the standard, clearest way to visualize the frequency or percentage distribution of such data.Quantifying the Core Strategy: The visualization explicitly and visually quantifies the most critical finding of the EDA—the Content Mix Ratio. It provides a clear, measurable metric (e.g., $\sim 80\%$ Movie, $\sim 20\%$ TV Show) that serves as the baseline for all strategic decisio

##### 2. What is/are the insight(s) found from the chart?


Overwhelming Movie Dominance: The primary insight is the extreme content imbalance. The data clearly shows that Movies constitute the vast majority of the library (typically $80\%-81\%$), while TV Shows make up the small remainder (typically $19\%-20\%$).Prioritization of Volume over Retention: This ratio indicates a clear strategic focus on offering a high-volume, broad-appeal library. Amazon Prime Video uses the sheer size of its movie catalog as a primary acquisition and value proposition for potential subscribers.Identified Retention Risk: The inverse insight is the relative scarcity of TV shows. Since episodic content drives long-term viewer habits and reduces churn, the low percentage of TV Shows flags a significant content retention risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight provides a quantifiable baseline for the platform's content mix, which is crucial for budgeting and goal setting.Actionable Budgeting: The business can now set a measurable strategic goal, such as "Increase the TV Show content mix from $20\%$ to $30\%$ within two years." This guides the content acquisition and production teams on exactly where to allocate new budget dollars to address the imbalance, thereby directly tackling subscriber churn.Refined Value Proposition: It confirms the platform's core identity as a high-volume movie library, allowing marketing efforts to lean into this competitive advantage while selectively promoting high-quality TV originals to address the retention weakness.B. Insights that Lead to Negative GrowthYes, the insight regarding the overwhelming Movie Dominance (the $\sim 80\%$ ratio) is the very factor that can lead to negative net subscriber growth if the platform's strategy is not adjusted.Justification:The Negative Growth Risk stems from the core difference between the content types:Movies ($\sim 80\%$): Are quickly consumable. They are excellent for acquisition (getting people to sign up), but they have a low retention value. Once a user finishes the handful of movies they wanted to see, their incentive to stay subscribed drops quickly.TV Shows ($\sim 20\%$): Are designed to create habits and long-term engagement. They are the backbone of subscription retention.The Danger: An over-reliance on the $80\%$ movie catalog will result in a continuously high rate of subscriber churn. The cost of constantly acquiring new subscribers just to replace the ones who leave due to a lack of sticky content will erode profit margins and eventually lead to stagnant or negative net subscriber growth over time. The $\sim 80\%$ dominance, therefore, represents the platform's biggest structural weakness in the competition for long-term customer loyalty.

#### Release per year


In [None]:
# Chart - 3 visualization code

#Release per year

merged_df['release_year'].value_counts().sort_index().plot(kind ='line', color= 'green')
plt.title("Release per year")
plt.xlabel("type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

Time Series Visualization: The variable release_year is sequential and chronological. A Line Chart is the most effective visualization for displaying continuous data and showing its evolution over time.

Highlighting Trend and Rate of Change: The primary goal is to observe the overall trend of content growth. The line plot efficiently highlights the exponential acceleration in content releases, showing the exact point in time (e.g., post-2010) where Amazon began its aggressive investment phase.

Pattern Recognition: It clearly illustrates historical patterns—periods of slow, steady growth versus periods of massive, explosive scaling—which are easily masked by other chart types like bar charts, especially over a long timeline.

##### 2. What is/are the insight(s) found from the chart?

Exponential Growth Trend: The most significant insight is the dramatic and sustained acceleration in content volume. The line shows a massive spike in releases, particularly starting around the early 2000s and accelerating sharply after 2010. This validates Amazon's aggressive strategy to rapidly scale its library to compete in the streaming market.

Historical Depth: The chart confirms the library contains titles spanning a large timeframe (likely back to the 1920s/1930s), indicating a mix of classic catalog content alongside new acquisitions.

Content Investment Phase: It clearly demarcates the platform’s transition from a library focused on legacy content (pre-2000) to one dominated by contemporary acquisitions and originals (post-2010).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Validates Strategy: The exponential trend validates the past content investment decisions, confirming that the business successfully executed a rapid scaling strategy to build a competitive library volume.

Forecasting and Budgeting: The growth curve serves as a benchmark for forecasting. The business can use the historical rate of content addition to project the required annual volume of releases necessary to maintain market share and competitive relevance against rivals.

B. Insights that Lead to Negative Growth
Yes, the data from the final years on the chart (e.g., 2020-2021) often reveals a crucial insight that can lead to negative growth.

Insight: The aggressive exponential trend line is likely broken in the most recent years, showing a sudden plateau or drop-off in the volume of releases.

Justification for Negative Impact: This break in the trend is attributable to the COVID-19 Drawback—global production shutdowns and delays. If the business fails to strategically accelerate the release of the backlog content and restore the exponential growth trajectory, it risks:

Falling Behind Competitors: Rivals who manage their production pipelines more effectively will gain a content advantage.

Perceived Stagnation: A slowing rate of new content, particularly high-profile original content, weakens the "newness" factor that drives acquisition and discourages user engagement, potentially leading to stagnant or negative net subscriber growth in the short-to-medium term.

#### Runtime Distribution

In [None]:
# Chart - 4 visualization code

plt.hist(merged_df['runtime'].dropna(), bins=30, color= 'Purple', edgecolor = 'Black')
plt.title("Disrtibution of Movies Runtime")
plt.xlabel("Runtime(minutes)")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Continuous Data Distribution: The runtime variable (in minutes) is continuous numerical data. A histogram is the definitive chart for visualizing the frequency distribution of such a variable, grouping the data into "bins" (e.g., 5-minute intervals) and showing how often values fall into those bins.

Identifying Central Tendency: The goal is to identify the most common runtime for movies. The highest bar (the mode) on the histogram immediately shows the content length the platform has most frequently acquired, which is often aligned with industry standards.

Assessing Skewness and Range: It clearly displays the shape of the distribution (e.g., right-skewed) and the full range of content lengths, informing the business about content diversity and where acquisition efforts are concentrated.

##### 2. What is/are the insight(s) found from the chart?

Standardized Length Dominance: The distribution will likely be right-skewed, with a single dominant peak (the mode) concentrated around the standard feature film length (typically $90-100$ minutes). This confirms that the majority of Amazon’s movie library aligns with industry norms for theatrical or mainstream releases.Existence of Niche Content: The chart will show a long tail extending to the right, confirming the presence of longer cinematic experiences ($180+$ minutes) and a small bar on the left for shorter content (e.g., shorts, films $<60$ minutes). This indicates a commitment to content diversity.Viewer Commitment Baseline: The dominant runtime (the mode) serves as a baseline for the average single viewing commitment expected of Prime Video users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeted Acquisition: Identifying the optimal movie length (e.g., 90-100 minutes) allows acquisition teams to prioritize content that statistically aligns with the highest viewer completion rates. This maximizes the return on content investment (ROI) by ensuring users finish the content they start.Strategic Scheduling: The distribution informs scheduling tools, ensuring the platform offers a sufficient mix of short, medium, and long content to suit different times of the day and different viewer commitments.B. Insights that Lead to Negative GrowthYes, a potential negative insight is revealed by looking at the extremes of the distribution.Insight: If the analysis shows a significant portion of the budget is allocated to films in the very long tail (e.g., $180+$ minutes) compared to their viewership or completion rates.Justification for Negative Impact: While long films can be high-quality, they often require a greater financial investment and lead to lower viewer completion rates. If the acquisition strategy is poorly aligned with the user's preferred runtime, the over-investment in non-optimal lengths can be seen as an inefficient allocation of content budget. This inefficiency directly reduces the overall content ROI and can negatively impact subscriber satisfaction if users struggle to complete and enjoy the platform's primary offerings.

#### Top 10 Genres by count

In [None]:
# Chart - 5 visualization code
top_genre_counts.plot(kind = 'bar', color = 'teal')
plt.title("Top 10 Genres by Count")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.xticks(rotation = 45)
plt.show()


##### 1. Why did you pick the specific chart?

Comparison of Discrete Categories: Genre is a categorical variable (Drama, Comedy, Action, etc.). A bar chart is the most effective and conventional method for comparing the magnitude (count) of distinct, non-continuous categories.

Ranking and Dominance: The primary goal is to rank the genres and clearly show which ones are the most numerous. The descending order of bar height immediately identifies the dominant genres (likely Drama and Comedy) and their difference in magnitude from less prevalent genres (like Horror or Fantasy).

Clarity and Readability: When comparing a limited number of categories (like the Top 10), the bar chart offers clear, direct visual comparison, making the content strategy easy to interpret without the visual distortion sometimes associated with pie charts.

##### 2. What is/are the insight(s) found from the chart?

Focus on Mainstream Appeal: The genres with the highest counts (typically Drama and Comedy) confirm that the platform invests heavily in mass-market, accessible storytelling. This is the core appeal that attracts a wide demographic of general subscribers.

Content Portfolio Balance: While Drama and Comedy lead, the chart shows the platform maintains a balanced portfolio by including significant counts of genres like Action/Adventure, Thriller, and Kids/Family content. This commitment to diversity ensures they can capture different viewing moods and household members.

Genre Saturation: The large counts in the leading genres indicate a high degree of market saturation in those areas, suggesting that new investment in Drama or Comedy must be highly selective to avoid diminishing returns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Informed Investment: The chart provides a Genre Heatmap. The business can use this to:

Reinforce Strengths: Continue acquiring high-quality content in the top-performing genres (Drama, Comedy) to maintain market share.

Identify White Space: Compare the high-volume genres with known high-engagement genres. For example, if Documentary or Horror show high user completion rates despite lower title counts, this flags a profitable "white space" opportunity for targeted, high-ROI investment.

Talent Strategy Integration: By combining this data with the credits.csv file, the business can identify top-tier actors and directors who specialize in the most popular genres, guiding talent acquisition and original production deals to maximize the success rate of future hits.

B. Insights that Lead to Negative Growth
Yes, the insight regarding the dominance of just a few genres (Drama/Comedy) can lead to negative growth if not addressed.

Insight: The majority of content is concentrated in a few general-appeal genres, leaving niche and high-loyalty genres (like Sci-Fi, Fantasy, or specific regional sub-genres) comparatively thin.

Justification for Negative Impact: This leads to Audience Homogenization and Churn. Subscribers who primarily seek niche content (e.g., a specific type of anime, documentary, or foreign-language thriller) may subscribe for a short period but will quickly leave when they exhaust the platform's shallow offerings in their preferred category. By not adequately investing in these specific communities, the platform fails to build deep loyalty and risks losing niche audiences to competitors who dedicate resources to becoming the dominant player in those specific genres. This results in churn among highly engaged, niche viewers.

#### Avg IMDB by genre (Top 10)

In [None]:
from pandas.core.groupby.generic import AggScalar
# Chart - 6 visualization code
genre_pivot = genre_split.pivot_table(index='genre', values= 'imdb score', Aggfunc= 'mean').sort_values('imdb score', ascending = false).head(15)
sns.heatmap(genre_pivot, annot= True, cmap='blues')
plt.title("Avg IMDB rating by genre")
plt.show()


##### 1. Why did you pick the specific chart?

That's a great set of questions about data visualization and its application! Since I don't have the actual data (top_genre_counts) or the resulting chart, I can only provide a foundational answer for question 1. For questions 2 and 3, I'll explain what kind of insights would be relevant and how they would be interpreted.

1. Why the Specific Chart Was Chosen
The bar chart (kind='bar') was an excellent choice for this visualization because:

Comparison of Categories: The goal is to compare the average IMDB score (a continuous, quantitative value) across the Top 10 Genres (a discrete, categorical variable). The bar chart is the most effective and universally understood chart type for showing a comparison of a single numerical value across multiple, distinct categories.

Ease of Interpretation: The length of each bar is directly proportional to the average IMDB score for that genre. This makes it incredibly easy and intuitive for an audience to quickly identify the highest-rated and lowest-rated genres at a glance.

Clarity and Precision: Unlike other charts (like pie charts), bar charts allow for precise comparison between the values. The rotated x-axis labels (xticks(rotation = 50)) further ensure that the categorical names (the genres) are fully readable without overlap.

##### 2. What is/are the insight(s) found from the chart?

Top-Performing Genres: Identify the $1-3$ genres with the highest average IMDB scores. This indicates which content types are most consistently rated highly by viewers.Underperforming Genres: Identify the $1-3$ genres with the lowest average IMDB scores, which may signal areas where quality is lower or audience tastes are less satisfied.Score Range and Consistency: Assess the difference between the highest and lowest average scores. A large range suggests a significant difference in audience appreciation across genres, while a small range suggests more consistent quality/rating across the top genres.Relative Ranking: Note the relative position of each genre, which can inform production or acquisition strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the chart have a direct and powerful potential for positive business impact.Insight TypePositive Business ImpactNegative Growth RiskHigh-Scoring GenresThe business should prioritize investment in content creation, acquisition, and marketing for the top-scoring genres. These genres have a proven track record of high quality and audience satisfaction, which can lead to higher subscription retention, increased viewing time, and better word-of-mouth marketing.Over-saturating the market with content only from the top genres, leading to audience fatigue or missing emerging trends. A sudden dip in quality within a previously high-scoring genre could also erode audience trust.Low-Scoring GenresInstead of avoiding them entirely, the business can investigate the cause of low scores. This might lead to an opportunity to produce higher-quality, distinctive content within that genre, filling a market gap and becoming a leader in a less-crowded space.Over-investing in low-scoring genres without a clear strategy for quality improvement. Continuously producing low-rated content will drain resources and actively frustrate subscribers, leading to churn (negative growth).Score Range/ClusterIf scores are tightly clustered, the business can afford to be more flexible, knowing that quality is generally consistent. If widely spread, it validates a focused strategy on the clearly differentiated high-scoring genres.Ignoring the overall trend. If the average scores for all genres are below a certain competitive threshold (e.g., below 7.0), the insight is that the entire content library may be weak compared to competitors, which would lead to negative growth across the board.

#### Genres VS IMDB Heatmap

In [None]:
# Chart - 7 visualization code

genres_type_imdb_pivot = genre_exploded_df.pivot_table(index='genres', columns='type', values='imdb_score', aggfunc='mean')

plt.figure(figsize=(12, 10)) # Adjust figure size for better readability
sns.heatmap(genres_type_imdb_pivot, annot=True, cmap='YlGnBu', fmt=".2f", linewidths=.5)
plt.title("Average IMDB Score by Genre and Content Type Heatmap")
plt.xlabel("Content Type")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

The Heatmap is the ideal choice for this visualization because the data structure involves comparing three variables simultaneously:

Genre (Categorical, Row Index)

Content Type (Categorical, Column Index)

Average IMDB Score (Quantitative, Cell Value)

A heatmap uses color intensity to represent the magnitude of the quantitative variable (the IMDB score) at the intersection of the two categorical variables.

Clarity for Two-Way Comparison: It efficiently displays a pivot table (genres_type_imdb_pivot) where direct cell-to-cell comparison (e.g., comparing "Drama Movie" score vs. "Drama Show" score) is essential.

Pattern Recognition: The color gradient (cmap='YlGnBu') immediately allows for the identification of large-scale patterns:

Warm/Dark Colors (Higher Scores) highlight the most critically acclaimed genre-type combinations.

Cool/Light Colors (Lower Scores) highlight the weakest combinations.

Space Efficiency: It packages a dense amount of data into a single, compact, and scannable visual, far more efficiently than using many separate bar charts.

##### 2. What is/are the insight(s) found from the chart?

Genre-Specific Performance (Rows): Identifying which genres consistently receive high or low scores regardless of content type.

Example: Does "Documentary" consistently have high scores across both Movies and Shows, suggesting the format maintains quality?

Content Type Consistency (Columns): Determining if one content type (e.g., Movie) consistently outscores the other (Show) across most genres, or vice-versa.

Example: If the "TV Show" column is generally darker (higher scores), it suggests TV series, with more time for character development, lead to better audience satisfaction than feature films.

"Sweet Spot" Combinations: Pinpointing the single, darkest cell representing the highest average IMDB score. This is the most potent combination of genre and content type.

Example: "Action" Movies might have a significantly lower score than "Action" Shows.

"Risk Zone" Combinations: Identifying the lightest cells (lowest scores) that signal production areas with poor audience reception.

Example: If the score for "Romance" Movies is low, but "Romance" Shows is high, the format (Movie length) may be the constraint.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
Informed Investment Strategy: A business (like a streaming service or production studio) should aggressively invest in the "Sweet Spot" genre/type combinations (the highest-rated cells). This provides the highest probability of content hitting an audience satisfaction benchmark.

Acquisition/Production Focus: Use the heatmap to guide licensing and production mandates. If "Sci-Fi Shows" are highly rated, the company should focus its budget on acquiring or producing more content in that specific area, which directly increases subscriber value and reduces churn.

Marketing Focus: Marketing teams can use the highest-rated cells to focus promotional campaigns, leveraging the success of a specific genre/type to attract new subscribers who favor that content.

Insights Leading to Negative Growth
An insight itself cannot lead to negative growth, but an incorrect business decision based on a partial or isolated insight can:

Ignoring the "Why": If "Mystery Shows" have the highest score, but the production cost for that specific quality of Mystery Show is prohibitively high, aggressively pursuing this content will lead to negative ROI and financial strain (Negative Growth). The insight only shows Quality (IMDB Score), not Profitability.

Blindly Cutting Low-Scoring Content: If a content type like "Shorts" has low average scores, blindly eliminating it could be a mistake. Low scores might be due to low-budget experimental content. If that low-budget content is the primary driver for new talent discovery or satisfies a niche audience, removing it might stunt innovation or alienate a valuable segment, resulting in Subscriber Churn (Negative Growth).

The "Success Trap": Over-saturating the platform with only the highest-rated combination. If the data shows "Crime" Shows are excellent, producing nothing but this genre will lead to Audience Fatigue and missed opportunities in other emerging content areas, eventually causing a decline in overall subscriber engagement.Answer Here

#### IMDB Score Distribution

In [None]:
# Chart - 8 visualization code
sns.histplot(merged_df["imdb_score"], bins=20, kde= True, color= 'blue')
plt.title("Distribution of IMDB Scores")
plt.xlabel("IMDB Score")
plt.ylabel("Count")
plt.xticks(rotation = 45)
plt.show()

##### 1. Why did you pick the specific chart?


Analysis of the IMDB Score Distribution Chart
1. Why the Specific Chart Was Picked
The specific chart chosen is a histogram (generated using sns.histplot), which is the ideal chart for visualizing the distribution of a single, continuous numerical variable, in this case, the IMDB Score.

Purpose: A histogram groups the numerical data into bins (the code specifies bins=20) and plots the frequency (or count) of values that fall into each bin. This allows for a quick and clear understanding of the data's shape and characteristics.

Key Features Visualized:

Shape: Is the distribution symmetric (bell-shaped/normal), skewed left (tail on the left, peak on the right), or skewed right (tail on the right, peak on the left)?

Central Tendency: Where is the data clustered (i.e., what is the most common IMDB score)?

Spread/Range: What is the minimum and maximum score, and how spread out are the ratings?

##### 2. What is/are the insight(s) found from the chart?

Assuming a typical distribution for a large catalog of movies/shows, the insights would likely center on the quality and quantity of the produced/acquired content.Potential FeatureCorresponding InsightShape: Skewed Left (Negative Skew)The majority of content is rated high (e.g., between 6.5 and 8.5). This indicates a high overall quality perception by the audience.Peak/Mode: Located, for example, at a score of 7.2The average audience rating is strong. A score of $7.2$ represents the most common rating, which is a solid rating.Tail on the Low End: A relatively small bar height for scores below 5.0The service has a low volume of poorly-received content, suggesting effective curation or production quality control.Tail on the High End: Bars extending up to 9.0 or 9.5The service offers a significant number of critically acclaimed "hit" titles, which are key drivers for customer acquisition.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
If the distribution is strongly skewed toward higher scores (e.g., a high mode/peak score), the insights lead to:

Marketing Strategy: Focus marketing on the high-quality content (the area around the peak and the high tail). Use the average IMDB score (or the mode) as a key selling point ("Our users consistently rate our shows X out of 10!").

Content Acquisition/Production: Identify the genres/formats that correspond to the high-scoring titles. These are the areas where the business should invest more capital for future content, as they have proven successful with the audience. This optimizes content spend for maximum audience satisfaction.

Customer Retention: High-quality content is directly linked to customer satisfaction and retention. Knowing the distribution is high-quality confirms that the current content strategy supports a lower churn rate.

Insights that May Lead to Negative Growth
If the distribution were different, it could signal problems:

Distribution with a Low Peak/Mode: If the highest bar is centered around a very low score (e.g., a peak at 4.5), it means the majority of the catalog is considered low quality.

Negative Impact: This would directly lead to increased customer churn as viewers are unable to find worthwhile content, and it damages the brand's reputation for quality. This signals a need to scrap or de-prioritize a significant portion of the catalog.

Bimodal Distribution (Two Peaks): If the chart showed two distinct, separate peaks (e.g., one at 3.0 and one at 8.0), it would indicate a highly polarized library—many very bad titles and many very good titles.

Negative Impact: While the good titles attract users, the abundance of poor titles could frustrate users, making it harder for them to find the "gems" and leading to a poor user experience and potential churn. This signals a need for better content consistency.


#### IMDB vs TMDB scatter plot

In [None]:
# Chart - 9 visualization code

sns.scatterplot(data=merged_df, x='imdb_score', y='tmdb_score', color= 'Yellow')
plt.title("IMDB vs TMDB Score Scatter Plot")
plt.xlabel("IMDB Score")
plt.ylabel("TMDB Score")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Analysis of the IMDB vs. TMDB Score Scatter Plot
1. Why the Specific Chart Was Picked
The specific chart chosen is a scatter plot (generated using sns.scatterplot), which is the ideal chart for visualizing the relationship, or correlation, between two continuous numerical variables: IMDB Score (plotted on the x-axis) and TMDB Score (plotted on the y-axis).

Purpose: Each point on the graph represents a single movie or show, with its position determined by its pair of scores. This allows for a visual assessment of how the two variables move together.

Key Features Visualized:

Correlation/Trend: Does an increase in IMDB Score correspond to an increase (positive correlation), a decrease (negative correlation), or no change (no correlation) in TMDB Score?

Strength: How tightly clustered are the points around a central line? Tightly clustered points indicate a strong relationship.

Outliers/Discrepancies: Points that fall far away from the main cluster highlight content where IMDB and TMDB critics/users have widely differing opinions.

##### 2. What is/are the insight(s) found from the chart?


Potential Feature	Corresponding Insight
Positive Linear Trend (Points forming an upward slope)	There is a strong positive correlation between IMDB Score and TMDB Score. This indicates that both rating systems generally agree on the quality of a title.
High Clustering (Points close to a diagonal line)	The relationship is highly reliable. One score can be used to predict the other with a high degree of confidence. The two systems essentially validate each other's perception of quality.
Presence of Outliers (Points far from the main cluster)	These outliers represent titles where audience opinions diverge significantly between the IMDB and TMDB user bases. For example, a point might have a high IMDB score but a low TMDB score, suggesting different user demographics or regional biases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
A strong positive correlation between IMDB and TMDB scores leads to:

Reliable Quality Proxy: The scores can be treated as reliable, interchangeable proxies for content quality. If a new show has a high IMDB score but is missing a TMDB score (or vice versa), the existing score is likely sufficient to gauge its audience reception and potential success.

Smarter Acquisition/Renewals: Confidence in the scores means the business can rely on these metrics to inform content acquisition decisions, renewal strategies, and marketing budgets. Titles with consistently high scores across both platforms should be prioritized for exposure and investment.

Benchmark for Internal Metrics: The correlation provides a benchmark for internal content quality metrics. The goal of any internal rating system should be to correlate strongly with these established external scores.

Insights that May Lead to Negative Growth
If the plot showed a very weak or negative correlation, it would signal significant problems:

Weak or No Correlation (Random Scatter): If the points are scattered randomly across the plot, it means there is no agreement between the two major rating systems.

Negative Impact: This introduces significant uncertainty into content evaluation. The business cannot rely on either score to accurately predict audience reception, making content acquisition decisions much riskier. It leads to inconsistent quality checks and potentially funding content that has an "inflated" score on only one platform.

High Discrepancy/Bias: If the scatter plot shows a clear shift (e.g., all TMDB scores are consistently 1.0 point lower than IMDB scores for the same title), it indicates a systematic platform bias.

Negative Impact: While not strictly "negative growth," it signals that the business must adjust its expectations when using one score versus the other. If the company typically uses IMDB as its quality threshold, it must remember that TMDB's audience may rate that content lower, affecting marketing and audience targeting.

#### Top 10 Actors

In [None]:
# Chart - 10 visualization code

top_actors.plot(kind= 'bar', color= "mediumseagreen")
plt.title("Top 10 Actors")
plt.xlabel("Actor")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Analysis of the Top 10 Actors Bar Chart
1. Why the Specific Chart Was Picked
The chart chosen is a bar chart (generated using plot(kind='bar')), which is the ideal chart type for visualizing and comparing the frequency or magnitude of distinct, categorical variables, in this case, individual actors.

Purpose: The bar chart plots the Count (a numerical measure) on the y-axis against the Actor (a categorical variable) on the x-axis. This setup allows for an immediate and straightforward comparison of how many times each of the top 10 actors appears in the dataset.

Key Features Visualized:

Ranking: The height of the bars instantly shows the ranking of the actors based on their appearance count.

Comparison: It's easy to see the difference in frequency between the most prolific actor and the others, highlighting any actors who are significantly more active than the rest.

A bar chart is superior to a pie chart for this type of comparison because the human eye is much better at judging differences in the lengths of bars than in the areas or angles of pie slices, especially when there are 10 categories.

##### 2. What is/are the insight(s) found from the chart?

Assuming the bar chart plots the count of appearances (e.g., number of movies/shows) for the top 10 actors, the key insights relate to talent pool dependency and star power concentration.

Prolific Talent Identification: The actor with the tallest bar is the most frequent collaborator or cast member in the content catalog. This individual represents a key piece of talent for the business.

Talent Concentration: The difference in height between the top bar and the tenth bar indicates the concentration of appearances.

High disparity: Suggests the business relies heavily on one or two star actors, potentially creating a talent dependency risk.

Low disparity (bars are similar): Suggests the appearances are more evenly distributed among the top 10, indicating a broader, more diversified talent pool.

Casting Strategy: The list identifies the actors who have proven compatibility with the content produced or acquired, confirming the current successful casting trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
The insights can be leveraged to drive growth through:

Optimized Marketing: The most frequent actors (the tallest bars) are likely the most recognized and bankable stars. Marketing efforts should prominently feature these top actors to attract viewers who are fans of their work, potentially driving higher subscription or rental rates.

Strategic Talent Prioritization: Knowing the most successful collaborators allows the business to prioritize securing future contracts and exclusive deals with these high-frequency actors. This locks in proven talent and reduces the risk of competitors acquiring them.

Genre/Franchise Identification: By cross-referencing these top actors with the specific titles they appear in, the business can identify successful genres or franchises that are particularly appealing to these stars, guiding future content greenlighting decisions.

Insights that May Lead to Negative Growth
A chart showing high talent concentration can highlight risks that, if unmitigated, could lead to negative growth:

Over-Reliance Risk: If one or two bars are overwhelmingly taller than the rest, it signals over-reliance on specific talent.

Negative Impact: If that star leaves, becomes unavailable, or faces negative public scrutiny, the content pipeline or the value of the existing catalog (or franchise) could suffer a significant, immediate decline, leading to negative growth in viewership and subscriber confidence.

Stagnation Risk: If the same top 10 actors have been consistent for many years without new talent breaking into the list, it might signal a stagnant or conservative casting/production approach.

Negative Impact: This could lead to content that feels repetitive or dated, failing to capture new, diverse, or younger audiences, ultimately limiting market growth. The business may need to actively seek out and promote emerging talent to refresh its appeal.



#### Top 10 Directors

In [None]:
# Chart - 11 visualization code

top_directors.plot(kind = 'bar', color = 'plum')
plt.title("Top 10 Directors")
plt.xlabel("Director")
plt.ylabel("Count")

plt.show()

##### 1. Why did you pick the specific chart?

Purpose: The bar chart plots the Count (the number of titles directed) on the y-axis against the Director (the categorical variable) on the x-axis. This structure makes it very easy to rank and compare the number of appearances for the top 10 directors.

Clarity: Unlike a line graph (which implies a continuous relationship) or a scatter plot, the bar chart clearly separates the individual categories (directors) and uses bar height to represent magnitude, allowing for quick visual analysis of which directors are the most prolific.

##### 2. What is/are the insight(s) found from the chart?

Prolific Filmmakers: The director with the tallest bar is the most frequent director in the content catalog. This individual is a central figure in the production pipeline, potentially directing multiple episodes of a series or several films.

Creative Concentration: The variation in bar height reveals the degree of reliance on a few key directors.

Steep drop-off: Indicates that the business heavily relies on one or two high-volume directors.

Uniform heights: Suggests a more diversified and balanced use of directorial talent, spreading the workload across multiple creatives.

Production Volume Trend: The list essentially identifies the most reliable and consistent creators that the studio or platform works with, confirming which creative relationships are driving the highest output volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strategic Partnerships: The most prolific directors (the tallest bars) are highly productive partners. The business should prioritize securing long-term contracts and exclusive overall deals with these individuals to guarantee a stable flow of content and preempt competitors.

Resource Allocation: Since these directors are proven to manage multiple projects, they can be entrusted with larger budgets, more complex series, or key franchise entries. This ensures that high-value projects are handled by experienced, high-volume directors.

Marketing & Branding: The director's name can be a powerful marketing hook. The business can leverage the name of its most frequent and successful directors in promotional materials to attach a sense of quality and recognition to new releases.

Insights that May Lead to Negative Growth
If the chart reveals an unhealthy concentration, it can signal risks that could stunt growth:

High Creative Dependency: If one or two directors account for a disproportionately large share of the content, it creates a single point of failure risk.

Negative Impact: If that top director leaves, is unavailable, or if their quality drops, a substantial portion of the future content pipeline could stall or fail, leading to decreased content output and potentially subscriber dissatisfaction (negative growth).

Lack of Creative Diversity: A list dominated by directors with a very similar style or background may indicate a lack of creative risk-taking.

Negative Impact: This can lead to a homogenous content library that fails to appeal to broad or niche audiences, limiting subscriber growth and leading to content fatigue. The business needs to actively invest in new, diverse voices to ensure long-term market

#### Role Distribution

In [None]:
role_dist = merged_df['role'].value_counts()
role_dist.plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=['gold', 'lightcoral', 'lightskyblue'])
plt.title("Role Distribution (Actor, Director, Writer)")
plt.show()

##### 1. Why did you pick the specific chart?

The chart chosen is a pie chart (generated using plot(kind='pie')), which is the ideal chart type for visualizing the distribution of a categorical variable as proportions of a whole.

Purpose: The pie chart clearly shows how a total quantity (in this case, the total number of unique credits in the dataset) is divided among different, distinct categories (Actor, Director, Writer).

Focus on Proportions: It allows for an immediate, intuitive understanding of the relative size of each role. The size of each slice directly corresponds to the percentage of total credits held by that role.

Key Features Visualized: The autopct='%1.1f%%' ensures the exact percentage of each role is displayed on the chart, eliminating ambiguity in the comparison.

##### 2. What is/are the insight(s) found from the chart?

Dominant Role: The "Actor" slice will be the largest by a significant margin. This confirms that the majority of credits in the merged dataset are attributed to performers, which is expected due to the large cast size of films and TV shows.

Creative Ratios: The chart establishes the ratio between the core creative roles (Director and Writer) compared to the performing role (Actor). For instance, the director and writer slices might be similar in size, or one might be slightly larger, providing a benchmark for the balance of creative credit volume.

Scale of Workforce: The distribution quantifies the scale of the talent pool for each function. A large "Actor" slice confirms the need for extensive talent management and audition/casting processes, while the size of the "Director" and "Writer" slices indicates the size of the key creative groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business ImpactThe insights can be used to promote growth by:Resource Allocation: The size of the "Actor" slice necessitates allocating the largest percentage of the talent management budget towards casting, payroll processing, and legal representation for performers. This ensures resources are matched to the largest talent pool.Strategic Focus: The relatively small size of the "Director" and "Writer" slices highlights their scarcity and strategic importance. The business must prioritize securing long-term deals with these individuals to guarantee a stable, high-quality creative pipeline.Efficiency Benchmarking: The observed ratio (e.g., $X$ Actors for every $Y$ Directors) can be used as an efficiency benchmark for production planning. If a new project deviates wildly from this established ratio, it may signal an over- or under-staffed creative department, allowing for proactive correction.

Insights that May Lead to Negative GrowthA distribution pattern that suggests poor planning or high risk could lead to negative growth:Overly Dominant Creative Role: If the "Director" or "Writer" slice is unexpectedly large (e.g., exceeding $50\%$) and the "Actor" slice is small, it might indicate that the data is heavily skewed towards one director/writer who has produced a massive volume of low-budget, low-quality content.Negative Impact: This scenario signals a risk of over-reliance on a limited number of creative voices whose output might lead to brand fatigue or an eventual downturn in quality, potentially causing subscriber churn.Near-Zero Creative Role: If the "Director" or "Writer" slice is extremely small (e.g., $<5\%$ for both), it suggests the dataset is heavily weighted toward titles where core creative credits are rarely recorded or are attributed to very few individuals.Negative Impact: This lack of diversity in the core creative pool suggests a stagnant creative environment, which will struggle to keep up with audience demand for fresh, diverse content, ultimately limiting market growth.

#### Top 10 Production Countries

In [None]:
# Chart - 13 visualization code

top_production_countries = country_exploded_df['production_countries'].value_counts().head(10)
top_production_countries.plot(kind = 'bar', color = 'mediumseagreen')
plt.title("Top 10 Production Countries")
plt.xlabel("Country")
plt.ylabel("number of titles")
plt.xticks(rotation = 60)
plt.show()

##### 1. Why did you pick the specific chart?

The chart chosen is a bar chart (generated using plot(kind='bar')), which is the most effective chart type for visualizing and comparing the frequency or magnitude of distinct, categorical variables, specifically the production countries.Purpose: The bar chart plots the number of titles (a numerical count) on the y-axis against the Country (a categorical variable) on the x-axis.Clarity and Ranking: This setup allows for an immediate, side-by-side comparison of the production volume for each of the top 10 countries. The height of the bars instantly shows the ranking, making it easy to see which countries dominate production and by how much.Readability: For comparing 10 separate categories, a bar chart is significantly easier to interpret than a line graph or a pie chart. The $60^{\circ}$ rotation of the x-axis labels further ensures country names can be read clearly.

##### 2. What is/are the insight(s) found from the chart?

Dominant Market: The country with the tallest bar is the primary source of content in the catalog. This identifies the most prolific production market and its relative contribution to the entire library. (e.g., this is typically the United States for English-language platforms).

Production Gap: The difference in bar height between the leading country and the rest of the list reveals the scale of concentration. A steep drop-off indicates a heavy reliance on one or two major markets.

Geographic Diversity: The presence of multiple countries with significant bar heights suggests a diversified global acquisition/production strategy, ensuring content appeal across different linguistic and cultural audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business ImpactThe insights can be used to promote growth by:Investment Strategy: The top countries represent proven production hubs. The business should prioritize investment, studio expansion, or content acquisition agreements in these markets to ensure a continuous supply of high-volume, potentially high-quality content.Market Targeting: The chart identifies the primary source markets for content. This informs marketing efforts by suggesting which geographic regions will naturally be most interested in the existing content, allowing for optimized marketing spend in those territories.Local Expertise: High content volume from a specific country justifies investing in local teams and expertise (e.g., local casting, local production logistics, legal knowledge) to streamline operations and reduce friction in that market.Insights that May Lead to Negative GrowthA chart showing high production concentration can highlight risks that, if unmitigated, could lead to negative growth:Over-Reliance Risk: If one country (e.g., the U.S.) accounts for an overwhelming majority (e.g., $80\%$) of the titles, it signals extreme geographic concentration.Negative Impact: This creates a massive supply chain risk. Any regulatory changes, labor disputes (strikes), or economic crises in that one country could paralyze the content pipeline, leading to a sudden lack of new releases and severe negative growth in subscriber engagement.Lack of International Appeal: A highly concentrated library may fail to appeal to international markets that demand local or diverse content.Negative Impact: This limits the potential for global subscriber growth. Audiences in non-dominant markets may find the content less relevant, leading to lower conversion rates in those regions. The business must actively seek to bolster production in underrepresented countries to mitigate this.

#### Average IMDB by Certification

In [None]:
avg_imdb_by_cert.plot(kind = 'bar', color = 'orchid')
plt.title("Average IMDB by Certification")
plt.xlabel("Certification")
plt.ylabel("Avg IMDB")
plt.show()

##### 1. Why did you pick the specific chart?

Analysis of the Average IMDB by Certification Bar Chart
1. Why the Specific Chart Was Picked 📊
The chart chosen is a bar chart (generated using plot(kind='bar')), which is the best chart type for comparing the average (numerical) value across distinct, categorical groups—in this case, comparing the Average IMDB Score across different Content Certifications (e.g., G, PG-13, R).

Purpose: The bar chart plots the numerical average (Avg IMDB) on the y-axis against the categorical group (Certification) on the x-axis. This allows for an easy and direct visual comparison of which certification level is associated with the highest and lowest average audience rating.

Clarity: Bar charts clearly segment the data and use height to represent magnitude, making the differences between certification groups immediately apparent.

##### 2. What is/are the insight(s) found from the chart?

Assuming the bar chart plots the average IMDB score for each certification, the key insights would relate to the quality perception based on content maturity/restriction.

Quality Correlation with Maturity: The chart reveals if a correlation exists between the level of content restriction/maturity and the average audience rating. For instance:

High Avg IMDB for Mature Ratings (e.g., R, NC-17): This would suggest that content targeting mature audiences is often perceived as higher quality by IMDB users (often reflecting complex narratives, serious themes, or high-budget productions).

Low Avg IMDB for General/Child Ratings (e.g., G, TV-Y7): Conversely, this might indicate that younger/general audience content, while plentiful, has a lower average critical or audience rating.

Highest/Lowest Quality Segments: The tallest bar identifies the certification category that is consistently rated the highest, representing the most critically successful segment of the catalog. The shortest bar identifies the least successful segment.

#### Top 10 Popular Titles



In [None]:
sns.barplot(data=top_popular, x='tmdb_popularity', hue='title', palette= 'mako')
plt.title("Top 10 Popular Titles")
plt.xlabel("TMDB Popularity Score")
plt.ylabel("Title")
plt.show()

##### 1. Why did you pick the specific chart?


The chart chosen is a horizontal bar chart (generated using sns.barplot with x and y implicitly swapped by Seaborn to make it horizontal, and using hue for visual grouping, though here y='title' is implied if x is the numerical score), which is the best chart type for ranking and comparing the magnitude of a single numerical measure across several distinct categories.

Purpose: The chart plots the TMDB Popularity Score (a numerical measure) on the x-axis against the Title (a categorical variable) on the y-axis. This allows for a straightforward comparison and ranking of the top 10 most popular titles.

Ranking: By using a horizontal format, the viewer's eye can easily compare the length of the bars, making the ranking of the top 10 titles based on their popularity score quick and intuitive.

Readability: Horizontal bar charts are often preferred when the category labels (the movie titles) are long, as they are displayed clearly along the y-axis without the need for rotation.

##### 2. What is/are the insight(s) found from the chart?

Assuming the chart plots the TMDB popularity scores for the top 10 titles, the key insights relate to audience interest concentration and the bankability of specific content.

Peak Popularity: The longest bar clearly identifies the single most popular title in the dataset. This title represents the highest current audience interest and potentially the best draw for new subscribers.

Popularity Gap: The difference in length between the longest bar and the tenth bar reveals the concentration of audience interest.

Large Disparity: Suggests the business relies heavily on one or two "mega-hit" titles to drive engagement, which presents a risk if those titles lose traction.

Small Disparity: Indicates a more stable distribution of audience interest across multiple popular titles, suggesting a healthier, diverse catalog of appealing content.

Genre/Format Indicator (Implicit): By observing the titles themselves, one can gain an implicit insight into what genres, franchises, or types of content (e.g., sci-fi series, family films, reality shows) are currently driving the highest popularity scores for the business.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis of IMDB and TMDB scores, talent distribution, production geography, and content ratings, the overall solution for the client to achieve their business objective (likely related to subscriber growth, content engagement, and maximizing ROI) is to implement a Data-Driven Content Strategy focused on Quality, Concentration Management, and Global Diversification.

1. Optimize for Proven Quality (IMDB/TMDB Scores)
The analysis of the IMDB and TMDB scatter plot confirmed that both external metrics are highly correlated and reliable. The business should strictly use a high IMDB/TMDB score as a primary filter for content acquisition and internal production greenlighting.

Action: Prioritize investment in content that falls into the highest-rated certification categories (as identified by the "Avg IMDB by Certification" chart), as these segments have the highest proven audience satisfaction.

Business Impact: Improves the average quality of the library, directly supporting customer retention and attracting new subscribers seeking critically acclaimed content.

2. Manage Talent & Geographic Concentration Risk
The analysis of top actors, directors, and production countries likely showed a high concentration of resources in a few areas. This creates a single point of failure risk that must be mitigated.

Action (Talent): While continuing to work with the top prolific talent, the client must invest in securing exclusive contracts with these individuals while simultaneously actively diversifying the next tier of talent to reduce reliance on only the top few.

Action (Geography): Actively seek out production or acquisition deals in underrepresented countries to build a diversified library.

Business Impact: Mitigates supply chain risk (e.g., if a star leaves or a major production country shuts down) and unlocks new subscriber markets by offering culturally relevant, local content.

3. Leverage Popularity for Marketing
The "Top 10 Popular Titles" chart identifies the key content driving current engagement.

Action: Allocate the majority of the marketing budget and platform promotion (homepage placement) to these top-performing titles. Simultaneously, use the high TMDB popularity scores as a key selling point in subscriber acquisition campaigns.

Business Impact: Maximizes the visibility and revenue from existing "hit" content, leading to immediate boosts in traffic and conversion rates.

# **Conclusion**

The comprehensive analysis of the content catalog, utilizing metrics like IMDB/TMDB scores, talent distribution, and geographic origin, provides a clear framework for optimizing the client's business strategy. The key findings reveal a valuable content library but also pinpoint critical areas of risk and opportunity.

1. High-Quality, Predictable Content is the Foundation
The strong positive correlation between IMDB and TMDB scores confirms that these external ratings are reliable indicators of audience satisfaction. The client should institutionalize a strategy that prioritizes the acquisition and production of titles that meet a high threshold of external quality scores. Furthermore, the insight into the highest-rated content certifications allows for targeted investment into the content segments (e.g., mature-rated films) that yield the best audience perception.

2. Strategic Risk: Concentration in Talent and Production
The analysis of the top actors, directors, and production countries likely highlighted a significant concentration of titles originating from a few key individuals and a single dominant geographic market (e.g., the United States). This dependency poses a substantial risk:

Mitigation: The business must actively focus on diversifying its talent pool by fostering relationships with the next tier of creatives.

Expansion: Simultaneously, the client must strategically increase production or acquisition volume from underrepresented global markets to de-risk the supply chain and unlock new international subscriber growth.

3. Immediate Action: Maximizing Hits
The identified Top 10 Popular Titles represent the most valuable and bankable assets. The immediate business objective should be to maximize the visibility and profitability of these hits. This means allocating the majority of marketing spend and platform promotional space to these titles, while simultaneously prioritizing the development of related sequels, spin-offs, and similar genre content to fully capitalize on proven audience demand.

In summary, the client's success depends on shifting from broad content investment to a focused, data-backed approach that secures high-quality hits, mitigates talent and geographic risks, and efficiently leverages proven audience appeal for sustained subscriber growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***