<a href="https://colab.research.google.com/github/SSJ7111/Numerical-Programming-in-Python---Capstone-EDA-/blob/main/Amazon_Prime_TV_Shows_and_Movies_EDA_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    **- Amazon Prime TV Shows and Movies Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This project aims to analyze the content available on **Amazon Prime Video** to uncover key insights about show diversity, viewer preferences, and content performance. With the rapid expansion of digital streaming platforms, understanding what content resonates with users is crucial for driving engagement, improving content strategy, and maintaining a competitive edge.

The dataset focuses on Amazon Prime titles available in the **United States** and includes two CSV files:
  * **titles.csv** - Contains detailed information on over **9,000 unique shows and movies**, including genre, release year, ratings (IMDB, TMDB), runtime, and age certification.
  * **credits.csv** - Includes **124,000+ entries** of cast and crew data, covering actors and directors associated with each title.

Through this analysis, we aim to:
  * Identify dominant **genres and content categories**
  * Understand **regional and age-based availability**
  * Track **content trends over time**
  * Discover the **most popular and highly rated** titles

Key Python libraries like **Pandas, NumPy, Matplotlib,** and **Seaborn** will be used for data cleaning, exploration, and visualization. The outcome of this project will provide **data-driven insights** for stakeholders in content strategy, marketing, and business development.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


# Business Problem Overview

In the rapidly growing and highly competitive streaming industry, platforms like **Amazon Prime Video** must continuosly innovate and expand their content offerings to retain users and attract new subscribers. With thousands of titles across various genres, understanding **what content performs well**, **what users prefer**, and **how content trends evolve** is essential.

However, without data-driven insights, making strategic decisions about content investment, regional availability, and user engagement becomes challenging. There is a need to analyze the vast collection of shows and movies on Amazon Prime to uncover patterns, preferences, and performance metrics.


#### **Define Your Business Objective?**

The primary objective is to analyze the Amazon Prime Video dataset to:
1. **Identify content trends** based on genres, ratings, and popularity.
2. **Understand audience preferences** by evaluating ratings, votes, and types of content (movie/show).
3. **Measure content performance over time** to track how the library has evolved and what types of content are gaining traction.
4. **Explore regional and certification-based content availability** to identify gaps or strengths in specific markets or age groups.
5. **Provide actionable insights** for:
  * Content creators (to understand what's working)
  * Business teams (to plan content acquisition/investment)
  * Marketing teams (to promote high-potential content)

By leveraging this analysis, Amazon Prime or similar streaming platforms can make informed decisions to improve **user engagement, subscriber retention,** and **overall content strategy.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

### Dataset Loading

In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# File paths
credits_file_path = '/content/drive/My Drive/Amazon Prime TV Shows and Movies/credits.csv'
titles_file_path = '/content/drive/My Drive/Amazon Prime TV Shows and Movies/titles.csv'

# Check if files exist
print("Credits file exists:", os.path.exists(credits_file_path))
print("Titles file exists:", os.path.exists(titles_file_path))

# Load CSVs
df_credits = pd.read_csv(credits_file_path)
df_titles = pd.read_csv(titles_file_path)


### Dataset First View

In [None]:
# Credits Dataset First Look
df_credits.head()

In [None]:
# Titles Dataset First Look
df_titles.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Credits Dataset Rows and Columns:", df_credits.shape)
print("Titles Dataset Rows and Columns:", df_titles.shape)

### Dataset Information

In [None]:
# Credits Dataset Info
print("Credits Dataset Info:", df_credits.info())

In [None]:
# Titles Dataset Info
print("Titles Dataset Info:", df_titles.info())

#### Duplicate Values

In [None]:
# Credits Dataset Duplicate Value Count
print("Credits Dataset Duplicate Values:", df_credits.duplicated().sum())

In [None]:
# Titles Dataset Duplicate Value Count
print("Titles Dataset Duplicated Values:", df_titles.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Credits Dataset Missing Values:", df_credits.isnull().sum())

In [None]:
# Titles Dataset Missing Values/Null Values Count
print("Titles Dataset Missing Values:", df_titles.isnull().sum())

In [None]:
# Visualizing the missing values of Credits Dataset
sns.heatmap(df_credits.isnull())
plt.title("Credits Dataset Missing Values Heatmap")
plt.show()

In [None]:
# Visualizing the missing values of Titles Dataset
sns.heatmap(df_titles.isnull())
plt.title("Titles Dataset Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

After analyzing the credits.csv and titles.csv datasets using the code, we can summarize the following key points:

### General Overview
  * **Credits Dataset**: Contains data about actors and directors associated with Amazon Prime shows/movies.
  * **Titles Dataset**: Contains metadata for each show or movie available on Amazon Prime (e.g., title, genre, release year, ratings)

### Rows and Columns

The .shape function reveals the number of rows and columns in each dataset.
  * df_credits -> 124235 rows and 5 columns
  * df_titles -> 9871 rows and 15 columns

### Dataset Structure
.info() shows:
  * Column names and data types
  * Non-null count for each column (helps indetify missing data)
  * Helps in understanding which columns are numerical, categorical, or object type

### Duplicate Records
The number of **duplicate rows** in each dataset is printed using .duplicated().sum().
  * This tells us if the dataset has redundancy that may need cleaning.

### Missing Values
.isnull().sum() gives the **count of missing values** per column.
**Heatmaps** visually show where data is missing in the datsets:
  * More white/light areas indicate missing values
  * Helps understand which columns may need imputation, removal, or further investigation

### Insights Summary
* We've checked the **data shape, structure, duplicates**, and **null values**
* We've visualized missing data for a clearer picture
* These steps are a strong **first step in understanding data quality, completeness, and potential cleaning requirements**

## ***2. Understanding Your Variables***

In [None]:
# Credit Dataset Columns
print("Print Credit Dataset Columns:", df_credits.columns)

# Dataset Describe
df_credits.describe()

In [None]:
# Title Dataset Columns
print("Print Title Dataset Columns:", df_titles.columns)

# Dataset Describe
df_titles.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for Credits variable.
for col in df_credits.columns:
  print(f"{col}: {df_credits[col].nunique()} unique values")

In [None]:
# Check Unique Values for Titles variable.
for col in df_titles.columns:
  print(f"{col}: {df_titles[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Wrangling for Credits

# Remove duplicate rows
df_credits.drop_duplicates(inplace=True)

# Checking for missing values
print("Missing values in df_credits:\n", df_credits.isnull().sum())

# Optinal: Fill missing character with "Unknown"
df_credits['character'].fillna('Unknown', inplace=True)

# Strip leading/trailing spaces in text columns (if any)
df_credits['name'] = df_credits['name'].str.strip()
df_credits['role'] = df_credits['role'].str.strip()

# Reset index
df_credits.reset_index(drop=True, inplace=True)

In [None]:
# Visualizing the missing values of Credits Dataset
sns.heatmap(df_credits.isnull())
plt.title("Credits Dataset Removed Missing Values")
plt.show()

In [None]:
# Wrangling for Titles

# Remove duplicate rows
df_titles.drop_duplicates(inplace=True)

# Check for missing values
print("Missing values in df_titles:\n", df_titles.isnull().sum())

# Drop rows with missing
df_titles.dropna(subset=['imdb_id', 'tmdb_popularity'], inplace=True)

# Fill missing values for other fields
df_titles['description'].fillna('No description available', inplace=True)
df_titles['age_certification'].fillna('Unrated', inplace=True)
df_titles['seasons'].fillna(0, inplace=True)
df_titles['tmdb_score'].fillna(0, inplace=True)
df_titles['imdb_score'].fillna(0, inplace=True)
df_titles['imdb_votes'].fillna(0, inplace=True)

# Optional: Clean text columns
df_titles['title'] = df_titles['title'].str.strip()
df_titles['type'] = df_titles['type'].str.strip()

# Reset index
df_titles.reset_index(drop=True, inplace=True)

In [None]:
# Visualizing the missing values of Titles Dataset
sns.heatmap(df_titles.isnull())
plt.title("Titles Dataset Removed Missing Values")
plt.show()

### What all manipulations have you done and insights you found?

### On Credits Dataset

**1. Duplicates Removed :**
  * Removed all duplicate rows to ensure data consistency

**2. Missing Values Treated :**
  * Filled missing character names with "Unknown" to maintain completeness.

**3. Text Cleaned :**
  * Stripped leading/trailing whitespaces from name and role columns to prevent mismatches during analysis.

**4. Index Reset :**
  * Reset the index after cleaning for a fresh, clean dataframe.

**5. Missing Data Visualization :**
  * Used a heatmap to confirm that missing values have been handled properly.



### On Titles Dataset

**1. Duplicates Removed :**
  * Dropped duplicate rows to avoid redundant entries.

**2. Critical Null Rows Dropped :**
  * Dropped rows where imdb_id or tmdb_popularity was missing, as they are crucial for further analysis.

**3. Missing Values Filled :**
  * description -> Filled with 'No description available'.
  * age_certification -> Filled with 'Unrated'.
  * seasons, tmdb_score, imdb_score, imdb_votes -> Filled with 0.

**4. Text Cleaned :**
  * Trimmed spaces from title and type columns.

**5. Index Reset & Null Check :**
  * Reset index and confirmed via heatmap that no missing values remain.


### General Observations:
  * **Data completeness improved significantly** after filling/removing missing values.
  * **Duplicates were present** in both datasets, including the need for data validation befpre analysis.
  * **Some key fields like IMDb and TMDB scores/votes has missing values,** which are now handled.
  * **Text columns were messy (extra spaces),** which could have caused incorrect grouping or filtering-this is now fixed.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Content Diversity - Identify content trends based on genres, ratings, and popularity.

#### Chart 1 - Bar Chart : To show top genres by count (i.e., how many titles per genre)

In [None]:

# Cleaning Genres column for duplicate values
# Drop rows with null genres
df_titles = df_titles[df_titles['genres'].notna()]

# Convert stringified genre list to actual Python list (if needed)
# Example: if genres are like "['comedy', 'drama']" (as a string), use `ast.literal_eval`

import ast
df_titles['genres'] = df_titles['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Remove duplicates within each row's genre list
df_titles['genres'] = df_titles['genres'].apply(lambda x: list(set([genre.strip().lower() for genre in x])))

# Sort the genres alphabetically for consistency
df_titles['genres'] = df_titles['genres'].apply(sorted)

# Preview cleaned genre column
df_titles['genres']

# Make a copy of the cleaned titles DataFrame
df_exploded = df_titles.copy()

# Explode the 'genres' column so each genre has its own row
df_exploded = df_exploded.explode('genres')

# Strip any whitespace and convert to lowercase (already done earlier, but just in case)
df_exploded['genres'] = df_exploded['genres'].str.strip().str.lower()


# Bar Chart: Number of titles per genre
plt.figure(figsize=(12, 6))
genre_counts = df_exploded['genres'].value_counts().head(10)
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='Set2')
plt.title('Top 10 Genres on Amazon Prime')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**1. Clear Comparison of Categories**

Genres are **categorical** variables (e.g., comedy, drama, thriller).
A **bar chart** is perfect for showing how often each category appears - making it easy to **compare** genres by number of titles.

**2. Top-N-Focus**

We're focusing on the **top 10 genres**, and bar charts handle "top-N" visualizations **really well** by showing the most popular items clearly.

**3. Quick Insights**

With just a glance, viewers can easily answer:
  * Which genres dominate the platform?
  * Are there any surprising genre trends (e.g., rise in animation or war-related shows)?

**4. Customizable**

Bar charts allow:
  * Color variation (pallete='Set2)
  * Sorting (genre_counts already sorted)
  * Annotaions or additional overlays if needed later (e.g. IMDb score per genre)

##### 2. What is/are the insight(s) found from the chart?

After exploding the genres column and visualizing the **Top 10 Genres** on Amazon Prime using a bar chart, here are the **key insights** derived from the chart:

### 1. Dominance of Drama and Comedy
  * **Drama** is the most common genre, followed by **Comedy**.
  * This indicates the Amazon Prime heavily focuses on storytelling and character-driven content, which appeals to a broad audience.


### 2. Variety in Popular Genres
* Other frequently appearing genres include:
  * Triller
  * Action
  * Romance
  * Crime
  * Documentation

* This shows that Prime Video maintains a **balanced mix** of content across emotions (romance), excitement (action), and suspense (thriller).

### 3. Family and Animation Also Have a Strong Presence
* **Family** and **Animation** genres are also in the top 10, highlighting Amazon Prime's efforts to target **kids and family-friendly audiences.**

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:

#### 1. Content Strategy Alignement
* Knowing that **Drama, Comedy** and **Romance** dominate the platform allows Amazon Prime to:
  * Continue investigating in **high-demand** content.
  * Plan **original productions** around popular genres.
  * License shows/movies that resonate with viewer interests.

#### 2. Audience Targeting
* High presence of **Family** and **Animation** genres suggests a large **family-oriented** or **younger audience**.

  * This helps with **targeted marketing**, personalised recommendations, and better **ad campaign alignment**.

#### 3. Content Gaps & Diversification
* If genres like **Sci-fi, Documentary,** or **Historical** are underrepresented, this insight allows the business to **fill content gaps** and attract niche users.

#### 4. User Retenstion & Growth
* Serving content in preferred genres helps keep users **engaged longer**, improving subscription renewals and platform stickiness.


### Potential Insights Leading to Negative Growth (if not acted upon):

#### 1. Over-Concentration in Few Genres
* Over-reliance on genres like **Drama** and **Comedy** could lead to **viewer fatigue.**
  * If users feel the content is too repetitive or lacks variety, they may **churn** to platforms with more diverse offerings (e.g., Netflix with more Sci Fi/ Documentary content).

#### 2. Underrepresented Genres = Missed Opportunities
* Low presence of certain genres may indicate a **missed opportunity** to attract **genre-specific communities** (e.g., Anime lovers, Tech Enthusiasts, History Buffs)

### Summary:
* **Positive Impact** comes from understanding user preferences, improving recommendations, boosting engagement, and aliging investment with audience demand.
* **Negative trends** like genre saturation or gaps can be **corrected** by proactively **diversifying content** and targeting unmet interests.

#### Chart 2 - Histogram : Distribution of Ratings

In [None]:
# Check for missing values in rating columns
print(df_titles['imdb_score'].isnull().sum())
print(df_titles['tmdb_score'].isnull().sum())

# Fill missing scores with 0 or mean if necessary
df_titles['imdb_score'].fillna(0, inplace=True)
df_titles['tmdb_score'].fillna(0, inplace=True)

# Understand the spread of ratings
print(df_titles[['imdb_score', 'tmdb_score']].describe())


In [None]:
# Set the overall style
sns.set(style="whitegrid")

# Create two subplots (stacked vertically)
fig, axes = plt.subplots(2, 1, figsize=(10, 8), sharex=True)

# IMDb Score Distribution
sns.histplot(df_titles['imdb_score'], bins=20, kde=True, color='skyblue', label='IMDb', ax=axes[0])
axes[0].set_title('Distribution of IMDb Ratings')
axes[0].set_ylabel('Number of Titles')


# TMDB Score Distribution
sns.histplot(df_titles['tmdb_score'], bins=20, kde=True, color='salmon', label='TMDB', ax=axes[1])
axes[1].set_title('Distribution of TMDB Ratings')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Number of Titles')



plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**Purpose**
To understand how the **ratings** (IMDb and TMDB) are **distributed** across all titles - are they skewed, normal, or have outliers?

**Reasons for Choosing Histogram:**

**1. Visualizes Distibution Clearly**

Histograms are perfect for showing the **frequency of rating values** - i.e., how many titles have IMDb score of 7, 8, etc.


**2. Highlights Skewness & Spread**

It helps us see whether the ratings are **left-skewed, right-skewed,** or **normally distributed.**


**3. Easy to Compare**

By placing the IMDb and TMDB distributions in **separate subplots**, we get a **clear comparison** without overlapping the charts.


**4. KDE Curve Enhances Understanding**

The KDE (Kernal Density Estimate) line gives us a **smoothed view** of how the ratings are distributed, which complements the histogram bars.

**5. Supports Business Insights**

Knowing how most titles are rated helps us:
  * Identify **quality perception** trends
  * Understand **platform standards**
  * Recognize rating gaps between IMDb and TMDB


##### 2. What is/are the insight(s) found from the chart?

**1. Majority of titles have mid-range ratings**
* Most titles are clustered between **6 and 8** for both IMDb and TMDB scores.
* This indicates that **average to good quality** content dominates the platform.

**2. Very few titles have extremely low or high ratings**
* Very **few titles have ratings below 4 or above 9**
* Indicates that there are **not many poor or exceptional titles,** or such content may be filtered out / less promoted.

**3. TMDB ratings tend to be more centered / uniform**
* The **TMDB score** distribution appears **more compact and less spread out** than IMDb.
* This suggests **TMDB users may rate content more consistently**, or the rating system smooths extremes.

**4. IMDb ratings are slightly more spread out**
* IMDb ratings have a **slightly wider spread,** showing **greater variance in audience opinions**.

**5. Possible Platform Perception Difference**
* If a title has a higher score on TMDB but lower on IMDb (or vice versa), it can hint at **demographic or regional preferences** on each platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:

**1. Content Acquisition & Recommendation Strategy**
* Knowing that most content falls within the **6-8 rating range,** the platform can:
  * Focus on acquiring or producing more **high-performing content (8+).**
  * Avoid investing in **low-rated content (<5)** that may result in poor user engagement.

  **2. Personalized User Experience**
  * Understanding how **audiences rate content** helps improve **recommendation engines:**
    * Suggest titles based on user preference ranges (e.g., users who like 7+ scores).
    * Create dynamic playlists or homepage carousels based on **top-rated titles.**

  **3. Content Marketing & Promotion**
  * Highly rated content can be promoted with labels like **"Top Rated"**, which improves click-through rates.
  * Lower-rated titles can be **rebranded or bundled** to improve perception.

  **4. Audience Segmentation**
  * IMDb and TMDB rating patterns could hint at **demographic preferences**.
    * Ex: TMDB users might prefer a different genre or language.
    * Helps **target ads or content by region or age group.**


### Negative Growth Triggers (Potential Risks):
**1. Too Much Average Content**
* If most content sits in the **mid-range (6-7)**, the platform risks becoming **mediocre or unmemorable.**
  * Could affect **brand perception** and **user retention**.

**2. Ignoring Niche Audiences**
* Over-focusing on top-rated content might exclude **niche or cult favorites**, which have loyal audiences.
  * A balanced mix is essential.

**3. Overdependece on Ratings**
* Ratings alone don't reflect **viewer engagement, watch time,** or **virality**.
  * Need to correlate with **actual viewership data** for better decisions.

#### Chart 3 - Bar Chart : Compare Ratings Across Genres
Use the exploded genre data to analyze average ratings per genre.

In [None]:
# Group by genre and get average ratings
avg_ratings_by_genre = df_exploded.groupby('genres')[['imdb_score', 'tmdb_score']].mean().sort_values('imdb_score', ascending=False).head(10)

# Bar plot
avg_ratings_by_genre.plot(kind='bar', figsize=(12, 6), colormap='viridis')
plt.title("Top 10 Genres by Average Ratings")
plt.ylabel("Average Rating")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**1. Clear Genre Comparison:**
Bar charts are excellent for **side-by-side comparisons.** In this case:
  * We're comparing **IMDb** and **TMDB** average scores across the **top 10 genres.**
  * The chart allows us to easily see which genres are **highest or lowest rated.**

**2. Categorical vs Numeric Data:**
* The genres column is **categorical.**
* imdb_score and tmdb_score are **numerical.**
* Bar plots are ideal when comparing **numeric metrics across categories.**

**3. Visual Clarity & Readability:**
* Using color coding(colormap='virdis) makes it easier to **distinguish between rating types.**
* Genres are listed on the x-asix, and their average ratings are on the y-axis, which is **intutive** for most audiences.

**4. Quick Insights:**
* One can immediately spot which genres are **critically acclaimed** and which are **less appreciated** by audiences.
* Helps highlight **content quality trends per genre.**


##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:
**1. Content Strategy Optimization:**
Knowing that genres like **Drama, War,** or **History** receive **higher average ratings,** platforms like Amazon Prime can:
* **Prioritize acquiring or producing** more high-quality content in these genres.
* **Highlight theese genres in promotions** to attract quality-conscious viewers.

**2. Enhanced Recommendation Systems:**
* Recommender algorithms can be fine-tuned using **rating data per genre,** helping suggest better-rated content to users.
* This leads to **higher engagement, more viewing time,** and **improved customer satisfaction.**

**3. Targeted Marketing:**
* Genres with high ratings can be used in **targeted ads or featured banners** to improve click-through and conversion rates.
* For example, "Top-rated War Dramas" may attract users specifically interested in intense, emotional stories.


### Potential Negative Insights (Growth Risks):
**1. Overproduction in Low-Rated Genres:**
* If genres like **Comedy** or **Action** show **relatively lower average ratings**, this could indicate **quality inconsistency** or **audience fatigue.**
* **Continuing to invest heavily** in such genres without innovation may lead to **wasted budgets and reduced viewer retention.**

**2. Ignoring Niche But High-Rated Genres:**
* If niche genres (e.g., **Documentary, Mystery**) are **highly rated but underrepresented,** negleecting them can mean **missed opportunities** in untapped markets.

**3. Platform Reputation Risks:**
* A platform loaded with **low-rated content** may affect its **brand image**, especially if competitors focus on critically acclaimed or curated content.

#### Chart 4 - Scatterplot : IMDb vs TMDB by Type

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x='imdb_score', y='tmdb_score', hue='type', data=df_titles, alpha=0.6, palette='Set1')
plt.title('IMDb Score vs TMDB Score by Type')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Score')
plt.grid(True)
plt.legend(title='Type')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

#### 1. Relationship Analysis Between Two Variables
* A scatterplot is perfect to **visually compare the correlation between two continuous variables** - in this case, imdb_score and tmdb_score.
* It helps identify if there's a **linear relationship, clustering, or outliers** in the ratings data.

#### 2. Multivariate Visualization (with Type)
* By using hue='type', we add a **third variable (content type: movie, show, etc.)** to the analysis.
* This allows us to **observe how ratings fiffer between types** - whether movies are generally higher rated than shows, for example.

#### 3. Visual Density and Overlap
* Using alpha=0.6 adds transparency, allowing us to **see overlapping points,** making it easier to detect dense rating areas.


##### 2. What is/are the insight(s) found from the chart?

**1. Positive Correlation Between IMDb and TMDB Ratings**
* There's a **moderate positive correlation** - as IMDb scores increase, TMDB scores tend to increase too.
* However, the correlation isn't perfect; **some titles rated highly on one platform are rated lower on the other.**

**2. Clustering in Mid-Range Ratings**
* Most titles, regardless of type, tend to cluster around the **5-7 rating range** on both IMDb and TMDB.
* Very few titles have extreme ratings (either very high or very low).

**3. Type-Based Distribution**
* When separated by type (e.g., **movie** vs **show**), we may notice:
  * **Movies** tend to have a slightly **wider spread** in both low and high ratings.
  * **Shows** might be more concentrated in the mid-rating range.
* This suggests that **movies might be more polarizing,** whereas **shows are more consistently rated.**

**4. Outliers**
* Some titles show significant **discrepancy between IMDb and TMDB scores -** possibly due to:
  * Different user bases
  * Varying review criteria
  * Regional preferences or release timing differences

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact
The insights gained from the **IMDb vs TMDB score scatterplot by content type** can significantly help decision-making in the following ways:

**1. Audience Perception Alignment**
* A **positive correlation** between IMDb and TMDB ratings indicates that audiences on both platforms have similar perceptions.
* This helps platforms like Amazon Prime understand that **aggregating ratings from both sources is valid and uuseful for quality measurement.**

**2. Strategic Content Investment**
* By observing **which type (Movie or Show)** performs better (more consistently high-rated), businesses can **allocate mre budget toward that content type.**
  * For example, if **movies show a wider spread** but higher peaks, more high-quality films could be a valuable investment.
  * If **shows are more consistent,** they could be used to retain long-term subscribers.

**3. User Satisfaction Monitoring**
* Discrepancies between IMDb and TMDB scores can help **identify titles with mixed reception,** giving teams a chance to dig deeper into:
  * What audiences liked/disliked
  * Regional differences
  * Marketing or expectation gaps

**4. Enhance Recommendations**
* Knowing how content types are rated helps refine **recommendation engines,** ensuring better personalization and improved **user engagement and retention.**


### Possible Insights Leading to Negative Growth
**1. Inconsistent Ratings Between Platforms**
* Titles with large gaps between IMDb and TMDB scores could lead to **confusion or mistrust** among users - impacting the credibility of platform ratings.

**2. Mid-Tier Content Saturation**
* If a large portion of content clusters around **5-7 ratings,** it may indicate a **lack of standout, high-quality content,** which could:
  * Reduce excitement and interest
  * Impact new user acquisition
  * Lead to **subscriber churn** over time

**3. Underperforming Content Type**
* If one content type (e.g., shows) consistently underperforms, it might be a **waste of production or licensing resources,** unless quality is improved.

### Regional Availability - Explore regional and certification-based content availability to identify gaps or strengths in specific markets or age groups.

#### Chart 5 - Bar Chart: Top Countries by Content Volume
Goal : See which regions have the most titles available.

In [None]:
# Cleaning production_countries column for duplicate values
# Drop rows with null production_countries
df_titles = df_titles[df_titles['production_countries'].notna()]

# Convert stringified production_countries list to actual Python list (if needed)
# Example: if production_countries are like "['US', 'CN']" (as a string), use `ast.literal_eval`

import ast
df_titles['production_countries'] = df_titles['production_countries'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Remove duplicates within each row's production_countries list
df_titles['production_countries'] = df_titles['production_countries'].apply(lambda x: list(set([genre.strip().lower() for genre in x])))

# Sort the production_countries alphabetically for consistency
df_titles['production_countries'] = df_titles['production_countries'].apply(sorted)

# Preview cleaned production_countries column
df_titles['production_countries']

# Make a copy of the cleaned titles DataFrame
df_exploded = df_titles.copy()

# Explode the 'production_countries' column so each genre has its own row
df_exploded = df_exploded.explode('production_countries')

# Strip any whitespace and convert to lowercase (already done earlier, but just in case)
df_exploded['production_countries'] = df_exploded['production_countries'].str.strip().str.lower()


# Bar Chart: Number of titles per production_countries
plt.figure(figsize=(12, 6))
genre_counts = df_exploded['production_countries'].value_counts().head(10)
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='Set2')
plt.title('Top 10 Production Countries on Amazon Prime')
plt.xlabel('Production Countries')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is ideal for visualizing and comparing **discrete categories,** in this case, **production countries.** Bar charts clearly show differences in the **volume of content** associated with each country, making it easy to identify the top contributors.
Since we are looking at **frequency counts** (number of titles per country), a bar chart allows us to quickly interpret:
  * Which countries produce the most content available on the platform.
  * How production volume varies across regions.
  * Relative differences between the top producers.

The chart is especially effective here due to:
  * A **limited number of categories** (top 10 countries), which fits well in a horizontal/vertical format.
  * Clear **labeling and comparison** of each region's contribution.
  * Easy scalability if we want to adjust the number of countries shown (top 5, top 15, etc.).

##### 2. What is/are the insight(s) found from the chart?

From the **Bar Chart of Top 10 Production Countries,** we can draw the following insights:
  **1. Dominance of Certain Countries:** A few countries like **United States, India,** and **United Kingdom** dominate content production on Amazon Prime. These countries have produced the **highest number of titles** available on the platform.

  **2. Regional Variety:** The presence of countries from **different continents** (e.g., US, IN, JP, CA, DE) shows that Amazon Prime is sourcing content **globally**, not just from Western markets.

  **3. Emerging Markets:** Countries with relatively smaller counts but still in the top 10 (like **Japan, Canada, Germany, Italy** or **Australia**) inidcate growing content contribution - potentially **rising opportunities** for content acquisition or local production.

  **4. Underrepresentation:** Regions like **Africa, South America** or smaller Asian countries might be **underrepresented** in the current content library, hinting at potential **market expansion** or content investment opportunities.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:
* **Targeted Content Acquisition:** Indetifying top-performing countries (like the US, India and the UK) enables Amazon Prime to **strengthen partnerships** and invest in content that already has proven demand and volume.

* **Localization Strategies:** With clear visibility of content-rich regions,Amazon can focus on **localizing** successful international content (through dubbing/subtitles), enhancing user experience and engagement in multilingual markets.

* **Regional Expansion Opportunities:** Countries that are **underepresented** or missing in the top 10 can be seen as **untapped markets.** Amazon can invest in original content or licensing deals in those regions to expand its global footprint.


### Possible Indicators of Negative Growth:
* **Overreliance on Few Countries:** A high concentration of titles from a limited number of countries may lead to **cultural saturation** or lack of diversity in content. This can result in **viewer fatigue** or loss of interest from global audiences seeking variety.

* **Neglected Regions:** Lack of content from certain regions (like Africa or Latin America) could signal **missed opportunities,** potentially allowing competitors to dominate those markets.

#### Chart 6 - Stacked Bar Chart: Age Certification by Country

In [None]:
# NOTE : FOR BETTER VIEW, TO SEE IT WITHOUT 'unrated' VALUES, PLEASE RE-RUN IT (RUN IT TWICE).

# Explode country column (if it's a list) and drop missing values
df_cert = df_titles.copy()
df_cert['production_countries'] = df_cert['production_countries'].apply(lambda x: x if isinstance(x, list) else [x])
df_cert = df_cert.explode('production_countries')

# Drop 'unrated' rows
df_titles = df_titles[df_titles['age_certification'].str.lower() != 'unrated']

# Group and plot
cert_country = pd.crosstab(df_cert['production_countries'], df_cert['age_certification'])
cert_country = cert_country[cert_country.sum(axis=1) > 20]


cert_country.plot(kind='bar', stacked=True, figsize=(14, 6), colormap='tab20c')
plt.title('Content Certifications by Country')
plt.xlabel('Country')
plt.ylabel('Number of Titles')
plt.legend(title='Age Certification', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This chart is ideal for this case because:

#### 1. Comparative + Compositional Insight
* It shows **how many titles** each country has (like a regular bar chart), and
* Within each country, it visually breaks down **how those titles are distributed across different age certifications** (e.g., PG, R, 18+)

#### 2. Clear View of Proportions
* You can immediately see:
  * Which age certifications dominate in which countries.
  * How age ratings **differ across regions.**
  * Whether a country focuses more on **family-friendly** vs **mature content.**

#### 3. Easy to Compare Across Countries
* While individual bars show total content volume,
* The stacked sections allow comparison of **certification diversity and dominance** per country.

##### 2. What is/are the insight(s) found from the chart?

**1. Variation in Certification Mix Across Countries:**
  * Some countries have a **broader spread** across age certifications, while others are skewed toward specific categories.
    * For example, a country might have a high volume of **TV-MA (Mature)** content, while another might favor **PG or TV-G**

**2. Regional Content Preferences:**
  * Certain countries produce or host more **family-friendly** content (e.g., G, PG).
  * Others lean more heavily toward **adult or mature content** (e.g., R, 18+, TV-MA), reflecting **audience preferences** or **regulatory differences.**

**3. High Content Volume Countries Stand Out:**
  * Countries like the **US, UK or India** (depending on data) may have significantly **more total titles,** regardless of rating category.

**4. Unrated Titles Were Removed:**
  * After cleaning, only certified content is shown - this helps clarify which countries are **actively regulating or labeling** their content.

**5. Dominance of Youth-Safe Content in Some Regions:**
  * In some countries, **G, TV-Y, PG** categories dominate - possibly due to a strong focus on **family entertainment** or **stricetr content guidelines.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the **Age Certification by Country** chart can drive **strategic, content-driven business decisions,** especially in the areas of **regional targeting, licensing, and compliance.**

#### Positive Business Impact:
**1. Market-Specific Content Strategy:**
  * If a country shows a higher volume of **PG or TV-G** content, platforms can prioritize **family-safe, educational, or animated content** for that market.
  * Countries favouring **Tv-Ma, R, or 18+** can be targeted with more **gritty dramas, thrillers, or documentaries.**

**2. Audience Growth & Engagement:**
  * By aligning content offerings with local **age-certification trends**, businesses can improve **user retention** and **satisfaction** especially among **parents** or **young adult viewers**.

**3. Compliance & Legal Clarity:**
  * Understanding the age-certification mix helps avoid **legal issues** related to inappropriate content in markets with **strict content laws**.
  * Enables platforms to **proactively adapt** to government or regional regulations.

**4. Strategic Expansion Planning:**
  * Identifying **underrepresented certifications** (e.g., missing youth content in a country with a large young audience) reveals **gaps** that can be filled by either producing or acquiring content for that niche.

#### Insights that may signal negative growth or challenges:
**1. Over-concentration in a Single Certification:**
  * if a country's content is too heavily skewed toward **one age group** (e.g., mostly 18+ or only PG), it may indicate a **lack of diversity** - potentially **alienating other demographics.**

**2. Low Certification Transparency (previously 'unrated'):**
 * A high number of previously unrated titles (now removed) suggests a **lack of content classification**, which could harm **trust** and **user confidence**, especially among parents or cautious viewers.

**3. Regulatory Risk:**
  * If age certifications don't align with **local standards,** platforms risk content being **banned, flagged**, or **heavily restricted** in certain countries - impacting **reach** and **revenue**.

#### Chart 7 - Heatmap: Correlation Between Regions and Certifications
Goal: Show intensity of content availability across certifications and regions.

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(cert_country.T, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap: Age Certification Distribution Across Countries')
plt.xlabel('Country')
plt.ylabel('Age Certification')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**1. Visualizes Intensity Clearly**
  * A heatmap uses color gradients to reflect the **magnitude** of values (i.e., the number of titles).
  * This makes it easy to spot **which countries have the most content for each age certification.**

**2. Compact and Multi-Dimensional**
  * It fits **two categorical variables** (countries vs. age certifications) in a compact grid.
  * No need for multiple bar charts - everything is shown in one unified view.

**3. Highlights Trends & Gaps Instantly**
  * High-density regions stand out with darker/more intense colors.
  * Sparse or missing areas are easily spotted as light or blank spaces - useful for **identifying gaps**.

**4. Good for Stakeholders**
  * Stakeholders can quickly understand which countries target kids (e.g., '7+', '13+') and which favour mature content (e.g., '18+')
  * Supports **market strategy** - where to grow family-friendly vs. adult content.

##### 2. What is/are the insight(s) found from the chart?

**1. Country-wise Focus on Content Types:**
  * Certain countries (like **US, IN, GB**) have a **wide spread** across all age certifications - indicating a **diverse content library** catering to all age groups.
  * Some countries may have a **high concentration** in specific age certifications (e.g., mostly **'18+'** or **'13+'**), suggesting a **targeted demographic focus.**

**2. Unbalanced Certification Distribution:**
  * You might notice **lower representation of kid's content** (e.g., **'7+'** or **'all'**) in several regions.
  * On the other hand, **'16+'** and **'18+'** categories are dominant in many countries, showing a trend toward **mature content**.

**3. Reginal Gaps:**
  * Some countries have **very few or no titles** in certain certifications - a potential **gap in offerings** for specific age groups.
  * This may represent an opportunity to **introduce more age-diverse content** in those regions.

**4. Certification Trends Across Borders:**
  * The same age certifications are **not evenly popular or avaialble** across countries - possibly due to **local regulations or cultural preferences.**
  * For example, **'13+'** content might be more popular in Western regions, while **'16+'** dominates elsewhere.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Content Strategy Personalization:**
  * Insights into age certification preferences by country enable **targeted content curation** - e.g., pushing more **family-friednly or kid's content** in regions where they're underrepresented can attract broader audiences.

**2. Market Expansion Opportunities:**
  * Countries with **lower content volumes in certain certifications** (like **'all'** or **'7+'**) indicate **gaps in the catalog.** Filling these can help **capture untapped market segments,** such as children or families.

**3. Compliance & Localization Readiness:**
  * Understanding which certifications dominate where helps align with **local content rating standards,** ensuring **faster approvals** and **smoother launches** in new ewgions.


#### Possible Negative Growth Areas:
**1. Over-saturation in Mature Content:**
  * Heavy dominance of **'16+'** or **'18+'** content across multiple regions may signal **limited appeal to younger viewers** and families - potentially alienating a **large user segment.**

**2. Neglecting Local Preferences:**
  * If content offerings are **not aligned with regional tastes or regulations,** it may lead to **poor engagement** or even **content removals** by authorities.

### Trends Over Time: Measure content performance over time to track how the library has evolved and what types of content are gaining traction.

#### Chart 8 - Line Chart : Best for showing changes over time
Goal : Track how the volume of content (movies/TV shows) has changed over time.

In [None]:
# Count number of titles released each year
yearly_trend = df_titles['release_year'].value_counts().sort_index()
plt.figure(figsize=(12, 8))
sns.lineplot(x=yearly_trend.index, y=yearly_trend.values, marker='o')
plt.title('Evolution of Content Library Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Line chart is the most effective and intuitive way to visualize **changes over time.** Here's why:

**1. Time Series Clarity:**

  A line chart clearly shows trends across sequential time intervals (in this case, release years), making it easy to observe rises and drops in content volume.

**2. Smooth Visual Flow:**

  The connected data points highlight patterns like growth spurts, plateaus, or declines, helping us understand how the content library evolved year by year.

**3. Trend Detection:**

  It's ideal for identifying:
    * Peak content production years
    * Periods of decline
    * Steady growth trends

**4. Simlpicity + Insight:**

  Compared to bar charts, line plots handle many time intervals without cluttering, keeping the focus on trend **direction** and **pace**, not just counts.

##### 2. What is/are the insight(s) found from the chart?

**1. Growth in Content Volume Over Time**

There's usually a **steady increase** in the number of titles added over the years - indicating that Amazon Prime has been **expanding its content library**, especially in recent years.

**2. Possible Drop in Recent Years**

If there's a slight dip in the last year or two, it may be due to:
  * Incomplete data for recent years
  * External disruptions (like COVID-19)
  * Strategic content reshuffling or budget shifts

**3. Peak Production Years**

Some years might show **sharp spikes** - likely due to investments in original content, market expansion, or licensing deals.

**4. Content Boom Era**

The chart often highlights a 'boom phase' (e.g., post-2007) where content production or acquisition accelerated - possibly aligning with global market entry or competition with platforms like Netflix and Disney+.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Data-Driven Content Strategy**

Understanding which years had the **highest growth in titles** helps Amazon Prime evaluate the **effectiveness of past content investment strategies** - e.g., if a spike in content also aligned with increased subscriptions or engagement.

**2. Identify Expansion Periods**

The chart helps pinpoint **when Amazon expanded aggressively**, which regions might've benefited, and which content types (e.g., original vs. licensed) were prioritized.

**3. Forecasting**

This trendline can be used to **forecast future content needs** - indentifying whether the platform needs to **scale up or optimize** content creation.


#### Insights Leading to Negative Growth:
**Recent Decline in Content Volume**

If the chart shows a **downward trend in recent years,** it could signal:
  * Budget cuts
  * Strategic shifts
  * Less content acquisition
  * Production delays

If **not intentional,** it might **negatively impact user engagement**, especially if subscribers notice a lack of fresh content.


**2.Saturation Point**

A **flattening line** might indicate the platform has reached a **saturation point** in certain genres or markets - which calls for **innovative or diversified content.**

#### Chart 9 - Line Chart : Time Trend by Type (Movie vs Show)

In [None]:
# Group by year and type
type_trend = df_titles.groupby(['release_year', 'type']).size().reset_index(name='count')

# Lineplot for each type
plt.figure(figsize=(12, 8))
sns.lineplot(data=type_trend, x='release_year', y='count', hue='type', marker='o')
plt.title('Types of Content Gaining Traction Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* **Line Charts** are the most effective way to visualize **trends over time**, especially when analyzing how values change **year-by-year**.

* Using **separate lines for each content type (Movie vs Show)** makes it easy to **compare their growth patterns side-by-side**

#### Why It Works for This Case:
**1. Clarity:** Each line clearly represents a content type, showing how their volume has changed from year to year.

**2. Comparision:** The chart helps **visually compare the trajectory** of movies vs shows - whether one is growing faster, catching up, or slowing sown.

**3. Time-Focused:** Since the goal is to measure **how types of content gain traction over time**, line charts map this avolution naturally and intuitively.

##### 2. What is/are the insight(s) found from the chart?

#### 1. Shift in Content Strategy
* There may be a **steady increase in both Movies and TV Shows** over the years, but..
* **One type may show a sharper rise** - for example, if shows are growing faster than movies, it signals a strategic shift toward episodic content (binge-watching trends, long-term viewer retention).

#### 2. Content Surge in Specific Years
* Certain years might display a **significant spike in content production**.
  * This can indicate:
    * Platform expansion
    * Acquisition of new titles
    * Increased investment in originals

#### 3. Plateau or Decline
Any visible **plateau or decline** in a particular content type (e.g., movie count dropping) could hint at:
  * Market saturation
  * Strategic Reduction
  * Shifting viewer preferences

#### 4. Consistency vs Volatility
* If movies show **more consistent yearly release,** while shows fluctuate more, it could mean:
  * Movies are safer bets for regular content pipelines
  * TV shows may be teid to bigger, riskier investments or seasonal launches

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact
**1. Content Strategy Alignment**
  * If the data shows that **TV shows are gaining traction** over time, platforms can **prioritize long-form series**, which enhance user engagement and increase wtach time (beneficial for ad revenue and retention).

**2. Resources Allocation**
  * Understanding which content type is rising helps in **budget planning** - e.g., investing more in shows if they drive user growth.

**3. Marketing Campaigns**
  * Peaks in specific years or types can inform **promotional strategies** - for instance, celebrating "content milestones" or pushing nostalgia-based campaigns.

**4. Catalog Planning**
  * Identifying slower years helps planners **plug content gaps** during low seasons or experiment with diverse formats.

#### Insights That May Indicate Negative Growth
**1. Stagnation or Decline in a Content Type**
  * If movies (or shows) are **declining over time**, it may reflect:
    * Audience fatigue
    * Reduced licensing/acquisition
    * Misalignment with current viewing habits

**2. Overproduction Without Engagement**
  * If the **volume of new titles increases but doesn't align with viewership growth,** it may suggest **inefficient investment** - poring resources into content that doesn't convert into value.

#### Chart 10 - Line Chart : Top 5 Trending Genres Over Time

In [None]:
# Cleaning Genres column for duplicate values
# Drop rows with null genres
df_titles = df_titles[df_titles['genres'].notna()]

# Convert stringified genre list to actual Python list (if needed)
# Example: if genres are like "['comedy', 'drama']" (as a string), use `ast.literal_eval`

import ast
df_titles['genres'] = df_titles['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Remove duplicates within each row's genre list
df_titles['genres'] = df_titles['genres'].apply(lambda x: list(set([genre.strip().lower() for genre in x])))

# Sort the genres alphabetically for consistency
df_titles['genres'] = df_titles['genres'].apply(sorted)

# Preview cleaned genre column
df_titles['genres']

# Make a copy of the cleaned titles DataFrame
df_exploded = df_titles.copy()

# Explode the 'genres' column so each genre has its own row
df_exploded = df_exploded.explode('genres')

# Strip any whitespace and convert to lowercase (already done earlier, but just in case)
df_exploded['genres'] = df_exploded['genres'].str.strip().str.lower()


# Group and filter top genres
df_exploded['release_year'] = pd.to_numeric(df_exploded['release_year'], errors='coerce')

top_genres = df_exploded['genres'].dropna().value_counts().head(5).index
df_top_genres = df_exploded[df_exploded['genres'].isin(top_genres)]

genre_trend = df_top_genres.groupby(['release_year', 'genres']).size().reset_index(name='count')

plt.figure(figsize=(14, 10))
sns.lineplot(data=genre_trend, x='release_year', y='count', hue='genres', marker='o')
plt.title('Trending Genres Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

#### 1. Best for Temporal Analysis
* Line charts are ideal for showing **how values change over time**.
* They allow you to **track trends, spot peaks or drops**, and **compare multiple categories (genres)** across a continuous axis (release years).

#### 2. Clear Genre Comparison
* Using different colored lines for each genre helps in **visually comparing** the popularity of genres across years.
* The chart helps answer:
 "Which genres have become more or less popular over time?"

#### 3. Insights at a Glance
* We can easily spot:
  * Sudden **spike** in interest (e.g., action movies post-2010)
  * **Declining trends** in specific genres
  * **Consistent performers** that stay popular year over year

#### 4. Scalable & Intuitive
* Limiting to the **top 5 genres** keeps the chart clean and easy to interpret.
* The chart works even as you scale time (e.g., decades) or group genres by category (e.g., thriller + suspense).

##### 2. What is/are the insight(s) found from the chart?

#### 1. Consistent Genre Leaders
* **Drama** and **Comedy** appear to consistently dominate across most years.
* These genres show **stable and high vloume** of content, indicating strong and ongoing audience demand.

#### 2. Rising Genre Trends
* Genres like **Action** and **Thriller** show **significant growth** in recent years (e.g., post-2007), suggesting:
  * Audience interest in fast-paced or suspense-driven content is increasing.
  * Platforms may be **investing more in these genres** recently.

#### 3. Genre Saturation or Decline
* One or more genres (depending on the dataset) may show a **declining trend**, which could be due to:
  * **Market saturation**
  * Shift in **audience preferences**
  * **Lower production** in that category

#### 4. Genre Volatility
* Some genres may show **fluctuations** (ups and downs) year to year.
* This might be influenced by:
  * Popular franchise releases
  * Social or political climate
  * Streaming trends (e.g., crime dramas booming during lockdowns)


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Strategic Content Investment**
  * Knowing that **Drama** and **Comedy** consistently perform well helps platforms **prioritize funding** into those genres.
  * Investing in **Action** or **Triller**, which show upward trends, could **capitalize on rising audience interest**.

**2. Targeted Marketing**
  * For newly released or upcoming content, marketing teams can **focus campaigns** on trending genres to **boost engagement and reach**.

**3. Regional & Temporal Planning**
  * Combing this genre trend with **regional preferences** or **certification insights** can help tailor **what content is promoted where and when**.

**4. User Retention & Experience**
  * Recommending trending genres to users increases **watch-time and satisfaction**, which improves **retention metrics**.

#### Insights That May Indicate Negative Growth:
**1. Declining Genre Popularity**
  * If certain genre are steadily declining (e.g., **Romance** or **Documentary**, depending on your dataset), continuing to invest heavily in them may result in:
    * **Lower ROI**
    * **Decreased viewer engagement**
    * **Content underperformance**

**2. Genre Saturation**
  * Even high-performing genres like **Drama** might hit a **plateau** if overproduced without innovation, causing:
    * Audience fatigue
    * Reduced excitement around new titles

#### Justification:
* Content trends are directly tied to **viewer behaviour**, which drives **subscription growth, watch time,** and **churn rate**.
* Monitoring what's rising and what's fading helps businesses **stay agile, optimize production costs,** and **maximize audience impact**.

### Contribution & Diversification of Roles: by ratings, regions & content

#### Chart 11 - Bar Chart : Actors or Directors whome contribute most to High-Rated Content?
Goal: Identify people consistently involved in high-performing content.

In [None]:
# Merge on 'id' to get title details along with credits
df_merged = pd.merge(df_credits, df_titles, on='id')

# Define high-rated as IMDb score >= 7
high_rated = df_merged[df_merged['imdb_score'] >= 7]

# Filter for roles
high_rated_roles = high_rated[high_rated['role'].isin(['ACTOR', 'DIRECTOR'])]

# Count top 10 actors and top 10 directors separately
top_actors = (
    high_rated_roles[high_rated_roles['role'] == 'ACTOR']['name']
    .value_counts()
    .head(10)
    .reset_index()
    .rename(columns={'index' : 'name'})
)
top_actors['role'] = 'ACTOR'


top_directors = (
    high_rated[high_rated['role'] == 'DIRECTOR']['name']
    .value_counts()
    .head(10)
    .reset_index()
    .rename(columns={'index' : 'name'})
)
top_directors['role'] = 'DIRECTOR'

# Combine both
combined = pd.concat([top_actors, top_directors])

# Visualize
plt.figure(figsize=(14, 8))
sns.barplot(data=combined, x='count', y='name', hue='role', palette='Set2')
plt.title('Top 20 High-Rated Contributors (Actors & Directors, IMDb ≥ 7)')
plt.xlabel('Number of High-Rated Titles')
plt.ylabel('Contributor Name')
plt.legend(title='Role')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The **Grouped Horizontal Bar Chart** was choosen for this use case because it is particularly well-suited to the **goal**:

**"Identify people consistency involved in high-performing content"**

#### **1. Easy Comparison Across Two Roles**
We're comparing **actors vs directors**:
  * The hue='role' clearly distinguishes **Actors** from **Directors** using different colors.
  * Grouped bars side by side allow us to see which role dominates the top contributions.

#### **2. Horizontal Orientation Improves Label Readability**
  * Contributor **names can be long**, so placing them on the y-axis (vertically) makes them easier to read.
  * This avoids clutter and overlapping text, especially when displaying **20 individuals**.

#### **3. Clear Quantitative Insight**
  * The x-axis shows the **numer of high-rated titles (IMDB >= 7)**.
  * It's easy to visually compare who has more appearances and **how much more.**

#### **4. Suitable for Categorical + Count Data**
  * You're dealing with **categorical data** (name, role) and **count/ frequency** (number of high-rated titles), which is the perfect use case for a bar plot.

##### 2. What is/are the insight(s) found from the chart?

**1. Some individuals consistently contribute to high-rated content:**

  * Certain **actors and directors** appear repeatedly in titles with IMDb scores >= 7, indicating a **strong track record** of working on successful projects.
  * These contributors are likely **trusted names in the industry** fordelivering quality content.

**2. Top Actors dominate in frequency:**

  * If actors occupy more positions in the top 20 or have **taller bars**, it suggests that **actors have more consistent involvement** in high-performing titles than directors - or perhaps they participate in more projects overall.
  * This might impy a **broader range on opportunities** or roles abailable to actors vs. directors.

**3. Directors also have a strong presence:**

  * Some directors appear almost as frequently as the top actors, indicating their **influence in shaping high-rated content**.
  * This shows that while actors bring visibility, **directors play a critical creative role** in driving quality.

**4. Name & Role pairing helps recognize dual contributors:**

  * If a person appears as both **actor and director** in different entries (e.g., through parentheses in names like Ben Affleck (ACTOR) and Ben Affleck (DIRECTOR)), it can highlight **multi-talented contributors**.

**5. Data-driven recommendation opportunities:**

  * These top contributors could be **valuble casting choices** for future high-quality produtions.
  * Streaming platforms or producers could use this list to **predict potential content success** based on past trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact Insights:

**1. Identifying Top Contributors for Future Projects:**

  * **Actors and Directors** who are consistently involved in high-rated content are valuable assets to future productions. By focusing on these **top contributors,** business can **increase the likelihood of success** in their content creation.
  * Producers, streaming platforms, or production companies could leverage this data to **formulate casting and director choices** for new projects, thereby **ensuring high-quality content** and **improved ratings** (boosting audience engagement and loyalty).

**2. Strategic Casting and Production Decisions:**

  * The insights can help casting departments ad production studios **target high-performing actors and directors** to create content that is **more likely to attract positive reviews** (IMDb scores >= 7).
  * The **reputation of actors and directors** can also be used to **increase marketing value,** as these individuals have built a trusted brand with their audience. **brand alignment** with these top contributors can attract more subscribers or viewers.

**3. Content Personalization and Targeting:**

  * Streaming platforms can use these insights to craete **personalized recommendations** based on **actor-director collaborations** that audiences may prefer. For instance, if a viewer consistently enjoys movies with a particular actor or director, the platform can surface content featuring them more prominently.
  * This drives higher **user satisfaction, engagement, and retention**, contributing positively to the business's long-term growth.

**4. Proven Success Predictors for High-Rated Content:**

  * The data shows which contributors have a **high success rate** in terms of IMDb ratings. By focusing on these individuals, companies can **replicate past successes** more effectively.
  * This can help businesses focus their resources on **content with higher potential returns,** making them more efficient in their decision-making.

#### Negative Growth Insights:
**1. Over-reliance on a Small Pool of Contributors:**

  * If the chart reveals that only a **few actors and directors** dominate the high-rated content, there could be a **narrow talent pool** being overused. This can lead to **creative stagnation** and audience fatigue as people see the same individuals in multiple roles.
  * **viewer interest** might decline if they feel the content lacks diversity or novelty, leading to **diminishing return** in terms of engagement and viewership.

**2. Risk of Marginalizing Other Talent:**

  * By **focusing too much on the top contributors,** business may inadvertently **neglect promising talent** who have the potential for future sucess but haven't had the same number of opportunities yet. This could hurt the **diversity and creativity** of content in the long run.
  * If companies only invest in the same set of popular contributors, it could lead to **lower innovation** and **homogeneous content**, potentially alienating audiences looking for fresh ideas and new talent.

**3. Overemphasis on Past Success:**

  * The data is based on **historical performance** (IMDb >= 7), and it might overlook **emerging talent or new trends** in the industry. Focusing too much on **past high-rated contributors** can lead to missed opportunities with **new or emerging actors and directors** who are breaking into the industry with strong performance but haven't yet reached the same level of recognition.
  * **Innovation** might be stifled if businesses don't take into account evolving tastes or fresh perspectives from **up-and-coming talents.**

**4. Bias Toward Popular Content:**

  * If a business only looks at **IMDb ratings** to identify success, it might inadvertently **ignore diverse or niche content** that may not have broad appeal but could be **culturally or critically significant.** This could lead to an overly commercialized approach, prioritizing blockbuster hits over unique, niche, or indie content that might appeal to specific segment of the market.

#### Justification:
**Positive Impact Justification:**

  * By identifying top performers (actors, directors), businesses can capitalize on their reliable track record, ensuring higher quality content, and engaging larger audiences, which directly impacts viewership and business profitability.

  * Strategic investments in these top contributors improve return on investment (ROI) for content production.

**Negative Growth Justification:**

  * Creativity and innovation can be limited when businesses overly focus on a small group of familiar faces. Overusing the same contributors can reduce variety, ultimately hurting audience engagement and loyalty over time.

  * Diversity in talent is crucial for long-term success, as it helps prevent audience fatigue and promotes fresh storytelling.

#### Chart 12 - Scatter Plot : Correlation between Cast Size and Content Ratings
Goal: Analyze whether having more (or fewer) cast members influences IMDb/TMDB scores.

In [None]:
df_actors = df_credits[df_credits['role'] == 'ACTOR']

# Count unique actor names per content ID
cast_size_per_title = df_actors.groupby('id')['name'].nunique().reset_index()
cast_size_per_title.columns = ['id', 'cast_size']

df_cast_rating = pd.merge(cast_size_per_title, df_titles[['id', 'imdb_score', 'tmdb_score', 'title']], on='id')

# Compute correlation
correlation = df_cast_rating[['cast_size', 'imdb_score', 'tmdb_score']].corr()
print(correlation)

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))

# IMDb
plt.subplot(1, 2, 1)
sns.scatterplot(data=df_cast_rating, x='cast_size', y='imdb_score', alpha=0.6)
plt.title('Cast Size vs IMDb Score')
plt.xlabel('Cast Size')
plt.ylabel('IMDb Score')

# TMDB
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_cast_rating, x='cast_size', y='tmdb_score', alpha=0.6, color='orange')
plt.title('Cast Size vs TMDB Score')
plt.xlabel('Cast Size')
plt.ylabel('TMDB Score')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The goal of this analysis is to examine the relationship between **cast size** and **content ratings** (both IMDb and TMDB scores). A **scatter plot** is ideal in this case for several reasons:

**1. Visualizing Relationships Between Two Variables:**

  * Scatter plots allow us to clearly see how two continuous variables are related to each other. In this case, we are comparing **cast size** (the number of actors in a given piece of content) with **IMDb and TMDB scores** (continuous rating scores). Scatter plots provide a clear and intuitive way to identify trends, clusters, or patterns in the data.

**2. Identify Correlations:**

  * Scatter plots are particularly useful for identifying **positive, negative, or no correlation** between two variables. By plotting cast size against the ratings, we can easily observe if there's a trend suggesting that **larger or smaller casts** influence **higher or lower ratings. We can visually assess whether the number of actors has any notable impact on the ratings.

**3. Clarity with Multiple Comparisions:**

  * Using **two separate scatter plots** (one for IMDb and one for TMDB scores) enables us to make direct comparisons of how cast size impacts **each platform's rating**. The side-by-side plots allow for easier comparison without cluttering a single chart.

**4. Handling Outliers:**

  * Scatter plots can highlight **outliers** that may not follow the general trend of the dats. For example, a movie with an unusually large cast and a very high IMDb score or a low TMDB score would be easy to spot, allowing for further investigation.

**5. Correlation Calculation:**

  * The scatter plots provide a visual clue to the **correlation coefficient** we computed. If the points tend to align along a straight line (whether upward or downward), it suggests a **strong correlation**. If the points are spread widely without any clear pattern, it indicates **week or no correlation**.


##### Interpretation of the Chart's Utility:
**Scatter Plot 1 (IMDb Score vs Cast Size):**

  * This helps us understand if there's any relationship between the number of actors in a content piece and its IMDb rating. For example, if we see a positive correlation, it might suggest that movies or shows with more actors tend to have higher IMDb ratings. Conversely, a negative correlation would suggest that larger casts result in lower ratings.

**Scatter Plot 2 (TMDB Score vs Cast Size):**

  * Similarly, this plot helps explore the relationship between cast size and TMDB score, providing insights into whether larger casts are rated higher or lower by users on TMDB specifically.


##### Outcome from These Charts:
  * The scatter plots will allow you to visually evaluate if there is any observable pattern between cast size and ratings on IMDb and TMDB.

  * They will also give a clearer idea of whether the number of actors in a content piece plays a significant role in the overall ratings it receives or if other factors (e.g., storyline, direction) have a more prominent influence.


##### 2. What is/are the insight(s) found from the chart?

**1. No Strong Correlation Between Cast Size and IMDb Scores:**

  * From the scatter plot for **IMDb scores**, there doesn't appear to be a **clear trend** between cast size and ratings. The points are **widely scattered** with no obvious **linear relationship**, suggesting that having more or fewer actors does not have a consistent effect on **IMDb ratings**.
  * This indicates that **other factors** (such as script quality, direction, or performances) likely play a more significant role in determining IMDb scores, rather than just the number of actors in a film or show.

**2. Weak Positive Correlation for TMDB Scores:**

  * The **scatter plot for TMDB scores** shows a **slightly upward trend** as the number of cast members increases, hinting at a **weak positive correlation**. While the relationship is not very strong, it suggests that, in some cases, larger cast sizes might slightly influence **higher ratings** on TMDB.
  * This could mean that **larger, ensemble casts** might be percieved more favorably on TMDB, possibly due to the broader range of performances or more diverse representation in the content.

**3. Outliers:**

  * In both charts, there could be **outliers** where movies or shows with **larger casts** have **extremely low or high ratings**, indicating that while cast size may have a marginal effect, **other factors** likely play a more dominant role.
  * For example, a film with a huge cast but a low IMDb score or a popular ensemble show with a smaller cast could be outliers, confirming that cast size is not the sole determinant of success on these platforms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact

**1. Focus on Content Quality Over Cast Size:**

  * The **lack of strong Correlation** between cast size and IMDb scores suggests that **businesses should focus more on content quality** rather than merely increasing the size of the cast. This could lead to more **cost-effective productions** by allocating resources towards **scriptwriting, direction,** and **cinematography** instead of recruiting larger casts.
  * **Producers and content creators** can ensure that their content is not dependant on **star power** or a large ensemble cast to succeed. By prioritizing other factors, such as **unique storylines, strong performance, and high production quality,** they could enhance the chances of achieving high ratings, especially on platforms lik **IMDb**, where audience preferences may be more discerning.

**2. Targeting Ensemble Casts for TMDB:**

  * The **weak positive correlation with TMDB scores** suggests that having a **larger cast** may slightly boost ratings on this platform. As **TMDB is often used by a more niche audience**, businesses could experiment with **ensemble casts** to potentially improve visibility and engagement on that platform.
  * This could also be leveraged in marketing campaigns, as audiences may be drawn to movies or shows with **well-known actors** or **diverse characters,** which can **increase overall interest** in the content.

#### Negative Growth Considerations:
**1. Investing Excessively in Cast Size:**
  * The weak correlation between cast size and ratings could lead some studios or content creators to **overestimate the importance of a large cast.** This could result in **unnecessary investments** in hiring multiple actors, increasing production costs without a guaranteed return in terms of higher ratings or viewership.
  * **Excessive reliance on cast size** could result in a **poor ROI (Return on Investment)**, as more actors may not necessarily lead to better performance or greater audience engagement, especially when the quality of the content itself is not strong.

**2. Misalignment of Expectations:**
  * If businesses incorrectly interpret the weak positive correlation on TMDB as a **strong determinant for success,** they may prioritize **large casts** in a way that detracts from **other important elements**, such as **storytelling** or **directorial quality**.
  * This could lead to **resource misallocation,** where businesses spend more on actors instead of investing in **better scripts, production design,** or **audience engagement strategies.** If the audience is not satisfied with the content despite a large cast, it may result in **negative feedback** and **lower future ratings**.

**3. Complacency in Content Development:**
  * The **insight that cast size does not guarantee high ratings** may lead to complacency where businesses feel that having a **small or inexpensive cast** will be sufficient for success, potentially **undermining efforts to produce high-quality, compelling content.**
  * If companies do not invest in strong **strorylines** or **creative direction,** relying too much on a smaller cast, it could lead to **underwhelming content** that does not resonate with audiences, resulting to **poor ratings and declining growth**.

#### Chart 13 - Bar Chart : Types of Characters or Roles that are Most Common in Trending Genres or Hit Shows?
Goal: Spot patterns in character types across popular titles.

In [None]:
# Define threshold for "trending"
top_titles = df_titles[df_titles['imdb_score'] >= 7.5]

df_top_credits = pd.merge(top_titles[['id', 'title', 'genres']], df_credits[['id', 'name', 'character', 'role']], on='id')

# Count top roles (e.g., ACTOR, DIRECTOR, PRODUCER)
top_roles = df_top_credits['role'].value_counts().head(10)

# Count top character names (ignoring null/empty)
top_characters = df_top_credits['character'].dropna().value_counts().head(15)

# Visualize
plt.figure(figsize=(4, 4))
sns.barplot(x=top_roles.index, y=top_roles.values, palette='coolwarm')
plt.title("Top Roles in High-Rated Titles")
plt.xlabel("Count")
plt.ylabel("Role")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

**1. Clear Comparison of Categories:**
  * **Bar charts** are excellent for comparing discrete categories (in this, case, different **roles** in high-rated titles), allowing for easy identification of **which roles are most prominent.**
  * The **horizontal bar chart** used here allows for clear readability of labels (roles), especially if the list of roles is long and the bars extend to varying lengths. This makes it easy to see which roles have the highest count.

**2. Highlighting Frequency of Roles:**
  * The purpose of this chart is to show the **frequency of each role** in trending or high-rated content, and a **bar chart** is particularly effective for this type of analysis because it allows quick identification of the **top 10 roles** that appear most frequently.
  * The **color palette** used (coolwarm) helps visually differentiate each bar, making it easier to distinguish between different roles at a glance.

**3. Contextual Insights:**
  * The chart allows us to quickly assess which **types of characters or roles** (such as **actors, directors, producers**) are most often associated with high-rated titles. This insight can help identify **patterns or preferences** in trending or successful content.

**4. Focused on Top Roles:**
  * By limiting the chart to the **top roles,** we are able to focus on **the most influential roles in the success of a high-rated title** without overloading the viewer with less relevant information. This allows for a more focused and meaningful analysis.

##### 2. What is/are the insight(s) found from the chart?

**1. Dominance of Certain Roles:**
  * The chart likely reveals that **actors** are the most common role in high-rated titles, which is expected, as **actors** are crucial to the success of most films and shows. Their ability to draw audiences and portray characters effectively often influences the ratings and success of a title.
  * Depending on the dataset, you might also observe that **directors** and **producers** appear frequently, but likely not as much as actors, reflecting the collaborative nature of content creation but also pointing out that the **visibility of actors** plays a more direct role in a title's popularity.

**2. Presence of Other Roles:**
  * You might see **producers, writers, and even creators** making their mark in trending or high-rated titles, highlighting the importance of key behind-the-scenes roles in the quality of the content. However, these may not necessarily appear as often in the top spots as actors and directors.
  * If any **character roles** (like "hero", "villain"" etc.) are included, they might show up, suggesting that certain character archetypes resonate more in trending or hit shows/movies. For example, the **protagonist** or **antagonist** roles could be more prevalent in high-rated titles, indicating that these are central to the audience's connection with the content.

**3. Character Types Across Genres:**
  * By cross-refencing this data with the genres of these high-rated titles, we can spot patterns in the types of characters (or roles) associated with different genres. For example:
    * **Action and adventure genres** may have more **heroes, villains, and fighters.**
    * **Drama or romance genres** might feature more **complex, emotionally-driven character roles.**

**4. Frequency of Key Roles in Specific Content Types:**
  * If the chart compares roles across movies and TV shows, it could show whether certain roles are more common in one type of content. For example, **actors** may dominate in movies, while **directors** or **producers** might be more prominent in TV series, where long-term vision and showrunning are more crucial.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Targeting Popular Roles for Future Projects:
  * By identifying the most common and successful roles in trending genre, businessescan tailor casting decisions to align with successful patterns. For instance, if **actors** and **directors** dominate in high-rated content, investing in well-known or skilled actors and directors could help increase the likelihood of producing high-rated content.
  * **Character archetypes** that appear frequently in high-rated titles (e.g., heroes, villains, or complex protagonists) can also guide the development of future content, ensuring that the roles resonate with audiences.

**2. Informed Content Strategy:**

  * Knowing which **roles** (actor, director, producer, etc.) play a prominent role in high-rated content could influence production strategies. For instance, if certain roles (like directors) are key to success in a genre, content creators might invest more in securing renowned directors for future productions in that genre.

  * **Genres** with a history of high ratings can be targeted for more content creation. By analyzing which character types or roles are most effective in those genres, content producers can ensure that their projects align with the expectations of their audiences.

**3. Casting and Talent Decisions:**

  * Identifying that **certain actors** and **directors** are associated with successful content can help companies form strategic partnerships with these individuals. This can help improve the **marketability** and **appeal** of upcoming projects, leading to better engagement, higher viewership, and potentially higher ratings.

**4. Audience Engagement and Trends:**

  * The insights could also help marketing teams target audiences more effectively. For instance, if certain character types are trending in high-rated genres, promotional materials (e.g., trailers, posters) can highlight these roles to attract audience attention. This can increase engagement and brand recognition for the content.





#### Negative Growth Insights (Potential Risks):

**1. Over-reliance on Familiar Roles:**

  * While focusing on successful roles (actors, directors, and character types) could be beneficial, over-relying on the same talent or character archetypes may result in a lack of originality. Audiences may grow tired of repetitive roles or character types, which could result in diminishing returns for content that feels formulaic or predictable.

  * Overuse of the same high-performing actor or director may also limit the variety of projects, leading to stagnation in creative output.

**2. Ignoring Other Crucial Factors:**

  * The data might skew towards the idea that certain roles are the key to success, but it might ignore other important factors, such as storylines, production quality, innovation, or global trends. Focusing too heavily on popular roles might ignore the broader, more complex reasons behind high ratings, leading to potential failures when other crucial factors are overlooked.

**3. Excessive Focus on Specific Genres or Roles:**

  * A focus on trending genres and specific roles may result in content creators overproducing certain types of projects (e.g., action films with the same archetypes). This could saturate the market and lead to audience fatigue, diminishing the quality of the content and reducing engagement over time.

**4. Exclusion of Emerging Talent:**

  * Prioritizing popular roles and well-known actors or directors could inadvertently limit opportunities for emerging talent. This could hurt diversity and creativity, ultimately stifling innovation in the industry. New voices and fresh perspectives may be excluded, which could negatively affect long-term growth by alienating audiences seeking diversity and originality.


#### Chart 14 - Bart Chart : Most Prolific Actors or Directors across Genres
Goal: Measure volume of content of person by genre.

In [None]:
# Cleaning Genres column for duplicate values
# Drop rows with null genres
df_titles = df_titles[df_titles['genres'].notna()]

# Convert stringified genre list to actual Python list
# Example: if genres are like "['comedy', 'drama']" (as a string), use `ast.literal_eval`

import ast
df_titles['genres'] = df_titles['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Remove duplicates within each row's genre list
df_titles['genres'] = df_titles['genres'].apply(lambda x: list(set([genre.strip().lower() for genre in x])))

# Sort the genres alphabetically for consistency
df_titles['genres'] = df_titles['genres'].apply(sorted)

# Preview cleaned genre column
df_titles['genres']

# Make a copy of the cleaned titles DataFrame
df_exploded = df_titles.copy()

# Explode the 'genres' column so each genre has its own row
df_exploded = df_exploded.explode('genres')

# Strip any whitespace and convert to lowercase (already done earlier, but just in case)
df_exploded['genres'] = df_exploded['genres'].str.strip().str.lower()

df_merged = df_exploded.merge(df_credits, on='id')


# Get top 10 most prolific names
top_names = (
    df_merged
    .groupby(['name', 'role'])
    .size()
    .reset_index(name='total_count')
    .sort_values('total_count', ascending=False)
    .head(5)
)

# Filter merged data for only top names
top_merged = df_merged[df_merged['name'].isin(top_names['name'])]


genre_counts = (
    top_merged
    .groupby(['name','role', 'genres'])
    .size()
    .reset_index(name='count')
)

genre_counts['label'] = genre_counts['name'] + ' (' + genre_counts['role'] + ')'

# Visualization
plt.figure(figsize=(14, 8))
sns.barplot(data=genre_counts, x='label', y='count', hue='genres')
plt.title('Top Contributors by Genre')
plt.xlabel('Name')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Genre', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The **bar chart** is ideal for visualizing the distribution of data, especially when we want to compare the **prolific actors or directors** across different genres. Here's why:

**1. Clear Comparison:** The bar chart allows for easy comparison between **different contributors (actors or directors)** and their involvement in various genres. By plotting the number of titles associated with each person and genre, the chart visually highlights the most prolific individuals in each genre.

**2. Categorical Data Representation:** In this case, both **roles** (actors or directors) and **genres** are categorical variables. A bar chart is excellent for showing counts or frequencies of these categories in an intuitive way. The **hue parameter** (representing genres) allows the reader to see how many titles each person contributed to across different genres, making it easy to distinguish patterns.

**3. Multiple Group Comparisons:** The chart shows multiple variables at once—**the person’s name, their role**, and **the genre** they worked in. This makes it useful for identifying not just the most prolific individuals, but also how they perform across multiple genres. The **hue** provides color differentiation for each genre, helping to visually distinguish between them in a single plot.

**4. Readable for Top Contributors:** With **top 10 most prolific names**, the bar chart is well-suited to show which actors and directors are leading in terms of genre contributions. It allows for quick identification of the highest contributors and their genre distribution.

**5. Scalability:** As we compare the **top 5 or 10 contributors**, the bar chart handles this small dataset efficiently. It's scalable and would still work effectively with a larger dataset as long as the number of categories remains manageable.

##### 2. What is/are the insight(s) found from the chart?

**1. Most Prolific Actors and Directors:**

  * The chart highlights the **actors** and **directors** who have contributed to the highest number of titles across different genres. By identifying the most prolific individuals, the chart shows which contributors have the most influence in the industry in terms of quantity.

  * We can also see if a specific actor or director dominates in certain genres, which may suggest their specialization or preference in those genres.

**2. Genre Preferences:**

  * The chart reveals which genres are most commonly associated with certain **actors** or **directors**. This can give insights into their **career trajectory** and specialization. For example, an actor who works predominantly in **action** or **comedy** could be seen as an expert or fan favorite in that genre.

  * By looking at the **genre distribution** of each individual, you can assess how well-rounded an actor or director is across different genres or if they are mostly tied to one specific genre.

**3. Role Distribution:**

  * The chart allows you to compare the **roles (actors vs directors)** within each genre. You might observe that certain genres are more associated with one role than the other. For instance, **directors** might dominate in **drama** and **action** genres, while **actors** might be more evenly distributed across various genres.

  * This can suggest **talent needs** in the industry. If certain genres lack diversity in directors, it could influence the hiring or casting decisions in the future.

**4. Collaborations:**

  * By examining the combination of **name, role**, and **genre**, the chart can show which individuals tend to work with particular genres more often. This can indicate long-term **collaborations** between certain actors, directors, and genres, suggesting they have a proven track record of successful content in those genres.

**5. Emerging Talent:**

  * The chart might also show emerging **actors** or **directors** who are beginning to make an impact in specific genres. If new names appear at the top of the list in genres like **horror** or **drama**, this could indicate **emerging talent** that might be worth keeping an eye on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Informed Casting and Talent Decisions:**

  * The insights regarding **prolific actors and directors** can help production companies and studios make **data-driven casting decisions**. By identifying the top contributors and the genres they excel in, they can select the right talent for specific projects, ensuring higher chances of success. For example, if a director has a proven track record in a genre like thriller or comedy, studios can choose them for upcoming projects within that genre.

**2. Content Strategy Development:**

  * The chart helps to identify which **genres** are most frequently associated with successful talent. This information allows companies to shape their **content strategy**, focusing on genres that consistently attract top-tier talent. For instance, if **action** movies tend to have certain directors at the helm, studios can invest more heavily in that genre, knowing it will attract top talent and likely yield strong audience reception.

**3. Diversification of Roles and Genres:**

  * By observing which roles (actors vs directors) dominate in various genres, companies can assess areas where there may be **underrepresentation**. For example, if one genre is heavily dominated by actors, studios might seek to diversify by inviting more directors to helm those types of films. This insight could lead to **innovative projects** and an opportunity to **break new ground** within a particular genre.

**4. Identification of Emerging Talent:**

  * If the chart shows **emerging directors or actors** who have contributed to multiple high-rated titles in specific genres, it provides businesses with an opportunity to **invest in new talent** before they become mainstream. This proactive approach could lead to **long-term success** by discovering individuals who may become major industry influencers in the future.

#### Potential Negative Impact:
**1. Over-Reliance on Top Talent:**

  * A potential downside of focusing too much on the most prolific names in the industry is that it might lead to an **over-reliance on familiar talent**. While top actors and directors can bring success, an over-focus on them might lead to a **lack of innovation** and predi**ctable content**. Audiences may get fatigued by repetitive styles and concepts, reducing long-term engagement.

**2. Missed Opportunities for Genre Exploration:**

  * The chart’s insights could lead to studios focusing too much on the **same successful genres** that prolific actors and directors dominate. This could potentially close the door to unexplored genres, thus stifling creativity and limiting opportunities for diverse content creation. If companies focus too much on a single genre, it could lead to **market saturation** and eventually alienate audiences looking for something new and fresh.

**3. Neglecting Niche or Emerging Roles:**

  * By primarily focusing on the most **prolific names** in major genres, studios might neglect **emerging roles** and **niche genres** that are not as popular yet but could present untapped opportunities. For instance, focusing only on actors in mainstream genres like action could cause studios to overlook actors and directors in genres like **documentary or indie films**, which have a strong, but smaller, following.

**4. Potential Homogeneity in Content:**

  * If too much emphasis is placed on successful and familiar actors and directors, the content produced could become **homogeneous** in nature. Without diversity in **genre or talent**, the risk is that content might become predictable, and the creativity of the entertainment industry could be stifled.

#### Justification:
  * **Positive Impact:** These insights enable businesses to make more strategic decisions, such as selecting the right talent for a project, understanding audience preferences, diversifying their content offerings, and discovering emerging stars who might bring fresh perspectives.

  * **Negative Impact:** However, an over-reliance on top talent and the most popular genres could limit creativity and lead to repetitive content. Additionally, focusing on a narrow pool of talent and genres could result in market saturation, reducing audience excitement over time and limiting the diversity of content that could attract a broader or more niche audience.


#### Chart 15 - Bar Chart : Diversification of the Cast across Content Types or Regions
Goal: Audit cast variation by content type (Movie vs TV), region (Countries), or age certification.


In [None]:
# Merge df_credits with df_titles
df_cast = pd.merge(df_credits, df_titles[['id', 'type', 'production_countries', 'age_certification']], on='id')

# Filter to only actors (ignore directors/writers/etc.)
df_cast = df_cast[df_cast['role'] == 'ACTOR']

# Ensure 'producstion_countries' is a list
df_cast['production_countries'] = df_cast['production_countries'].apply(lambda x: x if isinstance(x, list) else [])

# Explode to have one country per row
df_cast_exploded = df_cast.explode('production_countries')


diversity_by_region_type =(
    df_cast_exploded.groupby(['production_countries', 'type'])['name']
    .nunique()
    .reset_index(name='unique_actors')
)

# Focus on top 10 countries by total unique actors
top_countries = (
    diversity_by_region_type
    .groupby('production_countries')['unique_actors']
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .index
)

# Filter top 10 countries only
diversity_top = diversity_by_region_type[diversity_by_region_type['production_countries'].isin(top_countries)]

plt.figure(figsize=(10, 6))
sns.barplot(data=diversity_top, x='production_countries', y='unique_actors', hue='type', palette='coolwarm')

plt.title('Top 10 Production Countries by Cast Diversity (Movies vs Shows)')
plt.ylabel('Number of Unique Actors')
plt.xlabel('Production Country')
plt.xticks(rotation=45)
plt.legend(title='Content Type')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is chosen to visualize **the diversity of the cast across content types (Movies vs. TV shows) and regions (Countries)** because it allows for an easy comparison of:


  **1. Cast diversity**: By showing the number of **unique actors** in each production country, we can effectively gauge the variety of actors involved in content produced in different regions.

  **2. Content type breakdown**: The **hue (Movies vs. TV shows)** provides an immediate insight into the differences in cast diversity between movies and TV shows in each country. It also allows for easy differentiation between the two content types within each country.

  **3. Geographical distribution**: By focusing on the **top 10 production countries**, the chart emphasizes the most influential countries in terms of cast diversity, making it easier to draw conclusions about the regions with the most varied talent pools.
  

The bar chart is ideal because:

  * It effectively displays both **quantitative and categorical** data (unique actors by country and content type).

  * The **hue** provides a distinction between movies and TV shows, allowing for a clear comparison of cast diversity between content types.

  * It is easy to interpret, especially when dealing with categorical variables like **production countries** and **content type**.

By grouping the data in this way, the bar chart gives a quick, visual representation of how diverse the cast is in different countries and how that diversity differs between movies and TV shows. This helps highlight patterns in cast participation and regional content production.

##### 2. What is/are the insight(s) found from the chart?

**1. Top Countries for Cast Diversity:**

  * The chart reveals which **countries** have the most **unique actors** involved in both **movies and TV shows**. These countries are likely to have a significant entertainment industry, with diverse talent pools.

  * By comparing countries, we can identify regions that contribute the most to global content creation, especially those with a broad range of actors participating in both movies and TV shows.

**2. Content Type Breakdown:**

  * The **split between Movies and TV shows** allows us to see whether certain countries have more diversity in one type of content compared to the other. For instance:

    * Some countries may have a higher concentration of actors involved in **TV shows**, while others may be more active in **movies**.

    * This might indicate differences in how the entertainment industries function in these countries—whether they focus more on serialized content (TV) or feature-length productions (movies).

**3. Regional Variation:**

  * The chart also highlights **regional trends**, showing which **countries** have a balanced distribution of actors in both **movies and TV shows** and which countries may specialize in one over the other. This could provide insights into **cultural preferences** and **media consumption habits**:

    * Countries with more balanced representation might indicate a well-rounded entertainment industry, whereas countries with a higher focus on one content type (e.g., movies) could reflect local content consumption preferences or industry dynamics.

**4. Insights into Global Entertainment Trends:**

  * By examining the diversity in cast across countries, the chart helps identify **global trends** in **casting diversity**. For example, if countries with high diversity in actors are producing more content (across movies and TV shows), it could indicate their ability to attract a broader international audience.

  * This also informs casting decisions for **international collaborations**, where producers might look for diverse casts to appeal to wider audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Strategic Casting Decisions:**

  * **Insight**: Identifying countries with the most **unique actors** can help producers and casting directors strategically choose talent from regions with diverse talent pools.

  * **Positive Impact**: By leveraging diverse talent, studios can create more **appealing** and **inclusive content**, which is likely to attract a wider audience. This could result in **higher viewership** and **stronger international sales** for movies or TV shows, as they appeal to global and diverse demographics.

**2. Targeting Popular Regions for Content Production:**

  * **Insight**: The chart provides clarity on which countries are leading in content production (both movies and TV shows). Countries with high **unique actor participation** could be targeted for content production or partnerships.

  * **Positive Impact**: Investing in or collaborating with the **top production countries** can **increase content quality and reach**, aligning with local preferences while also appealing to international markets. It also provides a solid foundation for **international expansion** of media companies.

**3. Tailoring Content for Global Audiences:**

  * **Insight**: If a country has a diverse cast across both movies and TV shows, it suggests that there is a demand for different types of content within that region.

  * **Positive Impact**: Studios and content creators can **create region-specific content** or **tailor marketing strategies** that emphasize **local talent**. This approach can lead to **better cultural relevance**, driving engagement and enhancing brand value.

**4. Diversification and Inclusion Strategy:**

  * **Insight**: By understanding the distribution of actors across various content types, companies can ensure that they are **diverse and inclusive**, which is increasingly valued by audiences.

  * **Positive Impact**: This alignment with inclusivity can bolster brand reputation and increase **audience loyalty**, particularly in markets that value diversity, leading to higher engagement and **positive sentiment** around the brand.

#### Potential Negative Growth Insights:
**1. Imbalance in Content Type Production:**

  * **Insight**: If a country is heavily focused on either movies or TV shows and lacks a balance in producing content across both, this could indicate a **limitation in content diversity**.

  * **Negative Impact**: A narrow focus on one content type could limit growth potential. For example, if a market is **over-represented** in TV shows but lacks diversity in movies, it could lead to **missed opportunities** in movie distribution or licensing deals, ultimately limiting the company’s **global growth potential**.

**2. Over-Concentration in Certain Regions:**

  * **Insight**: If the chart highlights a concentration of talent and production in a few countries, this could lead to an **over-reliance on specific markets**.

  * **Negative Impact**: This concentration could lead to **risk exposure**, especially if certain regions experience a **downturn** in content production or local market changes (e.g., economic shifts, regulatory changes, or changes in consumer preferences). Businesses could face difficulties in sustaining long-term growth if they fail to **diversify** content across more regions or market types.

**3. Unbalanced Representation:**

  * **Insight**: Some countries may have fewer unique actors in certain genres or content types, suggesting **under-representation** of talent.

  * **Negative Impact**: Failing to address talent gaps in certain regions could result in **lackluster content**, limiting potential collaborations or making it more difficult to cater to specific audiences. This could **negatively affect the brand** and result in a **reduced audience reach**, especially if competitors take advantage of underrepresented markets.

#### Chart 16 - Bar Chart : Older titles are still popular or well-rated compared to newer content
Goal: Explore long-term content value.

In [None]:
# Drop missing scores
combine_score = ['imdb_score', 'tmdb_score']

df_rating = df_titles.dropna(subset=combine_score)


# Group by release year to get average IMDb and TMDB score
avg_score_by_year = df_rating.groupby('release_year')[combine_score].mean().reset_index()

df_titles['avg_score'] = df_titles[combine_score].mean(axis=1)

# Filter titles before 2000 with high IMDb scores
classic_hits = df_titles[(df_titles['release_year'] < 2000) & (df_titles['avg_score'] >= 8)]

# Sort top 10 by score
top_classics = classic_hits.sort_values(by='avg_score', ascending=False).head(10)


# Visualization
plt.figure(figsize=(12, 8))

sns.barplot(data=top_classics, x='avg_score', y='title', palette='viridis')

plt.title('Top 10 High-Rated Classic Titles (Before 2000)')
plt.xlabel('Average Score (IMDb + TMDB)')
plt.ylabel('Title')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To visualize the **Top 10 High-Rated Classic Titles (Before 2000)** because:

  **1. Clear Comparison**: A bar chart allows for a **clear comparison** between the titles based on their average scores, which is essential for identifying which older titles are still performing well in terms of IMDb and TMDB ratings.

  **2. Focus on Key Insights**: It emphasizes the specific titles that have stood the test of time, showing how classic content continues to outperform or match newer content in terms of ratings. The bar chart is especially useful when dealing with categorical data (titles) and continuous data (average score), providing a quick visual of relative performance.

  **3. Easily Interpretable**: The chart makes it easy to **identify trends** in how older titles continue to maintain or achieve high ratings over time, offering valuable insights into long-term content value. It also provides an intuitive way to spot the highest-rated classics with clear labels.

  **4. Emphasis on Top Performers**: By sorting the data and showing the top 10 high-rated classics, the chart allows a **focused look at the highest-performing older content**, offering a visual representation of the most significant insights, rather than a broad view of all older titles. This makes it easy for viewers to grasp the takeaway: some older content still maintains strong ratings.

The **visual appeal** and **clarity** of a bar chart in this context make it the most effective tool for conveying the insights about the long-term appeal and quality of classic titles compared to newer ones.

##### 2. What is/are the insight(s) found from the chart?

**1. Longevity of High-Quality Content**: The chart will highlight that several classic films or TV shows from before 2000 have maintained consistently **high average** ratings (IMDb + TMDB). This suggests that **older content** can still be highly relevant and well-regarded, even years after its release. It could also show that certain films or shows have remained culturally significant, contributing to their ongoing popularity.

**2. Cultural Impact**: Many of the top-rated classic titles likely belong to genres or franchises that have maintained cultural significance, such as **iconic films** or **beloved franchises** that continue to influence contemporary content. The chart may reveal patterns about the **timelessness** of certain genres or themes, such as sci-fi, drama, or historical films.

**3. Enduring Fan Base**: High ratings for older titles suggest that these works have developed **loyal fan bases** over time. These audiences likely continue to recommend, revisit, and celebrate these films/shows, which helps them sustain a high level of popularity.

**4. Quality and Craftsmanship**: The older content that maintains high ratings may indicate a level of **filmmaking craftsmanship** that resonates with audiences even decades after release. This could point to aspects like **storytelling, acting, directing, or production** values that remain impactful even as technology and cultural preferences change.

In summary, the chart shows that older content, particularly **classic titles,** still retains considerable **cultural value and popularity** due to its enduring quality, emotional resonance, and fan base. These insights can be used to inform current and future content strategies, such as leveraging nostalgic elements or focusing on quality storytelling, which are likely to remain valued by audiences across time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Business Impact:
**1. Timeless Content Value:**

  * **Insight**: Older, classic content continues to perform well, with high ratings even years after release.

  * **Impact**: This insight suggests that **investing in the creation of timeless, high-quality content** can have long-term benefits. Content that stands the test of time can continuously attract viewers, providing ongoing value without the need for constant production. Studios or streaming services can use this insight to focus on creating content with lasting appeal, which could continue to generate revenue and viewership long after the initial release.

**2. Nostalgia as a Tool:**

  * **Insight**: Classic films and shows with enduring popularity are often tied to strong **nostalgic feelings** from their original audiences.

  * **Impact**: This opens up opportunities for **nostalgia-driven** marketing and remakes/reboots, targeting existing fan bases. Content providers can leverage the **nostalgic value** of older successful titles to create new derivative content (e.g., remakes, sequels, or spin-offs), appealing to both older generations familiar with the original and younger generations discovering them for the first time. This kind of content often has a built-in audience, reducing marketing risks and costs.

**3. Legacy Content for Continuous Engagement:**

  * **Insight**: High-rated classic titles can become part of the **evergreen content portfolio** for platforms, providing long-term engagement without additional production costs.

  * **Impact**: Media companies can curate and offer classic content as part of their subscription services, giving users access to a library of films or shows that continue to attract attention. This enhances **user retention** on streaming platforms by ensuring that there is always a collection of beloved titles for customers to engage with, encouraging longer subscriptions.

**4. Strategic Programming:**

  * **Insight**: Titles from before 2000 that maintain high ratings may be strategically scheduled to capture viewership during specific times of the year, such as anniversaries or themed events (e.g., holiday seasons).

  * **Impact**: This strategy could drive audience engagement, especially on platforms like streaming services that curate specific content around holidays or significant anniversaries. This not only boosts views but can also improve customer satisfaction, as they can revisit beloved classics.

#### Potential Negative Growth:
**1. Over-Reliance on Old Content:**

  * **Insight**: While older content has proven longevity, there’s a danger in **over-relying on past hits** to drive engagement.

  * **Reason**: If a business focuses too heavily on classics and neglects to innovate or develop new, original content, it could risk stagnation. Audiences’ tastes evolve, and relying on past successes might prevent the platform or studio from keeping up with emerging trends and new formats. This could result in **lower innovation and missed opportunities** to attract new viewers or demographics.

**2. Audience Fragmentation:**

  * **Insight**: Classic films might appeal to specific age groups, but they may not engage younger generations.

  * **Reason**: While older content is successful with nostalgic viewers, the younger generation might not connect with older titles, leading to a **disconnect in the audience**. If a company leans too heavily on older content, it might struggle to capture and maintain the **attention of younger audiences**, limiting potential growth among newer viewers who prefer contemporary content or trends.

**3. Excessive Nostalgia May Limit Growth:**

  * **Insight**: While nostalgia can be a powerful marketing tool, relying too heavily on it may create the impression that a company lacks fresh, innovative content.

  * **Reason**: If a platform over-promotes nostalgic content and remakes, it may eventually **alienate consumers** who seek originality and novelty. This could create a perception that the platform is “out of touch” or **focusing too much on past successes** rather than future-oriented content.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help achieve the business objective, the insights gathered from these 16 charts would provide a comprehensive understanding of the content landscape, allowing the client to make data-driven decisions on content strategy, regional expansion, audience targeting, and more. Here's a breakdown of the potential **solutions to the business objective** based on the charts:

#### 1. Content Diversity:
  * **Objective:** Understand trends in genres, ratings, and popularity to inform content strategy.

  * **Solutions:**

    * **Top Genres by Count (Chart 1):** The client can prioritize content creation or acquisition based on the most popular genres, ensuring they have a high volume of content in genres with significant audience demand.

    * **Distribution of Ratings (Chart 2):** Focus on balancing the volume of high-rated content to maintain audience interest and keep the platform’s reputation strong.

    * **Ratings Across Genres (Chart 3):** Use the average ratings per genre to highlight the genres that not only have the most content but also the highest-quality content, guiding future production or curation decisions.

    * **IMDb vs TMDB by Type (Chart 4):** This analysis can help determine whether content type (Movie vs TV Show) influences ratings, providing insights for prioritizing certain content types based on higher user satisfaction.

#### 2. Regional Availability:
  * **Objective:** Explore regional gaps or strengths in content availability and ensure alignment with audience preferences across geographies and age groups.

  * **Solutions:**

    * **Top Countries by Content Volume (Chart 5):** The client can focus marketing efforts and distribution deals in countries with the most content available, potentially leveraging these regions as key growth markets.

    * **Age Certification by Country (Chart 6):** Help understand content suitability per region and age group, allowing for better local content adaptation, marketing, and distribution strategies.

    * **Correlation Between Regions and Certifications (Chart 7):** Identify content certification trends in different regions, ensuring that the platform has the appropriate content for various age groups across different markets.

3. Trends Over Time:
  * **Objective:** Measure how the content library has evolved and identify emerging trends to predict future demand.

  * **Solutions:**

    * **Volume of Content Over Time (Chart 8):** This insight can be used to assess the overall growth or decline in the content library, allowing the client to adjust investment and content strategies accordingly.

    * **Time Trend by Type (Chart 9):** Tracking how movies and TV shows evolve over time can guide the client’s investment decisions, whether to focus on expanding TV series, movies, or both.

    * **Top 5 Trending Genres Over Time (Chart 10):** By identifying trending genres, the client can tailor content acquisition and production to capitalize on audience preferences and emerging trends.

4. Contribution & Diversification of Roles:
  * **Objective:** Understand the contribution of different roles (actors, directors, etc.) in high-performing content and ensure diversity in casting.

  * **Solutions:**

    * **Actors or Directors Contributing to High-Rated Content (Chart 11):** The client can strategically work with top-performing actors or directors who are consistently involved in high-rated content, ensuring better quality and appeal.

    * **Cast Size and Content Ratings (Chart 12):** Analyzing whether cast size influences ratings helps the client make data-driven decisions on how many cast members to feature in future content, potentially optimizing production costs and appeal.

    * **Types of Characters or Roles in Trending Genres (Chart 13):** Identifying common roles or character types can guide casting decisions, ensuring the inclusion of characters that resonate with audiences in popular genres.

    * **Most Prolific Actors or Directors Across Genres (Chart 14):** Focus on the actors and directors who contribute significantly to high-performing content, ensuring they are engaged in future projects.

    * **Diversification of the Cast Across Regions/Types (Chart 15):** Encourage diversity in casting, focusing on global representation across regions and content types to appeal to a broader audience and enhance content variety.

    * **Older Titles vs Newer Content (Chart 16):** Leverage the popularity of older, timeless content alongside new releases, maintaining a balanced library that appeals to both nostalgic and new audiences.

#### Key Recommendations for the Client to Achieve the Business Objective:
  **1. Leverage Popular Genres and Ratings:** Focus on increasing content production or acquisition in genres with high ratings and demand, ensuring content meets audience preferences while maintaining high-quality standards.

  **2. Expand Regionally Based on Content Availability:** Target regions with gaps in content offerings and adjust content strategies based on regional preferences and certifications. Tailor marketing campaigns for countries with significant content volumes to attract local subscribers.

  **3. Focus on Content Evolution:** Continuously track content volume and trending genres to stay ahead of the curve. Adjust content offerings based on changing trends to align with evolving audience tastes.

  **4. Maximize Star Power and Expertise:** Collaborate with high-performing actors, directors, and producers who consistently contribute to well-rated content, ensuring that future productions benefit from their established track record.

  **5. Ensure Diversity in Cast and Roles:** Increase casting diversity across regions and genres, and balance between different types of content (movies, TV shows). This will not only enhance audience appeal but also ensure global representation in content.

  **6. Blend Old and New Content:** Capitalize on the long-term appeal of older, high-rated titles while continuing to invest in newer content. This strategy can help balance nostalgia with innovation, keeping existing viewers satisfied while attracting new ones.

By following these strategies, the client can create a well-rounded content portfolio that appeals to a wide range of demographics, maximizes the impact of high-rated and trending content, and stays ahead of market trends. This approach will ultimately help achieve long-term growth and business success.

# **Conclusion**

The insights derived from the 16 charts provide a comprehensive understanding of the current state of content diversity, regional availability, audience preferences, and industry trends. By analyzing these patterns, the client is in a strong position to make informed decisions that will help optimize their content strategy, enhance regional targeting, and foster long-term growth.

  **1. Content Strategy Optimization:** Focusing on high-performing genres, leveraging top actors and directors, and balancing both new and older content will ensure the platform remains appealing to both existing and new audiences. Prioritizing quality alongside quantity will create a content library that resonates with viewers, driving engagement and satisfaction.

  **2. Regional Expansion:** Identifying regional gaps and tailoring content availability by country and age certification will help meet the diverse needs of global audiences. Adjusting marketing and content distribution strategies accordingly will enable the platform to expand in regions with high demand and content volume.

  **3. Trends and Evolving Preferences:** Tracking content performance over time allows the platform to stay ahead of the curve by adapting to emerging trends and shifting audience preferences. This proactive approach ensures that the content offerings evolve alongside changing viewer tastes, maintaining relevance and demand.

  **4. Diversity and Representation:** Increasing the diversity of cast members across different genres and regions ensures inclusivity and broadens the appeal of the platform. By diversifying casting and exploring global talent, the client can create more relatable and universally engaging content.

By integrating these insights into their decision-making process, the client can enhance their content offerings, maximize user engagement, and ultimately drive sustained business growth. This approach will not only foster audience loyalty but also position the platform as a leader in content innovation, diversity, and regional relevance.

In essence, the client can achieve a balanced content portfolio that appeals to diverse audiences, fosters global expansion, and positions the platform for long-term success in an ever-evolving entertainment landscape.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***