<a href="https://colab.research.google.com/github/MoSaizanCoder/Capstone_Project/blob/main/EDA_Amazon_Prime_TV_Shows_and_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Amazon Prime TV Shows and Movies Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

​This project conducts an exploratory data analysis (EDA) on the Amazon Prime Video dataset, focusing on content available in the United States. Utilizing a dataset of over 9,800+ titles and 124,000+ cast/crew credits, the analysis aims to uncover patterns in content distribution, genre popularity, and audience engagement.

​Using Python libraries such as Pandas for data manipulation and Matplotlib/Seaborn for visualization, the project cleans raw data to handle missing values and complex list structures. The final output identifies key trends in library growth, regional production dominance, and the relationship between IMDb ratings and popularity, providing a comprehensive snapshot of Amazon Prime’s streaming landscape.

# **GitHub Link -**

My GitHub Link : https://github.com/MoSaizanCoder/Capstone_Project/blob/main/EDA_Amazon_Prime_TV_Shows_and_Movies.ipynb

# **Problem Statement**


In the highly saturated streaming market, platforms like Amazon Prime Video face the challenge of managing massive content libraries while striving to keep users engaged. With thousands of movies and TV shows added regularly, it becomes difficult for stakeholders to manually discern:


*   ​Which genres are over-represented or under-utilized.

*   How the balance between "classic" content and new releases is shifting.
*   Whether high critical acclaim (IMDb scores) translates to actual user popularity.

Without data-driven insights, content acquisition and production strategies risk being inefficient, potentially leading to wasted budget on low-performing content and a decline in subscriber retention. This project addresses the need to transform raw catalog data into clear, actionable intelligence.



#### **Define Your Business Objective?**
The primary objective of this analysis is to derive data-backed insights that can optimize content strategy and drive business growth. Specifically, the analysis aims to:



*  **Optimize Content Strategy:** Identify high-performing genres and categories to guide future content licensing and original production investments.

*   **Assess Market Positioning:** Analyze the ratio of Movies to TV Shows and the age of the content library to understand Amazon Prime’s unique value proposition compared to competitors.
*   **Target Audience Engagement:** Determine the correlation between ratings (Quality) and popularity (Engagement) to understand what truly drives viewership on the platform.


*   **Geographic Expansion:** Evaluate the diversity of production countries to assess the platform's readiness for global audiences and identify key international markets.

**Goal:** To move from a volume-based content approach to a value-based strategy that maximizes viewer satisfaction and platform loyalty.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Mounting Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Importing Important Libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')


### Setting Up Some Basic Display Settings

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

### Dataset Loading

In [None]:
# Reading the Credits dataset.
Amazon_credits = pd.read_csv('/content/drive/MyDrive/Amazon Prime TV Shows and Movies/credits/credits.csv')
# Reading the Titles dataset.
Amazon_titles = pd.read_csv('/content/drive/MyDrive/Amazon Prime TV Shows and Movies/titles/titles.csv')

### Dataset First View

In [None]:
# First Look Of Credits Dataset Top 5.
Amazon_credits.head()

In [None]:
# First Look Of Credits Dataset Last 5.
Amazon_credits.tail()

In [None]:
# First Look Of Titles Dataset Top 5.
Amazon_titles.head()

In [None]:
# First Look Of Titles Dataset Last 5.
Amazon_titles.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Amazon_credits.shape

In [None]:
# Dataset Rows & Columns count
Amazon_titles.shape

### Dataset Information

In [None]:
# Dataset Info
Amazon_credits.info()

In [None]:
Amazon_titles.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print(Amazon_titles.duplicated().sum())

print(Amazon_credits.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(Amazon_titles.isnull().sum())
print(Amazon_credits.isnull().sum())

### Visualizing The Missing Values Of Dataset Of Credits & titles Using Heatmap

In [None]:
# Visualizing the missing values
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot for Titles on the first axis (axes[0])
sns.heatmap(Amazon_titles.isnull(), cbar=False, yticklabels=False, cmap='viridis', ax=axes[0])
axes[0].set_title('Missing Values: Titles Dataset', fontsize=14)

# Plot for Credits on the second axis (axes[1])
sns.heatmap(Amazon_credits.isnull(), cbar=False, yticklabels=False, cmap='viridis', ax=axes[1])
axes[1].set_title('Missing Values: Credits Dataset', fontsize=14)

plt.tight_layout()
plt.show()



### What did you know about your dataset?



**​1. Missing Data Insights:**


*   **Major Gaps in Metadata:**
    * **​seasons (8,514 missing):** This column has the highest number of missing values. This is structurally expected because the majority of content on Amazon Prime consists of Movies, which do not have seasons.

    * **age_certification (6,487 missing):** Over 65% of the titles lack an age rating, which is a significant gap we will need to handle (likely by filling with 'Not Rated' Or Dropping Entire Columns).

    * **​character (16,287 missing):** In the credits dataset, while we have the actor's name, the specific character they played is missing for a large chunk of entries.

*   **​Ratings Gaps:** There are missing values for tmdb_score (2,082) and imdb_score (1,021), indicating that about 10-20% of the library consists of niche or new titles that haven't received enough votes to generate a score.

**2. Duplicate Values:**
* Unlike many clean datasets, this data **contains duplicates**.
    * **Titles Dataset:** Found 3 duplicate rows.
    * ​**Credits Dataset:** Found 56 duplicate rows.
    * ​**Action Item:** These will need to be dropped during the Data Wrangling phase to ensure accurate analysis.

**​3. Primary Key**
* **​Primary Key:** The **id** column has 0 missing values in both datasets, making it a reliable key to merge the Titles and Credits data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(Amazon_titles.columns)
print(Amazon_credits.columns)


In [None]:
# Dataset Describe for titles
print(Amazon_titles.describe())


In [None]:
# Dataset Description For credits
print(Amazon_credits.describe())

### Variables Description

 **Titles Dataset:**
* **​id:** Unique identifier for the title (Key to join with Credits).
* **​title:** Name of the Movie or TV Show.
* **​type:** Category of content (Movie or Show).
* **​description:** Brief synopsis/plot summary.
* **​release_year:** The year the content was released.
* **​age_certification:** Age rating (e.g., R, PG-13).
* **​runtime:** Duration of the content in minutes.
* **​genres:** List of genres the title belongs to (e.g., Drama, Comedy).
* **​production_countries:** List of countries where it was produced.
* **​seasons:** Number of seasons (for TV Shows only).
* **​imdb_score / tmdb_score:** User ratings on IMDb and TMDB.
* **​imdb_votes:** Number of votes cast on IMDb (indicates popularity).
* **​tmdb_popularity:** Popularity metric calculated by TMDB.

 **Credits Dataset:**
* **​person_id:** Unique identifier for the person (actor/director).
* **​id:** Foreign key linking to the Titles dataset.
* **​name:** Name of the actor or director.
* **​character:** Name of the character played (for Actors).
* **​role:** Role in the production (ACTOR or DIRECTOR).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(Amazon_titles.nunique())
print(Amazon_credits.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Handling Duplicates
# Dropping duplicates to ensure data integrity
Amazon_titles.drop_duplicates(inplace=True)
Amazon_credits.drop_duplicates(inplace=True)


In [None]:
# 2. Dropping Unnecessary column.
# Dropping character column.
# Logic : dropping 'character' because it has 16,000+ missing values and we don't need role names for analysis we only need actor and directors name.
Amazon_credits.drop('character', axis=1, inplace=True)

In [None]:
# 3. Handling Missing Values
# For Seasons : Not going to drop because the null values mean it's movie.
Amazon_titles['seasons'] = Amazon_titles['seasons'].fillna(0) # Filling NA with 0.

In [None]:
# Filling Ratings/Votes with 0.
# Logic: The Unrated content gets a 0 (neutral) Score because these consists of niche or new titles that haven't received enough votes to generate a score.
Amazon_titles['imdb_score'] = Amazon_titles['imdb_score'].fillna(0) # Filling NA with 0.
Amazon_titles['imdb_votes'] = Amazon_titles['imdb_votes'].fillna(0) # Filling NA with 0.
Amazon_titles['tmdb_score'] = Amazon_titles['tmdb_score'].fillna(0) # Filling NA with 0.
Amazon_titles['tmdb_popularity'] = Amazon_titles['tmdb_popularity'].fillna(0) # Filling NA with 0.

In [None]:
# Fill text columns with placeholders.
# Logic: Keeping the data available for other analysis like (genres).
Amazon_titles['age_certification'] = Amazon_titles['age_certification'].fillna('Not Rated') # Filling NA with 'Not Rated'.
Amazon_titles['description'] = Amazon_titles['description'].fillna('No Description') # Filling NA with 'No Description'.

In [None]:
# 4. Converting Strings Lists to Actual Lists.
import ast # 'ast' helps us read text that looks like code (like "['Action', 'Drama']") and turn it into real code.

def clean_list_column(row):

    # Step 1: Check if the value is a String (Text)
    # If it is NOT a string (like NaN or a number), we can't convert it.
    if isinstance(row, str) == False:
        return ['Unknown']

    # Step 2: Try to convert the text into a List
    try:
        # literal_eval reads the string "['A', 'B']" and makes it a list ['A', 'B']
        result_list = ast.literal_eval(row)

        # Step 3: Check if the list is empty
        # If the result is just [], it means we have no info.
        if len(result_list) == 0:
            return ['Unknown']

        # If it's not empty, return the good list!
        return result_list

    # Step 4: Handle Errors
    # If the text was messy and couldn't be converted, return Unknown.
    except:
        return ['Unknown']

# Apply the cleaning function to genres and countries
Amazon_titles['genres'] = Amazon_titles['genres'].apply(clean_list_column)
Amazon_titles['production_countries'] = Amazon_titles['production_countries'].apply(clean_list_column)



In [None]:
Amazon_titles.info()

In [None]:
Amazon_credits.info()

### What all manipulations have you done and insights you found?

1. **​Handling Duplicates:** Removed duplicate rows from both datasets (titles and credits) to ensure unique records.
2. **Dropping Column:** I dropped this character from the Credits dataset because it had over 16,000 missing values and the specific role names are not required for our analysis of Actors and Directors.

3. **​Imputing Missing Values:**
    * **​seasons:** Filled missing values with 0. This logically distinguishes Movies (0 seasons) from TV Shows (1+ seasons).
    * **​age_certification:** Filled missing entries with "Not Rated" to preserve the data for analysis rather than deleting of the rows.
    * **​Ratings:** Filled missing scores with 0 to handle unrated content.
4. **​Data Type Conversion:**
* ​Converted genres and production_countries from strings to actual lists.
* ​Handled Empty Lists: Replaced empty entries [] in production_countries with ['Unknown'] to prevent errors in future visualizations.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Comparing Counts of movies and TV Show using Countplot (Bar Chart)

In [None]:
# Chart - 1 visualization code
# Setting the fig size
plt.figure(figsize=(10, 6))

# Creating Count plot to compare Movies vs Shows.
ax = sns.countplot(x='type', data=Amazon_titles, palette='viridis')

# Adding title And labels
plt.title('Movies vs. TV Shows Count')
plt.xlabel('Content Type', fontsize = 12)
plt.ylabel('Total Count Of Movies & TV Shows', fontsize = 12)

# Adding numbers (labels) on top of the bars
# 'ax.containers' is a list that holds all the bars we just drew on the chart.
for container in ax.containers:

    # 'bar_label' is a smart function that automatically finds the height of each bar and writes that number on top of it.
    ax.bar_label(container, padding=3, fontsize=12)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose this Countplot (Bar Chart) because it is the most effective way to compare the frequency of categorical variables. it provides an immediate visual comparison between the volume of Movies and TV Shows in the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Dominance of Movies:** The platform is heavily skewed towards Movies, which make up approximately 78% of the content, while TV Shows account for only 22%.

**Library Strategy:** This suggests that Amazon Prime Video's acquisition strategy has historically focused on building a massive catalog of films rather than serialized television content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding this ratio helps stakeholders in two ways:
1. **Retention:** Knowing the library is movie-centric helps in marketing to film buffs.
2. **Growth Opportunity:** The lower number of TV Shows indicates a potential gap. Since TV Shows often drive higher long-term user retention (binge-watching), the business might consider increasing investment in original series to balance the catalog.

#### Chart - 2 Histogram with KDE

In [None]:
# Chart - 2 visualization code
# Seting up the figure size
plt.figure(figsize=(12, 6))

# filtering Out the Recent Content for better trend.
recent_content =Amazon_titles[Amazon_titles['release_year'] >= 2000]

# Creating histogram with Kde line.
sns.histplot(data=recent_content, x='release_year', kde=True, color='teal', bins=23)

# Adding title and labels
plt.title('Content Added by Release Year (2000 - 2022)', fontsize = 16)
plt.xlabel('Release Year', fontsize = 12)
plt.ylabel('Number Of Titles Available', fontsize = 12)

# Displaying the plot
plt.show()

##### 1. Why did you pick the specific chart?

I selected a Histogram combined with a KDE (Kernel Density Estimate) line because it is the standard and most effective way to visualize how data is distributed over time. I wanted to see the "shape" of Amazon Prime's library to understand if it's full of old classics or if it focuses mostly on modern content. The histogram lets me see the volume per year, and the KDE line helps identify the smooth trend of growth.

##### 2. What is/are the insight(s) found from the chart?

* **Exponential Boom:** I noticed a massive, exponential spike in content availability starting around 2015-2016. This clearly marks the point where Amazon Prime started aggressively expanding its library.

* **​Recency Bias:** The vast majority of the content is from the last decade (2010–2021). This tells me that the platform is definitely focused on "New & Modern" releases rather than being an archive for old movies.

* **​The 2022 Drop:** There is a sharp drop-off in 2022. I interpret this not as a stop in production, but likely because the dataset collection stopped early in that year, so the data for 2022 is incomplete.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely it will help in creating a positive business impact.

**​Marketing Strategy:** Since the data confirms the library is very "modern," marketing teams should focus campaigns on "Fresh Content" rather than "Classics" to attract younger audiences.

**​Competitive Benchmarking:** The exponential growth curve shows that to stay competitive against giants like Netflix, needed to maintain this high rate of content acquisition. A slowdown now would look like a decline in service value to subscribers.

#### Chart - 3 Top 10 Genres (Bar Chart)

In [None]:
# Chart - 3 visualization code
# .explode() the 'genres' column to count individual genres.
df_genres = Amazon_titles.explode('genres') # .explode() converts each list-like element in a column into separate rows, repeating the index for each value.

# Calculating the value counts for the top 10 genres.
top_genres = df_genres['genres'].value_counts().head(10)

# Ploting Visualization (Bar Chart).
plt.figure(figsize=(12, 6))

# x = Genre Names (index), y = Counts (values).
ax = sns.barplot(x=top_genres.index, y=top_genres.values, palette='mako')

# Adding title and labels.
plt.title('Top 10 Genres on Amazon Prime Video', fontsize=16)
plt.xlabel('Genre of Movies & TV Shows', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)

# Rotating x-axis labels by 45 degrees so they don't overlap.
plt.xticks(rotation=45)

# Adding numbers(data labels) on top of the bars for clarity.
for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.show()



##### 1. Why did you pick the specific chart?

I chose a Bar Chart because it is the most effective way to rank categorical data by frequency. Since I wanted to identify the "Top 10" genres, this visualization allows for an instant comparison of volume. The descending order makes it easy to spot the dominant categories (Drama, Comedy) versus the niche ones (Family, European) at a glance.

##### 2. What is/are the insight(s) found from the chart?

**​Drama is the Absolute Leader:** The chart clearly shows that Drama is the backbone of Amazon Prime's library with 4,762 titles, far outpacing any other genre.

**The "Big Two":** There is a significant drop-off after the top two genres. Drama and Comedy (2,987) together make up a massive portion of the content, indicating that the platform prioritizes general entertainment over niche categories.

**​Action & Thriller Presence:** Thriller (2,119) and Action (1,820) are strong contenders, showing that while Drama leads, there is still a healthy selection of high-intensity content.

**Family Content is Lower:** Interestingly, Family content appears quite low on the list (9th place with 751 titles), suggesting the platform might be less focused on kids' programming compared to competitors like Disney+.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights drive specific strategic decisions:

* **​Core Retention:** Since the majority of the library is Drama and Comedy, the platform must ensure the quality of these titles remains high to retain its current user base. Marketing campaigns should lean heavily into these strengths.

* **​Gap Analysis / Growth Opportunity:** The relatively low number of Family titles (751) highlights a potential weakness. To attract more households and compete with family-centric platforms, the business might need to invest more in acquiring or producing content for children and families.

* **​Niche Targeting:** The presence of "European" content in the top 10 suggests a solid international library. This can be leveraged to target specific demographics interested in foreign cinema, differentiating the service from purely Hollywood-centric rivals.

#### Chart - 4 Top 10 Production Countries (Bar Chart)

In [None]:
# Chart - 4 visualization code
# .explode() 'production_countries' because some titles are produced by multiple countries
# This will separates them so we can count each country individually.
df_countries = Amazon_titles.explode('production_countries')

# Count the top 10 countries.
top_countries = df_countries[df_countries['production_countries'] != 'Unknown']['production_countries'].value_counts().head(10)

# Visualization (Bar Chart)
plt.figure(figsize=(12, 6))

# Creating the plot using the 'rocket' palette for a nice color gradient
ax = sns.barplot(x=top_countries.index, y=top_countries.values, palette='rocket')

# Adding title and labels
plt.title('Top 10 Production Countries on Amazon Prime', fontsize=16)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)

# Adding numbers on top of bars using the simple container method
for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.show()


##### 1. Why did you pick the specific chart?

I picked a Bar Chart because I wanted to see where the movies and shows are actually coming from. It’s the best way to compare the total count of content produced by different countries. This chart helps me quickly spot if Amazon Prime is relying too much on one region, like the US.

##### 2. What is/are the insight(s) found from the chart?

**US is Highly Dominant:** The United States (US) is the clear leader with 5,331 titles. This is more than five times the amount of the next biggest single country.

**India is Key:** India (IN) is the next biggest country with 1,072 titles. This confirms Amazon's heavy investment in regional content (like Bollywood), making India a major content hub outside of North America.

**English Language Focus:** The high presence of the UK (GB) and Canada (CA) confirms that Amazon focuses heavily on English-language content outside the US.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely. This data is critical for content strategy:

* **Global Diversification (Negative Insight/Opportunity):** The huge reliance on the US market presents a risk of being too "Hollywood-centric." The business should invest more in other countries like Japan (JP) and Germany (DE) to attract a wider global audience and reduce the platform's risk.

* **Market Focus:** The strong presence of India and the UK means that local production offices in those countries are working well. The business should continue to support original programming tailored for those specific markets.

* **Data Quality:** The "Unknown" titles show a weakness in the data collection process. Improving the scraping or labeling for this group will allow the business to correctly categorize and market those titles.

#### Chart - 5 The 'spread' of scores for Movies vs TV Shows Box Plot

In [None]:
# Chart - 5 visualization code
# Seting the figure size
plt.figure(figsize=(10, 6))

# Created a Box Plot
# This compares the 'spread' of scores for Movies vs TV Shows
# The line in the middle of the box is the Median score.
sns.boxplot(data=Amazon_titles, x='type', y='imdb_score', palette='Set2')

# Adding title and labels
plt.title('Distribution of IMDb Scores: Movies vs. TV Shows', fontsize=16)
plt.xlabel('Content Type', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Box Plot to analyze the distribution of IMDb scores across different content types (Movies vs. TV Shows). This chart is superior to averages because it shows the "spread" of the data, including the median, the range where most scores fall (the box), and the outliers (dots). This helps me answer not just "which is higher?" but also "which is more consistent?".

##### 2. What is/are the insight(s) found from the chart?

**TV Shows are Rated Higher:** The median score (the line inside the box) for TV Shows is noticeably higher (7.2) than for Movies (6.0). This indicates that, on average, users rate TV series more favorably than films.

**Movies have a Wider Spread:** The box for Movies is taller and the whiskers are longer, meaning the quality of movies varies wildly—from terrible (scores near 1) to excellent. TV shows are more consistently "good."

**Outliers:** Both categories have outliers (the small circles), particularly on the low end. This means there are some really bad movies and shows that fall far below the standard quality range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Investment Focus:** Since TV Shows generally achieve higher customer satisfaction scores, increasing the budget for original series (like The Boys or Reacher) is a safer bet for maintaining a high-quality reputation than churning out many average movies.

**Quality Control:** The wide variation in Movie ratings suggests a need for stricter quality control. The business might want to be more selective in acquiring films to raise the "floor" of the library's quality, avoiding those low-rated outliers that dilute the brand.

#### Chart - 6 Top 10 "Bankable" Actors Horizontal Bar Chart

In [None]:
# Chart - 6 visualization code

# Merged datasets (Safety check to ensure we have the combined data)
# joining 'Amazon_titles' with 'Amazon_credits' on the 'id' column
df_merged = pd.merge(Amazon_titles, Amazon_credits, on='id', how='inner')

# Filter for ACTORS only
actor_stats = df_merged[df_merged['role'] == 'ACTOR']

# Group by Actor Name and calculated Average Rating & Count
# Wanted to know the Mean Score and How Many titles they did
actor_metrics = actor_stats.groupby('name')['imdb_score'].agg(['mean', 'count'])

# Filtering established actors (at least 5 titles)
# This removes actors who appeared in only 1 movie that happened to get a 10/10.
established_actors = actor_metrics[actor_metrics['count'] >= 5]

# Sorted by Highest Average Score and taken Top 10
top_quality_actors = established_actors.sort_values(by='mean', ascending=False).head(10)

# Visualization of chart
plt.figure(figsize=(12, 6))

# Creating Horizontal Bar Plot
ax = sns.barplot(x=top_quality_actors['mean'], y=top_quality_actors.index, palette='plasma')

# Added title and labels
plt.title('Top 10 "Bankable" Actors (Highest Avg IMDb Score, Min 5 Titles)', fontsize=16)
plt.xlabel('Average IMDb Score', fontsize=12)
plt.ylabel('Actor Name', fontsize=12)
plt.xlim(7, 10) # Zoom in on the high scores (7-10) to see differences clearly

# Added numbers to bars
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f', padding=3)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Horizontal Bar Chart to identify the highest-performing actors based on Average Quality rather than just volume. By filtering for actors with at least 5 titles, I ensured that the list represents "consistent performers" rather than "one-hit wonders." This chart allows the business to identify "Bankable Talent"—actors whose presence guarantees a high-quality production.

##### 2. What is/are the insight(s) found from the chart?

**South Indian Cinema Dominance:** The #1 spot is held by 'Poo' Ram (8.4), a critically acclaimed Tamil actor. Other Indian actors like Fahadh Faasil (7.8) and Ajay Ghosh (7.8) also appear. This proves that Indian regional cinema (Tamil, Malayalam, Telugu) drives some of the highest user satisfaction ratings on the entire platform.

**The Anime Factor:** Almost half the list consists of legendary Japanese voice actors like Hiroaki Hirata (7.9), Mamiko Noto (7.8), and Akio Otsuka (7.7). This indicates that the Anime category on Amazon Prime is consistently rated excellent (likely 8.0+) compared to live-action movies.

**No Hollywood A-Listers:** There is a complete absence of mainstream American stars. This suggests that while Hollywood stars bring volume (Chart 7), niche regional and animated actors bring quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Content Acquisition:** The business should aggressively acquire more South Indian films (especially those featuring Fahadh Faasil or similar method actors) and Anime series. These genres have a proven track record of delighting users with high ratings.

**Marketing Niche Verticals:** Instead of just marketing "Movies," Amazon should create dedicated high-quality hubs for "Anime Leaders" and "Indian Critical Hits," as these distinct user bases are highly engaged and satisfied.

#### Chart - 7 Top 10 Actors by Number of Titles

In [None]:
# Chart - 7 visualization code

# Merging Dataset
# We need to join Titles and Credits Dataset to connect 'Movies' with 'Actors'
# We use an 'inner' join on the 'id' column.
df_merged = pd.merge(Amazon_titles, Amazon_credits, on ='id', how = 'inner')

# Preparing Dataset Filtering Only 'ACTOR'
actors_df = df_merged[df_merged['role'] == 'ACTOR']

# the top 10 Actors by number of titles they appear in
top_actors = actors_df['name'].value_counts().head(10)

# Visualization Of top 10 Actor Using Bar Plot
plt.figure(figsize=(12,6))

# Creating Horizontal Bar plot
ax = sns.barplot(x=top_actors.values, y=top_actors.index, palette='magma')

# Adding Title and labels
plt.title('Top 10 Actors on Amazon Prime Video', fontsize=16)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Actor Name', fontsize=12)

# Adding numbers to the end of bars
for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.show()

##### 1. Why did you pick the specific chart?

I picked a Horizontal Bar Chart because I needed to rank people (Actors) by how many movies they have been in. Since actor names are long, a horizontal chart makes it much easier to read than a vertical one. This chart helps me identify which actors appear most frequently across the entire platform.

##### 2. What is/are the insight(s) found from the chart?

**Dominance of Classic Westerns:** The top names—George 'Gabby' Hayes (49), Roy Rogers (45), and Gene Autry (40)—are all legendary icons of the American Western genre from the 1930s-1950s. This tells me that Amazon Prime has a massive catalog of classic Western movies.

**Prolific Character Actors:** Actors like Bess Flowers (44) and Herman Hack (35) were known as "extras" or supporting actors who appeared in hundreds of films. Their high rank confirms the library's depth in classic Hollywood cinema.

**Nassar (37):** The presence of Nassar, a veteran South Indian actor, reconfirms the platform's strong connection to the Indian market, which we saw in the Country analysis (Chart 4).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Niche Marketing:** The data proves Amazon Prime is a goldmine for Western fans. The business should create a dedicated "Classic Westerns" hub or channel to market this specific strength to older demographics or film historians.

 **Regional Strategy:** The high ranking of Nassar proves that Indian content is significant enough to compete with American classics. The platform should continue to highlight these regional stars in their specific markets (e.g., creating a "Best of Nassar" collection for Indian users).

#### Chart - 8 Top 10 Directors by Number of Titles

In [None]:
# Chart - 8 visualization code
# Preparing Dataset Filtering Dataset df_merged for 'DIRECTOR'
directors_df = df_merged[df_merged['role'] == 'DIRECTOR']

# the top 10 Actors by number of titles they appear in
top_directors = directors_df['name'].value_counts().head(10)

# Visualization Of top 10 Directors Using Bar Plot
plt.figure(figsize=(12,6))

# Creating Horizontal Bar plot
ax = sns.barplot(x=top_directors.values, y=top_directors.index, palette='crest')

# Adding Title and labels
plt.title('Top 10 Directors on Amazon Prime Video', fontsize=16)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Directors Name', fontsize=12)

# Adding numbers to the end of bars
for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Horizontal Bar Chart to visualize the most prolific directors. Just like with the Actors, listing the names horizontally is the cleanest way to display this categorical data. This chart helps identify the key creative figures responsible for the highest volume of content on the platform.

##### 2. What is/are the insight(s) found from the chart?

**Joseph Kane (41) & Sam Newfield (38):** These top directors are legendary figures in the world of "B-Westerns" and low-budget action films from the mid-20th century. Their dominance reconfirms the platform's massive library of classic, high-volume genre cinema.

**Lesley Selander (22):** Another prolific director known for Westerns, further solidifying the platform's niche strength in this genre.

**Jay Chapman (34):** Unlike the others, Jay Chapman is often associated with stand-up comedy specials. His presence indicates a significant investment in comedy content, aligning with the earlier finding that "Comedy" is the **#2 genre.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Genre Hubs:** The sheer volume of content from directors like Joseph Kane suggests that a curated "Golden Age of Westerns" collection would have deep content to support it, increasing engagement for older demographics.

**Comedy Strategy:** The high ranking of Jay Chapman highlights the importance of stand-up specials. The business should continue to acquire or produce exclusives in this space, as it is a high-volume, low-production-cost category that retains subscribers.

#### Chart - 9 Distribution of Runtime (Movies vs. TV Shows)

In [None]:
# Chart - 9 visualization code

# Filtering extreme outliers for a cleaner chart
# We keep titles under 200 minutes (3+ hours) to focus on the main content library.
df_filtered = Amazon_titles[Amazon_titles['runtime'] < 200]

# Visualization
plt.figure(figsize=(12, 6))

# Creating a Histogram with KDE (Kernel Density Estimate)
# Adding hue='type'
sns.histplot(
    data=df_filtered,
    x='runtime',
    hue='type',
    kde=True,
    bins=30,
    palette='viridis',
    element='step'
)

# Adding title and labels
plt.title('Distribution of Duration: Movies vs. TV Shows', fontsize=16)
plt.xlabel('Runtime (Minutes)', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)

# Adding reference lines for standard lengths
plt.axvline(x=22, color='orange', linestyle='--', label='TV Show Standard (22m)')
plt.axvline(x=90, color='red', linestyle='--', label='Movie Standard (90m)')

plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Histogram with KDE to visualize the distribution of content duration. Since "Movies" and "TV Shows" have vastly different standard lengths, plotting them together with different colors allows me to clearly see the two distinct clusters (short-form vs. long-form) and identify the most common runtime for each type.

##### 2. What is/are the insight(s) found from the chart?

**Two Distinct Peaks:** The chart shows two clear peaks.

  * **TV Shows (Left):** The peak is around 20-45 minutes, which aligns with standard sitcoms (22 mins) and hour-long dramas (44 mins).

  * **Movies (Right):** The peak is around 90-100 minutes, which is the industry standard for feature films.

**Short Movies exist:** There is a surprising overlap where some "Movies" are very short (45-60 mins). These are likely documentaries, stand-up specials, or animated features, which Amazon classifies as movies but are shorter than theatrical releases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Ad Placement Strategy:** Knowing that TV shows peak at ~22 minutes helps in planning ad breaks. Ads should be shorter and more frequent for these quick episodes compared to the longer 90-minute movies.

**Content Gap:** If the business wants to target "Commuters" (people watching on trains/buses), they should acquire more content in the 20-30 minute range, as the current library has a smaller volume here compared to the massive 90-minute movie library.

#### Chart - 10 Age Rating Counts Using Pie Chart

In [None]:
# Chart - 10 visualization code
# Count the values for each age ratings.

rated_content = Amazon_titles[Amazon_titles['age_certification'] != 'Not Rated'] # Filtering Out 'Not Rated' Movies & TV Shows.

# I will take Top 5 ratings to Maintain Pie Chart clean
rating_counts = rated_content['age_certification'].value_counts().head(5)

# Implementing visualization for Pie chart
plt.figure(figsize=(10,8))

# Creating Pie Chart
plt.pie(
    rating_counts.values,
    labels=rating_counts.index,
    autopct='%1.1f%%',       # Show percentages with 1 decimal place
    startangle=140,          # Rotate start for better look
    colors=sns.color_palette('pastel'), # Use pastel colors
    explode=(0.02, 0, 0, 0, 0) # Explode the first slice slightly to highlight it
)

# Adding title
plt.title('Distribution of Age Certifications (Rated Content Only)', fontsize=16)

plt.show()


##### 1. Why did you pick the specific chart?

I chose a Pie Chart to analyze the composition of the platform's rated content. I specifically excluded the "Not Rated" category (which dominated the data) to "zoom in" on the known age ratings. This allows us to clearly see the ratio between Mature (R, 16+) and Family-Friendly (G, PG) content without the noise of unrated titles.

##### 2. What is/are the insight(s) found from the chart?

**Dominance of 'R':** The chart clearly shows that Restricted (R) content is the single largest category, making up 43.0% of the rated library. This confirms that Amazon Prime is heavily focused on mature audiences.

**Teens & Adults (PG-13 + R):** When we combine R (43%) and PG-13 (20.2%), we see that over 63% of the catalog is targeted at audiences aged 13 and above.

**Small Family Share:** PG (20%) and G (9.3%) together make up less than 30% of the rated library, reinforcing the insight that "Kids & Family" is a secondary focus compared to adult entertainment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Marketing Segmentation:** The high volume of R-rated content means marketing campaigns should focus on late-night viewing and adult genres (Thriller, Action) rather than trying to brand the platform as a "family hub" like Disney+.

**Acquisition Strategy:** The data shows a clear gap in G-rated (9.3%) content. If the business wants to reduce churn among families with young children, they must specifically acquire more G-rated animated movies or educational shows to balance the library.

#### Chart - 11 Average IMDb Score Over Time (2000-2021)

In [None]:
# Chart - 11 visualization code

# Filter data for the modern era (2000 onwards).
# Focusing on recent history where the volume of content is significant.
df_trend = Amazon_titles[Amazon_titles['release_year'] >= 2000]
plt.figure(figsize=(12, 6)) # Setting up fig size

# Creating a Line Plot
# This automatically calculates the mean rating per year and plots it.
# The shaded area around the line is the "Confidence Interval" (variation).
sns.lineplot(
    data=df_trend,
    x='release_year',
    y='imdb_score',
    hue='type',         # Compare Movies (Blue) vs TV Shows (Orange)
    palette='tab10',    # Distinct colors
    marker='o'          # Added dots to make data points clear
)

# Adding title and labels
plt.title('Trend of Average IMDb Scores Over Time (2000 - Present)', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Average IMDb Score', fontsize=12)

# Add a horizontal line for the overall average (5.25)
plt.axhline(y=5.25, color='gray', linestyle='--', label='Platform Average (5.25)')
plt.legend()

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Line Plot with Confidence Intervals because it is the standard and most effective way to visualize trends over time (Time Series Analysis). By plotting the average IMDb score year-by-year, I can instantly see if the quality of content is trending upwards, downwards, or staying flat. Separating Movies and TV Shows (using different colors) allows me to compare their individual performance, while the shaded confidence intervals show the variability of ratings in each year.

##### 2. What is/are the insight(s) found from the chart?

**TV Shows consistently outperform Movies:** The blue line (TV Shows) is consistently higher than the orange line (Movies) for almost the entire 20-year period. This indicates that serialized content on Amazon Prime generally achieves higher user satisfaction ratings than its feature film library.

**Movie Quality Stagnation/Decline:** The average movie rating (orange line) hovers around or below the platform average (5.25), showing a long-term trend of stagnation. There is a noticeable dip in recent years (2020-2022), likely due to the influx of lower-budget direct-to-streaming titles or "filler" content acquired to boost volume.

**Volatile Peaks for Shows:** While TV shows rate higher on average, their line is more "jagged" or volatile. This suggests that a few highly-rated hit series (like The Boys or Invincible) can spike the average for a year, whereas a bad batch of shows can pull it down significantly.

**Recent Dip (2020-2022):** Both Movies and TV Shows show a sharp decline in average ratings around 2020-2021. This could be correlated with the pandemic-era production challenges or a shift in content acquisition strategy to prioritize quantity over quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Investment Pivot:** Since TV Shows consistently generate higher satisfaction ratings (often > 6.5), the business should prioritize budget allocation towards Original Series rather than acquiring low-rated bulk movie catalogs. High-rated shows are better for long-term subscriber retention.

**Quality Control Mechanism:** The declining trend in movie ratings is a risk to the brand's premium perception. The content team should implement a stricter "Quality Floor" (e.g., avoiding acquisition of titles with < 5.0 IMDb) to stop the dilution of the library's perceived value.

**Marketing Strategy:** Marketing should lean heavily on "Critical Acclaim" for TV Series (e.g., "From the studio that brought you The Boys"), whereas movie marketing might need to focus more on specific genres or stars rather than aggregate quality claims.

#### Chart - 12 Impact of Age Rating Using Box Plot

In [None]:
# Chart - 12 visualization code

# Filtered the dataset for the most common Age Ratings to keep the chart clean
# selecting the standard MPAA and TV guidelines.
target_ratings = ['G', 'PG', 'PG-13', 'R', 'TV-G', 'TV-PG', 'TV-14', 'TV-MA']
df_age_quality = Amazon_titles[Amazon_titles['age_certification'].isin(target_ratings)]

# Visualization of Box Plot
plt.figure(figsize=(14, 7))

# This shows the median quality and range for each age group.
sns.boxplot(
    data=df_age_quality,
    x='age_certification',
    y='imdb_score',
    order=target_ratings, # Ordering them from "Young" to "Mature"
    palette='coolwarm'
)

# Added title and labels
plt.title('Impact of Age Rating on Quality (IMDb Score)', fontsize=16)
plt.xlabel('Age Certification', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a **Box Plot** because it allows me to compare the "quality spread" across different age groups. Instead of just looking at the average, the box plot shows the **median** (the middle line) and the **consistency** (the size of the box). This helps answer the business question: "Is producing mature content actually safer or riskier than family content?".

##### 2. What is/are the insight(s) found from the chart?

**TV Dominates Quality:** The most obvious insight is that all TV categories (TV-MA, TV-14, TV-PG) have significantly higher median scores than their Movie counterparts (R, PG-13, PG).

**TV-MA is the "Sweet Spot":** The TV-MA (Mature Audiences) category has the highest median score (~7.5) and the box is placed high up. This proves that Amazon Prime's audience resonates most strongly with gritty, complex, adult-oriented serialized storytelling.

**The "Movie" Struggle:** The ratings for movies (G, PG, PG-13, R) are remarkably lower, with medians hovering around 6.0. This suggests that while Amazon has a lot of movies, the quality of these films is generally perceived as mediocre compared to their TV catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Greenlight Strategy:** The data strongly supports increasing the budget for TV-MA Originals. Shows like The Boys or Invincible aren't just popular; they are the highest-quality assets the platform has.

**Movie Acquisition Reform:** The low median scores for R and PG-13 movies suggest a "quantity over quality" issue. The content team needs to implement stricter quality filters for acquiring movies, as the current library is weighed down by average content.

**Family Strategy:** If the business wants to target families, they should invest in TV-PG series rather than PG movies, as the series format consistently delivers higher user satisfaction in that demographic.

#### Chart - 13 IMDb Votes vs. IMDb Score Using Bouble (Scatter Plot)

In [None]:
# Chart - 13 visualization code

# Filtering data to remove titles with 0 scores or votes
df_bubble = Amazon_titles[
    (Amazon_titles['imdb_score'] > 0) &
    (Amazon_titles['tmdb_score'] > 0) &
    (Amazon_titles['imdb_votes'] > 0)
]

# Create Interactive Bubble Plot
fig = px.scatter(
    df_bubble,
    x='imdb_score',
    y='tmdb_score',
    size='imdb_votes',     # Bubble size depends on popularity
    color='type',          # Color by Movie vs Show
    hover_name='title',    # SHOWS TITLE ON HOVER (Key Feature!)
    hover_data=['release_year'], # Add extra info to hover
    size_max=60,           # limit max bubble size
    opacity=0.6,
    template='plotly_white',
    title='<b>Interactive Correlation: IMDb vs TMDB Scores</b><br>(Size = Popularity)',
    labels={'imdb_score': 'IMDb Score', 'tmdb_score': 'TMDB Score'}
)

# Added a diagonal line for reference
fig.add_shape(
    type="line",
    x0=0, y0=0, x1=10, y1=10,
    line=dict(color="Red", width=2, dash="dash")
)

fig.show()

##### 1. Why did you pick the specific chart?

I selected a **Bubble Plot** because it is the most effective way to visualize a **Multivariate Analysis** involving three key variables simultaneously:

* **X-Axis:** IMDb Score (Quality Metric A)

* **Y-Axis:** TMDB Score (Quality Metric B)

* **Bubble Size:** IMDb Votes (Engagement/Popularity)

This visualization allows us to validate the reliability of quality metrics (do the platforms agree?) while simultaneously seeing if "High Quality" content correlates with "High Popularity" (big bubbles). It directly addresses the business objective of understanding the relationship between **Content Quality** and **User Engagement**.

##### 2. What is/are the insight(s) found from the chart?

**Strong Positive Correlation:** The bubbles cluster tightly along the diagonal red line (Perfect Match Line). This proves that IMDb and TMDB users generally agree on content quality. If a movie is rated high on one platform, it is almost guaranteed to be rated high on the other.

**Popularity follows Quality:** The largest bubbles (representing the most popular/voted content) are heavily concentrated in the top-right quadrant (Scores > 7.0). This indicates that the most engaged-with content on Amazon Prime is also the highest-rated.

**The "Long Tail" of Mediocrity:** There is a vast sea of small, green dots (Movies) in the lower-left quadrant (Scores < 6.0). These are unpopular, low-quality movies that likely clutter the library without driving significant user engagement.

**TV Show Superiority:** The blue bubbles (TV Shows) are noticeably denser in the upper-right section compared to movies, reinforcing the earlier insight that Amazon's serialized content is generally of higher perceived quality than its movie catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Validation of Metrics:** Since the correlation is strong, the business can confidently use IMDb Scores as a reliable KPI for content acquisition. There is no need to pay for expensive proprietary metrics when public user sentiment is this consistent.

**Pruning Strategy:** The abundance of small, low-rated green dots suggests a "Bloated Library." To optimize costs, the business should consider removing or stopping the renewal of licenses for movies with scores under 5.0 and low vote counts, as they add volume but no value.

**Investment Focus:** The large bubbles in the 7.0-9.0 range are the "Crown Jewels." The strategy should shift from buying volume (many small dots) to buying/producing fewer, higher-quality titles (big bubbles), as quality is the primary driver of popularity on the platform.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap Visualization code
# Selecting only the numerical columns for correlation
# We exclude 'id', 'title', 'type', etc.
numeric_df = Amazon_titles.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
# This gives us a number between -1 and 1 for every pair of variables
corr_matrix = numeric_df.corr()

# Creating the Heatmap
plt.figure(figsize=(12, 8))

# sns.heatmap arguments:
# annot=True: for Writing the correlation number on each box
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

# Adding title
plt.title('Correlation Heatmap of Numeric Variables', fontsize=16)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Correlation Heatmap because I wanted to see how all the numerical variables (like release year, scores, and popularity) relate to each other in one single view. This chart uses color to show the strength of relationships, making it easy to spot if two things (like IMDb Score and TMDB Score) move together or if they have no connection at all.

##### 2. What is/are the insight(s) found from the chart?

**Weak Correlation Overall:** The first thing I noticed is that most of the boxes are light blue, meaning the correlation numbers are very close to 0 (e.g., 0.06, -0.02). This tells me that most variables don't strongly affect each other. For example, a longer runtime does not guarantee a higher imdb_score.

**Strongest Positive Correlation:** The highest correlation is between imdb_votes and tmdb_popularity (0.25). This makes sense: movies that get a lot of votes on IMDb are usually the popular ones that people are searching for on TMDB.

**Seasons vs. Runtime:** There is a negative correlation (-0.32) between seasons and runtime. This is logical because "Movies" have high runtime but seasons 0, while "TV Shows" have seasons but typically shorter episode runtimes.

#### Chart - 15 - Pair Plot Visualization

In [None]:
# Chart - 15 visualization code

# Defining the columns we want to plot
cols_to_plot = ['release_year', 'imdb_score', 'tmdb_score', 'tmdb_popularity']

# Filtering the data to remove 0 values
# created a temporary dataframe 'df_pair' that only contains rows where scores and popularity are greater than 0.
df_pair = Amazon_titles[
    (Amazon_titles['imdb_score'] > 0) &
    (Amazon_titles['tmdb_score'] > 0) &
    (Amazon_titles['tmdb_popularity'] > 0)
]

# Create Pair Plot
sns.pairplot(
    data=df_pair,
    vars=cols_to_plot,
    hue='type',       # Colors by Movie vs TV Show
    palette='husl',
    height=2.5,
    plot_kws={'alpha': 0.5} # Transparency to see overlaps
)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a Pair Plot as the final visualization because it serves as a comprehensive summary of the entire analysis. By filtering out the 0 values (missing data), I ensured the chart reflects only valid content. It automatically creates a grid of scatter plots for every combination of numerical variables (release_year, scores, popularity) while simultaneously showing their individual distributions on the diagonal. This allows me to spot any remaining patterns, clusters, or outliers across the entire dataset in a single view.

##### 2. What is/are the insight(s) found from the chart?

**Distinct Clusters:** The diagonal density plots (curves) clearly show that Movies (teal) and TV Shows (pink) have different distribution shapes. For example, the imdb_score curve for TV Shows is shifted to the right (higher ratings) compared to Movies.

**Yearly Volume:** The release_year plots show a massive spike in density for recent years (2015-2021), reconfirming the exponential growth of the library.

**Score Correlation:** The scatter plot between imdb_score and tmdb_score shows a tight linear relationship (a neat diagonal line), validating that quality signals are consistent across platforms.

**Popularity vs. Quality:** The scatter plots involving tmdb_popularity show that popularity is heavily skewed; a few titles have massive popularity (outliers) while most cluster near the bottom, regardless of their score.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**1. Shift Investment to High-Quality Serialized Content (TV-MA)**

**Insight:** The analysis proves that TV Shows consistently outperform Movies in user ratings (Median ~7.2 vs. 6.0), with Mature (TV-MA) content achieving the highest satisfaction scores.

**Suggestion:** Redirect budget from acquiring bulk movie libraries to funding Original Adult-Oriented Series (Action, Thriller, Drama). This content drives the highest engagement and retention.

**2. Leverage Regional Strengths for Global Growth**

**Insight:** India is the second-largest content producer on the platform, and Indian regional actors/directors are among the highest-rated talent.

**Suggestion:** Double down on the Indian market by investing in high-budget regional originals (Tamil, Telugu, Hindi). Use this strong foothold to dominate the South Asian streaming market.

**3. Revamp the Recommendation Algorithm for "Hidden Gems"**

**Insight:** The Bubble Plot revealed thousands of titles with High IMDb Scores (>7.0) but Low Vote Counts, indicating they are under-watched.

**Suggestion:** Update the recommendation engine to prioritize surfacing these "Hidden Gems" over generic popular titles. This increases watch time and perceived library value without the cost of acquiring new content.

**4. Implement a "Quality Floor" for Movie Acquisitions**

**Insight:** The movie library suffers from "Quality Stagnation," with a flood of average-rated titles diluting the brand.

**Suggestion:** Stop renewing licenses for movies with IMDb scores below 5.0 unless they have significant viewership. Focus movie acquisition on specific high-performing niches like Anime and Documentaries, which consistently deliver high user satisfaction.

# **Conclusion**

This comprehensive Exploratory Data Analysis of the Amazon Prime Video dataset (9,600+ titles, 120,000+ credits) reveals a platform at a strategic crossroads. The data highlights a distinct trade-off between **Volume** and **Value**.

**Key Findings:**

* **Content Composition:** While Movies dominate the library by volume (~78%), they significantly underperform in user satisfaction compared to TV Shows. The platform’s strength lies in **Serialized Content**, particularly **Mature (TV-MA)** originals and **Anime**, which consistently achieve the highest IMDb scores.

* **Global Strengths:** India has emerged as a critical content hub, ranking as the second-largest producer. Regional talent from India and Japan (Anime) often outranks mainstream Hollywood stars in terms of average content quality.

* **The "Hidden Gem" Opportunity:** A vast portion of the library consists of high-quality titles with low visibility. This represents an untapped asset class that can be leveraged to increase engagement without new acquisition costs.

**Strategic Verdict:** To maximize growth and retention, Amazon Prime Video should pivot from a "Volume-First" approach to a **"Quality-First" strategy**. By doubling down on **Adult Animation, Regional Originals (India/UK), and TV-MA Series**, and by refining its recommendation engine to surface hidden high-rated content, the platform can solidify its position as a premium entertainment destination rather than just a bulk content repository.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***