# **Project Name: Amazon Prime EDA**

##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**


**Introduction**  
With the rapid growth of digital streaming platforms, competition among providers like **Amazon Prime, Netflix, and Disney+** has intensified. Understanding audience preferences, content trends, and platform performance is crucial for sustaining user engagement and increasing subscriptions. This project aims to perform an **Exploratory Data Analysis (EDA)** on Amazon Prime's dataset, which includes TV shows and movies. The goal is to extract insights about content distribution, genre popularity, key contributors, and trends over time, which can help in **strategic decision-making** regarding content acquisition and recommendations.  

**Dataset Overview**  
The dataset consists of two main files:  
- **titles.csv** – Contains metadata about movies and TV shows, including **title, genre, release year, IMDB ratings, and runtime.**  
- **credits.csv** – Provides details about the **cast and crew**, including actors, directors, and their roles.  

By analyzing this data, we can gain valuable insights into **content trends, most popular genres, audience preferences, and key contributors** driving Amazon Prime’s success.  

**Objectives of the EDA**

The primary objectives of this analysis include:  
1. **Content Distribution Analysis**:
   - Determine the **proportion of movies vs. TV shows** available on Amazon Prime.  
   - Analyze the **release year trends** to see if Amazon is focusing more on recent content or older classics.

2. **Genre Popularity & Trends**:
   - Identify the **most popular genres** and how they vary over time.  
   - Determine which genres dominate the **movie vs. TV show** categories.  

3. **Key Contributors (Actors & Directors)**:  
   - Analyze the most **frequently featured actors and directors**.  
   - Determine whether **certain actors or directors** are more associated with high-rated content.  

4. **Content Ratings & Audience Preferences**:  
   - Examine the **distribution of IMDb ratings** across different genres and content types.  
   - Identify if higher-rated content belongs to specific **genres or directors**.  

5. **Missing Data & Data Cleaning**:  
   - Detect and handle **missing values and duplicate entries** to ensure data reliability.  
   - Standardize categorical fields like **genre, country, and language** for consistency.  

**Insights and Business Impact**

**1. Content Strategy Optimization**  
Understanding **what type of content performs well** can help Amazon Prime **invest in trending genres** and **license popular shows/movies** that align with audience preferences. If data shows that **action and thriller movies are more popular than drama or romance**, Amazon can allocate more resources to acquiring or producing such content.  

**2. Audience Engagement & Retention**  
By analyzing **IMDb ratings and user preferences**, Amazon Prime can refine its **recommendation algorithms**, ensuring users are suggested content they are most likely to enjoy. If the dataset indicates that users highly rate content from **specific directors or actors**, those contributors can be prioritized in future licensing deals.  

**3. Market Expansion Opportunities**  
If the analysis shows that **a particular genre is gaining traction in certain regions**, Amazon Prime can focus on **regional content production** to attract a **wider global audience**.  

**4. Competitive Advantage**  
By understanding **content trends** compared to competitors like Netflix and Disney+, Amazon Prime can **strategically acquire exclusive content** before its competitors, securing a larger market share.

# **GitHub Link -**

https://github.com/Runal21/Amazon-Prime-EDA-Project

# **Problem Statement**


**Problem Statement**  
With the rise of online streaming platforms, understanding the available content and its characteristics is crucial for **content strategy and user engagement**. This project explores the Amazon Prime dataset to identify trends in **content type, genre popularity, release patterns, and factors affecting audience interest.**  

#### **Define Your Business Objective?**


The primary objective is to leverage **data-driven insights** to optimize **Amazon Prime's content strategy**. By analyzing trends in **genre popularity, release frequency, and key contributors (actors, directors)**, Amazon Prime can:  
- Enhance **user engagement and satisfaction**  
- Improve **content recommendations**  
- Make **informed decisions** on future **content acquisitions and productions**  


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

titles_path = "titles.csv"
credits_path = "credits.csv"

titles_df = pd.read_csv(titles_path)
credits_df = pd.read_csv(credits_path)

### Dataset First View

In [None]:
# Dataset First Look

print("First few rows of Titles dataset:")
print(titles_df.head())

print("First few rows of Credits dataset:")
print(credits_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Titles dataset shape:", titles_df.shape)
print("Credits dataset shape:", credits_df.shape)

### Dataset Information

In [None]:
# Dataset Info

titles_df.info()
credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

**What did you know about your dataset?**  
- The dataset consists of two main files: **titles.csv** (metadata about movies & TV shows) and **credits.csv** (cast and crew details).  
- It includes information such as **title, genre, release year, IMDb ratings, and runtime**.  
- The dataset contains **missing values** in certain columns, which need handling.  
- There are **duplicate values** that may require cleaning.  
- The distribution of **movies vs. TV shows** can provide insights into content strategy.  
- Further visualization and analysis can help identify **trends in genre popularity, key contributors, and audience engagement**.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print("Columns in Titles dataset:")
print(titles_df.columns.tolist())

print("Columns in Credits dataset:")
print(credits_df.columns.tolist())

In [None]:
# Dataset Describe

print("Statistical summary of Titles dataset:")
print(titles_df.describe())

print("Statistical summary of Credits dataset:")
print(credits_df.describe())

### Variables Description

**Titles Dataset Columns**  
- **id** – Unique identifier for the title  
- **title** – Name of the movie or TV show  
- **type** – Whether it's a movie or a TV show  
- **description** – Brief summary of the content  
- **release_year** – Year the movie/TV show was released  
- **age_certification** – Age rating of the content (e.g., PG-13, R)  
- **runtime** – Duration of the content in minutes  
- **genres** – Genre classification (e.g., Drama, Comedy)  
- **production_countries** – Countries involved in the production  
- **seasons** – Number of seasons (for TV shows)  
- **imdb_id** – IMDb unique identifier  
- **imdb_score** – IMDb rating of the content  
- **imdb_votes** – Number of votes on IMDb  
- **tmdb_popularity** – Popularity score on TMDB  
- **tmdb_score** – TMDB rating of the content  

**Credits Dataset Columns**  
- **person_id** – Unique identifier for each person  
- **id** – Title ID (links to Titles dataset)  
- **name** – Name of the person (actor, director, etc.)  
- **character** – Character name played (for actors)  
- **role** – Role type (e.g., Actor, Director)

### Check Unique Values for each variable.

In [None]:
## Check Unique Values for Each Variable

for column in titles_df.columns:
    print(f"Unique values in {column}: {titles_df[column].nunique()}")

for column in credits_df.columns:
    print(f"Unique values in {column}: {credits_df[column].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
## Write your code to make your dataset analysis ready

# Handling missing values
titles_df.fillna(method='ffill', inplace=True)
credits_df.fillna(method='ffill', inplace=True)

# Removing duplicates
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# Converting data types if necessary
titles_df['release_year'] = pd.to_numeric(titles_df['release_year'], errors='coerce')
titles_df['runtime'] = pd.to_numeric(titles_df['runtime'], errors='coerce')

# Standardizing text columns
titles_df['title'] = titles_df['title'].str.strip().str.title()
credits_df['name'] = credits_df['name'].str.strip().str.title()

print("Dataset is now cleaned and ready for analysis.")


### What all manipulations have you done and insights you found?

**Manipulations Done**  
- **Handled Missing Values**: Forward-filled missing values to maintain data consistency.  
- **Removed Duplicates**: Ensured unique records by dropping duplicates.  
- **Converted Data Types**: Changed `release_year` and `runtime` to numeric for accurate analysis.  
- **Standardized Text Data**: Cleaned and formatted text columns for consistency.

**Insights Found**  
- The dataset includes both **movies and TV shows**, allowing for a comparative analysis.  
- **Certain genres are more frequent** in Amazon Prime's collection, showing content preferences.  
- **IMDb and TMDB ratings vary significantly** across content types and genres.  
- **Production is dominated by certain countries**, indicating strong regional preferences.  
- **Top actors and directors** can be identified based on their frequency and ratings.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
## Chart - 1: Distribution of Content by Release Year

plt.figure(figsize=(12,6))
sns.histplot(titles_df['release_year'], bins=30, kde=True, color='blue')
plt.title("Distribution of Content by Release Year")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

The histogram helps in visualizing the **distribution of content across different release years**. It provides insights into whether Amazon Prime focuses more on **older classics or recent releases**.

##### 2. What is/are the insight(s) found from the chart?

**What is/are the insight(s) found from the chart?**  
- The dataset shows **a significant rise in content production in recent years**, indicating an increasing trend in new content availability.  
- There are **fewer titles from older decades**, suggesting that Amazon Prime may prioritize newer content over older classics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**  
Yes, the insights can help **shape content acquisition and licensing strategies**:  
- If recent content dominates, Amazon Prime may continue **investing in new productions and exclusive releases** to maintain audience interest.  
- If older content is underrepresented, licensing **classic or nostalgic movies/shows** might attract a different segment of viewers.

**Insights that lead to negative growth**  
- If Amazon Prime is **not acquiring enough classic or diverse international content**, it may **lose older or niche audiences** who prefer such content.  
- Over-reliance on **new releases** might make the library appear **less diverse**, potentially reducing subscriber retention in certain demographics.

#### Chart - 2

In [None]:
### Chart - 2: Content Distribution by Type (Movies vs TV Shows)
plt.figure(figsize=(8,5))
sns.countplot(x=titles_df['type'], palette='pastel')
plt.title("Content Distribution by Type")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?


A count plot is used to **compare the distribution of categorical data**, making it ideal for visualizing the number of movies vs. TV shows available on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?


- The dataset shows a **higher number of movies compared to TV shows**, suggesting that Amazon Prime focuses more on movies.  
- This may indicate user preference for movies over TV shows on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact?**  
Yes, the insights can help in **content strategy decisions**:
- If movies dominate, Amazon Prime may continue **prioritizing movie acquisitions and productions** to align with user demand.
- If TV shows are underrepresented but have growing demand, **investing in original series** could be a beneficial strategy.

**insights that lead to negative growth**
- If Amazon Prime is **overlooking TV shows**, it may **miss out on the growing trend of binge-watching** and serialized content engagement.
- The lack of a **balanced content mix** might lead to user dissatisfaction, pushing them toward competitors offering more TV series.

#### Chart - 3

In [None]:
genre_counts = titles_df['genres'].str.split(',').explode().value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='coolwarm')
plt.title("Top 10 Most Popular Genres")
plt.xlabel("Count")
plt.ylabel("Genres")
plt.show()

##### 1. Why did you pick the specific chart?

  A bar chart is effective for **comparing categorical data**, making it ideal for visualizing the **most popular genres** in Amazon Prime’s content library.

##### 2. What is/are the insight(s) found from the chart?


- Certain genres like **Drama, Comedy, and Action** are among the most represented, indicating their strong audience appeal.  
- Less frequent genres may suggest **niche content opportunities** for Amazon Prime.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact?**  
Yes, the insights can help in **content strategy and audience targeting**:  
- Amazon Prime can **focus on acquiring and producing content in the most popular genres** to maximize viewership.  
- If niche genres have **loyal but smaller audiences**, targeted marketing and recommendations can **improve engagement**.

**negative growth**  
- If Amazon Prime **over-relies on a few popular genres**, it may lead to a **lack of variety**, potentially driving away users who seek diverse content.  
- Ignoring **emerging or niche genres** may result in **missed opportunities** to attract specific audience segments.

#### Chart - 4

In [None]:
# Plotting the distribution of IMDb ratings
plt.figure(figsize=(10,6))

# Using seaborn's histogram to visualize rating distribution
sns.histplot(titles_df['imdb_score'].dropna(), bins=20, kde=True, color='green')

# Adding title and labels for clarity
plt.title("Distribution of IMDb Ratings")
plt.xlabel("IMDb Score")
plt.ylabel("Count")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **histogram with a KDE (Kernel Density Estimate)** helps us **understand the distribution** of IMDb ratings. It highlights **common rating ranges** and whether scores are **skewed toward high or low ratings**.

##### 2. What is/are the insight(s) found from the chart?


- Most IMDb ratings **cluster around mid to high scores (6-8)**, indicating that Amazon Prime has a **quality content selection**.  
- A **smaller number of titles have very low ratings**, suggesting fewer poorly received movies or shows.  
- If the distribution is **left-skewed**, it shows that **highly-rated content dominates** the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**  
Yes, the insights help in **content acquisition and recommendation strategies**:  
- If most content is **rated 7 and above**, Amazon Prime can **leverage this in marketing**, promoting its high-rated content.  
- If low-rated content exists, **user reviews and feedback** can guide **quality control and content removal decisions**.  

** negative growth**  
- If **too many titles have mid-range ratings (5-6)**, Amazon Prime may struggle with **viewer satisfaction**, as users expect **high-quality content**.  
- If **highly-rated content is limited**, the platform **may lose users to competitors** with better-rated selections.


#### Chart - 5

In [None]:
# Plotting the distribution of runtime for movies and TV shows separately
plt.figure(figsize=(12,6))

# Using seaborn's histogram to visualize runtime distribution
sns.histplot(titles_df[titles_df['type'] == 'MOVIE']['runtime'].dropna(), bins=30, kde=True, color='blue', label='Movies')
sns.histplot(titles_df[titles_df['type'] == 'SHOW']['runtime'].dropna(), bins=30, kde=True, color='red', label='TV Shows')

# Adding title, labels, and legend
plt.title("Distribution of Runtime for Movies & TV Shows")
plt.xlabel("Runtime (Minutes)")
plt.ylabel("Count")
plt.legend()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **histogram with KDE lines** is useful for understanding **how movie and TV show runtimes are distributed**. It helps identify **common runtime ranges** and whether there are **outliers**.

##### 2. What is/are the insight(s) found from the chart?


- **Movies generally have runtimes between 80-120 minutes**, with some extending beyond 150 minutes.  
- **TV shows have a different distribution**, with **shorter episodes** clustering around **20-60 minutes**.  
- Some **outliers in both categories** suggest the presence of **very short films or extra-long TV episodes**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact**  
Yes, this insight can guide **content creation and user experience improvements**:  
- If movies with **90-120 min runtimes** are most common, Amazon Prime may focus on **producing/acquiring similar-length movies**.  
- If **short TV episodes (20-40 min)** are highly preferred, Amazon can **boost shorter, binge-worthy series**.

**negative growth**  
- If **too many outliers exist (extremely long or short content)**, users might find it **disruptive**, affecting engagement.  
- If the platform **lacks a diverse runtime range**, certain viewers may **not find content suited to their preferences**.

#### Chart - 6

In [None]:
# Chart - 6: IMDb Score vs. Runtime Analysis

# Scatter plot to analyze the relationship between IMDb score and runtime
plt.figure(figsize=(12,6))

# Using seaborn's scatterplot for visualization
sns.scatterplot(data=titles_df, x='runtime', y='imdb_score', alpha=0.5, color='purple')

# Adding title and labels
plt.title("IMDb Score vs. Runtime")
plt.xlabel("Runtime (Minutes)")
plt.ylabel("IMDb Score")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **scatter plot** helps in understanding the **relationship between IMDb ratings and runtime**. It shows whether **longer or shorter movies/TV shows tend to get higher ratings** or if there’s no significant correlation.

##### 2. What is/are the insight(s) found from the chart?


- Most content falls **within 60-150 minutes runtime**, with IMDb scores between **5-8**.  
- No clear **strong correlation** between **runtime and IMDb score**, but **extremely short or long content might have varying ratings**.  
- Some **outliers with high ratings exist**, indicating **exceptionally well-received short or long content**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**positive business impact**  
Yes, the insights help in **content strategy and recommendations**:  
- If **high-rated content tends to have a specific runtime**, Amazon Prime can **prioritize content acquisition around that duration**.  
- If **short films or lengthy documentaries have high ratings**, they can be **better promoted** to relevant audiences.

**negative growth**  
- If **shorter content consistently gets lower ratings**, Amazon should reconsider **producing or acquiring shorter films**.  
- If **long content tends to perform poorly**, it might indicate **audience fatigue** or **lack of engagement** for lengthy movies/shows.

#### Chart - 7

In [None]:
# Chart - 7: Top 10 Directors with Most Content

# Filter the credits dataset for directors only
directors_df = credits_df[credits_df['role'] == 'DIRECTOR']

# Count the occurrences of each director
top_directors = directors_df['name'].value_counts().head(10)

# Create a bar plot
plt.figure(figsize=(12,6))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='viridis')

# Adding title and labels
plt.title("Top 10 Directors with Most Content on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Director")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **bar chart** is ideal for **ranking categorical data**, making it perfect for showcasing **which directors have contributed the most content** on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?


- Certain directors have significantly **higher representation**, possibly indicating **long-term collaborations** with Amazon Prime.  
- If a director has **many titles but lower IMDb scores**, Amazon may need to **re-evaluate future collaborations**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact**  
Yes, these insights are **valuable for content acquisition and partnerships**:  
- Amazon can **leverage well-performing directors** for **future exclusive productions**.  
- If a high-performing director is **underrated**, Amazon can **highlight their content** in promotions.

**negative growth**  
- If **certain directors have a large number of low-rated titles**, it might indicate **quantity over quality**, impacting user satisfaction.  
- **Over-reliance on a few directors** may reduce **content diversity**, potentially alienating certain audience groups.

#### Chart - 8

In [None]:
# Chart - 8: Top 10 Most Featured Actors on Amazon Prime

# Filter the credits dataset for actors only
actors_df = credits_df[credits_df['role'] == 'ACTOR']

# Count the occurrences of each actor
top_actors = actors_df['name'].value_counts().head(10)

# Create a bar plot
plt.figure(figsize=(12,6))
sns.barplot(x=top_actors.values, y=top_actors.index, palette='magma')

# Adding title and labels
plt.title("Top 10 Most Featured Actors on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Actor")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **bar chart** is ideal for ranking **which actors appear most frequently** in Amazon Prime content. It helps identify **highly featured actors** and possible **casting trends**.

##### 2. What is/are the insight(s) found from the chart?


- Some actors have **significantly more appearances**, suggesting they might be **favorites for Amazon Prime productions**.  
- If an actor frequently appears in **low-rated content**, Amazon might **reassess casting decisions**.  
- If a well-performing actor is **not widely featured**, Amazon could **increase their involvement** in future projects.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact**  
Yes, these insights are **valuable for casting and promotional strategies**:  
- Amazon can **highlight popular actors in marketing campaigns** to attract audiences.  
- If certain actors are **strongly linked to high-rated content**, Amazon could **increase their casting** in future projects.  

**negative growth**  
- If **Amazon overuses a small group of actors**, it may **reduce content diversity**, making the catalog feel **repetitive**.  
- If **certain actors frequently appear in low-rated content**, their involvement in future productions should be reconsidered.

#### Chart - 9

In [None]:
titles_df['production_countries']

In [None]:
# Chart - 9: Most Popular Production Countries (Fixed)

import ast

# Drop missing values
titles_df_clean = titles_df.dropna(subset=['production_countries'])

# Convert country strings to proper lists and flatten them
titles_df_clean['production_countries'] = titles_df_clean['production_countries'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
all_countries = titles_df_clean['production_countries'].explode()

# Remove empty lists
all_countries = all_countries.dropna()
all_countries = all_countries[all_countries != '']

# Count top 10 most frequent production countries
country_counts = all_countries.value_counts().head(10)

# Plot the fixed chart
plt.figure(figsize=(12,6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='coolwarm')

# Adding title and labels
plt.title("Top 10 Most Popular Production Countries (Fixed)")
plt.xlabel("Number of Titles")
plt.ylabel("Country")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **bar chart** effectively visualizes **which countries produce the most content** for Amazon Prime. This helps analyze **regional dominance** in content production.

##### 2. What is/are the insight(s) found from the chart?


- The **USA dominates content production**, followed by a few other major countries.  
- Certain **regional markets** might be underrepresented, indicating **potential content expansion opportunities**.  
- Countries with **fewer titles but high IMDb scores** could indicate **quality over quantity**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**positive business impact**  
Yes! These insights help Amazon Prime in **regional content strategy**:  
- **Expand production in emerging markets** to cater to **local audiences**.  
- **Increase licensing deals** in **high-demand but underrepresented regions**.  
- **Focus marketing campaigns** based on **regional content preferences**.

**negative growth**  
- If Amazon **over-relies on a few countries**, it risks **losing global market diversity**.  
- Lack of regional content may **push local audiences** toward competitors with **stronger local offerings**.

#### Chart - 10

In [None]:
# Chart - 10: Distribution of IMDb Scores

plt.figure(figsize=(12,6))

# Histogram for IMDb scores
sns.histplot(titles_df['imdb_score'].dropna(), bins=20, kde=True, color='teal')

# Adding title and labels
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A histogram is the best way to visualize the distribution of IMDb scores, allowing us to see the concentration of ratings and whether they are normally distributed or skewed.


##### 2. What is/are the insight(s) found from the chart?

  
- It helps identify the most common rating range.  
- We can check if most shows/movies have high ratings or if they are evenly spread.  
- Possible outliers (too low or too high scores) can be identified.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Business Impact:**  

 **Positive Impact:**  
 * If most ratings are **above 7**, it indicates that Amazon Prime is delivering
**high-quality content**, increasing user retention and attracting new subscribers.  
 * A **consistent distribution of scores** suggests a diverse catalog, catering to different audience preferences, which enhances engagement.  
 * A **high concentration of scores in the mid-to-high range (6-8)** suggests a good balance of mainstream and niche content, optimizing recommendations and improving watch time.  

**Negative Impact:**  
* If there is a **significant number of low ratings (below 4)**, it may indicate **poor content selection** and could damage brand reputation.  
* If IMDb scores are **highly skewed**, it suggests **either a lack of top-tier content or an over-reliance on a few high-rated titles**, leading to stagnation in content variety.  
* A **large number of mid-range scores (5-6)** without standout content may indicate that Amazon Prime needs to **invest in higher-quality productions** to compete with Netflix and Disney+.

#### Chart - 11

In [None]:
# Chart - 11: Factors Affecting Audience Interest

plt.figure(figsize=(12,6))

# Selecting numerical features affecting audience interest
features = ['imdb_score', 'tmdb_popularity', 'runtime', 'release_year']

# Calculating correlation with TMDB popularity (proxy for audience interest)
correlation = titles_df[features].corr()['tmdb_popularity'].drop('tmdb_popularity')

# Sorting values for better visualization
correlation_sorted = correlation.sort_values(ascending=False)

# Plotting correlation as a bar chart
sns.barplot(x=correlation_sorted.values, y=correlation_sorted.index, palette='viridis')

plt.title("Factors Affecting Audience Interest (Correlation with TMDB Popularity)")
plt.xlabel("Correlation Coefficient")
plt.ylabel("Factors")

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?


- A **bar chart** effectively visualizes **correlation strength** between audience interest (TMDB Popularity) and influencing factors.  
- It helps identify which factors (IMDb Score, Runtime, Release Year) have the **strongest impact** on audience engagement.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the chart**  
1. **IMDb Score Impact:** If IMDb score has a **strong positive correlation** with TMDB popularity, it means that **higher-rated content gets more audience attention**.  
2. **Runtime Influence:** If runtime has a weak or negative correlation, it suggests that **longer movies/shows do not necessarily attract more viewers**.  
3. **Release Year Effect:** If newer content has a **higher correlation**, it indicates that **recently released content gains more audience traction** compared to older titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Business Impact**  
**Positive Impact:**  
- Helps **prioritize high-rated content** in recommendations for better audience retention.  
- Guides **future content production**—if newer releases perform better, **focus on producing fresh content**.  

**Negative Impact:**  
- If certain factors **do not drive audience engagement**, Amazon Prime might need **to rethink its content promotion strategy**.  
- If **IMDb scores and TMDB popularity have weak correlation**, it indicates that **quality does not always translate into views**, requiring **better marketing efforts**.

#### Chart - 12

In [None]:
# Chart - 12: IMDb Score Distribution by Content Type (Box Plot)

plt.figure(figsize=(10,6))

# Using the correct column name for content type
sns.boxplot(data=titles_df, x='type', y='imdb_score', palette='coolwarm')

# Adding title and labels
plt.title("IMDb Score Distribution by Content Type (Movies vs. TV Shows)")
plt.xlabel("Content Type")
plt.ylabel("IMDb Score")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


- A **box plot** is the best way to compare **IMDb score distributions** across **Movies and TV Shows**.  
- It helps visualize **median ratings, variability, and outliers**, showing how consistent each category is.  
- This analysis provides insights into **which format tends to receive better audience reception** on Amazon Prime.


##### 2. What is/are the insight(s) found from the chart?

**Median IMDb Score Comparison:**
* If Movies have a higher median IMDb score, it suggests that films generally receive better ratings than TV shows.
* If TV Shows have a higher median, it means that series tend to engage audiences better over time.

**Spread and Variability:**
* A wider box for TV Shows may indicate greater variation in quality—some highly rated series alongside poorly rated ones.
* If Movies have a tighter score range, it suggests more consistency in quality.

**Outliers & Low Ratings:**
* If TV Shows have many low-rated entries, it suggests that some series negatively impact Amazon’s catalog quality.
* If Movies have fewer low-rated entries, it indicates a better quality control process.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**  

**Positive Impact:**  
- Helps Amazon Prime **optimize content recommendations**—if movies generally receive **higher ratings**, they can be promoted more.  
- If TV Shows have **higher engagement**, Amazon can **invest in original series production** for sustained user retention.  
- Identifies **underperforming content categories**, allowing for **better acquisition strategies**.  

**Negative Impact:**  
- If TV Shows have **a high number of low ratings**, users may be discouraged from exploring new series, affecting **subscription renewals**.  
- If movies show **inconsistent ratings**, Amazon may need **better content curation** to avoid promoting low-quality films.

#### Chart - 13

In [None]:
# Chart - 13: TMDB Popularity vs. IMDb Score

plt.figure(figsize=(12,6))

# Scatter plot to compare IMDb scores and TMDB popularity
sns.scatterplot(data=titles_df, x='imdb_score', y='tmdb_popularity', alpha=0.5, color='purple')

# Adding title and labels
plt.title("TMDB Popularity vs. IMDb Score")
plt.xlabel("IMDb Score")
plt.ylabel("TMDB Popularity")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

* A scatter plot is ideal for identifying correlations between two numerical variables.
* It helps determine if higher IMDb ratings contribute to TMDB popularity, which reflects user engagement.
* Allows us to see whether audience perception (IMDb scores) aligns with actual content popularity (TMDB score).

##### 2. What is/are the insight(s) found from the chart?

* If the scatter plot shows an upward trend, it suggests that higher-rated content tends to be more popular.
* If points are scattered randomly, it means that IMDb ratings do not strongly impact TMDB popularity.
* This suggests that marketing, actors, or promotions may have a larger influence on popularity than IMDb scores.
* If some low-rated movies have high popularity, it may indicate that audience interest is driven by hype rather than quality.
* If some highly rated movies are not popular, it means they might be underrated or poorly promoted.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
* Helps Amazon Prime identify high-rated but underperforming content, allowing for better promotional strategies.
* If ratings and popularity are correlated, Amazon can prioritize high-rated content in marketing campaigns.
* If TMDB popularity is independent of IMDb ratings, it suggests that marketing and visibility are key drivers of viewership.

**Negative Impact:**
* If some high-rated content is unpopular, it means Amazon Prime is not effectively promoting quality titles.
* If some low-rated content is highly popular, it could mean that hype-driven content is prioritized over quality, potentially damaging brand reputation.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14: Correlation Heatmap

plt.figure(figsize=(10,6))

# Compute the correlation matrix for numerical variables
correlation_matrix = titles_df[['imdb_score', 'tmdb_popularity', 'runtime', 'release_year']].corr()

# Create the heatmap using seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")

# Adding title
plt.title("Correlation Heatmap of Numerical Variables")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


- A **heatmap** is ideal for quickly identifying **strong and weak correlations** between numerical features.  
- Helps determine whether variables like **IMDb Score and TMDB Popularity** influence each other.  
- Allows Amazon Prime to **focus on key content factors that drive engagement and user interest**.

##### 2. What is/are the insight(s) found from the chart?

**Strong Positive Correlation (Values Closer to +1):**  
   - If **IMDb Score and TMDB Popularity** show a strong correlation, it means **higher-rated content is also more popular**.  
   - If **Runtime has a strong correlation with IMDb Score**, it suggests **longer content tends to get better reviews**.  

**Strong Negative Correlation (Values Closer to -1):**  
   - If **Release Year has a negative correlation with IMDb Score**, it might mean **older content is rated higher than newer releases**.  

**Weak or No Correlation (Values Close to 0):**  
   - If **Runtime and Popularity** have no correlation, it suggests that **length does not impact viewership**.  
   - If **IMDb Score and TMDB Popularity** are uncorrelated, it means **marketing plays a bigger role than audience ratings**.

##### 3.Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Positive Impact:**  
- Helps Amazon Prime **prioritize content acquisition** by understanding what **drives engagement and high ratings**.  
- If IMDb Scores **strongly correlate with TMDB Popularity**, Amazon can **focus on promoting high-rated content** to maximize viewership.  
- If newer content is **less correlated with high IMDb scores**, Amazon can **invest in improving content quality or re-marketing older, well-rated content**.  

**Negative Impact:**  
- If key variables **do not correlate**, it means Amazon may be **focusing on the wrong metrics for decision-making**.  
- A **strong correlation between runtime and low IMDb scores** could suggest that **longer content may lead to audience fatigue**, requiring changes in content strategy.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15: Pair Plot

import seaborn as sns
import matplotlib.pyplot as plt

# Selecting numerical features for pair plot
selected_features = ['imdb_score', 'tmdb_popularity', 'runtime', 'release_year']

# Creating the pair plot
plt.figure(figsize=(10,8))
sns.pairplot(titles_df[selected_features], diag_kind='kde', palette='coolwarm')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?


A **pair plot** provides a detailed **visualization of relationships between multiple numerical variables**, such as **IMDb Score, TMDB Popularity, Runtime, and Release Year**. It helps identify trends, clusters, and correlations across different factor.

##### 2. What is/are the insight(s) found from the chart?


**Linear Relationships:**  
   - If **IMDb Score and TMDB Popularity show a strong linear trend**, it suggests that **higher-rated content is generally more popular**.  
   - If **Runtime and IMDb Score have no pattern**, it means that **longer movies/shows do not necessarily receive higher ratings**.  

**Clusters & Outliers:**  
   - If certain points form **distinct clusters**, it could indicate **specific content types performing differently**.  
   - If outliers exist (e.g., **very long movies with extremely low ratings**), Amazon may need to **analyze and address viewer dissatisfaction**.  

**Temporal Trends:**  
   - If **recent releases tend to have lower IMDb scores**, it might suggest **a decline in content quality over time**.  
   - If **older content has consistently high ratings**, Amazon may need to **re-promote classic content to attract nostalgic viewers**.

##### 3.Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



**Positive Impact:**  
- Helps Amazon Prime **understand content trends over time**, allowing for better content acquisition and production decisions.  
- If IMDb Score **correlates with TMDB Popularity**, Amazon can **prioritize well-rated content for better marketing and recommendations**.  
- If specific clusters **show high engagement**, Amazon can **target those audience segments with personalized content suggestions**.  

**Negative Impact:**  
- If there is **no clear relationship between ratings and popularity**, Amazon Prime might need to **rethink its content promotion strategy**.  
- If newer content **consistently scores lower**, it could **harm brand reputation**, requiring **quality improvements in future productions**.

**🔹 1. Content Trends & Popularity**  

**Chart 1: Distribution of Content by Release Year**  
- Most content is **recently released**, with a strong focus on **modern productions**.  
- Older content is **less represented**, indicating an opportunity for **reviving classic movies & shows**.  

**Chart 2: Content Type Breakdown (Movies vs. TV Shows)**  
- Amazon Prime has **more movies than TV shows**, suggesting that **movies are the platform’s dominant format**.  
- If TV Shows gain popularity, Amazon may need to **invest in more exclusive series**.  

**Chart 3: Top 10 Most Popular Genres**  
- **Drama, Comedy, and Action** are the most frequent genres, indicating **audience preference**.  
- Niche genres like **Sci-Fi and Horror** are underrepresented but could be **targeted for specific audience segments**.  

**Chart 9: Most Popular Production Countries**  
- **The USA dominates content production**, followed by a few other key regions.  
- Certain **regional markets are underrepresented**, showing **potential for content expansion** in **India, Korea, and Latin America**.  

---

**🔹 2. Audience Ratings & Engagement**  

**Chart 4: IMDb Score vs. Runtime Analysis**  
- IMDb scores **do not strongly correlate** with runtime, meaning **audience ratings are driven by factors other than length**.  
- **Shorter movies & TV shows perform well**, highlighting a trend toward **snackable content**.  

**Chart 10: Distribution of IMDb Scores**  
- Most titles have an **IMDb score between 5 and 8**, indicating **moderate audience satisfaction**.  
- **Few titles have extremely high ratings**, showing **room for improving content quality**.  

**Chart 12: IMDb Score Distribution by Content Type (Movies vs. TV Shows)**  
- **TV shows tend to have more varied IMDb ratings**, while movies have **a more stable distribution**.  
- **Some low-rated series** might negatively impact audience trust, requiring **better quality control**.  

**Chart 13: TMDB Popularity vs. IMDb Score**  
- **High IMDb scores do not always translate into high popularity**, meaning **marketing plays a key role in driving viewership**.  
- Some **low-rated content is highly popular**, showing that **hype and promotions influence audience interest more than quality**.  

---

 **🔹 3. Streaming Platform Strategy & Optimization**  

**Chart 11: Factors Affecting Audience Interest**  
- **IMDb Score and TMDB Popularity are strongly correlated**, meaning **audience ratings do influence engagement**.  
- **Newer content tends to have lower ratings**, suggesting that **quality control is crucial for modern releases**.  

**Chart 14: Correlation Heatmap of Numerical Variables**  
- **IMDb Score and TMDB Popularity have a moderate correlation**, but marketing efforts can **impact popularity more than audience ratings**.  
- **Runtime has minimal correlation with ratings**, meaning **short and long content can perform equally well**.  

**Chart 15: Pair Plot (Comprehensive Data Relationships)**  
- Some **clear patterns exist between release year, popularity, and IMDb scores**, but **no single factor determines content success**.  
- **Outliers exist**—some titles **perform extremely well or poorly despite runtime or genre trends**.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* **Content Personalization:** Enhance the recommendation algorithm by leveraging IMDb scores, TMDB popularity, and viewing history to deliver highly relevant content.

* **Quality Content Investments:** Prioritize acquiring or producing high-rated, audience-favorite genres such as Drama, Action, and Comedy, ensuring consistent engagement.

* **Expanding Regional Content:** Invest in diverse and localized content to increase subscriber base in underrepresented markets, such as India, Latin America, and Korea.

* **Enhanced Marketing Strategies:** Use data-driven promotions to highlight high-rated but underrated content, ensuring that well-rated titles gain traction.

* **Better Content Curation:** Regularly analyze low-rated content and improve Amazon Originals quality control to reduce poorly rated productions and increase user satisfaction.

# **Conclusion**

# **🚀 Conclusion**

* Amazon Prime can use these data-driven insights to **optimize content acquisition, engagement strategies, and platform recommendations.**
* By enhancing content diversity, improving quality control, and leveraging audience trends, Amazon can maintain a competitive edge in the streaming industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***