# **Project Name**    -Exploratory Data Analysis (EDA) on Amazon Prime TV Shows & Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Sakshi Vispute


# **Project Summary -**

### **Project Summary**  

This project focuses on performing **Exploratory Data Analysis (EDA)** on the **Amazon Prime TV Shows & Movies dataset** to uncover trends in **content type, genres, IMDb ratings, production countries, and audience engagement**. The objective is to derive insights that can guide **content strategy, audience retention, and platform growth**.  

---

### **📌 Key Findings from EDA:**  

#### **1️⃣ Content Distribution**  
- The dataset contains **more movies than TV shows**, indicating Amazon Prime’s primary focus on films.  
- Most TV shows have **1-3 seasons**, suggesting that limited series are preferred over long-running ones.  
- The number of releases has **increased over time**, aligning with the rise of streaming platforms.  

#### **2️⃣ Genre Popularity & Ratings**  
- **Drama, Thriller, and Comedy** are the most frequent genres.  
- **Drama and Thriller tend to have higher IMDb ratings**, while **Horror and Action show mixed reviews**.  
- **Documentary titles generally receive high ratings**, indicating a dedicated audience for factual content.  

#### **3️⃣ Audience Engagement & Ratings**  
- IMDb scores mostly range **between 6-8**, suggesting an average audience reception.  
- Some **highly popular titles have low IMDb votes**, showing that **popularity doesn’t always mean high engagement**.  
- **IMDb votes have increased over time**, reflecting a rise in audience interaction.  

#### **4️⃣ Production Trends & Regional Insights**  
- The **US dominates content production**, followed by other key markets like the UK and India.  
- Some regions **produce very little content**, presenting opportunities for expansion.  
- TV shows with **more seasons tend to have higher engagement**, showing that audiences invest in long-term storytelling.  

---

### **📌 Business Impact & Recommendations**  

✅ **Content Strategy Optimization**  
- **Investing in high-rated genres** like Drama and Thriller can improve audience satisfaction.  
- **Enhancing Action and Horror quality** can make them more appealing to a broader audience.  

✅ **Enhancing Viewer Engagement**  
- **Shorter TV series (1-3 seasons) attract binge-watchers**, improving user retention.  
- **Adding interactive features like audience ratings and reviews** can boost engagement.  

✅ **Platform Growth & Expansion**  
- **US production is dominant**, but **expanding regional content** (Asia, Europe) can help attract a wider audience.  
- Leveraging **IMDb votes and TMDb popularity trends** can help promote high-engagement content.  

✅ **Addressing Negative Growth Areas**  
- **Monitoring poorly rated content** and either removing or improving it can enhance platform quality.  
- **Balancing content production across regions** prevents over-reliance on a single market.  

---

### **📌 Conclusion**  
The **insights from EDA** can help Amazon Prime **refine its content strategy, optimize audience engagement, and expand its global reach**. By focusing on **high-rated genres, improving content diversity, and leveraging audience interaction trends**, Amazon Prime can strengthen its position in the competitive streaming industry. 🚀  


# **GitHub Link -**

https://github.com/SakshiVispute-2002/Project


# **Problem Statement**


**Write Problem Statement Here.**
Amazon Prime hosts a vast collection of movies and TV shows, but understanding what makes content successful is crucial for platform growth, audience retention, and strategic content investment. This project aims to analyze content type, genres, ratings, production trends, and audience engagement to uncover insights that can improve content strategy and viewer experience.



#### **Define Your Business Objective?**

The primary objective is to analyze Amazon Prime TV Shows and Movies to:
✅ Identify popular content types & high-performing genres.
✅ Analyze IMDb ratings & audience engagement trends.
✅ Understand content release patterns & regional production trends.
✅ Provide data-driven recommendations for improving content quality, engagement, and platform growth.

These insights will help Amazon Prime optimize content strategy, attract a larger audience, and improve retention in the competitive streaming industry. 🚀

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#imort libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
credits_df = pd.read_csv("credits.csv")
titles_df = pd.read_csv("titles.csv")

### Dataset First View

In [None]:
# Dataset First Look
# Display first 5 rows of each dataset
print("📌 First 5 rows of Titles Dataset:")
print(titles_df.head())

In [None]:
# Display first 5 rows of each dataset
print("\n📌 First 5 rows of Credits Dataset:")
print(credits_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Count the number of rows and columns in both datasets
titles_shape = titles_df.shape
credits_shape = credits_df.shape

print(f"📌 Titles Dataset: {titles_shape[0]} rows, {titles_shape[1]} columns")
print(f"📌 Credits Dataset: {credits_shape[0]} rows, {credits_shape[1]} columns")


### Dataset Information

In [None]:
# Dataset Info

print("📌 Titles Dataset Info:")
titles_df.info()

print("\n📌 Credits Dataset Info:")
credits_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

titles_duplicates = titles_df.duplicated().sum()
credits_duplicates = credits_df.duplicated().sum()

print(f"📌 Duplicate Rows in Titles Dataset: {titles_duplicates}")
print(f"📌 Duplicate Rows in Credits Dataset: {credits_duplicates}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_titles = titles_df.isnull().sum()
print("📌 Missing Values in Titles Dataset:\n", missing_titles)



In [None]:
# Missing Values/Null Values Count
missing_credits = credits_df.isnull().sum()

In [None]:
print("\n📌 Missing Values in Credits Dataset:\n", missing_credits)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 5))
sns.heatmap(titles_df.isnull(), cmap="viridis", cbar=False, yticklabels=False)
plt.title("Missing Values in Titles Dataset")
plt.show()


## ***2. Understanding Your Variables***

In [None]:
# Dataset Describe
# Display column names for Titles dataset
print("📌 Titles Dataset Columns:")
print(titles_df.columns)


In [None]:
# Display column names for Credits dataset
print("\n📌 Credits Dataset Columns:")
print(credits_df.columns)

In [None]:
# Summary statistics for Titles dataset
print("\n📌 Titles Dataset Description:")
print(titles_df.describe())

# Summary statistics for Credits dataset
print("\n📌 Credits Dataset Description:")
print(credits_df.describe())


✅ Interpretation:

Titles dataset contains 9,871 movies/TV shows with 15 attributes.
Credits dataset has 124,235 records linking actors & crew to movies/TV shows with 5 attributes.


In [None]:
# Convert list-type columns to strings before checking unique values
print("\n📌 Unique Values in Titles Dataset:")
for column in titles_df.columns:
    print(f"{column}: {titles_df[column].astype(str).nunique()} unique values")

print("\n📌 Unique Values in Credits Dataset:")
for column in credits_df.columns:
    print(f"{column}: {credits_df[column].astype(str).nunique()} unique values")



### Variables Description

The dataset contains key attributes that help analyze content trends, ratings, and audience engagement.

1.id & title: Unique identifiers and names of movies/TV shows.
2.type: Specifies whether the content is a Movie or TV Show.
release_year & runtime: Provide insights into content duration and production trends.
3.genres & production_countries: Help understand the most common genres and regional content distribution.
4.imdb_score, imdb_votes, tmdb_score, tmdb_popularity: Indicate audience ratings, votes, and popularity across platforms.
5.age_certification & seasons: Help analyze content suitability and TV show longevity.
These variables provide insights to optimize content strategy, viewer engagement, and platform growth. 🚀



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Fill missing values with appropriate replacements
titles_df["description"].fillna("No description available", inplace=True)
titles_df["age_certification"].fillna("Unknown", inplace=True)
titles_df["seasons"].fillna(0, inplace=True)  # Movies have 0 seasons
titles_df["imdb_score"].fillna(titles_df["imdb_score"].mean(), inplace=True)
titles_df["tmdb_score"].fillna(titles_df["tmdb_score"].median(), inplace=True)

In [None]:
# Plot missing values heatmap again
plt.figure(figsize=(12, 6))
sns.heatmap(titles_df.isnull(), cmap="viridis", cbar=False)
plt.title("Missing Values in Titles Dataset (After Fixes)")
plt.show()

In [None]:
# 2. Dropping Unnecessary Columns
# Dropping 'imdb_id' as it may not be relevant for analysis
titles_df.drop(columns=['imdb_id'], inplace=True)

In [None]:
# Dropping 'character' column from credits_df as it has many missing values
credits_df.drop(columns=['character'], inplace=True)

In [None]:
# 3. Removing Duplicate Rows
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

In [None]:
# 4. Converting Data Types
titles_df['release_year'] = titles_df['release_year'].astype(int)
titles_df['runtime'] = titles_df['runtime'].astype(int)

In [None]:
# 5. Creating New Features
# Extracting the decade from the release year
titles_df['decade'] = (titles_df['release_year'] // 10) * 10

In [None]:
# 6. Merging Datasets
# Merging titles and credits datasets on 'id' column
merged_df = pd.merge(titles_df, credits_df, on='id', how='left')
# Display the cleaned dataset
print(merged_df.head())

In [None]:
# Display all rows
pd.set_option('display.max_rows', None)

In [None]:
# Display all columns
pd.set_option('display.max_columns', None)

In [None]:
# Display full content of each column
pd.set_option('display.max_colwidth', None)

### What all manipulations have you done and insights you found?

✅ Data Cleaning & Handling Missing Values:
Filled missing descriptions with "No Description Available" to maintain readability.
Replaced missing age_certification values with "Unknown" since it's categorical.
Replaced missing IMDb and TMDb scores with their median value to avoid extreme distortions.
IMDb votes missing values were replaced with 0, assuming they have no recorded votes.
✅ Data Preprocessing & Optimization:
Dropped imdb_id and character columns due to lack of relevance and excessive missing values.
Removed duplicate rows to avoid biased results.
Converted release_year and runtime columns to integer types for consistency.
Added a "decade" column to analyze trends across different time periods.
✅ Merging & Final Dataset Preparation:
Merged titles_df and credits_df on the id column for a consolidated dataset

### Check Unique Values for each variable.

In [None]:

# Check for missing values
missing_values_titles = titles_df.isnull().sum()
missing_values_credits = credits_df.isnull().sum()

In [None]:
print("Missing Values in Titles Dataset:\n", missing_values_titles)


In [None]:
print("\nMissing Values in Credits Dataset:\n", missing_values_credits)

In [None]:
# Visualizing missing data in Titles dataset
plt.figure(figsize=(10, 5))
sns.heatmap(titles_df.isnull(), cmap="viridis", cbar=False, yticklabels=False)
plt.title("Missing Values in Titles Dataset")
plt.show()

In [None]:

titles_df["type"].value_counts().plot(kind="bar", color=["blue", "orange"])
plt.title("Distribution of Movies vs TV Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.show()


In [None]:

import ast

# Convert genre strings to lists
titles_df["genres"] = titles_df["genres"].apply(ast.literal_eval)
all_genres = [genre for sublist in titles_df["genres"] for genre in sublist]
genre_counts = pd.Series(all_genres).value_counts()

In [None]:
# Plot top 10 genres
genre_counts.head(10).plot(kind="barh", color="green")
plt.title("Top 10 Most Common Genres")
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()

In [None]:

sns.histplot(titles_df["imdb_score"].dropna(), bins=20, kde=True, color="purple")
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")
plt.show()


In [None]:

# Top 10 Movies/Shows by IMDb Score
top_imdb = titles_df[['title', 'type', 'imdb_score']].dropna().sort_values(by='imdb_score', ascending=False).head(10)
print("Top 10 Highest-Rated Movies/TV Shows (IMDb):")
print(top_imdb)


In [None]:

# Top 10 by TMDb Popularity
top_tmdb = titles_df[['title', 'type', 'tmdb_popularity']].dropna().sort_values(by='tmdb_popularity', ascending=False).head(10)
print("\nTop 10 Most Popular Movies/TV Shows (TMDb Popularity):")
print(top_tmdb)

In [None]:

# Count top 10 actors
top_actors = credits_df[credits_df['role'] == 'ACTOR']['name'].value_counts().head(10)
print("Top 10 Most Frequent Actors:")
print(top_actors)


In [None]:
# Count top 10 directors
top_directors = credits_df[credits_df['role'] == 'DIRECTOR']['name'].value_counts().head(10)
print("\nTop 10 Most Frequent Directors:")
print(top_directors)

In [None]:

titles_df["release_year"].value_counts().sort_index().plot(kind="line", figsize=(10, 5), marker="o", color="red")
plt.title("Number of Movies & Shows Released Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Releases")
plt.grid(True)
plt.show()


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

🔹 1. Distribution of Movies vs. TV Shows

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.countplot(x='type', data=titles_df, palette='coolwarm')
plt.title("Distribution of Movies vs. TV Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A count plot effectively shows the distribution of categorical data (type = Movie/TV Show).

##### 2.  What is/are the insight(s) found from the chart?

Movies dominate the dataset compared to TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.More movies indicate a strong preference for one-time content over series.
2.A balanced ratio may be needed to compete with platforms like Netflix.

#### Chart - 2

🔹 2. Top 10 Most Common Genres

In [None]:
import ast

# Function to safely convert string to list
def convert_to_list(value):
    if isinstance(value, str):  # Apply conversion only if it's a string
        try:
            return ast.literal_eval(value)  # Convert string to list
        except (ValueError, SyntaxError):
            return []  # Return an empty list if conversion fails
    return value  # If already a list, return as-is

# Apply function to the genres column
titles_df["genres"] = titles_df["genres"].apply(convert_to_list)

# Verify conversion
print(titles_df["genres"].head())
from collections import Counter

all_genres = [genre for sublist in titles_df["genres"] for genre in sublist]
genre_counts = Counter(all_genres).most_common(10)

print("Top 10 Most Common Genres:", genre_counts)
# Plot
plt.figure(figsize=(10, 5))
sns.barplot(x=[g[0] for g in genre_counts], y=[g[1] for g in genre_counts], palette="viridis")
plt.xticks(rotation=45)
plt.title("Top 10 Most Common Genres")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for showing the frequency of top genres.

##### 2. What is/are the insight(s) found from the chart?

Drama & Comedy are the most dominant genres in the dataset.
Action and Thriller also have high counts, indicating demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.Investing in Drama and Comedy content is beneficial.
2.A balanced approach with Action and Thriller can expand the audience.

#### Chart - 3

🔹 3. IMDb Score Distribution

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(titles_df["imdb_score"].dropna(), bins=20, kde=True, color="purple")
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram effectively shows the distribution of IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

Most IMDb scores range between 6 and 8, meaning average ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.If fewer high-rated titles exist, efforts should be made to improve content quality.
2.Low-rated movies could hurt brand reputation.

#### Chart - 4

🔹 4. IMDb Score vs. Popularity (Correlation)

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=titles_df["imdb_score"], y=titles_df["tmdb_popularity"], alpha=0.5, color="blue")
plt.title("IMDb Score vs. TMDb Popularity")
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Popularity")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot helps visualize the relationship between two numerical variables.

##### 2. What is/are the insight(s) found from the chart?

1.Some low-rated movies have high popularity (possibly due to marketing or controversy).
2.Highly-rated movies also show moderate to high popularity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.Marketing can boost popularity even for average-rated titles.
2.High IMDb score does not always guarantee high popularity.

#### Chart - 5

🔹 5. Top 10 Most Popular Movies/TV Shows

In [None]:
top_popular = titles_df[['title', 'tmdb_popularity']].dropna().sort_values(by='tmdb_popularity', ascending=False).head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_popular['tmdb_popularity'], y=top_popular['title'], palette="coolwarm")
plt.title("Top 10 Most Popular Titles")
plt.xlabel("TMDb Popularity")
plt.ylabel("Title")
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly shows the top titles.

##### 2. What is/are the insight(s) found from the chart?

1.Some lesser-known movies may have unexpectedly high popularity.
2.Popularity may be driven by trends or recent promotions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.Investing in trending titles could increase revenue.
2.Identifying why some movies are highly popular can help optimize marketing.

#### Chart - 6

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(data=titles_df['release_year'].value_counts().sort_index())
plt.title("Number of Movies/TV Shows Released Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Releases")
plt.show()


##### 1. Why did you pick the specific chart?

1.A line chart is best for showing trends over time.
2.It helps visualize the growth pattern of movie & TV show releases.

##### 2. What is/are the insight(s) found from the chart?

1.The number of releases has increased over the years, especially after 2000.
2.There may be a dip in certain years due to industry disruptions (e.g., economic crises, pandemics).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact:
Growth trend means increasing demand, making it beneficial to produce more content.
❌ Negative Impact:
A sudden drop in content production in certain years may indicate industry challenges or shifting audience interests.

#### Chart - 7

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Explode the genres column while resetting index
exploded_df = titles_df[['genres', 'imdb_score']].explode('genres').dropna().reset_index(drop=True)

# Plot the boxplot
plt.figure(figsize=(15, 6))
sns.boxplot(x=exploded_df['genres'], y=exploded_df['imdb_score'])
plt.xticks(rotation=90)
plt.title("Average IMDb Score per Genre")
plt.xlabel("Genre")
plt.ylabel("IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

1.A box plot helps compare IMDb score distributions across different genres.
2.It shows which genres tend to have higher or lower ratings.

##### 2. What is/are the insight(s) found from the chart?

1.Genres like Drama, Documentary, and Thriller tend to have higher ratings.
2.Action and Horror movies have greater variation, with some poorly rated titles.

#### Chart - 8

In [None]:
country_counts = titles_df['production_countries'].explode().value_counts().head(10)

plt.figure(figsize=(10, 5))
plt.pie(country_counts, labels=country_counts.index, autopct='%1.1f%%', colors=sns.color_palette('Set2'))
plt.title("Top 10 Production Countries")
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart effectively shows the distribution of content production by country.

##### 2. What is/are the insight(s) found from the chart?

1.The US dominates production, followed by other major film industries (India, UK).
2.Some regions produce little content, highlighting potential growth opportunities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact:
Expanding localized content in underserved regions can attract new subscribers.
❌ Negative Impact:
Over-reliance on a single country (US) could limit global reach and diversity.

#### Chart - 9

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(titles_df['seasons'].dropna(), bins=15, kde=True, color="green")
plt.title("Distribution of Seasons in TV Shows")
plt.xlabel("Number of Seasons")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram shows the frequency of TV shows with different season counts.

##### 2. What is/are the insight(s) found from the chart?

Most TV shows have 1-3 seasons, indicating limited-run series are common.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact:
Shorter series (1-3 seasons) attract binge-watchers, increasing engagement.
❌ Negative Impact:
Fewer long-running shows could lead to lower platform retention.

#### Chart - 10

In [None]:
top_directors = credits_df[credits_df['role'] == 'DIRECTOR']['name'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_directors.values, y=top_directors.index, palette="coolwarm")
plt.title("Top 10 Directors with Most Titles")
plt.xlabel("Number of Titles")
plt.ylabel("Director Name")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart highlights the most frequently working directors.

##### 2. What is/are the insight(s) found from the chart?

Some directors have extensive filmographies, contributing to platform success.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Partnering with top directors can lead to consistent high-quality releases.

#### Chart - 11

In [None]:
top_actors = credits_df[credits_df['role'] == 'ACTOR']['name'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_actors.values, y=top_actors.index, palette="coolwarm")
plt.title("Top 10 Most Frequent Actors")
plt.xlabel("Number of Titles")
plt.ylabel("Actor Name")
plt.show()



##### 1. Why did you pick the specific chart?

Highlights which actors are the most featured on the platform.

##### 2. What is/are the insight(s) found from the chart?

Certain actors appear in many productions, making them valuable assets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Promoting familiar faces can drive viewership.

#### Chart - 12

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=titles_df["runtime"], y=titles_df["imdb_score"], alpha=0.5, color="red")
plt.title("Runtime vs IMDb Score")
plt.xlabel("Runtime (minutes)")
plt.ylabel("IMDb Score")
plt.show()


##### 1. Why did you pick the specific chart?

Shows if longer movies tend to get higher/lower ratings.

##### 2. What is/are the insight(s) found from the chart?

1.Moderate-length movies (90-150 min) have higher ratings.
2.Very short/long movies tend to be rated lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact:
Producing content within the ideal runtime range can improve ratings.

#### Chart - 13

In [None]:
titles_df['decade'] = (titles_df['release_year'] // 10) * 10
decade_type_counts = titles_df.groupby(['decade', 'type']).size().unstack()

decade_type_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap="coolwarm")
plt.title("Movies vs. TV Shows per Decade")
plt.xlabel("Decade")
plt.ylabel("Count")
plt.legend(title="Type")
plt.show()



##### 1. Why did you pick the specific chart?

Helps understand how TV shows and movies have evolved over time.

##### 2. What is/are the insight(s) found from the chart?

TV Shows have increased significantly in recent decades.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Impact:
Investing in more TV shows aligns with modern trends.

#### Chart - 14

In [None]:
plt.figure(figsize=(12, 5))
sns.lineplot(data=titles_df.groupby('release_year')['imdb_votes'].sum())
plt.title("Growth Trend of IMDb Votes Over the Years")
plt.xlabel("Year")
plt.ylabel("Total IMDb Votes")
plt.show()


##### 1. Why did you pick the specific chart?

1.A line chart helps visualize how audience engagement (IMDb votes) has evolved over time.
2.Helps identify trends in movie/TV popularity across different years.

##### 2. What is/are the insight(s) found from the chart?

1.IMDb votes have increased significantly over the years, showing a rise in audience engagement.
2.There may be spikes in votes around blockbuster releases or specific trends (e.g., streaming era).


✅ Positive Impact:

Increasing IMDb votes indicate growing audience interaction, proving higher engagement and viewership.
More votes mean more valuable audience feedback for better content decisions.
❌ Negative Impact:

A drop in IMDb votes in specific years could indicate declining audience interest or lack of high-quality content.
Platforms must ensure continuous content quality to maintain audience engagement.

#### Chart - 15

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=titles_df["tmdb_popularity"], y=titles_df["imdb_votes"], alpha=0.5, color="green")
plt.title("IMDb Votes vs TMDb Popularity")
plt.xlabel("TMDb Popularity")
plt.ylabel("IMDb Votes")
plt.show()


##### 1. Why did you pick the specific chart?

1.A scatter plot shows the relationship between a movie/show's popularity and audience votes.
2.Helps determine whether highly popular content also receives high audience engagement.

##### 2. What is/are the insight(s) found from the chart?

1.Some highly popular movies/shows have low votes, meaning they might be trending but not engaging.
2.Movies with high votes are not always the most popular, indicating strong fan engagement even for niche titles.

📌 3. Business Impact (Positive/Negative Growth)?
✅ Positive Impact:

Understanding this correlation can help strategically market content that is gaining traction.
If low-vote movies are highly popular, adding interactive elements (polls, reviews, social engagement) may help boost retention.
❌ Negative Impact:

If popular titles receive fewer votes, it suggests they lack deep audience engagement, meaning they may not have lasting impact.
Focus should be on increasing audience retention, not just hype-based popularity.

## **5. Solution to Business Objective**

Based on the EDA, the following key solutions are derived:

✅ Content Strategy: Prioritize high-rated genres like Drama and Thriller, while improving Action and Horror quality.
✅ Viewer Engagement: Focus on shorter TV series for binge-watchers and improve audience interaction through ratings and reviews.
✅ Platform Growth: Utilize IMDb votes & TMDb popularity to promote trending content and collaborate with top actors and directors.
✅ Addressing Challenges: Monitor poorly rated content and avoid over-reliance on a single region for production.

# **Conclusion**

The EDA on Amazon Prime Movies & TV Shows provided key insights into genre popularity, IMDb ratings, content trends, and audience engagement.

✅ Drama and Thriller dominate high ratings, while Action and Horror show mixed reviews.
✅ Movies outnumber TV shows, and shorter series are more common.
✅ IMDb votes & TMDb popularity highlight trending content, guiding marketing strategies.
✅ Production is US-dominated, but expanding globally can boost reach.

These findings can help improve content strategy, platform growth, and viewer retention. 🚀