<a href="https://colab.research.google.com/github/Rajgadekar2151/EDA_AmazonPrimeVideo/blob/main/EDA_prime_video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  EDA on Amazon Prime Video Shows and Movies




##### **Project Type** - Exploratory Data Analysis (EDA)

##### **Contribution**    - Individual
#####  **Member** - Raj gadekar

# **Project Summary -**

This project analyzed over 9,800 titles from Amazon Prime Video, including Movies and TV Shows, along with actors and directors information. The goal was to understand content trends, popularity, ratings, and audience preferences.

We first explored the data, checked for missing values and duplicates, and performed data wrangling to clean the datasets. We filled missing ratings, age certifications, and other key columns to make the data ready for analysis.

Next, we created more than 20 charts to explore the data from different angles:

* Content type distribution – shows how many Movies vs TV Shows are available.

* Release trends over time – shows how content has grown year by year.

* Genre popularity – finds the most common genres.

* Ratings and popularity analysis – identifies top-rated and most-voted titles.

* Actor/Director contribution – highlights who appears most in Amazon Prime content.

* Correlation and pair plots – shows relationships between numerical variables like IMDb score, TMDB score, votes, and popularity.

From these analyses, we found that few titles and actors dominate the platform, some genres are more popular, and highly-rated titles often have more votes.

These insights can help Amazon promote top content, plan new productions, and make data-driven decisions to increase user engagement and subscriptions.

# **GitHub Link -**

https://github.com/Rajgadekar2151/EDA_AmazonPrimeVideo

# **Problem Statement**


Amazon Prime Video has many movies and TV shows for different types of users.

Because many streaming platforms are competing with each other, it is important to understand what type of content is available and how it is performing.

This project analyzes Amazon Prime Video data using Exploratory Data Analysis (EDA) to understand content trends, ratings, and popularity.

The goal is to find useful insights that can help in better content planning and business decisions.


#### **Define Your Business Objective?**

• To understand how many Movies and TV Shows are available on Amazon Prime Video

• To find the most common genres on the platform

• To study how the content has grown over the years

• To analyze IMDb and TMDB ratings and popularity of the content

• To provide simple insights that can help improve content strategy

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import os


### Dataset Loading

In [None]:
# Load Dataset
titles = pd.read_csv('titles.csv')
credits = pd.read_csv('credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
titles.head(10)

In [None]:
titles.tail(10)

In [None]:
credits.head(10)

In [None]:
credits.tail(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles.shape


In [None]:
credits.shape

### Dataset Information

In [None]:
# Dataset Info
titles.info()

In [None]:
credits.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
titles.duplicated().sum()


In [None]:
credits.duplicated().sum()


In [None]:
dup = titles[titles.duplicated(keep=False)]
dup.sort_values(by=dup.columns.tolist()).head(10)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
titles.isnull().sum()


In [None]:
credits.isnull().sum()


In [None]:
# Visualizing the missing values
titles.isnull().sum().plot(kind='bar',color=['blue', 'red','black'],figsize=(16,4))
plt.title("Missing Values Count by Column")
plt.show()

### What did you know about your dataset?

* The dataset has more than 9000 titles from Amazon Prime Video.

* There are two types of content: Movies and TV Shows.

* The data has both text (categorical) and numbers (numerical).

* Some columns, like IMDb score or age certification, have missing values.

* Genres are stored as a list of multiple genres in one column.

* The dataset is mainly for content in the United States.

* IMDb and TMDB ratings are available, which can help us find popular content.

* The dataset also contains information about production countries, runtime, and seasons.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles.columns

In [None]:
credits.columns

In [None]:
# Dataset Describe
titles.describe()

In [None]:
"""numeric_df = titles.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()
#corr_matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', annot_kws={"size": 7})
plt.title('Correlation Heatmap')
plt.show()"""

### Variables Description

id: Unique identifier for each title available on Amazon Prime Video.

title: Name of the movie or TV show.

show_type: Indicates whether the content is a Movie or a TV Show.

description: Short summary describing the content of the title.

release_year: Year in which the title was released.

age_certification: Age rating assigned to the title, indicating suitable audience age.

runtime: Duration of the movie or episode in minutes.

genres: List of genres associated with the title.

production_countries: Countries involved in producing the title.

seasons: Number of seasons for TV shows (not applicable for movies).

imdb_id: Unique identifier of the title on IMDb.

imdb_score: IMDb rating score of the title.

imdb_votes: Number of votes received on IMDb.

tmdb_popularity: Popularity score of the title on TMDB.

tmdb_score: Rating score of the title on TMDB.

***

person_id: Unique identifier for each person (actor or director).

id: Unique identifier of the title associated with the person.

name: Name of the actor or director.

character_name: Name of the character played by the actor.

role: Role of the person in the title, such as Actor or Director.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in titles.columns:
    print(col, ":", titles[col].nunique())

In [None]:
for col in credits.columns:
    print(col, ":", credits[col].nunique())

For categorical columns, we checked the number of unique values to understand the diversity in the dataset.

Columns like show_type, genres, age_certification, and production_countries have multiple categories that help analyze content distribution.

Numeric columns have many different values, so we will focus on their summary statistics (min, max, mean, median) instead of listing all unique numbers.

In [None]:
categorical_df = titles.select_dtypes(include=['object'])
for col in categorical_df.columns:
    print(f"{col}: {categorical_df[col].nunique()} unique values")

In [None]:
categorical_df1 = credits.select_dtypes(include=['object'])
for col in categorical_df1.columns:
    print(f"{col}: {categorical_df1[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Remove duplicate rows
titles = titles.drop_duplicates()

# Check again for duplicates
print("Remaining duplicates:", titles.duplicated().sum())

In [None]:
credits = credits.drop_duplicates()
print("Remaining duplicates:", titles.duplicated().sum())
credits['character'] = credits['character'].fillna('Unknown')


In [None]:
# Fill missing values

titles['age_certification'] = titles['age_certification'].fillna('Not Rated')

titles['description'] = titles['description'].fillna('No description')

titles['seasons'] = titles['seasons'].fillna(0)
titles['seasons'] = titles['seasons'].astype(int)




titles['tmdb_score'] = titles.groupby('type')['tmdb_score'] \
    .transform(lambda x: x.fillna(x.mean()))

titles['imdb_votes'] = titles.groupby('type')['imdb_votes'] \
    .transform(lambda x: x.fillna(x.mean()))

titles['tmdb_popularity'] = titles.groupby('type')['tmdb_popularity'] \
    .transform(lambda x: x.fillna(x.mean()))

titles['imdb_score'] = titles.groupby('type')['imdb_score'] \
    .transform(lambda x: x.fillna(x.mean()))


titles['imdb_votes'] = titles['imdb_votes'].astype(int)

In [None]:
'''credits.head()
credits.isnull().sum()'''

In [None]:
'''titles.head()
titles.isnull().sum()'''

### What all manipulations have you done and insights you found?

During the data wrangling process, duplicate rows were identified and removed to avoid repeated records in the analysis.
Missing values in categorical and text columns such as age certification and description were filled with meaningful labels.
The seasons column was filled with zero for movies, as movies do not have seasons, and converted to integer type.
Missing values in rating-related columns such as IMDb score, IMDb votes, TMDB popularity, and TMDB score were handled using group-wise mean based on content type (Movie or TV Show).
Identifier columns like IMDb ID were left unchanged, as they are not used for analytical calculations.

From this process, it was observed that a large number of missing values were expected for certain columns, especially seasons and age certification, due to the nature of the content.
Proper handling of these values helped in creating a clean dataset that is ready for meaningful visualization and analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
fig, ax = plt.subplots(figsize=(15, 8))
sns.countplot(data=titles, x='type', palette="Blues_d", ax=ax)
plt.title('Distribution of Content Type on Amazon Prime')
plt.xlabel('Content Type')
plt.ylabel('Number of Titles')

for bar in ax.containers:
    ax.bar_label(bar)

plt.show()


##### 1. Why did you pick the specific chart?

*   A count plot is suitable for categorical variables and helps compare the number of Movies and TV Shows available on the platform

##### 2. What is/are the insight(s) found from the chart?

*   The chart shows that Amazon Prime Video has a higher number of movies compared to TV shows, indicating a stronger focus on movie content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  Yes, this insight helps understand content strategy. A higher number of movies may attract movie-focused users, while increasing TV shows could help retain long-term subscribers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
'''plt.figure(figsize=(10,5))
titles['release_year'].value_counts().sort_index().plot(kind='line')
plt.title('Content Release Trend Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()'''


In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(30,10))
sns.countplot(data=titles, x='release_year', palette="viridis")
plt.title('Number of Titles Released per Year on Amazon Prime')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

*  To see how content on Amazon Prime has increased or changed over the years.

*  A bar chart clearly shows the number of releases per year.

##### 2. What is/are the insight(s) found from the chart?

*  The number of titles has grown steadily in recent years.

*  Older years have fewer titles — either because fewer releases back then or less data available.

*  This shows that Amazon Prime is adding more content over time.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  Positive impact: Shows that Amazon Prime is actively adding new content to attract more subscribers.

*  No negative growth is observed from this chart. The trend is upward, which is good for business.

In [None]:
titles.head()

#### Chart - 3

In [None]:
# Create dummy variables for each genre
genre_df = titles['genres'].str.get_dummies(sep=',').sum().sort_values(ascending=False).reset_index()
genre_df.columns = ['Genre', 'Count']

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(16,6))
sns.barplot(data=genre_df.head(20), x='Genre', y='Count', palette="viridis")
plt.title('Top 10 Genres on Amazon Prime')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

*  To see which genres are most popular on Amazon Prime.

*  Helps understand content diversity on the platform.

##### 2. What is/are the insight(s) found from the chart?

*  The top genres are usually Drama, Comedy, Thriller, etc.

*  Amazon Prime focuses more on popular genres to attract viewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  Positive impact: Understanding popular genres helps in content acquisition and marketing.

*  Negative insight: Less popular genres may need more investment to increase user engagement.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10,5))
sns.histplot(titles['imdb_score'], bins=20, kde=True)

plt.title('Distribution of IMDb Scores on Amazon Prime')
plt.xlabel('IMDb Score')
plt.ylabel('Number of Titles')

# Manually set x-axis ticks
plt.xticks(range(0, 11, 1))   # 0,1,2,3,...10

plt.show()



##### 1. Why did you pick the specific chart?

* A histogram is suitable for understanding the distribution of numerical variables like IMDb score. It helps identify how ratings are spread across titles.

##### 2. What is/are the insight(s) found from the chart?

* Most titles have IMDb scores between 5 and 8, indicating that a large portion of content is moderately to highly rated. Very few titles have extremely low or very high scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight helps understand overall content quality on the platform. Knowing that most titles are well-rated supports marketing strategies that highlight content quality to attract users.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,5))
sns.histplot(titles['runtime'], bins=30, kde=True)

plt.title('Distribution of Runtime on Amazon Prime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Number of Titles')

plt.show()


##### 1. Why did you pick the specific chart?

* A histogram is suitable for analyzing the distribution of numerical variables like runtime. It helps understand how content duration is spread across the platform.

##### 2. What is/are the insight(s) found from the chart?

* Most titles have a runtime between 60 and 120 minutes, indicating that Amazon Prime mainly focuses on standard-length movies and episodes. Very long runtimes are less common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, understanding runtime patterns helps in content planning and user experience design. Viewers often prefer content with manageable durations, which can improve engagement and retention.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(6,4))
sns.barplot(data=titles, x='type', y='imdb_score', palette='Set2')

plt.title('Average IMDb Score by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Average IMDb Score')
plt.ylim(0,10)
plt.show()


'''filtered_titles = titles[
    (titles['imdb_score'] >= 7) &
    (titles['imdb_votes'] >= 10000)
]
count_df = filtered_titles['type'].value_counts().reset_index()
count_df.columns = ['Content Type', 'Count']

plt.figure(figsize=(6,4))
sns.barplot(data=count_df, x='Content Type', y='Count', palette='Set2')

plt.title('Count of High-Rated & Highly-Voted Content by Type')
plt.xlabel('Content Type')
plt.ylabel('Number of Titles')

plt.show()'''



##### 1. Why did you pick the specific chart?

* A bar chart is useful for comparing the average IMDb ratings between Movies and TV Shows.


##### 2. What is/are the insight(s) found from the chart?

* TV Shows have a slightly higher average IMDb score compared to Movies, indicating better overall audience ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight helps understand audience preferences. Higher-rated TV Shows can improve user engagement and long-term retention.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(8,5))
sns.violinplot(data=titles, x='type', y='imdb_score', palette='Set2')

plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.ylim(0,10)
plt.show()



##### 1. Why did you pick the specific chart?

* A violin plot is useful for comparing the distribution of IMDb scores between Movies and TV Shows and understanding rating spread and density.

##### 2. What is/are the insight(s) found from the chart?

* TV Shows generally have a slightly higher and more consistent IMDb score distribution compared to Movies.
Movies show a wider spread of ratings, including more lower-rated titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight suggests that TV Shows tend to perform better in terms of audience ratings.
Focusing on quality TV Shows can help improve user engagement and retention.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Split production countries and count
country_df = (
    titles['production_countries']
    .str.get_dummies(sep=',')
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .reset_index()
)

country_df.columns = ['Country', 'Count']

plt.figure(figsize=(15,5))
sns.barplot(data=country_df, x='Country', y='Count', palette='viridis')

plt.title('Top 10 Production Countries on Amazon Prime')
plt.xlabel('Country')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

* A bar chart is suitable for comparing the number of titles produced by different countries and identifying the major content-producing regions.

##### 2. What is/are the insight(s) found from the chart?

* The United States contributes the highest number of titles on Amazon Prime, followed by other countries.
This indicates a strong dominance of content produced in a few major regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, understanding production country distribution helps in regional content planning and localization strategies.
Expanding content from diverse regions can help attract a wider global audience.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Split genres into separate columns
genre_dummy = titles['genres'].str.get_dummies(sep=',')

# Combine with IMDb score
genre_rating = genre_dummy.mul(titles['imdb_score'], axis=0)

# Calculate average rating per genre
avg_genre_rating = genre_rating.sum() / genre_dummy.sum()

# Convert to DataFrame and take top 10 genres
avg_genre_rating = avg_genre_rating.sort_values(ascending=False).head(10).reset_index()
avg_genre_rating.columns = ['Genre', 'Average IMDb Score']

plt.figure(figsize=(15,5))
sns.barplot(data=avg_genre_rating, x='Genre', y='Average IMDb Score', palette='viridis')

plt.title('Top 10 Genres by Average IMDb Rating')
plt.xlabel('Genre')
plt.ylabel('Average IMDb Score')
plt.ylim(0,10)
plt.xticks(rotation=45)
plt.show()



##### 1. Why did you pick the specific chart?

* This chart was chosen to compare IMDb ratings across different genres and identify which genres perform better in terms of audience ratings.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that certain genres such as Reality, Sport, and War tend to have higher average IMDb ratings compared to others.
This indicates that audiences generally rate these genres more positively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight helps content platforms focus on genres that consistently receive higher audience ratings.
Investing more in high-performing genres can improve user satisfaction and platform credibility.

#### Chart - 10

In [None]:
# Chart - 10 visualization code


# Prepare genre dummy
genre_dummy = titles['genres'].str.get_dummies(sep=',')

# Multiply with TMDB popularity
genre_popularity = genre_dummy.mul(titles['tmdb_popularity'], axis=0)

# Average popularity per genre
avg_genre_pop = (genre_popularity.sum() / genre_dummy.sum()) \
                .sort_values(ascending=False) \
                .head(10) \
                .reset_index()

avg_genre_pop.columns = ['Genre', 'Average Popularity']

plt.figure(figsize=(14,5))
sns.barplot(data=avg_genre_pop, x='Genre', y='Average Popularity', palette='viridis')
plt.title('Top 10 Genres by Average TMDB Popularity')
plt.xlabel('Genre')
plt.ylabel('Average Popularity')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

* This chart was chosen to understand which genres attract higher audience popularity on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

*  Genres like Animation, Fantasy, and SciFi show higher average popularity, indicating strong viewer interest and engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Popular genres can be prioritized for promotion and future content investment to maximize audience reach.

#### Chart - 11

In [None]:
# Chart - 11 visualization code


# Add content type to genre dummy
genre_type_df = genre_dummy.copy()
genre_type_df['type'] = titles['type']

# Group by content type
genre_type_count = genre_type_df.groupby('type').sum().T

# Take top 10 genres overall
top_genres = genre_type_count.sum(axis=1).sort_values(ascending=False).head(10).index
genre_type_count = genre_type_count.loc[top_genres]

# Plot
genre_type_count.plot(kind='bar', figsize=(10,5))
plt.title('Top Genres: Movies vs TV Shows')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

* This chart compares Movies and TV Shows across genres to understand content distribution by type.

##### 2. What is/are the insight(s) found from the chart?

* Some genres like Drama and Comedy are dominated by TV Shows, while others like Action are more common in Movies.
This shows different content strategies for different genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Understanding genre dominance helps plan balanced content strategies and improve audience targeting.



#### Chart - 12

In [None]:
# Chart - 12 visualization code

ax = sns.countplot(data=credits, x='role', palette='Set2')
for bar in ax.containers:
    ax.bar_label(bar)
plt.show()




##### 1. Why did you pick the specific chart?

* This chart was chosen to compare the number of actors and directors involved in Amazon Prime content.

##### 2. What is/are the insight(s) found from the chart?

* This chart was chosen to compare the number of actors and directors involved in Amazon Prime content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight highlights the scale of talent involvement in content production.
It helps in understanding casting requirements and talent management for future productions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

high_rated_titles = titles[titles['imdb_score'] >= 7][['id', 'imdb_score']]
actor_titles = credits[credits['role'] == 'ACTOR']
merged_actor = actor_titles.merge(high_rated_titles, on='id')
top_actors = (
    merged_actor['name']
    .value_counts()
    .head(10)
    .reset_index()
)

top_actors.columns = ['Actor', 'High-Rated Title Count']

plt.figure(figsize=(15,5))
sns.barplot(data=top_actors, x='Actor', y='High-Rated Title Count', palette='viridis')

plt.title('Top 10 Actors Appearing in High-Rated Amazon Prime Titles')
plt.xlabel('Actor')
plt.ylabel('Number of High-Rated Titles')
plt.xticks(rotation=45)
plt.show()




##### 1. Why did you pick the specific chart?

* This chart was chosen to identify actors who frequently appear in high-rated Amazon Prime content by combining title ratings and cast information.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that a small group of actors appear repeatedly in highly rated titles.
This suggests that these actors are often associated with well-received content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight can help content platforms identify reliable talent.
Collaborating with actors who are consistently part of successful content may improve content performance.

#### Chart - 14

In [None]:
# Chart - 14 visualization code

high_rated_titles = titles[titles['imdb_score'] >= 7][['id', 'imdb_score']]
director_titles = credits[credits['role'] == 'DIRECTOR']

merged_director = director_titles.merge(high_rated_titles, on='id')
top_directors = (
    merged_director['name']
    .value_counts()
    .head(10)
    .reset_index()
)

top_directors.columns = ['Director', 'High-Rated Title Count']

plt.figure(figsize=(15,5))
sns.barplot(
    data=top_directors,
    x='Director',
    y='High-Rated Title Count',
    palette='viridis'
)

plt.title('Top 10 Directors with High-Rated Amazon Prime Titles')
plt.xlabel('Director')
plt.ylabel('Number of High-Rated Titles')
plt.xticks(rotation=45)
plt.show()



##### 1. Why did you pick the specific chart?

* This chart was chosen to identify directors who frequently work on high-rated Amazon Prime titles by combining rating data and director information.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that a small number of directors are repeatedly associated with highly rated content.
This suggests consistency in quality for certain directors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, this insight helps identify reliable directors who consistently deliver well-received content.
Collaborating with such directors can improve content quality and audience satisfaction.

#### Chart - 15

In [None]:
# Chart - 14 visualization code

# Filter highly rated and popular titles
best_titles = titles[(titles['imdb_score'] >= 8) & (titles['imdb_votes'] >= 50000)]

# Plot count by type
plt.figure(figsize=(8,5))
sns.countplot(data=best_titles, x='type', palette='coolwarm')

plt.title('Best Amazon Prime Titles by Content Type (IMDb ≥ 8 & Votes ≥ 50k)')
plt.xlabel('Content Type')
plt.ylabel('Number of Titles')

# Add value labels on top of bars
ax = plt.gca()
for p in ax.patches:
    ax.annotate(int(p.get_height()), (p.get_x() + p.get_width()/2., p.get_height()),
                ha='center', va='bottom')

plt.show()


##### 1. Why did you pick the specific chart?

* To identify the best-rated and highly popular content on Amazon Prime.

* Focuses on quality content that viewers actually love.

##### 2. What is/are the insight(s) found from the chart?

* We can see how many Movies vs TV Shows meet this high rating and popularity.

* Helps recognize top-performing content in terms of both quality and audience size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive: Amazon can promote these top titles more, feature them in recommendations, and invest in similar content.

* Negative: If only a few titles meet this criteria, it shows limited high-quality content, suggesting a gap in content investment.

#### Chart - 16

In [None]:
# Filter top titles
best_titles = titles[(titles['imdb_score'] >= 8) & (titles['imdb_votes'] >= 50000)]

# Take top 15 by votes
top_titles = best_titles.sort_values(by='imdb_votes', ascending=False).head(15)

plt.figure(figsize=(14,7))
sns.scatterplot(
    data=top_titles,
    x='imdb_score',
    y='imdb_votes',
    size='imdb_votes',   # bubble size proportional to votes
    hue='type',          # color by Movie/Show
    palette='coolwarm',
    sizes=(100, 1000),
    alpha=0.8,
    legend='full'
)

# Annotate titles
for i, row in top_titles.iterrows():
    plt.text(row['imdb_score']+0.02, row['imdb_votes'], row['title'], fontsize=9)

plt.xlim(7.5, 10)
plt.yscale('log')  # log scale for votes for better readability
plt.xlabel('IMDb Score')
plt.ylabel('Number of Votes (log scale)')
plt.title('Top-Rated & Most Voted Amazon Prime Titles')
plt.show()


##### 1. Why did you pick the specific chart?

* To see IMDb rating vs number of votes for Movies and TV Shows.

* Bubble size = popularity, color = type.

##### 2. What is/are the insight(s) found from the chart?

* Top-right = highly rated & popular titles.

*  Big bubbles = most voted.

* Can see which type performs better.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive: Promote top titles to attract viewers.

* Negative: Few top titles → may need better content.

#### Chart - 17

In [None]:
# Count appearances per person
person_counts = credits['name'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=person_counts.values, y=person_counts.index, palette='magma')
plt.title('Top 10 Most Frequent Actors/Directors on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Actor/Director')
plt.show()


##### 1. Why did you pick the specific chart?

* To see which actors or directors appear most on Amazon Prime.

* Helps identify key contributors to the content library.

##### 2. What is/are the insight(s) found from the chart?

* Shows most frequent actors/directors.

* Tells us who influences Amazon Prime content.

* Can highlight popular actors driving viewer interest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive: Amazon can promote content with popular actors/directors.

* Negative: If a few actors dominate, content may lack variety, affecting diversity.

#### Chart - 18 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))

numeric_cols = [
    'release_year',
    'runtime',
    'seasons',
    'imdb_score',
    'imdb_votes',
    'tmdb_popularity',
    'tmdb_score'
]

corr = titles[numeric_cols].corr()

sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

*  A correlation heatmap was chosen to understand the relationship between multiple numerical variables such as ratings, votes, popularity, runtime, and seasons at the same time.

##### 2. What is/are the insight(s) found from the chart?

* The heatmap shows a moderate positive correlation between IMDb score and TMDB score, which means titles rated well on IMDb are usually rated well on TMDB.
IMDb votes and TMDB popularity also show a positive relationship, indicating that popular titles receive more audience engagement.
Other variables such as runtime and number of seasons show weak or no correlation with ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  Yes, these insights help confirm that audience engagement and ratings are consistent across platforms.
The weak correlation between runtime or seasons and ratings suggests that content quality matters more than content length, which can guide content strategy decisions.


#### Chart - 19 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select numeric columns
numeric_cols = ['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']

# Pair plot
sns.pairplot(titles[numeric_cols], diag_kind='kde', plot_kws={'alpha':0.5, 's':30})
plt.suptitle('Pair Plot of Numeric Columns', y=1.20)
plt.show()


##### 1. Why did you pick the specific chart?

* To see relationships between numerical columns like IMDb score, votes, TMDB score, and popularity.

* Helps find patterns or correlations.

##### 2. What is/are the insight(s) found from the chart?

* IMDb score and TMDB score have a moderate positive correlation.

* Titles with high votes often have high popularity.

* Can spot clusters of high-rated vs low-rated content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive: Helps Amazon understand how ratings and popularity are linked.

* Can guide content promotion and investment decisions.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* Promote top-rated and highly-voted content to attract more viewers.

* Invest in popular genres and active actors/directors to increase engagement.

* Identify gaps in content types or ratings and produce content to fill them.

* Use insights from ratings, votes, and trends over time to make data-driven content decisions.

# **Conclusion**

* Amazon Prime has a large and diverse content library, with both Movies and TV Shows.

* Some genres and actors are more popular, and a few titles drive most of the engagement.

* Ratings and popularity are moderately correlated, helping identify high-value content.

* The analysis provides actionable insights for content promotion, production strategy, and subscriber growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***