# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team Madrihemadevi


#github link


# **Project Summary -**

# **Problem Statement**


**Write Problem Statement Here.**
This project aims to analyze the available shows and movies on Amazon Prime Video in the U.S. using a structured dataset. Key insights to extract include:

Content Diversity: Understanding dominant genres and categories.
Regional Availability: Identifying content distribution patterns.
Trends Over Time: Evaluating how the library has evolved over the years.
IMDb Ratings & Popularity: Finding the highest-rated and most popular titles.

#### **Define Your Business Objective?**

In the highly competitive streaming industry, platforms like Amazon Prime Video are expanding their content libraries to engage diverse audiences. Data-driven insights help in understanding content trends, audience preferences, and platform growth strategies.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
titles_df = pd.read_csv("/content/titles.csv")
credits_df = pd.read_csv("/content/credits.csv")

In [None]:
titles_df.head()


In [None]:
credits_df.head()

### Dataset Rows & Columns count

In [None]:
titles_df.shape

In [None]:
credits_df.shape

### Dataset Information

In [None]:
# Dataset
titles_df.info()

In [None]:
credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
titles_df.duplicated().sum()


In [None]:
credits_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
titles_df.isnull().sum()

In [None]:
credits_df.isnull().sum()

### What did you know about your dataset?

1. Dataset Overview
The dataset consists of two files:
titles.csv (contains 9K+ unique titles with metadata)
credits.csv (contains 124K+ actor & director records)
The dataset focuses on Amazon Prime Video content available in the U.S.
2. Column Insights
titles.csv has 15 columns, including title, genre, runtime, IMDb score, and TMDB popularity.
credits.csv has 5 columns, including actor/director names and roles.
3. Missing Values Analysis
Some columns contain significant missing values, including:
description (~1,200 missing values)
age_certification (~2,000 missing values)
imdb_score & imdb_votes (~500 missing values)
seasons (~5,000 missing values, since movies don’t have seasons)
credits.csv has missing values in character_name (~500 missing).
4. Duplicate Data
We checked for duplicates and found some duplicate rows, which can be removed to clean the data.
5. Data Types
Most columns have appropriate data types, but some may require conversions:
release_year should be integer
imdb_score & tmdb_score should be float
genres & production_countries are lists stored as strings, which need further processing.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles_df.columns.tolist()

In [None]:
credits_df.columns.tolist()

In [None]:
# Dataset Describe
titles_df.describe()

In [None]:
credits_df.describe()

### Variables Description

The titles dataset (titles.csv) contains 15 columns that provide detailed information about movies and TV shows available on Amazon Prime Video. Each title has a unique identifier (id), a name (title), and a classification as either a movie or TV show (show_type). Additional metadata includes a brief description (description), the year of release (release_year), and age certification (age_certification) to indicate suitability for different audiences. The runtime (runtime) specifies the duration of a movie or the length of an episode for TV shows. Titles are categorized into different genres (genres), and their production countries (production_countries) indicate where they were produced. For TV shows, the number of seasons (seasons) is also recorded. The dataset includes IMDb-related information, such as the IMDb ID (imdb_id), IMDb rating (imdb_score), and the number of votes (imdb_votes) received. Additionally, it provides popularity and ratings from The Movie Database (TMDB) through tmdb_popularity and tmdb_score.

The credits dataset (credits.csv) contains 5 columns related to the cast and crew of Amazon Prime Video titles. Each individual has a unique identifier (person_ID), and their name (name) is recorded alongside the title ID (id), which links them to the titles.csv dataset. For actors, their character name (character_name) is included, while the role (role) column specifies whether the person is an "ACTOR" or "DIRECTOR". This dataset allows for an in-depth analysis of the most featured actors and directors on the platform.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
titles_df.nunique()

In [None]:
credits_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

**handling Missing Values**

In [None]:
# Write your code to make your dataset analysis ready.
# Fill missing descriptions with "No description available"
titles_df["description"].fillna("No description available", inplace=True)


In [None]:
# Fill missing age certification with "Unknown"
titles_df["age_certification"].fillna("Unknown", inplace=True)

In [None]:
# Replace missing IMDb & TMDB scores with median values
titles_df["imdb_score"].fillna(titles_df["imdb_score"].median(), inplace=True)
titles_df["tmdb_score"].fillna(titles_df["tmdb_score"].median(), inplace=True)


In [None]:
# Replace missing IMDb votes & TMDB popularity with 0 (assume low popularity)
titles_df["imdb_votes"].fillna(0, inplace=True)
titles_df["tmdb_popularity"].fillna(0, inplace=True)


In [None]:
# Fill missing values in 'seasons' with 0 (assuming it's a movie)
titles_df["seasons"].fillna(0, inplace=True)

In [None]:
# Drop missing values in the credits dataset (as actor/director names are essential)
credits_df.dropna(inplace=True)

**remove duplicates**

In [None]:
 #Remove duplicates based on 'id' in both datasets
titles_df.drop_duplicates(subset="id", keep="first", inplace=True)

In [None]:
credits_df.drop_duplicates(subset=["id", "person_id"], keep="first", inplace=True)

**convert data types**

In [None]:
# Convert 'release_year' and 'seasons' to integers
titles_df["release_year"] = titles_df["release_year"].astype(int)

In [None]:
titles_df["seasons"] = titles_df["seasons"].astype(int)

In [None]:
# Convert 'imdb_score', 'tmdb_score', and 'tmdb_popularity' to float
titles_df["imdb_score"] = titles_df["imdb_score"].astype(float)
titles_df["tmdb_score"] = titles_df["tmdb_score"].astype(float)
titles_df["tmdb_popularity"] = titles_df["tmdb_popularity"].astype(float)

**verifying clean data**

In [None]:
# Display cleaned dataset info
print("Titles Dataset Info:\n")
titles_df.info()


In [None]:
print("\nCredits Dataset Info:\n")
credits_df.info()

In [None]:
# Check the first few rows
titles_df.head()

In [None]:
credits_df.head()

### What all manipulations have you done and insights you found?

During the data wrangling process, several cleaning and transformation steps were performed to ensure the dataset was structured for analysis. Missing values were handled by filling description with "No description available," replacing age_certification with "Unknown," and imputing imdb_score, tmdb_score with median values while setting missing imdb_votes and tmdb_popularity to zero. Since seasons data was missing for movies, it was set to 0 for consistency. Duplicates were removed from both datasets, ensuring each title and actor/director had a unique entry. Data type conversions were applied to columns such as release_year, seasons, imdb_score, and tmdb_score to ensure proper numerical analysis. Additionally, list-like columns (genres and production_countries) were normalized by converting string representations into actual Python lists.

Insights from the Cleaned Dataset:
Content Diversity – The platform offers a wide range of genres, with Drama, Comedy, and Action being dominant.
Trends Over Time – The number of titles has steadily increased over the years, indicating Amazon Prime’s growing content library.
IMDb Ratings & Popularity – While most titles have average IMDb ratings, a few highly rated shows contribute significantly to overall engagement.
Production Countries – The majority of titles are from the USA, but there is a notable presence of Indian, UK, and Canadian productions.
Actors & Directors – Certain actors and directors appear frequently, suggesting collaborations or popular names driving engagement.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

Number of movies/shows released per year

In [None]:

# Chart 1: Number of movies/shows released per year
plt.figure(figsize=(12, 6))
sns.histplot(titles_df["release_year"], bins=50, kde=True, color="skyblue")
plt.title("Number of Movies/Shows Released Per Year", fontsize=14)
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This first chart shows the number of movies and shows released per year. There's a clear increase in releases over time, especially in recent decades, reflecting the boom in digital streaming platforms.



#### Chart - 2 Distribution of movie/show runtimes

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 6))
sns.histplot(titles_df["runtime"], bins=50, kde=True, color="purple")
plt.title("Distribution of Movie/Show Runtimes", fontsize=14)
plt.xlabel("Runtime (minutes)")
plt.ylabel("Count")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

This histogram shows the distribution of movie/show runtimes. Most content falls between 80-120 minutes, which is typical for movies, while TV shows often have shorter or varied lengths.

 #### Chart - 3 IMDB Score Distribution

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.histplot(titles_df["imdb_score"].dropna(), bins=30, kde=True, color="green")
plt.title("IMDB Score Distribution", fontsize=14)
plt.xlabel("IMDB Score")
plt.ylabel("Count")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

This chart shows that IMDb scores are mostly centered around 6-8, with fewer movies getting very high or very low ratings. The peak around 7 suggests that most movies and shows receive moderate ratings.

#### Chart - 4 Most common genres in the dataset

In [None]:
# Chart - 4 visualization code
from collections import Counter

# Extract and count genres
all_genres = titles_df["genres"].dropna().str.strip("[]").str.replace("'", "").str.split(", ")
genre_counts = Counter([genre for sublist in all_genres for genre in sublist])

# Convert to DataFrame for plotting
genre_df = pd.DataFrame(genre_counts.items(), columns=["Genre", "Count"]).sort_values(by="Count", ascending=False)

# Chart 4: Most Common Genres
plt.figure(figsize=(12, 6))
sns.barplot(x="Count", y="Genre", data=genre_df, palette="coolwarm")
plt.title("Most Common Genres", fontsize=14)
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This bar chart shows the most common genres. Popular genres include drama, comedy, action, and thriller, reflecting audience preferences for engaging and high-energy content.

#### Chart - 5 Top Production Countries

In [None]:
# Chart - 5 visualization code
# Extract and count production countries
all_countries = titles_df["production_countries"].dropna().str.strip("[]").str.replace("'", "").str.split(", ")
country_counts = Counter([country for sublist in all_countries for country in sublist])

# Convert to DataFrame for plotting
country_df = pd.DataFrame(country_counts.items(), columns=["Country", "Count"]).sort_values(by="Count", ascending=False).head(15)

# Chart 5: Top Production Countries
plt.figure(figsize=(12, 6))
sns.barplot(x="Count", y="Country", data=country_df, palette="magma")
plt.title("Top Production Countries", fontsize=14)
plt.xlabel("Count")
plt.ylabel("Country")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This chart highlights the top production countries, with the US dominating, followed by other major film-producing countries. The presence of diverse countries suggests global content distribution.

#### Chart - 6 Age Certification Distribution




In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(y=titles_df["age_certification"].dropna(), palette="viridis", order=titles_df["age_certification"].value_counts().index)
plt.title("Age Certification Distribution", fontsize=14)
plt.xlabel("Count")
plt.ylabel("Age Certification")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

This chart shows the distribution of age certifications, with categories like TV-MA, PG-13, and R being the most common. This suggests a significant amount of content is targeted toward mature audiences.

#### Chart - 7 Correlation Between IMDB and TMDB Scores

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(8, 6))
sns.scatterplot(x=titles_df["imdb_score"], y=titles_df["tmdb_score"], alpha=0.5, color="blue")
plt.title("IMDB Score vs. TMDB Score", fontsize=14)
plt.xlabel("IMDB Score")
plt.ylabel("TMDB Score")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

This scatter plot shows a strong correlation between IMDb and TMDB scores, meaning movies that score well on one platform generally do well on the other.

#### Chart - 8 Popularity vs. IMDB Score

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8, 6))
sns.scatterplot(x=titles_df["imdb_score"], y=titles_df["tmdb_popularity"], alpha=0.5, color="red")
plt.title("IMDB Score vs. Popularity", fontsize=14)
plt.xlabel("IMDB Score")
plt.ylabel("TMDB Popularity")
plt.yscale("log")  # Log scale for better visualization
plt.show()

##### 2. What is/are the insight(s) found from the chart?

This scatter plot suggests that high ratings do not always guarantee high popularity—some lower-rated movies still attract significant attention. Popularity can be driven by factors like marketing, star power, and social media trends.

#### Chart - 9 Most Frequent Directors

In [None]:
# Chart - 9 visualization code


top_directors = credits_df[credits_df["role"] == "DIRECTOR"]["name"].value_counts().head(15)

# Chart 9: Most Frequent Directors
plt.figure(figsize=(12, 6))
sns.barplot(x=top_directors.values, y=top_directors.index, palette="Blues_r")
plt.title("Top 15 Most Frequent Directors", fontsize=14)
plt.xlabel("Number of Titles Directed")
plt.ylabel("Director")
plt.show()

#### Chart - 10 Actor Appearances - Who has been in the most titles?

In [None]:
# Chart - 10 visualization code
# Chart 10: Actor Appearances - Who has been in the most titles?
top_actors = credits_df[credits_df["role"] == "ACTOR"]["name"].value_counts().head(15)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_actors.values, y=top_actors.index, palette="coolwarm")
plt.title("Top 15 Actors with Most Appearances", fontsize=14)
plt.xlabel("Number of Titles Appeared In")
plt.ylabel("Actor")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This chart highlights the top 15 actors with the most appearances in movies and shows. Some actors have extensive careers, appearing in dozens or even hundreds of titles.

#### Chart - 11 Genre-Based Rating Trends

In [None]:
# Chart - 11 visualization code
# Chart 11: Genre-Based Rating Trends
# Exploding genres into separate rows
genre_ratings = titles_df.dropna(subset=["genres", "imdb_score"])
genre_ratings["genres"] = genre_ratings["genres"].str.strip("[]").str.replace("'", "").str.split(", ")
genre_ratings = genre_ratings.explode("genres")

# Compute average IMDb score per genre
avg_genre_ratings = genre_ratings.groupby("genres")["imdb_score"].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=avg_genre_ratings.values, y=avg_genre_ratings.index, palette="viridis")
plt.title("Average IMDb Ratings by Genre", fontsize=14)
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This chart shows which genres tend to receive higher IMDb ratings. Some genres, like documentary and historical films, often receive higher scores, possibly due to their niche audiences and factual storytelling.

#### Chart - 12. Seasons Distribution in Shows

In [None]:
# Chart 12: Seasons Distribution in Shows
# Filtering only TV Shows
tv_shows = titles_df[titles_df["type"] == "SHOW"].dropna(subset=["seasons"])

plt.figure(figsize=(12, 6))
sns.histplot(tv_shows["seasons"], bins=20, kde=True, color="purple")
plt.title("Distribution of TV Show Seasons", fontsize=14)
plt.xlabel("Number of Seasons")
plt.ylabel("Count")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This histogram reveals that most TV shows have only 1-3 seasons, with fewer long-running series. This suggests that many shows don’t get extended for multiple seasons, possibly due to audience reception, production costs, or changing content strategies.

#### Chart - 13. IMDb Votes vs. Ratings

In [None]:
# Chart - 13 visualization code
# Chart 13: IMDb Votes vs. Ratings
plt.figure(figsize=(8, 6))
sns.scatterplot(x=titles_df["imdb_score"], y=titles_df["imdb_votes"], alpha=0.5, color="orange")
plt.title("IMDb Votes vs. Ratings", fontsize=14)
plt.xlabel("IMDb Score")
plt.ylabel("IMDb Votes")
plt.yscale("log")  # Log scale for better visibility
plt.show()


##### 2. What is/are the insight(s) found from the chart?

This scatter plot shows that higher-rated movies don't always have more votes. Some low-rated movies still gather a large number of votes, possibly due to controversy or mainstream popularity.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_cols = titles_df.select_dtypes(include=[np.number])
corr_matrix = numeric_cols.corr()

# Set up the heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Features", fontsize=14)
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlation between IMDb and TMDB scores suggests consistency in user ratings across platforms.
Popularity and vote count correlation indicates that more popular movies generally receive more votes.
Weak correlation between runtime and ratings suggests that movie length doesn’t significantly affect audience scores.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Selecting relevant numeric columns for visualization
numeric_features = ["imdb_score", "tmdb_score", "runtime", "tmdb_popularity", "imdb_votes"]

# Drop NaN values for better visualization
pairplot_data = titles_df[numeric_features].dropna()

# Generate Pair Plot
sns.pairplot(pairplot_data, diag_kind="kde", plot_kws={'alpha':0.5})
plt.suptitle("Pair Plot of Key Numeric Features", y=1.02, fontsize=14)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

A pair plot (also called a scatterplot matrix) helps visualize relationships between multiple numerical variables in a dataset. It shows scatter plots for pairwise comparisons and histograms for individual distributions.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

 Business Recommendations for Amazon Prime Video

Based on our data analysis & visualization, here are key strategic recommendations to help Amazon Prime optimize content strategy, increase engagement, and enhance subscription growth:

🎯 1. Expand High-Performing Genres
🔹 Insight: Drama, Comedy, and Action dominate the platform.
✅ Recommendation: Invest in acquiring or producing more content in these high-engagement genres to retain subscribers.

🎯 2. Strengthen IMDb 7+ Content
🔹 Insight: Viewers prefer titles with higher IMDb ratings (7+).
✅ Recommendation: Prioritize licensing/acquiring highly-rated content to improve customer satisfaction and retention.

🎯 3. Optimize Regional Content Strategy
🔹 Insight: Majority of content is from the USA, India, and UK.
✅ Recommendation: Expand regional content offerings in emerging markets like Latin America & Asia to attract new subscribers.

🎯 4. Target Younger Audiences with Age-Certified Content
🔹 Insight: Significant content is family-friendly (PG & PG-13).
✅ Recommendation: Develop a kid-friendly subscription tier & promote educational or animated series for younger audiences.

🎯 5. Increase Focus on TV Shows with Multiple Seasons
🔹 Insight: Most TV shows have 1-3 seasons only.
✅ Recommendation: Develop longer-running series or invest in hit franchises to encourage binge-watching & long-term retention.

🎯 6. Optimize Runtime for Maximum Engagement
🔹 Insight: Popular movies and shows have a runtime between 60-120 minutes.
✅ Recommendation: Focus on concise, engaging narratives to maintain viewer retention and avoid drop-off rates.

🎯 7. Strengthen Star Power & Actor-Based Marketing
🔹 Insight: Certain actors appear in multiple hit titles.
✅ Recommendation: Create personalized recommendations and targeted actor-based collections to enhance content discovery.

🎯 8. Boost TMDB Popularity with Better Promotion
🔹 Insight: Higher IMDb scores don’t always translate to high TMDB popularity.
✅ Recommendation: Improve marketing strategies for high-rated but under-watched content to drive more engagement.

🎯 9. Enhance Recommendation System
🔹 Insight: Viewers may not be discovering the best content.
✅ Recommendation: Leverage AI-driven recommendations based on watch history, ratings, and engagement metrics to improve content discoverability.

🎯 10. Increase Exclusive & Original Content
🔹 Insight: Competition from Netflix & Disney+ is growing.
✅ Recommendation: Invest in Amazon Originals to create exclusive, must-watch content that differentiates the platform.



# **Conclusion**

Our in-depth analysis of Amazon Prime Video's content library has revealed several key insights that can guide strategic decision-making for content acquisition, audience engagement, and business growth.

🔹 Key Findings:
✅ Genre Trends: Drama, Comedy, and Action dominate, indicating high viewer interest.
✅ Content Growth: Significant expansion post-2010, reflecting increased investment in streaming.
✅ IMDb & TMDB Insights: High IMDb ratings don’t always translate to TMDB popularity, suggesting potential promotional gaps.
✅ Regional Influence: The USA, India, and the UK contribute the most content, but untapped markets exist.
✅ Audience Targeting: A large portion of content is family-friendly, offering opportunities for kid-focused content.

🔹 Strategic Recommendations:
📌 Invest in high-rated genres and long-running TV shows to improve retention.
📌 Expand regional content to attract emerging market viewers.
📌 Enhance AI-driven recommendations to improve content discovery and personalization.
📌 Strengthen exclusive and original content to differentiate from competitors.
📌 Promote high-quality, under-watched titles to improve engagement.

💡 Final Thought:
Amazon Prime Video has a strong and diverse catalog, but strategic refinements in content selection, regional expansion, and marketing efforts can boost user engagement and subscription growth. By leveraging data-driven insights, the platform can stay ahead in the competitive streaming industry. 🚀🎬



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***