<a href="https://colab.research.google.com/github/Shivu7889/Amazon-Prime-Tv-Shows-and-Movie/blob/main/Amazon_Prime_Tv_Shows_and_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Prime TV Shows and Movies


##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Shivam Patidar

# **Project Summary -**

In this project, we performed an exploratory data analysis (EDA) on Amazon Prime Video’s content dataset to uncover insights into the platform’s content diversity, popularity trends, and regional representation. The dataset consisted of two CSV files: titles.csv containing metadata of over 9,000 titles (movies and TV shows), and credits.csv containing cast and crew information for more than 120,000 roles.

Using Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn, we conducted data cleaning, transformation, and visual analysis to address several key business questions.

# **GitHub Link -**

https://github.com/Shivu7889/Amazon-Prime-TV-Shows-and-Movies

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

* Content Diversity: What genres and categories dominate the platform?
* Regional Availability: How does content distribution vary across different regions?
* Trends Over Time: How has Amazon Prime’s content library evolved?
* IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?


 By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

#### **Define Your Business Objective?**

To uncover data-driven insights from Amazon Prime Video’s catalog that support strategic decisions in content acquisition, user engagement, and platform growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast

# Set visualization style
sns.set(style="whitegrid")
%matplotlib inline

from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
try:
    titles = pd.read_csv("/content/drive/MyDrive/MsProject/Amazon_Prime/titles.csv")
    credits = pd.read_csv("/content/drive/MyDrive/MsProject/Amazon_Prime/credits.csv")
except FileNotFoundError as e:
    print("Error: Dataset file not found.")
    raise e

### Dataset First View

In [None]:
# Dataset First Look - tiltes datset
titles.head()

In [None]:
# Dataset First Look - credits dataset
credits.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count - titles
titles.shape


In [None]:
# Dataset Rows & Columns count - credits
titles.shape

### Dataset Information

In [None]:
# Dataset Info - credits
credits.info()


In [None]:
# Dataset Info - titles
titles.info()

#### Duplicate Values

In [None]:

titles.duplicated().sum()

In [None]:
# Dataset Duplicate Value Count - credits
credits.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count - titles
titles.isnull().sum()


In [None]:
# Missing Values/Null Values Count - credits
credits.isnull().sum()

In [None]:
#null visulization - credits
plt.figure(figsize=(10, 6))
sns.heatmap(credits.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Value Heatmap - Titles Dataset')
plt.show()

In [None]:
# Visualizing the missing values - titles
plt.figure(figsize=(10, 6))
sns.heatmap(titles.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Value Heatmap - Titles Dataset')
plt.show()

### What did you know about your dataset?

In my dataset i have two scv files : titles.csv and credits.csv. titles.csv has 9871 rows and 15 columns and it contains duplicated values and null values.

credits.csv has 124235 rows and 5 columns and it also contains duplicated values and null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns - titles
titles.columns

In [None]:
# Dataset Columns - credits
credits.columns

In [None]:
# Dataset Describe - credits
credits.describe(include = 'all')

In [None]:
# Dataset Describe - titles
titles.describe(include = 'all')

### Variables Description

**titles dataset**
* **id:** The title ID on JustWatch.
* **title:** The name of the title.
* **show_type:** TV show or movie.
* **description:** A brief description.
* **release_year:** The release year.
* **age_certification:** The age certification.
* **runtime:** The length of the episode (SHOW) or movie.
* **genres:** A list of genres.
* **production_countries:** A list of countries that produced the title.
* **seasons:** Number of seasons if it's a SHOW.
* **imdb_id:** The title ID on IMDB.
* **imdb_score:** Score on IMDB.
* **imdb_votes:** Votes on IMDB.
* **tmdb_popularity:** Popularity on TMDB.
* **tmdb_score:** Score on TMDB.

**credits dataset**
* **person_ID:** The person ID on JustWatch.
* **id:** The title ID on JustWatch.
* **name:** The actor or director's name.
* **character_name:** The character name.
* **role:** ACTOR or DIRECTOR.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable. - titles
for i in titles.columns.tolist():
  print("No. of unique values in ",i,"is",titles[i].nunique(),".")

In [None]:
# Check Unique Values for each variable.
for i in credits.columns.tolist():
  print("No. of unique values in ",i,"is",credits[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#titles datset

# Drop duplicate rows
titles.drop_duplicates(inplace=True)

# Drop rows with missing title or type (critical fields)
titles.dropna(subset=['title', 'type'], inplace=True)

# Fill missing numerical values with median (safer for skewed data)
titles['imdb_score'].fillna(titles['imdb_score'].median(), inplace=True)
titles['tmdb_score'].fillna(titles['tmdb_score'].median(), inplace=True)
titles['tmdb_popularity'].fillna(titles['tmdb_popularity'].median(), inplace=True)
titles['imdb_votes'].fillna(titles['imdb_votes'].median(), inplace=True)

# Fill missing values for age certification and description with placeholder
titles['age_certification'].fillna('Unknown', inplace=True)
titles['description'].fillna('No Description Available', inplace=True)

# Convert 'seasons' NaNs to 0 (assuming it's 0 for movies)
titles['seasons'].fillna(0, inplace=True)

# Convert genres and production_countries from string to list
import ast
titles['genres'] = titles['genres'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])
titles['production_countries'] = titles['production_countries'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])

# Cast 'release_year', 'runtime', 'seasons' to integer
titles['release_year'] = titles['release_year'].astype(int)
titles['runtime'] = titles['runtime'].astype(int)
titles['seasons'] = titles['seasons'].astype(int)




#credits dataset
# Drop duplicate rows
credits.drop_duplicates(inplace=True)

# Fill missing character names with "Unknown"
credits['character'].fillna('Unknown', inplace=True)

In [None]:
# Merge into titles
df = pd.merge(titles, credits, on='id', how='left')
df

### What all manipulations have you done and insights you found?

* Removed duplicate rows from both datasets.

* Filled null values with meaningful defaults or medians.

* Converted columns with stringified lists (genres, production_countries) into Python lists.

* Converted data types (e.g., seasons, runtime, release_year) to int.

* Merged credits and titles dataset

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Tv vs Movies Show
# Plot distribution of show types
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='type', palette='pastel')
plt.title("Distribution of Titles by Type (TV vs Movie)")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A countplot (bar chart) is the most effective visualization to compare the frequency of categorical variables—in this case, TV Shows vs Movies. It gives a direct visual cue on which content type is more dominant on the platform.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it's evident that Movies significantly outnumber TV Shows on Amazon Prime Video. This indicates that the platform’s content library is heavily skewed towards movies, possibly catering to users who prefer short-duration entertainment over long-term series commitments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact:

* If user engagement data shows that viewers prefer TV series for long-term subscriptions, then increasing TV show content could boost retention and watch time.

* This analysis allows content acquisition teams to balance their library—acquiring more TV series could strategically fill this gap and enhance user satisfaction.

No insights lead to negative growth.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Count the number of titles released each year
release_year = titles['release_year'].value_counts().sort_index()

# Plotting
plt.figure(figsize=(12, 6))
sns.lineplot(x=release_year.index, y=release_year.values, marker='o', color='steelblue')
plt.title("Number of Titles Released Per Year on Amazon Prime")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is ideal for visualizing trends over time. It helps track the yearly progression of content releases on Amazon Prime, revealing patterns such as spikes, declines, or steady growth. The continuous nature of the line makes it easy to observe how content strategy evolved over the years.

##### 2. What is/are the insight(s) found from the chart?

* There has been a significant increase in title releases from the early 2000s until around 2020.
* Post-2020, there appears to be a decline or plateau, possibly due to the COVID-19 pandemic affecting production pipelines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can guide strategic planning and forecasting:

* Understanding the peak years of content growth allows Amazon to analyze what strategies worked—e.g., licensing deals, original productions, etc.

* The decline post-2020 may signal the need to re-accelerate content acquisition or original production to retain users.

By using this insights amazon can build their plan to increase the customer adoption.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df, x='imdb_score', y='tmdb_popularity', alpha=0.5)
plt.title("IMDb vs TMDb Ratings")
plt.xlabel("IMDb Rating")
plt.ylabel("TMDb Popularity")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is well-suited to observe the relationship or correlation between two continuous variables—in this case, IMDb score and TMDb popularity.


##### 2. What is/are the insight(s) found from the chart?

* Most titles cluster around IMDb ratings between 5 and 7, and low to moderate TMDb popularity.

* There are some titles with both high ratings and high popularity, representing successful content that performs well both critically and publicly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

* Helps content strategists identify what kinds of titles manage to achieve both high ratings and popularity.

* Titles that score high on both axes can be used as benchmarks for future content acquisition or creation.

* It also helps in targeted promotion—popular but lower-rated titles may still attract mass audiences, which can be leveraged in marketing.

**Negative Insight:**

* A disconnect between ratings and popularity may signal audience-content mismatch, which, if not addressed, could affect long-term viewer trust and satisfaction.

#### Chart - 4

In [None]:
# Count number of titles per year


# Get top 10 highest rated titles
top_10_rated = titles.sort_values(by='imdb_score', ascending=False).head(10)

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(y=top_10_rated['title'], x=top_10_rated['imdb_score'], palette='viridis')
plt.title("Top 10 Highest IMDb Rated Titles on Amazon Prime")
plt.xlabel("IMDb Score")
plt.ylabel("Title")
plt.xlim(8, 10)  # Optional: focusing on high score range
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen because it clearly displays the comparison of IMDb scores for the top-rated titles. It allows for easier reading of longer title names and makes it straightforward to identify the best-rated shows or movies

##### 2. What is/are the insight(s) found from the chart?

* All top 10 titles have IMDb scores above 8.5, showing a high level of critical acclaim.

* Some of the highest-rated titles are lesser-known, indicating that quality content doesn't always gain mainstream visibility.

* The chart showcases diverse genres and formats, reflecting that high-quality content spans across various types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

* These top-rated titles can be strategically promoted as “critically acclaimed picks,” attracting viewers looking for quality.

* Helps identify content patterns (genre, language, year) common among high-rated titles to guide future acquisitions or productions.

* Boosting visibility of high-rated but less-popular titles may improve customer retention and satisfaction.

**Negative Growth Consideration:**

* If such highly-rated titles are not promoted effectively, they may remain underwatched, which is a lost opportunity for engagement.

* Over-reliance on IMDb scores alone could ignore audience preferences or popularity, so a balanced strategy is necessary.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Count of content types
category_counts = titles['type'].value_counts()

# Pie chart
plt.figure(figsize=(6,6))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', colors=['#66b3ff','#99ff99'], startangle=140)
plt.title('Content Type Distribution: Movies vs TV Shows')
plt.axis('equal')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart was chosen because it provides a quick and visually appealing way to show the proportion of two categories—Movies vs TV Shows—in the dataset. It allows viewers to immediately grasp which type dominates the platform.

##### 2. What is/are the insight(s) found from the chart?

* Movies form the majority of the content on Amazon Prime Video.

* TV Shows make up a smaller percentage, indicating a potential area for growth or user interest diversification.

* The distribution is not balanced, highlighting a content skew toward movie-based offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact:

* If user engagement data shows that viewers prefer TV series for long-term subscriptions, then increasing TV show content could boost retention and watch time.

* This analysis allows content acquisition teams to balance their library—acquiring more TV series could strategically fill this gap and enhance user satisfaction.

No insights lead to negative growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Top 10 genres
top_genres = titles['genres'].explode().value_counts().head(10)

# Bar plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')
plt.title('Top 10 Most Common Genres on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.grid(axis='x')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for displaying categorical data like genres. It offers clarity and is especially useful when dealing with long genre names, allowing easy comparison of frequencies across genres.

##### 2. What is/are the insight(s) found from the chart?

* The most common genres include Drama, Comedy, Action, Thriller, and Romance, suggesting that Amazon Prime caters to a wide range of mainstream preferences.

* Drama appears as the most dominant genre, reflecting user demand or Amazon's acquisition strategy.

* Niche genres (like Documentary, Animation, or Sci-Fi) are present but less frequent in the top 10.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Amazon can tailor its recommendations and curate genre-based collections around top-performing categories like Drama and Comedy to increase viewer engagement.

* The platform may use these insights to invest more in high-demand genres to retain and grow the subscriber base.

* Genre-specific marketing campaigns can be created to target audience segments based on their preferences.

**Negative insights:**
* If not focusing on low value genre than customer moved to another platform with specific interest.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
df_country = df.explode('production_countries')
top_countries = df_country['production_countries'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_countries.index, y=top_countries.values, palette="Set3")
plt.title("Top 10 Content Producing Countries")
plt.xticks(rotation=45)
plt.ylabel("Number of Shows")
plt.show()


##### 1. Why did you pick the specific chart?

A vertical bar chart is ideal for representing discrete categories such as countries. It provides a clear view of which countries contribute the most content to Amazon Prime, making it easier to compare production volume across regions.

##### 2. What is/are the insight(s) found from the chart?

* The United States leads by a large margin in terms of content production on Amazon Prime.

* Other key contributors include India, United Kingdom, Canada, France, and Japan.

* There's a healthy mix of Western and Asian countries, showing Amazon's global content acquisition strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Amazon can leverage this insight to expand partnerships or co-productions with high-content-producing countries.

* The company can localize marketing and regional UI/UX strategies for countries like India and the UK where content production is high.

* These findings help identify content hubs and guide decisions for future content investments.

**Potential Negative Impact:**

* Over-reliance on a few countries (especially the US) may cause regional content gaps, alienating international audiences looking for diverse or native-language content.

* Underrepresented countries may reflect missed market opportunities or limited regional production support.

#### Chart - 8

In [None]:
# Chart - 8 visualization code


# Get top 10 most voted titles
top_10_popular = titles.sort_values(by='imdb_votes', ascending=False).head(10)

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(y=top_10_popular['title'], x=top_10_popular['imdb_votes'], palette='magma')
plt.title("Top 10 Most Popular Titles by IMDb Votes on Amazon Prime")
plt.xlabel("IMDb Votes")
plt.ylabel("Title")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was selected to visualize the top 10 titles with the highest IMDb vote count. This format allows for easy reading of long title names and direct comparison of popularity levels based on audience engagement (votes), which is a strong indicator of a title’s success and reach.

##### 2. What is/are the insight(s) found from the chart?

* A few titles have garnered millions of votes, indicating high global popularity and engagement.

* These titles likely represent evergreen or blockbuster content that continues to attract viewers over time.

* High vote counts reflect not just viewership but also viewer interest in rating the content, a signal of strong audience connection.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* These insights help identify benchmark content that draws massive engagement, useful for recommendation systems, re-marketing, or franchise development.

* Amazon can feature or re-promote these titles to maximize viewership or subscription retention.

* Helps in licensing and investment decisions, as similar content genres or formats may be prioritized.

No negative insights

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8,6))
sns.heatmap(df[['runtime', 'imdb_score', 'imdb_votes', 'seasons']].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for identifying relationships between multiple numerical features in a dataset. It provides a clear and visual summary of how strongly (or weakly) different variables are related. This helps in feature selection, understanding content performance metrics, and building data-driven strategies.

##### 2. What is/are the insight(s) found from the chart?

* A moderate positive correlation is observed between imdb_votes and imdb_score, suggesting that more voted titles tend to have slightly higher ratings, which reflects engaged audiences tend to support quality content.

* Runtime and seasons show low or no correlation with imdb_score and votes, implying that longer content or multi-season shows are not necessarily better rated.

* Seasons and runtime might be weakly related, hinting that TV shows with more seasons may have slightly longer total runtimes, but it’s not a strong relationship.



#### Chart - 10 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

A pair plot is chosen because it provides a comprehensive multivariate visualization, displaying pairwise relationships between multiple numerical features. It combines scatter plots, histograms, and density plots in a single matrix, helping in detecting correlations, distributions, trends, and outliers across all combinations of selected variables. This is especially useful in exploratory data analysis (EDA).

##### 2. What is/are the insight(s) found from the chart?

* A positive trend is visible between imdb_votes and imdb_score, similar to what was observed in the correlation heatmap.

* Skewed distributions are visible in variables like imdb_votes and runtime, indicating a few titles dominate in popularity or length.

* Some clusters and outliers can be observed, which may indicate specific genres, series, or standout movies with exceptional characteristics.

* The pair plot confirms weak or no relationship between runtime and imdb_score, meaning longer content doesn’t guarantee higher ratings.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective, Amazon Prime should focus on:

* Producing more high-rated genres like Drama and Comedy.

* Expanding regional content, especially in high-demand countries like India and the UK.

* Promoting top-rated and most-voted content to increase user engagement.

* Using data insights for personalized recommendations and targeted marketing.

# **Conclusion**

In this Exploratory Data Analysis (EDA) Capstone Project on Amazon Prime Video content, we uncovered valuable insights that can guide strategic decisions for improving content offerings and viewer satisfaction. Through detailed visualizations and data-driven observations, we analyzed trends in content types, genres, regional distribution, popularity metrics (IMDb & TMDb), and user engagement indicators.

Key findings revealed that Drama, Comedy, and Action dominate the platform, while content is largely produced in a few countries like the USA, UK, and India. Highly voted and rated titles significantly influence viewer behavior, and shorter runtimes or limited-series formats are gaining popularity. These insights highlight opportunities for content diversification, regional expansion, and data-driven personalization.

Overall, this analysis equips stakeholders with actionable intelligence to boost platform engagement, enhance user experience, and drive sustainable growth for Amazon Prime Video.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***