<a href="https://colab.research.google.com/github/Abdullah-113/Data-Analysis/blob/Amazon/Abdullah's_Capstone_Project_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

In today’s highly competitive streaming ecosystem, platforms like Amazon Prime Video must continuously expand and diversify their content libraries to attract and retain viewers. With thousands of movies and web series being added across various genres and regions, data-driven insights are critical for understanding audience behavior, content performance, and strategic investment.

This project focuses on analyzing a dataset containing all shows available on Amazon Prime Video in the United States. The dataset consists of two CSV files and includes a mix of categorical and numerical variables, covering details such as titles, release year, genre, IMDb ratings, cast, duration, and more.

- Purpose of the Project

The goal of this analysis is to extract key insights from the content library to support:

Business decisions

Content acquisition strategy

Viewer engagement planning

Market trend understanding

- Key Analytical Objectives

Content Diversity

Identify which genres and categories dominate the platform.

Check the spread between movies vs. TV shows.

Regional / Country-Level Distribution

Explore content originating from different regions.

Understand cultural representation and global reach.

Trends Over Time

Analyze how content volume has evolved across years.

Detect growth spikes or shifting focus in genres.

Ratings & Popularity

Determine highest-rated shows and movies.

Investigate whether certain genres receive better ratings.

- Why This Matters

Insights extracted from this dataset can help:

Business analysts measure platform performance.

Content creators understand what type of shows attract viewers.

Marketing teams optimize campaigns based on viewer interests.

Streaming companies benchmark Amazon Prime’s strategy against competitors.

# **Problem Statement**


**PROBLEM OVERVIEW**

With a rapidly growing global streaming market, platforms like Amazon Prime Video need to understand what type of content attracts and retains viewers. Although thousands of shows and movies exist on the platform, there is limited visibility into which genres dominate, how content is distributed across countries, how the library has evolved over time, and which titles perform best in terms of IMDb ratings and popularity.

This project aims to analyze Amazon Prime Video’s U.S. content catalog to answer key business questions such as:

What types of content (movies, shows, genres) are most common on the platform?

Which regions contribute the most content?

How has the available content changed across different years?

Which shows and movies receive the highest ratings?

By solving these questions through data analysis, we can uncover patterns that help in content acquisition, audience targeting, and strategic investment decisions for streaming platforms.

#### **Define Your Business Objective?**

The primary business objective of this project is to analyze Amazon Prime Video’s content library in the United States to uncover trends that support informed decision-making in areas such as content acquisition, audience targeting, and platform growth strategy. By evaluating content types, genres, regional contributions, and IMDb ratings, the goal is to identify which types of shows and movies perform well and appeal most to viewers.

Through this analysis, Amazon Prime (or any streaming service) can:

Invest in high-performing genres and regions

Enhance user engagement by promoting popular or highly-rated content

Optimize content purchasing decisions based on viewer preferences

Strengthen competitive advantage in the streaming market

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
import warnings
warnings.filterwarnings('ignore')
# Setting Style and fig
sns.set_style("darkgrid")
plt.rcParams['figure.figsize'] = (10,5)
pd.set_option('display.max_colwidth', 200)

### Dataset Loading

In [None]:
# Load Dataset
credits = pd.read_csv('/content/credits.csv')
titles = pd.read_csv('/content/titles.csv')

### Dataset First View

In [None]:
# Dataset First Look
credits.head()
titles.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
credits.shape

### Dataset Information

In [None]:
# Dataset Info
credits.info()
titles.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
credits.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
credits.isnull().sum(), titles.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(credits.isnull())

### What did you know about your dataset?

From the given data of an Streaming Industry like Amazon Prime Video,uncover key trends that influence subscription growth, user engagement, and content investment strategies.

We have 2 Dataset which contains titles and credits.

In Titles, it contains information about films, types, release years, genres & more. Whereas in Credits, it contains information about person,role, characters.
The Credits contains, 124235 rows & 5 Columns which includes around 56 duplicates. Whereas Titles 9871 rows &  15 columns with 3 duplicates included.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles.columns

In [None]:
# Dataset Describe
titles.describe()

### Variables Description

release_year: **Release year displayes the relase starting years from 1912 all the way to 2022 recent release year

runtime: The length of the episode (SHOW) or movie.

seasons: Number of seasons if it's a SHOW.

imdb_score: Score on IMDB.

imdb_votes: Votes on IMDB.

tmdb_popularity: Popularity on TMDB.

tmdb_score: Score on TMDB.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
titles.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#let's first removes remove any duplicate
titles.info()
titles.isnull().sum()

In [None]:
#let's first removes remove any duplicate
titles.drop_duplicates(inplace=True)
credits.drop_duplicates(inplace=True)

In [None]:
# Removing missing values
titles['title'].dropna(inplace=True)

titles['imdb_score'] = pd.to_numeric(titles['imdb_score'], errors='coerce')
titles['tmdb_popularity'] = pd.to_numeric(titles['tmdb_popularity'], errors='coerce')


In [None]:
# Converting the Data types
titles['release_year'] = pd.to_numeric(titles['release_year'], errors='coerce')

In [None]:
# Filling missing values with unknown genres
titles['genres'] = titles['genres'].fillna('Unknown')
titles['genres'] = titles['genres'].apply(lambda x: [g.strip() for g in str(x).split(',')])

In [None]:
# Cleaning up Genre for unique
# Convert all to string
titles["genres"] = titles["genres"].astype(str)

# Remove brackets, quotes, spaces and lower()
titles["genres_clean"] = (
    titles["genres"]
    .str.replace(r"[\[\]'\" ]", "", regex=True)  # remove [, ], ', ", spaces
    .str.lower()
)

# Split by comma
titles["genres_clean"] = titles["genres_clean"].str.split(",")

# Explode rows so each genre is individual
genre_explode = titles.explode("genres_clean")


In [None]:
# Clean the country
titles['production_countries'] = titles['production_countries'].astype(str)

# Remove brackets, quotes, spaces and lowercase
titles['country_clean'] = (
    titles['production_countries']
    .str.replace(r"[\[\]'\" ]", "", regex=True)    # remove [ ] ' " and spaces
    .str.lower()
)

# Split into list
titles['country_clean'] = titles['country_clean'].str.split(',')

# Explode for counting
country_ex = titles.explode('country_clean')

# Remove blanks
country_ex = country_ex[country_ex['country_clean'] != ""]


In [None]:
# Convert country codes to full names
country_map = {
    'us': 'United States',
    'gb': 'United Kingdom',
    'uk': 'United Kingdom',
    'in': 'India',
    'ca': 'Canada',
    'de': 'Germany',
    'fr': 'France',
    'jp': 'Japan',
    'au': 'Australia',
    'es': 'Spain'
}

# Replace codes with full country names
country_ex['country_clean'] = country_ex['country_clean'].replace(country_map)


In [None]:
# Make sure genres are cleaned
titles['genres'] = titles['genres'].astype(str)
titles['genres_clean'] = (
    titles['genres']
    .str.replace(r"[\[\]'\" ]", "", regex=True)  # remove brackets, quotes, spaces
    .str.lower()
    .str.split(',')                              # split into list
)

# Explode cleaned genres
genre_explode = titles.explode('genres_clean')
genre_explode = genre_explode[genre_explode['genres_clean'] != ""]  # remove blanks

# Compute highest rated unique genres
unique_genre_rating = (
    genre_explode.groupby('genres_clean')['imdb_score']
    .mean()
    .sort_values(ascending=False)
    .head(15)
)

In [None]:
# Creating new columns Decade
titles['decade'] = (titles['release_year'] // 10) * 10

In [None]:
# addeding new columns based on Show/Movie for Better understanding
titles['content_type'] = titles['type'].replace({'SHOW':'TV Show','MOVIE':'Movie'})

In [None]:
# Final Cleaned dataset
titles.head()

### What all manipulations have you done and insights you found?

**What Data Manipulations Were Done?**

1. **Removed duplicate rows** from both `titles` and `credits` datasets using `drop_duplicates()` to avoid repeated records.
2. **Removed missing titles** using `dropna()` so every record has a valid movie/show name.
3. **Converted important columns to numeric**:

   * `imdb_score`,
   * `tmdb_popularity`,
   * `release_year`

       This prevents errors and allows correct calculations, sorting, and plotting.
4. **Filled missing genres with `'Unknown'`** so no blank categories remain.
5. **Cleaned the `genres` column**:

   * Converted to string
   * Removed `[ ]`, quotes and spaces
   * Converted to lowercase
   * Split values by comma
    
     This makes all genres consistent.
6. **Exploded genres** so each movie/show appears once per genre.

     This allowed accurate counting of genre frequency and ratings.
7. **Cleaned `production_countries` column**:

   * Removed `[ ]`, quotes and spaces
   * Converted to lowercase
   * Split into a list
   * Exploded into separate rows
8. **Mapped country codes to full country names** (`us → United States`, `in → India`, etc.) so charts show readable labels.
9. **Created new calculated columns**:

   * `decade` → grouped content by decade
   * `content_type` → replaced SHOW/MOVIE with cleaner labels (`TV Show`, `Movie`)
10. **Removed blanks after cleaning** (`!= ""`) to ensure high-quality data for charts.



**What Insights Were Found?**

1. **Movies are higher in number than TV Shows** → Prime has a movie-heavy library.
2. **Drama, Comedy, and Action are the most common genres** → These drive most content.
3. **Highest-rated genres** include Documentary, Biography, and Historical → These genres deliver strong content quality.
4. **Release trend shows rapid growth after 2010** → Prime expanded aggressively in recent years.
5. **IMDb score distribution is mostly between 6 and 8** → Content quality is average to good.
6. **TMDB popularity has many outliers** → A few titles are extremely popular, even if not the highest rated.
7. **Popularity does not strongly depend on rating** → Marketing, stars, or franchise value affect popularity.
8. **Top content-producing countries** are:
   - United States
   - United Kingdom
   - India
9. **Most movies run between 90–120 minutes** → Standard movie length performs well.
10. **Top rated titles identified** → Useful for marketing and recommendations.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Movie vs TV Show count

In [None]:
# Chart - 1 visualization code
count_type = titles['content_type'].value_counts()
sns.barplot(x=count_type.index, y=count_type.values)
plt.title("Distribution of Movies vs TV Shows")
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart shows category comparison clearly.

##### 2. What is/are the insight(s) found from the chart?

Movies are more than TV shows on Prime.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If TV shows bring higher engagement, Amazon should add more series to balance the library.

#### Chart - 2 Top 15 Most Popular Genres

In [None]:
# Chart - 2 visualization code
# Get UNIQUE genre
unique_genres = sorted(genre_explode['genres_clean'].dropna().unique())
unique_genres
top_genres = genre_explode['genres_clean'].value_counts().head(15)
# Top 15 Unique Genres Chart
sns.barplot(y=top_genres.index, x=top_genres.values)
plt.title("Top 15 Genres on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

Shows genre diversity clearly

##### 2. What is/are the insight(s) found from the chart?

Drama / Comedy dominate

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Invest more in popular genres

#### Chart - 3 Trend of Releases Over Time

In [None]:
# Chart - 3 visualization code
year_counts = titles['release_year'].value_counts().sort_index()

plt.plot(year_counts.index, year_counts.values)
plt.title("Content Released Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.show()



##### 1. Why did you pick the specific chart?

Line chart highlights trend over time—growth, decline, and spikes.

##### 2. What is/are the insight(s) found from the chart?

Content volume increased drastically after 2010, showing Prime’s aggressive expansion phase.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More recent content increases subscriber engagement

#### Chart - 4 Releases by Decade

In [None]:
# Chart - 4 visualization code
decade_counts = titles['decade'].value_counts().sort_index()
sns.barplot(x=decade_counts.index, y=decade_counts.values)
plt.title("Content Added by Decade")
plt.xlabel("Decade")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart easily compares decades without overcrowding labels.

##### 2. What is/are the insight(s) found from the chart?

Most content belongs to 2010s and 2020s; older movies are fewer.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Users prefer modern content, so Prime should buy or produce fresh releases.

Older content could be selectively curated for nostalgia sections.

#### Chart - 5 Runtime distribution

In [None]:
# Chart - 5 visualization code
titles['runtime'] = pd.to_numeric(titles['runtime'])
sns.histplot(titles['runtime'], bins=40, kde=True)
plt.title("Distribution of Runtime in Minutes")
plt.xlabel("Runtime")
plt.show()


##### 1. Why did you pick the specific chart?

Histogram shows common runtime lengths and viewer preferences.

##### 2. What is/are the insight(s) found from the chart?

Most movies run between 90–120 minutes, which matches standard viewing comfort.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Future movie investments should target 1.5–2 hour formats.

For short-duration viewers, improve short films & series recommendations.

#### Chart - 6 IMDb Score Distribution

In [None]:
# Chart - 6 visualization code
sns.histplot(titles['imdb_score'], bins=30, kde=True)
plt.title("IMDb Score Distribution")
plt.xlabel("Score")
plt.show()


##### 1. Why did you pick the specific chart?

Because a histogram clearly shows how IMDb scores are distributed across all titles.

##### 2. What is/are the insight(s) found from the chart?

Most IMDb scores fall between 6 and 8, meaning most Prime content is average to good quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime can promote higher-rated titles to improve viewer satisfaction and avoid investing in low-rated content.

#### Chart - 7 Top Rated Titles

In [None]:
# Chart - 7 visualization code
# Top Rated Titles
top_rated = titles[['title', 'imdb_score']].dropna().sort_values(by='imdb_score', ascending=False).head(10)
plt.figure(figsize=(10,5))
sns.barplot(x=top_rated['imdb_score'], y=top_rated['title'])
plt.title("Top 10 Highest Rated Titles (IMDb)")
plt.xlabel("IMDb Score")
plt.ylabel("Title")
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly compares the highest-rated titles with their exact IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

These titles represent the best-performing content on the platform in terms of quality and viewer reception.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime can promote these titles more aggressively and prioritize acquiring similar high-quality content to improve user satisfaction and platform reputation.

#### Chart - 8 Popularity Distribution (TMDB)

In [None]:
# Chart - 8 visualization code
# Drop rows where popularity is missing or non-numeric
titles['tmdb_popularity'] = pd.to_numeric(titles['tmdb_popularity'], errors='coerce')
popularity_clean = titles['tmdb_popularity'].dropna()

# Boxplot for popularity spread
sns.boxplot(x=popularity_clean)
plt.title("TMDB Popularity Spread")
plt.xlabel("TMDB Popularity")
plt.show()


##### 1. Why did you pick the specific chart?

Because a boxplot clearly shows how popularity is distributed and highlights extreme popular titles as outliers.

##### 2. What is/are the insight(s) found from the chart?

Most titles have moderate popularity, but a few shows and movies are extremely popular compared to the rest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime can focus marketing and promotion on these highly popular titles to drive more engagement and attract new viewers.

#### Chart - 9 IMDb Score by Genre

In [None]:
# Chart - 9 visualization code
# Plot chart
sns.barplot(y=unique_genre_rating.index, x=unique_genre_rating.values)
plt.title("Highest Rated Unique Genres")
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart clearly compares the average IMDb score of each genre and ranks them from highest to lowest.

##### 2. What is/are the insight(s) found from the chart?

Genres like Documentary, Biography, and Historical tend to have the highest ratings, showing strong audience appreciation and content quality in those categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime should promote and invest more in these high-rated genres because they increase platform credibility and user satisfaction.

#### Chart - 10 Correlation Heatmap

In [None]:
# Chart - 10 visualization code
num_cols = titles.select_dtypes(include='number')
sns.heatmap(num_cols.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap shows correlations between numeric variables clearly, helping identify relationships in one visual view.

##### 2. What is/are the insight(s) found from the chart?

Most features show weak correlations, meaning popularity and ratings are not strongly linked—viewership may depend on marketing, cast, or genre rather than ratings alone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime can’t rely only on ratings to predict popularity; marketing strategies and promotion play a major role in making content successful.

#### Chart - 11 Country Content Distribution

In [None]:
# Chart - 11 visualization code
top_countries = country_ex['country_clean'].value_counts().head(10)

top_countries.plot(kind='bar')
plt.title("Top 10 Content Producing Countries")
plt.ylabel("Number of Titles")
plt.xlabel("Country")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart clearly compares which countries produce the most content on Prime Video.

##### 2. What is/are the insight(s) found from the chart?

United States leads by a large margin, followed by countries like India and the UK.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime should continue strong partnerships in these regions and increase international content to attract global subscribers.

#### Chart - 12 - Popular vs Low-rated Titles

In [None]:
# Pair Plot visualization code
sns.scatterplot(x=titles['tmdb_popularity'], y=titles['imdb_score'])
plt.title("Popularity vs Rating")
plt.xlabel("TMDB Popularity")
plt.ylabel("IMDb Score")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is the best way to show how two continuous variables relate to each other—here, popularity and IMDb rating.

##### 2. What is/are the insight(s) found from the chart?

Some titles are highly popular even with average or low ratings, showing that popularity doesn’t always depend on critic scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prime can promote trending titles even if ratings are not very high, because audience interest and marketing can drive views.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

 **Suggestions to Achieve the Business Objective**

To meet the goal of improving content strategy, viewer engagement, and platform growth, the following actions are recommended:

1. **Invest more in high-demand genres**
   Since Drama, Comedy, Action, and Thriller dominate the platform and attract the highest viewership, Prime should continue acquiring and producing more titles in these genres.

2. **Strengthen TV Show catalog**
   The dataset shows Prime has more movies than TV shows, while subscribers increasingly prefer binge-watching series. Expanding TV show content can boost watch-time and retention.

3. **Leverage high-rated titles**
   Titles with strong IMDb scores should be promoted on the homepage and recommended more often, increasing user satisfaction and trust in platform quality.

4. **Use popularity trends for marketing**
   Some titles are highly popular even with average ratings. Prime should push these trending shows through banners, notifications, and social media to maximize engagement.

5. **Increase regional and international content**
   U.S., U.K., and India produce most content. Increasing local-language and international titles will help grow global subscriber base and attract new markets.

6. **Add more niche high-quality content**
   Documentaries, Biographies, and Historical genres have high average ratings. A focused investment here can improve content diversity and platform reputation.

7. **Use data-driven recommendation systems**
   Combining popularity + rating + genre preferences can personalize suggestions and improve user watch time.


# **Conclusion**

This project successfully analyzed Amazon Prime Video’s content library to uncover valuable insights into genre diversity, regional distribution, popularity, and ratings. Through detailed data cleaning, wrangling, and visualization, the analysis revealed that **Drama, Comedy, Action, and Thriller** are the most dominant genres, while **Documentary and Biography** genres achieve the highest IMDb ratings.

It was also observed that **Movies vastly outnumber TV Shows**, and most titles have IMDb scores between **6 and 8**, indicating average to good overall quality. The **United States, India, and the United Kingdom** emerged as the top content-producing regions, showing Prime’s strong focus on English and Indian markets.

Based on these insights, it is recommended that Amazon Prime:

* Expands its **TV show collection** to match user viewing trends,
* **Promotes top-rated and trending titles** to improve engagement,
* **Invests in regional and high-rated genres** like Documentary and Biography to diversify the catalog,
* And uses **data-driven recommendations** to enhance user experience and retention.

Overall, by leveraging these insights, Amazon Prime Video can improve its **content strategy, viewer satisfaction, and global market competitiveness**—aligning with its business objective of driving growth through personalized, high-quality entertainment.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***