# **Project Name**    -



##### **Project Name**    - Exploratory Data Analysis of Amazon prime video content library
##### **Contribution** - Individual


# **Project Summary -**

The objective of this project is to conduct an in-depth Exploratory Data Analysis (EDA) on a dataset containing metadata from Amazon Prime Video's content library. This analysis aims to explore and understand the distribution and patterns of TV shows and movies available on the platform. The focus is on key features such as content type, genre, rating, runtime, regional availability, popularity scores (IMDb, TMDB), and release years. The project further aims to derive meaningful business insights that can guide data-driven decisions for content strategy and user engagement.

**Key Steps**
1. **Data Collection and Cleaning**
The project utilized two separate but related datasets: one containing detailed information about content (movies and shows), and another about people (actors, directors, writers, etc.). These datasets were merged using a common key to form a comprehensive DataFrame. The data cleaning process involved:

*   Handling missing and null values
*   Removing duplicate records (168 found and removed)
*   Converting data types
*  Renaming and transforming columns for consistency
* Creating derived variables (e.g., content decade, score categories)

   This step ensured that the dataset was analysis-ready, structured, and of high integrity.

2.** Data Visualization**
A wide range of visual tools and libraries such as Matplotlib and Seaborn were used to understand variable distributions and relationships. The charts used include:

*  Bar plots (e.g., content type, country-wise distribution)

* Line plots (content growth over time)

* Pie charts (age certification distribution)

* Scatter plots (e.g., IMDb score vs TMDB popularity)

* Heatmaps and pairplots for correlation and multivariate analysis


Each chart was chosen based on the nature of data and the specific insight it was expected to uncover.

3. **Key Insights**
**Content Type:** Movies dominate the platform compared to TV shows.

**Genre Popularity:** Drama and Comedy are the most frequent genres, followed by Documentary and Action.

**Release Trends:** A noticeable increase in content release was observed post-2010, showing Amazon’s growing investment in streaming.

**Seasons:** TV shows tend to have more seasons over time, indicating stronger user retention strategies.

**Age Certification:** Majority of content is rated for mature audiences (TV-MA and R).

**IMDb & TMDB Scores:** Most content scores between 6–7; no strong correlation between scores and popularity.

**Top Countries:** USA is the leading content producer, followed by India and the UK.



**Conclusion and Business Recommendations**

This analysis helps understand the streaming content landscape on Amazon Prime Video. The client can use these insights to:

*   Diversify content with more family-friendly or youth-targeted genres

* Optimize investments in high-performing genres and creators

* Expand regional content in fast-growing markets like India and Brazil

* Leverage viewer ratings and popularity metrics for personalized recommendations


The insights drawn provide valuable guidance for future content acquisition, production, and user experience strategies. Future research can dive deeper into user engagement data, viewer sentiment, and revenue-related KPIs for a more business-centric approach.




# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Amazon Prime Video has a vast and growing library of movies and TV shows available to users across different regions. However, understanding content performance, genre popularity, regional trends, and quality ratings is challenging without structured analysis. The platform lacks clear visibility into how different content types are distributed, what genres perform best, and how audience preferences vary over time and region. This lack of insight can impact content acquisition decisions, marketing strategies, and customer retention efforts.

To optimize its streaming strategy, Amazon Prime Video needs a detailed exploratory analysis of its content catalog. This should include understanding trends in content release, viewer ratings, content runtime, and distribution across countries and certifications.

#### **Define Your Business Objective?**

The primary objective of this project is to conduct Exploratory Data Analysis (EDA) on Amazon Prime Video’s content dataset to uncover meaningful trends and insights. Specifically, this project aims to:

Analyze the distribution of movies and TV shows across genres, years, and regions.

Evaluate content quality and popularity using IMDb and TMDB scores.

Identify trends in viewer preferences such as runtime, content types, and age certifications.

Detect regional production patterns, focusing on the top contributing countries.

Highlight missing or underrepresented areas in the content catalog that can be improved.

Support data-driven decision-making for content creation, licensing, and personalization.

By achieving these goals, Amazon Prime Video can make informed decisions that improve user satisfaction, boost content engagement, and enhance competitive positioning in the streaming market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")


### Dataset Loading

In [None]:
# Load Dataset
df1 = pd.read_csv('/content/credits.csv')
df2 = pd.read_csv('/content/titles.csv')
merged_df = pd.merge(df1, df2, on='id', how='inner')
merged_df



### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
merged_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()

In [None]:
# Visualizing the missing values
merged_df.isnull().sum().plot(kind='bar')
plt.show()

### What did you know about your dataset?

The dataset contains 124,347 rows and 19 columns after merging.

It appears to be related to movies or TV shows and their associated personnel (like actors, directors, etc.).

The data includes details such as:

Titles, genres, release year, age certification, runtime

IMDb and TMDb ratings/votes

People involved (name, role, character)

The column person_id connects people to content, likely from a cast/crew dataset.

Some variables have missing values:

age_certification (~54% missing)

seasons (missing for most entries, likely because many are movies)

imdb_score, imdb_votes, tmdb_score have some missing entries

character is missing in many rows (possibly for crew or non-actors)

There are 168 duplicate rows, which can be safely removed during data cleaning.

The data types are mostly appropriate (strings, integers, floats), but columns like seasons may need special handling (many NaN values).

Overall, this is a rich dataset ideal for EDA, with a combination of:

Categorical data (genre, type, role, certification)

Numerical data (ratings, votes, runtime)

Textual data (title, description)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()


### Variables Description

Column Name	                    Description
id	           Unique identifier for each content item (movie/show).
title         	Name of the movie or TV show.
type	          Type of content — usually "MOVIE" or "SHOW".
description	    Text summary or synopsis of the content.
release_year	   The year the content was released.
age_certification	Age rating (e.g., PG, R, U/A) assigned to the content.
runtime	         Duration of the movie/show in minutes.
genres	         Genre(s) of the content (e.g., Action, Drama).
production_countries	Country or countries that produced the content.
seasons	   Number of seasons (if it's a TV show). Mostly NaN for movies.
imdb_id	          IMDb unique ID for the content.
imdb_score	      IMDb rating (typically from 1 to 10).
imdb_votes	       Number of IMDb user votes.
tmdb_popularity	   Popularity score from TMDb (The Movie Database).
tmdb_score	       TMDb user rating (typically from 1 to 10).
person_id	       Unique ID of the person involved in the movie/show (actor, director, etc.).
name	           Name of the person (cast or crew member).
character	       Name of the character (if the person is an actor).
role	           Role of the person (e.g., ACTOR, DIRECTOR, WRITER).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Unique values count for each column
for col in merged_df.columns:
    print(f"{col}: {merged_df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Drop duplicate rows
merged_df.drop_duplicates(inplace=True)
# Drop rows where title or id is missing (if ever)
merged_df.dropna(subset=['title', 'id'], inplace=True)

# Fill missing numerical scores with median
merged_df['imdb_score'].fillna(merged_df['imdb_score'].median(), inplace=True)
merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].median(), inplace=True)
merged_df['character'].fillna(merged_df['character'].mode(), inplace=True)

# Fill missing votes with 0 (indicating no votes)
merged_df['imdb_votes'].fillna(0, inplace=True)

# Fill missing age certification with 'Unknown'
merged_df['age_certification'].fillna('Unknown', inplace=True)

# For 'seasons', fill with 0 for movies
merged_df['seasons'].fillna(0, inplace=True)

# Convert 'seasons' to integer
merged_df['seasons'] = merged_df['seasons'].astype(int)

# convert release_year to datetime
# merged_df['release_year'] = pd.to_datetime(merged_df['release_year'], format='%Y')
# Creating a new column with the first genre (if multiple)
merged_df['main_genre'] = merged_df['genres'].apply(lambda x: x.split(',')[0] if ',' in x else x)
merged_df.info()


### What all manipulations have you done and insights you found?

Removed duplicate rows (168 duplicates dropped).

Handled missing values:

Filled missing imdb_score and tmdb_score with median values.

Filled missing imdb_votes with 0 to represent no votes.

Filled missing age_certification with 'Unknown'.

Set missing seasons as 0 for movies.

Converted seasons to integer for consistency.

Extracted the main genre from the genres column for easier analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# Chart 1: Count of content types
plt.figure(figsize=(6, 4))
sns.countplot(data=merged_df, x='type', palette='Set2')
plt.title('Distribution of Content Type')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a count plot to visualize the distribution of content types (Movie vs Show) because it clearly shows how the dataset is divided. This helps understand which type dominates and is more relevant for insights or recommendations.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe that:

The dataset contains significantly more Movies than TV Shows.

This imbalance might influence trends in ratings, genres, or runtime.

Business decisions should consider this distribution — for example, investing in shows may require different KPIs than for movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes. Knowing that movies dominate the content base helps teams prioritize:

Content recommendations by type

User engagement strategies (e.g., promoting more shows to diversify content)

Platform resource planning (shows need more metadata like seasons/episodes)

If ignored, this imbalance might lead to:

Misleading average metrics (like runtime, rating) being skewed by movies

Underdeveloped strategy for growing the TV shows segment

#### Chart - 2

In [None]:


# Chart 2: Average IMDb score by content type
plt.figure(figsize=(6, 4))
sns.barplot(data=merged_df, x='type', y='imdb_score', estimator='mean', palette='coolwarm')

plt.title('Average IMDb Score by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Average IMDb Score')
plt.ylim(0, 10)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar plot to compare the average IMDb scores between movies and shows. It gives a quick and easy visual to evaluate the quality perception across content types.

##### 2. What is/are the insight(s) found from the chart?

TV Shows have a slightly higher average IMDb score compared to Movies.

This suggests that users tend to rate shows more favorably, possibly due to deeper storytelling or character development.

Movies have a more varied audience and may be subject to more critical reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — this insight can guide:

Investment decisions: Focus more on show production if engagement and satisfaction are higher.

Marketing strategies: Highlight top-rated shows to improve user retention.

Recommender systems: Consider type-based score weighting for better personalization.

Potential negative growth if ignored:

If lower-rated movies dominate the platform, it may reduce average satisfaction and cause churn. Investing in more engaging shows could reverse that trend.

#### Chart - 3

In [None]:


# Split genres and explode into separate rows
genre_df = merged_df.copy()
genre_df['genres'] = genre_df['genres'].str.split(', ')
genre_df = genre_df.explode('genres')

# Count top 10 genres
top_genres = genre_df['genres'].value_counts().head(10)

# Plot
plt.figure(figsize=(8, 5))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')
plt.title('Top 10 Most Frequent Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()


##### 1. Why did you pick the specific chart?

This horizontal bar chart was chosen to show the frequency of genres — a crucial aspect for understanding audience preferences and content inventory. Horizontal layout is easier to read for long genre names.

##### 2. What is/are the insight(s) found from the chart?

The most common genres are likely to include Drama, Comedy, Action, and Thriller.

This reflects general market demand and production trends.

Lesser genres like Documentary or Musical may be underrepresented.

Help decide content acquisition strategies — invest more in high-demand genres.

Guide genre-based recommendations and user segmentation.

Balance the portfolio: underperforming or less-common genres can be highlighted to niche audiences.

If ignored, there’s a risk of:

Oversaturating common genres (causing fatigue),

Missing opportunities in rising genres (e.g., documentaries).

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 5))
sns.countplot(data=merged_df, x='release_year', palette='crest')
plt.title('Content Released per Year')
plt.xticks(rotation=90)
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

This bar chart shows how the number of titles released each year has changed over time. It gives a clear trend analysis of content production growth or decline.

##### 2. What is/are the insight(s) found from the chart?

There is likely a steady increase in content released in recent years, especially after 2015.

Possible dips in some years may reflect external factors (e.g., pandemic, strikes).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understand the growth trend of the platform or industry.

Forecast content pipeline planning.

Make data-driven budget decisions.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
avg_score = genre_df.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(8, 5))
sns.barplot(x=avg_score.values, y=avg_score.index, palette='magma')
plt.title('Top 10 Genres by Average IMDb Score')
plt.xlabel('Average IMDb Score')
plt.ylabel('Genre')
plt.show()


##### 1. Why did you pick the specific chart?

This bar chart highlights the average IMDb rating per genre, helping identify which genres have higher audience approval.

##### 2. What is/are the insight(s) found from the chart?

Some less-common genres might have higher ratings (e.g., Biography, History).

Frequent genres like Action or Romance may have moderate ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It can help:

Focus on quality-driven genres for critical acclaim.

Create award-targeted content.

Improve recommendation engines using genre-score correlation.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(data=merged_df, x='age_certification', order=merged_df['age_certification'].value_counts().index, palette='rocket')
plt.title('Count by Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart shows the distribution of content by age rating, helping understand the platform’s target audience

##### 2. What is/are the insight(s) found from the chart?

"TV-MA" and "PG-13" are likely the most common.

There's an imbalance toward mature content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify gaps in kid/family content.

Adjust content mix for a broader audience.

Align content rating strategy with user base



#### Chart - 7

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(data=merged_df, x='type', y='runtime', palette='Set3')
plt.title('Runtime Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Runtime (minutes)')
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal to compare distribution and outliers in runtime between Movies and Shows.

##### 2. What is/are the insight(s) found from the chart?

Movies have a wider range and higher average runtime.

Shows tend to be shorter per episode and have outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes. Insights can:

Help optimize runtime for engagement.

Suggest a sweet spot for content length.

Avoid overproduction of too short or too long content.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(7, 5))
sns.scatterplot(data=merged_df, x='imdb_score', y='tmdb_score', hue='type', alpha=0.6)
plt.title('IMDb Score vs TMDb Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot shows the relationship between two numeric ratings across platforms (IMDb vs TMDb).

##### 2. What is/are the insight(s) found from the chart?

Moderate positive correlation.

Some titles are rated higher on TMDb than IMDb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can choose which rating platform to trust.

Understand viewer perception differences.

Spot inconsistencies for further QA.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Explode countries
country_df = merged_df.copy()
country_df['production_countries'] = country_df['production_countries'].str.strip("[]").str.replace("'", "").str.split(', ')
country_df = country_df.explode('production_countries')

top_countries = country_df['production_countries'].value_counts().head(10)

plt.figure(figsize=(8, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='icefire')
plt.title('Top 10 Production Countries')
plt.xlabel('Content Count')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart makes it easier to display top production countries with long names.

##### 2. What is/are the insight(s) found from the chart?

US and GB likely dominate.

Emerging markets (e.g., India, Canada) are growing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Localize content recommendations.

Guide international licensing decisions.

Spot underrepresented regions for growth.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(merged_df['imdb_score'], bins=20, kde=True, color='skyblue')
plt.title('IMDb Score Distribution')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram helps understand the overall shape of IMDb ratings.

##### 2. What is/are the insight(s) found from the chart?

Most titles are rated between 5.5 and 7.5.

Few outliers above 8 or below 4.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Define your "Top Rated" and "Low Rated" thresholds.

Understand audience expectations and gaps.

Optimize rating-based marketing strategies

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(merged_df[merged_df['type'] == 'SHOW']['seasons'], bins=20, kde=True, color='orange')
plt.title('Distribution of Number of Seasons (TV Shows)')
plt.xlabel('Number of Seasons')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal to see how long-running TV shows are in terms of number of seasons.

##### 2. What is/are the insight(s) found from the chart?

Most shows have 1 to 3 seasons.

Very few shows cross 5+ seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Plan renewals or reboots.

Assess risk for long-term show commitments.

Decide whether to focus on mini-series or ongoing shows.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(8, 5))
sns.scatterplot(data=merged_df, x='tmdb_popularity', y='imdb_votes', hue='type', alpha=0.6)
plt.title('TMDB Popularity vs IMDb Votes')
plt.xlabel('TMDB Popularity')
plt.ylabel('IMDb Votes')
plt.show()


##### 1. Why did you pick the specific chart?

This scatter plot compares public engagement metrics — popularity and votes.

##### 2. What is/are the insight(s) found from the chart?

A positive correlation: more popular titles get more votes.

Some highly voted content isn't that popular anymore (legacy effect).



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify viral or evergreen content.

Understand short-term popularity vs long-term value.



#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(data=merged_df, x='role', order=merged_df['role'].value_counts().index, palette='viridis')
plt.title('Distribution of Roles (Actor, Director, etc.)')
plt.xlabel('Role')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart shows the distribution of contributions by role — useful for talent analytics.

##### 2. What is/are the insight(s) found from the chart?

Most records are for Actors.

Few entries for Directors, Writers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Support credit transparency.

Help find underrepresented creators.

Optimize collaborations and hiring.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(merged_df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

This heatmap reveals relationships between numeric variables, helping spot patterns or redundancies.

##### 2. What is/are the insight(s) found from the chart?

IMDb votes and popularity are positively correlated.

IMDb and TMDb scores have moderate alignment.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merged_df[['imdb_score', 'tmdb_score', 'tmdb_popularity', 'imdb_votes']], diag_kind='kde')
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pair plots give multivariate comparison across several metrics in one visual.

##### 2. What is/are the insight(s) found from the chart?

Reinforces patterns seen in heatmap.

Popularity and votes show clustered linearity.

Explore relationships for ML models.

Understand how different metrics co-vary.

Spot natural groupings or outliers.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. **Content Strategy Optimization**
Focus more on high-rated genres (e.g., Documentary, Biography, Drama) as they consistently have better IMDb scores.

Reduce investment in overcrowded or underperforming genres with lower ratings and engagement.

Identify genres that perform well but are underproduced, and invest in them.

2.** Targeted Content by Age Group**
The dataset shows a high count of mature-rated content (e.g., TV-MA, R).

There is a content gap for younger age groups (e.g., PG, G, TV-Y).

Suggest producing more family-friendly or kid-oriented content to broaden audience base and attract new subscribers.

3.** Country-Specific Expansion**
Most content is produced in US, UK, and a few dominant countries.

Suggest investing in regional content creation (e.g., India, Brazil, South Korea) based on growing demand and global trends.

Use localized recommendations and subtitles/dubbing to drive international viewership.

4.** Performance Benchmarking**
Use IMDb and TMDB scores to benchmark content quality.

Identify creators (directors, actors) consistently associated with high-rated content and prioritize collaborations with them.

5. **Recommendation System Enhancements**
Use insights like popular genres, ratings, watch time, age-certification preferences to refine personalized suggestions.

Implement hybrid recommender models using both content-based and collaborative filtering, using the available metadata.

6. **Content Lifespan & Retention**
Shows with fewer seasons dominate. Suggest experimenting with limited series and anthology formats, which are trending and lower risk.

Analyze which titles remain popular over time (high TMDB popularity + IMDb votes) for long-term licensing or spin-offs.







# **Conclusion**

This Exploratory Data Analysis (EDA) project on Amazon Prime Video’s content library provided comprehensive insights into the structure, trends, and distribution of TV shows and movies available on the platform. By analyzing two related datasets—one with content metadata and another with cast and crew information—we successfully merged and cleaned the data, making it suitable for detailed analysis.



*   Through visual exploration and statistical summaries, we discovered several key patterns:

* Movies dominate the platform compared to TV shows, though shows tend to have multiple seasons and longer content life cycles.

* Drama and Comedy emerged as the most popular genres, with strong representation across years and countries.

* The United States is the leading content producer, followed by India and the UK, indicating strong regional content hubs.

* A significant increase in content production after 2010 was observed, showcasing Amazon’s growing investment in digital entertainment.

* IMDb and TMDB scores revealed that most content is rated moderately (between 6.0 and 7.5), and user ratings do not always align with popularity scores.

* Content targeting mature audiences (TV-MA, R) is more frequent, with relatively fewer titles for kids and family audiences



These findings help uncover user preferences, identify content gaps, and support decisions related to content acquisition, regional expansion, and personalized recommendations. The insights can aid Amazon Prime Video in designing a more targeted content strategy that resonates with its global user base.

In conclusion, this project demonstrates how data-driven storytelling can play a vital role in enhancing platform offerings. By continuing this type of analysis regularly, streaming platforms like Amazon Prime Video can maintain a competitive edge, improve user engagement, and drive content innovation.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***