<a href="https://colab.research.google.com/github/Piyush20002/Amazon_Prime_EDA./blob/main/Amazon_Prime_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
Amazon EDA


##### **Project Type**    - EDA
##### **Contribution**    - Individual
by Piyush Chaudhari


# **Project Summary -**


 This project aims to conduct an Exploratory Data Analysis (EDA) on the Amazon Prime Video content library, focusing on TV shows and movies available in the United States. The analysis will leverage two datasets: "titles.csv", containing information about the titles (ID, name, type, description, release year, age certification, runtime, genres, production countries, seasons, IMDb/TMDB scores, and popularity), and "credits.csv", which provides details about the cast and crew (person ID, title ID, name, character name, and role).

 The primary objective is to extract valuable insights into Amazon Prime's content strategy, audience preferences, and market trends. Specifically, the EDA will address key problem statements such as:
* Content Diversity: Identifying dominant genres and categories on the platform.
* Regional Availability: Understanding how content distribution varies (though the dataset is US-specific, this might involve looking at production countries).
* Trends Over Time: Analyzing the evolution of Amazon Prime's content library over the years.
* IMDb Ratings & Popularity: Discovering the highest-rated and most popular shows/movies.

Through comprehensive data cleaning, manipulation, and visualization, this analysis will uncover patterns and relationships within the data. Various charts, including univariate, bivariate, and multivariate analyses, will be employed to present findings clearly. Each visualization will be accompanied by an explanation of why it was chosen, the insights derived, and its potential business impact (positive or negative growth). The goal is to provide data-driven recommendations that can influence subscription growth, enhance user engagement, and optimize content investment strategies in the competitive streaming industry. The final output will be a well-structured and commented Python notebook, adhering to best practices for production-grade code.


# **GitHub Link -**

https://github.com/Piyush20002/Amazon_Prime_EDA.

# **Problem Statement**


To analyze the extensive dataset of Amazon Prime Video content (TV shows and movies) and their associated cast/crew to uncover key trends, audience preferences, and content performance metrics. The analysis aims to provide actionable insights for optimizing content strategy, improving user engagement, and driving subscription growth.

The primary objective is to extract valuable insights into Amazon Prime's content strategy, audience preferences, and market trends. Specifically, the EDA will address key problem statements such as:
* Content Diversity: Identifying dominant genres and categories on the platform.
* Regional Availability: Understanding how content distribution varies (though the dataset is US-specific, this might involve looking at production countries).
* Trends Over Time: Analyzing the evolution of Amazon Prime's content library over the years.
* IMDb Ratings & Popularity: Discovering the highest-rated and most popular shows/movies.

#### **Define Your Business Objective?**

The business objective is to gain a deeper understanding of the Amazon Prime Video content ecosystem to inform strategic decisions related to content acquisition, production, and marketing. This includes:
1.  Identifying popular content types and genres: To prioritize future content investments.
2.  Understanding content performance: Analyzing IMDb/TMDB scores and popularity to gauge audience reception.
3.  Tracking content evolution: Observing trends in content release over time to adapt to changing market demands.
4.  Optimizing content library diversity: Ensuring a balanced mix of content to appeal to a broad audience.
5.  Leveraging cast/crew data: Understanding the impact of actors/directors on content popularity.

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import ast # For safely evaluating string literals of lists


In [None]:
# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

### Dataset Loading

In [None]:
# Load Dataset
try:
    titles_df = pd.read_csv('titles.csv')
    credits_df = pd.read_csv('credits.csv')
    print("Datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading file: {e}. Please ensure 'titles.csv' and 'credits.csv' are in the correct directory.")
    exit() # Exit if files are not found

### Dataset First View

In [None]:
# Dataset First Look
titles_df.head()


In [None]:
credits_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Titles Dataset - Rows: {titles_df.shape[0]}, Columns: {titles_df.shape[1]}")
print(f"Credits Dataset - Rows: {credits_df.shape[0]}, Columns: {credits_df.shape[1]}")


### Dataset Information

In [None]:
# Dataset Info
titles_df.info()

credits_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"Titles Dataset - Duplicate Rows: {titles_df.duplicated().sum()}")
print(f"Credits Dataset - Duplicate Rows: {credits_df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing Values in Titles Dataset:")
print(titles_df.isnull().sum())

print("\nMissing Values in Credits Dataset:")
print(credits_df.isnull().sum())

In [None]:
# Percentage of null values in titles_df
titles_null_percent = (titles_df.isnull().sum() / len(titles_df)) * 100
titles_null_percent = titles_null_percent[titles_null_percent > 0].sort_values(ascending=False)

print("Percentage of Null Values in Titles Dataset:")
print(titles_null_percent)



In [None]:
# Percentage of null values in credits_df
credits_null_percent = (credits_df.isnull().sum() / len(credits_df)) * 100
credits_null_percent = credits_null_percent[credits_null_percent > 0].sort_values(ascending=False)

print("\nPercentage of Null Values in Credits Dataset:")
print(credits_null_percent)


In [None]:
# Visualizing the missing values/ Null Values Count
# Bar plot of null percentage for titles_df
titles_null_percent.plot(kind='barh', color='salmon')
plt.title('Percentage of Missing Values in Titles Dataset')
plt.xlabel('Percentage')
plt.gca().invert_yaxis()
plt.show()
# Bar plot of null percentage for credits_df
credits_null_percent.plot(kind='barh', color='skyblue')
plt.title('Percentage of Missing Values in Credits Dataset')
plt.xlabel('Percentage')
plt.gca().invert_yaxis()
plt.show()


### What did you know about your dataset?


*The `titles.csv` dataset contains information about Amazon Prime Video titles, including their ID, name, type (movie/TV show), description, release year, age certification, runtime, genres, production countries, number of seasons, IMDb score, IMDb votes, TMDB popularity, and TMDB score. It has approximately 9600 entries and 15 columns.
*Key observations from `titles_df`:
* `description`, `age_certification`, `runtime`, `genres`, `production_countries`, `seasons`, `imdb_id`, `imdb_score`, `imdb_votes`, `tmdb_popularity`, and `tmdb_score` columns have missing values.
* `seasons` column is specifically for TV shows and has a large number of missing values for movies, which is expected.
* `imdb_id` is mostly missing, which might limit analysis based on external IMDb data if not handled carefully.
* `genres` and `production_countries` are stored as string representations of lists, requiring parsing.

*The `credits.csv` dataset contains information about the cast and crew (actors and directors) for each title. It includes person ID, title ID, name, character name, and role. It has over 124,000 entries and 5 columns.
* Key observations from `credits_df`:
* `character_name` has a significant number of missing values, which is expected as directors won't have a character name.
* Both datasets have `id` columns which can be used for merging.
* There are no duplicate rows in either dataset after initial checks.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles Dataset Columns:")
print(titles_df.columns)

print("\nCredits Dataset Columns:")
print(credits_df.columns)




In [None]:
# Dataset Describe
titles_df.describe(include='all').T


In [None]:
credits_df.describe(include='all').T

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_titles_df = titles_df.nunique()
unique_credits_df = credits_df.nunique()

print("Unique Values in Titles Dataset:")
print(unique_titles_df)

print("\nUnique Values in Credits Dataset:")
print(unique_credits_df)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Merge the two datasets on 'id'
merged_df = pd.merge(titles_df, credits_df, on='id', how='left')

# Check shape and null values
merged_shape = merged_df.shape
missing_values = merged_df.isnull().sum()

merged_shape, missing_values.sort_values(ascending=False).head(10)


In [None]:
pd.set_option('display.max_columns', None)
merged_df.head()

In [None]:
# Convert 'release_year' to datetime and extract 'year_released'
# Since 'release_year' is only a year, we append '-01-01' to make it a full date

merged_df['release_year'] = pd.to_datetime(merged_df['release_year'].astype(str) + '-01-01', errors='coerce')
merged_df['year_released'] = merged_df['release_year'].dt.strftime('%Y')

# Verify the changes
merged_df[['release_year', 'year_released']].head()


In [None]:
# 'release_year', 'description', 'imdb_id' Cloums where drops
merged_df.drop(['release_year', 'description', 'imdb_id'], inplace= True, axis=1)


In [None]:
# dropping imdb votes column

merged_df.drop(columns=["imdb_votes"], inplace=True)

In [None]:
#  Replace empty lists with NaN
merged_df['production_countries'] = merged_df['production_countries'].apply(lambda x: np.nan if x == [] else x)

# Impute missing values with mode
mode_country = merged_df['production_countries'].mode()[0]
merged_df['production_countries'].fillna(mode_country, inplace=True)

# Check number of unique country lists
merged_df['production_countries'].nunique()

In [None]:
merged_df['production_countries'] = merged_df['production_countries'].astype(str).str.strip("[]").str.replace("'", "").str.split(',').str[0]



In [None]:
# creating a new column Country for replaceing name 'production_countries'
merged_df['country'] = merged_df['production_countries'].apply(lambda x: x[0] if isinstance(x, list) else x)



In [None]:
merged_df['country'].nunique()


In [None]:
merged_df.drop(columns=['production_countries'], inplace=True)


In [None]:
merged_df['genres'] = merged_df['genres'].apply(lambda x: str(x).replace('[', '').replace(']', '').replace("'", "").replace(',', '').strip()if isinstance(x, (str, list)) else x
)
merged_df['genres']

In [None]:
#To fill missing values in age_certification
merged_df['age_certification'].fillna(merged_df['age_certification'].mode()[0], inplace=True)


In [None]:
# seasons column is filled with data not available

merged_df['seasons']= merged_df['seasons'].fillna('No data reported')

In [None]:
# data imputation for handling missing values

merged_df['imdb_score'].fillna(merged_df['imdb_score'].median(), inplace=True)
merged_df['tmdb_popularity'].fillna(merged_df['tmdb_popularity'].mean(), inplace=True)
merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].median(), inplace=True)

merged_df['imdb_score'] = merged_df['imdb_score'].round(1)
merged_df['tmdb_score'] = merged_df['tmdb_score'].round(1)
merged_df['tmdb_popularity'] = merged_df['tmdb_popularity'].round(1)

In [None]:
merged_df.dropna(subset=['person_id'], inplace=True)


In [None]:
# Fill missing values
merged_df['character'].fillna('Unknown', inplace=True)
merged_df['role'].fillna('Unknown', inplace=True)
merged_df['name'].fillna('Unknown', inplace=True)

# Reset index after cleaning
merged_df.reset_index(drop=True, inplace=True)

# Final check on cleaned data
cleaned_shape = merged_df.shape
remaining_nulls = merged_df.isnull().sum().sort_values(ascending=False).head(10)

cleaned_shape, remaining_nulls

In [None]:
merged_df.head()

In [None]:
merged_df.isnull().sum()

### What all manipulations have you done and insights you found?


To prepare the dataset for effective exploratory data analysis, multiple data wrangling steps were performed. First, two separate datasets—`titles.csv` and `credits.csv`—were merged using a left join on the common `id` column. This integration brought together comprehensive metadata and cast/crew details, forming a unified DataFrame ready for in-depth analysis.

Next, missing values were systematically handled. For example, missing values in `age_certification` were filled with the mode (most frequent certification) to retain the categorical balance of age ratings. `person_id` nulls (around 1,000 rows) were dropped, as this field is crucial for uniquely identifying individuals in cast data. Columns such as `description` and `imdb_id` were removed entirely due to their limited analytical value or high proportion of missing data.

The `genres` and `production_countries` columns were originally stored as string representations of Python lists. These were cleaned and parsed into actual lists using `ast.literal_eval`. Subsequently, to simplify category-based analysis, a new column `country` was extracted from the first element of `production_countries`. Similarly, the `genres` column was cleaned of brackets and quotes and transformed into a clean, comma-separated string to enable readable and meaningful genre-level analysis.

Furthermore, the `release_year` column was converted into a datetime format, and a new column `year_released` was created to facilitate time-based trend analysis. Unnecessary columns like `release_year`, `description`, and `imdb_id` were dropped to streamline the dataset.

#### Insights Found During Wrangling:

* The `seasons` column had missing values primarily for movies, confirming that it's only applicable to TV shows.
* A significant number of entries lacked `imdb_score`, `tmdb_score`, or `imdb_votes`, suggesting many titles may not have gained enough visibility or user engagement for rating data.
* The original format of genre and country data emphasized the complexity of multi-label classification in entertainment metadata.
* There was a heavy skew toward U.S.-produced content, which indicates a possible bias in platform content distribution.
* The presence of detailed cast and role data opens up possibilities for actor or role-based trend analysis in future stages.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distribution of Content Type (Movies vs Show)

In [None]:
# Distribution of Content Type (Movies vs Show)
plt.figure(figsize=(8, 5))
type_counts = merged_df['type'].value_counts()
sns.barplot(x=type_counts.index, y=type_counts.values, palette='viridis')
plt.title('Distribution of Content Type on Amazon Prime', fontsize=14, pad=20)
plt.xlabel('Content Type')
plt.ylabel('Count')
for i, v in enumerate(type_counts.values):
    plt.text(i, v + 500, str(v), ha='center', va='bottom')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To compare the number of movies and shows on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

 Shows whether movies or shows are more prevalent.

##### 3. Will the gained insights help creating a positive business impact?
Positive Impact: Helps in making data-driven decisions about future content acquisition and production.

Negative Insight: Over-reliance on one type of content could limit the platform's appeal to a broader audience.

Answer Here

#### Chart - Top 10 Most Frequent Genres on Amazon Prime

In [None]:
# Chart - 2 Top 10 Most Frequent Genres on Amazon Prime
from collections import Counter

# Split genre strings into individual genres and count them
genre_series = merged_df['genres'].dropna().str.split(', ')
all_genres = genre_series.explode()
top_genres = all_genres.value_counts().head(10)

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='rocket')
plt.title('Top 10 Most Frequent Genres on Amazon Prime', fontsize=14)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

To identify the most common genres on Amazon Prime.A horizontal bar chart clearly ranks the top genres by frequency.

##### 2. What is/are the insight(s) found from the chart?

Popular Genres: Identifies which genres are most prevalent on the platform.
Content Preferences: Indicates viewer preferences and content trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Helps in tailoring content recommendations and acquisition strategies.
Negative Insight: Over-reliance on a few genres could limit the platform's appeal to a broader audience.

#### Chart - 3 Trend of Content Releases by Year

In [None]:
# Trend of Content Releases by Year
plt.figure(figsize=(14, 7))
yearly_counts = merged_df['year_released'].value_counts().sort_index()
plt.plot(yearly_counts.index, yearly_counts.values, marker='o', linewidth=2, markersize=4)
plt.title('Trend of Content Releases by Year', fontsize=14)
plt.xlabel('Year')
plt.ylabel('Number of Titles')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To visualize the trend in the number of content releases over the years. A line plot clearly shows the temporal trend and helps identify growth phases or declines.


##### 2. What is/are the insight(s) found from the chart?

Identifies years with significant increases in content releases

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4 top 10 Release Years by Content Count

In [None]:
# Chart - 4 top 10 Release Years by Content Count
# Convert 'year_released' to integer for sorting
merged_df['year_released'] = merged_df['year_released'].astype(int)

# Group and count by year and type
top_years = merged_df['year_released'].value_counts().sort_values(ascending=False).head(10).index
filtered_df = merged_df[merged_df['year_released'].isin(top_years)]

plt.figure(figsize=(12, 6))
sns.countplot(data=filtered_df, x='year_released', hue='type', palette='pastel')
plt.title('Top 10 Release Years by Content Count (Movies vs Shows)', fontsize=14)
plt.xlabel('Year Released', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.legend(title='Content Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

o compare the number of movies and shows released in the top 10 years with the highest content count.A grouped bar chart clearly shows the distribution of content types across the top release years.

##### 2. What is/are the insight(s) found from the chart?

Identifies the years with the highest number of content releases.
Shows the balance between movies and shows in the top release years.

#### Chart - 5 Top 10 Countries by Content

In [None]:
# Chart - Top 10 Countries by Content
plt.figure(figsize=(10, 6))
country_counts = merged_df['country'].value_counts().head(10)
sns.barplot(y=country_counts.index, x=country_counts.values, palette='coolwarm')
plt.title('Top 10 Content-Producing Countries', fontsize=14)
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly ranks the top countries by the number of titles produced.To identify the top countries contributing the most content to Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

Identifies which countries are the primary producers of content on the platform.
 Highlights any significant dominance by specific countries.

#### Chart - 6 IMDb Score Distribution

In [None]:
# Chart - 6 IMDb Score Distribution
plt.figure(figsize=(10, 6))
plt.hist(merged_df['imdb_score'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(merged_df['imdb_score'].mean(), color='red', linestyle='--',
            label=f'Mean: {merged_df["imdb_score"].mean():.2f}')
plt.title('Distribution of IMDb Scores', fontsize=14)
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To visualize the distribution of IMDb scores across all titles.A histogram provides a clear view of the frequency distribution and helps identify the central tendency and spread of the scores.

##### 2. What is/are the insight(s) found from the chart?

Highlights the mean IMDb score, providing a benchmark for average content quality.Shows the range and concentration of IMDb scores.

#### Chart - 7 Average IMDB score by Country

In [None]:
# Chart - 7 Average IMDb Score by Country
country_scores = merged_df.groupby('country')['imdb_score'].agg(['mean', 'count']).reset_index()
country_scores = country_scores[country_scores['count'] >= 50].sort_values('mean', ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(data=country_scores, y='country', x='mean', palette='viridis')
plt.title('Average IMDb Score by Country (min 50 titles)', fontsize=14)
plt.xlabel('Average IMDb Score')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To identify countries with the highest average IMDb scores, considering only countries with at least 50 titles to ensure statistical significance.A horizontal bar chart clearly ranks the top countries by average IMDb score.

##### 2. What is/are the insight(s) found from the chart?

Identifies which countries produce content with higher average IMDb scores. Highlights regions that consistently produce high-quality content.Answer Here

#### Chart - 8 Movies and Shows count by Age Certification


In [None]:
# Movies and Shows count by Age Certification
plt.figure(figsize=(12, 6))
sns.countplot(data=merged_df, x='age_certification', hue='type', palette='Set2')
plt.title('Count of Movies and Shows by Age Certification', fontsize=14)
plt.xlabel('Age Certification', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.legend(title='Type of Content', loc='upper right')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To compare the number of movies and shows across different age certifications.
A grouped bar chart clearly shows the distribution of content types by age certification.

##### 2. What is/are the insight(s) found from the chart?

Helps in understanding the platform's content strategy and identifying successful age certifications.Identifies any imbalance between content types for different age groups, indicating potential areas for diversification.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Calculate average IMDb scores over the years
average_imdb_by_year = merged_df.groupby('year_released')['imdb_score'].mean()

plt.figure(figsize=(12, 6))
plt.plot(average_imdb_by_year.index, average_imdb_by_year.values, marker='o', color='green')
plt.title('Time Series of Average IMDb Scores Over the Years', fontsize=14)
plt.xlabel('Year Released', fontsize=12)
plt.ylabel('Average IMDb Score', fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To visualize how the average IMDb scores have changed over the years.
A line plot clearly shows the temporal trend in average IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

Identifies years with higher or lower average IMDb scores, indicating changes in content quality over time.
Highlights any significant fluctuations or trends in content quality.

#### Chart - 10 Correlation Heatmap

In [None]:
# Chart - 10 Correlation Heatmap
plt.figure(figsize=(10, 8))
numerical_cols = ['year_released', 'runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
available_cols = [col for col in numerical_cols if col in merged_df.columns]
correlation_matrix = merged_df[available_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Correlation Heatmap of Numerical Variables', fontsize=14)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To visualize the correlation between numerical variables in the dataset.
A heatmap provides a clear and intuitive way to see the relationships between different numerical attributes.

##### 2. What is/are the insight(s) found from the chart?

Correlation Analysis: Identifies which variables are positively or negatively correlated.
Strong Relationships: Highlights strong correlations that might indicate important relationships, such as between runtime and IMDb score.

#### Chart - 11 Top 10 Most Prolific Directors on Amazon Prime

In [None]:
# Chart - 11 Top 10 Most Prolific Directors on Amazon Prime
director_data = merged_df[merged_df['role'] == 'DIRECTOR']
if not director_data.empty:
    top_directors = director_data['name'].value_counts().head(10)

    plt.figure(figsize=(12, 6))
    sns.barplot(y=top_directors.index, x=top_directors.values, palette='plasma')
    plt.title('Top 10 Most Prolific Directors on Amazon Prime', fontsize=14)
    plt.xlabel('Number of Titles')
    plt.ylabel('Director Name')
    plt.tight_layout()
    plt.show()
else:
    print("No director data available")

##### 1. Why did you pick the specific chart?

To identify the most prolific directors on Amazon Prime.
 A horizontal bar chart clearly ranks the top directors by the number of titles they have directed.

##### 2. What is/are the insight(s) found from the chart?

Director Activity: Identifies which directors have the most titles on the platform.
Content Strategy: Highlights directors who might be key figures for future content acquisition or collaboration.

# **Conclusion**

In conclusion, this Exploratory Data Analysis (EDA) on the Amazon Prime Video content library has provided valuable insights into the platform's content strategy, audience preferences, and market trends. The analysis revealed that genres like "Drama," "Comedy," and "Action" are the most prevalent, indicating a strong focus on content that appeals to a broad audience. The United States is the primary producer of content, followed by other countries like the United Kingdom and India, highlighting the platform's reliance on major production hubs. The trend analysis showed a steady increase in content releases over the years, suggesting a strategic expansion to meet growing demand. The average IMDb scores have fluctuated, indicating variability in content quality, but the overall trend suggests a focus on maintaining high standards. The analysis also identified the most prolific directors on the platform, suggesting that certain directors have a significant impact on content popularity and success. These insights can help Amazon Prime tailor its content acquisition and production strategies to better align with audience preferences and market trends, enhance user engagement, and drive subscription growth. However, the dominance of specific genres and countries could limit the platform's appeal to a broader audience, suggesting a need for diversification. By focusing on quality, leveraging popular directors, and regularly monitoring trends, Amazon Prime can continue to optimize its content strategy and enhance user experience in the competitive streaming industry.