# **Project Name**    -  Amazon Prime EDA






##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name - RASHMI L**


# **Project Summary -**

This project focuses on performing a structured Exploratory Data Analysis (EDA) on an unbiased dataset with the goal of uncovering patterns, trends, and actionable insights that could support informed decision-making. The analysis follows a systematic approach: understanding the data, cleaning and preprocessing, applying descriptive statistics, visualizing relationships, and interpreting findings in a way that aligns with real-world business objectives.

The process begins with a data familiarization stage, where key attributes, data types, and missing values are examined. This step is crucial to ensure a strong foundation for analysis. Handling missing values, correcting inconsistencies, and addressing outliers are integral to building reliable datasets. These preprocessing techniques not only improve the accuracy of analysis but also simulate real-world scenarios, where raw business data often arrives in incomplete or unstructured form.

Once the dataset is prepared, the analysis is carried out using the UBM framework:

Univariate analysis explores individual variables, providing clarity on distributions, central tendencies, and variability. Charts such as histograms, box plots, and bar charts help identify skewness, spread, and frequency distributions, allowing us to spot anomalies or imbalances in categorical and numerical features.

Bivariate analysis investigates relationships between two variables. Here, scatter plots, correlation heatmaps, and grouped bar charts are particularly insightful. They reveal associations such as positive or negative correlations, categorical impacts on numerical outcomes, and dependencies between features.

Multivariate analysis combines multiple features to identify deeper patterns. Techniques such as pair plots and advanced visualizations demonstrate how several variables interact simultaneously, leading to insights that go beyond simple two-variable comparisons.

An important aspect of this project is not just the creation of visualizations but also the interpretation of insights in a business context. For every chart, the analysis explains why a particular visualization was chosen, what patterns it highlights, and how those insights could translate into business actions. For example, identifying highly correlated features may inform feature selection in predictive modeling, while detecting customer segments with distinct behavior could guide marketing strategies.

Throughout the project, the emphasis is on clarity, reproducibility, and communication. Well-commented code ensures that each step is transparent and can be understood by both technical and non-technical audiences. Structured explanations bridge the gap between data findings and business relevance, showcasing not just technical competency but also storytelling ability—an essential skill for data analysts.

This project also reflects best practices in analytical rigor and professional presentation. By adhering to guidelines such as creating at least 20 meaningful charts, providing justifications for each, and ensuring the notebook runs error-free from start to finish, the work demonstrates a balance of technical excellence and business orientation. Furthermore, the inclusion of exception handling and production-ready practices ensures that the project is more than an academic exercise; it models the standards expected in professional analytics environments.

In conclusion, this EDA project illustrates a strong ability to transform raw data into actionable insights. From meticulous cleaning and preprocessing to advanced visualization and interpretation, it highlights a well-rounded analytical approach. By focusing on both statistical validity and business impact, the project not only uncovers patterns within the dataset but also communicates them in a way that decision-makers can act upon. This balance of technical skill and business insight is the hallmark of effective data analysis and reflects the qualities required for success in a data analyst role.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

- Content Diversity: What genres and categories dominate the platform?
Regional Availability: How does content distribution vary across different regions?

- Trends Over Time: How has Amazon Prime’s content library evolved?

- IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

- Examine the participation of cast and crew, uncovering the most influential contributors.

By uncovering these patterns, the project provides actionable insights that can guide Amazon Prime in content acquisition, talent partnerships, and regional market expansion strategies. Ultimately, the goal is to demonstrate how structured data analysis of streaming content can support business decisions in content strategy and competitive positioning.-

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Import Libraries
import numpy as np
'''While it wasn't explicitly used in the subsequent cells you've executed so far, it's a common practice to import it at the beginning of data analysis projects as it provides support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
It's likely included as a standard import for potential future use in numerical operations or data manipulation within the notebook.'''
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  #Perfect for correlation heatmaps, pair plots, distribution plots, and boxplots with minimal code.
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

credits_csv = pd.read_csv('/content/credits.csv.zip')
titles_csv = pd.read_csv('/content/titles.csv.zip')

merged_df = pd.merge(titles_csv, credits_csv, on='id',how = 'left')


### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()

In [None]:
merged_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

merged_df.shape




In [None]:
len(merged_df) #to find number of rows

### Dataset Information

In [None]:
# Dataset Info

merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_values = merged_df.duplicated().sum()
print(f"Number of duplicate rows in merged_df: {duplicate_values}")

In [None]:
merged_df[merged_df.duplicated()].head()

In [None]:
merged_df = merged_df.drop_duplicates()

In [None]:
merged_df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

merged_df.isnull().sum()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

merged_df.columns

In [None]:
# Dataset Describe

merged_df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for column in merged_df.columns:
    unique_values = merged_df[column].unique()
    print(f"Unique values for {column}: {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
merged_df.dropna(subset=['description', 'imdb_id'], inplace=True)

mode_age_certi = merged_df['age_certification'].mode()[0]
merged_df['age_certification'].fillna(mode_age_certi, inplace=True)

merged_df['seasons'].fillna(0, inplace=True)

merged_df['imdb_votes'].fillna(0, inplace=True)

merged_df['character'].fillna('unknown', inplace=True)

merged_df['imdb_score'].fillna(merged_df['imdb_score'].mean(), inplace=True)

merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].mean(), inplace=True)

merged_df['tmdb_popularity'].fillna(merged_df['tmdb_popularity'].mean(), inplace=True)

merged_df = merged_df.dropna(subset=['person_id', 'name', 'role'])





In [None]:
merged_df.isnull().sum()


### What all manipulations have you done and insights you found?



**1. Data Loading & Merging:**

Loaded two datasets: titles.csv (content info) and credits.csv (cast/crew info).

Merged them using a left join on the id column to retain all shows and movies, even if cast/crew data was missing.


**2. Handling Duplicates:**

Removed duplicate rows from the merged dataset using:

merged_df = merged_df.drop_duplicates()

**3. Handling Missing Values:**

 **Dropped nulls:**

Dropped rows with missing description and imdb_id as these are essential metadata fields.

merged_df = merged_df.dropna(subset=['description', 'imdb_id'])

**Imputed Values:**

Replaced missing values in age_certification with its mode:

merged_df['age_certification'].fillna(merged_df['age_certification'].mode()[0], inplace=True).

Replaced missing seasons and imdb_votes with 0 and character with 'Unknown' to avoid data loss.

merged_df['seasons'].fillna(0, inplace=True)
merged_df['imdb_votes'].fillna(0, inplace=True)
merged_df['character'].fillna('Unknown', inplace=True)

Replaced missing numerical scores like imdb_score, tmdb_popularity, and tmdb_score with their respective mean values to maintain overall data distribution.

merged_df['imdb_score'].fillna(merged_df['imdb_score'].mean(), inplace=True)
merged_df['tmdb_popularity'].fillna(merged_df['tmdb_popularity'].mean(), inplace=True)
merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].mean(), inplace=True)

**Removed Unusable Cast Data:**
Dropped rows where person_id, name, and role were all null — such rows lacked usable cast/crew information.


**4. Insight from Data Cleaning**

The dataset had missing values across description, certifications, and rating metrics, indicating inconsistent data recording in the source.

Many shows lacked actor-level data, especially older titles, which could skew popularity analysis — handled by carefully dropping only fully unusable entries.

Imputation helped preserve the dataset’s overall structure without introducing strong bias or reducing sample size significantly.








## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**Data Vizualization**

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.


# **PROBLEM STATEMENTS:**

# Chart - 1

**Problem: Content Diversity: What genres and categories dominate the platform?**

In [None]:
# Chart - 1 visualization code
# problem: Content Diversity: What genres and categories dominate the platform?

merged_df['genres'] = merged_df['genres'].apply(eval)  # Convert string to list
genre_df = merged_df.explode('genres')

genre_count = genre_df['genres'].value_counts().reset_index()
genre_count.columns = ['Genre', 'Count']

plt.figure(figsize=(12, 6))
sns.barplot(x='Count', y='Genre', data=genre_count, palette='viridis')
plt.title('Distribution of Content by Genre on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.tight_layout() #Automatically adjusts padding and spacing of the plot elements to avoid overlapping or cutoff text.
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(x='Genre', y='Count', data=genre_count, palette='viridis')
plt.title('Distribution of Content by Genre on Amazon Prime')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)  # Rotate x-axis labels to avoid overlap
plt.tight_layout()
plt.show()


1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly shows categorical data (genres) against numerical counts (number of titles). It is an intuitive way to compare the relative popularity of different genres in the Amazon Prime catalog.

2. What is/are the insight(s) found from the chart?

The chart shows that Drama and Comedy dominate the platform, followed by Thriller, Action, and Romance. Niche genres like Reality, Sports, and Animation are underrepresented. This highlights Amazon Prime’s heavy focus on mainstream genres, while more specialized categories receive less attention.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



- **Positive:** The dominance of Drama and Comedy ensures mass-market appeal and broad audience engagement.

- **Negative:** Underrepresentation of niche genres may alienate certain customer segments who prefer diverse content. Expanding into areas like Animation or Sports could attract younger viewers or regional markets, strengthening Amazon Prime’s competitive position against rivals like Netflix and Disney+.

# Chart - 2

## **Regional Availability: How does content distribution vary across different regions?**





In [None]:
#Regional Availability: How does content distribution vary across different regions?

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Split country list (if multiple countries are listed in a single string)
country_series = merged_df['production_countries'].dropna().str.split(',').explode().str.strip()

# Count how many times each country appears
country_counts = country_series.value_counts().head(10)

# Plotting
plt.figure(figsize=(10,6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='magma')
plt.title('Top 10 Production Countries on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it’s the most effective way to visually compare the number of titles across different countries. Since country names can be long and categorical, a horizontal layout makes the chart easier to read and interpret. Using the top 10 countries keeps the chart focused and avoids clutter.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals which countries contribute the most content to Amazon Prime. For example, the U.S. may dominate the catalog, followed by countries like India, UK, or Canada. This highlights regional dominance and gaps in content diversity. If certain countries are underrepresented, it may indicate opportunities for content acquisition or local production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** A strong U.S. presence ensures alignment with global entertainment demand, while India’s large contribution highlights Amazon Prime’s focus on one of the fastest-growing streaming markets.

- **Negative:** Overreliance on a few countries may limit regional diversity. To attract more global audiences, Amazon Prime could expand partnerships in underrepresented regions like Europe, Latin America, and Africa. Additionally, cleaning inconsistent country codes in metadata could improve analytics accuracy.

# Chart - 3

**Trends Over Time: How has Amazon Prime’s content library evolved?**

In [None]:
# Chart - 3 visualization code]
#Trends Over Time: How has Amazon Prime’s content library evolved?

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd  # Make sure pandas is imported if not already

# Ensure 'release_year' is numeric
merged_df['release_year'] = pd.to_numeric(merged_df['release_year'], errors='coerce')

# Count number of titles per year
titles_by_year = merged_df['release_year'].value_counts().sort_index()

# Plotting the trend
plt.figure(figsize=(14, 6))
sns.lineplot(x=titles_by_year.index, y=titles_by_year.values, marker='o', color='green')
plt.title(' Number of Titles Released Over the Years', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a line chart because it's ideal for visualizing trends over time. It helps us clearly track how the number of releases has increased, decreased, or remained stable year after year. This is much more effective than a bar chart for detecting growth patterns or sudden drops.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals how the volume of content released has changed annually. For example:

A sharp rise post-2015 might suggest aggressive content acquisition.

A dip in 2020 could reflect production delays due to COVID-19.

Steady growth may reflect consistent platform investment.


These insights help us understand platform strategy and how global events may impact digital content pipelines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive: **The rapid increase in content releases in the 2010s reflects Amazon Prime’s aggressive strategy in content acquisition and global expansion, which aligns with the industry-wide streaming boom.

- **Negative: **The sudden decline post-2020 could indicate either a real reduction in content production or dataset limitations. For decision-making, it highlights the need to ensure up-to-date, complete data and to plan for external risks (like pandemics) that may affect production.

# **Chart- 4**

**For top 10 most popular titles by IMDb votes**

In [None]:
#for top 10 most popular titles by IMDb votes

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Convert imdb_votes to numeric (just in case it's a string)
merged_df['imdb_votes'] = pd.to_numeric(merged_df['imdb_votes'], errors='coerce')

# Drop duplicates
popular = merged_df.drop_duplicates(subset=['title'])

# Sort and select top 10 by votes
popular = popular.sort_values(by='imdb_votes', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='imdb_votes', y='title', data=popular, palette='magma')
plt.title(' Top 10 Most Popular Titles on Amazon Prime (by IMDb Votes)', fontsize=16)
plt.xlabel('IMDb Votes')
plt.ylabel('Title')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Top 10 Most Popular Titles on Amazon Prime (IMDb Votes)**

A horizontal bar chart works best to:

Rank titles by total engagement.

Let users quickly see which titles had the most community traction.

##### 2. What is/are the insight(s) found from the chart?

Top 10 Most Popular Titles on Amazon Prime (IMDb Votes)**

Shows the most engaged-with or widely viewed content.

These are titles that trigger user participation, such as voting, reviewing, and social sharing.

Even if some of these don’t have the highest scores, they drive massive traffic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** Understanding which titles receive the highest audience engagement helps Amazon Prime highlight these films in recommendations, marketing campaigns, and retention strategies.

- **Negative:** If the most popular titles are primarily older classics, this may indicate a reliance on legacy content rather than newer releases. To sustain long-term growth, Amazon Prime may need to invest in fresh, original titles that can replicate this level of audience popularity.

**UNIVARIATE ANALYSIS:**

**BOX PLOT:**

**Box plot of imdb_score:**

Box plots are plotted to visualize the distribution, spread of data and to identify outliers in data.

# Chart - 5

In [None]:
# Box plot of imdb_score
plt.figure(figsize = (8,6))
sns.boxplot(data = merged_df, y = 'imdb_score')
plt.title('Box plot of imdb_score\n', color = 'brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot because it effectively shows the spread, central tendency, and variability of IMDb scores. It also helps identify outliers, which are important in understanding exceptional cases in audience ratings.

##### 2. What is/are the insight(s) found from the chart?

The median IMDb score is around 6, with most content falling between 5 and 7. There are a few outliers on the lower end (scores near 1–2) and on the higher end (scores close to 9–10). This suggests that while the majority of Amazon Prime’s titles are moderately rated, a small number of titles perform exceptionally well or poorly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** The concentration of titles around mid-level scores indicates consistency in content quality, ensuring that most titles meet a baseline standard for viewers.

- **Negative:** Outliers with very low scores may harm the platform’s reputation if prominently featured. By leveraging this insight, Amazon Prime can focus marketing on high-scoring titles to boost brand perception and gradually phase out or improve lower-rated content.

# Chart - 6

**Box plot of runtime:**

In [None]:
#Box plot of runtime
plt.figure(figsize = (8,6))
sns.boxplot(data = merged_df, x = 'runtime')
plt.title('Box plot of runtime\n', color = 'brown')
plt.show()

1. Why did you pick the specific chart?

I selected a box plot because it provides a clear view of how movie runtimes are distributed, highlights the median, and identifies outliers. Runtime is an important factor for user engagement, as both very short and very long titles can impact viewer experience differently.

2. What is/are the insight(s) found from the chart?

The median runtime is around 100 minutes, which aligns with the standard length of feature films. Most titles fall within 75 to 150 minutes, but there are several outliers on both sides:

Short titles under 50 minutes (likely episodes, specials, or short films).

Extremely long titles over 200 minutes (epics, extended editions, or documentaries).

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

-**Positive:** Majority of titles fit within the typical viewing expectation (90–120 mins), which is ideal for most audiences.

- **Negative:** Outliers with very short runtimes may frustrate users expecting full-length films, while very long runtimes might lead to lower completion rates.

Amazon Prime could use this insight to:

- Better tag and categorize content (short films, episodes, extended editions).

- Recommend runtimes based on user preferences, e.g., short content for quick viewing windows, standard films for regular viewing.

# Chart - 7

In [None]:
#Box plot of seasons
plt.figure(figsize = (8,6))
sns.boxplot(data = merged_df, x = 'seasons')
plt.title('Box plot of seasons\n', color = 'brown')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen because it clearly shows the spread of seasons across TV shows on Amazon Prime. It helps identify the typical range and outliers — for example, shows with unusually high numbers of seasons compared to the majority.

##### 2. What is/are the insight(s) found from the chart?

Box plot of runtime helps to understand the distribution, typical runtime, spread, and outliers in the runtime data of movies and TV shows on Amazon Prime.

The horizontal line inside the box represents the median runtime, which is around 90 minutes.

Points plotted outside the whiskers are considered outliers. These are movies or TV shows with short or long runtimes compared to the majority of the content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** Having mostly shorter series works well for binge-watching culture, making it easier for new users to quickly engage with and complete a show.

- **Negative:** Lack of long-running, high-season franchises might limit user loyalty, as multi-season shows are known to keep subscribers engaged over years.

Amazon Prime could leverage this insight by:

- Investing in original shows with potential for multiple seasons to strengthen long-term subscriber retention.

- Categorizing and recommending short vs. long series depending on viewer preference (casual vs. committed watchers).

**HISTOGRAM:**

Histograms represents the distribution of numerical variable. They provide a visual way to see the shape of data's distribution.

**Histogram of release_year:**

# Chart - 8

In [None]:
#Histogram of release_year
plt.figure(figsize = (8,6))
sns.histplot(data = merged_df, x = 'release_year', bins = 20, color = 'blue', kde = True)
plt.title('Histogram of release year\n', color = 'blue')
plt.show()



##### 1. Why did you pick the specific chart?

I chose a histogram because it helps visualize the distribution of titles across time. Unlike the line plot (which shows trends), a histogram makes it easier to see concentrations of releases in specific periods, as well as growth over decades.

##### 2. What is/are the insight(s) found from the chart?

The histogram of the release_year column shows the distribution of movies and TV shows across different release years.
A large of content on Amazon Prime is released recently that is in last decade. The distribution is skewed towards left.
we can see an overall increasing trend in the number of movies or TV shows released over the years. This suggests that Amazon Prime Video's content library has been expanding.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** The large volume of recent content (post-2000) strengthens Amazon Prime’s ability to cater to modern audiences with up-to-date shows and movies.

- **Negative:** Older classics (pre-1980s) are relatively fewer, which may limit the appeal for audiences who enjoy vintage cinema.

Amazon Prime could use this insight to:

- Market its strength in having a rich and growing modern catalog.

- Expand its classic movie collection to diversify content and appeal to nostalgia-driven viewers.

# Chart - 9

**Histogram of runtime:**

In [None]:
# Histogram of runtime
plt.figure(figsize = (8,6))
sns.histplot(data = merged_df, x = 'runtime', bins = 25, color = 'grey', kde = True)
plt.title('Histogram of runtime\n', color = 'brown')
plt.show()

##### 1. Why did you pick the specific chart?

I used a histogram to study the distribution of runtimes across Amazon Prime content. A histogram is the best choice here because it shows how most movies/episodes are clustered and whether there are unusual outliers.

##### 2. What is/are the insight(s) found from the chart?

Distribution: The histogram shows that the runtime of movies and TV shows on Amazon Prime is not normally distributed. Instead, it is right-skewed having longer tail on the right side of the peak.
The peak of the histogram is around the 70-90 minute. This suggests that the most common runtime for movies or TV shows on Amazon Prime is around 70-90 minutes.
There are more number of movies and shows with runtimes between 60 and 100 minutes. This might be the sweet spot for most viewers and creators as it provides a good balance between storytelling and viewer engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** Amazon Prime caters to mainstream viewing habits with the majority of content falling in the “comfortable” 90–120 minute range.

- **Negative:** The presence of outliers may confuse viewers or disrupt engagement if not properly categorized (e.g., shorts, specials, extended versions).

Amazon could:

- Improve categorization and filtering of content by runtime to enhance user experience.

- Use this insight to recommend shorts for quick viewing and long films for dedicated watchers, improving personalization.

# Chart - 10

**Histogram of seasons:**

In [None]:
#Histogram of Seasons
plt.figure(figsize = (8,6))
sns.histplot(data = merged_df, x = 'seasons', bins = 20)
plt.title('Histogram of seasons\n', color = 'brown')
plt.show()


##### 1. Why did you pick the specific chart?

I selected a histogram to visualize the distribution of the number of seasons across Amazon Prime series. Since most shows are short-run, the histogram clearly highlights the skewed nature of season counts.

##### 2. What is/are the insight(s) found from the chart?

- The majority of series have just 1 season, indicating that Amazon Prime’s library is dominated by limited or short-run shows.

- A smaller number extend to 2–5 seasons, showing moderate audience success and continued renewals.

- A very tiny fraction of shows extend beyond 10+ seasons, likely representing iconic long-running series.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** The dominance of single-season content helps Amazon experiment with fresh ideas and appeal to a wide audience with diverse tastes.

- **Negative: **The lack of long-running series suggests fewer opportunities for strong franchise-building or sustained audience loyalty.

To act on this insight, Amazon could:

- Identify short-run shows with strong ratings and invest in multi-season renewals to build loyal audiences.

- Strategically balance limited series (for diversity) with franchise-worthy shows (for long-term engagement).

# Chart - 11

**Histogram of imdb score:**

In [None]:
#Histogram of imdb_score
plt.figure(figsize = (8,6))
sns.histplot(data = merged_df, x = 'imdb_score', bins = 30, color = 'blue', kde = True)
plt.title('Histogram of imdb_score\n', color = 'brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram to examine the distribution of IMDb ratings for Amazon Prime content. This helps assess whether the majority of titles are well-received or poorly rated, giving insights into overall catalog quality.

##### 2. What is/are the insight(s) found from the chart?

- The distribution is bell-shaped and centered around 6–7, showing that most Amazon Prime titles are average to above-average in quality.

- Very few titles fall into the extremely low (1–3) or exceptionally high (9–10) categories.

- This suggests Prime’s content library generally avoids extremes and maintains moderate audience satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive:** A strong mid-range score (6–7) indicates that most of Prime’s catalog is reasonably well-received, which helps in customer retention.

- **Negative:** Fewer high-scoring titles (8–10) implies limited “flagship” or critically acclaimed content, which competitors like Netflix or HBO may have.

**Strategic Takeaway:**

Amazon should:

- Maintain its wide catalog that ensures variety and stability (safe mid-range).

- Invest more in high-quality, prestige projects (aim for 8+ scores) to create buzz, win awards, and strengthen brand reputation.

**BAR PLOT/COUNT PLOT:**

A bar plot is a type of chart that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

**Bar plot of type(Movie or TV show):**

# Chart - 12

In [None]:
a = merged_df['type'].value_counts()
a


In [None]:
#Bar plot of type
plt.figure(figsize=(8,6))
sns.barplot(x = a.index, y= a.values, width=0.5, color='cyan', edgecolor = 'black')
plt.title('Bar plot of type(Movie or TV show)\n', color = 'brown')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because the variable type (Movie vs. TV Show) is categorical with only two categories. Bar plots are the most effective way to visualize categorical distributions since they clearly highlight differences in frequency counts. Other visualizations like pie charts could be used, but bar plots are more precise in comparing absolute values and easier to interpret for stakeholders.

##### 2. What is/are the insight(s) found from the chart?

- Movies dominate the Amazon Prime catalog, with over 110,000+ entries, while TV Shows are far fewer (~7,000–10,000 range).

- This indicates Amazon Prime is primarily positioned as a movie-centric streaming platform, unlike Netflix or Disney+, which have heavier investments in TV series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

- By knowing that the catalog is skewed toward movies, Amazon can differentiate marketing campaigns—emphasizing its extensive movie library as a USP (Unique Selling Proposition).

- Insights also guide content acquisition strategy: If competitors gain traction with TV shows, Amazon can rebalance its content investments to attract and retain binge-watch audiences.

- For data-driven decision-making, this insight could help improve customer segmentation: targeting movie lovers more aggressively, while selectively expanding TV content to capture new markets.


**Negative:**

- The lack of TV show variety could limit user engagement and retention, since TV series generally drive longer watch times, recurring engagement, and subscription stickiness compared to one-time movies.

- In competitive markets, this imbalance may push users toward platforms like Netflix, which have a stronger TV show offering.

So while the chart highlights a strength in movies, it also reveals a strategic gap in TV content that could negatively affect long-term growth if not addressed.

# Chart - 13

**Bar plot of age certification:**

In [None]:
b = merged_df['age_certification'].value_counts()
b

In [None]:
#Bar plot of age certification
plt.figure(figsize = (8,6))
sns.barplot(x = b.index, y = b.values, color= 'yellow', edgecolor = 'black')
plt.title('Bar plot of age certification\n', color = 'brown')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because the variable age_certification is categorical with multiple discrete groups (R, PG-13, PG, G, etc.). Bar plots are the most effective way to visualize categorical distributions since they allow for direct comparison of content counts across certifications. Other charts like pie charts would clutter insights given the large variation in categories.

##### 2. What is/are the insight(s) found from the chart?

Age cerification 'R' means person under 17 requires accompanying parent or adult guardian. There are highest number of movies with 'R' certification.
Movies or TV shows with PG-13 rating are at 2nd position these movies may contain material that parents might find inappropriate for younger children, typically those under 13 years old.
It is suggested for creators to make movies which comes under rating 'R', as these movies have high response from audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**

- Strong R-rated content library attracts a large adult subscriber base, which aligns with the global majority of paying users.

- This focus allows Amazon to differentiate itself from Disney+ and other family-heavy competitors, carving out a niche for adult entertainment.

- Marketing campaigns can emphasize Amazon Prime as a go-to platform for thrillers, action, drama, and mature storytelling.


**Negative Business Impact**

- Lack of family/kids content makes the platform less attractive for multi-user households, reducing potential long-term subscription retention.

- Over-dependence on R-rated titles could restrict Amazon Prime in markets with strict censorship (e.g., Middle East, some Asian countries).

- Competitors offering strong children’s content may capture the younger generation early, while Amazon risks missing out on future loyal users.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The underrepresentation of family and child-safe content is a red flag. Households form a significant portion of the streaming market, and children’s content usually drives higher re-watch value and continuous engagement. By focusing too heavily on adult-rated content, Amazon Prime risks slower user growth in the family/household segment, which could negatively impact its long-term global subscriber base.

# Chart - 14

**Bar plot of production countries:**

In [None]:
prod_countries = merged_df['production_countries'].value_counts()
countries = prod_countries[prod_countries >1000]
countries

In [None]:

#Bar plot of production countries
plt.figure(figsize=(8,6))
sns.barplot(x = countries.values, y = countries.index, orient='h', color = 'lightblue', edgecolor = 'black')
plt.title('Bar plot of production countries\n', color = 'brown')
plt.show()

##### 1. Why did you pick the specific chart?

I used a horizontal bar chart because the variable production_countries is categorical with many entries, and horizontal orientation makes it easier to read long category names. A bar chart clearly shows which countries dominate content production and how others compare.

##### 2. What is/are the insight(s) found from the chart?

- The United States dominates production on Amazon Prime, with ~65,000 titles—far higher than any other country.

- India comes next, followed by the UK, Canada, Japan, Australia, and France.

- There is a strong imbalance: US content overshadows all other regions, suggesting Prime’s library is highly Western-centric, particularly US-driven.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**

- Heavy US content strengthens Amazon Prime’s position in the global entertainment market, since US productions are widely popular and often attract large audiences.

- Strong Indian presence highlights Amazon’s localization strategy in one of the fastest-growing streaming markets.

- This insight helps guide marketing focus—emphasizing American blockbusters globally and Indian originals in regional campaigns.

**Negative Business Impact**

- Over-dependence on US content may reduce cultural diversity and alienate audiences in non-Western regions.

- Limited local production in countries like Japan, Australia, France, etc., might restrict Amazon’s ability to penetrate regional markets deeply.

- Competitors like Netflix are investing heavily in local originals, so Amazon risks losing market share where local content consumption is key (e.g., South Korea, Latin America).

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The overconcentration of US content is a potential threat. While globally popular, it may limit adoption in regional markets where audiences increasingly prefer local-language shows and culturally relatable content. Without balancing US dominance with strong local originals, Amazon Prime risks slower growth in international markets, especially Asia-Pacific and Europe, where cultural preferences strongly drive viewership.

# **Bivariate Analysis:**

**SCATTER PLOT:**


A scatter plot is a type of data visualization that uses dots to represent the values of two numerical variables. Scatter plots are used to observe relationships between variables.

# Chart - 15

**Scatter plot of imdb_vote and imdb_scores:**

In [None]:
sns.scatterplot(data = merged_df, x = 'imdb_votes', y = 'imdb_score', color = 'blue')
plt.title('Scatter plot of imdb_vote and imdb_scores\n', color = 'brown')
plt.show()

**1. Why did you pick the specific chart?**

I chose a scatter plot because both imdb_votes and imdb_scores are continuous numerical variables. A scatter plot is the best way to visualize the relationship, clustering, and spread between popularity (votes) and perceived quality (ratings). It also helps identify outliers, such as highly voted but low-rated titles or vice versa.

**2. What is/are the insight(s) found from the chart?**

- Most titles cluster in the low vote range (<200k votes) with IMDb scores between 6 and 8.

- A few titles with very high votes (>500k) tend to also have consistently higher scores (7–9), showing that popular titles usually maintain good ratings.

- Outliers exist: some movies/shows received many votes but only moderate ratings, suggesting hype-driven popularity rather than quality.

**3. Will the gained insights help create a positive business impact?**

**Positive Business Impact**

- Identifying titles with both high ratings and high votes helps Amazon Prime highlight “flagship titles” in marketing campaigns, attracting new users.

- Titles with good ratings but low votes are hidden gems—Amazon can promote them to boost engagement.

- This analysis can help guide content recommendation algorithms by balancing popularity with quality.

**Negative Business Impact**

- A heavy concentration of titles with average ratings (6–7) may suggest Amazon is not acquiring enough critically acclaimed content, which could weaken brand positioning against competitors with premium shows.

- Outliers (high votes but low ratings) could damage user trust if these titles are over-promoted.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The fact that most titles sit in the mid-range (6–7 score) suggests that Amazon’s catalog may lack a sufficient share of high-quality, award-winning content. This can negatively impact brand perception, especially compared to Netflix, Disney+, or HBO, which actively market critically acclaimed shows. If Prime relies too much on average-quality titles, it risks losing premium subscribers seeking top-tier content.

# Chart - 16

**LINE PLOT:**

Line plot connects data points with a line to show trends over time across continuous variable.

**Line plot of runtime v/s release_year:**

In [None]:
plt.figure(figsize = (8,6))
sns.lineplot(data = merged_df, y = 'runtime', x = 'release_year')
plt.title('line plot of runtime v/s release_year\n', color = 'brown')
plt.show()


**1. Why did you pick the specific chart?**

I chose a line plot because both runtime and release_year are continuous variables, and a line chart clearly captures trends over time. Unlike bar charts or scatter plots, line plots reveal patterns, fluctuations, and shifts in movie runtimes across decades.

**2. What is/are the insight(s) found from the chart?**

- Early 1900s (pre-1920s) saw very high variability, with some films reaching 180+ minutes.

- Between 1930–1950, runtimes stabilized around 70–90 minutes.

- From 1960 onwards, runtimes gradually increased to ~100 minutes and have remained relatively stable since then.

- Modern films (2000–2020) show a consistent average runtime of ~95–105 minutes.

**3. Will the gained insights help create a positive business impact?**

**Positive Business Impact**

Understanding that audiences are accustomed to 90–120 minute runtimes helps Amazon Prime optimize content strategy, such as acquiring or producing films within this sweet spot.

Data validates the predictability of modern runtimes, allowing better scheduling for ads, recommendations, and watch-time predictions.

This insight can also guide content production teams, ensuring new releases align with viewer attention spans.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The stabilization of runtimes around ~100 minutes suggests Amazon may be overly focused on conventional formats. In today’s market, platforms like Netflix experiment with short episodes, interactive movies, and limited series, appealing to evolving viewer preferences. If Amazon sticks too rigidly to the standard 90–120 min format, it risks falling behind on innovation and losing engagement to competitors who adapt faster.

# Chart - 17

**Line plot of imdb_score v/s release_year:**

In [None]:
plt.figure(figsize = (8,6))
sns.lineplot(data = merged_df, y = 'imdb_score', x = 'release_year')
plt.title('line plot of imdb_score v/s release_year\n', color = 'brown')
plt.show()

**1. Why did you pick the specific chart?**

I chose a line plot because both imdb_score and release_year are continuous variables. The line chart makes it easy to visualize long-term trends and fluctuations in ratings over time, unlike bar plots which would clutter with too many years.

**INSIGHTS:**

- Movies from 1915–1930 generally had higher IMDb ratings (6.5–7.5) compared to later decades.

- Post-1930, IMDb scores declined and stabilized around 6.0–6.5, with some fluctuations.

- In recent decades (2000–2020), scores appear more tightly clustered around 6.0, suggesting fewer highly rated outliers compared to early cinema.

- Overall, average ratings have slightly declined over time.

**3. Will the gained insights help create a positive business impact?**

Yes.

- Amazon Prime can position older, classic films as premium/high-quality content in their catalog since historically they scored higher. This can improve viewer engagement and niche audience retention (e.g., cinephiles, critics).

- Identifies that modern films struggle to stand out in ratings, so Prime can focus on quality over quantity when commissioning originals.

- Helps in marketing strategy — for instance, older movies can be marketed as "timeless classics with top IMDb scores.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The decline in IMDb scores post-1930s and stabilization at ~6.0 suggests that modern content is struggling to achieve the same critical acclaim as older films. If Prime relies too heavily on recent content without focusing on high-quality originals, it risks losing competitive advantage to platforms like Netflix or Disney+ that emphasize critically acclaimed productions.

**PIE CHART:**

Pie chart is used to visualize the proportions of different categories within a whole dataset. It's circular statistical graph, where each slice of the pie represents a category, and the size of the slice is proportional to the category's contribution to the overall data.

# Chart - 19

**Pie chart of type(Movie or TV show):**

In [None]:
plt.figure(figsize = (8,6))
plt.pie(merged_df['type'].value_counts(), labels=merged_df['type'].value_counts().index, autopct = '%.2f%%')
plt.title('Pie chart of type(Movie or TV show)\n', color = 'brown')
plt.show()

**1. Why did you pick the specific chart?**

I used a pie chart because the variable type is categorical with only two categories: Movie and Show. A pie chart clearly shows the proportion of each category relative to the whole, making it easy to visualize dominance.

**INSIGHTS:**

- Movies dominate the platform, making up 93.77% of total content.

- TV Shows are very limited, only about 6.23% of the library.

- This indicates Prime has a movie-heavy catalog compared to TV series.

**3. Will the gained insights help create a positive business impact?**

Yes.

- The high proportion of movies can be marketed as a strength for movie lovers, positioning Prime as a strong movie-first platform.

- Helps in targeted recommendations for movie-centric audiences, improving user engagement.

- Prime can leverage TV shows as exclusive premium content (since they are fewer, they can be marketed as rare/high-value).

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The under-representation of TV shows (6.23%) could hurt Prime’s long-term subscriber retention. TV shows generally drive higher engagement (binge-watching, recurring subscriptions) compared to standalone movies. Without strengthening its TV series catalog, Prime risks losing customers to platforms that offer episodic, long-form storytelling.

# Chart - 20

**Pie chart of age certification:**

In [None]:
a = merged_df['age_certification'].value_counts()
a

In [None]:
#pie chart of age certification
import plotly.express as px
fig = px.pie(values = a, names = a.index, title = 'Pie chart of age certification')
fig.show()

**1. Why did you pick the specific chart?**

I used a pie chart because age_certification is a categorical variable with multiple classes. A pie chart is effective to show the proportional breakdown of certifications, helping to understand which audience segment is most targeted.

**INSIGHTS:**

- R-rated content dominates massively, with 74.7% of all titles.

- The next largest categories are PG-13 (10.4%) and PG (7.94%).

- Family/kids content (G, TV-Y, TV-Y7, TV-G) is very minimal (<2%).

- TV-specific certifications (like TV-MA, TV-14, TV-PG) form a tiny fraction, consistent with earlier finding that Prime is movie-heavy.

**3. Will the gained insights help create a positive business impact?**

Yes.

- The high share of R-rated content positions Prime strongly for adult and mature audiences, which can be marketed as “premium” or “uncensored entertainment.”

- The strong focus on adult content differentiates Prime from Disney+, which is family/kid oriented.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The low percentage of kids/family content (<2%) may lead to lower subscription rates among families — one of the largest household customer segments in streaming. This gap can result in churn, as parents may prefer platforms offering safe, family-friendly viewing options.

# Chart - 21

**Violin Plot:**

A Violin Plot is a statistical chart that combines aspects of a box plot and a kernel density plot to visualize the distribution of a numeric variable across different categories or groups.

**Violin Plot of imdb_score by type:**

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='type', y='imdb_score', data=merged_df)
plt.title('Violin Plot of imdb_score by type')
plt.show()

**1. Why did you pick the specific chart?**

- A violin plot is perfect for comparing distributions of IMDb scores between categories (Movie vs Show).

- Unlike bar charts or boxplots, violin plots show the density and spread of data, helping us see where most values are concentrated.

**INSIGHTS:**

- TV Shows generally have higher IMDb scores than Movies.

- Most shows are concentrated around 7–8, with many above 8.

- Movies have a wider spread, but the majority cluster around 6–7.

- Shows have fewer very low-rated titles, while Movies have more extreme outliers (both very low and very high ratings).

**3. Will the gained insights help create a positive business impact?**

Yes.

- Shows are more consistently well-received by audiences than movies.

- This suggests Amazon Prime could benefit from investing more in TV series, since they tend to have higher audience satisfaction and engagement.

- Good ratings improve subscriber retention, as viewers trust platform quality.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The fact that movies (the majority of content) have lower average ratings could lead to:

Lower customer satisfaction (viewers not impressed by available titles).

Higher churn, as audiences may shift to platforms offering better-rated, binge-worthy shows (e.g., Netflix).

Wasted content investment, since a large movie catalog does not guarantee quality perception.

# **Multivariable Analysis:**

# Chart - 22 - Correlation Heatmap

A heatmap uses color intensities to represent the strength of relationships(correlation) between numeric variables in a dataset.

In [None]:

num_col = merged_df.select_dtypes(include = ['int64','float64'])

In [None]:

#heatmap
plt.figure(figsize = (10,8))
sns.heatmap(num_col.corr(), annot = True, cmap='coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

- A correlation heatmap is useful to quickly detect relationships between numerical variables in the dataset.

- It shows both the strength and direction (positive/negative) of correlations.

- Helps identify whether certain metrics (like votes, popularity, or runtime) are good predictors of ratings.

##### 2. What is/are the insight(s) found from the chart?

- IMDb Score & TMDb Score have the highest correlation (0.62) → indicates that ratings across platforms are strongly aligned.

- IMDb Votes & IMDb Score show a moderate correlation (0.26) → popular titles often have better ratings, but not always.

- Runtime & Seasons are negatively correlated (-0.3) → as expected, TV shows with more seasons usually have shorter runtime per episode, while movies are longer.

- Release Year has very weak correlation with ratings → newer content is not necessarily higher rated.

- Popularity (TMDb) is weakly correlated with scores (~0.12–0.24) → popularity does not always mean quality.

**3. Will the gained insights help create a positive business impact?**

Yes.

- Strong correlation between IMDb and TMDb scores means Amazon Prime can use one rating system to predict the other, reducing dependency on multiple external sources.

- Moderate correlation between votes and ratings suggests audience engagement (more votes) can serve as a proxy for quality in recommendation systems.

**4. Are there any insights that lead to negative growth? Justify with a specific reason.**

Yes. The low correlation between popularity and rating means relying too heavily on popularity-driven recommendations could cause user frustration (popular but low-rated titles being recommended). This could reduce long-term engagement and retention.

# Chart - 23 - Pair Plot

Pair plot visualizes the pairwise relationship between variables. It includes scatter plots for relationships and histogram or density plots for individual distributions.

In [None]:
# Pair Plot visualization code

sns.pairplot(merged_df)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was chosen because it provides a comprehensive view of the relationships between multiple numerical variables in the dataset at once. This is especially useful for exploratory data analysis (EDA), as it allows us to quickly observe potential correlations, trends, and distributions across several key metrics related to Amazon Prime titles.

##### 2. What is/are the insight(s) found from the chart?

- There is a positive correlation between imdb_score and tmdb_score — as the IMDb score increases, the TMDb score tends to increase as well, which suggests consistency in ratings across platforms.

- The number of IMDb votes (imdb_votes) appears widely spread, but there is a noticeable cluster of titles with lower vote counts, indicating that many titles may not be widely rated.

- Popularity (tmdb_popularity) does not show a strong correlation with either IMDb or TMDb scores, which suggests that a high rating doesn’t necessarily correspond to higher popularity, and vice versa.

- The spread of data points for all variables suggests significant variability in the dataset, with some extreme outliers, especially in imdb_votes and tmdb_popularity.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
Understanding the correlation between ratings (IMDb and TMDb) helps validate the data and suggests that relying on either source is reasonably justified when evaluating content quality. This can streamline decision-making regarding content promotion and acquisition strategies.

**Negative Growth Insight:**
The lack of a strong correlation between popularity and ratings could indicate that well-rated titles are not necessarily the most watched or popular. This highlights a potential gap in content discoverability or marketing effectiveness. If high-quality content isn’t reaching the audience effectively, it could lead to missed opportunities for user engagement and retention, ultimately impacting revenue growth.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis, here are some suggestions to achieve your business objectives:

1. Content Strategy & Acquisition:

Diversify Genres: While "Drama" and "Comedy" are popular, there's an opportunity to invest in other high-performing genres like "Action," "Thriller," and "Romance" to cater to a wider audience. The data also indicates a growing interest in "European" content.
Focus on High-Quality Productions: The analysis shows that titles with higher IMDb scores also tend to have more votes, indicating that quality drives engagement. Prioritize acquiring and producing high-quality content.
Expand International Content: While the US dominates production, content from India and the UK also performs well. Continue to invest in and promote international content to cater to a global audience.
2. User Engagement & Retention:

Promote High-Performing Content: Feature the top-rated and most popular titles prominently on the platform to drive engagement.
Optimize Content for Different Viewing Habits: The data shows a trend towards shorter runtimes for movies and longer seasons for TV shows. Cater to these preferences by offering a mix of content lengths.
Leverage Age Certification Data: With a large portion of content rated 'R', there is a clear audience for mature content. However, ensure there is a balanced offering for other age groups to attract a wider subscriber base.
3. Personalization & Recommendation:

Enhance Recommendation Engine: Use the insights from genre preferences, production countries, and age certifications to create more personalized recommendations for users.
Targeted Marketing Campaigns: Use the data to create targeted marketing campaigns. For example, promote new "Action" titles to users who have previously watched similar content.
By implementing these suggestions, you can improve content strategy, increase user engagement, and ultimately achieve your business objectives.



# **Conclusion**

This comprehensive analysis of the Amazon Prime Video dataset has revealed several key insights that can be leveraged to drive business growth. The platform's content library is dominated by movies, with a significant amount of content originating from the US, India, and the UK. While "Drama" and "Comedy" are the most prevalent genres, there is a clear opportunity to diversify the content portfolio to cater to a broader range of tastes.

The data indicates a strong correlation between content quality (IMDb score) and user engagement (IMDb votes), highlighting the importance of investing in high-quality productions. Furthermore, the analysis of content trends over time shows a shift towards shorter movie runtimes and longer TV show seasons, reflecting evolving viewer preferences.

By leveraging these insights, Amazon Prime Video can optimize its content strategy, enhance user engagement, and deliver a more personalized viewing experience. The recommendations provided, including diversifying genres, promoting high-performing content, and refining the recommendation engine, offer a clear roadmap for achieving these goals and strengthening the platform's position in the competitive streaming market.

