# **Project Name**    -



##### **Project Type**    - Exploratory Data Analysis(EDA)
##### **Contribution**    - Abhishek Kaithwas


# **Project Summary -**

***Project Objective:***

To gain insights into Amazon Prime Video's content library, user preferences, and trends to influence subscription growth, user engagement, and content investment strategies in the streaming industry.

***Dataset:***

The project utilizes two datasets:

**titles.csv**: Contains information about movies and TV shows available on Amazon Prime, including title, type, description, release year, age certification, runtime, genres, production countries, IMDb ratings, and TMDb popularity scores.

**credits.csv**: Lists cast and crew information for each title, including person ID, name, and role.

***Methodology:***

**Data Loading and Cleaning**:

Libraries like pandas, numpy, matplotlib, seaborn, and missingno are imported. The datasets are loaded, and initial exploration reveals content diversity, regional availability, trends over time, and IMDb ratings. Data cleaning involves handling missing values, removing duplicates, and converting data types for better analysis.

**Data Visualization and Storytelling**:

Univariate, bivariate, and multivariate analyses are performed using various charts and visualizations. Insights gained from each chart are discussed, along with their potential impact on business objectives.

**Key Findings:** The analysis revealed several key findings, including:

**Content Distribution**: Movies dominate the platform compared to TV shows. Drama, comedy, documentary, and action are the most common genres.

**IMDb Ratings**: Most content has IMDb scores between 6 and 8, indicating generally positive reception.

Trends Over Time: There has been a significant increase in content availability after 2010, reflecting Amazon's growing content library.

Runtime: Most content falls within a 60-120 minute runtime, aligning with general viewer preferences.

Correlations: IMDb score and TMDb score are positively correlated, suggesting consistent quality across platforms.

***Business Recommendations***: Based on the findings, the project suggests actionable strategies to achieve the business objective:

Content Diversification: Invest in expanding TV show catalogs and explore less-represented genres like thrillers, horror, or sci-fi.

Content Quality and Ratings: Focus on maintaining high content quality, prioritizing genres with high IMDb scores like documentaries.

Optimizing Runtime: Ensure content aligns with general preferences for runtimes between 60 and 120 minutes.

Understanding User Preferences: Monitor and analyze viewership data, user feedback, and platform insights to optimize content strategies.

**Implementation and Outcomes:** The recommendations are geared toward achieving increased subscription growth, improved user engagement, and optimized content investment for Amazon Prime Video. By continuously monitoring and adapting content strategies based on data insights, the platform can retain its competitive edge in the streaming market.

# **GitHub Link -**

https://github.com/Abhishekkaithwas/Amazon-Prime-TV-Shows-Movies-content-/blob/main/Amazon%20prime%20TV%20shows%20%26%20Movies.ipynb

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

*Content Diversity*: What genres and categories dominate the platform?

*Regional Availability*: How does content distribution vary across different regions?

*Trends Over Time*: How has Amazon Prime’s content library evolved?

*IMDb Ratings & Popularity*: What are the highest-rated or most popular shows on the platform?


By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

#### **Define Your Business Objective?**

To influence subscription growth, user engagement, and content investment strategies in the streaming industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

### Dataset Loading

In [None]:
# Load Dataset(Credits)
df_credit = pd.read_csv("/content/credits.csv")  # Replace with your file path

In [None]:
#Load dataset (Titles)
df_title=pd.read_csv('/content/titles.csv')

### Dataset First View

In [None]:
# Dataset First Look(Credits)
df_credit.head()

In [None]:
#Dataset first look(Titles)
df_title.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count(Credits)
rows, columns = df_credit.shape
print(f'Rows:{rows}',
f'Columns:{columns}')


In [None]:
#Dataset Rows & columns(Titles)
rows,columns = df_title.shape
print(f'Rows:{rows}',
      f'Columns:{columns}')

### Dataset Information

In [None]:
# Dataset Info(Credits)
df_credit.info()

In [None]:
#Dataset info(Titles)
df_title.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count(Credits)
duplicate_rows = df_credit.duplicated().sum()
print(f'Duplicate rows:{duplicate_rows}')


In [None]:
#Dataset Duplicate Value Count(Titles)
duplicate_rows = df_title.duplicated().sum()
print(f'Duplicate Rows:{duplicate_rows}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count(Credits)
df_credit.isnull().sum()

In [None]:
# Misiing Values/Null Values Count(Titles)
df_title.isnull().sum()

In [None]:
#Visualizing the missing values of credits dataset
plt.figure(figsize=(12,6))
sns.heatmap(df_credit.isnull(), cmap='viridis',cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap-Credit Dataset")
plt.show

In [None]:
#Visualizing the missing values percentage(Credit)
missing_percent = (df_credit.isnull().sum() / len(df_credit)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values()

plt.figure(figsize=(10, 5))
missing_percent.plot(kind='barh', color='red')
plt.xlabel("Percentage of Missing Values")
plt.ylabel("Columns")
plt.title("Missing Values Percentage")
plt.show()

In [None]:
# Visualizing the missing values(titles)
plt.figure(figsize=(12,6))
sns.heatmap(df_title.isnull(),cmap="viridis",cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap-Title Dataset")
plt.show

In [None]:
#Visualising the missing values percentage(Titles)
missing_percent = (df_title.isnull().sum() / len(df_title)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values()

plt.figure(figsize=(10, 5))
missing_percent.plot(kind='barh', color='red')
plt.xlabel("Percentage of Missing Values")
plt.ylabel("Columns")
plt.title("Missing Values Percentage")
plt.show()



### What did you know about your dataset?

This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This dataset has data available in the United States.

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information, including:

id: The title ID on JustWatch.

title: The name of the title.

show_type: TV show or movie.

description: A brief description.

release_year: The release year.

age_certification: The age certification.

runtime: The length of the episode (SHOW) or movie.

genres: A list of genres.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns(Credits)
df_credit.columns

In [None]:
# Dataset Columns(Titles)
df_title.columns

In [None]:
# Dataset Describe(Credits)
df_credit.describe()

In [None]:
# Dataset describe(Titles)
df_title.describe()

### Variables Description

# Credit Dataset
1. **person_id**: This is a unique identifier for each individual involved in the shows or movies listed in the dataset(actors,directors,writers,etc).This allows tracking conditions accross multiple productions.

2. **id**: This is likely a unique identifier for each show or movie itself.

3. **name**: This column holds the name of the person associated with the 'person_id'. It lets you know who played which character or had what role in specific conditions.

4. **Role**: This column should contain the type of role a person played in a specific production. Common values could include 'ACTOR', 'DIRECTOR', 'WRITER', etc., categorizing individuals based on their contributions.


# Title Dataset

1. **id**: A unique identifier for each title (show or movie) in the dataset. This is crucial for linking this dataset with other related datasets, such as the 'credits' dataset (df in your code).

2. **title**: The title of the show or movie.

3. **type**: Specifies whether the title is a 'SHOW' or a 'MOVIE'.

4. **description**: A brief summary or description of the show or movie.

5. **release_year**: The year the show or movie was released.

6. **age_certification**: The age rating or certification for the content (e.g., 'PG-13', 'TV-MA', etc.).

7. **runtime**: The duration of the show or movie in minutes.

8. **genres**: A list of genres associated with the title (e.g., 'Comedy', 'Drama', 'Action').

9. **production_countries**: A list of countries where the show or movie was produced.

10. **seasons**: For shows, this indicates the number of seasons. For movies, it would likely be empty or have a value of 1.

11. **imdb_id**: The unique identifier for the title on IMDb (Internet Movie Database).

12. **imdb_score**: The IMDb rating for the title, reflecting its overall quality as perceived by users.

13. **imdb_votes**: The number of votes received for the title on IMDb.

14. **tmdb_popularity**: A measure of the title's popularity on TMDb (The Movie Database).

15. **tmdb_score**: The rating for the title on TMDb.

### Check Unique Values for each variable.

In [None]:
df_credit['person_id'].unique()
df_credit['id'].unique()
df_credit['name'].unique()
df_credit['role'].unique()
df_title['id'].unique()
df_title['title'].unique()
df_title['type'].unique()
df_title['description'].unique()
df_title['release_year'].unique()
df_title['age_certification'].unique()
df_title['runtime'].unique()
df_title['genres'].unique()
df_title['production_countries'].unique()
df_title['seasons'].unique()
df_title['imdb_id'].unique()
df_title['imdb_score'].unique()
df_title['imdb_votes'].unique()
df_title['tmdb_popularity'].unique()
df_title['tmdb_score'].unique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Step 1: Fill missing values in 'description' with a placeholder text
df_title['description'].fillna("No description available", inplace=True)

# Step 2: Fill missing numerical values with their median
num_cols = ['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
for col in num_cols:
    median_value = df_title[col].median()
    df_title[col].fillna(median_value, inplace=True)

# Step 3: Remove duplicate rows if any exist
df_title.drop_duplicates(inplace=True)
df_credit.drop_duplicates(inplace=True)

# Step 4: Convert 'genres' and 'production_countries' columns from string representation of lists to actual lists
import ast

df_title['genres'] = df_title['genres'].apply(ast.literal_eval)
df_title['production_countries'] = df_title['production_countries'].apply(ast.literal_eval)

# Step 5: Verify the cleaning process
credits_clean_info = df_credit.info(), df_credit.head()
titles_clean_info = df_title.info(), df_title.head()

credits_clean_info, titles_clean_info

### What all manipulations have you done and insights you found?

# Data Cleaning and manipulation

✅ 1. Handled Missing Values:

Dropped the character column from credits.csv due to excessive missing values.
Filled missing values in description with "No description available".
Replaced missing values in IMDb & TMDb-related numeric columns (imdb_score, imdb_votes, tmdb_popularity, tmdb_score) with their median values.

✅ 2. Removed Duplicates:

Eliminated duplicate rows from both datasets.
Titles dataset reduced from 9,871 to 9,868 rows.
Credits dataset reduced from 124,235 to 124,003 rows.

✅ 3. Converted Data Types:

Transformed genres and production_countries from string format ('["genre1", "genre2"]') into actual Python lists.

# Insights gained
1. Content Distribution:

Movies (about 80%) dominate the platform compared to TV Shows.
Top Genres: Drama, Comedy, Documentary, and Action are the most common genres.

2. IMDb Ratings & Popularity:

Majority of content has IMDb scores between 6 and 8.
Very few shows/movies have an IMDb rating below 4 or above 9.
IMDb votes follow a skewed distribution, with most shows receiving less than 10,000 votes.

3. Trends Over Time:

A significant increase in content availability after 2010, showing Amazon's aggressive content expansion.
Peak content release years: 2018 and 2019.
Content released before 2000 is rare on the platform.

4. Age Certification Insights:

Most movies are rated PG-13 or R, meaning a focus on teen and adult audiences.
Very few shows are rated G (General Audience).

5. Regional Availability:

The majority of content is produced in the United States, followed by India and the UK.
Non-English content is growing but still a smaller percentage.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(data=df_title, x="type", palette="coolwarm")
plt.title("Distribution of Content by Type (Movies vs. TV Shows)")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A countplot is the best way to visualize the distribution of categorical data like 'TV shows & Movies'

##### 2. What is/are the insight(s) found from the chart?



*   It helps us understand whether Amazon Prime focuses more on movies or TV shows.

*   If one type significantly dominates, it could indicate a content strategy bias.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive: If Amazon Prime aligns its marketing with the dominant type, it can enhance user engagement.

❌ Negative: If the platform lacks balance, it might lose audiences who prefer the underrepresented content type.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Splitting and counting genres
all_genres = [genre for sublist in df_title['genres'] for genre in sublist]
genre_counts = pd.Series(all_genres).value_counts().head(10)

# Plot
plt.figure(figsize=(10, 5))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette="viridis")
plt.title("Top 10 Most Common Genres on Amazon Prime")
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()


##### 1. Why did you pick the specific chart?



*   A horizontal bar chart effectively displays categorical rankings, making it easy to compare genre popularity.




##### 2. What is/are the insight(s) found from the chart?



*   The most frequently available genres on Amazon Prime are highlighted.

*   If a genre is significantly more common, it suggests Amazon Prime may prioritize certain content categories.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive: Helps content strategists invest more in high-demand genres to attract subscribers.

❌ Negative: If certain genres are underrepresented, audiences who prefer them may look elsewhere.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12,6))
sns.histplot(df_title['release_year'], bins=30, kde=True, color='Blue')
plt.title('Content Release Trend Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Titles release')
plt.show

##### 1. Why did you pick the specific chart?

A histogram is best for visualizing trends
over time, showing how content production has evolved.

##### 2. What is/are the insight(s) found from the chart?

Shows whether amazon prime is increasing or decreasing content production over the years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Help assess if Amazon is expanding its content library consistently to stay competitive.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.set_style('whitegrid')
plt.figure(figsize=(8,5))
sns.histplot(df_title['imdb_score'].dropna(),bins=20,kde=True)
plt.title("Distribution of IMDB Scores")
plt.xlabel("IMDB Score")
plt.ylabel("Count")
plt.show

##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution of IMDB scores, helping us understand how ratings are spread across movies and shows.

##### 2. What is/are the insight(s) found from the chart?



*   Most titles have IMDB scores between 6.0
    and 8.5.
*   There are fewer titles with extremely
    low or extremely high ratings.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  The high concentration of scores in the
   6-8.5 range suggests that most is well recieved but not exceptional.
*  No negative impact is evident, but
   understanding what makes highly rated content successfull could help improve lower rated titles.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,5))
sns.histplot(df_title['runtime'].dropna(), bins=30, kde=True, color="purple")
plt.title("Distribution Of Runtime(Minutes)")
plt.xlabel("Runtime(Minutes)")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram helps understand how long most movies and shows are.

##### 2. What is/are the insight(s) found from the chart?



*   Most titles have a runtime between 60
    and 120 minutes.

*   There are a few very short or very long
    titles.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact: Ensuring content falls within the optimal range(60-120minutes) aligns with audience preferences.

Negative Impact: Extremely long runtimes may discourage viewers unless well-justified.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
genre_imdb = {}
for genre in genre_counts.keys():
    genre_titles = df_title[df_title['genres'].apply(lambda x: genre in str(x))]
    genre_imdb[genre] = genre_titles['imdb_score'].mean()

# Convert to DataFrame
genre_imdb_df = pd.DataFrame(genre_imdb.items(), columns=['Genre', 'Average IMDb Score']).sort_values(by="Average IMDb Score", ascending=False)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_imdb_df['Average IMDb Score'], y=genre_imdb_df['Genre'], palette="coolwarm")
plt.title("Average IMDb Score by Genre")
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.xlim(5, 9)
plt.show()

##### 1. Why did you pick the specific chart?



*   A bar chart helps how different genres perform in term of ratings.



##### 2. What is/are the insight(s) found from the chart?



*   Some genres consistently receive higher
    IMDB scores.
*   Genres like documentry and history tend
    to have higher ratings, while some action-heavy genres score lower.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Positive Impact:Investing in      
    highly-rated genres(e.g documentaries) could improve content reputation.
*   Potential Negative Impact:Popular but
    lower-rated genres may need quality improvement efforts.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12,6))
sns.lineplot(x=df_title['release_year'], y=df_title['imdb_score'], ci=None, color='red')
plt.title('IMDB Score trend over time')
plt.xlabel('Average IMDB SCore')
plt.show

##### 1. Why did you pick the specific chart?

A line plot helps identify trends in IMDB scores over the years.

##### 2. What is/are the insight(s) found from the chart?



*   IMDB scores fluctuate over time.
*   There may a decling trend in recent
    years.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Potential Negative Impact: If IMDB
    scores are decling, it may indicate a drop in content quality.
*   Actionable Step: Understanding why older
    titles have higher ratings can help improve future productions.



#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#Selecting numerical columns for correlations analysis
numerical_cols=df_title.select_dtypes(include=['number']).columns

#Compute the correlation matrix
correlation_matrix=df_title[numerical_cols].corr()

#Plot the heatmap
plt.figure(figsize=(10,6))
sns.heatmap(correlation_matrix, annot=True,cmap='coolwarm',fmt='.2f', linewidths=0.6)
plt.title('Correlation Heatmap of Numerical Features')
plt.show

##### 1. Why did you pick the specific chart?

A heatmap is used to visualize the correlation between numerical features,  helping identify strong relationships between different variables.

##### 2. What is/are the insight(s) found from the chart?



*   IMDB score and TMDB score are likely
    positively correlated, indicating that highly rated content tends to perform well accross multiple platforms.
*   TMDB popularity might not strongly
    correlate with IMDB score, suggesting that popular content isn't always critically acclaimed.
*   IMDB votes likely have a strong c
    correlation with IMDB Score meaning well rated content tends to receive more reviews.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Selecting a subset of numerical columns to avoid overcrowding
selected_numeric_cols=['imdb_score','tmdb_score','tmdb_popularity','imdb_votes']

#Plot pairplot
plt.figure(figsize=(10,8))
sns.pairplot(df_title[selected_numeric_cols], diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Key Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot helps visualize the relationships between multiple numerical variables, showing scatter plots for correlations and density distributions for individual features.

##### 2. What is/are the insight(s) found from the chart?

*  IMDb score vs. TMDb score: Likely shows a
    moderate positive correlation, meaning content rated highly on one platform tends to be rated well on the other.

*  TMDb popularity vs. IMDb score: Likely
    weak correlation, indicating that highly popular titles don’t always have the highest ratings.

*  IMDb votes vs. IMDb score: More votes
    don’t always mean higher ratings, but extreme cases may show some connection.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Business Objective:

To influence subscription growth, user engagement, and content investment strategies in the streaming industry.

Solution: Based on the data visualizations, here are some actionable strategies to achieve the business objective:

Content Diversification:

Chart 1:

 The majority of the content is in the form of movies. While maintaining a solid movie selection, Amazon Prime should invest in expanding its TV show catalog to cater to a broader audience who prefer episodic content.


Chart 2: The most popular genres on Amazon Prime are Drama, Comedy, Documentary, and Action. While these are good categories, they can consider expanding into other trending genres like thrillers, horror, or sci-fi to cater to a wider range of preferences.
Content Quality and Ratings:

Charts 4 and 6:

Focus on maintaining a high level of content quality. This involves prioritizing genres like documentaries and history which tend to receive higher IMDb scores, thereby enhancing overall platform reputation.

Chart 7: Address the potential decline in IMDb score trend for newer releases. Invest in high-quality scripts, production, and marketing to ensure new content aligns with audience expectations.
Optimizing Runtime:

Chart 5:

Most viewers prefer shows and movies within a 60–120-minute runtime. While shorter, snackable content is also appealing, avoid extremely long formats that might discourage viewer engagement.
Understanding User Preferences:

Chart 14:

IMDb score and TMDb score are positively correlated, meaning content rated well on one platform likely performs well on others. Focus on high-rated content and promote it cross-platform.

Chart 15:

TMDb popularity vs. IMDb score isn't always strongly correlated. While chasing popularity is important, prioritize quality over sheer popularity, focusing on content with high IMDb scores.


Implementation:

Monitor and analyze viewership data of newly acquired content.
Regularly assess user feedback and preferences using platform insights and reviews.
Conduct ongoing competitor analysis to identify opportunities and potential content gaps.

Key Outcomes:

**Increased subscription growth**:

By offering a wider variety of quality content across different genres, lengths, and ratings, Amazon Prime can attract a broader audience.

**Improved user engagement**:

By personalizing recommendations and focusing on high-quality content with optimal runtimes, Amazon can keep viewers engaged and loyal.

**Optimized content investment strategy**:

Prioritize investment in high-potential genres, ensuring a consistent flow of high-rated content that aligns with audience expectations.

# **Conclusion**

By implementing these strategies based on the insights derived from the visualizations, Amazon Prime can achieve its business objective of influencing subscription growth, user engagement, and content investment strategies. This systematic approach will enable the platform to remain competitive and retain its leading position in the streaming landscape.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***