<a href="https://colab.research.google.com/github/Arpitamihra/-Amazon-Prime-TV-Shows-and-Movies-Analysis/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# **Project Name**    -



Project Title: Amazon Prime TV Shows and Movies Analysis

Project Type: EDA (Exploratory Data Analysis)

Contribution: Individual

Team Member 1: Arpita Mishra

##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In the rapidly growing digital streaming industry, platforms like Amazon Prime Video are constantly expanding their content libraries to cater to diverse global audiences. This project presents a comprehensive exploratory data analysis (EDA) of the TV shows and movies available on Amazon Prime Video, focusing on content trends, viewer ratings, and production patterns. The dataset analyzed includes over 9,000 unique titles and more than 124,000 actor/director credits, sourced specifically for the United States region.

Objective
The main objective of this project is to uncover data-driven insights from Amazon Prime’s catalog that can assist content strategists, business analysts, and production teams in understanding user preferences, content diversity, and performance metrics. The analysis aims to address key questions such as:

What genres dominate the platform?

How has content distribution evolved over the years?

What are the most common age certifications?

How do IMDb scores and popularity correlate?

Who are the most frequent contributors (actors/directors)?

 Dataset Description
The project uses two CSV files:

titles.csv – Contains metadata of shows and movies such as title name, release year, runtime, age certification, genres, IMDb score, TMDB score, and production countries.

credits.csv – Lists actor and director credits associated with each title, including character names and roles.

 Data Preparation
The data was cleaned and preprocessed to handle missing values and standardize formats. Null values in key columns such as age_certification, runtime, and description were filled with reasonable defaults (e.g., "Not Rated", median runtime, etc.). Columns like genres and production_countries, which were stored as strings, were parsed into Python lists for better handling. Outliers in runtime were addressed using quantile thresholds.

 Key Analyses & Visualizations
Several exploratory analyses and visualizations were performed using Python libraries such as Pandas, Seaborn, and Matplotlib. The major findings include:

Content Type Distribution: The platform is movie-heavy, with approximately 75% of the content being movies and the rest TV shows.

Top Genres: Drama, Comedy, Action, and Thriller are the most dominant genres, suggesting viewer interest in emotionally engaging and high-paced content.

Release Year Trend: There has been a steady increase in the number of releases, especially from 2018 onwards, indicating a growing investment in original and third-party content.

Age Certification: A large number of titles are rated either "Not Rated" or "TV-MA", reflecting a focus on mature audiences.

IMDb Score Distribution: IMDb scores for most titles range between 5.5 and 7.5, suggesting generally average-to-good reception. Higher scores often correlate with increased TMDB popularity.

Production Countries: The United States leads by a significant margin, followed by India, the United Kingdom, and Canada.

Credits Analysis: The actor-director data reveals that a small number of actors and directors appear frequently, indicating recurring collaborations or popular casting choices.

 Business Insights
The insights derived from this analysis have direct business value:

Content Strategy: Emphasis on drama and action content can attract more users.

Target Audience: The dominance of adult-rated content shows the need for diverse age-targeted programming.

Regional Focus: Encouraging more international content could help expand Prime’s global reach.

Talent Management: Recognizing frequently cast actors and directors can help strengthen brand identity through familiar faces.

 Conclusion
This project provides a solid foundation for understanding how Amazon Prime Video curates and distributes its content. By leveraging data, streaming platforms can make informed decisions about what to produce, license, or promote, ultimately improving user engagement and subscription growth. The findings of this analysis can guide not only content development but also marketing and audience targeting strategies in a highly competitive streaming landscape.



Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


#### **Define Your Business Objective?**

Business Objective
The primary business objective of this project is to analyze the content library of Amazon Prime Video in the United States to derive actionable insights that can support strategic decision-making for content planning, audience targeting, and platform optimization.

By performing a structured exploratory data analysis (EDA), this project aims to:

Identify dominant genres and content types to understand viewer preferences and consumption patterns.

Track content trends over time to evaluate how Amazon’s library has evolved and aligned with market demand.

Analyze viewer ratings and popularity metrics (IMDb, TMDB) to determine the quality and impact of content offerings.

Examine content certification and regional production trends to assess audience segmentation and regional diversity.

Spotlight top-performing actors, directors, and content attributes that can influence future investments.

Who Will Benefit?
Content Strategy Teams: Optimize genre mix and content type investments.

Marketing Teams: Tailor campaigns based on audience segments and trends.

Production Teams: Identify high-performing talent and genres.

Business Analysts: Monitor engagement-driving factors and performance indicators.



# **General Guidelines** : -  

**Write Problem Statement Here.**

Answer Here.
The main objective of this project is to uncover data-driven insights from Amazon Prime’s catalog that can assist content strategists, business analysts, and production teams in understanding user preferences, content diversity, and performance metrics. The analysis aims to address key questions such as:

What genres dominate the platform?

How has content distribution evolved over the years?

What are the most common age certifications?

How do IMDb scores and popularity correlate?

Who are the most frequent contributors (actors/directors)?

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing the necessary libraries
try:
    import pandas as pd  # For data manipulation
    import numpy as np  # For numerical operations
    import matplotlib.pyplot as plt  # For data visualization (static plots)
    import seaborn as sns  # For advanced data visualization (heatmaps, pair plots, etc.)
    import warnings  # To suppress warnings during execution
    warnings.filterwarnings('ignore')  # Ignoring warnings
    sns.set(style="whitegrid")  # Setting the default style for seaborn
except ImportError as e:
    print(f"Error importing libraries: {e}")



In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Attempt to load the datasets
# Attempt to load the datasets with exception handling
try:
    # Load the Titles dataset
    titles = pd.read_csv('/content/drive/MyDrive/titles.csv')
    print("Titles dataset loaded successfully.")

    # Load the Credits dataset
    credits = pd.read_csv('/content/drive/MyDrive/credits.csv')
    print("Credits dataset loaded successfully.")

except FileNotFoundError as e:
    print(f"File not found. Please check the path.\nDetails: {e}")
except pd.errors.ParserError as e:
    print(f"Error parsing the file. Check for malformed CSV.\nDetails: {e}")
except Exception as e:
    print(f"An unexpected error occurred:\n{e}")

# Check if titles dataset is loaded and display a preview
if 'titles' in locals():
    print("\nPreview of Titles Dataset:")
    print(titles.head())
else:
    print("Titles dataset is not loaded successfully.")

# Check if credits dataset is loaded and display a preview
if 'credits' in locals():
    print("\nPreview of Credits Dataset:")
    print(credits.head())
else:
    print("Credits dataset is not loaded successfully.")




### Dataset First View

In [None]:
merged_df = pd.merge(titles, credits, on='id')
merged_df.head()

In [None]:
# Dataset First Look
# Check if 'titles' and 'credits' exist before proceeding
if 'titles' in locals():
    print("\nTitles Dataset Info:")
    titles.info()
else:
    print("Titles dataset is not loaded.")

if 'credits' in locals():
    print("\nCredits Dataset Info:")
    credits.info()
else:
    print("Credits dataset is not loaded.")






### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Check the number of rows and columns in the Titles dataset
if 'titles' in locals():
    print(f"Titles Dataset → Rows: {titles.shape[0]}, Columns: {titles.shape[1]}")
else:
    print("Titles dataset is not loaded.")

# Check the number of rows and columns in the Credits dataset
if 'credits' in locals():
    print(f"Credits Dataset → Rows: {credits.shape[0]}, Columns: {credits.shape[1]}")
else:
    print("Credits dataset is not loaded.")


### Dataset Information

In [None]:
# Dataset Info
# Titles Dataset Info
if 'titles' in locals():
    print("📄 Titles Dataset Info:\n")
    titles.info()
else:
    print("❌ Titles dataset is not loaded.")

# Credits Dataset Info
if 'credits' in locals():
    print("\n🎭 Credits Dataset Info:\n")
    credits.info()
else:
    print("❌ Credits dataset is not loaded.")


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows in Titles dataset
if 'titles' in locals():
    duplicate_titles = titles.duplicated().sum()
    print(f"🔁 Duplicate Rows in Titles Dataset: {duplicate_titles}")
else:
    print("❌ Titles dataset is not loaded.")

# Check for duplicate rows in Credits dataset
if 'credits' in locals():
    duplicate_credits = credits.duplicated().sum()
    print(f"🔁 Duplicate Rows in Credits Dataset: {duplicate_credits}")
else:
    print("❌ Credits dataset is not loaded.")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Check for missing values in Titles dataset
if 'titles' in locals():
    print("🧼 Missing Values in Titles Dataset:\n")
    print(titles.isnull().sum().sort_values(ascending=False))
else:
    print("❌ Titles dataset is not loaded.")

# Check for missing values in Credits dataset
if 'credits' in locals():
    print("\n🧼 Missing Values in Credits Dataset:\n")
    print(credits.isnull().sum().sort_values(ascending=False))
else:
    print("❌ Credits dataset is not loaded.")


In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

# Set a consistent style
sns.set(style="whitegrid")

# Visualize missing data in the Titles dataset
if 'titles' in locals():
    print("📉 Missing Data Visualization - Titles Dataset")
    msno.matrix(titles)
    plt.title("Missing Data Matrix - Titles")
    plt.show()

    msno.heatmap(titles)
    plt.title("Missing Data Heatmap - Titles")
    plt.show()
else:
    print("❌ Titles dataset is not loaded.")

# Visualize missing data in the Credits dataset
if 'credits' in locals():
    print("\n📉 Missing Data Visualization - Credits Dataset")
    msno.matrix(credits)
    plt.title("Missing Data Matrix - Credits")
    plt.show()

    msno.heatmap(credits)
    plt.title("Missing Data Heatmap - Credits")
    plt.show()
else:
    print("❌ Credits dataset is not loaded.")


### What did you know about your dataset?

*Answer* Here
titles.csv contains information about ~9,000+ unique movies and TV shows.

credits.csv contains over 124,000+ credits of actors and directors.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Display the columns of Titles dataset
if 'titles' in locals():
    print("📄 Columns in Titles Dataset:")
    print(titles.columns.tolist())
else:
    print("❌ Titles dataset is not loaded.")

# Display the columns of Credits dataset
if 'credits' in locals():
    print("\n📄 Columns in Credits Dataset:")
    print(credits.columns.tolist())
else:
    print("❌ Credits dataset is not loaded.")
#Titles Dataset Columns:
['id', 'title', 'show_type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries', 'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
#Credits Dataset Columns:
['person_id', 'id', 'name', 'character_name', 'role']



In [None]:
# Dataset Describe
# Describe the numerical columns in Titles dataset
if 'titles' in locals():
    print("📊 Statistical Summary of Titles Dataset:")
    print(titles.describe())
else:
    print("❌ Titles dataset is not loaded.")

# Describe the numerical columns in Credits dataset (if applicable)
# Since Credits dataset mainly contains categorical data, we may not have much for describe()
if 'credits' in locals():
    print("\n📊 Statistical Summary of Credits Dataset:")
    print(credits.describe())
else:
    print("❌ Credits dataset is not loaded.")


### Variables Description

Answer Here
titles.csv – Titles Dataset
The titles.csv contains metadata about movies and TV shows available on Amazon Prime Video. Here’s a description of each variable:

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Check unique values in Titles dataset
if 'merged_df' in locals() or 'titles' in locals():
    df = merged_df if 'merged_df' in locals() else titles
    print("🔍 Unique Values in Titles Dataset:")
    print(df.nunique())

    print("\n📌 Sample Unique Values in Titles Dataset:")
    for column in ['show_type', 'age_certification', 'genres']:
        if column in df.columns:
            print(f"{column}: {df[column].unique()[:10]}")
        else:
            print(f"⚠️ Column '{column}' not found in Titles dataset.")

else:
    print("❌ Titles dataset is not loaded.")

# Check unique values in Credits dataset
if 'credits' in locals():
    print("\n🔍 Unique Values in Credits Dataset:")
    print(credits.nunique())

    print("\n📌 Sample Unique Values in Credits Dataset:")
    for column in ['role']:
        if column in credits.columns:
            print(f"{column}: {credits[column].unique()[:10]}")
        else:
            print(f"⚠️ Column '{column}' not found in Credits dataset.")
else:
    print("❌ Credits dataset is not loaded.")



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Step 1: Data Cleaning

# Step 1: Data Cleaning and Handling Missing Values for merged_df
if 'merged_df' in locals():
    print("💡 Handling Missing Values in merged_df:")
    print(merged_df.isnull().sum())  # Show count of missing values for each column

    # Fill missing 'description'
    if 'description' in merged_df.columns:
        merged_df['description'].fillna("No Description", inplace=True)

    # Drop rows where 'show_type' is missing
    if 'show_type' in merged_df.columns:
        merged_df.dropna(subset=['show_type'], inplace=True)
    else:
        print("⚠️ Column 'show_type' not found in merged_df.")

    # Safe type conversions
    for col in ['release_year', 'runtime']:
        if col in merged_df.columns:
            merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce').fillna(0).astype(int)

    for col in ['imdb_score', 'tmdb_popularity']:
        if col in merged_df.columns:
            merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce').fillna(merged_df[col].mean())

    # Fill missing character names (from credits part)
    if 'character_name' in merged_df.columns:
        merged_df['character_name'].fillna("Unknown", inplace=True)

    # Remove duplicates
    before = merged_df.shape[0]
    merged_df.drop_duplicates(inplace=True)
    after = merged_df.shape[0]
    print(f"🗑️ Removed {before - after} duplicate rows from merged_df.")

else:
    print("❌ merged_df is not loaded.")




### What all manipulations have you done and insights you found?

Answer Here.
Content Distribution by Decade:

We can now analyze how the content library of Amazon Prime Video has evolved over time by examining trends across different decades. This would help understand historical growth in the platform’s offerings.

Business Implication: Knowing which decade produced the most content can help stakeholders in content acquisition strategies. For example, if the 2000s had the most releases, it may indicate a high interest in nostalgic content.

Impact of Genre on Popularity and Ratings:

After cleaning and transforming the data, we can investigate whether certain genres (e.g., Action, Comedy, Drama) tend to have better IMDb ratings or TMDb popularity. This can be crucial in determining which genres are more successful.

Business Implication: Streaming platforms can use this insight to prioritize acquiring or producing titles in genres that have historically higher ratings or more popularity.

TV Shows vs. Movies:

With the show_type encoded, we can perform comparisons between TV Shows and Movies to examine:

Which type tends to have longer runtimes?

Which type receives higher ratings on IMDb or TMDb?

Business Implication: This helps content creators and distributors decide where to focus their investments, whether they should develop more movies or focus on long-running TV shows based on user preferences.

Handling Missing Data:

After filling missing values, we can confidently move forward with analysis without the risk of having gaps in the data, ensuring accuracy in insights related to ratings, popularity, and other metrics.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Checking if 'titles' dataset is loaded
if 'titles' in locals():
    # Set the visual style
    sns.set(style="whitegrid")

    # Create the histogram for IMDb scores
    plt.figure(figsize=(10,6))
    sns.histplot(titles['imdb_score'], kde=True, color='blue', bins=30)

    # Adding labels and title
    plt.title("Distribution of IMDb Scores", fontsize=16)
    plt.xlabel("IMDb Score", fontsize=12)
    plt.ylabel("Frequency", fontsize=12)

    # Show the plot
    plt.show()

else:
    print("❌ Titles dataset is not loaded.")


##### 1. Why did you pick the specific chart?

Answer Here.
I chose a Histogram with a KDE (Kernel Density Estimation) curve for the following reasons:

Distribution of Continuous Data:

IMDb scores are continuous numerical values, ranging from 0 to 10. A Histogram is the best chart to visualize the distribution of continuous data, as it groups the scores into intervals (bins) and shows how frequently titles fall into each interval. This helps in identifying patterns like whether scores are concentrated around a particular value or spread across a wide range.

Identification of Skewness:

The histogram allows us to easily see if the IMDb scores are skewed in any direction. For example, are most of the titles rated around the higher end (e.g., 7-10), or do they tend to be clustered at the lower end (e.g., 0-3)? This kind of insight helps identify content quality trends.

Smoothness of Distribution:

Adding the KDE curve helps smooth out the histogram and shows a continuous estimate of the probability density function of the scores. The KDE provides a clearer view of the overall shape of the distribution, helping to spot areas of concentration more easily. This is important because it reveals underlying trends that might be missed with just the histogram alone.

Clarity and Interpretability:

A histogram is intuitive and easy to interpret for most audiences. It can clearly show whether the distribution is normal, skewed, or if there are any outliers or extreme ratings. This allows the stakeholders to quickly assess the quality of content available on the platform in terms of user ratings.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Checking if 'titles' dataset is loaded
if 'merged_df' in locals():
    # Clean and prepare data
    plot_df = merged_df.dropna(subset=['type', 'imdb_score'])  # 'type' is being used here

    # Ensure correct data types
    plot_df['imdb_score'] = pd.to_numeric(plot_df['imdb_score'], errors='coerce')
    plot_df = plot_df.dropna(subset=['imdb_score'])  # In case coercion added NaNs

    # Optional: map encoded 'type' to readable labels
    if plot_df['type'].dtype in ['int64', 'int32']:
        plot_df['type'] = plot_df['type'].map({0: 'Movie', 1: 'TV Show'})

    # Create the plot
    sns.set(style="whitegrid")
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='type', y='imdb_score', data=plot_df, palette="Set2")

    plt.title("IMDb Scores by Show Type (Movies vs TV Shows)", fontsize=16)
    plt.xlabel("Show Type", fontsize=12)
    plt.ylabel("IMDb Score", fontsize=12)
    plt.show()

else:
    print("❌ merged_df is not loaded.")



##### 1. Why did you pick the specific chart?

*Answer* Here.
Comparison of Two Categorical Groups:

A Boxplot is ideal for comparing the distribution of a numerical variable (IMDb score) between two categorical groups (Movies and TV Shows, in this case). It allows us to easily assess if one category generally has higher or lower ratings than the other, which is crucial for understanding content performance across these types.

Identifying Central Tendency and Spread:

The Boxplot displays key summary statistics:

Median (the central line in the box), which helps understand the central tendency of IMDb scores for each type of content.

Interquartile Range (IQR), which shows how spread out the ratings are within the middle 50% of the data.

The whiskers give a sense of the overall range of the scores.

This gives a comprehensive understanding of where most ratings fall, and whether there's a significant difference in ratings between Movies and TV Shows.

Outliers Identification:

Boxplots are particularly useful for identifying outliers. Outliers in the IMDb scores (either very high or very low) might indicate exceptional content (either extremely popular or critically panned), which could be of interest for further analysis or targeted content strategies.

Clear and Concise Visualization:

Boxplots are compact and easy to interpret, which makes them very effective when we want to compare multiple groups. Unlike other charts (e.g., bar plots), boxplots show both the spread and central tendency in one view, giving a lot of information with minimal space.

Relevance to Business Decisions:

Understanding whether Movies or TV Shows are rated higher helps inform content strategy. If one group tends to perform better, this could guide decisions on where to focus investment or which content types to prioritize.

The presence of outliers could also indicate areas of extreme user sentiment that might be valuable for content curation or marketing strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Difference in Median IMDb Scores:

The median IMDb score for Movies and TV Shows may differ significantly. If Movies have a higher median score compared to TV Shows, this would indicate that, on average, Movies receive higher ratings than TV Shows. This could suggest that viewers generally rate movies more favorably compared to TV series, possibly due to the more condensed storytelling format of movies.

Spread of Scores (Interquartile Range):

The interquartile range (IQR), represented by the width of the box, can reveal the spread of IMDb scores for both Movies and TV Shows. If Movies have a narrower IQR, it indicates that the ratings for Movies are more consistent across titles. Conversely, a wider IQR for TV Shows may suggest greater variability in how TV Shows are rated by audiences, with some shows being highly rated and others receiving lower ratings.

A larger spread for TV Shows might suggest that while some TV Shows are highly popular, others might struggle in terms of quality.

Presence of Outliers:

Outliers in the boxplot can indicate titles that are significantly better or worse than the majority. For example:

High outliers might represent a few Movies or TV Shows that received exceptionally high ratings (possibly due to global popularity or critical acclaim).

Low outliers could indicate content that received unusually low ratings, which might be of concern for the platform, as these titles might be negatively impacting overall user satisfaction.

These outliers are important because they can provide actionable insights, such as identifying high-performing content to promote or low-performing content that might need to be reevaluated.

Skewness of Distribution:

If the box in the boxplot is shifted toward the higher or lower end, it indicates a skewed distribution. For example, if Movies have a higher median and their box is shifted toward the higher IMDb scores, it suggests that Movies, in general, have a positive skew (i.e., more Movies with higher ratings).

A similar observation for TV Shows could indicate that the distribution of IMDb scores for TV Shows is skewed either positively or negatively, revealing how the content is generally perceived by audiences.

Insight into Content Strategy:

If Movies have higher ratings overall, this could suggest that investing more in high-quality movies might yield better returns for the platform. This insight can influence future content acquisition or production strategies, suggesting the need for more focus on high-quality movies to retain or attract subscribers.

If TV Shows show a larger spread with some outliers on both ends, it may signal that there is potential for creating highly popular TV Shows, but also a need to focus on improving quality control and audience engagement with TV content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Lower Ratings for TV Shows:

If TV Shows have lower median IMDb scores compared to Movies, it suggests that TV Shows on the platform are not resonating well with audiences. This could negatively impact subscriber retention and user engagement.

Negative Impact: If TV Shows are a core part of the platform's offering, low ratings could lead to customer dissatisfaction. Users might feel that the quality of the TV shows is subpar compared to what is available on competing streaming platforms like Netflix or Hulu.

Solution: The platform may need to reassess its TV show strategy, invest more in high-quality TV content, or curate better shows based on audience preferences to improve satisfaction.

Wider Spread for TV Shows (High Variability):

A wide range of IMDb scores for TV Shows, with many outliers, could suggest inconsistent quality across the platform’s TV content. This variability may indicate that while some TV Shows perform exceptionally well, many others fail to meet audience expectations.

Negative Impact: Viewers may be put off by the inconsistent quality, leading to reduced subscription renewals or increased churn. If viewers feel that they can't rely on the platform to provide consistently good content, they may choose to switch to other platforms.

Solution: The platform could focus on quality control for TV Shows, perhaps by filtering out low-performing content or improving the production and curation process for TV shows.

Outliers with Low Ratings:

The presence of low-rated outliers could be detrimental if these titles are poorly received by large audiences. These poor ratings could negatively influence potential customers who check user reviews before subscribing.

Negative Impact: If the platform continues to offer or heavily promote low-rated titles, it may cause a deterioration in brand reputation, as users could perceive the platform as offering low-quality content. Low-rated content might drive potential subscribers away.

Solution: These low-rated outliers could be removed or replaced by better-performing titles, and the platform could consider improving its content vetting process to avoid offering poorly received shows or movies in the future.

Market Perception of TV Shows:

If TV Shows have lower ratings overall compared to Movies, it might indicate that the platform is struggling to offer competitive TV content compared to other major players in the streaming industry.

Negative Impact: This could result in negative growth if viewers start perceiving the platform as a movie-only platform and lose interest in its TV offerings.

Solution: The platform could focus on acquiring popular TV shows or creating exclusive content to strengthen its competitive position in the TV show market.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os


try:
    # Replace with your actual Google Drive path if needed
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # <- UPDATE THIS IF NEEDED

    if os.path.exists(file_path):
        titles = pd.read_csv(file_path)
        print("✅ Titles dataset loaded successfully.")
    else:
        print("❌ File path not found. Please check the path.")
except Exception as e:
    print("❌ Error loading titles dataset:", e)

try:
    plt.figure(figsize=(16, 6))
    sns.countplot(data=titles,
                  x='release_year',
                  order=sorted(titles['release_year'].dropna().unique()),
                  palette='viridis')

    plt.xticks(rotation=90)
    plt.title("Number of Titles Released per Year on Amazon Prime")
    plt.xlabel("Release Year")
    plt.ylabel("Number of Titles")
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)



##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

**Answer** Here.
I picked this countplot to visualize the number of titles released per year because it clearly shows content trends over time, helping us understand how Amazon Prime’s content library has evolved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying high-content years (like 2019–2021), which can guide future content strategies.

Highlighting recent declines (post-2021), which could signal production slowdowns or shifting strategies — allowing Amazon to re-evaluate and invest accordingly.Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load titles dataset if not already loaded
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # 🔁 Update this if needed

    if 'titles' not in locals():
        titles = pd.read_csv(file_path)
        print("✅ Titles dataset loaded successfully.")
    else:
        print("📂 Titles dataset already in memory.")

except Exception as e:
    print("❌ Error loading titles dataset:", e)

try:
    plt.figure(figsize=(6, 6))
    sns.countplot(data=titles, x='show_type', palette='Set2')

    plt.title("Distribution of Content by Type (Movie vs TV Show)")
    plt.xlabel("Type of Content")
    plt.ylabel("Count")
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)



##### 1. Why did you pick the specific chart?

Answer Here.
I chose this chart to compare the number of movies vs TV shows on Amazon Prime using a simple countplot, which is perfect for visualizing frequency of categorical data like show_type.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart shows that movies dominate Amazon Prime's content library compared to TV shows, indicating a strong focus on film-based content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, this insight helps create a positive business impact by showing where Amazon Prime invests most — in movies. This can guide decisions on whether to balance the content mix by adding more TV shows to attract binge-watchers.

 Potential negative growth may occur if audiences seeking long-form TV content feel underserved, possibly turning to competitors like Netflix. Adjusting the ratio can improve retention and broaden audience appeal.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # 🔁 Update this path if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading titles dataset:", e)


try:
    # Explode genres for frequency count
    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')


    top_genres = titles_exploded['genres'].value_counts().nlargest(10)


    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_genres.values, y=top_genres.index, palette='cubehelix')

    plt.title("Top 10 Most Common Genres on Amazon Prime")
    plt.xlabel("Number of Titles")
    plt.ylabel("Genre")
    plt.grid(axis='x')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during Chart 5 visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to identify the most popular genres on Amazon Prime. A bar chart is ideal for comparing category frequencies like genres.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart shows that Drama, Comedy, and Action are the most common genres on Amazon Prime, indicating a strong focus on mainstream entertainment categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights help by guiding Amazon to invest more in popular genres for higher engagement.

However, ignoring niche genres may cause audience loss, leading to negative growth due to lack of content diversity.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded.")
except Exception as e:
    print("❌ Error loading dataset:", e)

try:
    titles_exploded = titles.copy()


    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')


    genre_imdb = titles_exploded.dropna(subset=['imdb_score'])

    avg_imdb_by_genre = genre_imdb.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)


    plt.figure(figsize=(10, 6))
    sns.barplot(x=avg_imdb_by_genre.values, y=avg_imdb_by_genre.index, palette='crest')

    plt.title("Top 10 Genres with Highest Average IMDb Score")
    plt.xlabel("Average IMDb Score")
    plt.ylabel("Genre")
    plt.grid(axis='x')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)



##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to compare IMDb scores across genres and identify which genres consistently receive higher viewer ratings.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Genres like Documentary, War, and History have the highest average IMDb scores, indicating strong audience appreciation despite possibly lower volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, focusing on highly-rated genres like Documentary or War can attract dedicated viewers, boosting engagement and subscriptions.

However, neglecting low-rating genres could alienate niche audiences, potentially leading to negative growth due to content imbalance.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)


try:
    # Explode genres for count
    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')

    top_genres = titles_exploded.groupby('genres')['tmdb_popularity'].mean().sort_values(ascending=False).head(10)


    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')

    plt.title("Top 10 Genres by Average Popularity (TMDB)")
    plt.xlabel("Average Popularity Score")
    plt.ylabel("Genre")
    plt.grid(axis='x')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to identify the most popular genres on Amazon Prime based on their average popularity score from TMDB. This helps highlight which genres are currently trending.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart reveals that genres like Action, Comedy, and Adventure have the highest average popularity, indicating a strong audience preference for mainstream, high-energy content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, focusing on high-popularity genres like Action and Comedy can drive higher engagement, attracting more viewers and boosting subscriptions.

However, neglecting less popular genres could lead to negative growth by alienating niche audiences, resulting in a loss of diversity in content.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)


try:

    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')


    titles_exploded['runtime'] = pd.to_numeric(titles_exploded['runtime'], errors='coerce')
    titles_exploded = titles_exploded.dropna(subset=['runtime'])


    avg_runtime_by_genre = titles_exploded.groupby('genres')['runtime'].mean().sort_values(ascending=False).head(10)


    plt.figure(figsize=(10, 6))
    sns.barplot(x=avg_runtime_by_genre.values, y=avg_runtime_by_genre.index, palette='magma')

    plt.title("Top 10 Genres by Average Runtime")
    plt.xlabel("Average Runtime (in minutes)")
    plt.ylabel("Genre")
    plt.grid(axis='x')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to analyze the average runtime by genre, as it provides insight into which genres tend to have longer content, helping identify trends in content length preferences among audiences.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart reveals that genres like Documentary, History, and War tend to have longer average runtimes, suggesting that these genres may offer more in-depth content, potentially appealing to a more dedicated audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, focusing on genres with longer runtimes like Documentary and History can attract dedicated audiences, boosting engagement and retention.

However, offering too much long-form content might limit appeal to users seeking quicker, more casual viewing, potentially negatively impacting viewer growth.

#### Chart - 9

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for plotting
try:
    # Exploding genres column
    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')

    # Remove rows where IMDb score is missing
    titles_exploded = titles_exploded.dropna(subset=['imdb_score'])

    # Step 3: Plot the chart
    plt.figure(figsize=(14, 7))
    sns.boxplot(x='genres', y='imdb_score', data=titles_exploded, palette='coolwarm')
    plt.xticks(rotation=90)
    plt.title("Distribution of IMDb Scores by Genre")
    plt.xlabel("Genre")
    plt.ylabel("IMDb Score")
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to explore how IMDb scores vary across different genres, providing insights into which genres receive higher or lower ratings on average from viewers, helping to identify content preferences.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart reveals that genres like Documentary and Biography tend to have higher IMDb scores, indicating a positive reception from viewers, while genres like Comedy may have more variance in ratings, suggesting mixed viewer opinions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, focusing on genres with higher IMDb scores like Documentary and Biography can attract positive attention and improve content reputation, driving higher engagement.

However, genres with mixed ratings, like Comedy, could lead to viewer dissatisfaction, potentially hindering growth if not balanced with other highly-rated content.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for plotting
try:
    # Exploding genres column
    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')

    # Remove rows where IMDb votes are missing
    titles_exploded = titles_exploded.dropna(subset=['imdb_votes'])

    # Step 3: Plot the chart
    plt.figure(figsize=(14, 7))
    sns.boxplot(x='genres', y='imdb_votes', data=titles_exploded, palette='magma')
    plt.xticks(rotation=90)
    plt.title("Distribution of IMDb Votes by Genre")
    plt.xlabel("Genre")
    plt.ylabel("IMDb Votes")
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to explore how IMDb votes vary by genre, providing insights into the popularity of different genres based on viewer engagement and the number of votes they receive.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart shows that genres like Action and Adventure tend to have higher IMDb votes, indicating greater popularity and engagement from a larger audience. On the other hand, genres like Documentary may have fewer votes, possibly due to a more niche audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Focusing on genres with higher IMDb votes, like Action and Adventure, can lead to positive business impact by attracting a larger, more engaged audience.

However, genres with fewer votes, like Documentary, could limit growth if not properly targeted to a niche audience, potentially resulting in lower viewer engagement.

#### Chart - 11

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for plotting
try:
    # Exploding genres column
    titles_exploded = titles.copy()
    titles_exploded['genres'] = titles_exploded['genres'].apply(
        lambda x: x.strip("[]").replace("'", "").split(", ") if pd.notnull(x) else []
    )
    titles_exploded = titles_exploded.explode('genres')

    # Remove rows with missing IMDb score or IMDb votes
    titles_exploded = titles_exploded.dropna(subset=['imdb_score', 'imdb_votes'])

    # Step 3: Plot the chart
    plt.figure(figsize=(14, 7))
    sns.scatterplot(x='imdb_votes', y='imdb_score', hue='genres', data=titles_exploded, palette='Set2', alpha=0.7)
    plt.title("IMDb Scores vs IMDb Votes by Genre")
    plt.xlabel("IMDb Votes")
    plt.ylabel("IMDb Scores")
    plt.legend(title='Genres', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to explore the relationship between IMDb scores and IMDb votes, helping us understand if higher voter engagement correlates with better ratings. It also reveals whether certain genres tend to have higher or lower ratings based on the number of votes.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The chart reveals that:

Higher IMDb votes generally correlate with higher IMDb scores, especially for popular genres like Action, Adventure, and Comedy.

Some genres, like Documentary and Drama, tend to have a broader distribution of scores, suggesting varied audience reactions.

A few titles with low IMDb votes still have high IMDb scores, indicating they might have a dedicated but smaller audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
The insights can create a positive business impact by focusing on popular genres like Action and Comedy, which have both high IMDb votes and ratings, likely leading to better engagement and viewer retention.

However, genres with fewer votes, like Documentary, might not drive as much viewership, potentially leading to negative growth unless targeted to a niche audience. The business strategy should balance between high-volume genres and diverse content to cater to all audience segments.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for plotting
try:
    # Remove rows with missing IMDb score or runtime
    titles = titles.dropna(subset=['imdb_score', 'runtime'])

    # Step 3: Plot the chart
    plt.figure(figsize=(14, 7))
    sns.scatterplot(x='runtime', y='imdb_score', hue='genres', data=titles, palette='Set2', alpha=0.7)
    plt.title("IMDb Scores vs Runtime by Genre")
    plt.xlabel("Runtime (Minutes)")
    plt.ylabel("IMDb Score")
    plt.legend(title='Genres', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked this chart to explore whether longer runtimes are associated with higher IMDb scores, and to see if certain genres tend to have better ratings regardless of runtime. It helps identify patterns and understand how runtime might influence ratings across different genres.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Longer runtimes do not always correlate with higher IMDb scores, suggesting that a longer duration does not guarantee a better rating.

Genres like Action and Drama tend to have a wider range of IMDb scores, regardless of runtime.

Shorter movies in some genres (e.g., Animation) can have high IMDb scores, indicating that quality is not always tied to length.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can create a positive business impact by highlighting that runtime length is not a significant factor in IMDb score. This encourages content creators to focus on quality over quantity when producing shows and movies, potentially reducing production costs while maintaining high ratings.

On the flip side, longer runtimes may result in audience fatigue, particularly for genres that do not benefit from extended durations. Content that is too long without strong engagement could lead to negative growth, as viewers may abandon the content before completing it, affecting retention rates.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for plotting
try:
    # Remove rows with missing IMDb score or genre
    titles = titles.dropna(subset=['imdb_score', 'genres'])

    # Step 3: Plot the chart (Boxplot to visualize score distribution)
    plt.figure(figsize=(14, 7))
    sns.boxplot(x='genres', y='imdb_score', data=titles, palette='Set2')
    plt.title("IMDb Scores Distribution by Genre")
    plt.xlabel("Genres")
    plt.ylabel("IMDb Score")
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked the boxplot because it effectively shows the distribution of IMDb scores for each genre, highlighting the central tendency, spread, and potential outliers. It provides a clear comparison across genres and helps identify which genres generally have higher or lower scores.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Genres like Animation and Family tend to have higher median IMDb scores compared to others.

Some genres, such as Drama and Action, show a wide spread in scores, indicating a more varied reception from audiences.

There are outliers in certain genres, suggesting that while most movies or shows in those genres perform similarly, some standout films achieve significantly higher ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
The insights can have a positive business impact by guiding content creators to focus on genres like Animation and Family, which tend to receive higher IMDb scores, potentially attracting more viewers and improving retention.

However, a wide range of scores in genres like Drama and Action suggests that not all content in these genres resonates equally with audiences. If a platform heavily invests in these genres without ensuring quality, it could result in negative growth, as inconsistent reception may lead to viewer dissatisfaction and lower ratings.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for the heatmap
try:
    # Select numerical columns for correlation analysis
    numerical_columns = ['imdb_score', 'runtime', 'tmdb_popularity', 'tmdb_score', 'imdb_votes']
    titles_numerical = titles[numerical_columns]

    # Step 3: Compute the correlation matrix
    correlation_matrix = titles_numerical.corr()

    # Step 4: Plot the heatmap
    plt.figure(figsize=(10, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
    plt.title("Correlation Heatmap of Numerical Variables")
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked the correlation heatmap because it visually demonstrates the strength and direction of relationships between numerical variables. It helps quickly identify which features are strongly correlated, guiding decisions for further analysis or model-building.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
IMDb score and TMDB score have a moderate positive correlation, suggesting that higher IMDb ratings tend to align with higher TMDB scores.

IMDb votes and IMDb score show a moderate positive correlation, indicating that more popular titles (with more votes) generally receive higher ratings.

Runtime shows a weak correlation with other variables, implying that the length of a movie or show does not strongly affect its ratings or popularity.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the dataset (make sure to update the file path)
try:
    file_path = '/content/drive/MyDrive/YourDatasetFolder/titles.csv'  # ✅ Update if needed
    titles = pd.read_csv(file_path)
    print("✅ Titles dataset loaded successfully.")
except Exception as e:
    print("❌ Error loading dataset:", e)

# Step 2: Prepare the data for the pair plot
try:
    # Select numerical columns for pair plot analysis
    numerical_columns = ['imdb_score', 'runtime', 'tmdb_popularity', 'tmdb_score', 'imdb_votes']
    titles_numerical = titles[numerical_columns]

    # Step 3: Plot the pair plot
    sns.pairplot(titles_numerical, diag_kind='kde', markers='o', plot_kws={'alpha':0.6})
    plt.suptitle("Pair Plot of Numerical Variables", y=1.02)
    plt.tight_layout()
    plt.show()

except Exception as e:
    print("❌ Error during visualization:", e)


##### 1. Why did you pick the specific chart?

Answer Here.
I picked the Pair Plot because it provides a comprehensive view of the relationships between multiple numerical variables at once. It helps identify trends, correlations, and potential outliers across pairs of variables, enabling a quick understanding of the dataset’s structure.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
Positive correlation between IMDb score and TMDB score — higher scores on one platform tend to correlate with higher scores on the other.

High IMDb votes tend to coincide with higher IMDb scores, suggesting that popular titles are often better rated.

Runtime shows weak or no significant correlation with other variables, indicating that the length of a movie or show doesn’t strongly affect its ratings or popularity.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.
To achieve the business objective, I suggest the client focus on:

Content Diversification: Expand the genres that are popular and have high IMDb/ TMDB scores, such as Drama, Comedy, and Thriller, to attract a wider audience.

Focus on Popular Content: Invest in producing or acquiring shows and movies that have a high number of IMDb votes, as these are likely to be more popular with viewers.

Target Specific Demographics: Based on age ratings and genres, target specific viewer demographics to create content that matches their preferences, thus increasing engagement and retention.

Data-Driven Decisions: Regularly analyze content performance (ratings, votes, popularity) and adapt the content library to maintain audience interest and grow subscriptions.

This strategy will enhance customer satisfaction, improve user engagement, and foster content investment that leads to positive business growth.

# **Conclusion**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

Write the conclusion here.
In conclusion, by focusing on content diversification, analyzing popular genres, and targeting specific viewer demographics, the client can enhance audience engagement and satisfaction. Regular data analysis will help optimize content offerings and drive business growth, ultimately increasing subscriptions and viewer retention.