# **Project Name**    - Amazon Prime



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**  Hemalatha Y


# **Project Summary -**

This project performs an Exploratory Data Analysis (EDA) on an Amazon Prime dataset. The goal is to gain insights into customer behavior, preferences, and trends related to Amazon Prime services. The project leverages Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn for data manipulation, analysis, and visualization. Through various charts and graphs, the project explores univariate, bivariate, and multivariate relationships between variables in the dataset. The findings are then summarized to provide actionable insights that could potentially improve business strategies and enhance customer satisfaction. The project follows a structured approach, including data cleaning, variable understanding, and data visualization, ultimately leading to valuable insights about the Amazon Prime dataset.

# **GitHub Link -**

https://github.com/Hemalathagoutham

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Okay, here's a potential business objective for this Amazon Prime EDA project:

Business Objective

To identify key factors influencing customer satisfaction and retention with Amazon Prime services.
Reasoning:

By performing EDA, we can uncover patterns and relationships within the dataset that shed light on customer behavior and preferences. This information can then be used to improve business strategies and enhance customer satisfaction. For example, we might discover that customers who utilize a certain feature of Amazon Prime (e.g., free shipping) are more likely to renew their subscription. This insight could then be used to target marketing efforts or develop new features that cater to customer preferences.

In addition to the above, here are some other potential business objectives that could be explored:

To identify opportunities for growth and expansion of Amazon Prime services.
To understand the impact of different marketing campaigns on customer acquisition and retention.
To develop a predictive model to forecast customer churn.
The specific business objective will ultimately depend on the focus of the analysis and the goals of the stakeholders.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
import pandas as pd

df = pd.read_csv('titles.csv')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Get the number of rows and columns
num_rows = df.shape[0]
num_cols = df.shape[1]

# Print the results
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

### Dataset Information

In [None]:
df.info()


#### Duplicate Values

In [None]:
# Calculate the number of duplicate rows
num_duplicates = df.duplicated().sum()

# Print the result
print("Number of duplicate rows:", num_duplicates)

#### Missing Values/Null Values

In [None]:
# Count missing values in each column
missing_values = df.isnull().sum()

# Print the result
print("Missing values per column:\n", missing_values)

# Calculate total number of missing values
total_missing = missing_values.sum()

# Print the result
print("\nTotal number of missing values:", total_missing)

In [None]:
import missingno as msno
import matplotlib.pyplot as plt

# Visualize missing values as a matrix
msno.matrix(df)
plt.show()

# Visualize missing values as a bar chart
msno.bar(df)
plt.show()

# Visualize missing values as a heatmap
msno.heatmap(df)
plt.show()

### What did you know about your dataset?

Okay, let's summarize what we've learned about the dataset so far:

It's loaded into a Pandas DataFrame named df.

It likely contains information about Amazon Prime titles (e.g., movies, TV shows). This is inferred from the filename 'titles.csv' used in pd.read_csv().

The exact number of rows and columns would depend on the dataset itself. We used df.shape to get this information.

We examined the data types of columns and the presence of missing values using df.info(). This is important for understanding data structure and potential cleaning needs.

We identified and counted duplicate rows with df.duplicated().sum(). Duplicates might need further investigation or removal.

We explored missing values in detail using df.isnull().sum() and visualizations from the missingno library. This provides insights into the extent and pattern of missing data.

In essence, we've gained a preliminary understanding of the dataset's structure, content, and potential challenges like missing or duplicated data. This initial exploration sets the stage for further data wrangling, analysis, and visualization.

## ***2. Understanding Your Variables***

In [None]:
# Get the list of columns
columns = df.columns.tolist()

# Print the columns
print("Columns in the dataset:", columns)

In [None]:
df.describe()

### Variables Description

Okay, let's describe the variables (columns) in your dataset. Since you haven't provided the actual dataset, I'll make some educated guesses based on the filename "titles.csv" and common columns found in datasets related to Amazon Prime titles.

Possible Variables and Descriptions:

show_id: A unique identifier for each title (movie or TV show).

type: The type of content, such as "Movie" or "TV Show".

title: The title of the movie or TV show.

director: The name of the director(s).

cast: A list of actors appearing in the title.

country: The country or countries where the title was produced.

date_added: The date when the title was added to Amazon Prime.

release_year: The year the title was originally released.

rating: The content rating (e.g., PG-13, TV-MA).

duration: The length of the movie or TV show (in minutes or seasons).

listed_in: Categories or genres associated with the title.

description: A brief summary or description of the title's plot.

Important Considerations:

These are just possible variables based on common patterns. The actual variables in your dataset might be different.
Refer to the dataset's documentation or source for the most accurate and comprehensive variable descriptions.
You can use the df.info() method to get a list of columns and their data types, which can provide further clues about their meaning.
Explore the data using df.head() and other methods to gain a deeper understanding of the variables and their values.Answer Here

### Check Unique Values for each variable.

In [None]:
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for '{column}':\n{unique_values}\n")

## 3. ***Data Wrangling***

In [None]:
import pandas as pd

# 1. Handling Missing Values
for col in ['director', 'cast', 'country', 'listed_in']:
    if col in df.columns:  # Check if the column exists
        df[col].fillna('Unknown', inplace=True)

for col in ['date_added', 'rating', 'duration']:
    if col in df.columns:  # Check if the column exists
        df[col].fillna(df[col].mode()[0] if df[col].mode().size > 0 else 'Unknown', inplace=True)  # Handle empty mode

# 2. Converting Data Types
if 'date_added' in df.columns:  # Check if the column exists
    df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')  # Handle potential errors during conversion

if 'release_year' in df.columns:  # Check if the column exists
    df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce').astype('Int64')  # Handle potential errors during conversion

# 3. Creating New Features
if 'date_added' in df.columns:  # Check if the column exists
    df['added_month'] = df['date_added'].dt.month
    df['added_year'] = df['date_added'].dt.year

# 4. Handling Inconsistent Values
if 'country' in df.columns:  # Check if the column exists
    df['country'] = df['country'].str.replace('United States', 'USA', regex=False)  # Explicitly disable regex for safety

# 5. Dropping Irrelevant Columns
if 'show_id' in df.columns:  # Check if the column exists
    df.drop(['show_id'], axis=1, inplace=True)

# 6. Removing Duplicates
df.drop_duplicates(inplace=True)

# 7. Renaming Columns
if 'listed_in' in df.columns:  # Check if the column exists
    df.rename(columns={'listed_in': 'genres'}, inplace=True)

# (Optional) Display the first few rows to check the results
df.head()

### What all manipulations have you done and insights you found?

Manipulations and Insights:

Handling Missing Values:

Manipulation: Filled missing values in columns like 'director', 'cast', 'country', 'listed_in', 'date_added', 'rating', and 'duration'. Categorical columns were filled with "Unknown," while numerical/date columns were filled with the mode (most frequent value) or "Unknown" if the mode was empty.
Insights: This addresses potential biases in analysis caused by missing data. By filling missing values, we ensure that all rows can be included in the analysis, leading to more comprehensive results. The choice of filling with "Unknown" for categorical columns preserves the information that the data was missing, while using the mode for numerical/date columns helps maintain the distribution of the data.

Converting Data Types:

Manipulation: Converted the 'date_added' column to datetime objects and the 'release_year' column to integers (with error handling).
Insights: This enables date-based calculations and analysis, such as filtering by date ranges or calculating time differences. It also ensures that the 'release_year' column can be used for numerical comparisons and aggregations.

Creating New Features:

Manipulation: Extracted 'added_month' and 'added_year' from the 'date_added' column.
Insights: These new features allow for a more granular analysis of trends over time. You can now explore patterns based on the month or year when titles were added to Amazon Prime.

Handling Inconsistent Values:

Manipulation: Standardized country names by replacing "United States" with "USA" (as an example).
Insights: This ensures consistency in categorical data, preventing issues with grouping and analysis. Standardizing values makes it easier to compare and aggregate data across different categories.

Dropping Irrelevant Columns:

Manipulation: Dropped the 'show_id' column (if present).
Insights: This simplifies the dataset by removing unnecessary information that might not be relevant to the analysis. Removing irrelevant columns can improve efficiency and focus the analysis on the most important features.

Removing Duplicates:

Manipulation: Removed duplicate rows from the dataset.
Insights: This ensures data integrity by preventing the same title from being counted multiple times in the analysis. Removing duplicates leads to more accurate results and prevents biases caused by repeated observations.

Renaming Columns:

Manipulation: Renamed the 'listed_in' column to 'genres'.
Insights: This provides a more descriptive and understandable column name, improving the readability of the dataset and making it easier to interpret the analysis results.

By carefully performing these data wrangling steps, we've improved the quality, consistency, and usability of the dataset. These manipulations are crucial for gaining insights that are reliable and actionable.

Remember that data wrangling is often an iterative process. You might need to revisit these steps and make further adjustments based on your specific analysis goals and the insights you uncover along the way.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has a 'type' column

# Create the bar chart
plt.figure(figsize=(8, 6))  # Adjust figure size if needed
sns.countplot(x='type', data=df)
plt.title('Distribution of Content Types')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Okay, let's reiterate why a bar chart (countplot) was chosen for visualizing the distribution of content types (Movie or TV Show) in your Amazon Prime dataset:

Reasons for Choosing a Bar Chart (Countplot):

Categorical Data: The primary reason is that the variable being visualized ('type') is categorical, representing distinct categories (Movie and TV Show). Bar charts are specifically designed for displaying the distribution of categorical data.

Comparison of Frequencies: The main goal is to compare the frequencies or counts of each content type. Bar charts excel at showing these comparisons visually through the lengths of the bars.

Clarity and Simplicity: Bar charts are easy to understand and interpret. Viewers can quickly grasp the relative proportions of each category by simply looking at the bar lengths.

Effective for Univariate Analysis: This visualization is a form of univariate analysis, focusing on a single variable ('type'). Bar charts are well-suited for this type of analysis as they effectively show the distribution of one variable.

Seaborn's Countplot: Seaborn's countplot() function provides a convenient way to create bar charts specifically for counting the occurrences of different categories in a dataset.

In summary: A bar chart (countplot) was chosen because it is the most appropriate and effective chart type for visualizing and comparing the frequencies of different content types (a categorical variable) in your Amazon Prime dataset. It provides a clear, concise, and easily interpretable representation of the data's distribution, making it ideal for this specific analysis.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

The main insight you can derive from this chart is the relative proportion or distribution of Movies and TV Shows within the Amazon Prime titles dataset.

By observing the bar lengths, you can determine:

Dominant Content Type: Which content type (Movie or TV Show) is more prevalent in the dataset? The bar with the greater height represents the dominant category.
Content Mix: Is the content mix balanced, or is there a significant skew towards one type? This can be assessed by comparing the relative lengths of the two bars. If they are similar in length, it indicates a more balanced mix. If one bar is much taller than the other, it suggests a heavier focus on that content type.

Example Interpretation:

If the bar for "Movie" is significantly taller than the bar for "TV Show," it would indicate that there are more movies than TV shows in the dataset. This could suggest that Amazon Prime has a larger movie library compared to its TV show offerings.
If the bars for "Movie" and "TV Show" are roughly equal in length, it would imply a more balanced content mix, with a similar number of movies and TV shows available on the platform.

Business Relevance:

These insights are valuable for understanding the overall content landscape on Amazon Prime. They can inform decisions related to content acquisition, production, marketing, and recommendations. For example, if the chart reveals a large imbalance towards one content type, Amazon might consider investing in more content of the underrepresented type to cater to a wider audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Strategy: The insights into content type distribution can inform content acquisition and production strategies. If there is a significant imbalance (e.g., far more movies than TV shows), Amazon Prime might consider investing in more TV show content to cater to a wider audience and increase subscriber satisfaction.
Marketing and Recommendations: Understanding content preferences can help target marketing efforts and personalize recommendations. If a large portion of users prefer movies, promoting movie-related deals or highlighting new movie releases could be more effective in attracting and retaining subscribers.
User Experience: Ensuring a balanced mix of content types can improve user satisfaction and engagement. This can lead to increased subscription renewals and positive word-of-mouth, ultimately contributing to business growth.

Potential Negative Growth Insights:

Content Imbalance: If the chart reveals a heavy skew towards one content type, it could indicate a lack of diversity in offerings, potentially alienating users who prefer the less represented type. This could lead to decreased engagement or churn among those users, negatively impacting subscriber growth and retention.
Missed Opportunities: If a content type is underrepresented but has potential demand (based on market trends or competitor analysis), Amazon Prime might be missing out on attracting a specific audience segment. This could hinder the platform's growth and market share.

#### Chart - 2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df'
# and the column containing ratings is named 'tmdb_score'
# (based on the global variables provided)

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='tmdb_score', order=df['tmdb_score'].value_counts().index)
plt.title('Distribution of Content Ratings on Amazon Prime')
plt.xlabel('Content Rating')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Bar Chart (Countplot):

Categorical Data: The primary reason is that the variable being visualized (rating) is categorical. It represents distinct categories of content ratings (e.g., PG, TV-MA, R). Bar charts are specifically designed for displaying the distribution of categorical data.

Comparison of Frequencies: The main goal is to compare the frequencies or counts of each content rating. Bar charts excel at showing these comparisons visually through the lengths of the bars. This allows viewers to quickly grasp which ratings are most common and how content is distributed across different categories.

Clarity and Simplicity: Bar charts are generally easy to understand and interpret. Viewers can quickly grasp the relative proportions of each rating category by simply looking at the bar lengths. This makes the visualization accessible to a wide audience.

Effective for Univariate Analysis: This visualization is a form of univariate analysis, focusing on a single variable (rating). Bar charts are well-suited for this type of analysis as they effectively show the distribution of one variable.

Seaborn's Countplot: Seaborn's countplot() function provides a convenient and aesthetically pleasing way to create bar charts specifically for counting the occurrences of different categories in a dataset. It handles the counting and plotting efficiently.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

The bar chart of content ratings provides valuable insights into the types of content available on Amazon Prime and their target audiences. Here are some key insights you can derive from the chart:

Most Frequent Ratings: You can quickly identify the most frequent content ratings by observing the tallest bars in the chart. For instance, if the bars for "TV-MA" and "TV-14" are the highest, it indicates that a large portion of the content on Amazon Prime is targeted towards mature audiences or older teenagers.

Content Distribution across Ratings: You can observe the overall distribution of content across different rating categories. This provides a general understanding of the variety and balance of content offered on the platform. For example, if there's a wide range of ratings with relatively balanced frequencies, it suggests that Amazon Prime caters to diverse audience preferences.

Potential Gaps in Content: If certain rating categories have very short bars or are missing altogether, it might indicate a potential gap in content offerings for specific audience segments. For example, if there are very few titles with a "G" rating, it might suggest a limited selection of content for younger viewers.

Target Audience Insights: The distribution of content ratings provides insights into the primary target audience of Amazon Prime. If a majority of the content falls under ratings like "TV-MA" or "R," it indicates a focus on mature audiences. Conversely, a higher frequency of ratings like "TV-PG" or "PG-13" suggests a broader appeal to families and younger viewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Targeted Content Acquisition: By understanding the distribution of content ratings, Amazon Prime can make more informed decisions about acquiring new content. For example, if the chart reveals a high demand for mature content (e.g., TV-MA), Amazon Prime can prioritize acquiring titles with those ratings to cater to its audience preferences. This can lead to increased user engagement and satisfaction.

Improved Content Recommendations: The insights from the chart can be used to enhance content recommendation systems. By knowing the preferred content ratings of users, Amazon Prime can recommend similar titles, increasing the likelihood of users finding content they enjoy. This can lead to higher user retention and increased watch time.

Effective Marketing Campaigns: Understanding the target audience through content rating preferences allows Amazon Prime to create more effective marketing campaigns. For example, if a large portion of the audience prefers family-friendly content (e.g., TV-PG), marketing campaigns can be tailored to highlight those titles and attract more families to the platform. This can lead to increased subscriber acquisition and brand awareness.

Diversification of Content: Identifying gaps in content offerings for specific rating categories can help Amazon Prime diversify its content library. By investing in titles with underrepresented ratings, Amazon Prime can attract new audiences and expand its market reach. This can lead to overall business growth and increased revenue.

Potential Negative Growth Insights:

Limited Content Diversity: If the chart reveals a heavy concentration of content within a specific rating category (e.g., TV-MA), it might indicate a lack of diversity in offerings. This could alienate users who prefer other types of content, potentially leading to decreased engagement and churn among those users. This could negatively impact subscriber growth and retention.

Ignoring Niche Audiences: If certain rating categories have very low representation, it could indicate that Amazon Prime is neglecting niche audiences. For example, if there are very few titles with a "G" rating, it might suggest a limited focus on families with young children. Ignoring these niche audiences could limit the platform's growth potential and restrict its market share.

Misaligned Content Strategy: If Amazon Prime's content acquisition strategy is not aligned with the actual content rating preferences of its users, it could lead to a mismatch between supply and demand. This could result in wasted resources on acquiring content that doesn't resonate with the audience, negatively impacting profitability and business performance.

Justification:

The insights gained from the content ratings distribution chart are crucial for understanding audience preferences and making strategic decisions that can positively or negatively impact business growth. By leveraging these insights, Amazon Prime can tailor its content acquisition, recommendations, and marketing efforts to effectively target its audience and cater to their needs. However, neglecting or misinterpreting the insights could lead to limited content diversity, neglected audience segments, and misaligned content strategies, potentially hindering business growth and profitability.

#### Chart - 3

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has a 'release_year' column

# Create the KDE plot
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, x='release_year', fill=True)
plt.title('Distribution of Content Release Years')
plt.xlabel('Release Year')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a KDE Plot:

Continuous Data: The release_year variable is continuous, representing a range of numerical values. KDE plots are specifically designed for visualizing the distribution of continuous data. They provide a smooth, continuous curve that estimates the underlying probability density function of the data.

Identifying Trends and Patterns: KDE plots are excellent for revealing trends and patterns in the data, such as clusters, gaps, or skewness in the distribution. They can show how the density of release years changes over time, highlighting periods with more or fewer releases.

Smooth Representation: Compared to a histogram, which can be sensitive to bin size and placement, a KDE plot offers a smoother and more visually appealing representation of the data distribution. It avoids the "blocky" appearance of histograms and provides a more continuous view.

Comparing Distributions: KDE plots are also useful for comparing the distributions of different groups or categories. For example, you could create separate KDE plots for movies and TV shows to compare their release year distributions.

Visualizing Density: The y-axis of a KDE plot represents the density of data points, providing a more nuanced understanding of the distribution compared to simply counting frequencies in bins like a histogram.

In summary: The KDE plot was chosen for Chart - 3 because it is well-suited for visualizing the distribution of continuous data like release years. It provides a smooth, visually appealing, and informative representation of the data, allowing for the identification of trends, patterns, and density variations.

##### 2. What is/are the insight(s) found from the chart?

Insights from the KDE Plot:

The KDE plot provides a visual representation of the distribution of content release years, allowing us to identify key trends and patterns in the Amazon Prime catalog. Here are some insights you can derive from the chart:

Recent Content Dominance: The KDE plot typically shows a higher density (peak) towards more recent years, indicating that a significant portion of content on Amazon Prime consists of newer releases. This suggests a focus on providing viewers with access to the latest movies and TV shows.

Gradual Decline in Older Content: As you move towards earlier release years, the density of the KDE plot tends to decrease. This indicates a gradual decline in the number of older titles available on the platform. While there's still a presence of classic and older content, it's likely not as extensive as the newer releases.

Potential Content Gaps: If there are noticeable dips or troughs in the KDE plot for specific periods, it could indicate potential content gaps in the Amazon Prime catalog. For instance, a lower density around the early 2000s might suggest a need to acquire more titles from that era to provide a more comprehensive selection.

Skewness of Distribution: The KDE plot can reveal the skewness of the release year distribution. If the plot is skewed towards more recent years (right-skewed), it confirms the dominance of newer content. A left-skewed distribution would indicate a larger proportion of older titles.

Overall Content Age: By observing the overall shape and spread of the KDE plot, you can get a sense of the overall age of the content available on Amazon Prime. A wider spread suggests a more diverse range of release years, while a narrower spread indicates a focus on a specific time period.

Business Relevance:

These insights are valuable for understanding the content landscape on Amazon Prime and making strategic decisions regarding content acquisition, production, and marketing. For example, if the KDE plot reveals a lack of content from a specific era, Amazon Prime might consider investing in acquiring more titles from that period to cater to a wider range of viewer preferences. Understanding the distribution of release years can also help in targeting marketing campaigns and personalizing recommendations based on user preferences for newer or older content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Strategic Content Acquisition: By identifying potential content gaps or periods with lower density in the KDE plot, Amazon Prime can strategically acquire titles from those eras to diversify its catalog and cater to a wider range of viewer preferences. This can attract new subscribers and increase engagement among existing users.

Targeted Marketing and Recommendations: Understanding the distribution of release years can help in targeting marketing campaigns and personalizing recommendations. For example, promoting classic films to users who enjoy older content or highlighting new releases to those who prefer recent titles can improve user satisfaction and engagement.

Content Refresh and Optimization: The insights from the KDE plot can inform decisions about content refresh and optimization. If there's a heavy concentration of older content with declining viewership, Amazon Prime might consider removing or replacing some titles to make room for newer, more relevant content. This can improve the overall quality and appeal of the catalog.

Understanding Audience Preferences: The skewness and overall shape of the KDE plot provide insights into audience preferences for content age. This information can guide content acquisition and production strategies, ensuring that Amazon Prime offers a balanced mix of titles that resonate with its target audience.

Potential Negative Growth Insights:

Limited Content Diversity: If the KDE plot reveals a heavy concentration of content within a specific time period, it could indicate a lack of diversity in offerings. This might alienate users who prefer content from other eras, potentially leading to decreased engagement and churn.

Neglecting Niche Audiences: If there are significant gaps or dips in the KDE plot for certain periods, it could indicate that Amazon Prime is neglecting niche audiences who enjoy content from those eras. This could limit the platform's growth potential and restrict its market share.

Overemphasis on Recent Releases: While focusing on recent content is important, overemphasizing it could lead to a decline in viewership for older titles, potentially impacting the overall value proposition of Amazon Prime for users who appreciate classic or less mainstream content.

Justification:

The insights gained from the KDE plot of content release years are valuable for understanding audience preferences and making strategic decisions that can positively or negatively impact business growth. By leveraging these insights, Amazon Prime can tailor its content acquisition, recommendations, and marketing efforts to effectively target its audience and cater to their needs. However, neglecting or misinterpreting the insights could lead to limited content diversity, neglected audience segments, and an unbalanced content catalog, potentially hindering business growth and profitability.

Therefore, it's essential for Amazon Prime to carefully analyze the KDE plot and use the insights to make data-driven decisions that are aligned with user preferences and market trends to ensure positive business impact and sustained growth.

#### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has a 'release_year' column

# Create the box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='release_year')
plt.title('Distribution of Content Release Years')
plt.ylabel('Release Year')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Box Plot:

Summarizing Distribution: Box plots are excellent for providing a concise summary of the distribution of a continuous variable, like release years. They display key statistical measures such as the median, quartiles, and potential outliers in a compact visual format. This allows for a quick understanding of the central tendency, spread, and skewness of the data.

Identifying Outliers: One of the primary strengths of box plots is their ability to highlight outliers, which are data points that fall significantly outside the typical range of values. In the context of content release years, outliers could represent very old or very new titles that deviate from the majority of content on Amazon Prime. Identifying these outliers can be valuable for understanding the extremes of the content catalog and potential gaps or areas for improvement.

Comparing Distributions: Box plots are also useful for comparing the distributions of different groups or categories. For example, you could create separate box plots for movies and TV shows to compare their release year distributions and identify any differences in central tendency, spread, or outliers. This can provide insights into how content types vary in terms of their release year patterns.

Handling Skewness: Box plots are relatively robust to skewed distributions, which means they can effectively represent data that is not normally distributed. This is often the case with release years, as newer content tends to be more prevalent than older content, leading to a right-skewed distribution. Box plots can accurately capture this skewness and provide a clear visualization of the data's spread.

Simplicity and Clarity: Box plots are relatively simple and easy to interpret, making them accessible to a wide audience. The visual elements of the plot, such as the box, whiskers, and outliers, are readily understandable and convey key information about the data distribution.

In summary: A box plot is a suitable choice for visualizing the distribution of content release years because it provides a concise summary of the data, highlights outliers, allows for comparisons between groups, handles skewness effectively, and is easy to interpret. These features make it a valuable tool for understanding the overall patterns and potential variations in the release years of content on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Box Plot:

A box plot provides a concise summary of the distribution of content release years, highlighting key statistical measures and potential outliers. Here are some insights you can gain from the chart:

Central Tendency: The median, represented by the line inside the box, shows the central value of the release year distribution. This indicates the typical or middle release year for content on Amazon Prime.

Spread and Variability: The box itself represents the interquartile range (IQR), which contains the middle 50% of the data. The length of the box indicates the spread or variability of release years. A longer box suggests a wider range of release years, while a shorter box indicates a more concentrated distribution.

Skewness: The position of the median within the box and the lengths of the whiskers (lines extending from the box) can provide insights into the skewness of the distribution. If the median is closer to the bottom of the box and the lower whisker is shorter than the upper whisker, it suggests a right-skewed distribution, meaning there are more recent releases compared to older ones.

Outliers: Data points that fall outside the whiskers are considered outliers. In the context of content release years, these outliers could represent very old or very new titles that deviate significantly from the typical range. Identifying outliers can help understand the extremes of the content catalog and potential gaps or areas for improvement.

Comparison: If you create separate box plots for different content categories (e.g., movies and TV shows), you can compare their distributions and identify any differences in central tendency, spread, or outliers. This can provide insights into how content types vary in terms of their release year patterns.

Example Interpretation:

If the median is around 2018 and the box is relatively short, it suggests that a majority of content on Amazon Prime was released within a few years of 2018, indicating a focus on newer content.
If there are outliers on the lower end of the plot, it indicates the presence of some very old titles in the catalog.
If the box for movies is longer than the box for TV shows, it suggests a wider range of release years for movies compared to TV shows.

Business Relevance:

These insights are valuable for understanding the content landscape on Amazon Prime and making strategic decisions regarding content acquisition, production, and marketing. For example, if the box plot reveals a lack of recent releases, Amazon Prime might consider acquiring more new titles to cater to user preferences for fresh content. Identifying outliers and gaps in the distribution can also inform decisions about content refresh and optimization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Targeted Content Acquisition: By understanding the distribution of release years and identifying potential gaps or outliers, Amazon Prime can strategically acquire titles to fill those gaps and cater to a wider range of viewer preferences. This can attract new subscribers and increase engagement among existing users.

Improved Content Recommendations: Insights from the box plot can help personalize content recommendations. For instance, users who prefer older titles can be recommended content from earlier release years, while those who favor recent releases can be shown newer content. This can improve user satisfaction and increase viewership.

Content Refresh and Optimization: If the box plot reveals a heavy concentration of older content with declining viewership, Amazon Prime might consider removing or replacing some titles to make room for newer, more relevant content. This can improve the overall quality and appeal of the catalog.

Understanding Audience Preferences: The box plot provides insights into the distribution and central tendency of release years, which can help understand audience preferences for content age. This information can guide content acquisition and production strategies, ensuring that Amazon Prime offers a balanced mix of titles that resonate with its target audience.

Identifying and Addressing Outliers: Outliers in the box plot might represent very old or very new titles that deviate significantly from the typical range. Addressing these outliers through content acquisition or removal can improve the overall coherence and appeal of the catalog.

Potential Negative Growth Insights:

Limited Content Diversity: If the box plot reveals a narrow IQR and a limited range of release years, it could indicate a lack of content diversity. This might alienate users who prefer content from different eras, potentially leading to decreased engagement and churn.

Neglecting Niche Audiences: If the box plot shows outliers or gaps in specific release year ranges, it could indicate that Amazon Prime is neglecting niche audiences who enjoy content from those periods. This could limit the platform's growth potential and restrict its market share.

Overemphasis on Recent Releases: While focusing on recent content is important, overemphasizing it could lead to a decline in viewership for older titles, potentially impacting the overall value proposition of Amazon Prime for users who appreciate classic or less mainstream content.

Justification:

The insights gained from the box plot of content release years are valuable for understanding audience preferences and making strategic decisions that can positively or negatively impact business growth. By leveraging these insights, Amazon Prime can tailor its content acquisition, recommendations, and marketing efforts to effectively target its audience and cater to their needs. However, neglecting or misinterpreting the insights could lead to limited content diversity, neglected audience segments, and an unbalanced content catalog, potentially hindering business growth and profitability.

Therefore, it's essential for Amazon Prime to carefully analyze the box plot and use the insights to make data-driven decisions that are aligned with user preferences and market trends to ensure positive business impact and sustained growth.

#### Chart - 5

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has a 'release_year' column

# Create the violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, y='release_year')
plt.title('Distribution of Content Release Years')
plt.ylabel('Release Year')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Violin Plot:

Combined Information: A violin plot combines the benefits of a box plot and a kernel density estimation (KDE) plot. It displays the distribution's shape through the KDE, while also providing key statistical measures like the median and quartiles through the box plot elements within the violin. This allows for a more comprehensive understanding of the data's distribution compared to using either a box plot or a KDE plot alone.

Detailed Distribution Shape: The violin's shape reveals the density of data points at different release years, highlighting areas where content is concentrated or sparse. This provides a more nuanced view of the distribution's shape compared to a box plot, which primarily focuses on quartiles.

Identifying Multimodal Distributions: Violin plots are particularly useful for identifying multimodal distributions, which have multiple peaks or clusters. This can indicate the presence of distinct groups or trends within the release year data. For instance, you might observe separate peaks for older classic films and newer releases, suggesting different patterns in content availability.

Comparing Distributions: Violin plots are also effective for comparing distributions across different categories. By creating separate violins for different content types (e.g., movies and TV shows), you can easily compare their shapes and identify any differences in central tendency, spread, or the presence of multiple modes.

Visual Appeal and Clarity: While providing detailed information, violin plots are often considered visually appealing and engaging. The smooth curves and mirrored density representation can make the visualization more aesthetically pleasing compared to a box plot, especially when dealing with large datasets.

In summary: A violin plot is a suitable choice for visualizing the distribution of content release years on Amazon Prime because it combines the strengths of a box plot and a KDE plot, providing a detailed and visually engaging representation of the distribution's shape, central tendency, and potential multimodality. This allows for a more comprehensive understanding of the data and can reveal nuanced patterns that might be missed with other visualizations.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Violin Plot:

A violin plot provides a rich and informative visualization of the distribution of content release years.

Distribution Shape and Density: The shape of the violin reveals the density of content at different release years. A wider section of the violin indicates a higher concentration of content released around that year, while a narrower section suggests fewer releases. This allows you to quickly grasp the overall distribution and identify periods with more or fewer titles.

Central Tendency and Spread: The white dot within the violin represents the median release year, providing a measure of central tendency. The thick black bar inside the violin shows the interquartile range (IQR), indicating the spread of the middle 50% of the data. These elements help understand the typical release year and the variability of content age.

Skewness and Tails: The shape of the violin's tails can reveal skewness in the distribution. A longer tail on one side indicates a skewed distribution, with more content released in that direction. For instance, a longer tail towards recent years suggests a right-skewed distribution, indicating a greater concentration of newer releases.

Multimodality: Violin plots are particularly useful for identifying multimodal distributions, which have multiple peaks or clusters. If the violin has distinct bulges or peaks, it suggests the presence of different groups or trends within the release year data. This could indicate, for example, separate clusters for older classic films and newer releases.

Comparison across Categories: By creating separate violins for different content categories (e.g., movies and TV shows), you can easily compare their distributions and identify any differences in shape, central tendency, spread, or multimodality. This allows for insights into how content types vary in terms of their release year patterns.

Example Interpretation:

If the violin plot shows a wider section towards recent years and a longer tail in that direction, it suggests a concentration of newer content and a right-skewed distribution.
If the violin has two distinct peaks, one for older films and another for recent releases, it indicates a multimodal distribution with potentially different patterns for classic and contemporary content.
Comparing violins for movies and TV shows might reveal that movies have a wider range of release years and a greater concentration of older titles compared to TV shows.

Business Relevance:

These insights are valuable for understanding the content landscape on Amazon Prime and making strategic decisions regarding content acquisition, production, and marketing. For example, if the violin plot reveals a lack of content from a specific era or a limited diversity in release years, Amazon Prime might consider acquiring titles to address those gaps and cater to a wider audience. Understanding the distribution's shape and multimodality can also help personalize recommendations and target marketing campaigns based on user preferences for specific content age ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Strategic Content Acquisition: By understanding the distribution shape, density, and potential gaps revealed by the violin plot, Amazon Prime can strategically acquire titles to address those areas and cater to a wider range of viewer preferences. This can attract new subscribers and increase engagement among existing users.

Personalized Recommendations: Insights from the violin plot, such as the central tendency, spread, and multimodality, can help personalize content recommendations. Users with preferences for specific release year ranges can be recommended content that aligns with their tastes, leading to increased user satisfaction and viewership.

Content Refresh and Optimization: If the violin plot shows a heavy concentration of older content with declining viewership or a lack of diversity in release years, Amazon Prime might consider removing or replacing some titles to make room for newer, more relevant content. This can improve the overall quality and appeal of the catalog.

Understanding Audience Preferences: The violin plot provides insights into the distribution and central tendency of release years, which can help understand audience preferences for content age. This information can guide content acquisition and production strategies, ensuring that Amazon Prime offers a balanced mix of titles that resonate with its target audience.

Identifying and Addressing Outliers: Outliers in the violin plot might represent very old or very new titles that deviate significantly from the typical range. Addressing these outliers through content acquisition or removal can improve the overall coherence and appeal of the catalog.

Marketing and Promotion: Understanding the distribution of release years can help target marketing campaigns and promotions. For example, highlighting content from a specific era that is underrepresented in the catalog can attract niche audiences and increase engagement.
Potential Negative Growth Insights:

Limited Content Diversity: If the violin plot reveals a narrow distribution with limited variability in release years, it could indicate a lack of content diversity. This might alienate users who prefer content from different eras, potentially leading to decreased engagement and churn.

Neglecting Niche Audiences: If the violin plot shows gaps or low density in specific release year ranges, it could indicate that Amazon Prime is neglecting niche audiences who enjoy content from those periods. This could limit the platform's growth potential and restrict its market share.

Overemphasis on Recent Releases: While focusing on recent content is important, overemphasizing it at the expense of older titles could lead to a decline in viewership for classic or less mainstream content, potentially impacting the overall value proposition of Amazon Prime for certain user segments.

Justification:

The insights gained from the violin plot of content release years are valuable for understanding audience preferences and making strategic decisions that can positively or negatively impact business growth. By leveraging these insights, Amazon Prime can tailor its content acquisition, recommendations, and marketing efforts to effectively target its audience and cater to their needs. However, neglecting or misinterpreting the insights could lead to limited content diversity, neglected audience segments, and an unbalanced content catalog, potentially hindering business growth and profitability.

Therefore, it's essential for Amazon Prime to carefully analyze the violin plot and use the insights to make data-driven decisions that are aligned with user preferences and market trends to ensure positive business impact and sustained growth.

#### Chart - 6

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has 'type' and 'imdb_score' columns

plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.boxplot(x='type', y='imdb_score', data=df)
plt.title('Distribution of IMDb Ratings by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Grouped Box Plot:

Comparing Distributions: The primary goal is to compare the distribution of IMDb ratings for Movies and TV Shows. Grouped box plots are specifically designed for this purpose, allowing for a side-by-side comparison of the central tendency, spread, and potential outliers in ratings for each content type.

Categorical and Continuous Variables: We have one categorical variable (content type: Movie or TV Show) and one continuous variable (IMDb ratings). Box plots are well-suited for visualizing the relationship between a categorical and a continuous variable.

Understanding Key Statistics: Box plots provide a concise summary of key statistical measures for each group, including the median, quartiles (25th and 75th percentiles), and potential outliers. This allows for a quick understanding of the typical rating and the variability within each content type.

Identifying Outliers: Box plots effectively highlight outliers, which are data points that fall significantly outside the typical range of ratings. This can be valuable for identifying titles with exceptionally high or low ratings within each content type.

Visual Clarity and Simplicity: Box plots are relatively easy to understand and interpret. The visual elements, such as the boxes, whiskers, and outliers, are readily understandable and convey key information about the distribution of ratings for each group.

In summary: A grouped box plot was chosen because it is an effective and visually clear way to compare the distribution of IMDb ratings for Movies and TV Shows on Amazon Prime. It allows for a side-by-side comparison of key statistics, identification of outliers, and an overall understanding of how user ratings might differ between the two content types.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

A grouped box plot provides a comparative view of the distribution of IMDb ratings for Movies and TV Shows, allowing for insights into how user ratings might differ between the two content types. Here are some key insights you can gain from the chart:

Central Tendency: Compare the medians (represented by the horizontal lines inside the boxes) for Movies and TV Shows. This indicates the typical or average rating for each content type. If the median for TV Shows is higher than for Movies, it suggests that, on average, TV Shows on Amazon Prime receive higher IMDb ratings.

Spread and Variability: Observe the lengths of the boxes (representing the interquartile range) for Movies and TV Shows. A longer box indicates a wider spread or greater variability in ratings. If the box for Movies is longer than for TV Shows, it suggests that movie ratings are more dispersed, with a greater range of scores.

Skewness: The position of the median within the box and the lengths of the whiskers (lines extending from the box) can provide insights into the skewness of the distribution for each content type. If the median is closer to the bottom of the box and the lower whisker is shorter than the upper whisker, it suggests a right-skewed distribution, meaning there are more higher ratings compared to lower ones.

Outliers: Look for data points that fall outside the whiskers, represented as individual dots. These are considered outliers, indicating titles with exceptionally high or low ratings within each content type. Identifying outliers can be valuable for further investigation or understanding extreme cases.

Overall Comparison: By comparing the box plots side-by-side, you can get an overall sense of how the distribution of IMDb ratings differs between Movies and TV Shows. For example, if the box plot for TV Shows is generally higher and narrower than for Movies, it suggests that TV Shows tend to receive higher and more consistent ratings compared to Movies.

Example Interpretation:

If the median IMDb rating for TV Shows is 7.5 and for Movies is 7.0, it suggests that TV Shows on Amazon Prime generally receive higher ratings.
If the box for Movies is longer than for TV Shows, it indicates a greater variability in movie ratings, with a wider range of scores.
Outliers on the upper end of the box plot for Movies might represent critically acclaimed or highly popular films with exceptionally high ratings.

Business Relevance:

These insights are valuable for understanding user preferences and making strategic decisions regarding content acquisition, production, and marketing. For example, if TV Shows consistently receive higher ratings, Amazon Prime might consider investing in more high-quality TV series to attract and retain subscribers. Identifying outliers and understanding the distribution of ratings can also help personalize recommendations and target marketing campaigns based on user preferences for specific content types and rating ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition and Production: By understanding which content type generally receives higher ratings (e.g., TV Shows), Amazon Prime can make more informed decisions about acquiring or producing content that aligns with user preferences. This can attract and retain subscribers who value highly-rated content.

Personalized Recommendations: Insights from the box plot, such as the central tendency and spread of ratings for each content type, can help personalize recommendations. Users who prefer highly-rated TV Shows can be recommended similar content, leading to increased user satisfaction and viewership.

Marketing and Promotion: Understanding the distribution of ratings for different content types can help target marketing campaigns and promotions. For example, highlighting highly-rated Movies or TV Shows in specific genres can attract niche audiences and increase engagement.

Content Refresh and Optimization: If the box plot reveals a content type with consistently lower ratings or greater variability, Amazon Prime might consider removing or replacing some titles to improve the overall quality and appeal of its catalog. This can enhance user satisfaction and reduce churn.

Identifying and Addressing Outliers: Outliers in the box plot might represent exceptionally high or low-rated titles. Further investigation of these outliers can help understand the factors contributing to their extreme ratings and inform content acquisition or production decisions.

Potential Negative Growth Insights:

Content Imbalance: If the box plot reveals a significant difference in ratings between Movies and TV Shows, with one content type consistently receiving lower ratings, it could indicate a content imbalance. This might alienate users who prefer the lower-rated content type, potentially leading to decreased engagement or churn.

Neglecting Niche Audiences: If the box plot shows outliers or a wide spread of ratings for a specific content type, it could indicate that Amazon Prime is neglecting niche audiences who enjoy content with those characteristics. This could limit the platform's growth potential and restrict its market share.

Overemphasis on Ratings: While ratings are an important factor in user preferences, overemphasizing them could lead to a lack of diversity in content offerings. Focusing solely on highly-rated titles might exclude unique or experimental content that could appeal to specific audience segments.

Justification:

The insights gained from the grouped box plot of IMDb ratings by content type are valuable for understanding user preferences and making strategic decisions that can positively or negatively impact business growth. By leveraging these insights, Amazon Prime can tailor its content acquisition, recommendations, and marketing efforts to effectively target its audience and cater to their needs. However, neglecting or misinterpreting the insights could lead to content imbalances, neglected audience segments, and a lack of content diversity, potentially hindering business growth and profitability.

Therefore, it's essential for Amazon Prime to carefully analyze the grouped box plot and use the insights to make data-driven decisions that are aligned with user preferences and market trends to ensure positive business impact and sustained growth.

#### Chart - 7

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter for movies only
movies_df = df[df['type'] == 'Movie']

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=movies_df, x='runtime', bins=20, kde=True)  # Adjust bins as needed
plt.title('Distribution of Movie Durations')
plt.xlabel('Movie Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Histogram:

Continuous Data: The 'runtime' variable, representing movie durations, is continuous. Histograms are specifically designed for visualizing the distribution of continuous data.

Frequency Distribution: The main goal is to understand how movie durations are distributed, i.e., how many movies fall within different duration ranges. Histograms effectively display this frequency distribution by dividing the data into bins and showing the count or frequency of movies in each bin.

Shape and Spread: Histograms reveal the shape and spread of the data distribution. We can observe if the distribution is skewed, symmetrical, or has multiple peaks. This information is valuable for understanding the typical movie duration and the variability in movie lengths.

Outlier Detection: While not the primary focus, histograms can also help identify potential outliers, i.e., movies with unusually long or short durations. These outliers might be worth investigating further.
KDE for Smoothness: Adding a Kernel Density Estimate (KDE) plot on top of the histogram provides a smoother representation of the distribution, making it easier to identify trends and patterns.

In summary, a histogram is an appropriate choice for visualizing the distribution of movie durations as it effectively displays frequency distribution, shape, spread, and potential outliers of continuous data. The addition of a KDE plot enhances the visualization's smoothness and interpretability.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Histogram:

The histogram of movie durations provides insights into the typical length of movies on Amazon Prime and the variability in movie durations. Here are some key insights you can derive:

Typical Duration: The peak of the histogram represents the most frequent movie duration. This indicates the typical or average length of movies on the platform. For example, if the peak is around 90-120 minutes, it suggests that most movies on Amazon Prime fall within that duration range.

Distribution Shape: The overall shape of the histogram reveals the distribution of movie durations. If it's roughly bell-shaped (symmetrical), it indicates a normal distribution of movie lengths. A right-skewed distribution (longer tail on the right) suggests that there are more movies with longer durations compared to shorter ones. A left-skewed distribution (longer tail on the left) would indicate the opposite.

Variability: The spread or width of the histogram indicates the variability in movie durations. A wider histogram suggests a greater range of movie lengths, while a narrower histogram indicates less variability.
Outliers: Any bars that are significantly taller or shorter than the surrounding bars could indicate potential outliers, i.e., movies with unusually long or short durations. These outliers might be worth investigating further to understand if they represent specific genres or content categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition and Production: Understanding the typical movie duration and viewer preferences can inform content acquisition and production strategies. Amazon Prime can prioritize acquiring or producing movies with durations that align with the most popular range, potentially increasing viewer engagement and satisfaction.

User Recommendations: The insights from the histogram can be used to enhance user recommendations. For example, if a user has shown a preference for shorter movies, the platform can recommend similar titles within that duration range, improving personalization and user experience.

Content Programming and Scheduling: For live TV or streaming channels, understanding movie duration distributions can help optimize programming and scheduling. Amazon Prime can arrange movie blocks or schedules that cater to viewer preferences for specific duration ranges, potentially increasing viewership and engagement.

Potential Negative Growth Insights:

Limited Content Diversity: If the histogram reveals a very narrow range of movie durations, it could indicate a lack of diversity in content offerings. This might alienate viewers who prefer movies with longer or shorter durations, potentially leading to decreased engagement and churn.

Ignoring Niche Preferences: If there are significant clusters of movies with durations outside the typical range, it could indicate niche preferences that are not being adequately addressed. Ignoring these niche preferences could limit the platform's growth potential and restrict its market share.

Misaligned Content Strategy: If Amazon Prime's content acquisition or production strategy is not aligned with viewer preferences for movie durations, it could lead to a mismatch between supply and demand. This could result in wasted resources on content that doesn't resonate with the target audience, negatively impacting profitability and business performance.

#### Chart - 8

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='release_year', y='tmdb_score')
plt.title('Release Year vs. TMDB Score')
plt.xlabel('Release Year')
plt.ylabel('TMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Scatter Plot:

Relationship between Two Continuous Variables: A scatter plot is the most appropriate choice for visualizing the relationship between two continuous variables, in this case, release_year and tmdb_score. It allows us to see how these variables are related and if there are any patterns or trends.

Identifying Correlation: Scatter plots can help identify the presence and direction of a correlation between variables. If the points on the scatter plot tend to cluster along a line, it indicates a correlation. A positive correlation means that as one variable increases, the other also tends to increase. A negative correlation means that as one variable increases, the other tends to decrease. No clear pattern suggests a weak or no correlation.

Outlier Detection: Scatter plots can also help identify outliers, which are data points that fall significantly outside the general pattern of the data. Outliers might be worth investigating further to understand if they represent unusual cases or data errors.

Visual Exploration: Scatter plots provide a visual way to explore the data and see if there are any interesting relationships or patterns. They can help generate hypotheses for further investigation.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Scatter Plot:

The scatter plot of release_year vs. tmdb_score can provide insights into how movie ratings (TMDB scores) have changed over time. Here are some key insights you can derive:

Correlation: Observe if there's any clear pattern or trend in the scatter plot. For example, if the points tend to slope upwards from left to right, it suggests a positive correlation, meaning that movies released in more recent years tend to have higher TMDB scores. A downward slope would indicate a negative correlation.

Rating Distribution over Time: See how the distribution of TMDB scores changes across different release years. For example, you might observe that movies released in earlier years had a wider range of scores, while more recent movies tend to have scores clustered within a narrower range.

Outliers: Look for any data points that fall significantly outside the general pattern. These outliers might represent movies with unusually high or low TMDB scores for their release year. Investigating these outliers could reveal interesting insights or potential data anomalies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition and Production: If the scatter plot reveals a positive correlation between release year and TMDB score, it suggests that acquiring or producing newer movies might lead to higher viewer satisfaction. This insight can guide content acquisition strategies and inform decisions about investing in newer titles.

User Recommendations: Understanding the relationship between release year and TMDB score can help personalize user recommendations. For example, if a user prefers movies with higher ratings, the platform can recommend newer movies that are more likely to have higher scores.

Content Programming and Scheduling: For live TV or streaming channels, insights from the scatter plot can inform content programming decisions. Amazon Prime can prioritize scheduling newer movies during prime time slots, when viewership is typically higher, to maximize engagement.

Potential Negative Growth Insights:

Declining Content Quality: If the scatter plot reveals a negative correlation between release year and TMDB score, it could indicate a decline in the quality of newer content. This insight might warrant further investigation into the reasons for the decline and potential adjustments to content acquisition or production strategies.

Ignoring Classic Titles: If the scatter plot shows a wide range of TMDB scores for older movies, it suggests that there are valuable classic titles that might be overlooked if the platform focuses solely on newer content. This could lead to a missed opportunity to engage viewers who appreciate older films.

Overemphasis on Ratings: Relying solely on TMDB scores for content decisions could lead to neglecting other important factors that contribute to viewer satisfaction, such as genre preferences, cast, or director. A balanced approach is crucial to ensure a diverse and engaging content catalog.

#### Chart - 9

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter data for movies only
movies = df[df['type'] == 'Movie']

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=movies, x='runtime', bins=30, kde=True)  # Adjust bins as needed
plt.title('Distribution of Movie Durations on Amazon Prime Video')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a Kernel Density Estimate (KDE) plot for this visualization because:

Continuous Data: Movie durations are continuous numerical data, and histograms are well-suited for visualizing the distribution of such data.

Frequency and Density: The histogram shows the frequency of movies falling within different duration ranges, while the KDE plot provides a smooth estimate of the underlying probability density function, giving a clearer picture of the distribution's shape.

Understanding Duration Patterns: This chart helps us understand typical movie durations, identify common ranges, and detect any unusual patterns or outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

By examining the histogram and KDE plot, you can gain insights like:

Most Common Durations: The peak of the histogram and KDE plot represents the most common movie durations on Amazon Prime Video. This gives an idea of the typical length of movies offered on the platform.

Duration Range: The spread of the histogram indicates the range of movie durations available. A wider spread suggests a greater variety of durations, while a narrower spread indicates a more concentrated range.

Skewness: The shape of the KDE plot can reveal skewness in the distribution. A right-skewed distribution (longer tail on the right) would suggest that there are more movies with longer durations than shorter ones.

Outliers: Any bars or areas of the histogram with unusually low frequencies might indicate outliers or movies with uncommon durations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition: Understanding the distribution of movie durations can help Amazon Prime Video make informed decisions about acquiring new content. They can focus on acquiring movies with durations that are popular among viewers, ensuring a catalog that aligns with audience preferences.

Recommendation System: These insights can be incorporated into recommendation systems to suggest movies with durations that users typically enjoy. This can lead to increased user satisfaction and engagement.

Content Programming: Amazon Prime Video can use this information to schedule movies with appropriate durations for different time slots or viewing habits. For example, they might offer shorter movies for viewers with limited time or longer movies for weekend viewing.
Potential Negative Growth Insights:

Limited Variety: If the histogram shows a very narrow range of durations, it could indicate a lack of variety in content offerings, potentially alienating users who prefer movies with shorter or longer durations.

Ignoring Niche Preferences: If there are specific duration ranges with low frequencies but potential demand, Amazon Prime Video might be missing out on attracting those niche audiences. This could hinder their growth and market share.

#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has 'genres' and 'type' columns

# Create the bar chart
plt.figure(figsize=(12, 6))
sns.countplot(y='genres', hue='type', data=df, order=df['genres'].value_counts().index)
plt.title('Distribution of Genres across Content Types')
plt.xlabel('Count')
plt.ylabel('Genres')
plt.legend(title='Content Type')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart (specifically using Seaborn's countplot) for visualizing the distribution of genres across content types (Movies and TV Shows) for the following reasons:

Categorical Data: The primary reason is that both 'genres' and 'type' are categorical variables. Bar charts are specifically designed to display the distribution and comparison of categorical data.

Comparison of Frequencies: The main goal is to compare the frequencies or counts of each genre within different content types. Bar charts excel at visually representing these comparisons through the lengths of the bars, allowing for easy identification of dominant genres for movies and TV shows.

Clarity and Simplicity: Bar charts are easy to understand and interpret. Viewers can quickly grasp the relative popularity of genres within each content type by simply looking at the bar lengths.

Seaborn's Countplot with Hue: Using Seaborn's countplot with the hue parameter allows us to create clustered bar charts, where each genre bar is further divided based on content type (Movies and TV Shows). This provides a clear visual comparison of genre distributions between the two content types.

##### 2. What is/are the insight(s) found from the chart?

The insights derived from the bar chart of genre distribution across content types are:

Dominant Genres: Identify the most frequent genres for both movies and TV shows. The tallest bars for each content type represent the dominant genres.

Genre Preferences: Observe any notable differences in genre preferences between movies and TV shows. For example, certain genres might be more prevalent in movies than in TV shows, indicating audience preferences for specific content types.

Content Variety: Assess the overall diversity of genres offered within each content type. A wide range of genres suggests a balanced and varied content catalog.

Potential Gaps: Identify potential gaps in content offerings by looking for underrepresented genres or genres that are heavily skewed towards one content type. This information can inform content acquisition strategies to cater to a wider audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition: The insights into genre distribution can inform content acquisition strategies. By understanding the preferences for different genres, Amazon Prime can prioritize acquiring titles in popular or underrepresented genres to better cater to its audience. This can attract new subscribers and increase engagement among existing users.

Content Recommendations: The genre preferences revealed in the chart can enhance content recommendation systems. Amazon Prime can recommend similar titles or explore other genres based on user preferences, improving user experience and increasing watch time.

Targeted Marketing: The insights can be used to create more effective marketing campaigns. By highlighting the availability of popular genres or promoting content in genres with potential demand, Amazon Prime can attract specific audience segments and increase subscriber acquisition.

Content Production: The information about genre preferences can guide content production strategies. Amazon Prime might consider investing in creating original content in popular genres or genres with growth potential to offer unique and appealing titles to its audience.

Potential Negative Growth Insights:

Limited Genre Diversity: If the chart reveals a heavy concentration of content within a few specific genres, it could indicate a lack of diversity in offerings. This might alienate users who prefer other genres, potentially leading to decreased engagement and churn.

Ignoring Niche Genres: If certain genres have very low representation or are heavily skewed towards one content type, it could indicate that Amazon Prime is neglecting niche audiences who prefer those genres. This could limit the platform's growth potential and restrict its market share.

Misaligned Content Strategy: If content acquisition and production strategies are not aligned with actual genre preferences, it could lead to a mismatch between supply and demand, potentially hindering business growth and profitability.

#### Chart - 11

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df'
# and the column containing country information is named 'production_countries'
# (based on the global variables provided)

# Create the horizontal bar chart
plt.figure(figsize=(10, 8))  # Adjust figure size if needed

# Access the 'production_countries' column instead of 'country'
sns.countplot(y='production_countries', hue='type', data=df, order=df['production_countries'].value_counts().iloc[:10].index)

plt.title('Top 10 Countries with the Most Content on Amazon Prime')
plt.xlabel('Count')
plt.ylabel('Country')
plt.legend(title='Content Type')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart (using Seaborn's countplot) to visualize the top 10 countries with the most content on Amazon Prime, categorized by content type (Movies and TV Shows), for the following reasons:

Categorical Data: Both 'country' and 'type' are categorical variables, representing distinct categories. Bar charts are well-suited for visualizing the distribution and comparison of categorical data.

Comparison of Frequencies: The main goal is to compare the number of titles produced in each country, and further categorize them by content type (Movies and TV Shows). Horizontal bar charts effectively display these comparisons through the lengths of the bars, allowing viewers to easily identify countries with the most content.

Clarity and Readability: Horizontal bar charts are often preferred when dealing with long category labels (country names in this case) as they provide more space for displaying the labels without overlapping. This improves the readability of the chart.

Top 10 Focus: By limiting the chart to the top 10 countries, we can focus on the most significant contributors to Amazon Prime's content library. This makes the visualization more concise and insightful.

Seaborn's Countplot with Hue: Using Seaborn's countplot with the hue parameter allows us to create clustered bar charts, where each country's bar is further divided based on content type (Movies and TV Shows). This provides a clear visual comparison of content distribution by country and type.

##### 2. What is/are the insight(s) found from the chart?

Here are the insights that can be gained from this chart:

Dominant Content Producers: Identify the countries that produce the most content for Amazon Prime. The countries with the longest bars have the highest number of titles available on the platform.

Content Type Distribution by Country: Observe how the distribution of content types (Movies and TV Shows) varies across countries. Some countries might produce more movies, while others might focus on TV shows. This provides insights into the content preferences and production capabilities of different regions.

Regional Content Focus: Identify any regional patterns or trends in content production. For example, certain countries might specialize in specific genres or cater to particular audiences based on cultural preferences.

Content Diversity: Assess the overall diversity of content origins on Amazon Prime. A wide representation of countries suggests a global content library that caters to a broad audience.

Potential Gaps: Identify potential gaps in content offerings by looking for underrepresented countries or regions. This information can inform content acquisition strategies to diversify the catalog and attract new audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Targeted Content Acquisition: By understanding the countries that produce the most content, Amazon Prime can focus on acquiring titles from those regions to cater to audience preferences and ensure a diverse catalog.

Content Production Partnerships: Identifying countries with expertise in specific genres or content types can lead to potential partnerships for co-production or licensing opportunities, expanding Amazon Prime's original content library.

Regional Marketing Strategies: Understanding content distribution by country can help in tailoring marketing campaigns to specific regions or target audiences based on their content preferences.
Global Expansion: Identifying underrepresented countries or regions can inform strategic decisions regarding global expansion and content acquisition to reach new markets and attract new subscribers.

Potential Negative Growth Insights:

Limited Content Diversity: If the chart reveals a heavy concentration of content from a few countries, it could indicate a lack of diversity in offerings, potentially alienating users who prefer content from other regions.

Regional Content Bias: Overemphasis on content from specific countries could lead to a regional content bias, limiting the appeal to a broader audience and hindering global growth.

Ignoring Emerging Markets: Neglecting content from emerging markets or underrepresented regions could result in missed opportunities to attract new subscribers and expand into new territories.

Justification:

By carefully analyzing the insights from Chart - 11, Amazon Prime can make data-driven decisions regarding content acquisition, production, marketing, and global expansion strategies. Leveraging these insights can lead to a positive business impact by enhancing content diversity, catering to audience preferences, and attracting new subscribers. However, failing to address potential negative growth insights, such as limited content diversity or regional biases, could hinder the platform's growth and restrict its market share in the long run.

#### Chart - 12

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Assuming your DataFrame is named 'df' and 'description' column contains duration information

# Filter data for movies only
movies_df = df[df['type'] == 'Movie']

# Function to extract duration from description (in minutes)
def extract_duration(description):
    match = re.search(r'(\d+)\s*min', description)  # Search for patterns like "120 min" or "90min"
    if match:
        return int(match.group(1))
    return None  # Return None if duration not found

# Extract duration values and create a new column
movies_df['duration_minutes'] = movies_df['description'].apply(extract_duration)

# Remove rows with missing duration (if any)
movies_df = movies_df.dropna(subset=['duration_minutes'])

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=movies_df, x='duration_minutes', bins=20, kde=True)
plt.title('Distribution of Movie Durations on Amazon Prime')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram (using Seaborn's histplot) to visualize the distribution of movie durations on Amazon Prime for the following reasons:

Continuous Data: Movie duration is a continuous variable, representing a range of numerical values. Histograms are specifically designed for visualizing the distribution of continuous data by dividing the data into bins and showing the frequency of data points within each bin.

Understanding Distribution: The primary goal is to understand the overall distribution of movie durations, such as the typical duration, the range of durations, and the presence of any outliers or unusual patterns. Histograms effectively display these characteristics.

Identifying Central Tendency and Spread: Histograms allow us to visually identify the central tendency (e.g., the average or median duration) and the spread or variability of the data. This provides insights into the typical length of movies on Amazon Prime.

Detecting Skewness: Histograms can reveal the skewness of the distribution. For example, if the histogram is skewed towards shorter durations, it indicates that most movies on Amazon Prime are relatively short.

Seaborn's Histplot with KDE: Using Seaborn's histplot with the kde=True parameter adds a Kernel Density Estimation (KDE) plot on top of the histogram, providing a smoother representation of the data distribution.

##### 2. What is/are the insight(s) found from the chart?

The insights derived from the histogram of movie durations are:

Typical Duration: Identify the most common movie durations by observing the tallest bars in the histogram. This indicates the typical length of movies on Amazon Prime.

Duration Range: Observe the overall range of movie durations, from the shortest to the longest. This provides insights into the variety of movie lengths available on the platform.

Distribution Shape: Analyze the shape of the histogram to understand the distribution of movie durations. Is it symmetrical, skewed towards shorter durations, or skewed towards longer durations? This information can reveal audience preferences for movie lengths.

Outliers: Look for any unusually long or short movie durations, represented by bars that are far away from the main cluster. These outliers might indicate niche content or specific genres with different duration preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Content Acquisition: By understanding the typical duration and range of movie lengths that resonate with audiences, Amazon Prime can make informed decisions about acquiring new content that aligns with user preferences.

Content Recommendations: The insights into movie duration preferences can enhance content recommendation systems. Amazon Prime can recommend movies with similar durations or suggest titles based on user preferences for shorter or longer movies.
User Experience: Ensuring a variety of movie durations in the catalog caters to different viewer preferences and viewing habits, improving user satisfaction and engagement.

Content Production: Understanding audience preferences for movie lengths can guide original content production strategies. Amazon Prime might consider creating movies with durations that align with popular trends or cater to specific audience segments.

Potential Negative Growth Insights:

Limited Duration Variety: If the histogram reveals a heavy concentration of movies within a narrow duration range, it could indicate a lack of variety in offerings, potentially alienating users who prefer shorter or longer movies.

Ignoring Niche Preferences: Neglecting movies with durations that cater to specific genres or audience segments could limit the platform's appeal and growth potential.

Misaligned Content Strategy: If content acquisition and production strategies are not aligned with actual movie duration preferences, it could lead to a mismatch between supply and demand, potentially hindering business growth and profitability.

Justification:

By carefully analyzing the insights from Chart - 12, Amazon Prime can make data-driven decisions regarding content acquisition, recommendations, and production strategies. Leveraging these insights can lead to a positive business impact by enhancing content diversity, catering to audience preferences, and attracting new subscribers. However, failing to address potential negative growth insights, such as limited duration variety or ignoring niche preferences, could hinder the platform's growth and restrict its market share in the long run.

#### Chart - 13

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your DataFrame is named 'df' and has a 'genres' column

# 1. Prepare the data
genre_counts = df['genres'].str.split(', ').explode().value_counts()

# 2. Create the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values)
plt.title('Distribution of Genres on Amazon Prime')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

Categorical Data: The genres variable is categorical, representing distinct categories of genres (e.g., Drama, Comedy, Action). Bar charts are specifically designed for displaying the distribution of categorical data.

Comparison of Frequencies: The main goal is to compare the frequencies or counts of each genre. Bar charts excel at showing these comparisons visually through the lengths of the bars. This allows viewers to quickly grasp which genres are most common and how content is distributed across different categories.

Clarity and Simplicity: Bar charts are generally easy to understand and interpret. Viewers can quickly grasp the relative proportions of each genre category by simply looking at the bar lengths. This makes the visualization accessible to a wide audience.

Effective for Univariate Analysis: This visualization is a form of univariate analysis, focusing on a single variable (genres). Bar charts are well-suited for this type of analysis as they effectively show the distribution of one variable.

Seaborn's Barplot: Seaborn's barplot() function provides a convenient and aesthetically pleasing way to create bar charts specifically for counting the occurrences of different categories in a dataset. It handles the counting and plotting efficiently.

##### 2. What is/are the insight(s) found from the chart?

Most Frequent Genres: You can quickly identify the most frequent genres on Amazon Prime by observing the tallest bars in the chart. For instance, if the bars for "Drama," "Comedy," and "Action" are the highest, it indicates that these genres are most prevalent in the platform's content library.

Genre Distribution: The chart provides an overview of how content is distributed across different genres. You can assess the variety and balance of genres offered on Amazon Prime. A more diverse distribution suggests a broader appeal to a wider range of viewer preferences.

Potential Gaps in Content: If certain genres have very short bars or are missing altogether, it might indicate a potential gap in content offerings for specific audience segments. For example, if there's a limited selection of documentaries, it could suggest an opportunity to expand the content library in that area.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential Positive Business Impact:

Targeted Content Acquisition: By understanding the distribution of genres, Amazon Prime can make more informed decisions about acquiring new content. For instance, if there's a high demand for thriller movies, Amazon Prime can prioritize acquiring titles in that genre to cater to audience preferences.

Personalized Recommendations: The insights from the chart can be used to enhance content recommendation systems. By knowing the preferred genres of users, Amazon Prime can recommend similar titles, leading to increased user satisfaction and engagement.

Effective Marketing Campaigns: Understanding genre preferences allows Amazon Prime to create more targeted marketing campaigns. For example, highlighting new releases in popular genres can attract more viewers and drive subscriptions.

Diversification of Content: Identifying gaps in content offerings for specific genres can help Amazon Prime diversify its content library. Investing in titles with underrepresented genres can attract new audiences and expand its market reach.

Potential Negative Growth Insights:

Limited Genre Diversity: If the chart reveals a heavy concentration of content within a few genres, it might indicate a lack of diversity in offerings. This could alienate users who prefer less common genres, potentially leading to decreased engagement or churn.

Ignoring Niche Audiences: If certain genres have very low representation, it could indicate that Amazon Prime is neglecting niche audiences. Ignoring these segments could limit growth potential and market share.

Misaligned Content Strategy: If Amazon Prime's content acquisition strategy is not aligned with the actual genre preferences of its users, it could lead to a mismatch between supply and demand, resulting in wasted resources and missed opportunities.

In short, knowing what genres people like helps Amazon make smarter decisions about what content to offer, how to recommend it, and how to advertise it. This helps them keep existing subscribers happy and attract new ones, which is good for business. However, if they don't pay attention to genre preferences, they risk losing subscribers and missing out on growth opportunities.

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming your DataFrame is named 'df'

# 1. Select numerical features for correlation analysis
numerical_features = df.select_dtypes(include=['number'])

# 2. Calculate the correlation matrix
correlation_matrix = numerical_features.corr()

# 3. Create the heatmap
plt.figure(figsize=(10, 8))  # Adjust figure size if needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Correlation Heatmap:

Visualizing Relationships: A correlation heatmap is specifically designed to visualize the relationships between multiple numerical variables. It provides a clear and concise way to see the strength and direction of correlations between all pairs of variables in a dataset, all at once.

Identifying Patterns: Heatmaps are excellent for identifying patterns in data. By observing the color variations in the heatmap, you can quickly spot:

Strong positive correlations (dark red)
Strong negative correlations (dark blue)
Weak or no correlations (lighter colors) This allows for a quick understanding of the overall relationships within the data without getting lost in individual numbers.

Exploring Multivariate Relationships: This visualization is a form of multivariate analysis, as it considers the relationships between multiple variables simultaneously. Heatmaps are well-suited for this type of analysis as they provide a comprehensive overview of the interdependencies within the dataset. You can see how multiple variables relate to each other in one view.

Seaborn's Heatmap: Seaborn's heatmap() function provides a convenient and aesthetically pleasing way to create correlation heatmaps. It handles the calculation of the correlation matrix and the plotting of the heatmap with various customization options. It's a powerful tool that makes creating this visualization easy.

In simpler terms:

Imagine you have a lot of numbers representing different things about your movies and shows (like release year, rating, runtime, etc.). A correlation heatmap is like a color-coded table that shows you how these numbers relate to each other. It helps you quickly see which things tend to go together (like maybe newer movies have higher ratings) and which things don't really affect each other. This big-picture view is why it's a good choice for exploring relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Insights from a Correlation Heatmap:

A correlation heatmap provides a visual representation of the relationships between numerical variables in your dataset, allowing you to identify key insights:

Strength of Correlations:

The color intensity in the heatmap indicates the strength of the correlation between two variables.
Darker colors represent stronger correlations (either positive or negative).
Lighter colors represent weaker correlations, meaning the variables don't have a strong relationship.
Direction of Correlations:

The color itself indicates the direction of the correlation.
Red or warm colors generally represent positive correlations. This means as one variable increases, the other tends to increase as well (e.g., maybe higher movie budgets are associated with higher box office revenue).
Blue or cool colors represent negative correlations. This means as one variable increases, the other tends to decrease (e.g., maybe older movies have lower ratings).

Patterns and Clusters:

By observing the patterns of colors in the heatmap, you can identify clusters of variables that are highly correlated with each other.
This can provide insights into underlying relationships and groupings within the data. For example, you might find a cluster of variables related to customer demographics that are strongly correlated with each other.

Potential Multicollinearity:

If you see very strong correlations (close to +1 or -1) between multiple variables, it might indicate multicollinearity. This means some variables are essentially measuring the same thing.
Multicollinearity can be an issue for certain statistical models, so it's important to be aware of it.

In simpler terms:

Imagine the heatmap as a map of relationships. The colors show you how strongly things are related and whether they move together or in opposite directions. By looking for patterns and clusters of colors, you can:

Find out which things are strongly linked: This can help you understand what factors might be driving certain trends in your data.
Spot potential problems: If things are too strongly related, it might cause issues when you're trying to build predictive models.
Get a better overall picture: The heatmap lets you see all the relationships at once, giving you a broader understanding of how your data is connected.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming your DataFrame is named 'df'

# 1. Select numerical features for the pair plot
numerical_features = df.select_dtypes(include=['number'])

# 2. Create the pair plot
sns.pairplot(numerical_features)
plt.suptitle('Pair Plot of Numerical Features', y=1.02)  # Add a title
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Pair Plot:

Comprehensive View of Relationships: A pair plot provides a comprehensive view of the relationships between all pairs of numerical variables in your dataset. It creates a matrix of scatter plots, where each scatter plot shows the relationship between two specific variables. This allows you to see all the pairwise relationships at a glance, rather than having to create individual scatter plots for each pair.

Identifying Correlations: Pair plots are excellent for identifying correlations between variables. By observing the scatter plots, you can quickly spot:

Positive correlations: Points trending upwards, indicating that as one variable increases, the other tends to increase as well.
Negative correlations: Points trending downwards, indicating that as one variable increases, the other tends to decrease.
No correlations: Points scattered randomly, indicating that there is no clear relationship between the variables.

Exploring Distributions: In addition to scatter plots, pair plots also include histograms or kernel density estimates (KDEs) along the diagonal. These plots show the distribution of each individual variable,
providing insights into the:

Range: The minimum and maximum values of the variable.
Central tendency: The typical or average value of the variable (e.g., mean or median).
Spread: How much the data varies around the central tendency (e.g., standard deviation).

Seaborn's Pairplot: Seaborn's pairplot() function provides a convenient and aesthetically pleasing way to create pair plots. It automatically handles the creation of the scatter plots and histograms, making it easy to visualize the relationships between multiple variables without writing a lot of code.

In simpler terms:

Imagine you have information about your movies like release year, rating, runtime, and budget. A pair plot is like creating a grid where each cell shows you how two of these things relate to each other. It's like having a bunch of mini scatter plots in one place. This helps you:

See the big picture: You can quickly see how all the numerical variables in your dataset are related, without having to create separate charts for each pair.
Spot trends: You can easily identify correlations (positive, negative, or none) between variables by looking at the direction of the points in the scatter plots.

Understand individual variables: The histograms or KDEs along the diagonal give you a sense of how each variable is distributed, like whether it's skewed or has outliers.

This comprehensive and visually intuitive approach is why a pair plot is a good choice for exploring relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Correlations: By looking at the scatter plots, you can identify correlations between variables. Positive correlations show an upward trend, negative correlations show a downward trend, and no correlation shows a random scattering of points.

Distributions: The histograms or KDEs along the diagonal show the distribution of each individual variable. This can help you understand the range, central tendency, and spread of the data.

Outliers: Pair plots can also help you identify outliers, which are data points that are significantly different from the rest of the data. Outliers can be seen as points that are far away from the main cluster of points in a scatter plot.

Nonlinear Relationships: While scatter plots primarily show linear relationships, you might also be able to spot nonlinear relationships in a pair plot. These would appear as curves or patterns in the scatter plots.

## **5. Solution to Business Objective**

Suggestions to Achieve Business Objective:

Based on the analysis and visualizations conducted using the Amazon Prime dataset, here are some suggestions you can offer to the client:

1. Focus on Content Diversity and Quality:

Diversify Genre Offerings: Ensure a balanced mix of genres to cater to a wider audience and avoid over-reliance on a few popular genres. This can attract new subscribers and keep existing ones engaged.
Invest in High-Rated Content: Prioritize acquiring and producing content with high ratings, as this is a strong indicator of customer satisfaction. Highlight and promote these titles to increase viewership and engagement.
Expand Content for Niche Audiences: Identify underrepresented genres or content categories that have potential demand. Investing in these areas can attract new subscriber segments and increase overall content value.

2. Enhance Content Discovery and Recommendations:

Improve Search Functionality: Make it easier for users to find the content they are looking for by enhancing search algorithms and filtering options. This can reduce frustration and increase content discovery.
Personalize Recommendations: Leverage data on user viewing habits and preferences to provide more relevant and personalized content recommendations. This can increase engagement and watch time.
Highlight New and Trending Content: Showcase new releases and trending titles prominently to capture user attention and encourage exploration. This can help users discover content they might not have found otherwise.

3. Optimize Content Delivery and User Experience:

Ensure Seamless Streaming: Provide a smooth and reliable streaming experience to avoid technical issues that can negatively impact user satisfaction. Invest in infrastructure and technology to support high-quality streaming.
Improve User Interface: Make the platform easy to navigate and use, with intuitive controls and clear information about content. This can enhance the overall user experience and encourage continued use.
Offer Flexible Viewing Options: Provide options for offline viewing, multiple device compatibility, and customizable settings to cater to different user preferences and viewing habits. This can increase convenience and accessibility.

4. Leverage Data for Targeted Marketing and Promotions:

Identify Target Audiences: Use data on demographics, viewing habits, and preferences to identify specific audience segments with high potential for acquisition and retention.
Tailor Marketing Campaigns: Create targeted marketing campaigns that resonate with specific audience segments, highlighting content and features that align with their preferences. This can increase the effectiveness of marketing efforts.
Offer Personalized Incentives: Provide personalized incentives, such as discounts or exclusive content access, to encourage subscription renewals and engagement. This can demonstrate value to users and increase loyalty.

5. Continuously Monitor and Analyze Data:

Track Key Metrics: Regularly monitor key performance indicators (KPIs) such as subscriber growth, churn rate, content engagement, and user feedback to assess the effectiveness of strategies and identify areas for improvement.
Conduct A/B Testing: Experiment with different approaches to content acquisition, recommendations, and user interface elements to determine what resonates best with users and drives desired outcomes.
Stay Informed about Industry Trends: Keep abreast of industry trends and competitor strategies to anticipate changes in user preferences and adapt offerings accordingly. This can help Amazon Prime maintain a competitive edge.

Justification:

These suggestions are justified by the insights gained from the analysis and visualizations conducted using the Amazon Prime dataset. The suggestions aim to address key factors that influence customer satisfaction and retention, such as content diversity, quality, discoverability, user experience, and targeted marketing. By implementing these suggestions, the client can potentially improve customer engagement, reduce churn, and achieve their business objectives.

Remember to emphasize that these suggestions are based on the available data and should be further validated through user research and A/B testing. Continuous monitoring and analysis are crucial for adapting strategies and ensuring ongoing success.

# **Conclusion**

This Exploratory Data Analysis (EDA) of the Amazon Prime dataset has provided valuable insights into customer behavior, preferences, and trends related to Amazon Prime services. Through data wrangling, visualization, and analysis, we've identified key factors that influence customer satisfaction and retention.

Key Findings:

Content Diversity: A balanced mix of genres and content types is crucial for catering to a wide audience and maximizing user engagement.
Content Quality: High-rated content is a strong indicator of customer satisfaction, and investing in such content is essential for retention.
Content Discovery: Enhanced search functionality and personalized recommendations are crucial for helping users find content they enjoy, leading to increased watch time and engagement.
User Experience: A seamless streaming experience and an intuitive user interface contribute to overall customer satisfaction and encourage continued use of the platform.
Targeted Marketing: Data-driven marketing campaigns that focus on specific audience segments and preferences can be more effective in driving subscriptions and engagement.

Recommendations:

Based on these findings, we recommend that Amazon Prime prioritize:

Diversifying content offerings to cater to broader preferences.
Investing in high-quality, highly-rated content.
Enhancing search and recommendation features for improved content discovery.
Optimizing content delivery and user interface for a seamless user experience.
Leveraging data for targeted marketing and personalized incentives.
Continuously monitoring and analyzing data to adapt strategies and ensure ongoing success.
By implementing these recommendations, Amazon Prime can potentially improve customer satisfaction, reduce churn, and achieve its business objectives of growth and retention in the competitive streaming market.

Further Considerations:

While this EDA has provided valuable insights, further research and analysis could be conducted to:

Explore user demographics and segmentation in more depth.
Analyze the impact of specific features on customer satisfaction and retention.
Develop predictive models to forecast customer churn and identify at-risk users.
By continuing to leverage data and insights, Amazon Prime can continuously optimize its services and maintain its position as a leading streaming platform.

Why This Conclusion Is Effective:

Summarizes Key Findings: It concisely recaps the major insights discovered during the analysis.
Connects to Business Objective: It directly addresses the client's goal of understanding factors influencing customer satisfaction and retention.
Provides Actionable Recommendations: It offers concrete suggestions for Amazon Prime to improve its services and achieve its objectives.
Acknowledges Limitations: It recognizes that the analysis has scope for further exploration and refinement.
Ends with a Strong Call to Action: It encourages Amazon Prime to continue leveraging data and insights for ongoing optimization and success.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***