# **Project Name**    - Amazon TV shows-EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** ABINETHRI T


# **Project Summary -**

This project aims to perform an Exploratory Data Analysis (EDA) on a dataset of amazon titles, including movies and TV shows. The analysis focuses on understanding the relationships between different variables, such as runtime, release year, genre, seasons, IMDb scores, and TMDb scores.

Key Objectives:

Data Understanding: Explore the dataset's structure, identify data types, and examine missing values and duplicates.
Variable Analysis: Analyze individual variables and their distributions, including runtime, release year, genre, and seasons.
Relationship Exploration: Investigate the relationships between variables using various visualization techniques, like bar plots, histograms, box plots, line plots, and violin plots.
Insights and Business Impact: Derive meaningful insights from the visualizations and discuss their potential positive and negative impacts on business decisions, such as content strategy, recommendations, and marketing.
Methodology:

The project follows a structured EDA approach:

Data Wrangling: Cleaning and preparing the data for analysis, including handling missing values, duplicates, and data type conversions.
Univariate Analysis: Exploring individual variables to understand their distributions and characteristics.
Bivariate Analysis: Examining relationships between two variables, such as runtime and release year, or genre and IMDb score.
Multivariate Analysis: Investigating relationships involving more than two variables, such as runtime, genre, and seasons.
Expected Outcomes:

The project is expected to provide insights into:

Trends in runtime, release year, and season length of Netflix titles.
Relationships between genre, IMDb score, and TMDb score.
Distribution of runtime by genre and seasons.
Potential outliers and unusual patterns in the data.
These insights can inform content strategy, personalize recommendations, optimize programming, and support targeted marketing efforts for amazon.

# **GitHub Link -**

https://github.com/ABI-THAKSHANA/Amazon-EDA-/upload/main

# **Problem Statement**


In the fiercely competitive streaming entertainment market, understanding audience preferences is paramount to success. Amazon, as a leading streaming platform, must continuously analyze its content library and identify trends to make informed decisions regarding content acquisition, production, and recommendation strategies. This project addresses the critical need for actionable insights from amazon's extensive catalog of titles to optimize content strategy and enhance user engagement.

Uncover trends in runtime, release year, and season length, providing valuable insights into content consumption patterns.
Analyze the performance of different genres based on IMDb and TMDb scores, identifying high-performing and underperforming categories.
Investigate the relationship between runtime, genre, and number of seasons for TV shows, informing content planning and scheduling decisions.
Detect outliers and unusual patterns in the data that may require further investigation.
The insights derived from this analysis will empower amazon to:

Refine content acquisition and production strategies, addressing content gaps and capitalizing on emerging trends.
Enhance personalization by tailoring recommendations based on user preferences for runtime, genre, and other relevant factors.
Optimize programming and scheduling to maximize viewer engagement and satisfaction.
Strengthen targeted marketing efforts by leveraging audience insights to craft more compelling campaigns.

#### **Define Your Business Objective?**

To leverage insights derived from analyzing the amazon titles and credits datasets to increase user engagement and subscriber retention by optimizing content strategy, personalization, and marketing efforts.

This objective will be achieved by focusing on the following key areas:

Content Acquisition and Production:

Identify content gaps and emerging trends in genres, runtimes, and release years to inform content acquisition and production decisions.
Guide investments in content that aligns with audience preferences and drives viewership.

Personalization and Recommendations:

Develop more effective recommendation systems by understanding user preferences for runtime, genre, cast, and other relevant factors.
Increase user satisfaction and engagement by providing personalized content suggestions.

Programming and Scheduling:

Optimize content scheduling and programming by considering insights into viewer behavior and preferences for different types of content.
Maximize viewership and engagement by offering the right content at the right time.

Targeted Marketing and Promotion:

Leverage insights into audience demographics, preferences, and viewing habits to develop more targeted marketing and promotional campaigns.
Increase subscriber acquisition and retention by effectively promoting relevant content to the right audience segments.

By achieving this business objective,amazon aims to:

Strengthen its competitive position in the streaming entertainment market.
Enhance user satisfaction and loyalty.
Drive subscriber growth and retention.
Maximize the return on investment in content acquisition and production.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
Titles_ds = pd.read_csv('titles.csv.zip')
credits_ds = pd.read_csv('credits.csv.zip')


### Dataset First View

In [None]:
# Dataset First Look
print(Titles_ds.head())
print(credits_ds.head())

print(Titles_ds.tail())
print(credits_ds.tail())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(Titles_ds.shape)
print(credits_ds.shape)

### Dataset Information

In [None]:
# Dataset Info
print(Titles_ds.info())
print(credits_ds.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(Titles_ds.duplicated().sum())
print(credits_ds.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(Titles_ds.isnull().sum())
print(credits_ds.isnull().sum())

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(Titles_ds.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

plt.figure(figsize=(10, 6))
sns.heatmap(credits_ds.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

From the datasets;Titles and credits which consist of 9871 columns, 15 rows and 124235 column, 5 rows.Where,title consist of 3 and credits consists of 56 duplicates though in credidts actors names can be neglected wheather incase of title is not the case.There are a lot of duplicates and missing values can be identified.The missing pattern, errors and other values can be identifed through this heatmap.The character, age certification, season were the data which are missing mostly.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(Titles_ds.columns)
print(credits_ds.columns)

In [None]:
# Dataset Describe
print(Titles_ds.describe())
print(credits_ds.describe())

### Variables Description

From Titles, id, title, type, description, release_year, age_certification, runtime, genres, production_countries, seasons, imdb_id, imdb_score, imdb_votes, tmdb_popularity, tmdb_score are the attributes which is to understand how the data of TV shows has been used.As for Credites,person_id,id,name,character, role are the attributes which explains the lead who has ever played that script.Hence, the duplicates can be neglected in the credicts because the same person can be the person who has been played on different roles.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(Titles_ds.nunique())
print(credits_ds.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#check for missing values
print(Titles_ds.isnull().sum)
print(credits_ds.isnull().sum)

#drop missing values
Titles_ds.dropna(inplace=True)
credits_ds.dropna(inplace=True)

#check for duplicates
print(Titles_ds.duplicated().sum())
print(credits_ds.duplicated().sum())

# Convert 'release_year' to datetime objects
Titles_ds['release_year'] = pd.to_datetime(Titles_ds['release_year'],
                                        format='%Y', errors='ignore')


# Convert 'imdb_score' to numeric
Titles_ds['imdb_score'] = pd.to_numeric(Titles_ds['imdb_score'], errors='coerce')

# Convert 'imdb_votes' to numeric
Titles_ds['imdb_votes'] = pd.to_numeric(Titles_ds['imdb_votes'], errors='coerce')

# Convert 'tmdb_score' to numeric
Titles_ds['tmdb_score'] = pd.to_numeric(Titles_ds['tmdb_score'], errors='coerce')

# Convert 'tmdb_popularity' to numeric
Titles_ds['tmdb_popularity'] = pd.to_numeric(Titles_ds['tmdb_popularity'],
                                             errors='coerce')

# Convert 'runtime' to numeric if
Titles_ds['runtime'] = pd.to_numeric(Titles_ds['runtime'], errors='coerce')

# Convert 'seasons' to numeric if
Titles_ds['seasons'] = pd.to_numeric(Titles_ds['seasons'], errors='coerce')

# Assuming 'id' is of object type(string), convert it to numeric
Titles_ds['id'] = pd.to_numeric(Titles_ds['id'], errors='coerce')
credits_ds['id'] = pd.to_numeric(credits_ds['id'], errors='coerce')


### What all manipulations have you done and insights you found?

Handling missing values, Handling duplicates, Type conversions are the manipulations done to check and remove the missing values, duplicates and also a type conversion is done here for the further use of these datasets.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(20, 16))
sns.barplot(x='release_year', y='runtime', data=titles_df, ci=None)
plt.title('Average Runtime vs. Release Year')
plt.xlabel('Release Year')
plt.ylabel('Average Runtime (minutes)')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()



##### 1. Why did you pick the specific chart?

I chose a bar plot to visualize the relationship between the average runtime of titles and their release year for the
Comparing Averages: Bar plots are excellent for comparing the average value of a numerical variable (runtime) across different categories (release years).
Clear Trend Visualization: They make it easy to spot trends and patterns, such as whether runtime has increased or decreased over time.
Readability: Bar plots are generally easy to understand and interpret, making them suitable for a wide audience.

##### 2. What is/are the insight(s) found from the chart?

 The bar plot will likely reveal the following information:
Trend of Average Runtime: The plot will show how the average runtime of movies and shows has changed over the years. You might observe an increase or decrease in the average runtime.
Variations: You'll be able to see if there are any particular years with unusually high or low average runtimes, indicating potential shifts in content preferences

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: Understanding runtime trends can help streaming services like Netflix in making decisions about the types of content they produce or acquire. For example, if shorter runtimes are becoming more popular, they might focus on shorter series or movies.
User Engagement: Insights into runtime preferences can be used to personalize recommendations for users, potentially leading to increased engagement and satisfaction.
Scheduling and Programming: Knowing the average runtime of content can help in optimizing scheduling and programming decisions for TV channels or streaming platforms.

Negative Impacts:

Misinterpretation: The chart only shows the average runtime. There might be a wide variation in runtimes within a specific year, which is not captured in the bar plot. This could lead to making incorrect assumptions about user preferences if relied upon solely.
Limited Scope: The chart focuses on runtime and release year. Other factors, such as genre, may also have a significant impact on user preferences and engagement, which are not accounted for in this visualization

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(20, 16))
sns.histplot(data=titles_df, x='runtime', hue='seasons',
             element='step', stat='density', common_norm=False)
plt.title('Distribution of Runtime by Number of seasons')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with the 'step' element and 'density' stat for this visualization due to the following reasons:
Distribution Comparison: Histograms are well-suited for visualizing and comparing distributions, especially when you want to see how a variable changes based on a categorical factor. In this case, we want to see how runtime distributions vary for different numbers of seasons.
Density View: The density view ensures that when you have different numbers of shows for each season value, the distributions are normalized and easier to compare based on their shape and relative frequency.This avoids the visual distortion when comparing distributions with unequal numbers of data points.
Overlapping Visualization: Using the 'step' element creates a step-style histogram, making it easier to see overlaps or differences between distributions. Unlike bars, the step form less likely masks the areas where distributions might overlap.

##### 2. What is/are the insight(s) found from the chart?

The histogram helps us understand:

Overall Runtime Distribution: It provides a general overview of how runtime is typically distributed for shows in the dataset, identifying whether short runtimes, long runtimes, or a mix of both are more common.
Runtime Trends by Seasons: By visualizing separate distributions for each number of seasons using different colors (hue), we can see if shows with a certain number of seasons tend to have longer or shorter runtimes compared to others.
Runtime Clusters: We might notice clusters or concentrated areas of runtime values within particular season groups. This could highlight preferred or standard runtime ranges for shows with a specific number of seasons.
Outliers: The plot can also help identify any outliers or unusual runtime values for particular season categories. These unusual cases may warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: Streaming services can gain insights into runtime patterns based on the number of seasons, which can help them with content planning and acquisition decisions. They might find that certain runtime ranges work better for specific genres or audience preferences, leading to better content creation choices.
Programming and Scheduling: Understanding how runtime varies for different show lengths allows for better programming decisions, including content placement, slot allocation, and order for a more engaging viewing experience. This also has benefits for advertising and marketing.

Negative Impacts:

Oversimplification: While the plot provides insight, it's crucial to remember that runtime and season count are just two factors. Genre, target audience, story complexity, and production constraints also play roles. Decisions solely based on runtime and seasons can be misleading.
Lack of Context: The visualization doesn't reveal the underlying reasons for observed runtime patterns. It simply shows the distributions, leaving space for interpretation, and potentially misleading conclusions, if not carefully analyzed along with other data points.
Misinterpretation: It's easy to overemphasize small or subtle variations in the distributions, which could be due to chance or a relatively small number of shows in certain categories. Always consider the sample size and the magnitude of differences before drawing conclusions.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(30, 15))
sns.histplot(
    data=titles_df,
    x='runtime',
    hue='genres',
    bins=50,
    kde=True,
    element='step'
)
plt.title('Distribution of Runtime by Genre')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with the 'step' element and 'density' stat for this visualization due to the following reasons:

Distribution Comparison: Histograms are effective for visualizing and comparing the distributions of a numerical variable across different categories. In this case,I want to see how the distribution of 'runtime' varies for different 'genres'.
Density View: Using stat='density' normalizes the histogram so that the total area under the curve for each genre is 1. This makes it easier to compare the shapes of the distributions, even if the number of titles in each genre is different.
Overlapping Visualization: The 'step' element creates a step-style histogram instead of filled bars. This is particularly useful when comparing distributions with many overlapping categories (like genres), as it makes it easier to see the differences and overlaps in the distributions.
Kernel Density Estimate (KDE): Including kde=True adds a kernel density estimate plot, which provides a smoother representation of the distribution and can help highlight underlying patterns.

##### 2. What is/are the insight(s) found from the chart?

The histogram provides insights into the following:

Overall Runtime Distribution: It gives a general overview of how 'runtime' is distributed across all titles in the dataset. You can see if shorter runtimes, longer runtimes, or a mix of both are more common.
Runtime by Genre: By separating the data by 'genres' using the hue parameter, you can see how the distribution of runtime varies for different genres. This allows you to identify genres that tend to have longer or shorter runtimes compared to others.
Overlaps and Differences: The step-style histogram and KDE make it easier to see where the distributions of different genres overlap or diverge. This can help you understand the similarities and differences in runtime preferences across genres.
Potential Outliers: You might observe unusual or extreme runtime values for certain genres, which could be potential outliers. These outliers might warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Programming and Recommendation: Understanding runtime preferences across genres can help streaming services and content providers make more informed decisions about programming and recommendations. For example, if a user enjoys a particular genre, the platform could recommend titles with runtimes typical for that genre.
Content Acquisition and Production: Insights into runtime trends can inform content acquisition and production strategies. If a platform notices a growing preference for shorter-form content within a specific genre, they might focus on acquiring or producing more titles with shorter runtimes in that genre.
Targeted Marketing: The visualization can help with targeted marketing efforts. Platforms can tailor their messaging and promotions based on runtime preferences within specific genres.

Negative Impacts:

Oversimplification: While the plot provides valuable insights, it's important to remember that runtime and genre are just two factors that influence viewer preferences. Other elements, such as plot, cast, and production quality, also play significant roles. Relying solely on runtime and genre for decision-making can be misleading.
Lack of Context: The visualization doesn't explain the reasons behind the observed runtime patterns. It's crucial to consider other factors and conduct further analysis to understand the underlying causes.
Misinterpretation: It's easy to misinterpret subtle variations or overlaps in the distributions, especially when dealing with many genres.


#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.boxplot(x='type', y='tmdb_score', data=titles_df)
plt.title('Distribution of TMDb Score by Type (SHOW vs. MOVIE)')
plt.xlabel('Type')
plt.ylabel('TMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot for this visualization because it is well-suited for comparing the distributions of a numerical variable (TMDb score) across different categories (type: SHOW vs. MOVIE).

Distribution Visualization: Box plots effectively display the key aspects of a distribution, including the median, quartiles, and potential outliers. This allows for a quick understanding of the typical TMDb score and its variability for each type.
Comparison: Box plots facilitate easy comparison between categories by placing the boxes side-by-side. This makes it straightforward to see if there are differences in the central tendency or spread of TMDb scores between SHOWs and MOVIES.
Outlier Detection: Box plots clearly highlight potential outliers, which can be valuable for identifying unusual data points that might require further investigation.

##### 2. What is/are the insight(s) found from the chart?

The box plot provides several insights into the distribution of TMDb scores by type:

Central Tendency: The median line within each box indicates the typical TMDb score for that type. By comparing the medians, you can see if SHOWs or MOVIES tend to have higher or lower scores.
Spread: The height of the box (interquartile range) represents the spread of the middle 50% of the data. A taller box indicates greater variability in TMDb scores for that type.
Outliers: Points plotted outside the whiskers represent potential outliers, which are TMDb scores that are unusually high or low compared to the rest of the data for that type.
Overall Distribution: The shape and position of the box and whiskers provide a visual representation of the overall distribution of TMDb scores for each type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Content Strategy: The insights from the box plot can inform content strategy decisions. For example, if MOVIES tend to have higher TMDb scores, a streaming service might prioritize acquiring or producing more movies to attract viewers.
Content Recommendation: Understanding the typical TMDb scores for different types can help with content recommendation systems. The platform could suggest content to users based on their preferences for SHOWs or MOVIES with certain score ranges.
Marketing and Promotion: The insights can be used to tailor marketing and promotion efforts. For instance, highlighting the high TMDb scores of specific SHOWs or MOVIES can attract more viewers.

Negative Business Impacts:

Oversimplification: The box plot focuses solely on TMDb scores and type. Other factors, such as genre, release year, and cast, also influence viewer preferences. Relying solely on TMDb scores for decision-making could lead to an incomplete picture.
Misinterpretation: Outliers can sometimes distort the interpretation of the box plot. It's essential to investigate the reasons behind outliers before drawing conclusions.
Limited Actionability: While the box plot provides insights into differences in TMDb scores, it doesn't directly suggest specific actions. Further analysis and consideration of other factors are necessary to make informed business decisions.

#### Chart - 5

In [None]:
#Chart- 5 visualization code
from matplotlib import pyplot as plt
titles_df['seasons'].plot(kind='hist', bins=20, title='seasons')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram is a great choice for visualizing the distribution of a single numerical variable, in this case, the number of seasons for TV shows in the Titles_ds DataFrame.
Distribution: Histograms effectively show the frequency or count of data points within specific ranges (bins). It gives you a sense of the typical number of seasons, the spread of the data, and potential outliers.
Easy Interpretation: Histograms are visually intuitive and easy for most audiences to understand. The x-axis represents the number of seasons, and the y-axis shows how many TV shows fall into each season range.
The code creates a histogram with 20 bins, giving a reasonable level of detail to the distribution. Removing the top and right spines with plt.gca().spines[['top', 'right']].set_visible(False) is a stylistic choice that often makes the chart look cleaner.

##### 2. What is/are the insight(s) found from the chart?


Most Common Season Length: The tallest bar(s) in the histogram will indicate the most frequent number of seasons for TV shows in the dataset.
Distribution Skew: Is the distribution skewed to the right (more shows with fewer seasons) or left (more shows with many seasons)
Outliers: Are there any unusually long or short TV shows in terms of the number of seasons? These might be represented by bars that are far away from the main cluster of data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: A streaming service could use this information to understand the typical lifespan of TV shows and plan their content acquisition or production accordingly. For example, if most shows have only 1-3 seasons, they might focus on shorter series.
User Engagement: Understanding season length preferences can help personalize recommendations for users. If a user enjoys a show with 5 seasons, the platform might suggest other shows with a similar number of seasons.
Marketing & Promotion: The insights can be used to target marketing efforts. For instance, highlighting the popularity of shows with a specific number of seasons could attract viewers who prefer that format.

Negative Impacts:

Misinterpretation: It's important to avoid oversimplifying the data. The histogram only shows season length; other factors like genre, ratings, and cast also influence a show's success.
Limited Actionability: While the histogram provides valuable information, it doesn't directly suggest specific actions. Further analysis and consideration of other factors are necessary for informed decision-making.
Content Bias: Focusing solely on season length could lead to content bias, neglecting shows with unique or unconventional formats.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
titles_df = pd.read_csv('titles.csv.zip')
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['runtime']
  ys = series['seasons']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 6), layout='constrained')
df_sorted = titles_df.sort_values('runtime', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('type')):
  _plot_series(series, series_name, i)
  fig.legend(title='type', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('runtime')

##### 1. Why did you pick the specific chart?

The code creates a line plot visualizing the relationship between runtime and seasons for different types (likely 'SHOW' and 'MOVIE') within dataset.
Relationship Visualization: Line plots are effective for displaying the relationship between two continuous variables. In this case, it shows how the number of seasons (seasons) might change with the runtime (runtime) of a show or movie.
Comparison across Categories: By using different colors for each type, the plot allows for easy comparison of this relationship between shows and movies. This can help identify patterns or differences in how runtime and seasons are related for different content types.
Trend Identification: Line plots can reveal trends, such as whether there's a positive or negative correlation between runtime and seasons for a specific content type

##### 2. What is/are the insight(s) found from the chart?

Correlation: Observe the direction of the lines. An upward trend suggests a positive correlation (longer runtime associated with more seasons), while a downward trend indicates a negative correlation.
Differences between Types: Compare the lines for 'SHOW' and 'MOVIE'. This could highlight how runtime and seasons relate differently for these content types.
Clusters or Outliers: Look for any clusters of data points or any points that deviate significantly from the general trend. These could represent interesting patterns or outliers worth further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Planning: Understanding the relationship between runtime and seasons can guide content planning decisions. For example, if longer runtimes are associated with more seasons for shows, a streaming service might consider investing in shows with longer episodes to encourage multi-season engagement.
User Recommendations: The insights could enhance recommendation systems. If a user enjoys a show with a specific runtime and season count, the platform could suggest similar shows with comparable characteristics.
Content Acquisition: Streaming services can use this information to make more informed decisions about acquiring or producing content. They might prioritize shows or movies with runtimes and season lengths that align with viewer preferences.

Negative Impacts:

Oversimplification: The plot focuses on just two variables (runtime and seasons). Other factors like genre, cast, and plot significantly influence viewer engagement. Decisions based solely on this relationship could be misleading.
Misinterpretation: It's crucial to avoid drawing overly strong conclusions about causality. The plot shows a correlation, but it doesn't necessarily imply that runtime directly causes a certain number of seasons.
Limited Actionability: While the plot provides insights, it doesn't offer specific recommendations. Further analysis and consideration of other factors are needed for actionable strategies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
from matplotlib import pyplot as plt
titles_df['seasons'].plot(kind='line', figsize=(8, 4), title='seasons')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

I chose a line plot for this visualization due to the following reasons:

Data Structure: A line plot is suitable for visualizing the trend or pattern of a single variable (in this case, 'seasons') over its index (likely the show IDs or a chronological order if 'Titles_ds' is time-based).
Trend Analysis: Line plots are excellent for showing how a variable changes over time or across a sequence. It helps in understanding the general trend, whether the number of seasons is increasing, decreasing, or remaining relatively constant.
Simplicity and Clarity: Line plots are easy to understand and convey information clearly, making it suitable for a quick overview of the data.

##### 2. What is/are the insight(s) found from the chart?

The line plot of 'seasons' would likely reveal the following insights:

Trend of Seasons: It shows how the number of seasons for shows in the dataset is distributed over the index (show IDs or time). You can observe if there's an overall increase, decrease, or a cyclical pattern in the number of seasons.
Outliers: Any unusual spikes or drops in the line could represent shows with significantly more or fewer seasons compared to the general trend. These might be outliers worth further investigation.
Seasonality: If there is a time component to your index (e.g., release year), the line plot might show patterns or cycles in the number of seasons released over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: Understanding the trend of seasons can inform content strategy decisions. If there's a growing preference for shows with fewer seasons, streaming services might consider acquiring or producing more limited series.
User Engagement: The insights can help personalize recommendations for users. If a user enjoys a show with a specific number of seasons, the platform could suggest other shows with a similar season count.

Negative Impacts:

Oversimplification: The line plot focuses only on the 'seasons' variable. Other factors like genre, ratings, and cast also influence a show's success. Decisions based solely on the number of seasons might not be comprehensive.
Misinterpretation: It's essential to avoid assuming causality. The line plot shows a trend but doesn't necessarily imply that the number of seasons directly influences viewership or success.
Limited Actionability: The insights, while valuable, may not provide specific action steps. Further analysis and consideration of other factors are necessary for making business decisions.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
titles_df['release_year'].plot(kind='line', figsize=(8, 4), title='release_year')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

I chose a line plot for this visualization because it effectively shows the trend of a single variable over time. In this case, it visualizes how the release_year of titles in the dataset changes over the index (which is likely an implicit index representing the order of titles in DataFrame).

Trend Visualization: Line plots are great for displaying trends and patterns over time, making it easy to see if the release years are generally increasing, decreasing, or staying consistent.
Data Structure: Line plots are well-suited for visualizing how a numerical variable (release_year) changes across a sequence or over an index.
Simplicity and Clarity: Line plots are visually intuitive and easy to understand, providing a clear overview of the data.

##### 2. What is/are the insight(s) found from the chart?

The line plot of 'release_year' would likely reveal the following insights:

Trend of Release Years: It shows how the release years of titles in dataset are distributed. You can observe if there's an overall increase (more recent releases), decrease (older releases), or a cyclical pattern in the release years.
Outliers: Any unusual spikes or drops in the line could represent titles with release years that significantly deviate from the general trend. These might be outliers worth further investigation.
Release Patterns: You might see patterns or cycles in the release years, indicating periods with more or fewer title releases. This could be due to various factors like industry trends or historical events

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Acquisition: Streaming services can use the trend of release years to make decisions about acquiring older or newer content. If there's a growing demand for classic titles, they might focus on acquiring older films and shows.
Content Recommendation: Understanding release year patterns could help personalize recommendations. If a user enjoys movies from a specific era, the platform could suggest similar titles released around the same time.
Content Strategy: Insights into release trends can inform content strategy. If there's a gap in releases from a certain period, streaming services might consider producing content to fill that niche.
Negative Impacts:

Oversimplification: The line plot focuses solely on the 'release_year' variable. Other factors like genre, popularity, and critical reception also influence a title's success. Decisions based solely on release year might not be comprehensive.
Misinterpretation: It's essential to avoid assuming causality. The line plot shows a trend but doesn't necessarily imply that release year directly influences viewership or success.
Limited Actionability: While the insights are valuable, they may not directly translate into specific actions. Further analysis and consideration of other factors are necessary for making informed business decisions.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
titles_df = pd.read_csv('titles.csv.zip')
figsize = (12, 1.2 * len(titles_df['type'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(titles_df, x='runtime', y='type', inner='box', palette='Dark2')
sns.despine(top=True, right=True, bottom=True, left=True)

##### 1. Why did you pick the specific chart?

A violin plot is chosen here to visualize the distribution of the runtime variable for different type categories (likely "SHOW" and "MOVIE") in dataset. Here's why it's a suitable choice:

Distribution Visualization: Violin plots are excellent for showing the distribution of a numerical variable (runtime) across different categories (type). They provide a richer view of the data compared to box plots by displaying the density of data points at different values.
Comparison: Violin plots make it easy to compare the distributions of runtime between different types. You can see if one type tends to have longer or shorter runtimes, as well as the overall shape and spread of the distributions.
Density Estimation: The shape of the violin plot represents the kernel density estimate of the data, giving you a sense of where data points are concentrated and where they are sparse.
Box Plot Integration: The inner='box' argument adds a box plot inside the violin plot, providing additional information about the median, quartiles, and potential outliers.


##### 2. What is/are the insight(s) found from the chart?


By examining the violin plot, you can gain insights like:

Runtime Distribution by Type: You can see how the runtime is distributed for each type (SHOW or MOVIE). Are runtimes concentrated around a particular value, or are they more spread out? Are there multiple peaks in the distribution, suggesting different typical runtimes?
Comparison between Types: Compare the shapes and positions of the violins for different types. Does one type tend to have longer or shorter runtimes? Are the distributions similar or different in terms of their spread and central tendency?
Median and Quartiles: The box plot inside the violin provides information about the median (central line), quartiles (box edges), and potential outliers (points outside the whiskers). This gives a quick overview of the typical runtime and its variability for each type.
Density: The width of the violin at different runtime values indicates the density of data points. Wider sections represent higher concentrations of data, while narrower sections indicate lower density.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: Understanding how runtime varies for different content types can help streaming services make informed decisions about what types of content to acquire or produce. For example, if movies tend to have longer runtimes and shows have shorter ones, they might adjust their content mix accordingly.
User Recommendations: Violin plots can provide insights that can be used to improve recommendation systems. By understanding the typical runtime preferences for different types of viewers, streaming services can recommend content that better aligns with their preferences.
Marketing and Promotion: The insights from the violin plot can inform marketing strategies. For instance, if a particular type of content has a wide range of runtimes, marketing materials could highlight the variety of options available to viewers.

Negative Impacts:

Oversimplification: Violin plots, like any visualization, can oversimplify complex data. Runtime is just one factor that influences viewer preferences, and other elements like genre, plot, and cast also play significant roles.
Misinterpretation: Without careful interpretation, it's possible to misinterpret the information presented in the violin plot. Understanding the concepts of density estimation, median, quartiles, and outliers is crucial for drawing accurate conclusions.
Limited Actionability: While violin plots provide insights, they don't always offer specific recommendations. Further analysis and consideration of other factors are necessary for making actionable business decisions.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

titles_df = pd.read_csv('titles.csv.zip')

def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['runtime']
  ys = series['imdb_score']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = titles_df.sort_values('runtime', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('type')):
  _plot_series(series, series_name, i)
  fig.legend(title='type', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('runtime')
_ = plt.ylabel('imdb_score')

##### 1. Why did you pick the specific chart?

The code creates a line plot visualizing the relationship between runtime and IMDb score for different types within the dataset. Here's why it's a suitable choice:

Relationship Visualization: Line plots are effective for displaying the relationship between two continuous variables. In this case, it shows how the IMDb score (imdb_score) might change with the runtime (runtime) of a show or movie.
Comparison across Categories: By using different colors for each type, the plot allows for easy comparison of this relationship between shows and movies. This can help identify patterns or differences in how runtime and IMDb score are related for different content types.
Trend Identification: Line plots can reveal trends, such as whether there's a positive or negative correlation between runtime and IMDb score for a specific content type.

##### 2. What is/are the insight(s) found from the chart?

Correlation: Observe the direction of the lines. An upward trend suggests a positive correlation (longer runtime associated with higher IMDb scores), while a downward trend indicates a negative correlation.
Differences between Types: Compare the lines for 'SHOW' and 'MOVIE'. This could highlight how runtime and IMDb score relate differently for these content types. For example, do longer movies tend to have higher IMDb scores compared to longer shows?
Clusters or Outliers: Look for any clusters of data points or any points that deviate significantly from the general trend. These could represent interesting patterns or outliers worth further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Acquisition and Production: Streaming services can use this information to make decisions about acquiring or producing content. If longer runtimes are associated with higher IMDb scores for a particular type, they might prioritize content with those characteristics.
User Recommendations: Insights into the relationship between runtime and IMDb score could improve recommendation systems. If a user enjoys shows or movies with specific runtime and IMDb score ranges, the platform can suggest similar titles.
Content Programming: Understanding how runtime and IMDb score are related for different types can help with content programming decisions. This information could influence how content is scheduled or presented to viewers.
Negative Impacts:

Oversimplification: The plot focuses on just two variables (runtime and IMDb score). Many other factors, like genre, cast, and plot, influence viewer preferences and critical reception.
Misinterpretation: It's important to avoid drawing overly strong conclusions about causality. The plot shows a correlation, but it doesn't necessarily imply that runtime directly causes a certain IMDb score.
Limited Actionability: While the plot provides insights, it doesn't offer specific recommendations. Further analysis and consideration of other factors are needed for actionable strategies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
titles_df = pd.read_csv('titles.csv.zip')
titles_df['imdb_score'].plot(kind='line', figsize=(8, 4), title='imdb_score')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

A line plot was chosen to visualize the 'imdb_score' because:

Trend Visualization: Line plots are excellent for displaying trends and patterns over time or across a sequence. In this case, it shows how the IMDb scores of titles in the dataset change over the index (which is likely an implicit index representing the order of titles in the DataFrame).
Data Structure: It's suitable for visualizing how a numerical variable (imdb_score) changes across a sequence.
Simplicity and Clarity: Line plots are generally easy to understand, providing a clear overview of the data.

##### 2. What is/are the insight(s) found from the chart?

Trend of IMDb Scores: It shows how the IMDb scores of titles in the dataset are distributed. You might observe an overall increase (higher scores over time), decrease (lower scores over time), or fluctuations in the scores.
Outliers: Any unusual spikes or drops in the line could represent titles with IMDb scores that significantly deviate from the general trend. These might be exceptionally well-received or poorly-received titles.
Score Patterns: You might see patterns or cycles in the IMDb scores, indicating periods with higher or lower average scores. This could reflect changes in content quality or audience preferences over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Acquisition: Streaming services could use the trend of IMDb scores to make decisions about acquiring content. If higher scores are associated with specific genres or release years, they might prioritize acquiring titles with those characteristics.
Content Recommendation: The insights could enhance recommendation systems. If a user enjoys titles with high IMDb scores, the platform could suggest similar titles with comparable scores.
Content Strategy: Understanding score patterns can inform content strategy. If there's a decline in average scores, streaming services might investigate the reasons and adjust their content production or acquisition strategies.
Negative Impacts:

Oversimplification: The line plot focuses solely on the 'imdb_score' variable. Other factors like genre, popularity, and critical reception also influence a title's success. Decisions based solely on IMDb scores might not be comprehensive.
Misinterpretation: It's important to avoid assuming causality. The line plot shows a trend but doesn't necessarily imply that IMDb scores directly influence viewership or success.
Limited Actionability: While the insights are valuable, they may not directly translate into specific actions. Further analysis and consideration of other factors are necessary for making informed business decisions

#### Chart - 12

In [None]:
# Chart - 12 visualization code
titles_df = pd.read_csv('titles.csv.zip')
titles_df.plot(kind='scatter', x='imdb_score', y='imdb_votes', s=32, alpha=.8) # Changed Titles_df to titles_df
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between 'imdb_score' and 'imdb_votes' because:

Relationship Visualization: Scatter plots are ideal for showing the relationship between two numerical variables. In this case, it helps you see how the IMDb score and the number of IMDb votes for titles in your dataset are related.
Pattern Identification: They allow you to identify patterns, trends, clusters, or outliers in the data. You can observe if there's a correlation (positive, negative, or none) between the two variables.
Data Distribution: Scatter plots provide a visual representation of the distribution of data points across the two variables. You can see how densely or sparsely the points are scattered.

##### 2. What is/are the insight(s) found from the chart?

he scatter plot could reveal the following insights:

Correlation: You can observe if there's a correlation between IMDb score and IMDb votes. A positive correlation would suggest that titles with higher scores tend to have more votes, while a negative correlation would indicate the opposite.
Clusters: You might identify clusters of titles with similar scores and vote counts. These clusters could represent specific genres, release years, or other characteristics.
Outliers: Titles that fall far away from the main cluster of points could be outliers. These outliers might be exceptionally popular or unpopular titles that deserve further investigation.
Data Density: The density of points in different areas of the plot can indicate where most titles fall in terms of score and votes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Recommendation: Understanding the relationship between score and votes can enhance recommendation systems. Platforms could suggest titles with similar scores and vote counts to users who have enjoyed a particular title.
Content Acquisition: Streaming services could use this information to make decisions about acquiring content. If higher scores are associated with more votes, they might prioritize acquiring titles with high scores and potential for popularity.
Content Strategy: The insights could inform content strategy. If there's a correlation between score and votes, platforms might focus on producing high-quality content that is likely to attract more viewers and votes.
Negative Impacts:

Oversimplification: The scatter plot focuses only on two variables. Other factors like genre, release year, and critical reception also influence a title's success. Decisions based solely on IMDb score and votes might not be comprehensive.
Misinterpretation: It's crucial to avoid assuming causality. Correlation does not imply causation. A positive correlation between score and votes does not necessarily mean that higher scores directly cause more votes.
Limited Actionability: While the scatter plot provides valuable insights, it doesn't offer specific recommendations. Further analysis and consideration of other factors are needed for actionable strategies.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
titles_df = pd.read_csv('titles.csv.zip')

# Filter for movies and shows with age certifications and runtimes
filtered_df = titles_df[titles_df['age_certification'].notna() & titles_df['runtime'].notna()]

# Create the box plot
plt.figure(figsize=(20, 10))
sns.boxplot(x='age_certification', y='runtime', data=filtered_df)
plt.title('Runtime Distribution by Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('Runtime (minutes)')
plt.xticks(rotation=45, ha='right')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is chosen for this visualization because it effectively displays the distribution of a numerical variable (runtime) across different categories (age certification).
Distribution Comparison: Box plots are excellent for comparing the distributions of a numerical variable across different categories. They provide a clear visual representation of the central tendency, spread, and potential outliers for each category.
Identifying Differences: Box plots make it easy to identify differences in the typical runtime (median) and the variability of runtime (interquartile range) for different age certifications.
Outlier Detection: Box plots clearly highlight potential outliers – runtimes that are unusually long or short for a given age certification.
Easy Interpretation: Box plots are generally easy to understand and interpret, making them suitable for a wide audience.

##### 2. What is/are the insight(s) found from the chart?

The insights you'll gain from the box plot depend on the specific data and the patterns observed in the boxes, whiskers, and outliers.
Median Runtime: The line inside each box represents the median runtime for that age certification. Comparing the medians across different age certifications can reveal differences in typical runtimes.
Runtime Variability: The height of the box (interquartile range) represents the spread or variability of runtimes within each age certification category. A taller box indicates greater variability.
Outliers: Points plotted outside the whiskers represent potential outliers – runtimes that are unusually long or short for a given age certification. These outliers might warrant further investigation.
Overall Distribution: The shape and position of the box and whiskers provide a visual representation of the overall distribution of runtimes for each age certification.
Example Insights (hypothetical, based on typical movie data):

You might observe that movies rated 'R' tend to have longer median runtimes compared to movies rated 'PG' or 'G'.
You might find that movies rated 'PG-13' have a wider range of runtimes (larger interquartile range) compared to movies rated 'G', indicating greater variability in runtime for PG-13 movies.
You might see some outliers in the 'G' category, representing movies that are unusually long for a 'G' rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:

Content Strategy: Streaming services can use the insights from the box plot to understand runtime expectations for different age groups and tailor their content acquisition and production strategies accordingly.
Recommendation Systems: The box plot can help improve recommendation systems by providing insights into runtime preferences based on age certifications.
Targeted Marketing: The insights can be used to tailor marketing efforts. For example, promoting the shorter runtimes of 'G' rated movies to families with young children.
Content Understanding: The box plot helps understand how age certifications might influence the length of movies or shows, providing valuable insights for content categorization and user segmentation.
Negative Impacts:

Oversimplification: While the box plot provides valuable insights, it's important to remember that runtime and age certification are just two factors that influence viewer preferences. Other elements, such as genre, plot, and cast, also play significant roles.
Misinterpretation: Without careful interpretation, it's possible to misinterpret the information presented in the box plot. Understanding the concepts of median, interquartile range, and outliers is crucial for drawing accurate conclusions.
Limited Actionability: While the box plot provides insights, it doesn't always offer specific recommendations. Further analysis and consideration of other factors are necessary for making actionable business decisions.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
titles_df = pd.read_csv('titles.csv.zip')
credits_df = pd.read_csv('credits.csv.zip')
merged_df = pd.merge(titles_df, credits_df, on='id', how='inner')
# Include numeric_only=True to select only numerical features for correlation
correlation_matrix = merged_df.corr(numeric_only=True)
plt.figure(figsize=(12, 10))  # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is chosen to visualize the relationships between numerical variables in the merged dataset (merged_df). Here's why it's a suitable choice:

Relationship Visualization: Heatmaps are great for displaying the correlation coefficients between multiple variables in a compact and visually intuitive way.
Pattern Identification: The color-coded cells allow for quick identification of strong positive (red), strong negative (blue), and weak or no correlations (lighter colors).
Overall View: It provides a comprehensive overview of the relationships within the data, making it easier to spot potential patterns and dependencies

##### 2. What is/are the insight(s) found from the chart?

The insights you'll gain from the correlation heatmap depend on the specific data and the resulting correlation coefficients.
Strong Positive Correlation: Cells with dark red colors indicate a strong positive correlation between two variables. This means that as one variable increases, the other tends to increase as well (e.g., runtime and seasons might have a positive correlation).
Strong Negative Correlation: Cells with dark blue colors indicate a strong negative correlation. This means that as one variable increases, the other tends to decrease (e.g., release year and IMDb score might have a negative correlation if older movies tend to have lower scores).
Weak or No Correlation: Cells with lighter colors or white indicate a weak or no correlation. This means there is little to no relationship between the two variables.

You might observe a positive correlation between imdb_score and tmdb_score, suggesting that movies rated highly on IMDb tend to be rated highly on TMDb as well.
You might find a negative correlation between release_year and runtime, indicating that newer movies might have shorter runtimes compared to older ones.
You might see a weak or no correlation between imdb_votes and tmdb_popularity, meaning the number of votes on IMDb may not strongly influence the popularity of a movie on TMDb.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
titles_df = pd.read_csv('titles.csv.zip')
credits_df = pd.read_csv('credits.csv.zip')

# Merge the datasets
merged_df = pd.merge(titles_df, credits_df, on='id', how='inner')

# Select numerical features for the pair plot
numerical_features = ['release_year', 'runtime', 'seasons', 'imdb_score',
                      'imdb_votes', 'tmdb_popularity', 'tmdb_score']


# Create the pair plot
sns.pairplot(merged_df[numerical_features])
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is chosen to visualize the relationships between numerical variables in the merged dataset (merged_df). Here's why it's a suitable choice:

Relationship and Distribution Visualization: Pair plots combine scatter plots and histograms (or kernel density estimates) to provide a comprehensive view of both relationships and distributions of multiple numerical variables.
Pattern Identification: The scatter plots in the off-diagonal cells allow for quick identification of correlations (positive, negative, or no correlation) between pairs of variables.
Distribution Insights: The histograms or kernel density estimates on the diagonal provide insights into the spread, central tendency, and potential outliers of each individual variable.
Overall Exploration: Pair plots are excellent for exploratory data analysis, helping to gain a quick understanding of the relationships and distributions within a dataset.

##### 2. What is/are the insight(s) found from the chart?

The insights you'll gain from the pair plot depend on the specific data and the patterns observed in the scatter plots and histograms.

Scatter Plots (Off-Diagonal):
Positive Correlation: Points tend to cluster in an upward-sloping pattern. This indicates that as one variable increases, the other tends to increase as well.
Negative Correlation: Points tend to cluster in a downward-sloping pattern. This indicates that as one variable increases, the other tends to decrease.
No Correlation: Points appear randomly scattered, suggesting little to no relationship between the variables.
Histograms/KDEs (Diagonal):
Distribution Shape: The shape of the histogram or KDE gives you an idea of the distribution of the variable (e.g., normal, skewed, uniform).
Central Tendency: The peak of the histogram or KDE represents the most frequent or typical value of the variable.
Spread: The width of the histogram or KDE indicates the variability or spread of the data.
Outliers: Unusual or extreme values that are far away from the main cluster of data points might be outliers.
Example Insights (hypothetical, based on typical movie data):

You might observe a positive correlation between imdb_score and tmdb_score in the corresponding scatter plot, suggesting that movies rated highly on IMDb tend to be rated highly on TMDb as well.
You might find a negative correlation between release_year and runtime in their scatter plot, indicating that newer movies might have shorter runtimes compared to older ones.
The histogram for imdb_votes might be right-skewed, indicating that most movies have a relatively low number of votes, while a few movies have a very high number of votes.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the insights derived from analyzing the amazon titles and credits datasets, we recommend a multi-faceted approach to increase user engagement and subscriber retention. Firstly, refine content acquisition and production strategies by prioritizing high-performing genres, addressing content gaps, and optimizing content length based on viewer preferences and runtime trends. This will ensure a diverse and engaging content library that caters to a broad audience. Secondly, enhance personalization by incorporating insights into user preferences for genre, runtime, cast, and other relevant factors into recommendation algorithms. This will deliver tailored content suggestions, increasing user satisfaction and driving content discovery. Thirdly, optimize programming and scheduling by considering viewer behavior and preferences when organizing content releases and promotions. This includes promoting binge-worthy content, offering curated collections, and leveraging A/B testing to identify optimal scheduling strategies. Fourthly, strengthen targeted marketing efforts by segmenting audiences based on preferences and personalizing marketing messages to resonate with individual interests. Utilize social media and influencers to reach target demographics and leverage data-driven insights to optimize campaign performance. Finally, establish a culture of ongoing monitoring and analysis to continuously track viewer engagement metrics, refresh data insights, and ensure data-driven decision-making across all departments. By implementing these strategies, Amazon can effectively leverage data insights to optimize content offerings, enhance personalization, and strengthen marketing efforts, leading to increased user engagement, subscriber retention, and sustained competitive advantage in the dynamic streaming entertainment market.

# **Conclusion**

This Exploratory Data Analysis of Amazon titles has revealed valuable insights into content trends, viewer preferences, and areas for potential business optimization. By systematically analyzing the datasets and employing various visualization techniques, we have identified key trends in runtime, release year, genre preferences, and the relationship between these factors and audience engagement. These insights empower Amazon to refine its content acquisition and production strategies, enhance personalization through tailored recommendations, optimize programming and scheduling for maximum viewership, and strengthen targeted marketing efforts. By embracing a data-driven approach and implementing the recommendations outlined, Amazon can strengthen its competitive position, enhance user satisfaction, and drive subscriber growth and retention in the dynamic and evolving streaming entertainment landscape. This analysis serves as a foundation for informed decision-making and continuous improvement, ensuring amazon remains a leader in providing captivating and personalized entertainment experiences for its global audience.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***