<a href="https://colab.research.google.com/github/Sagarjain93/Amazon_prime_eda/blob/main/Amazon_prime_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Amazon Prime Tv Shows and Movies**    



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** SAGAR JAIN
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Sagarjain93/Amazon_prime_edaProvide

# **Problem Statement**


This project aims to analyze the content catalog of Amazon Prime Video to uncover actionable insights regarding content types, audience targeting, genre preferences, production trends, and user ratings. By performing exploratory data analysis (EDA) on key metadata attributes, the goal is to support strategic decision-making for content acquisition, audience engagement, and platform optimization.

#### **Define Your Business Objective?**

To derive data-driven insights from Amazon Prime Video’s content metadata that can inform strategic decisions around content acquisition, audience segmentation, and platform optimization. By understanding patterns in genres, content types, age certifications, release trends, ratings, and production origins, the objective is to help the business align its offerings with viewer preferences and improve user engagement.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### **Import Libraries**

In [None]:
# Import Libraries

#Data Manipulation Libraries
import pandas as pd
import numpy as np

#Data Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Date time libraries
import datetime as dt
from datetime import datetime

#Set consistant theme for plots
sns.set_theme(style='whitegrid')

### **Dataset Loading**

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
titles_df =pd.read_csv('/content/drive/MyDrive/colab/capstone projects/MODULE_2/titles.csv/titles.csv')
credits_df =pd.read_csv('/content/drive/MyDrive/colab/capstone projects/MODULE_2/credits.csv/credits.csv')

### **Dataset First View**

In [None]:
# Viewing first few rows of titles_df
titles_df.head()

In [None]:
# Viewing first few rows of credits_df
credits_df.head()

**Merging The Two Datasets**

In [None]:
# Merge on 'id'
df = pd.merge(credits_df, titles_df, on='id', how='left')

In [None]:
# Preview first few rows of the merged data
df.head()

In [None]:
# Preview last few rows of the merged data
df.tail()

### **Dataset Rows & Columns count**

In [None]:
# Dataset Rows & Columns count
df.shape

**Interpretation**
The dataset after merging has 19 rows and 124347 columns

### **Dataset Information**

In [None]:
# Dataset Info
df.info()

**Key Observations:**

**Character column** has missing values likely because it applies only to actors.

**Seasons** has major null values since it’s not relevant for movies.

**age_certification, imdb_score, and tmdb_score** also have significant missing data, which will require imputation or filtering depending on analysis.

Data types look clean and appropriate for most columns.



#### **Duplicate Values**

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

Out of a total of 124,347 rows, the dataset contains 168 completely duplicate rows. These rows have identical values across all 19 columns, which indicates:

They may have been unintentionally introduced during the merge of titles.csv and credits.csv.

These duplicates add no value and could distort analysis

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
(df.isnull().sum() / len(df) * 100).sort_values(ascending=False).plot(
    kind='bar', figsize=(8,5), color='skyblue', edgecolor='black')
plt.title("Percentage of Missing Values by Column")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()


**Key Insights:**

No missing values in:

*person_id, id, name, role, title, type, release_year, runtime, genres, production_countries*

Missing values in :

**character** has 16,307 missing values (~13%)
→ Likely missing for crew roles like directors/writers; this is expected and acceptable.

**description** has 91 missing values (<1%)
→ Negligible; rows can be dropped or filled with 'No description available'.

**age_certification** is missing in 67,640 rows (~54%)
→ Major concern; can be labeled as 'Unknown' or 'Unrated' for analysis.

**seasons** is missing in 116,194 rows (~93%)
→ Expected since movies don’t have seasons; no issue for most analysis.

**imdb_id, imdb_score, imdb_votes missing** in ~4.5–5%
→ Small portion; can be dropped or imputed using genre/type-wise mean.

**tmdb_popularity** has only 15 missing
→ Can be safely dropped without affecting data quality.

**tmdb_score** missing in 10,265 rows (~8.3%)
→ Moderate; handle by imputation or drop depending on analysis importance.


### What did you know about your dataset?

1.The dataset is a merged collection from two CSV files: titles.csv and credits.csv, combining metadata of movies/shows with detailed cast and crew information.

2.It contains 124,347 rows and 19 columns, representing individual credit entries (actors, directors, etc.) linked to content titles.

3.Columns include metadata like title, type, release_year, runtime, genres, and performance metrics like imdb_score, tmdb_score, and popularity.

4.There are no duplicate IDs, but 168 fully duplicated rows were identified (not yet removed).

5.A few columns, especially seasons, age_certification, and tmdb_score, have significant missing values, which will require cleaning or handling.

6.The seasons column has over 93% missing values, expected because it only applies to shows.

7.The dataset offers strong potential for EDA, content-type analysis, actor/director influence studies, and rating-based insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().round()

### Variables Description

person_id – Unique identifier for each person (actor, director, etc.) in the credits data.

id – Unique identifier for each title; links the credits and titles datasets.

name – Name of the person involved in the title.

character – Name of the character played by the actor (blank for directors, etc.).

role – The role of the person in the production (e.g., ACTOR, DIRECTOR).

title – The name of the movie or TV show.

type – Type of content: either MOVIE or SHOW.

description – A brief summary or plot description of the content.

release_year – The year the content was released.

age_certification – Age rating classification (e.g., PG-13, R, TV-MA).

runtime – Duration of the content in minutes.

genres – Genres assigned to the title (e.g., action, drama, comedy).

production_countries – Country or countries where the content was produced.

seasons – Number of seasons (only applicable to TV shows).

imdb_id – Unique identifier for the content on IMDb.

imdb_score – IMDb rating of the content (scale of 1 to 10).

imdb_votes – Number of user votes for the title on IMDb.

tmdb_popularity – Popularity metric from The Movie Database (TMDb).

tmdb_score – Viewer rating score from TMDb (similar to IMDb score).



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

**High Cardinality Columns (many unique values):**

person_id: 80,508 → large number of unique individuals.

name: 79,758 → diverse set of people (actors, directors, etc.).

id: 8,861 → number of distinct titles.

title: 8,748 → almost all titles are uniquely named.

character: 71,097 → diverse character names across shows/movies.

imdb_id: 8,267 → mostly unique identifiers per title.

description: 8,833 → most content has a unique description.

tmdb_popularity: 5,267, imdb_votes: 3,623 → lots of variability in popularity and votes.

genres: 1,965 → combinations of genres (e.g., ['drama', 'action']).

**Moderate Cardinality Columns:**

runtime: 204 → wide range of runtimes (from short films to long movies/shows).

release_year: 110 → content spans ~110 years (e.g., 1912–2022).

production_countries: 482 → many country codes, possibly includes co-productions.

**Low Cardinality Columns (useful for categorization/grouping):**

type: 2 → MOVIE, SHOW.

role: 2 → typically ACTOR and DIRECTOR.

seasons: 30 → applicable only to TV shows.

age_certification: 11 → manageable number of age ratings.

imdb_score: 85, tmdb_score: 88 → ratings from 1 to 10, in decimal steps (e.g., 6.5, 7.0).



## 3. ***Data Wrangling***

## **3.1 Handling Duplicates**

###**3.1.1 Checking Duplicates**

In [None]:
# check duplicate rows
df.duplicated().sum()

The above shows that there are 168 duplicate rows.These duplicate rows need to be removed from the dataset for smooth analysis and visualization.

###**3.1.2 Removing Duplicate rows**

In [None]:
# Remove duplicated rows
df.drop_duplicates(inplace=True)

We have removed the duplicate rows in the dataset.

## **3.2 Handling Missing/Null Values**

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Fill description with placeholder
df['description'].fillna("No description available")

In [None]:
# Fill age_certification with 'Unknown'
df['age_certification'].fillna("Unknown")

In [None]:
# Fill seasons: 0 for movies (since they don't have seasons)
df['seasons'] = df.apply(lambda row: 0 if row['type'] == 'MOVIE' and pd.isnull(row['seasons']) else row['seasons'], axis=1)

In [None]:
# Optional: Drop rows with missing imdb_id (or keep, based on analysis)
df.dropna(subset=['imdb_id'], inplace=True)

In [None]:
# Fill missing imdb_score, imdb_votes, tmdb_score with median values
df['imdb_score'].fillna(df['imdb_score'].median())
df['imdb_votes'].fillna(df['imdb_votes'].median())
df['tmdb_score'].fillna(df['tmdb_score'].median())

In [None]:
# Drop 15 rows with missing tmdb_popularity
df.dropna(subset=['tmdb_popularity'], inplace=True)

##**3.3 Handling Datatypes**

###**3.3.1 Checking Datatypes**

In [None]:
#checking datatypes
df.dtypes

###**3.3.2 Converting the required columns datatypes**

In [None]:
# Convert required columns
df['seasons'] = df['seasons'].astype(int)

# Optional: Cast categories
df['type'] = df['type'].astype('category')
df['role'] = df['role'].astype('category')
df['genres'] = df['genres'].astype('category')
df['production_countries'] = df['production_countries'].astype('category')


In [None]:
# Remove whitespace and unify case
df['name'] = df['name'].str.strip().str.title()
df['title'] = df['title'].str.strip().str.title()
df['role'] = df['role'].str.strip().str.upper()
df['genres'] = df['genres'].str.strip().str.title()
df['production_countries'] = df['production_countries'].str.strip().str.upper()


### What all manipulations have you done and insights you found?

**1. Data Cleaning**

*    Removed duplicate rows (168 duplicate entries were found and dropped).

**2. Handled missing values:**

*   Filled missing description with: "No description available".

*   Replaced nulls in age_certification with "Unknown".

*   Set seasons = 0 for all movies (since seasons are irrelevant).

*   Dropped rows with missing imdb_id .

*   Filled missing imdb_score, imdb_votes, and tmdb_score with respective medians.

*   Dropped 15 rows with missing tmdb_popularity.


**2. Data Transformation**

*   Converted stringified lists in genres and production_countries to actual Python lists using ast.literal_eval.

*   Exploded genres and production_countries into separate rows for multivariate analysis.

**3. Data Type Corrections**

*   Cast columns like runtime, release_year, seasons to integers.

*   Converted columns like type, role, genres, and production_countries to categorical types for analysis and memory efficiency.

**4. Text Standardization**

*   Standardized columns like name, title, genres, and production_countries by stripping whitespaces and using title/uppercase format.



In [None]:
df.shape

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##**4.1 Univariate Analysis**

### **4.1.1 Univariate Analysis - Categorical Columns**

#### Chart - 1 **Distribution of content type**

In [None]:
# Chart - 1
# Create count plot
plt.figure(figsize=(4, 5))
ax = sns.countplot(data=df, x='type',color='green')
plt.title('Distribution of Content Type')

# Add value labels
for i in ax.patches:
    count = int(i.get_height())
    ax.annotate(f'{count}',
                (i.get_x() + i.get_width() / 2., i.get_height()),
                ha='center', va='bottom', fontsize=11, color='black')

plt.show()

##### **1. Why did you pick the specific chart?**

We use countplot because it's ideal for visualizing the frequency of categorical variables. It automatically counts and displays the number of entries in each category, making it easy to compare values like Movies vs Shows in a clear, visual format.

##### **2. What is/are the insight(s) found from the chart?**

The count plot depcits that
1. shows are   7399
2. movies are  111462

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

**Postive Business Impact**

Yes.
Understanding the skew toward movies:

Helps content strategists decide whether to invest more in TV shows to balance the catalog.

Useful for marketing teams to tailor campaigns focused on the dominant content type (movies).

For user engagement analysis, it helps in feature recommendation algorithms by giving more weight to the more available type.

**Negative Business Insight**

Yes, possible negative growth insight:

The low volume of SHOW content (only 7,399 titles) might indicate underrepresentation of episodic/serial content, which is a major engagement driver for many OTT platforms.

Shows tend to keep viewers subscribed longer due to episode-based storytelling (e.g., weekly releases, bingeable seasons). The lack of sufficient show content may lead to:

Lower retention of subscribers looking for series.

Missed revenue opportunities through longer engagement.

Loss of competitive edge compared to platforms with strong original show libraries (e.g., Netflix, Disney+).

✅ Justification: If user behavior leans toward long-form or episodic content, the limited number of shows can push users to switch platforms, especially after watching the limited series on Amazon Prime.

#### Chart - 2 **Distribution of Genres**

In [None]:
# Chart - 2 visualization code
# Ensure genres column is in list format
import ast

# Step 1: Convert stringified lists to actual Python lists
df['genres'] = df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Step 2: Remove any non-list rows (optional safety check)
df = df[df['genres'].apply(lambda x: isinstance(x, list))]

# Step 3: Explode the genres
df_exploded = df.explode('genres').reset_index(drop=True)

In [None]:
plt.figure(figsize=(10,6))
ax = sns.countplot(data=df_exploded, y='genres', order=df_exploded['genres'].value_counts().index,color='blue')

# Adding value labels
for p in ax.patches:
    ax.annotate(f'{p.get_width()}', (p.get_width(), p.get_y() + 0.5), va='center')

plt.title('Distribution of Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

##### **1. Why did you pick the specific chart?**

A horizontal bar plot is ideal when:

We have categorical values with long labels (like genre names).

We want to easily compare frequency/counts of each genre.

We're dealing with many categories — which are better visualized horizontally.

Also, exploding the genre list was necessary to treat each genre individually, as many titles had multiple genres in a single list (e.g., ['comedy', 'drama']).



##### **2. What is/are the insight(s) found from the chart?**

Top 5 genres by count are:

🎭 Drama (68,042)

😂 Comedy (40,575)

🔪 Thriller (31,913)

🔫 Action (29,977)

💕 Romance (28,484)

Genres like Reality, Sport, and Animation are least represented.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

High counts of Drama and Comedy indicate mass appeal, suggesting platforms should prioritize acquiring or producing these genres.

Identifying popular genres helps guide marketing strategies and recommendation algorithms.

**Negative Growth Insight:**

Underrepresented genres (like Reality or Animation) may indicate untapped opportunities or niche audience segments.

If the platform wants to diversify content, these genres could be explored more actively.



#### Chart - 3 **Distribution of age certifications**

In [None]:
# Chart - 3 visualization code
# Set figure size
plt.figure(figsize=(10, 6))

# Count plot for age_certification
ax = sns.countplot(data=df, x='age_certification', order=df['age_certification'].value_counts().index,color='pink')

# Add annotations
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{int(height)}', (p.get_x() + p.get_width() / 2., height),
                ha='center', va='bottom', fontsize=10)

# Title and labels
plt.title("Distribution of Age Certifications")
plt.xlabel("Age Certification")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

A bar plot is ideal for visualizing frequency distribution of categorical variables like age_certification. It allows easy comparison between categories such as R, PG-13, TV-MA, etc.

##### **2. What is/are the insight(s) found from the chart?**

The most common age certification is ‘R’, followed by ‘PG-13’ and ‘PG’.

Less frequent certifications include ‘TV-G’, ‘TV-Y’, and ‘NC-17’.

The data seems to skew toward more mature content, indicating a heavier representation of content for adults and teens.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Insights**
The dominance of R-rated and PG-13 content may reflect the platform's focus on teen and adult audiences.

This insight can help shape marketing strategies, content recommendation systems, and future content acquisition decisions (e.g., producing more family/kids-friendly shows if the platform wants to expand to younger demographics).

**Negative Insights**

Limited content for kids (TV-Y, TV-G) could represent a missed opportunity. If families with children don’t find enough suitable content, they might avoid subscribing or move to more kid-friendly competitors like Disney+.

If not intentional, this imbalance can signal a need for diversification of content categories to tap into untapped viewer segments.

#### Chart - 4 **Trend of Releases Over Years**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(14,6))
sns.lineplot(data=df['release_year'].value_counts().sort_index())
plt.title('Trend of Content Releases Over Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.show()


##### **1. Why did you pick the specific chart?**

A line chart was chosen because it is ideal for showing how values change over time. Since release_year is a continuous variable and we wanted to observe trends year-by-year, this chart helps visualize the overall growth and dips in content production.

##### **2. What is/are the insight(s) found from the chart?**

There was steady growth in content production after 2000.

A sharp spike occurred between 2015 and 2020, reaching a peak around 2019–2020 with over 8000 titles released.

A noticeable drop is seen after 2020, especially in 2023, likely due to incomplete data for recent years.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

The rising trend from 2000 to 2020 shows strong demand and investment in content creation.

The surge around 2019–2020 suggests strategic expansion in streaming platforms, possibly due to increased digital consumption and global OTT adoption.

Businesses can use this insight to analyze successful launch windows, and replicate strategies during high-content-growth periods.

**Negative Insights**

Yes, the sharp drop after 2020 may indicate:

Incomplete data for the latest years (2022–2023)

Or actual decline due to COVID-19 disruptions, budget cuts, or changing business models.

This might negatively impact projections if not handled properly — forecasting models should treat post-2020 cautiously.



### **4.1.2 Univariate Analysis - Numerical Columns**

#### Chart - 5 **Distribution of Imdb Scores**

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(df['imdb_score'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### **1. Why did you pick the specific chart?**

A histogram with KDE (Kernel Density Estimate) is ideal for visualizing the distribution of a numerical variable.

It helps us understand the spread, central tendency, and skewness of IMDb scores.

##### **2. What is/are the insight(s) found from the chart?**

*   Most titles have IMDb scores between 5.5 and 7.0.

*   The distribution is approximately normal (bell-shaped), but slightly left-skewed, indicating fewer very high-rated titles.

*   Very few titles are rated below 4.0 or above 8.5.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

*   Majority of Amazon Prime content falls in the average to good rating range, which can be used for promotional campaigns: “Most of our titles are rated 6+ on IMDb!”

*   Marketing efforts can highlight top-rated content to drive user engagement.

*   Content below 5 could be reviewed for quality or considered for removal or improvement.

**Negative Insight**

Yes.

*  A significant portion of titles are clustered between 5.0–6.0, which might not attract audiences seeking premium-quality content.

*  If too much average content accumulates without standout ratings, it may reduce perceived platform value over time.



#### Chart - 6 **Distribution of Tmdb Scores**

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(df['tmdb_score'], bins=30, kde=True, color='teal')
plt.title('Distribution of TMDB Scores')
plt.xlabel('TMDB Score')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### **1. Why did you pick the specific chart?**

A histogram with KDE (Kernel Density Estimation) is best for showing how a continuous variable like tmdb_score is distributed. It reveals skewness, modality, and central tendency clearly, which is essential for understanding user ratings.

##### **2. What is/are the insight(s) found from the chart?**

* The TMDB scores are roughly normally distributed, centered around 6.

* The majority of titles have a score in the range of 5 to 7.

* Very few titles have extremely low (<3) or high (>8.5) scores.

* There's a slight positive skew in the distribution.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

* The average TMDB score being around 6 indicates moderate audience satisfaction.

* Since very few titles score extremely high, there's an opportunity to focus on improving content quality.

* Content creators or acquisition teams can target high-scoring genres or creators for better viewer retention.

* The large volume of average-scoring content may indicate content saturation, which could push users to switch platforms if not balanced with high-quality options.

**Negative Insight**

* The long tail of low-rated content may:

* Lower the platform’s average perception among users.

* Reduce viewer trust in recommendations.

* Result in user churn if users frequently encounter disappointing content.

Efforts should be made to analyze why some titles receive low scores and either improve them or remove/promote them less.

#### Chart - 7  **Distribution of Tmdb popularity**

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 2))
sns.boxplot(x='tmdb_popularity', data=df, color='salmon')
plt.title('Box Plot of TMDB Popularity')
plt.xlabel('TMDB Popularity')
plt.show()

##### **1. Why did you pick the specific chart?**

The box plot was selected because it efficiently displays the distribution, central tendency, and outliers of TMDB popularity in a compact format. It's excellent for identifying:

Median value

Interquartile range (IQR)

Extreme values/outliers



##### **2. What is/are the insight(s) found from the chart?**

*  TMDB Popularity is heavily right-skewed.

*  A majority of the data lies in a narrow range on the lower end (close to 0–50).

*  There are numerous extreme outliers, with some content showing popularity scores beyond 1000, which are significantly distant from the rest of the data.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

*   This skewed distribution suggests a small number of titles drive significant popularity, while most content receives relatively low attention.

*   Streaming platforms should focus promotional strategies on underperforming titles or investigate what makes the top-performing titles so successful.

*   Could help in content recommendation systems to balance between popular and lesser-known titles.

**Negative Insights**

*  The long tail of unpopular titles indicates content saturation or poor discoverability, which may waste platform resources (licensing, hosting).

*  If left unchecked, this can negatively affect ROI, especially if a majority of newly launched content continues to perform poorly.


#### Chart 8 - **Imdb Votes**

In [None]:
# Chart - 8 visualization code
# Box Plot to check for outliers
plt.figure(figsize=(12, 2))
sns.boxplot(x=df['imdb_votes'], color='orange')
plt.title('Box Plot of IMDb Votes')
plt.xlabel('IMDb Votes')
plt.show()


##### **1. Why did you pick the specific chart?**

*  A box plot is ideal for spotting outliers and understanding the distribution spread of numerical variables like imdb_votes.

*  It highlights the median, interquartile range (IQR), and extreme values, which is important for vote-based metrics where outliers are common.



##### **2. What is/are the insight(s) found from the chart?**

*  There is a heavy concentration of IMDb votes on the lower side, indicating most titles received fewer votes.

*  A significant number of outliers exist — some titles have votes exceeding 1 million, showing a few highly popular entries dominating user engagement.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

This skewed vote distribution suggests only a small subset of content receives mass engagement, which can help platforms:

*  Curate top-performing content

*  Focus marketing/promotion on similar content genres

*  Allocate more resources to produce or acquire content with similar engagement potential.

**Negative Insights**

Yes.
*  If a streaming platform relies only on the average vote metric, it might mislead content evaluation due to a few highly rated titles skewing the data.

*  Underrated titles with low votes but high potential might get overlooked, affecting diversity and user satisfaction in the content catalog.

## **4.2 Bivariate Analysis**

###**4.2.1 Bivariate Analysis - Categorical vs Categorical Columns**

#### Chart - 9 **Content type vs Age Certification**

In [None]:
# Chart - 9 visualization code
# Crosstab + Heatmap - type vs age_certification
plt.figure(figsize=(12, 6))
crosstab = pd.crosstab(df['type'], df['age_certification'])
sns.heatmap(crosstab, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of Type vs Age Certification')
plt.ylabel('Type')
plt.xlabel('Age Certification')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

A heatmap was chosen because it gives a clear, color-coded overview of how two categorical variables (type and age_certification) intersect in frequency.

It visually highlights where the majority of content lies in terms of certification for each content type (Movie/Show).

##### **2. What is/are the insight(s) found from the chart?**

* Movies dominate traditional certifications like R, PG-13, PG, and G.

* Shows are exclusive to TV certifications such as TV-MA, TV-14, TV-G, etc.

* There's a complete segregation — no overlap between movie and TV rating categories.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

* The platform likely classifies and restricts certifications clearly based on content type, helping avoid regulatory issues.

* Can help in designing custom filters or parental control settings — for instance, enabling different default filters for movies vs shows.

**Negative Insights**

* If shows do not explore traditional certifications (like PG-13 or R) and stick only to TV-based certifications, creative boundaries may be limited.

* Likewise, the platform might miss opportunities to adapt or mix certification standards across content types in global markets.

#### Chart - 10 **Content Type vs Genre**

In [None]:
# Chart - 10 visualization code
# Explode genres if not already exploded
df_exploded = df.copy()
df_exploded['genres'] = df_exploded['genres'].apply(lambda x: eval(x) if isinstance(x, str) else x)
df_exploded = df_exploded.explode('genres')

# Create crosstab
genre_type_ct = pd.crosstab(df_exploded['type'], df_exploded['genres'])

# Plot heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(genre_type_ct, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of Content Type vs Genre')
plt.ylabel('Type')
plt.xlabel('Genre')
plt.xticks(rotation=90, ha='right')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

* A heatmap is ideal to visualize the distribution and intensity of counts between two categorical variables.

* Here, type (MOVIE/SHOW) and genres are categorical, and we want to see how frequently each genre appears within each content type.

##### **2. What is/are the insight(s) found from the chart?**

* Movies dominate across all genres.

* Drama, Comedy, and Thriller are the top 3 genres for both types, but overwhelmingly for movies.

* Certain genres like Animation, Family, Fantasy, Sci-fi, History, and Romance also show significant presence in movies.

* Shows are very few in number and most common genres for shows are Drama, Comedy, and Animation.

* Genre Reality is almost negligible across both types.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

Helps content acquisition teams focus on popular genre-type combinations. For example:

* Investing more in Drama and Comedy Movies can yield better engagement.

* Animation Shows might cater to a niche but engaged audience.

* Useful for personalized recommendation algorithms – knowing what type of content is likely to appeal to certain viewer groups.

**Negative Growth Insights**
* Reality and Music genres show minimal presence, especially in shows. This might indicate:

* Lower viewer demand or low content production in these categories.

* These areas can either be avoided or carefully tested before investing more resources.

#### Chart - 11 **Genre vs Age_Certification**

In [None]:
# Chart - 11 visualization code

# Count of titles by age_certification and genre
age_genre_counts = df_exploded.groupby(['age_certification', 'genres']).size().reset_index(name='count')

# Plotting
plt.figure(figsize=(18, 7))
sns.barplot(data=age_genre_counts, x='genres', y='count', hue='age_certification')

plt.title('Distribution of Genres by Age Certification')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.legend(title='Age Certification')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

We selected a grouped bar chart to analyze the distribution of different Age Certifications across various Genres. This visual effectively shows comparative frequency within each genre for different certification levels, making it easy to identify patterns, imbalances, and dominance of certain age ratings per genre.

##### **2. What is/are the insight(s) found from the chart?**

*  Drama, Comedy, and Thriller dominate across almost all age certifications, especially in PG-13 and R categories.

*  R-rated content has the highest count across multiple genres such as Drama, Thriller, and Action, suggesting a strong tilt toward mature content production.

*  Family-oriented genres like Animation and Family primarily fall under G, PG, and TV-Y certifications.

*  Very few titles in genres like Reality and War have any certification, indicating underrepresentation or lack of age classification data.

*  Some genres (like Music, Sport) show a balanced distribution across multiple age groups.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

*  The dominance of mature content (R and PG-13) indicates that platforms like Amazon Prime are largely targeting adult audiences, which may limit viewership among families or children.

*  If Amazon Prime wants to expand into kid-safe or family segments, it may need to invest more in G, PG, or TV-Y content, especially in high-engagement genres like Animation, Family, or Fantasy.

*  On the other hand, the current focus on R and PG-13 content could be strategic, aligning with viewer demand for more intense, dramatic, or thrilling content.


Yes, there are two negative insights:

*  Over-dependence on R-rated content: Genres like Drama, Thriller, and Horror heavily rely on mature content, potentially alienating younger viewers or limiting household viewership.

*  Low diversity in certifications for certain genres: Genres such as European, War, or Reality show limited age group diversity, indicating a gap in inclusive or age-diverse storytelling, which can affect global engagement and compliance with regional content guidelines.



#### Chart - 12 **Production Country vs Content Type**

In [None]:
# Chart - 12 visualization code
import ast

# 1️⃣ Ensure production_countries is a Python list (not a string)
df['production_countries'] = df['production_countries'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)

# 2️⃣ Explode the list so each country gets its own row
df_country = df.explode('production_countries')

# 3️⃣ (Optional) Focus on the top 10 most frequent countries for readability
top_countries = df_country['production_countries'].value_counts().head(10).index
df_top = df_country[df_country['production_countries'].isin(top_countries)]

# 4️⃣ Plot: Countplot with 'type' as hue
plt.figure(figsize=(12, 6))
sns.countplot(
    data=df_top,
    x='production_countries',
    hue='type',
    palette='Set2'
)
plt.title('Distribution of Content Types Across Top Production Countries')
plt.xlabel('Production Country')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Content Type')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

A grouped countplot was chosen because we are analyzing the relationship between two categorical variables — production_countries and type (MOVIE or SHOW). This plot effectively highlights the distribution of content types across the top-producing countries.



##### **2. What is/are the insight(s) found from the chart?**

*  The United States dominates content production by a massive margin in both categories, especially movies.

*  Other notable contributors include India (IN), United Kingdom (GB), and Canada (CA), but their contributions are significantly lower than the US.

*  In every country shown, movies vastly outnumber shows, indicating a general industry preference or trend toward producing movies over shows.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

* **Content Strategy:** Platforms like Amazon Prime can allocate more licensing, promotion, or acquisition resources toward movies, especially from the US, as they form the core of the catalog.

* **Localized Growth Opportunities:** Countries like India, Canada, and the UK, despite being behind the US, show strong movie output — suggesting potential markets for regional content investments, localization, or original content development.

* **Content Balance:** The relative scarcity of shows across all countries may indicate either an opportunity to expand show production or a deliberate content strategy based on audience demand.

**Negative Insight**

Yes, one potential gap is the underrepresentation of TV shows across even the top 10 countries. This can be a missed opportunity in regions where binge-worthy series consumption is rising. Platforms that fail to meet this demand may lose users to competitors like Netflix or Disney+ that focus heavily on episodic content.




###**4.2.2 Bivariate Analysis - Categorical vs Numercial Columns**

#### Chart - 13 **Content Type vs Imdb score**

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(8, 5))
sns.violinplot(data=df, x='type', y='imdb_score',color='maroon')
plt.title('Distribution of IMDb Scores by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

A violin plot was chosen because it allows us to compare the distribution and density of IMDb scores between the two content types — MOVIE and SHOW. It shows both the range and the concentration of scores, which is more informative than just a boxplot or bar chart.

##### **2. What is/are the insight(s) found from the chart?**

*  V Shows tend to have higher average IMDb scores than Movies.

*  The distribution for Movies is wider and more spread out, indicating more variation in ratings.

*  The density for Shows is tightly clustered around the 7.5–8 range, suggesting more consistent quality.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

*  Shows are generally rated better by viewers, which can be a cue for platforms like Amazon Prime to invest more in quality show productions.

*  For marketing teams, highlighting high-rated shows can boost platform engagement and retention.

*  For product planning, it helps understand what content segment users are finding more reliable or enjoyable.

**Negative Insight**

Yes — the wider spread and lower density of high ratings in Movies suggests inconsistency in movie quality. If not addressed, this can:

*  Lead to user dissatisfaction, especially among movie lovers.

*  Result in negative reviews, which could affect brand trust over time.

*  A strategy to improve movie quality or curate better-rated titles is necessary to mitigate this.



#### Chart - 14 **Age_certification vs tmdb_score**

In [None]:
# Chart - 14 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='age_certification', y='tmdb_score', color='purple')

plt.title('Box Plot of TMDB Scores by Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('TMDB Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### **1. Why did you pick the specific chart?**

*  A box plot is ideal here because it shows:

*  The distribution of TMDB scores for each age certification category.

*  Key summary statistics (median, quartiles).

*  The spread of scores and presence of outliers.
It helps to compare how content ratings correlate with perceived quality (TMDB scores).



##### **2. What is/are the insight(s) found from the chart?**

*  TV-G, TV-Y7, TV-14, TV-MA content generally has higher median TMDB scores.

*  Certifications like NC-17 and R have lower medians and higher variability.

*  Outliers are present in every category, but child-safe content (TV-Y, TV-G) tends to show consistently high scores.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  Content with TV certifications (TV-MA, TV-G, TV-Y7) tends to be better received on TMDB.

*  If a platform is deciding what type of content to produce or promote, focusing on high-scoring certification brackets could improve audience satisfaction and ratings.

*  Parental control content (TV-Y, TV-G) is a safe bet for maintaining platform quality ratings.

**Negative Insights**

*  Content rated NC-17 and R has lower median scores and higher outliers, suggesting inconsistency or poor reception.

*  Promoting or investing heavily in explicit content might risk negative feedback or reduce platform credibility, especially if quality isn't maintained.

#### Chart - 15 **Type vs runtime**

In [None]:
# Chart - 15 visualization code
plt.figure(figsize=(5, 3))
sns.boxplot(data=df, x='type', y='runtime',color='yellow')

plt.title('Runtime Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Runtime (minutes)')
plt.tight_layout()
plt.show()


##### **1. Why did you pick the specific chart?**

A box plot is ideal for comparing the distribution of a numerical variable (runtime) across different categories (type: Movie or Show). It clearly shows the median, interquartile range, and outliers, helping us understand runtime variation for each content type.

##### **2. What is/are the insight(s) found from the chart?**

*  Movies tend to have higher and more varied runtimes compared to Shows.

*  Movies have several extreme outliers, some going above 300 minutes.

*  Shows generally have shorter and more consistent runtimes, often clustered below 60 minutes

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  This insight helps content strategy and UX design:

*  Knowing shows are shorter, binge-watch features and auto-play may be prioritized for shows.

*  The longer runtime of movies may require better buffering or download features on mobile devices.

*  Helps in recommendation engine tuning — runtime-based filtering or suggestions can improve engagement.

*  There’s no negative business growth insight here, but content length misalignment with viewer preferences could impact completion rates or user retention, especially on mobile-heavy platforms.



###**4.2.3 Bivariate Analysis - Numerical vs Numerical Columns**

#### Chart - 16 **Imdb score vs Tmdb score**

In [None]:
# Chart - 16 visualization code
plt.figure(figsize=(8, 5))
sns.regplot(data=df, x='imdb_score', y='tmdb_score', scatter_kws={'alpha':0.4}, line_kws={'color':'red'})

plt.title('IMDb Score vs TMDB Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Score')
plt.tight_layout()
plt.show()

##### **1. Why did you pick the specific chart?**

Because both IMDb and TMDB scores are continuous numerical variables, a scatter plot shows how closely they're related. The regression line helps visualize the trend and detect if a strong correlation exists.



##### **2. What is/are the insight(s) found from the chart?**

There is a strong positive correlation between IMDb and TMDB scores.

*  The regression line slopes upward, indicating that as IMDb scores increase, so do TMDB scores.

*  While the relationship is clear, there is some spread around the line, especially in the mid-range (5 to 7 IMDb), where TMDB scores show more variance.

*  The plot also highlights a few outliers, where titles rated low on IMDb received higher TMDB scores or vice versa.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


Confirms consistency between both rating systems, making it reliable for:

*  Recommendation systems

*  Data imputation (if one score is missing)

*  User trust by displaying one score backed by another

*  Variations in mid-score range suggest differences in audience preferences across platforms.

*  Useful for content quality benchmarking, especially for production and curation teams.

**Negative Insight**

*  While the overall trend is positively correlated, there’s a noticeable dispersion or inconsistency in the mid-score range:

*  Many titles with an IMDb score between 5 and 7 have TMDB scores that vary widely from 2 to 10.

*  Some titles with low IMDb scores (2–4) still have high TMDB scores (7–9) — this reflects inconsistency between the two rating platforms.

These discrepancies suggest differences in rating algorithms, user base preferences, or voting behavior on both platforms.

#### Chart - 17 **Imdb score vs Imdb votes**

In [None]:
# Chart - 17 visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='imdb_votes', y='imdb_score', alpha=0.5)

plt.xscale('log')  # IMDb votes can have a very wide range
plt.title('IMDb Score vs IMDb Votes')
plt.xlabel('IMDb Votes (Log Scale)')
plt.ylabel('IMDb Score')
plt.grid(True)
plt.tight_layout()
plt.show()

##### **1. Why did you pick the specific chart?**

A scatter plot with a log scale on the x-axis was selected because it effectively shows the relationship between two numerical variables (imdb_votes and imdb_score) that have significantly different ranges. Log scaling handles the skew in vote counts, revealing patterns across low- and high-vote titles.

##### **2. What is/are the insight(s) found from the chart?**

*  There's a very weak or no clear linear correlation between the number of IMDb votes and the score.

*  Many high-vote titles have moderate scores (around 6–7), while a few low-vote titles have both extremely high and low scores.

*  Some highly voted titles also have low ratings, indicating that popularity doesn’t always imply quality.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  Helps Amazon Prime differentiate between popularity and content quality.

*  Supports surfacing high-quality but under-voted content as “hidden gems” for niche audiences.

**Negative Insight:**

*  Relying solely on vote counts could lead to overpromoting average content, neglecting lesser-known but high-rated titles.

*  Business Recommendation: Use both score and vote count as weighted inputs in recommendation engines to balance between quality and popularity.

#### Chart - 18 **Runtime vs tmdb popularity**

In [None]:
# Chart - 18 visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='runtime', y='tmdb_popularity', alpha=0.5)

plt.title('TMDB Popularity vs Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('TMDB Popularity')
plt.grid(True)
plt.tight_layout()
plt.show()

##### **1. Why did you pick the specific chart?**

A scatter plot was chosen because it effectively displays the distribution and relationship between two numerical variables: runtime and tmdb_popularity. This chart type highlights how popularity varies across different content lengths.

##### **2. What is/are the insight(s) found from the chart?**

*  Most popular titles fall within the 60 to 120-minute runtime range.

*  Extremely long content (>200 min) generally shows low popularity.

*  Outliers with very high TMDB popularity exist between 80 to 110 minutes, suggesting that audiences prefer concise, well-paced content.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


Yes.

*  Helps Amazon Prime optimize content acquisition and production by focusing on ideal runtimes (approx. 90–120 mins).

Could improve user engagement by highlighting titles within preferred runtime ranges.

**Negative Insight:**

*  Over-prioritizing runtime may exclude long-form content lovers or niche documentaries, leading to reduced diversity.

Recommendation: Promote content around the sweet spot (90–120 mins) but retain long-format content for targeted audiences or special categories.



#### Chart - 19 **Release year vs Imdb score**

In [None]:
# Chart - 19 visualization
plt.figure(figsize=(10, 6))
plt.hexbin(df['imdb_votes'], df['imdb_score'], gridsize=50, cmap='Blues', bins='log')
plt.xscale('log')
plt.colorbar(label='log10(count)')
plt.xlabel('IMDb Votes (Log Scale)')
plt.ylabel('IMDb Score')
plt.title('Hexbin Plot: IMDb Score vs Votes')
plt.show()


##### **1. Why did you pick the specific chart?**

A 2D histogram (heatmap) is used to visualize density between two continuous variables. The log-scaled x-axis helps address the skewed distribution of votes, while color intensity reveals concentration zones of IMDb scores across vote counts.

##### **2. What is/are the insight(s) found from the chart?**

*  The highest density of content lies in the 5–7 IMDb score range, regardless of the number of votes.

*  A significant amount of content with both low and high votes still falls in the moderate rating range.

*  Very few titles score above 8 or below 4, showing clustering around average ratings.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


Yes:

*  Helps Amazon identify the most common quality tier (average-rated titles) to refine recommendations.

*  Useful for quality filtering: high vote + high score = premium titles.

**Negative Insight:**

*  There’s a risk of over-representing average content, missing out on promoting niche or standout titles.

*  Recommendation: Prioritize content in high-vote & high-score zones for promotion, and improve discoverability of high-score/low-vote hidden gems.

## **4.3 Multivariate Analysis**

#### Chart - 20 **Type vs Genre vs Imdb score**

In [None]:
# Visualization Chart 20
df_exploded_reset = df_exploded.reset_index(drop=True)

# Now create the plot
plt.figure(figsize=(14, 6))
sns.boxplot(data=df_exploded_reset, x='genres', y='imdb_score', hue='type')
plt.xticks(rotation=90)
plt.title('IMDb Score Distribution by Genre and Type')
plt.show()


##### **1. Why did you pick the specific chart?**

We used a box plot with a hue='type' to compare how IMDb scores vary across different genres, while also comparing between MOVIE and SHOW content types. This multivariate approach helps in identifying not just genre-based trends, but also how content types perform within those genres.



##### **2. What is/are the insight(s) found from the chart?**

*  Shows tend to have higher IMDb scores than MOVIEs across most genres (e.g., Drama, Comedy, Family, Animation).

*  Genres like War, History, and Western exhibit a wider spread (variance) in IMDb scores, indicating inconsistent reception.

*  Some genres like Documentary, Sci-Fi, and Reality have overlapping score distributions, but Movies generally have more outliers (both high and low).

*  Animation and Family SHOWs consistently receive higher ratings compared to their movie counterparts.

*  Horror and Thriller genres have lower median IMDb scores, especially for Movies.



##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  Streaming platforms can focus more on producing SHOWs in genres like Family, Animation, and Comedy, as they are generally better rated than movies in those categories.

*  Avoiding Movie-heavy production in genres like Horror or Thriller, where user satisfaction appears lower on average, could reduce poor-performing content.

*  Insights can guide genre-specific budgeting: investing more in SHOWs for high-rating genres and improving script/story quality for low-rating genres.

**Negative Insight**

*  MOVIEs in genres like Horror, Thriller, and Sci-Fi tend to have lower IMDb ratings and more variability, which may negatively impact viewer trust and reduce watch-time.

*  Reality shows, even as SHOWs, have low median scores, suggesting viewers don’t enjoy them as much—this could lead to subscriber churn if content is skewed in that direction.

*  Producing more Movies in underperforming genres without understanding viewer expectations could result in wasted production cost and lower ROI.

#### Chart - 21 **Age certification vs Genre vs Imdb Score**

In [None]:
print(df_exploded.columns[df_exploded.columns.duplicated()])

In [None]:
df_exploded = df_exploded.dropna(subset=['genres', 'imdb_score', 'age_certification'])


In [None]:
# Visualization Chart 21
plt.figure(figsize=(16, 8))
sns.boxplot(data=df_exploded, x='genres', y='imdb_score', hue='age_certification')
plt.xticks(rotation=90)
plt.title('IMDb Score by Genre and Age Certification')
plt.xlabel("Genre")
plt.ylabel("IMDb Score")
plt.tight_layout()
plt.show()



##### **1. Why did you pick the specific chart?**

A box plot with hue was selected because it effectively visualizes how IMDb scores are distributed across multiple genres while simultaneously comparing different age certifications. It allows for multi-dimensional comparison (Genre vs IMDb Score, segmented by Age Certification).



##### **2. What is/are the insight(s) found from the chart?**

*  Consistently Higher Ratings: Genres like Comedy, Family, Animation, and Fantasy with TV-Y7, TV-G, or PG ratings show generally higher median IMDb scores compared to their counterparts in adult ratings.

*  More Variance in R/NC-17: Titles with R or NC-17 certifications show wider variability and more outliers, indicating inconsistency in content quality or audience reception.

*  Certain genres like Horror, Thriller, and War with adult certifications tend to have lower medians and larger spread — suggesting these may be more hit-or-miss with viewers.

*  Genres like Music, Sport, and Documentation also show significant variance in IMDb scores across age groups, hinting at niche preferences.

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  Content Targeting: Platforms can prioritize producing content in genres and age certifications with consistently high IMDb scores to improve user satisfaction.

*  Marketing Strategy: Shows with family-friendly certifications and strong genres like Comedy or Animation may perform better with broader audiences, offering better returns on promotional campaigns.

*  Quality Control: R-rated or adult-content producers can use this data to focus on quality improvement, as these titles often show high variance.

**Negative Insights**

*  Adult-rated titles (R, NC-17) in genres like Thriller, Horror, or Crime have lower and more volatile IMDb scores, indicating viewer dissatisfaction or inconsistent production quality.

*  Continuing to produce large volumes in such combinations without quality assurance might negatively impact platform reputation and retention, especially if viewers equate these with low-quality content.



#### Chart - 22 **Production Country vs Type vs Count**

In [None]:
# explode the list of countries into separate rows
df_exploded_country = df.explode('production_countries')

#  plot only the top 10 countries by frequency
top_countries = df_exploded_country['production_countries'].value_counts().head(10).index

plt.figure(figsize=(12, 6))
sns.countplot(data=df_exploded_country[df_exploded_country['production_countries'].isin(top_countries)],
              x='production_countries', hue='type')
plt.title('Content Type Distribution by Top 10 Production Countries')
plt.xticks(rotation=45)
plt.xlabel("Production Countries")
plt.ylabel("Count")
plt.tight_layout()
plt.show()



##### **1. Why did you pick the specific chart?**

A grouped bar chart (countplot with hue) is ideal for comparing distribution of content types (MOVIE vs SHOW) across different production countries. It helps to visualize differences in production preferences by country, revealing content trends globally.

##### **2. What is/are the insight(s) found from the chart?**

*  The United States (US) overwhelmingly dominates production, especially for movies, with over 70,000+ titles.

*  Other top countries like India (IN), UK (GB), and Canada (CA) also show high movie production but significantly lower show counts.

*  In all top 10 countries, movies consistently outnumb

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


*  Content strategy: Platforms like Amazon Prime can focus on acquiring more shows from underrepresented countries, especially where movie dominance is high, to diversify offerings.

*  Localization and expansion: Countries like India and UK show strong movie presence, so Prime could invest in regional content or develop original shows targeting those markets.

*  Licensing: Knowing the supply dominance from the US helps prioritize negotiations with US-based studios for blockbuster content.

**Negative Insight**

*  Over-reliance on US movie content could lead to cultural saturation and lack of diversity on the platform, which might not resonate well with global audiences.

*  Very low production of shows in countries like Germany (DE), Japan (JP), Australia (AU) could point to a limited local appeal for series in those markets, risking low user retention unless addressed.

#### Chart - 23 **Age certification vs Genre vs Imdb Score**

##### **1. Why did you pick the specific chart?**

##### **2. What is/are the insight(s) found from the chart?**

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


#### Chart - 24 **Release Year vs Type vs IMDb Score**

##### **1. Why did you pick the specific chart?**

##### **2. What is/are the insight(s) found from the chart?**

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***