<a href="https://colab.research.google.com/github/SAITEJSRIMAL/Capstone-Project/blob/main/Amazon_EDA_Project_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

In today's competitive streaming industry, platforms like Amazon Prime Video are constantly expanding their content libraries to cater to diverse audiences. With a growing number of shows and movies available on the platform, data-driven insights play a crucial role in understanding trends, audience preferences, and content strategy.

This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This dataset has data available in the United States.

**Dataset Description** :

This dataset has two files containing the titles (titles.csv) and the cast (credits.csv) for the title.

This dataset contains +9k unique titles on Amazon Prime with 15 columns containing their information

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Problem Statement**

This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

* Content Diversity: What genres and categories dominate the platform?

* Regional Availability: How does content distribution vary across different regions?

* Trends Over Time: How has Amazon Prime’s content library evolved?

* IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

#### **Define Your Business Objective?**

By analyzing this dataset, businesses, content creators, and uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
titles = pd.read_csv('titles.csv')        # primary data
credits = pd.read_csv('credits.csv')      # secondary data

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
pd.set_option('display.max_columns', None)     # shows all the columns

In [None]:
titles.head()       # primary data

In [None]:
credits.head()       # secondary data

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
titles.shape     # Rows = 9871, Columns = 15

In [None]:
9871 - 8951

In [None]:
credits.shape     # Rows = 124235, Columns = 5

### Dataset Information

In [None]:
# Dataset Info

In [None]:
titles.info()   # It gives all the data like column names, rows, non- null values count, Datatype

In [None]:
credits.info()     # secondary data

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
titles.duplicated().sum()       # There are 3 duplicates values (whole rows) in the Secondary Data

In [None]:
credits.duplicated().sum()      # There are 56 duplicates values (whole rows) in the Primary Data

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
titles.isnull().sum()

In [None]:
(titles.isnull().sum() * 100 / len(titles)).round(2)   # round 2 = after decimal(.) --> Only 2 numbers

In [None]:
(titles.isna().mean()*100).round(2)      # code 3

In [None]:
credits.isnull().sum()

In [None]:
(credits.isnull().sum() * 100 / len(credits)).round(2)   # round 2 = after decimal(.) --> Only 2 numbers

In [None]:
# Visualizing the missing values

In [None]:
plt.figure()         # Primary data
sns.heatmap(titles.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap For Titles')
plt.show()

In [None]:
# Visualizing missing values as a heatmap

plt.figure()       # secondary data
sns.heatmap(credits.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap For Credits')
plt.show()

### What did you know about your dataset?

**There are 2 DataFrames :**

**1. titles**

**2. credits**

**The 'titles' table holds main metadata about each show or movie. (Primary Data)**

**The credits table cast and crew information linked to each title. (Secondary Data)**

**And later I have to merge this 2 dataframes throgh a Primary key : Id for the Insights like : Content Analysis, IMDb & TMDB Insights, etc**

**The 'titles' file has : Rows = 9871,  Columns = 15 and found 3 Duplicates Values from the file**

**The 'credits' file has : Rows = 124235, Columns = 5 and found 56 Duplicates Values from the file**


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
titles.columns

In [None]:
credits.columns

In [None]:
# Dataset Describe

In [None]:
# Dataset Describe
titles.describe(include = 'all')    # It will give the data for all the columns

In [None]:
titles.describe(include = 'object')    # specifically for datatypes == Object

In [None]:
credits.describe(include = 'all')

In [None]:
credits.describe(include = 'object')    # specifically for datatypes == Object

### Variables Description

**titles** : Dataset

* **id**: The title ID on JustWatch.

* **title**: The name of the title.

* **show_type**: TV show or movie.

* **description:** A brief description.

* **release_year**: The release year.

* **age_certification**: The age certification.

* **runtime** : The length of the episode (SHOW) or movie.

* **genres** : A list of genres.

* **production_countries** : A list of countries that produced the title.

* **seasons** : Number of seasons if it's a SHOW.

* **imdb_id** : The title ID on IMDB.

* **imdb_score** : Score on IMDB.

* **imdb_votes** : Votes on IMDB.

* **tmdb_popularity** : Popularity on TMDB.

* **tmdb_score** : Score on TMDB.




**credits**: Dataset

**person_ID** : The person ID on JustWatch.

**id** : The title ID on JustWatch.

**name**  The actor or director's name.

**character_name** : The character name.

**role** : ACTOR or DIRECTOR.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in titles.columns.tolist():
  print("No. of unique values in ",i,"is",titles[i].nunique(),".")     # using for loop for unique  values in all the columns of (primnary data (title))

In [None]:
# Check Unique Values for each variable.
for i in credits.columns.tolist():
  print("No. of unique values in ",i,"is",credits[i].nunique(),".")     # using for loop for unique  values in all the columns of (secondary data (credits))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#### Dropping the duplicates

In [None]:
titles.drop_duplicates(inplace = True)     # dropping the duplicates (prinary data)

In [None]:
titles = titles.reset_index(drop = True)     # after dropping the data the index values will be changed and resetting the data

In [None]:
titles.shape       # after drooping the duplicates,  rows = 9868, columns = 15

In [None]:
credits.drop_duplicates(inplace = True)     # dropping the duplicates (prinary data)

In [None]:
credits = credits.reset_index(drop = True)    # after dropping the data the index values will be changed and resetting the data

In [None]:
credits.shape       # after drooping the duplicates,  rows = 124179, columns = 15

In [None]:
titles.info()

In [None]:
(titles.isnull().sum() * 100 / len(titles)).round(2)

In [None]:
titles.drop('seasons', axis = 1, inplace = True)    # dropping the columns 'season' because it has high percent of missing values ----> Primart data  (86.25)

In [None]:
titles.head()

In [None]:
titles.info()

In [None]:
(titles.isnull().sum() * 100 / len(titles)).round(2)

#### Filling the NAN Values

In [None]:
titles['description_new'] = titles['description'].fillna('Unknown')  # Created a new column and replaced the NaN with 'Unknown' (all the data get stored in new column)

In [None]:
titles.head()

In [None]:
titles['age_certification_new'] = titles['age_certification'].fillna('Unknown')
# Created new column and Replaced Nan Values with 'Unknown'

In [None]:
titles.head()

In [None]:
# Calculate the mean of the tmdb_score column
imdb_score_median = titles['imdb_score'].median()

# Fill missing values in tmdb_score_new with the calculated mean
titles['imdb_score_new'] = titles['imdb_score'].fillna(imdb_score_median)

In [None]:
imdb_score_median

In [None]:
titles.head()

In [None]:
# Calculate the mean of the tmdb_score column
imdb_votes_median = titles['imdb_votes'].median()

# Fill missing values in tmdb_score_new with the calculated mean
titles['imdb_votes_new'] = titles['imdb_votes'].fillna(imdb_votes_median)

In [None]:
imdb_votes_median    # median

In [None]:
titles.head()

In [None]:
# Calculate the mean of the tmdb_score column
tmdb_popularity_median = titles['tmdb_popularity'].median()

# Fill missing values in tmdb_score_new with the calculated mean
titles['tmdb_popularity_new'] = titles['tmdb_score'].fillna(tmdb_popularity_median)

In [None]:
tmdb_popularity_median    # median

In [None]:
titles.head()

In [None]:
# Calculate the mean of the tmdb_score column
tmdb_score_median = titles['tmdb_score'].median()  # round = after decimal only 2 numbers

# Fill missing values in tmdb_score_new with the calculated mean
titles['tmdb_score_new'] = titles['tmdb_score'].fillna(tmdb_score_median)

In [None]:
tmdb_score_median # Median of the tmdb socre

In [None]:
titles.head()

In [None]:
(titles.isnull().sum() * 100 / len(titles)).round(2)

In [None]:
# Here not filled the column 'imdb_id' , beacause we can drop the column and it is not useful for analysis

In [None]:
# Secondary Data filling NAN

In [None]:
(credits.isnull().sum() * 100 / len(credits)).round(2)

In [None]:
credits.info()

In [None]:
credits['character_new'] = credits['character'].fillna('Unknown')
# Created new column and Replaced Nan Values with 'Unknown'

In [None]:
credits.head()

#### Merging the Data

In [None]:
final_data = pd.merge(titles, credits, on = 'id', how = 'left')    # Using the left join to merge the 2 data sets

In [None]:
final_data.shape    # rows = 125186, columns = 25

In [None]:
(final_data.isnull().sum() / len(final_data) * 100).round(2)

In [None]:
final_data.drop(['description', 'age_certification', 'imdb_id', 'imdb_score', 'imdb_votes','tmdb_popularity', 'tmdb_score', 'character'], axis = 1, inplace = True)

In [None]:
(final_data.isnull().sum() / len(final_data) * 100).round(2)

In [None]:
final_data = final_data.rename({'description_new' : 'description', 'age_certification_new' : 'age_certification', 'imdb_score_new' : 'imdb_score', 'imdb_votes_new' : 'imdb_votes',
              'tmdb_popularity_new' : 'tmdb_popularity', 'tmdb_score_new' : 'tmdb_score', 'character_new' : 'character'}, axis = 1)

In [None]:
final_data.info()

In [None]:
final_data.head()

In [None]:
(final_data.isnull().sum() / len(final_data) * 100).round(2)

In [None]:
final_data.drop(['person_id'], axis = 1)

In [None]:
final_data['person_id'].value_counts().head(50)

In [None]:
final_data.isnull().sum()

In [None]:
final_data.shape

In [None]:
final_data.head()

In [None]:
final_data[final_data['person_id'].isnull()].head(25)

In [None]:
final_data[['person_id', 'name', 'role', 'character']].isnull().sum()

In [None]:
final_data.info()

In [None]:
final_data.shape

In [None]:
final_data.drop(['person_id'], axis=1, inplace=True)

In [None]:
# Fill missing cast/crew info
final_data[['name', 'role', 'character']] = (
    final_data[['name', 'role', 'character']].fillna('Unknown'))

In [None]:
# Confirm no NaNs remain
final_data.isnull().sum()

#### Finding the Outlies

In [None]:
final_data.describe()

In [None]:
outliers_data = final_data[['runtime', 'imdb_score' , 'imdb_votes', 'tmdb_popularity', 'tmdb_score' ]]

In [None]:
outliers_data.head()

In [None]:
outliers_data.shape

In [None]:
# Define a function to find outliers in one column
def detect_outliers_iqr(df, column):
    # Q1 = 25th percentile, Q3 = 75th percentile
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)

    # IQR = Difference between Q3 and Q1
    IQR = Q3 - Q1

    # Define lower and upper limits for normal data
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    # Identify outliers — values less than lower or greater than upper limit
    outliers = df[(df[column] < lower_limit) | (df[column] > upper_limit)]

    # Print the count and range details
    print(f"\n {column.upper()} -> Outliers detected: {outliers.shape[0]}")
    print(f"   Lower limit: {lower_limit:.2f}, Upper limit: {upper_limit:.2f}")

    # Return the rows that are outliers
    return outliers



# Run for each numeric column
for col in outliers_data:      # stored outliers in a variable
    detect_outliers_iqr(final_data, col)

In [None]:
# Define outlier ranges for each numeric column
outlier_ranges = {
    'runtime': (44, 148),
    'imdb_score': (3.05, 9.05),
    'imdb_votes': (-8877, 15355),
    'tmdb_popularity': (-8.47, 19.12),
    'tmdb_score': (-8.47, 19.12)
}

# Create outlier label columns for all numeric features
for col, (low, high) in outlier_ranges.items():
    final_data[f'{col}_outlier_label'] = final_data[col].apply(
        lambda x: 'Outlier' if (x < low or x > high) else 'Not Outlier'
    )

# Preview results
final_data[[col for col in final_data.columns if 'outlier_label' in col]].head()

In [None]:
final_data.info()

#### Analysis

In [None]:
# ------------------------------------------------------------
# 🧹 Create a dataset with only NON-OUTLIER rows
# ------------------------------------------------------------

# We'll consider a row as "non-outlier" only if all numeric columns are 'Not Outlier'
final_non_outliers_data = final_data[
    (final_data['runtime_outlier_label'] == 'Not Outlier') &
    (final_data['imdb_score_outlier_label'] == 'Not Outlier') &
    (final_data['imdb_votes_outlier_label'] == 'Not Outlier') &
    (final_data['tmdb_popularity_outlier_label'] == 'Not Outlier') &
    (final_data['tmdb_score_outlier_label'] == 'Not Outlier')
]

print( "Non-outlier rows kept for analysis:", final_non_outliers_data.shape[0])
print("Outlier rows removed:", final_data.shape[0] - final_non_outliers_data.shape[0])

# View a few rows
final_non_outliers_data.head().reset_index(drop=True)


In [None]:
final_non_outliers_data.info()

In [None]:
final_non_outliers_data = final_non_outliers_data.drop(['runtime_outlier_label', 'imdb_score_outlier_label', 'imdb_votes_outlier_label', 'tmdb_popularity_outlier_label',
                        'tmdb_score_outlier_label'], axis = 1).reset_index(drop = True)

In [None]:
final_non_outliers_data.head()

In [None]:
final_non_outliers_data.shape

In [None]:
final_non_outliers_data.isnull().sum()

In [None]:
final_non_outliers_data.head()

In [None]:
final_non_outliers_data.duplicated().sum()   # checking for the duplicated values

In [None]:
final_non_outliers_data = final_non_outliers_data.drop_duplicates().reset_index(drop = True)

In [None]:
final_non_outliers_data.duplicated().sum()

In [None]:
final_non_outliers_data[final_non_outliers_data.duplicated(subset = ['id', 'title'])].shape     # checking the duplicates in the colunm of id, title rows

In [None]:
final_non_outliers_data = final_non_outliers_data.drop_duplicates(subset = ['id', 'title'], keep = 'last').reset_index(drop = True)
# droped the duplicates of id, title and kept latest entry

In [None]:
final_non_outliers_data.shape

In [None]:
final_non_outliers_data.head()

In [None]:
final_non_outliers_data[final_non_outliers_data['production_countries'] == '[]'].head()

In [None]:
final_non_outliers_data[final_non_outliers_data['production_countries'] == '[]'].shape

In [None]:
final_non_outliers_data[final_non_outliers_data['genres'] == '[]'].head()

In [None]:
final_non_outliers_data[final_non_outliers_data['genres'] == '[]'].shape

In [None]:
final_non_outliers_data['production_countries'] = final_non_outliers_data['production_countries'].replace({'[]' : 'Unknown'})

In [None]:
final_non_outliers_data[final_non_outliers_data['production_countries'] == '[]'].shape

In [None]:
final_non_outliers_data['genres'] = final_non_outliers_data['genres'].replace({'[]' : 'Unknown'})

In [None]:
final_non_outliers_data[final_non_outliers_data['genres'] == '[]'].shape

In [None]:
final_non_outliers_data['genres'][12]

In [None]:
import ast

def convert_to_list(x):
    if isinstance(x, str):  # if it's a string like "['crime', 'thriller']"
        try:
            return ast.literal_eval(x)  # convert to list
        except:
            return None  # if conversion fails, return None
    return x  # if already list or something else

final_non_outliers_data['genres'] = final_non_outliers_data['genres'].apply(convert_to_list)

In [None]:
final_non_outliers_data['genres'] = final_non_outliers_data['genres'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None)

In [None]:
final_non_outliers_data.head()

In [None]:
(final_non_outliers_data.isnull().mean() * 100).round(2)
# genres has somne missing values = 1.68

In [None]:
final_non_outliers_data[final_non_outliers_data['genres'].isnull()].head()

In [None]:
final_non_outliers_data[final_non_outliers_data['genres'].isnull()].shape   # rows = 130 are with none values, of columns = 16

In [None]:
final_non_outliers_data['genres'] = final_non_outliers_data['genres'].apply(lambda x: 'Unknown' if x is None else x)    # filled the values of genres wheere where is None with 'Unknown'

In [None]:
(final_non_outliers_data.isnull().mean() * 100).round(2)

In [None]:
final_non_outliers_data['production_countries'][30]

In [None]:
import ast

def get_first_country(x):
    # 1) Missing values
    if pd.isna(x):
        return 'Unknown'
    # 2) If it's already a list
    if isinstance(x, list):
        return x[0] if len(x) > 0 else 'Unknown'
    # 3) If it's a string
    if isinstance(x, str):
        s = x.strip()
        # common meaningless values
        if s.lower() in ('none', 'nan', 'unknown', "[]", ""):
            return 'Unknown'
        # try safe eval: "['GB', 'CA']" -> list
        try:
            val = ast.literal_eval(s)
            if isinstance(val, list):
                return val[0] if len(val) > 0 else 'Unknown'
            # sometimes literal_eval returns a single string
            if isinstance(val, str) and val.strip():
                return val.strip()
        except Exception:
            # fallback: remove brackets/quotes and split by comma
            s2 = s.strip('[]').replace("'", "").replace('"', '').strip()
            parts = [p.strip() for p in s2.split(',') if p.strip()]
            return parts[0] if parts else 'Unknown'
    # any other unexpected type
    return 'Unknown'

In [None]:
# apply to create a new column (safer than overwriting)
final_non_outliers_data['production_countries'] = final_non_outliers_data['production_countries'].apply(get_first_country)

In [None]:
final_non_outliers_data['production_countries'][30]

In [None]:
final_non_outliers_data.head()

In [None]:
(final_non_outliers_data == 'Unknown').sum()

In [None]:
(final_non_outliers_data == 'Unknown').mean().round(2) * 100      # checking the unknown values of the columns

In [None]:
final_non_outliers_data.head()

In [None]:
final_non_outliers_data.to_csv('Amazon Insights.csv', index=False)    #saving the file

### What all manipulations have you done and insights you found?

1. Data Loading & Basic Checks : For Titles(Prinmary Data) and Credits(Secondary Data )


* Loaded the Both datasets using pandas.read_csv().

* Verified the shape → 9871 rows × 15 columns.  ( Primnary Data)

* Verified the shape → 124235 rows × 5 columns.   ( Secondary Data)

* Checked for duplicates and missing values.

* Confirmed that data types were appropriate (numeric columns as float/int, others as object).


----


2.  Column Cleaning and Standardization/ Adding the Columns :

* Renamed the Columns Where Required, Created the Columms , Droped the Columns Accordingly

----
3. Handling Missing By  “Unknown”
----

4. Handling Missing Values By 'Median'

----

5. Merging the Dsta by Left Join : (pd.merge.titles, credits, on = 'id', how = 'left')

---
6. Used pd.normalize method for neat and clean data for the colums genre and production countries

**Insights**

* Movies dominate Amazon Prime’s catalog — roughly 70–75% of total titles.

* Most frequent genres: Drama, Comedy, and Thriller.

* Highest-rated genres (avg IMDb ≥ 7.5): Documentaries, Music, war.

* Content library grows sharply between 2010–2020.

* Majority of content is produced in the United States (US).

* Growing share from India, UK, and Canada — highlighting Amazon’s localization push.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Top  Genres

In [None]:
# 0. Setup — imports and basic helpers
import ast
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

%matplotlib inline
sns.set(style='whitegrid', rc={'figure.figsize':(12,6)})

# helper to parse list-like text to Python lists (safe)
def parse_list_cell(x):
    if pd.isna(x): return []
    if isinstance(x, list): return x
    if isinstance(x, str):
        try:
            v = ast.literal_eval(x)
            if isinstance(v, (list,tuple)): return list(v)
            return [v]
        except:
            # fallback: naive split
            s = x.strip()
            if s.startswith('[') and s.endswith(']'): s = s[1:-1]
            parts = [p.strip().strip("'\"") for p in s.split(',') if p.strip()]
            return parts
    return []

# load the merged cleaned dataset
df = pd.read_csv('Amazon Insights.csv', low_memory=False)  # low_memory = False --> "Load the whole file, then decide datatypes for each column."

# parse genres & production_countries to lists (if not already)
if 'genres' in df.columns: df['genres_list'] = df['genres'].apply(parse_list_cell)
else: df['genres_list'] = [[] for _ in range(len(df))]
if 'production_countries' in df.columns: df['countries_list'] = df['production_countries'].apply(parse_list_cell)
else: df['countries_list'] = [[] for _ in range(len(df))]

# create a first_country column (0th index)
df['first_country'] = df['countries_list'].apply(lambda x: x[0] if len(x)>0 else 'Unknown')

# ensure numeric types for common numeric columns
for col in ['imdb_score','tmdb_score','imdb_votes','tmdb_popularity','runtime','release_year','seasons']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

In [None]:
# Chart - 1 visualization code
# Calculate genre counts
genre_count = df['genres'].value_counts().head(10)

ax = sns.barplot(
    x=genre_count.values,
    y=genre_count.index,
    hue=genre_count.index,
    dodge=False,
    legend=False,
    palette="coolwarm"
)
ax.set(
    title="Top 10 Most Common Genres on Amazon Prime",
    xlabel="Number of Titles",
    ylabel="Genre"
)
plt.show()

# Everything same up to plt.show()
plt.show()

# Print after the plot (executes after the figure is closed)
print("\nTop genres :")
print(genre_count)

##### 1. Why did you pick the specific chart?

* A horizontal bar chart clearly shows which genres dominate, ordered from most to least frequent.

##### 2. What is/are the insights found from the chart?

* Certain genres like Drama, Comedy, and Thriller tend to dominate streaming platforms.

* Indicates content preference trends across viewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

* Content Acquisition: The data highlights genres with high existing volume (drama, comedy). Amazon Prime could leverage this to acquire more content in these popular areas, catering to existing demand and potentially increasing subscriber satisfaction and retention.

* Targeted Marketing: The insights can guide marketing campaigns. Promotions can focus on the extensive drama library to attract a broad audience or target niche markets (e.g., sci-fi fans) if data suggests high engagement within that smaller segment.

* Strategic Gaps: The low number of sci-fi titles (174) could be viewed as a market gap. If market research indicates high demand for sci-fi, Amazon Prime could invest heavily in this genre to capture an underserved audience, leading to positive growth in that specific segment.

#### Chart - 2:  Movies Vs Shows

In [None]:
# Chart - 2 visualization code

type_counts = df['type'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(6,4))
type_counts.plot(kind='bar')
plt.title('Count: Movies vs Shows (descending)')
plt.ylabel('Count')
plt.show()

# numeric output in descending order
print(type_counts)

##### 1. Why did you pick the specific chart?

* A bar chart was picked to visually compare the counts of two distinct categorical variables (movies and shows)

##### 2. What is/are the insights found from the chart?

* The primary insight is that the dataset contains significantly more movies than shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Tailor Marketing: Direct marketing efforts primarily towards movie audiences, as this is where the existing strength lies

* Content Acquisition: Focus on acquiring more movie titles to cater to the dominant user preference

* Resource Allocation: Allocate more resources (e.g., budget, promotional staff) to the movie segment, which likely generates more revenue or engagement

#### Chart - 3 : Top Countries Producing the Content

In [None]:
# Chart - 3 visualization code

country_count = df['production_countries'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(
    x=country_count.values,
    y=country_count.index,
    hue=country_count.index,  # add hue to suppress the warning
    dodge=False,
    legend=False,
    palette="mako"
)
plt.title("Top 10 Countries Producing Content on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.show()

# Print after the plot (executes after the figure is closed)
print("\nTop genres (descending):")
print(country_count)


##### 1. Why did you pick the specific chart?

* The chart was picked because a horizontal bar chart is an effective way to compare the number of titles produced by different countries and visually represent the ranking and disparities in production volume

##### 2. What is/are the insights found from the chart?

* The primary insight is the overwhelming dominance of the United States (US) in content production for Amazon Prime, with 4135 titles, which is significantly more than any other country

* India (IN) and the United Kingdom (GB) are the next largest producers, but with less than 700 titles each, a massive drop from the US volume.

* Most other listed countries (CA, AU, FR, CN, DE) produce a very small number of titles, all under 350

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

* Expansion in International Markets: Recognizing the low volume from most non-US countries suggests an opportunity to invest in local content in these regions to attract and retain local subscribers, potentially leading to significant growth in international markets.

* Data Improvement: The large "Unknown" category highlights a need for better data management and labeling. Understanding the origin of this content could reveal hidden market opportunities or data quality issues that need addressing.

#### Chart - 4 : Content per release year (time series) and years ranked

In [None]:
# Chart - 4 visualization code

# Titles per year
titles_per_year = df['release_year'].value_counts().sort_index(ascending=False)

plt.figure(figsize=(14,6))
sns.lineplot(x=titles_per_year.index, y=titles_per_year.values, marker='o')
plt.title("Trend of Content Releases Over Time (Descending by Year)")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.show()
print(titles_per_year.head(22))

##### 1. Why did you pick the specific chart?

* The line chart was likely chosen because it effectively visualizes the trend and growth rate of content releases over a long period (1920-2022).

##### 2. What is/are the insight(s) found from the chart?

* Historical Stability: From 1920 to approximately 2000, the number of annual releases remained relatively low and stable, generally below 150 titles per year

* Exponential Growth: A sharp, continuous increase in content releases is observed from the early 2000s onwards.

* Peak Release Years: The years 2019 and 2021 show the highest number of releases, with 640 and 645 titles respectively.

* Recent Dip: There is a notable drop in releases in 2022 (61 titles) compared to the preceding years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Market Saturation & Competition: The exponential growth indicates a highly saturated and competitive market. Businesses can use this insight to focus on niche markets, unique content, or high-quality production to stand out, rather than simply increasing volume.

* Targeted Production: The data highlights specific peak years. Businesses can analyze market conditions during those times to understand successful strategies or identify gaps in years with slightly lower releases.

* Resource Planning: The overall upward trend justifies increased investment in content creation infrastructure, marketing, and distribution channels to meet the ongoing market demand and competition.

#### Chart - 5 : Top Rated Genres By Average IMDB Score

In [None]:
# Chart - 5 visualization code

# Average IMDb score by genre

# Explode the genres list so each genre is counted individually
expl = df.explode('genres_list')

# Group by exploded genre, calculate mean IMDb score and count titles
genre_stats = expl.dropna(subset=['imdb_score']).groupby('genres_list').agg(
    imdb_mean=('imdb_score', 'mean'),
    count=('title', 'count')
)

# Filter genres with at least 30 titles, sort by descending average IMDb score
genre_stats_filtered = genre_stats[genre_stats['count'] >= 30].sort_values('imdb_mean', ascending=False)

# Prepare data for plotting
top_genres = genre_stats_filtered.head(10)
sorted_top_genres = top_genres['imdb_mean'].sort_values(ascending=True)  # For horizontal bar chart (ascending for better visualization)

# Generate color palette using seaborn (matching number of bars)
palette = sns.color_palette("viridis", n_colors=len(sorted_top_genres))

# Plot horizontal bar chart with colors from palette
plt.figure(figsize=(10, 6))
bars = plt.barh(
    sorted_top_genres.index,
    sorted_top_genres.values,
    color=palette
)

plt.title('Average IMDb Score by Genre (>=30 titles) — top descending')
plt.xlabel('Avg IMDb Score')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

print(top_genres)

##### 1. Why did you pick the specific chart?

* A horizontal bar chart was chosen because it effectively compares the average IMDb scores across different genres and makes the ranking clear.

##### 2. What is/are the insights found from the chart?

* Top Performers: Genres like documentation, music, war, and history have the highest average IMDb scores, all above 6.5. This suggests audiences generally rate films in these specific, often factual or niche, categories higher than others

* "Unknown" Genre: A significant number of titles (130) fall into an "Unknown" genre with an average score around 6.18, suggesting potential data classification issues that might need addressing.

##### 3. Will the gained insights help creating a positive business impact

Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: A production company could use these insights to focus investment on higher-rated genres (e.g., documentaries or war films) to potentially achieve higher critical acclaim and audience scores, which can lead to awards, better distribution deals, or a positive brand image.

* Genre Specialization: Identifying strong performance in niche genres could lead to specialization, catering to a dedicated audience base.

#### Chart - 6:  Top Rated  Genres By Avg Runtime

In [None]:
# Chart - 6 visualization code

avg_runtime = df.groupby('genres')['runtime'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
sns.barplot(
    x=avg_runtime.values,
    y=avg_runtime.index,
    hue=avg_runtime.index,  # add hue to suppress the warning
    dodge=False,
    legend=False,           # hide legend (not needed here)
    palette="cubehelix"
)
plt.title("Top 10 Genres by Average Runtime")
plt.xlabel("Average Runtime (minutes)")
plt.ylabel("Genre")
plt.show()

# Print the avg
print("\nTop Genres with Avg Runtime (descending):")
print(avg_runtime)

##### 1. Why did you pick the specific chart?

* The horizontal bar chart was likely chosen because it effectively compares the average runtimes across different, distinct movie genres and presents the data in a descending order, making it easy to identify the longest and shortest average runtimes at a glance.

##### 2. What is/are the insights found from the chart?

* Genre Runtimes: Romance films have the longest average runtime, approximately 98.4 minutes, while horror films have the shortest average runtime, approximately 88.1 minutes.

* Runtime Variation: There is a relatively small range in average runtimes across the top 10 genres, with only about a 10-minute difference between the longest and shortest averages.

* Genre Groupings: Genres like romance, European, thriller, and drama tend to have slightly longer runtimes compared to action, fantasy, family, music, crime, and horror films.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: Studios can use this data to understand audience expectations for runtime within specific genres. For example, knowing that romance films average nearly 100 minutes suggests that audiences expect a more developed narrative

* Scheduling & Programming: Cinema programmers and streaming services can optimize their schedules. Shorter films (like horror) allow for more screenings per day, potentially increasing ticket sales or viewer turnover.

* Marketing: The data can inform marketing messages, perhaps highlighting a longer runtime for a drama to imply depth, or a shorter runtime for a horror film to emphasize fast-paced action.

#### Chart - 7 :  IMDb Score vs TMDb Score (Correlation)

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(7,5))
sns.scatterplot(x='imdb_score', y='tmdb_score', data=df, alpha=0.6)
plt.title("IMDb Score vs TMDb Score (Descending Correlation Trend)")
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

* A scatter plot is suitable for this purpose because it effectively displays the relationship and correlation between two continuous numerical variables (IMDb Score and TMDb Score).

##### 2. What is/are the insights found from the chart?

* Positive Correlation: Generally, movies with higher IMDb scores also have higher TMDb scores, and vice versa. This indicates that both platforms tend to agree on the relative quality of content.

* Score Distribution: Both platforms seem to have a concentration of scores in the middle range (around 6 to 8), with fewer films receiving very low (below 3) or very high (above 9) scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Content Acquisition: The general agreement between the two scores confirms that a business can use either metric as a reliable indicator of general audience reception when deciding which content to acquire or recommend.

* Marketing Strategy: Films that score high on both platforms are likely universally popular and can be marketed broadly. Films with a significant difference in scores (outliers) might appeal to different audience niches, allowing for targeted marketing campaigns to specific user bases.

* Platform Trust: The overall positive correlation suggests that both platforms are reasonably reliable, which helps a business build trust with its own users if it aggregates or uses these scores.

#### Chart - 8: Most Active Directors

In [None]:
# Chart - 8 visualization code

# Calculate director counts
director_count = df[df['role']=='DIRECTOR']['name'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(
    x=director_count.values,
    y=director_count.index,
    hue=director_count.index,  # add hue to satisfy new API
    dodge=False,
    legend=False,
    palette="crest"            # keep your chosen color palette
)
plt.title("Top 10 Most Active Directors (Descending Order)")
plt.xlabel("Number of Titles Directed")
plt.ylabel("Director")
plt.show()

print(director_count)

##### **1. Why did you pick the specific chart?**

A Horizontal Bar Chart Clearly Shows which Directer Is Most Active

##### **2. What is/are the insights found from the chart?**

Identifies directors contributing most to Amazon's catalog

Top 3 Most Active Directors :
1. Josheph Kane

2. Sam Newfield

3. Jay Chapman

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

Useful for partnerships or future collaborations

#### Chart - 9: Top Age Certifications

In [None]:
# Chart - 8 visualization code

age_count = df['age_certification'].value_counts().head(10)

plt.figure(figsize=(8, 4))
sns.barplot(
    x=age_count.values,
    y=age_count.index,
    hue=age_count.index,       # link palette to hue
    dodge=False,
    legend=False,              # hides redundant legend
    palette="flare"            # keep your chosen color scheme
)
plt.title("Content Distribution by Age Certification")
plt.xlabel("Number of Titles")
plt.ylabel("Age Certification")
plt.show()

# print in nums
print(age_count)

##### 1. Why did you pick the specific chart?

A Horizontal Bar Chart Clearly Shows About The Distribution by Age Certification

##### **2. What is/are the insights found from the chart?**

There Are Many Shows/Movies Where the Age Certifications Are Missing

##### **3. Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

 It Can Lead to Financial and Legal Consequences, Loss Of Audience, And Significant Damage To The Platform's Reputation

#### Chart - 10 : Top titles by IMDb (descending) using vote threshold

In [None]:
# Top titles by IMDb (descending) using vote threshold
# Choose a vote threshold to ensure reliability (e.g., >=1000 votes)
min_votes = 1000
top_imdb = df[df['imdb_votes'].notna() & (df['imdb_votes'] >= min_votes)].sort_values(['imdb_score','imdb_votes'], ascending=[False, False])
# print top 30
display_cols = ['title','type','release_year','imdb_score','imdb_votes','first_country']
print("Top Titles by IMDb (votes >= {}).".format(min_votes))
print(top_imdb[display_cols].head(30).to_string(index=False))

##### 1. What is/are the insight(s) found from the chart?

The prominence of specific countries like India and the US, and the consistent high performance of documentaries and stand-up specials.

* Geographic Diversity and Dominance: While US and Indian titles are frequent, a range of countries are represented (AU, CN, GB, IL, IN, SU, Unknown), indicating a global audience for quality content.

* Genre Performance: Documentaries ("The Test", "NOVA", "Nazi Concentration Camps") and stand-up comedy specials (George Carlin, Bill Hicks, Zakir Khan) consistently achieve high IMDb scores.

* Timeless Appeal: Titles span several decades, from 1945 to 2022, suggesting that quality content can maintain high ratings over long periods ("NOVA" started in 1974, "Time Team" in 1994).

* High Engagement: All listed titles have at least 1000 votes, with some exceeding 14,000, indicating significant audience engagement with these specific titles.

##### 2. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* A business could leverage the insights to invest in genres that consistently perform well, such as documentaries or stand-up comedy, which appear to have a dedicated audience willing to rate them highly.

* Targeted Market Expansion: Recognizing the success of content from various countries (e.g., India, China) can guide efforts to localize content or acquire popular international titles for a global audience.

#### Chart - 11 : Top titles by TMDB popularity (descending)

In [None]:
# Chart -  visualization code

# Sort by tmdb_popularity if present
if 'tmdb_popularity' in df.columns:
    top_tmdb = df.dropna(subset=['tmdb_popularity']).sort_values('tmdb_popularity', ascending=False)
    print("Top Titles by TMDB Popularity (descending):")
    print(top_tmdb[['title','type','release_year','tmdb_popularity','tmdb_score']].head(30).to_string(index=False))
else:
    print("tmdb_popularity column not found.")

##### 1. What is/are the insight(s) found from the chart?

* The primary insight is that all top popular titles in this specific dataset have a perfect TMDB score of 10.0.

* The list includes both MOVIE and SHOW types, with release years spanning from 1937 to 2021, indicating that age and format do not prevent a title from being both popular and perfectly scored within this specific dataset's context.

##### 2. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* The list provides a selection of titles that are both popular and have a perfect score. A business could use this list to consider licensing these specific titles, assuming the scores are accurate and reflect genuine audience sentiment, potentially increasing platform engagement and customer satisfaction

#### Chart - 12 Top actors by number of appearances (descending)

In [None]:
# Visulaization Code

if 'role' in df.columns and 'name' in df.columns:
    top_actors = df[df['role']=='ACTOR']['name'].value_counts().sort_values(ascending=False)
    top_20_actors = top_actors.head(20).sort_values(ascending=True)  # ascending for horizontal bar chart

    plt.figure(figsize=(12,8))

    # Use seaborn color palette for bars
    palette = sns.color_palette("magma", n_colors=top_20_actors.shape[0])

    bars = plt.barh(top_20_actors.index, top_20_actors.values, color=palette)

    plt.title('Top 20 Actors by Appearances (descending)', fontsize=16, fontweight='bold')
    plt.xlabel('Number of Appearances', fontsize=14)
    plt.ylabel('Actor Names', fontsize=14)
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # Add labels on bars
    for bar in bars:
        width = bar.get_width()
        plt.text(width + 0.5, bar.get_y() + bar.get_height()/2, f'{int(width)}', ha='left', va='center', fontsize=11)

    plt.tight_layout()
    plt.show()
    print(top_actors.head(20))
else:
    print("Columns 'role' or 'name' not found in dataset.")

##### 1. Why did you pick the specific chart?

* A horizontal bar chart was likely chosen because it is effective for comparing the frequency of a categorical variable (actor appearances) when the category labels (actor names) are long.

##### 2. What is/are the insight(s) found from the chart?

* Top Actors: Bear Grylls, Ayano Omoto, Gallagher, and Rick Stein have the highest number of appearances, each with 3 appearances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: The insights suggest that the top actors (Bear Grylls, Ayano Omoto, Gallagher, Rick Stein) are likely popular or frequently used. A business could leverage this by featuring these actors in more content or marketing campaigns to capitalize on their established presence and potential audience appeal.

#### Chart 13 - IMDb score distribution (histogram)

In [None]:
# visulaization code

plt.figure(figsize=(8,4))
df['imdb_score'].dropna().hist(bins=25)
plt.title('IMDb Score Distribution')
plt.xlabel('IMDb Score')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

* A histogram was chosen because it is an effective visualization tool for displaying the frequency distribution of a continuous numerical variable, in this case, IMDb score

##### 2. What is/are the insight(s) found from the chart?

* Peak Frequency: The most frequent scores (the mode of the distribution) are found in the bin centered around 6.0-6.1, with over 1200 entries.

* Score Range: The scores in this dataset range from approximately 3.0 to 9.0.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: A business (e.g., a streaming service or production house) can use this to understand audience preferences. Since most content falls in the average range, there might be an opportunity to invest in higher-quality content (scores 7.5+) to capture a premium audience or focus on producing "average" content efficiently for mass market appeal.

* Targeted Marketing: Marketing efforts can be tailored. Movies in the 6.0-7.0 range could be marketed as "solid, reliable entertainment," while those in the 8.0+ range can be highlighted for their critical acclaim.

* Inventory Management: The data suggests a large volume of movies in the 5.0-7.0 range. A business can ensure its library is well-stocked with this type of content to meet average demand.

#### Chart 14 : Runtime Distribution

In [None]:
#  visualization code

plt.figure(figsize=(8,4))
df['runtime'].dropna().hist(bins=30)
plt.title('Runtime Distribution (all titles)')
plt.xlabel('Runtime (minutes)')
plt.show()

print("Median runtime by type:")
print(df.groupby('type')['runtime'].median().sort_values(ascending=False))

##### 1. Why did you pick the specific chart?

* A histogram is ideal for displaying the distribution of a continuous numerical variable (runtime in minutes)

* It groups runtimes into bins and shows the count (frequency) of titles within each bin

##### 2. What is/are the insight(s) found from the chart?

* The majority of titles have runtimes between approximately 80 and 110 minutes, with the highest frequency bin being 90-95 minutes (around 800 titles).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: The data confirms a strong preference or market saturation in the 90-100 minute movie length and 50-minute show length. A business can use this to optimize production or acquisition of new content that aligns with these popular lengths.

* Scheduling and Curation: Understanding typical runtimes allows for better scheduling of content slots, creating a seamless user experience (e.g., grouping similar-length content).

* Targeted Marketing: The bimodal nature suggests the user base is segmented into those who prefer feature-length content and those who prefer shorter episodes. Marketing efforts can be tailored to these specific preferences.

#### Chart - 15 :  Top country-genre pairs

In [None]:
#  visualization code
pair_df = df.explode('genres_list').explode('countries_list')
pair_counts = pair_df.groupby(['countries_list','genres_list']).size().reset_index(name='count').sort_values('count', ascending=False)
print("Top country-genre pairs (descending):")
print(pair_counts.head(30).to_string(index=False))

##### 2. What is/are the insight(s) found from the chart?

* Dominance of US Content: The US is overwhelmingly the primary source country for content, especially in genres like drama, comedy, thriller, western, and horror. The top three country-genre pairs are all US-based.

* Popular Genres: Drama and comedy are consistently the most frequent genres across multiple countries (US, IN, GB, CA, AU).

* Key Markets: The United States (US), Great Britain (GB), and India (IN) are the most prominent countries in the top 30 list.

* Specific Pairings: Certain countries show strong preferences for specific genres, such as India with romance and comedy, and the US with a wide variety including documentation and westerns.

* "Unknown" Data: A significant amount of data is categorized as "Unknown" for both country and genre, which indicates a data quality issue in the source dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

* The data can inform content acquisition and production strategies. A streaming service, for example, could focus resources on acquiring or producing more US drama and comedy content, as the data suggests high availability or demand in these areas. It helps identify key markets and popular genres to target for specific regions, optimizing investment and potentially increasing subscriber engagement and retention.

Negative Growth Insights:

* The "Unknown" category is a significant issue.

Reason: Without knowing the country or genre for these entries, the business cannot accurately target or leverage this content. This missing information leads to incomplete market understanding, potentially missed opportunities in specific regions, and inefficient resource allocation. Addressing the data quality issue (reducing "Unknowns") is crucial to avoid negative impacts from uninformed decisions.

#### Chart - 16 : Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

num_cols = [c for c in ['runtime','imdb_score','tmdb_score','imdb_votes','tmdb_popularity'] if c in df.columns]
plt.figure(figsize=(8,6))
sns.heatmap(df[num_cols].corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation matrix (numeric columns)')
plt.show()

##### 1. Why did you pick the specific chart?

* A correlation matrix heatmap was picked because it is an efficient and visually intuitive way to display the correlation coefficients between multiple numeric variables simultaneously

##### 2. What is/are the insight(s) found from the chart?

* tmdb_score and tmdb_popularity show a strong positive correlation (0.66), suggesting that higher-rated films on TMDB tend to be more popular on that platform

* imdb_score and tmdb_score also have a moderate-to-strong positive correlation (0.44), indicating a general agreement between the two rating systems.

* imdb_votes has a moderate positive correlation with imdb_score (0.13), tmdb_score (0.12), and tmdb_popularity (0.26), meaning films with more votes tend to have slightly higher scores and popularity

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact :

* Content Strategy: The strong correlation between tmdb_score and tmdb_popularity suggests that focusing on producing or acquiring high-quality content (likely to get high scores) is a viable strategy to drive platform popularity and user engagement

* Marketing: Since runtime has little correlation with popularity or scores, marketing efforts can focus on other factors like genre, cast, or ratings, rather than the length of the film.

* Platform Synergy: The moderate correlation between IMDB and TMDB scores indicates that data from one platform can be a useful, though not perfect, predictor for performance on the other platform.

#### Chart - 17 :  Pair Plot

In [None]:
# Pair Plot visualization code

# These columns help show relationships between ratings, year, and duration
num_cols = ['imdb_score', 'tmdb_score', 'runtime', 'release_year']

# Step 4: Create the pair plot
sns.pairplot(
    df[num_cols],                  # Only include selected numeric columns
    diag_kind='kde',               # Show smooth kernel density plots on diagonal
    corner=True,                   # Only show lower triangle to avoid duplication
    plot_kws={'alpha': 0.6}        # Make points slightly transparent
)

# Step 5: Add a title and display
plt.suptitle("Pair Plot - Ratings, Runtime, and Release Year Analysis", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

* A pair plot is ideal for exploring relationships between multiple numerical variables simultaneously.

* It shows both the distribution of each variable (using KDE plots on the diagonals) and the relationships/correlations between them (scatterplots off-diagonal).

##### 2. What is/are the insight(s) found from the chart?

* IMDb and TMDB scores have a positive correlation cluster: higher IMDb scores tend to associate with higher TMDB scores.

* The runtime distributions show most content ranges roughly between 60-120 mins, with some peak around 90 mins.

* Release years spread mostly after 1950, with a cluster especially near recent decades (~2000s onward).

* No strong linear correlations are apparent between runtime and release year or scores, indicating potentially complex relationships.

* Scatterplots seem fairly dense with many points, indicating large dataset size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Imapact :

* Understanding correlation between IMDb and TMDB scores helps in reliable audience rating analysis, influencing content acquisition, production, and recommendations.

* Runtime insights support optimized content length decisions for viewer engagement.

* Release year distribution reveals catalog age and gaps, encouraging targeted acquisition for newer or classic content.

#### Chart - 18 :  IMDB Rating Vs Release Year

In [None]:
import plotly.express as px

fig = px.scatter(
    df,
    x='release_year',
    y='imdb_score',
    color='genres',
    size='tmdb_popularity',
    hover_data=['title', 'first_country'],
    title='Interactive IMDb Ratings vs Release Year'
)
fig.show()

##### 1. Why did you pick the specific chart?

* The specific chart, , was likely chosen to visualize the relationship between three variables: IMDb score, release year, and movie genre.

##### 2. What is/are the insight(s) found from the chart?

* Production Volume: The density of data points significantly increases from left (earlier years) to right (later years), indicating a substantial increase in the number of movies released annually.

* Rating Distribution: IMDb scores for most movies tend to cluster between approximately 5 and 8 across all release years.

* Genre Diversity: A wide variety of genres are represented throughout the decades, with no single genre consistently dominating the highest or lowest scores.

* High Scores: While high scores (above 8) exist across the entire timeline, they appear more sporadically in recent years compared to the sheer volume of average-scoring films.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Imapact :

* Market Growth: The massive increase in movie production suggests a large and growing market, offering opportunities for new content production and distribution platforms..

* Genre Targeting: By identifying which genres consistently perform well or are underrepresented, businesses can tailor their production efforts to meet audience demand or fill market niches.


* Audience Preference: Analyzing the concentration of scores can help determine the general quality expectations of the audience, guiding production quality standards.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

# To Achieve The Business Objective, The Analysis Uses Data-Driven Methods to:

 **Explore Content Composition** :

* Analyze the mix of Movies vs TV Shows to understand content balance.

* Examine Genre Distribution to find the most dominant and underrepresented genres.

**Study Trends Over Time** :

* Visualize number of releases per year to identify how Amazon Prime’s content library has grown.

Observe runtime trends to see how show length evolved over the years.

**Assess Quality & Popularity**

* Compare IMDb scores and TMDb scores to evaluate audience satisfaction.

* Highlight top-rated genres, actors, and directors, showing who and what type of content performs best.

**Regional Insights**

* Analyze production countries to see which regions contribute most to Prime’s content — especially focusing on dominant markets like the U.S., U.K., and India.

**Deliver Actionable Insights** :

* Provide summarized visuals and metrics to help content strategists choose which genres or creators to invest in.

* Enable marketing teams to promote trending genres or top-rated content.

 * Support investors with trends in volume, quality, and audience engagement.

 * Expand non-U.S. production to attract diverse audiences globally.

 * Monitor user rating trends regularly to refine recommendation and acquisition strategies.

# **Conclusion**

* Movies dominate Amazon Prime’s catalog compared to TV shows — suggesting a movie-first strategy, but TV content is steadily increasing over the years.

* Drama and Comedy are the most common genres, while Thriller and Action maintain consistent viewer interest, making them strong performers in engagement and ratings.

* Average IMDb scores are stable between 6.5–7.5, meaning overall user satisfaction is good but leaves room for improvement with quality content.

* Recent years (2015–2020) show the largest increase in titles, indicating rapid content expansion before plateauing — aligning with Amazon’s global expansion period.

* Certain actors/directors appear more frequently, suggesting Amazon has preferred collaborations. Identifying top-rated collaborators could strengthen exclusive content deals.

* U.S. dominates content production, but there’s growing representation from India, U.K., and Canada, showing progress toward regional diversification.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***