# **Project Name**    - Exploratory Data Analysis of Amazon Prime TV Shows and Movies




##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Member  -** Keshika Swetha D


# **Project Summary -**

With the rapid growth of online streaming platforms, understanding content performance and audience preferences has become essential for data-driven decision-making. This project presents an exploratory data analysis (EDA) of the Amazon Prime Video movies and TV shows dataset with the objective of extracting meaningful insights related to content distribution, trends over time, genre preferences, audience ratings, and popularity metrics. The dataset consists of two primary files: titles, which contains information about movies and TV shows, and credits, which provides cast and crew details. Together, these datasets offer a comprehensive view of the platform’s content library.

The initial phase of the project involved data understanding and preprocessing. This included loading the datasets, inspecting their structure, handling missing values using context-based logic, removing duplicate records, validating data types, and converting list-like string columns such as genres and production countries into usable formats. Special care was taken when handling rating-related missing values to avoid introducing bias into the analysis. These steps ensured that the datasets were clean, consistent, and suitable for further exploration.

Following data preparation, univariate analysis was performed to understand individual variables. This included analyzing the distribution of movies versus TV shows, release year trends, runtime patterns, genre frequencies, and rating distributions. The analysis revealed that Amazon Prime Video’s catalog is heavily dominated by movies, with TV shows forming a smaller portion. Content production was found to increase significantly after 2010, indicating a strong focus on recent releases. Drama, Comedy, and Action emerged as the most common genres, reflecting mainstream audience preferences.

Bivariate analysis was then conducted to examine relationships between pairs of variables. Visualizations such as box plots, scatter plots, and line charts were used to analyze relationships between content type and ratings, runtime and ratings, and release year trends. These analyses showed that TV shows tend to have more consistent ratings compared to movies, while runtime and release year have minimal influence on audience ratings. This highlighted that content quality is not determined by duration or recency alone.

To gain deeper insights, multivariate analysis techniques were applied, including scatter plots with multiple encodings, correlation heatmaps, and pair plots. The correlation analysis demonstrated strong agreement between IMDb and TMDB ratings, validating the use of ratings as reliable indicators of content quality. However, popularity metrics such as IMDb votes and TMDB popularity showed only moderate correlation with ratings, indicating that high popularity does not always imply high-quality content. The pair plot further confirmed weak relationships among most numerical variables, emphasizing that content success is influenced by multiple interacting factors.

Overall, this project demonstrates the value of exploratory data analysis in transforming raw streaming data into actionable business insights. The findings support informed decision-making related to content acquisition, recommendation strategies, genre prioritization, and regional expansion. By adopting a balanced and data-driven approach, Amazon Prime Video can enhance audience engagement, improve content quality perception, and strengthen its competitive position in the streaming market.

# **GitHub Link -**

https://github.com/DKSwetha/Exploratory-Data-Analysis-of-Amazon-Prime-TV-Shows-and-Movies

# **Problem Statement**


The objective of this project is to perform Exploratory Data Analysis (EDA) on the Amazon Prime Movies and TV Shows dataset to uncover meaningful insights related to content type, release trends, genre popularity, country-wise distribution, and audience ratings. These insights can help stakeholders better understand the platform’s content strategy and identify opportunities for data-driven decision-making.


#### **Define Your Business Objective?**

The key business objectives of this analysis are:

1.Analyze Content Distribution -
To understand the proportion of movies versus TV shows available on Amazon Prime Video.

2.Identify Content Trends Over Time -
To analyze how content production has evolved across different release years, identifying growth patterns and peak periods.

3.Understand Genre Popularity -
To determine the most common and popular genres, helping assess audience preferences.

4.Country-wise Content Analysis -
To analyze the distribution of content associated with different countries and understand regional representation within the platform’s catalog.

5.Analyze Audience Ratings -
To study the distribution of content ratings and understand the target audience demographics.

6.Support Business Decision-Making -
To provide insights that can assist in content acquisition, recommendation strategies, and market expansion planning.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings for clean output
import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")


The required Python libraries were imported to support data analysis and visualization tasks. Pandas and NumPy were used for efficient data manipulation and numerical operations. Matplotlib and Seaborn were utilized for creating informative visualizations to identify patterns and trends in the dataset. Additionally, the Seaborn whitegrid style was applied to enhance the visual clarity of plots by adding light grid lines on a white background, making comparisons and trends easier to interpret. Warning messages were suppressed to ensure clean and readable outputs during exploratory data analysis.

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

titles_df = pd.read_csv('/content/drive/MyDrive/EDA - Amazon prime dataset/titles.csv')
credits_df = pd.read_csv('/content/drive/MyDrive/EDA - Amazon prime dataset/credits.csv')




The datasets required for this analysis, namely titles.csv and credits.csv, were stored in Google Drive and accessed directly within the Google Colab environment by mounting Google Drive. After establishing the connection, the datasets were imported into the notebook using the Pandas library. This approach ensures persistent access to the data across sessions and avoids the need for repeated manual uploads. The loaded datasets were then stored as Pandas DataFrames, enabling efficient manipulation and further exploratory data analysis.

### Dataset First View

In [None]:
# View first 5 rows of titles dataset
titles_df.head()


In [None]:
# View first 5 rows of credits dataset
credits_df.head()


To gain an initial understanding of the datasets, the first few records of both titles.csv and credits.csv were displayed using the head() function. This step helps in identifying the structure of the data, column names, and sample values, providing a preliminary overview before performing detailed exploratory data analysis.

### Dataset Rows & Columns count

In [None]:
print("Titles Dataset:")
print("Number of rows:", titles_df.shape[0])
print("Number of columns:", titles_df.shape[1])

print("\nCredits Dataset:")
print("Number of rows:", credits_df.shape[0])
print("Number of columns:", credits_df.shape[1])




The number of rows and columns in each dataset was obtained using the shape attribute to understand the dataset size.

### Dataset Information

In [None]:
# Structure of titles dataset
print("Titles Dataset Info:")
titles_df.info()

# Structure of credits dataset
print("\nCredits Dataset Info:")
credits_df.info()


Titles Dataset-

1.Contains 9,871 records and 15 columns.

2.Includes a mix of categorical (object) and numerical (int, float) attributes.

3.Key categorical columns: title, type, genres, production_countries.

4.Key numerical columns: release_year, runtime, imdb_score, tmdb_score.

5.Significant missing values observed in: age_certification, seasons (mostly applicable to TV shows),imdb_score, imdb_votes, and tmdb_score

6.The seasons column has limited values, indicating it applies mainly to TV shows.

Credits Dataset-

1.Contains 124,235 records and 5 columns.

2.Majority of columns are categorical, with person_id as a numerical identifier.

3.The character column contains missing values, likely for non-acting roles.

4.All other columns (id, name, role) are fully populated.

#### Duplicate Values

In [None]:
# Check duplicate rows in titles dataset
print("Duplicate rows in Titles dataset:", titles_df.duplicated().sum())

# Check duplicate rows in credits dataset
print("Duplicate rows in Credits dataset:", credits_df.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values in Titles dataset:")
print(titles_df.isnull().sum())

print("\nMissing values in Credits dataset:")
print(credits_df.isnull().sum())


In [None]:
# Visualizing the missing values
plt.figure(figsize=(12,6))
sns.heatmap(
    titles_df.isnull(),
    cmap='YlGnBu',
    cbar=True,
    yticklabels=False
)
plt.title("Missing Values Heatmap – Titles Dataset")
plt.show()

titles_missing_percent = (titles_df.isnull().sum() / len(titles_df)) * 100

titles_missing_percent[titles_missing_percent > 0].plot(
    kind='bar',
    figsize=(10,4),
    title='Percentage of Missing Values per Column – Titles Dataset'
)

plt.ylabel('Missing Values (%)')
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.heatmap(
    credits_df.isnull(),
    cmap='YlGnBu',
    cbar=True,
    yticklabels=False
)
plt.title("Missing Values Heatmap – Credits Dataset")
plt.show()

credits_missing_percent = (credits_df.isnull().sum() / len(credits_df)) * 100

credits_missing_percent[credits_missing_percent > 0].plot(
    kind='bar',
    figsize=(6,4),
    title='Percentage of Missing Values per Column – Credits Dataset'
)

plt.ylabel('Missing Values (%)')
plt.show()


Missing values were analyzed using both heatmaps and bar charts. The heatmaps were used to observe the distribution patterns of missing data, while bar charts provided a clear quantitative comparison of missing values across columns for both datasets.

### What did you know about your dataset?

The titles dataset contains 9,871 records and 15 columns, representing movies and TV shows. Most core attributes such as id, title, type, release_year, runtime, genres, and production_countries are almost 100% complete, indicating good data quality for content-level analysis. The description column has very low missing values (≈1%), meaning most titles are well documented. This makes the dataset reliable for analyzing content distribution, trends over time, and genre-based insights.

However, the dataset shows significant missing values in certain columns. The seasons column has the highest missing percentage (≈85%), which is expected because this attribute applies only to TV shows and not movies. The age_certification column has approximately 65% missing values, indicating that age ratings are not consistently available. Rating-related columns such as imdb_score, imdb_votes, and tmdb_score show moderate missingness, suggesting that not all titles have audience ratings or popularity metrics. These findings indicate that while the dataset is strong for descriptive analysis, some features may need to be excluded or handled carefully during further analysis.

The credits dataset contains 124,235 records and 5 columns, providing cast and crew information linked to titles using a common id. Most columns, including person_id, id, name, and role, are 100% complete, indicating high data reliability. This completeness makes the dataset suitable for analyzing cast involvement, role distribution, and participation patterns across titles.

The only column with missing values is character, which has approximately 13% missing data. This missingness is context-based and likely corresponds to non-acting roles such as directors, writers, or producers who do not portray characters. Since the remaining columns are fully populated, the dataset does not suffer from major data quality issues. Overall, the credits dataset complements the titles dataset well and enables extended analysis when merged, such as studying relationships between content type and cast participation.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles Dataset Data Types:")
print(titles_df.dtypes)

print("\nCredits Dataset Data Types:")
print(credits_df.dtypes)




In [None]:
# Dataset Describe
print("Titles Dataset Description:")
print(titles_df.describe(include='all'))

print("\nCredits Dataset Description:")
print(credits_df.describe(include='all'))



### Variables Description

Titles Dataset-

1.id: Unique identifier for each movie or TV show. This column is used to link the titles dataset with the credits dataset.

2.title: Name of the movie or TV show. Multiple titles may share the same name.

3.type: Indicates whether the content is a Movie or a TV Show. The dataset is dominated by movies.

4.description: A brief summary of the content. A small number of records contain missing or generic descriptions.

5.release_year: The year in which the title was released. Values range from early 1900s to recent years.

6.age_certification: Age rating assigned to the title (e.g., R, PG-13). This column has a high percentage of missing values.

7.runtime: Duration of the movie or episode in minutes. Most values fall within standard movie lengths.

8.genres: List of genres associated with the title. Drama is the most frequently occurring genre.

9.production_countries: Country or countries where the content was produced. The United States is the most common.

10.seasons: Number of seasons for TV shows. This column is mostly missing for movies.

11.imdb_id: External identifier linking the title to IMDb.

12.imdb_score: Average audience rating from IMDb, ranging approximately from 1 to 10.

13.imdb_votes: Number of votes received on IMDb, indicating popularity.

14.tmdb_popularity: Popularity score from TMDB, reflecting audience interest.

15.tmdb_score: Average rating from TMDB, typically ranging from 0 to 10.


Credits Dataset-

1.person_id: Unique identifier assigned to each individual involved in the title.

2.id: Identifier of the movie or TV show, used to link with the titles dataset.

3.name: Name of the cast or crew member.

4.character: Name of the character portrayed by the actor. This field contains missing values for non-acting roles.

5.role: Role of the individual in the title, such as Actor or Director. The dataset is primarily actor-focused.

### Check Unique Values for each variable.

In [None]:
# Number of unique values in each column (Titles dataset)
titles_unique = titles_df.nunique()

print("Unique values per column – Titles Dataset:")
print(titles_unique)

# Number of unique values in each column (Credits dataset)
credits_unique = credits_df.nunique()

print("\nUnique values per column – Credits Dataset:")
print(credits_unique)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# STEP 1: Create working copies
titles_wr = titles_df.copy()
credits_wr = credits_df.copy()

print("Initial dataset shapes:")
print("Titles:", titles_wr.shape)
print("Credits:", credits_wr.shape)

# STEP 2: Handle duplicate records
print("\nChecking duplicates BEFORE removal:")
print("Titles duplicates:", titles_wr.duplicated().sum())
print("Credits duplicates:", credits_wr.duplicated().sum())

titles_wr.drop_duplicates(inplace=True)
credits_wr.drop_duplicates(inplace=True)

print("\nDataset shapes AFTER duplicate removal:")
print("Titles:", titles_wr.shape)
print("Credits:", credits_wr.shape)
print("\nChecking duplicates AFTER removal:")
print("Titles duplicates:", titles_wr.duplicated().sum())
print("Credits duplicates:", credits_wr.duplicated().sum())


# STEP 3: Inspect missing values BEFORE wrangling
print("\nMissing values BEFORE wrangling (Titles):")
print(titles_wr.isnull().sum())

print("\nMissing values BEFORE wrangling (Credits):")
print(credits_wr.isnull().sum())



# STEP 4: Context-aware missing value handling
# Seasons:
# - Applicable only for TV shows
# - Movies logically have 0 seasons
titles_wr['seasons'] = titles_wr['seasons'].fillna(0)

# Age certification:
# - Missing for a large portion of data
# - Explicitly mark as 'Not Rated'
titles_wr['age_certification'] = titles_wr['age_certification'].fillna('Not Rated')

# Description:
# - Fill missing descriptions with a meaningful placeholder
titles_wr['description'] = titles_wr['description'].fillna('Description not available')

# Character:
# - Missing mainly for non-acting roles
credits_wr['character'] = credits_wr['character'].fillna('Not Applicable')



# STEP 5: Validate missing values AFTER handling
print("\nMissing values AFTER handling (Titles):")
print(titles_wr.isnull().sum())

print("\nMissing values AFTER handling (Credits):")
print(credits_wr.isnull().sum())


# STEP 6: Data type corrections
# Convert numerical columns to appropriate types
titles_wr['release_year'] = titles_wr['release_year'].astype(int)
titles_wr['seasons'] = titles_wr['seasons'].astype(int)
titles_wr['runtime'] = titles_wr['runtime'].astype(int)

print("\nData types after correction (Titles):")
print(titles_wr[['release_year', 'seasons', 'runtime']].dtypes)

# STEP 7: Standardize categorical text fields
titles_wr['type'] = titles_wr['type'].str.lower().str.strip()
credits_wr['role'] = credits_wr['role'].str.lower().str.strip()

print("\nUnique values in 'type' after standardization:")
print(titles_wr['type'].unique())

print("\nUnique values in 'role' after standardization:")
print(credits_wr['role'].unique())


# STEP 8: Convert list-like strings to actual lists
# This enables proper genre and country analysis later
import ast
def safe_list_parser(value):
    if isinstance(value, str):
        try:
            return ast.literal_eval(value)
        except (ValueError, SyntaxError):
            return []
    return value

titles_wr['genres'] = titles_wr['genres'].apply(safe_list_parser)
titles_wr['production_countries'] = titles_wr['production_countries'].apply(safe_list_parser)

print("\nSample parsed genres:")
print(titles_wr['genres'].head())


# STEP 9: Merge datasets (if combined analysis is needed)
merged_wr = pd.merge(
    titles_wr,
    credits_wr,
    on='id',
    how='left'
)

print("\nMerged dataset shape:", merged_wr.shape)
print("\nFirst 5 rows of merged dataset:")
print(merged_wr.head())


# STEP 10: Final validation checks
print("\nFinal cleaned Titles dataset info:")
titles_wr.info()

print("\nFinal cleaned Credits dataset info:")
credits_wr.info()


### What all manipulations have you done and insights you found?

Data Manipulations Performed-

1.Two datasets (titles and credits) were loaded and analyzed separately to maintain data structure.

2.Duplicate records were identified and removed to ensure data integrity.

3.Missing values were handled using context-based logic, such as filling season counts with zero for movies and labeling unavailable age certifications as Not Rated.

4.Rating and popularity-related fields from IMDb and TMDB were intentionally left unfilled to avoid introducing analytical bias.

5.Data types were validated and text fields were standardized to maintain consistency.

6.List-like string columns such as genres and production countries were converted into proper list formats for accurate analysis.

7.The datasets were merged using a common identifier when combined analysis was required.


Key Insights from the Dataset-

1.The titles dataset contains 9,871 records, while the credits dataset includes 124,235 records, indicating a large and detailed dataset.

2.The platform’s content is predominantly movie-based, with fewer TV shows.

3.High missing values in the seasons column confirm its relevance mainly to TV shows.

4.Age certification information is incomplete for a large portion of titles.

5.Drama is the most frequently occurring genre, and the United States is the leading production country.

6.Audience ratings show moderate average scores with a skewed popularity distribution.

7.Most missing values are structural and context-driven, rather than data quality errors.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart 1 - Univariate Analysis

In [None]:
# Movies vs TV Shows Count Plot
plt.figure(figsize=(6,4))
sns.countplot(x='type', data=titles_wr)
plt.title('Distribution of Movies vs TV Shows')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart (count plot) was selected because the objective was to compare the number of titles across discrete categories—Movies and TV Shows. Bar charts are the most effective visualization for comparing frequencies between categorical variables, making it easy to identify which content type dominates the platform.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly shows that movies significantly outnumber TV shows on the platform. Out of approximately 9,800 titles in the dataset, more than 8,500 are movies, indicating that movies account for over 85% of the available content. This indicates that the platform’s content strategy is heavily movie-centric, with relatively limited investment in episodic or series-based content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

The dominance of movies can be beneficial as movies:

Require lower long-term production and maintenance costs compared to TV shows.

Attract casual and one-time viewers, increasing short-term engagement.

Enable faster content rotation and easier licensing strategies.

This insight helps the business reinforce a movie-focused acquisition strategy if the goal is quick viewer engagement.

Potential negative impact-

The relatively low number of TV shows may limit:

User retention, as TV shows encourage binge-watching and repeat platform visits.

Subscription longevity, since episodic content often keeps users engaged over longer periods.

If competitors offer more TV series, this imbalance could lead to reduced user stickiness, potentially affecting long-term growth.

#### Chart 2 - Univariate Analysis

In [None]:
# Distribution of Release Years
plt.figure(figsize=(8,5))
sns.histplot(titles_wr['release_year'], bins=30, kde=False)
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen because it effectively displays the distribution of a numerical variable over a continuous range. It helps identify trends, peaks, and periods with higher content production.

##### 2. What is/are the insight(s) found from the chart?

The distribution shows that the majority of titles were released after the year 2000, with a noticeable increase in content in the last decade. Older titles exist but form a much smaller portion of the dataset, indicating a strong focus on recent and modern content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact-

Focus on recent content aligns with current audience preferences.

Newer titles tend to attract higher engagement and subscriptions.

Helps maintain platform relevance.

Potential negative impact-

Limited older or classic content may reduce appeal for niche or classic-movie audiences.

Over-reliance on recent content could reduce catalog diversity.

#### Chart 3 - Univariate Analysis

In [None]:
# Runtime Distribution
plt.figure(figsize=(8,5))
sns.histplot(titles_wr['runtime'], bins=30, kde=False)
plt.title('Distribution of Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is used because runtime is a continuous numerical variable.
This chart helps understand:Typical content length, presence of very short or very long titles and whether the platform focuses on standard-length movies or extended content.

##### 2. What is/are the insight(s) found from the chart?

Most titles have runtimes concentrated between 60 and 120 minutes. The peak is around 80–100 minutes, which aligns with standard movie lengths. A small number of titles have very short runtimes (close to 1–30 minutes). Very long runtimes (above 200 minutes) are rare and appear as outliers. This indicates that the dataset is dominated by standard-length movies, with limited extreme-duration content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Standard runtimes are suitable for casual viewing and broad audience appeal.

Predictable content length improves viewer satisfaction and watch completion rates.

Easier scheduling and recommendation planning.

Potential negative impact-

Limited short-form content may reduce appeal to users who prefer quick consumption.

Very few long-format titles may not satisfy users interested in extended or epic storytelling.

#### Chart 4 - Univariate Analysis

In [None]:
# Genre Distribution
# Explode genres so each genre gets its own row
genres_exploded = titles_wr.explode('genres')

# Count top genres
top_genres = genres_exploded['genres'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,5))
sns.barplot(x=top_genres.values, y=top_genres.index)
plt.title('Top 10 Genres Distribution')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen because genres are categorical variables with many categories. This chart makes it easy to compare genre frequencies and clearly identify the most dominant genres in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Drama is the most dominant genre in the dataset. Other frequently occurring genres include Comedy, Action, Thriller, and Romance. The distribution shows that the platform focuses heavily on story-driven and mainstream genres, while niche genres appear less frequently. This indicates a content strategy aimed at broad audience appeal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Popular genres such as Drama and Comedy attract a wide audience base.

High-frequency genres improve discoverability and recommendation accuracy.

Aligns well with mainstream viewer preferences.

Potential negative impact-

Overrepresentation of certain genres may reduce content diversity.

Users interested in niche genres (e.g., documentaries, experimental films) may find fewer options, potentially affecting engagement for those segments.

#### Chart  5 - Univariate Analysis

In [None]:
# IMDb Score Distribution
# Use only titles with available IMDb scores
imdb_scores = titles_wr['imdb_score'].dropna()

plt.figure(figsize=(8,5))
sns.histplot(imdb_scores, bins=20, kde=True)
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a density curve was chosen because IMDb score is a continuous numerical variable. This chart helps understand how ratings are distributed across titles and whether most content is low-rated, average, or highly rated.

##### 2. What is/are the insight(s) found from the chart?

Most IMDb scores are concentrated between 5 and 7. The distribution peaks around 6, indicating that the majority of titles receive average to moderately positive ratings. Very highly rated titles (scores above 8) are relatively rare. Extremely low-rated titles (below 3) are also uncommon. This suggests that the platform hosts largely mid-rated content, with fewer exceptional outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

A large volume of moderately rated content ensures consistent quality.

Average-rated titles can still perform well when supported by effective recommendations and genre preferences.

Helps set realistic quality benchmarks for content acquisition.

Potential negative impact-

Fewer highly rated titles may reduce the platform’s ability to attract users seeking premium or critically acclaimed content.

Overdependence on average-rated content could impact brand perception if competitors offer more top-rated titles.

#### Chart  6 - Univariate Analysis

In [None]:
# TMDB Score Distribution
# Use only titles with available TMDB scores
tmdb_scores = titles_wr['tmdb_score'].dropna()

plt.figure(figsize=(8,5))
sns.histplot(tmdb_scores, bins=20, kde=True)
plt.title('Distribution of TMDB Scores')
plt.xlabel('TMDB Score')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a density curve was chosen because TMDB score is a continuous numerical variable. This chart helps understand the overall quality perception of titles based on TMDB ratings and how scores are spread across the catalog.

##### 2. What is/are the insight(s) found from the chart?

TMDB scores are mostly concentrated between 5 and 7, similar to IMDb scores. The peak of the distribution lies around 6, indicating that most titles receive average ratings. Very high TMDB scores (above 8) are relatively few. Extremely low scores are uncommon.This suggests that the platform’s content generally maintains moderate quality, with limited highly acclaimed titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

A consistent mid-range score indicates stable content quality.

Average-rated content can still perform well when matched with the right audience through recommendations.

Helps the business benchmark expected performance of new content acquisitions.

Potential negative impact-

A smaller proportion of high-scoring titles may limit appeal for users seeking critically acclaimed content.

Heavy reliance on mid-rated titles could affect brand positioning if competitors emphasize premium-rated content.

#### Chart  7 - Univariate Analysis

In [None]:
# IMDb Votes (Popularity) Distribution
# Use only titles with available IMDb votes
imdb_votes = titles_wr['imdb_votes'].dropna()

plt.figure(figsize=(8,5))
sns.histplot(np.log1p(imdb_votes), bins=30, kde=True)
plt.title('Distribution of IMDb Votes (Log Scale)')
plt.xlabel('Log of IMDb Votes')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with log transformation was chosen to visualize IMDb vote counts because popularity data is heavily right-skewed. Using a logarithmic scale makes it easier to observe the distribution of both low- and high-popularity titles without extreme values dominating the chart.

##### 2. What is/are the insight(s) found from the chart?

Most titles have low to moderate IMDb vote counts, indicating limited audience engagement. A small number of titles have extremely high vote counts, forming a long right tail. This confirms that popularity is unevenly distributed, with a few blockbuster titles attracting most audience attention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

A few highly popular titles can act as traffic drivers, attracting users to the platform.

Understanding popularity concentration helps in promoting high-engagement content.

Useful for prioritizing featured content and recommendations.

Potential negative impact-

Heavy dependence on a small number of popular titles may create engagement imbalance.

Lesser-known titles may receive minimal visibility, reducing their potential value.

If popular titles leave the platform, it may impact overall user engagement.

#### Chart 8 - Bivariate Analysis

In [None]:
# Content Type vs IMDb Score (Movies vs TV Shows)
# Use only rows where IMDb score is available
bivariate_df = titles_wr[['type', 'imdb_score']].dropna()

plt.figure(figsize=(8,5))
sns.boxplot(x='type', y='imdb_score', data=bivariate_df)
plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

We want to compare IMDb ratings across Movies and TV Shows. A box plot is ideal because it shows Median rating, Spread of ratings and Outliers all side by side for easy comparison.

##### 2. What is/are the insight(s) found from the chart?

TV shows generally have a slightly higher median IMDb score compared to movies. Movies show a wider spread of ratings, indicating more variability in quality.
TV shows tend to have more consistently rated content, with fewer extremely low scores. Both content types have some high-rated outliers, but they are relatively rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact-

Higher and more consistent ratings for TV shows suggest stronger viewer engagement and retention.

This supports investing in episodic content to increase long-term platform usage.

Potential negative impact-

The wider variability in movie ratings indicates inconsistent quality, which may affect user satisfaction.

Poorly rated movies could reduce trust in recommendations if not filtered properly.

#### Chart 9 - Bivariate Analysis

In [None]:
# Release Year vs Number of Titles
# Count number of titles per release year
year_counts = titles_wr['release_year'].value_counts().sort_index()

plt.figure(figsize=(10,5))
sns.lineplot(x=year_counts.index, y=year_counts.values)
plt.title('Number of Titles Released per Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen because it effectively represents trends over time. It helps identify growth patterns, spikes, or declines in content production across different years.

##### 2. What is/are the insight(s) found from the chart?

The number of titles released per year remains very low and stable until around the 1930s, indicating limited historical content in the early decades.

A moderate increase is observed between the 1930s and 1950s, followed by a slight decline and relatively steady production through the 1960s–1980s.

From the early 2000s onward, there is a sharp and sustained increase in the number of titles released each year.

The most significant growth occurs after 2010, where annual releases rise rapidly and peak around 2018–2020, crossing 800 titles per year.

The sharp drop after the peak year is likely due to incomplete or partial data for the most recent year, rather than an actual decline in content production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Increasing content production in recent years shows a focus on modern and relevant content.

Aligns with current viewer preferences for newer releases.

Supports competitive positioning against other streaming platforms.

Potential negative impact-

Heavy concentration on recent releases may reduce catalog diversity.

Users interested in classic or older titles may find fewer options.

#### Chart 10 - Bivariate Analysis

In [None]:
# Runtime vs IMDb Score
# Use only rows with both runtime and IMDb score
runtime_rating_df = titles_wr[['runtime', 'imdb_score']].dropna()

plt.figure(figsize=(8,5))
sns.scatterplot(
    x='runtime',
    y='imdb_score',
    data=runtime_rating_df,
    alpha=0.5
)
plt.title('Runtime vs IMDb Score')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDb Score')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is appropriate because it shows the relationship between two continuous numerical variables. It helps identify correlations, clusters, or trends between runtime and IMDb ratings.

##### 2. What is/are the insight(s) found from the chart?

Most titles are clustered between 60 and 150 minutes, with IMDb scores largely between 5 and 7.

There is no strong linear relationship between runtime and IMDb score.

Highly rated titles (IMDb score above 8) appear across a wide range of runtimes, including both short and long durations.

Very long runtimes (above 200 minutes) are rare and show mixed ratings, suggesting that extended duration does not guarantee better audience reception.

Shorter runtimes also display a wide spread of ratings, indicating that content quality is independent of duration.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Content quality is not dependent on runtime, allowing flexibility in content length.

Shorter or standard-length content can still achieve strong audience ratings.

Helps optimize production costs without compromising audience satisfaction.

Potential negative impact-

Investing heavily in long-duration content does not ensure higher viewer approval.

Excessively long runtimes may risk lower completion rates without clear rating benefits.

#### Chart 11 - Bivariate Analysis

In [None]:
# Genre vs IMDb Score
# Explode genres and keep IMDb score
genre_rating_df = titles_wr[['genres', 'imdb_score']].dropna()
genre_rating_df = genre_rating_df.explode('genres')

# Select top 8 genres by frequency
top_genres = genre_rating_df['genres'].value_counts().head(8).index
genre_rating_df = genre_rating_df[genre_rating_df['genres'].isin(top_genres)]

plt.figure(figsize=(10,5))
sns.boxplot(x='genres', y='imdb_score', data=genre_rating_df)
plt.title('IMDb Score Distribution by Genre')
plt.xlabel('Genre')
plt.ylabel('IMDb Score')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is used to compare IMDb score distributions across multiple genres. It clearly shows:

Median rating per genre

Rating variability

Presence of outliers

This makes it ideal for comparing audience perception across genres.

##### 2. What is/are the insight(s) found from the chart?

Drama consistently shows a higher median IMDb score, indicating stronger audience appreciation.

Comedy and Action show wider rating spreads, suggesting mixed audience reception.

Genres like Thriller and Romance generally fall within the mid-rating range.

Some genres have fewer extreme low ratings, while others show higher variability.

Overall, genre choice influences ratings, but no genre guarantees universally high scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Investing in genres such as Drama can lead to higher audience satisfaction.

Genre-based insights help optimize content acquisition and recommendation systems.

Supports targeted marketing for genres with strong rating performance.

Potential negative impact-

Genres with high variability may produce inconsistent audience response.

Over-investing in a single genre could limit content diversity and audience reach.

#### Chart 12 - Multivariate Analysis

In [None]:
# Content Type × Runtime × IMDb Score
multi_df = titles_wr[['runtime', 'imdb_score', 'type']].dropna()

plt.figure(figsize=(9,5))
sns.scatterplot(
    data=multi_df,
    x='runtime',
    y='imdb_score',
    hue='type',
    alpha=0.6
)
plt.title('Runtime vs IMDb Score by Content Type')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

This scatter plot with color encoding was chosen to analyze the combined effect of three variables simultaneously:

Runtime (x-axis)

IMDb Score (y-axis)

Content Type (color: Movie vs TV Show)

The visualization allows direct comparison of how runtime and audience ratings behave separately for movies and TV shows, which would not be possible using univariate or bivariate plots alone.

##### 2. What is/are the insight(s) found from the chart?

Movies and TV shows form distinct clusters, reflecting differences in typical runtime and rating patterns.

TV shows are concentrated at shorter runtimes (mostly below 60 minutes) and tend to achieve consistently higher IMDb scores, often between 6 and 8.

Movies cover a much wider runtime range (from under 60 minutes to over 300 minutes) and show greater variability in ratings, ranging from very low to very high.

There is no strong relationship between runtime and IMDb score for either content type, indicating that longer duration does not guarantee higher audience ratings.

Extremely long movies appear rarely and show mixed IMDb scores, suggesting uncertain audience reception.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

TV shows demonstrate more consistent audience approval, supporting their role in long-term user retention and binge-watching behavior.

Since runtime does not strongly influence ratings, content creators can focus on story quality rather than length, enabling cost-efficient production.

The ability to clearly separate performance patterns by content type helps guide targeted investment strategies.

Potential negative impact-

Movies show high variability in audience reception, indicating higher risk in movie content investments.

Producing very long movies may involve higher costs without guaranteed audience appreciation.

Over-reliance on movies alone could reduce engagement compared to a more balanced mix with TV shows.

#### Chart 13 - Multivariate Analysis

In [None]:
# Genre × IMDb Score × Popularity (IMDb Votes)
# Prepare data
genre_multi_df = titles_wr[['genres', 'imdb_score', 'imdb_votes']].dropna()
genre_multi_df = genre_multi_df.explode('genres')

# Keep top genres only
top_genres = genre_multi_df['genres'].value_counts().head(6).index
genre_multi_df = genre_multi_df[genre_multi_df['genres'].isin(top_genres)]

# Log transform votes for better scaling
genre_multi_df['log_votes'] = np.log1p(genre_multi_df['imdb_votes'])

plt.figure(figsize=(10,6))

sns.scatterplot(
    data=genre_multi_df,
    x='genres',
    y='imdb_score',
    size='log_votes',          # bubble size
    hue='log_votes',           # bubble color
    palette='viridis',         # color gradient
    sizes=(30, 300),
    alpha=0.7,
    legend='brief'
)

plt.title('Genre vs IMDb Score with Popularity (Size & Color)')
plt.xlabel('Genre')
plt.ylabel('IMDb Score')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

This visualization was chosen to analyze three variables together:

Genre (x-axis)

IMDb Score (y-axis)

Popularity (IMDb votes) represented by bubble size

It allows us to understand how ratings and popularity interact across different genres, which cannot be captured using univariate or simple bivariate charts.

##### 2. What is/are the insight(s) found from the chart?

Drama shows a wide spread of IMDb scores, with many titles clustered between 6 and 8, indicating generally strong audience reception.

Comedy and Action exhibit high variability in IMDb scores, ranging from very low to very high, suggesting mixed audience response.

Crime and Thriller genres show a relatively concentrated distribution in the mid-to-high rating range, indicating more consistent audience appreciation.

Larger bubbles (higher IMDb votes) appear across all genres, but only a few titles in each genre dominate popularity, confirming that audience engagement is concentrated on select titles.

High popularity does not always align with high IMDb scores, meaning some widely watched titles are only moderately rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact-

Genres like Drama, Crime, and Thriller can be prioritized for quality-driven content acquisition, as they show consistent ratings.

Highly popular titles (large bubbles) can be leveraged for marketing and user acquisition, regardless of genre.

Helps balance investment between critically appreciated content and mass-appeal titles.

Potential negative impact-

Over-investing in highly popular but moderately rated genres may affect brand perception.

Genres with good ratings but lower popularity may be under-promoted, reducing their potential business value.

Relying solely on popularity metrics may overlook long-term audience satisfaction.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select only numerical columns
numeric_cols = [
    'release_year',
    'runtime',
    'seasons',
    'imdb_score',
    'imdb_votes',
    'tmdb_score',
    'tmdb_popularity'
]

corr_df = titles_wr[numeric_cols]

# Compute correlation matrix
corr_matrix = corr_df.corr()

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='coolwarm',
    fmt='.2f',
    linewidths=0.5
)
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is used to-

Measure strength and direction of relationships between numerical variables

Quickly identify strong, weak, or no relationships

Support data-driven decisions instead of visual assumptions

##### 2. What is/are the insight(s) found from the chart?

IMDb score and TMDB score show a strong positive correlation (0.58), indicating consistent audience ratings across both platforms. Titles rated highly on IMDb tend to also receive higher ratings on TMDB.

Runtime has a moderate negative correlation with seasons (-0.32), which is expected since movies typically have longer runtimes, while TV shows have multiple seasons with shorter episode durations.

IMDb votes and TMDB popularity show a weak-to-moderate positive correlation (0.26), suggesting that titles popular on IMDb tend to also gain visibility on TMDB, though popularity is not perfectly aligned across platforms.

IMDb score has only weak correlations with IMDb votes (0.17), indicating that highly rated titles are not necessarily the most popular ones.

Release year shows very weak correlation with ratings and popularity (≤ 0.13), suggesting that newer content is not automatically better rated or more popular.

Runtime has almost no correlation with IMDb score (-0.10), reinforcing the insight that content length does not significantly influence audience ratings.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select relevant numerical columns
pairplot_cols = [
    'release_year',
    'runtime',
    'seasons',
    'imdb_score',
    'imdb_votes',
    'tmdb_score',
    'tmdb_popularity'
]

pairplot_df = titles_wr[pairplot_cols].dropna()

# Create pair plot
sns.pairplot(
    pairplot_df,
    diag_kind='hist',
    plot_kws={'alpha': 0.5, 's': 15}
)

plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is used to:

Visualize pairwise relationships between multiple numerical variables at once

Identify correlations, clusters, and distributions

Validate patterns seen in scatter plots and correlation heatmaps

It combines:

Histograms (diagonal) → distribution of each variable

Scatter plots (off-diagonal) → relationships between variables

This makes it a summary-level multivariate analysis tool.

##### 2. What is/are the insight(s) found from the chart?

Distribution Insights (Diagonal Plots)-

Release Year is heavily skewed toward recent years, confirming that most content was released after the year 2000.

Runtime shows a right-skewed distribution, with most titles clustered between 60 and 150 minutes, and a few very long outliers.

Seasons is highly skewed, with most titles having 1 season, confirming the dominance of movies and short-running shows.

IMDb votes and TMDB popularity are extremely right-skewed, indicating that a small number of titles dominate audience attention.

IMDb score and TMDB score have relatively symmetric distributions centered around 6–7, indicating moderate overall content quality.

Relationship Insights (Off-Diagonal Plots)-

IMDb score and TMDB score show a clear positive relationship, visually confirming the strong correlation seen in the heatmap.

Popularity metrics (IMDb votes and TMDB popularity) show a positive but scattered relationship, indicating partial overlap in audience engagement across platforms.

Runtime shows no visible trend with IMDb or TMDB scores, reinforcing that content length does not strongly affect audience ratings.

Release year has weak visible relationships with ratings and popularity, suggesting newer content is not automatically better rated or more popular.

Seasons show weak and inconsistent relationships with ratings and popularity, indicating that longer-running shows do not necessarily receive higher scores.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1.Content Distribution Strategy (Movies vs TV Shows)

The analysis shows that movies dominate the platform’s content library, accounting for more than 85% of the total titles, while TV shows form a relatively smaller share. To improve user retention and engagement, it is recommended that Amazon Prime Video increases its investment in high-quality TV shows, particularly limited series and multi-season originals. While movies can continue to be a core offering, a more balanced mix of movies and TV shows can enhance long-term subscriber engagement.

2.Content Trends Over Time

Content production has increased significantly after 2010, with a peak observed between 2015 and 2020. This indicates a strong shift toward recent content consumption. Amazon Prime Video should prioritize acquiring and producing newer content while selectively curating high-performing older titles based on ratings and popularity rather than release year alone. Maintaining consistent content releases over time can help sustain platform growth and viewer interest.

3.Genre-Based Content Strategy

Drama, Comedy, and Action are the most common genres on the platform, while Crime and Thriller genres show relatively consistent audience ratings. It is recommended to continue investing in popular genres such as Drama but with a stronger focus on quality. Additionally, genres like Crime and Thriller can be leveraged to build a reputation for critically well-received content. Improving visibility and promotion of less popular titles within dominant genres can also enhance overall content utilization.

4.Country-wise and Regional Content Expansion

The dataset reveals that content production is dominated by a limited number of countries, indicating lower regional diversity. Amazon Prime Video should expand region-specific and local-language content to cater to diverse audiences. Increasing regional representation can support market expansion, attract new subscribers, and strengthen audience loyalty in underrepresented regions.

5.Audience Ratings and Quality Assessment

IMDb and TMDB scores show strong consistency, making ratings a reliable indicator of content quality. However, popularity metrics such as IMDb votes do not strongly correlate with ratings, indicating that high viewership does not always imply high-quality content. It is recommended that Amazon Prime Video use ratings as a primary quality filter while treating popularity metrics as indicators of engagement rather than quality. Content length (runtime or number of seasons) should not be a deciding factor, as it has minimal impact on ratings.

6.Data-Driven Decision Support

The analysis indicates that no single variable strongly determines content success. Therefore, Amazon Prime Video should adopt a multi-factor decision-making framework that combines ratings, popularity, genre, and regional factors. This approach can improve content acquisition strategies, enhance recommendation systems, and support informed business decisions. Care should also be taken when handling missing data to avoid bias in future analytical models.

# **Conclusion**

This project performed exploratory data analysis on the Amazon Prime Video movies and TV shows dataset to understand content distribution, trends, audience preferences, and factors influencing content performance. The analysis showed that the platform is largely dominated by movies, with TV shows forming a smaller portion, indicating potential scope for expanding episodic content to improve user engagement.

The study identified a strong growth in content production after 2010, with Drama, Comedy, and Action being the most common genres, while Crime and Thriller demonstrated more consistent audience ratings. Correlation and multivariate analysis confirmed strong agreement between IMDb and TMDB ratings, making them reliable indicators of content quality. However, popularity metrics showed only moderate alignment with ratings, and factors such as runtime, release year, and number of seasons had minimal influence on audience scores.

Overall, the findings highlight the importance of a balanced, data-driven content strategy that considers quality, popularity, genre diversity, and regional representation. The insights from this analysis can support informed decision-making in content acquisition, recommendation strategies, and future platform growth.