#### **Amazon Prime Exploratory Data Analysis Project**

# **Project Name**    - Amazon Prime EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Deepak Kumar


# **Project Summary -**

####Project Summary: Amazon Prime TV Shows and Movies - Exploratory Data Analysis (EDA) 🎬📊

#### Overview 🌐

This project focuses on conducting an Exploratory Data Analysis (EDA) of Amazon Prime Video’s content library to uncover trends, patterns, and insights into content distribution, audience preferences, and platform strategy. The dataset includes movies 🎥 and TV shows 📺 available in the U.S., featuring both categorical and numerical attributes. we have 2 datasets out of which one contains details about titles and other one contains details for credits.
We have to do exploratory data analysis for these two datasets to get insights of data with which we can analyze the trends, most preferred content, strategies for the content artists to invest in and customer engagement and retension rate.

#### Key Insights 🔍✨

1. Content Distribution 📈
Analyzed the proportion of movies 🎥 vs. TV shows 📺 to understand the platform's content mix.

Identified the dominant type of content and its implications for audience engagement.

2. Genre Trends 🎭
Explored the most popular genres using bar charts 📊 and word clouds ☁️.

Highlighted the genres that dominate the platform and their potential appeal to viewers.

3. Release Year Patterns 📅
Investigated trends in content addition over time, focusing on release years.

Identified periods of peak content production and addition to the platform.

4. IMDb Ratings & Popularity ⭐📈
Analyzed the relationship between IMDb ratings ⭐ and audience engagement (e.g., votes).

Visualized trends using scatter plots 📉 to understand how ratings correlate with popularity.

5. Top Actors & Directors 🎭🎬
Identified the most frequently credited actors 🎭 and filmmakers 🎬.

Created visualizations to highlight key contributors to the platform’s content library.

#### Visualizations 🎨📊

To effectively communicate insights, the following visualizations were employed:

Bar Charts & Histograms 📊: To analyze genre distribution and IMDb rating trends.

Scatter Plots 📉: To explore the relationship between ratings and audience votes.

Heatmaps 🔥: To uncover correlations between key variables.

Network Graphs 🕸️: To visualize connections between actors and their collaborations.

# **GitHub Link -**

#### https://github.com/Deepakkumar7774/Amazon-Prime-EDA-Project

# **Problem Statement**


**There are 2 datasets from amazon prime video which are to be analyzed through EDA to draw conclusions for various business problems ?**

#### **Define Your Business Objective?**

**The business objective for amazon prime video datasets are as follows:-**

1) To analyze the trends of the content considering diversity and time period.
2) To analyze customer engagement and restention rate based on their preference of watching and choosing the content.
3) To analyze the regional availability and customer engagement on regional content.
4) To analyze change in customer preference over a period of specific time.
5) To analyze which genres of movies and series are mostly preferred by customers.
6) To analyze which strategies and future investments will help company to get more profits with more customer engagement.

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import random
from wordcloud import WordCloud
import ast

### Dataset Loading

In [None]:
# Load Dataset

titles = pd.read_csv("/content/titles.csv")
credits = pd.read_csv("/content/credits.csv")

### Dataset Merged

In [None]:
md = pd.merge(titles,credits, on = 'id', how = 'outer')

### Dataset First View

In [None]:
# Dataset First Look
md.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
md.shape

### Dataset Information

In [None]:
# Dataset Info
md.info()

### Data Description

In [None]:
md.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate values
duplicates = md.duplicated(keep=False)

# Count the duplicate values
duplicate_count = duplicates.value_counts()

print(duplicate_count)

### Removing Duplicates

In [None]:
md.drop_duplicates(["id"],keep='first',inplace=True)

### Confirming that the duplicate values are removed

In [None]:
# Check for duplicate values
duplicates = md.duplicated(keep=False)

# Count the duplicate values
duplicate_count = duplicates.value_counts()

print(duplicate_count)

In [None]:
md.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
md.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as ms
ms.bar(md)
plt.show()

### What did you know about your dataset?

**There are total datasets which are:-**
1) titles.csv
2) credits.csv


There are total 15 columns for titles dataset and 5 columns for credits dataset out of which one is common among both dataset which is id column.
There are total 125130 rows before removing duplicates and after removing duplicates, unique values are 9868 rows with total 19 columns after merging both datasets.
The seasons and age_certification has highest and second highest number of null values respectively.
The columns with no null values are id, title, type, release_year, runtime, geners and production_countries.
Description is the second lowest column with null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
md.columns

In [None]:
# Dataset Describe
md.describe()

### Variables Description

**Title dataset**

1. id-Title ID on JustWatch

2. title Name of the title

3. show_type - TV Show or Movie

4. description - a brief description

5. release year - release year

6. age_certification the age certification

7. runtime-length of an episode in show or length of a movie

8. genres genre of show/movie

9. production countries - Countries that produced the title

10. seasons - No. of seasons if it's a show

11. imdb id The title ID on IMDB

12. imdb_score - IMDB score

13. imdb votes Votes on IMDB

14. tmdb popularity Popularity on IMDB

15. tmdb score - Score on TMDB

**Credits dataset**

1. person_ID - The person ID on JustWatch

2. id title ID on justWatch

3. Name - Actor / Director's name

4. character name The character's name

5. role - Actor / Director

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for x in md.columns:
    print(f"{x} - {md[x].nunique()}")

In [None]:
md["type"].value_counts()

In [None]:
md["id"].unique()

In [None]:
md["title"].unique()

In [None]:
md["age_certification"].unique()

In [None]:
md["runtime"].unique()

In [None]:
md["genres"].unique()

In [None]:
md["production_countries"].unique()

In [None]:
md["seasons"].unique()

In [None]:
md["imdb_id"].unique()

In [None]:
md["imdb_votes"].unique()

In [None]:
md["imdb_score"].unique()

In [None]:
md["tmdb_popularity"].unique()

In [None]:
md["tmdb_score"].unique()

In [None]:
md["person_id"].unique()

In [None]:
md["name"].unique()

In [None]:
md["character"].unique()

In [None]:
md["role"].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
md[["type","seasons"]][md.type=='SHOW'].isna().sum()

In [None]:
md[["type","seasons"]][md.type=='MOVIE'].isna().sum()

In [None]:
# filling up the seasons column which have null values with value 0
md.fillna({'seasons': 0},inplace=True)

In [None]:
md.fillna({'description': 'Not Available'},inplace=True)

In [None]:
md.fillna({'age_certification': 'Not Available'},inplace=True)

In [None]:
md.fillna({'imdb_id': 'unavailable'},inplace=True)

In [None]:
md.fillna(0,inplace= True)

In [None]:
md.drop_duplicates(inplace=True)

In [None]:
md["genres"]=md["genres"].apply(ast.literal_eval)
md["production_countries"]=md["production_countries"].apply(ast.literal_eval)

In [None]:
mdu=md.explode("genres").explode("production_countries").reset_index(drop=True)

In [None]:
mdu.shape

In [None]:
mdu.fillna("unavailable",inplace=True)

In [None]:
mdu.isna().sum()

In [None]:
mdu.duplicated().sum()

In [None]:
mdu.production_countries=mdu.production_countries.replace("United State of America","US")
mdu.production_countries=mdu.production_countries.replace("XX","unavailable")

In [None]:
#Dictionary mapping country codes to country names

country_mapping = {'SU': 'Soviet Union', 'Unavailable': 'Unknown', 'IN': 'India', 'IT': 'Italy', 'JP': 'Japan', 'FR': 'France', 'HK': 'Hong Kong', 'ES': 'Spain', 'IL': 'Israel', 'AU': 'Australia',

'CH': 'Switzerland', 'IE': 'Ireland', 'GR': 'Greece', 'CN': 'China', 'PH': 'Philippines', 'NL': 'Netherlands', 'YU': 'Yugoslavia', 'CI': 'Ivory Coast (Côte d Ivoire)' , 'PR': 'Puerto Rico', 'LI': 'Liechtenstein',

'US': 'United States', 'GB': 'United Kingdom', 'MX': 'Mexico', 'CA': 'Canada', 'DE': 'Germany', 'ZA': 'South Africa', 'PT': 'Portugal', 'CL': 'Chile', 'SE': 'Sweden', 'BR': 'Brazil', 'DK': 'Denmark', 'NZ': 'New Zealand', 'RU': 'Russia', 'LU': 'Luxembourg', 'CZ': 'Czech Republic', 'FI': 'Finland', 'AT': 'Austria', 'SK': 'Slovakia', 'AR': 'Argentina', 'VE': 'Venezuela', 'TH': 'Thailand', 'PL': 'Poland', 'AE': 'United Arab Emirates', 'SI': 'Slovenia', 'BA': 'Bosnia and Herzegovina', 'ID': 'Indonesia', 'NO': 'Norway', 'AF': 'Afghanistan', 'IR': 'Iran', 'IS': 'Iceland', 'BG': 'Bulgaria', 'JM': 'Jamaica', 'RS': 'Serbia', 'SZ': 'Eswatini (Swaziland)', 'LT': 'Lithuania', 'TC': 'Turks and Caicos Islands',

'KR': 'South Korea', 'XC': 'Czechoslovakia', 'HU': 'Hungary', 'TW': 'Taiwan', 'AN': 'Netherlands Antilles', 'MC': 'Monaco', 'CO': 'Colombia', 'RO': 'Romania', 'EG': 'Egypt', 'TR': 'Turkey', 'BE': 'Belgium',

'SG': 'Singapore', 'UY': 'Uruguay', 'BO': 'Bolivia', 'UA': 'Ukraine', 'MY': 'Malaysia', 'TN': 'Tunisia','QA': 'Qatar', 'NG': 'Nigeria', 'KZ': 'Kazakhstan', 'GQ': 'Equatorial Guinea', 'MT': 'Malta', 'SO': 'Somalia', 'KE': 'Kenya', 'MA': 'Morocco', 'VN': 'Vietnam', 'BD': 'Bangladesh', 'FJ': 'Fiji', 'MN': 'Mongolia', 'UG': 'Uganda', 'TT': 'Trinidad and Tobago', 'PK': 'Pakistan', 'XK': 'Kosovo', 'PE': 'Peru', 'DO': 'Dominican Republic', 'SV': 'El Salvador', 'GE': 'Georgia', 'PS': 'Palestine', 'HR': 'Croatia', 'LV': 'Latvia', 'AQ': 'Antarctica', 'LB': 'Lebanon', 'KH': 'Cambodia', 'CR': 'Costa Rica', 'BM': 'Bermuda', '30': 'Jordan', 'PA': "Panama", 'AL' : 'Albania', 'CY' : 'Cyprus', 'CU' : 'Cuba', 'PY' : 'Paraguay', 'EE': 'Estonia', 'ET': 'Ethiopia', 'PF': 'French Polynesia', 'EC': 'Ecuador', '10': 'British Indian Ocean Territory', 'AM': 'Armenia',

'SY': 'Syria', 'CM': 'Cameroon', 'LY': 'Libya' }

mdu.production_countries = mdu.production_countries.replace(country_mapping)

In [None]:
mdu.production_countries.unique()

### What all manipulations have you done and insights you found?

**Seasons have the highest null values and I observed that the null values are from type 'MOVIE'. As 'MOVIE' doesn't have seasons, the null values for this column are replaces by 0. The null values in age certification, description and imdb id are replaced by 'Not Available'. The null values in imdb_scores, imdb_votes, tmdb_popularity and tmdb_scores are replaced by 0. The entries in genres and production_countries are present as a list and I have splitted them into multiple examples with each genre and production_countires containing each element in the list. The modified dataset has 25819 entries and the duplicate entries have been removed.**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1 visualization code
# Barplot on genres - shows
genre_plot = mdu[mdu.type=='SHOW'].groupby('genres')['genres'].count()
genre_pl = pd.DataFrame(genre_plot)
# Rename the count column
genre_pl.rename(columns={'genres': 'count'}, inplace=True)
genre_pl.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(10, 8))  # Adjust figure size
sns.barplot(data=genre_pl, x="genres", y="count", hue="genres", palette="deep", legend=False)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
plt.xlabel("Genres")
plt.ylabel("Count")
plt.title("Genre Distribution of Shows")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart is been picked to have an idea about the distribution of genres based on shows present.

##### 2. What is/are the insight(s) found from the chart?

Majority of shows belong to the drama genre, with second highest being comedy. There are more than 300 animated shows. There are more than 200 shows belonging to the genre action, documentation, family and scifi. There are over 100 crime, european, fantasy, reality and romance shows. Rest of the genres have very less count with western being least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Majority of the people in general would prefer drama and comedy, so there would be more subscription count thereby ensuring positive growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Barplot on genres - movies
genre_plot = mdu[mdu.type=='MOVIE'].groupby('genres')['genres'].count()
genre_pl = pd.DataFrame(genre_plot)
# Rename the count column
genre_pl.rename(columns={'genres': 'count'}, inplace=True)
genre_pl.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(10, 8))  # Adjust figure size
sns.barplot(data=genre_pl, x="genres", y="count", hue="genres", palette="deep", legend=False)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
plt.xlabel("Genres")
plt.ylabel("Count")
plt.title("Genre Distribution of Movies")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart is been picked to have an idea about the distribution of genres based on movies present.

##### 2. What is/are the insight(s) found from the chart?

Majority of the movies are of the genre drama with comedy being second highest likewise in shows. There are more than 2000 thriller movies. There are around 2000 movies belonging to action and romance genre. There are around 1000 movies that are crime-based. Rest of the genres are quite less with reality being the least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Majority of the people in general would prefer drama and comedy. Also, there are a fairly high number of action, thriller, and romance, so there would be more subscription count thereby ensuring positive growth.

#### Chart - 3

In [None]:
mdup = mdu.drop(['genres','production_countries'],axis=1)
mdup.duplicated().sum()

In [None]:
md2 = md.drop(['genres','production_countries'],axis=1)
md2.duplicated().sum()

In [None]:
# Chart - 3 visualization code
# No. of shows released in each year

md3 = md2[md2.type=='SHOW']
md4 = md3.groupby('release_year')['release_year'].count()
md4 = pd.DataFrame(md4)
# Rename the count column
md4.rename(columns={'release_year': 'count'}, inplace=True)
md4.reset_index(inplace=True)
# Adjust figure size
plt.figure(figsize=(10, 8))
sns.lineplot(data=md4, x="release_year", y="count", marker="o", color="b")
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Release Year Distribution of Shows")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To have an idea about the distribution of no. of shows released each year on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

Before 2000, the no. of TV shows released in Prime were less, in fact less than 20. This is due to the fact that there was limited technology at that time. However, after 2000, there is a sharp increase in no. of shows. This was possible due to the advancement in technology, invention of smart phones and internet availability there enabling streaming in platforms. However in 2021, there is a sharp decrease from 140 to 60.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is one insight that talks about negative growth. The no. of shows drastically reduced from 140 to 60 in 2021. If this decreasing trend continues, this might lead to less subscription count.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4 visualization code
# No. of movies released in each year
md5 = md2[md2.type=='MOVIE']
md6 = md5.groupby('release_year')['release_year'].count()
md6 = pd.DataFrame(md6)
# Rename the count column
md6.rename(columns={'release_year': 'count'}, inplace=True)
md6.reset_index(inplace=True)
plt.figure(figsize=(10, 8))  # Adjust figure size
sns.lineplot(data=md6, x="release_year", y="count", marker="o", color="r")
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Release Year Distribution of Movies")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To have an idea about the distribution of no. of movies released each year on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

The chart follows a similar trend as that of shows. Before 2000, there were less than 100 movies. However, after 2000, there is a steep increase from 100 to 700 in 2018 due to advancements in technology and internet availability thereby enabling streaming much easier. There is an alternate decreasing and increasing trend from 2018 to 2021. There's a steep decrease from 2020 to 2021.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The sharp decrease from 2020 to 2021 signifies a negative growth. If this continues, the subscription count would reduce drastically.

#### Chart - 5

In [None]:
# Scatter plot runtime vs seasons (for shows)
amazon_show1 = md[md.type=='SHOW']
amazon_show2 = amazon_show1.drop(['genres','production_countries'],axis=1)
amazon_show2.duplicated().sum()

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(6,4))
sns.scatterplot(x='runtime',y='seasons',data=amazon_show2,color='r')
plt.xlabel('Runtime (of one episode)')
plt.ylabel('Seasons')
plt.title('Scatter Plot: Runtime vs Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot is used to compare two numerical features. This is used to compare the relationship between runtime and seasons.

##### 2. What is/are the insight(s) found from the chart?

The points are more clustered for Seasons 0-10, where the runtime is between 0 and 100. This suggests that most of the shows present here have less than 10 seasons. There are some shows whose seasons are in the range of 10-50 with runtime mostly between 0 and 60. There is one show with 50 seasons with a runtime of 120 and another show with one season and a runtime of approximately 158.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If the shows have atmost 10 seasons with each episode having an average runtime of 40 min, the chances of subscription count would be high.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Distribution of Age certification for shows
amazon_show2 = md2[md2.type=='SHOW'].groupby('age_certification')['age_certification'].count()
amazon_show2 = pd.DataFrame(amazon_show2)
# Rename the count column
amazon_show2.rename(columns={'age_certification': 'count'}, inplace=True)
amazon_show2.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(8, 6))  # Adjust figure size
sns.barplot(data=amazon_show2, x="age_certification", y="count", hue="age_certification", palette="dark", legend=False)

# Rotate x-axis labels for better readability
#plt.xticks(rotation=45, ha="right")
plt.xlabel("Age Certification")
plt.ylabel("Count")
plt.title("Distribution of Age Certification for Shows")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To understand the age certification distribution for shows.

##### 2. What is/are the insight(s) found from the chart?

For most of the shows, age certification isn't available. Considering the ones that are available, majority of the shows fall in TV-MA category. The second highest is TV-14. Rest are quite less with TV-Y7 and TV-14 being the least. A decreasing trend is observed from TV-PG to TV-Y7.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No insights can be given related to business impact as majority of age certification isn't available.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Distribution of Age certification for movie
amazon_movies2 = md2[md2.type=='MOVIE'].groupby('age_certification')['age_certification'].count()
amazon_movies2 = pd.DataFrame(amazon_movies2)
# Rename the count column
amazon_movies2.rename(columns={'age_certification': 'count'}, inplace=True)
amazon_movies2.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(8, 6))  # Adjust figure size
sns.barplot(data=amazon_movies2, x="age_certification", y="count", hue="age_certification", palette="dark", legend=False)

# Rotate x-axis labels for better readability
#plt.xticks(rotation=45, ha="right")
plt.xlabel("Age Certification")
plt.ylabel("Count")
plt.title("Distribution of Age Certification for Movies")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To understand the distribution of age certification for movies.

##### 2. What is/are the insight(s) found from the chart?

For majority of movies, the age certification isn't available. Out of the ones that are available, R category is the highest and NC-17 is the least. There are equal number of movies belonging to PG and PG-13 category and very less movies in G category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As majority of age certification isn't available, the given insights can't tell anything about business impact.

#### Chart - 8

In [None]:
amazon_prod = mdu.drop('genres',axis=1)
amazon_prod.duplicated().sum()

In [None]:
amazon_prod.drop_duplicates(inplace=True)

In [None]:
amazon_prod.duplicated().sum()

In [None]:
# Chart - 8 visualization code
# Distribution of production countries for shows
amazon_shows3 = amazon_prod[amazon_prod.type=='SHOW'].groupby('production_countries')['production_countries'].count()
amazon_shows3 = pd.DataFrame(amazon_shows3)
amazon_shows3.rename(columns={'production_countries':'count'},inplace=True)
amazon_shows3.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(8, 10))  # Adjust figure size
sns.barplot(data=amazon_shows3, x="count", y="production_countries", hue="production_countries", palette="dark", legend=False)

# Rotate x-axis labels for better readability
plt.ylabel("Production Countries")
plt.xlabel("Count")
plt.title("Distribution of Production Countries for Shows")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To analyse the distribution of production countries for shows.

##### 2. What is/are the insight(s) found from the chart?

United States have produced the highest number of shows on Amazon Prime. United Kingdom has also produced nearly 260 shows which is the second highest. For nearly 160 shows, the production country is unknown. Japan has produced more than 100 shows on Amazon Prime. Australia, Canada and China have produced a fairly good number of shows. Rest of the countries have produced have produced less number of shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In countries where people prefer English shows, the chances of subscription count is high because more number of shows are produced from USA and UK.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Distribution of production countries for movies
amazon_movies3 = amazon_prod[amazon_prod.type=='MOVIE'].groupby('production_countries')['production_countries'].count()
amazon_movies3 = pd.DataFrame(amazon_movies3)
amazon_movies3.rename(columns={'production_countries':'count'},inplace=True)
amazon_movies3.reset_index(inplace=True)
# Create a bar plot
plt.figure(figsize=(10, 22))  # Adjust figure size
sns.barplot(data=amazon_movies3, x="count", y="production_countries", hue="production_countries", palette="dark", legend=False)

# Rotate x-axis labels for better readability
plt.ylabel("Production Countries")
plt.xlabel("Count")
plt.title("Distribution of Production Countries for Movies")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To analyse the distribution of production countries for movies.

##### 2. What is/are the insight(s) found from the chart?

Majority of the movies are produced by the United States. However, the second highest is India and not UK unlike for shows. For nearly 800 movies, the production countries are unknown. Canada, France and UK have produced a fairly good number of movies. Rest of the countries have produced produced less number of movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As majority of movies are produced by the United States, Amazon Prime would preferably attract an audience that prefers English movies. As this audience is quite high, the subscription count is high. Also, Bollywood and the South Indian film industy produce quite a large number of films, targeting a greater number of Indian audiences living in India and overseas. So, the chances of Indians subscribing are high. However, negative growth can be expected in other countries as the movies produced are very less and chances of subscription are less.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Convert groupby Series to DataFrame and reset index
df_counts = mdu.groupby(["production_countries", "genres"]).size().unstack(fill_value=0)
df_relative = df_counts.div(df_counts.sum(axis=1), axis=0) * 100
# Plot stacked bar chart
fig, ax = plt.subplots(figsize=(12, 30))
df_relative.plot(kind="barh", stacked=True, colormap="plasma", ax=ax)

# Labels and title
plt.xlabel("Country")
plt.ylabel("Percentage of Genres")
plt.title("Relative Distribution of Genres Across Countries")
plt.legend(title="Genre", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.xticks(rotation=90)  # Rotate country names for readability

plt.show()

##### 1. Why did you pick the specific chart?

Relative stacked bar plot is useful in under relative distribution of genres across countries.

##### 2. What is/are the insight(s) found from the chart?

Most of the countries have a fair distribution across different genres which means there is a diverse production of films and shows and there are more audiences who prefer diverse genres and also resources for producing multiple films or shows in a year. However, if we observe in countries like Cameroon, Cambodia, Antartica, Armenia, Peru, Paraguay, Uganda, Syria, Jordan, Cyprus, Dominican Republic and Kazakastan, there is no fair distribution of all genres. There are only like 1 or 2 genres in each bar. This means that in these countries films or shows produced are quite less and either there aren't much people who prefer movies/shows or they prefer other streaming platforms over Amazon Prime.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chances of subscription count is high mostly in countries where there is a production of diverse genres, thereby leading to a positive impact in business.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# scatterplot imdb_score vs tmdb_score
plt.figure(figsize=(6,4))
sns.scatterplot(x='imdb_score',y='tmdb_score',data=md2,color='r')
plt.xlabel('IMDB rating')
plt.ylabel('TMDB rating')
plt.title('Scatter Plot: IMDB vs TMDB rating')
plt.show()

##### 1. Why did you pick the specific chart?

To compare IMDB rating and TMDB rating given for the corresponding shows and movies. Scatter plot is used to compare two numerical features.

##### 2. What is/are the insight(s) found from the chart?

A positive correlation is observed from the chart. However there are some datapoints for which IMDB is 0 but TMDB is from 0 to 10 and vice versa. This is because, some of the null values have been replaced with zero.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Including shows and movies that have high IMDB or TMDB rating increases the chances of subscription.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# scatter plot of imdb_votes and tmdb_popularity
plt.figure(figsize=(6,4))
sns.scatterplot(x='imdb_votes',y='tmdb_popularity',data=md2,color='g')
plt.xlabel('IMDB votes')
plt.ylabel('TMDB popularity')
plt.title('Scatter Plot: IMDB votes vs TMDB popularity')
plt.show()

##### 1. Why did you pick the specific chart?

To compare the relationship between IMDB votes and TMDB popularity.

##### 2. What is/are the insight(s) found from the chart?

There are more points clustered where IMDB votes lie between (0-0.4)x1e6 and TMDB popularity lying between 0 and 200. Also, for most of the points, TMDB popularity lies between 0 and 200. There are 3 points for which IMDB is less and TMDB is quite high. This suggests that TMDB popularity is less and around 200 for most of the movies and shows even though IMDB votes are high for some shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights provided here suggest that if people go by IMDB votes, then there would be more subscription count. On the other hand, negative growth may be seen if they go by the TMDB popularity reducing the subscription count and renewal.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# scatter plot of imdb_score and imdb_votes
plt.figure(figsize=(6,4))
sns.scatterplot(x='imdb_score',y='imdb_votes',data=md2,color='b')
plt.xlabel('IMDB score')
plt.ylabel('IMDB votes')
plt.title('Scatter Plot: IMDB score vs IMDB votes')
plt.show()

##### 1. Why did you pick the specific chart?

To make a comparison between IMDB score and IMDB votes.

##### 2. What is/are the insight(s) found from the chart?

The points are clustered in the regions where IMDB votes are in the range of 0-400000. For some movies where the IMDB score is between 6 and 8, the IMDB votes are quite high.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

People usually prefer shows and movies with high ratings (preferably from 7 to 10). As there are many shows and movies in this range with IMDB votes quite high, this would increase the subscription count.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
amazon_tit_num = md2.select_dtypes(include='number')
corr_mat = amazon_tit_num.corr()
sns.heatmap(corr_mat,annot=True,cmap="rainbow", fmt=".2f")
plt.show()

##### 1. Why did you pick the specific chart?

Correlation matrix is used to compare numerical features in the dataset.

##### 2. What is/are the insight(s) found from the chart?

There is quite stronger negative correlation between runtime and seasons. There is a stronger positive correlation between imdb_score and tmdb_score, and likewise imdb_votes and tmdb_popularity.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
amazon_tit_num = md2.select_dtypes(include='number')
amazon_type = md2[['type']]
amazon_tit_up = pd.concat([amazon_tit_num,amazon_type],axis=1)
sns.pairplot(amazon_tit_up, hue="type", palette="viridis")
plt.show()

##### 1. Why did you pick the specific chart?

To compare the numerical features based on type 'show' and 'movie'.

##### 2. What is/are the insight(s) found from the chart?

Some features could possibly have liner or non-linear relationships and random scatter between some features suggest that there is no correlation. IMDb score and TMDb score have a similar distribution as the KDE plot for both are quite similar. Runtime, seasons, IMDb votes and TMDb popularity are positively skewed. Some movies seem to dominate in votes and popularity. Some points are far from the main clusters, indicating highly popular movies/shows or exceptions in runtime/seasons. Movies might have higher IMDb votes and popularity, while shows could have wider variability in scores.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1) Most of the genre comprises of drama and comedy. However crime, thriller, animation, action and sci-fi shows and movies are quite less. Including more shows in this genre can actually increase the number of subscribers.

2) Including shows with lesser runtime for each episode can cater a huge audience.

3) Based on the data presented in the relative stacked bar plot, once can tailor user based preferences to include more shows and movies in the preferred genre.

4) Including shows and films with higher IMDB and TMDB rating would bring in more subscribers.

# **Conclusion**

Most of the shows 🎬 and movies 🍿 available on Amazon Prime belong to the drama 🎭 and comedy 😂 genres. The United States 🇺🇸 is the leading producer of the majority of this content.

Most of the shows have 10 seasons or fewer 📅, with the average runtime of each episode being around 40 minutes ⏳. There is a fair distribution of genres 🌍 across most countries, indicating a diverse content library.

Additionally, IMDb ⭐ and TMDb 🎥 scores are positively correlated 📈. This means that if a movie or show has a high IMDb rating ⭐, it is likely to also have a high TMDb rating 🎥. This correlation suggests consistency in audience and critic evaluations across platforms. 🚀