<a href="https://colab.research.google.com/github/Akshay533kumar/Amazon-prime-EDA-project/blob/main/Amazon_prime_EDA_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Summary -**

This project focuses on performing exploratory data analysis (EDA) on Amazon Prime Video content to uncover insights related to genre distribution, content ratings, popularity, and temporal trends. By leveraging datasets containing show/movie titles and associated credits, we aim to understand the structure, preferences, and evolution of content offered on the platform.

# **GitHub Link -**

Provide your GitHub Link here.




# **Problem Statement**


**Write Problem Statement Here.**

Answer HWith the rapid growth of digital streaming platforms, understanding content trends, viewer preferences, and catalog diversity is critical for competitive positioning. Amazon Prime Video hosts a vast library of TV shows and movies, but there is limited publicly available insight into the structure, performance, and evolution of its content.

The objective of this project is to perform an in-depth exploratory data analysis (EDA) of Amazon Prime’s content catalog to answer key business questions such as:

What are the most common genres and how are they distributed?

How has the type and quantity of content evolved over the years?

What is the relationship between content ratings, certifications, and viewer demographics?

Who are the most featured actors and directors on the platform?

Are there any identifiable trends in content duration or IMDb ratings?

By addressing these questions, the project aims to uncover meaningful insights that could assist content strategists, recommendation system developers, and market analysts in making informed decisions based on data.ere.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
titles_df=pd.read_csv('/content/titles.csv (1).zip')
credits_df=pd.read_csv('/content/credits.csv.zip')

In [None]:
data = pd.merge(titles_df, credits_df, on='id')

### Dataset First View

In [None]:
# Dataset First Look
data.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
data.dropna(subset=['description'],inplace=True)

# Data Wrangling

In [None]:
data['age_certification'].mode()[0]

In [None]:
data['age_certification'].fillna(data['age_certification'].mode()[0],inplace=True)

In [None]:
data['seasons'].fillna(0,inplace=True)

In [None]:
data['imdb_id'].fillna(data['imdb_id'].mode()[0],inplace=True)

In [None]:
data['imdb_score'].fillna(round(data['imdb_score'].mean(),1),inplace=True)

In [None]:
data['imdb_votes'].fillna(0,inplace=True)

In [None]:
data['tmdb_popularity'].fillna(round(data['tmdb_popularity'].mean(),2),inplace=True)

In [None]:
data['tmdb_score'].fillna(round(data['tmdb_score'].mean(),2),inplace=True)

In [None]:
data['character'].fillna('unknown',inplace=True)

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
data.drop_duplicates(inplace=True)

### What did you know about your dataset?

**Answer** This dataset contains structured information about each piece of content (movie or TV show). Key columns include:

Title: Name of the movie or show.

Type: Whether it’s a movie or a TV show.

Genre: Primary and sometimes secondary genres (e.g., Drama, Comedy).

Release Year: Year the content was released.

IMDb Rating: User rating from IMDb.

Age Certification: Content rating (e.g., TV-MA, PG, R).

Runtime: Duration in minutes.

Description: Brief synopsis of the content.

Production Country: Country of origin.

Seasons: Number of seasons (if applicable for TV shows).

2. credits.csv – Cast and Crew
This dataset complements the titles dataset by providing details about the people involved:

ID: Matches the title ID from titles.csv.

Name: Actor or director name.

Role: Role type (e.g., actor, director).

Character Name: Character played (for actors).

What I Learned About the Dataset
The platform features diverse genres, with Drama, Comedy, and Action being the most frequent.

Content ranges from early 1900s to recent years, allowing trend analysis over decades.

IMDb ratings range widely, providing an opportunity to assess content quality.

The dataset contains missing and inconsistent values (e.g., missing certifications or ratings), requiring data cleaning.

Some titles are duplicated or multi-genre tagged, which needed normalization for accurate analysis.

The credits.csv allows analysis of frequently cast actors/directors and their recurring collaborations.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.shape

In [None]:
# Dataset Describe
data.describe()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
top_genres = data ['genres'].value_counts().head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_genres.values, y=top_genres.index)

plt.title("Top 10 Genres on Amazon Prime")
plt.ylabel("Genre")
plt.tight_layout()
plt.show()

#### Chart - 2

In [None]:
# Chart - 2 visualization code
genre_imdb =data.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=genre_imdb.values, y=genre_imdb.index)

plt.title("Top 10 Genres by Average IMDb Score")
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.xlim(0, 10)
plt.tight_layout()
plt.show()

#### Chart - 3

In [None]:
# Chart - 3 visualization code
top_countries =data ['production_countries'].value_counts().head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index)

plt.title("Top 10 Production Countries on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.show()

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(6, 4))
sns.countplot(data=data, x='type')
plt.title("Distribution of Content Type (Movies vs Shows)")
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.show()

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(data=data,x='age_certification',order=data['age_certification'].value_counts().index)
plt.title("Age Certification Distribution")
plt.xlabel("Age Certification")
plt.ylabel("Number of Titles")
plt.show()

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data['runtime'],kde=True)

plt.title("Runtime Distribution of Amazon Prime ")
plt.xlabel("Runtime (in minutes)")
plt.ylabel("Number of Titles")
plt.tight_layout()
plt.show()

#### Chart - 7

In [None]:
# Chart - 7 visualization code
year_counts = data['release_year'].value_counts().sort_index()
plt.figure(figsize=(12, 6))
sns.lineplot(x=year_counts.index, y=year_counts.values, marker='o', linewidth=2.5)

plt.title("Number of Titles Released Each Year on Amazon Prime")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.grid(True)
plt.tight_layout()
plt.show()

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data['imdb_score'],bins=20, kde=True)

plt.title("IMDb Score Distribution on Amazon Prime")
plt.xlabel("IMDb Score")
plt.ylabel("Number of Titles")
plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data ['tmdb_score'], bins=20, kde=True)

plt.title("TMDb Score Distribution on Amazon Prime")
plt.xlabel("TMDb Score")
plt.ylabel("Number of Titles")
plt.show()

#### Chart - 10

In [None]:
# Chart - 10 visualization code
avg_scores =data.groupby('type')['imdb_score'].mean().reset_index()
plt.figure(figsize=(8, 5))
sns.barplot(data=avg_scores, x='type', y='imdb_score')

plt.title("Average IMDb Score by Content Type")
plt.xlabel("Content Type")
plt.ylabel("Average IMDb Score")
plt.show()

#### Chart - 11

In [None]:
# Chart - 11 visualization code
genre_imdb =data.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=genre_imdb.values, y=genre_imdb.index, )

plt.title("Top 10 Genres by Average IMDb Score")
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.xlim(0, 10)
plt.tight_layout()
plt.show()

#### Chart - 12

In [None]:
# Chart - 12 visualization code
actor_df =data[data ['role'] == 'ACTOR']

top_actors = actor_df['name'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_actors.values, y=top_actors.index, )

plt.title("Top 10 Most Frequent Actors on Amazon Prime")
plt.xlabel("Number of Appearances")
plt.ylabel("Actor")
plt.show()

#### Chart - 13

In [None]:
# Chart - 13 visualization code
directors_df = data [data['role'] == 'DIRECTOR']

top_directors = directors_df['name'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_directors.values, y=top_directors.index)

plt.title("Top 10 Most Frequent Directors on Amazon Prime")
plt.xlabel("Number of Titles Directed")
plt.ylabel("Director")
plt.show()

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_df = data.select_dtypes(include=['float64', 'int64'])
correlation_matrix =numeric_df.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

plt.title("Correlation Heatmap: Amazon Prime Dataset")
plt.tight_layout()
plt.show()

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data, diag_kind='kde', corner=True)
plt.suptitle("Pair Plot of IMDb Score, Runtime, and Release Year", y=1.02)
plt.show()

# **Conclusion**

This Exploratory Data Analysis (EDA) of the Amazon Prime Video dataset has provided valuable insights into the platform's content, trends, and user preferences.

Amazon Prime Video favors movies over TV shows, with a significantly large number of movies on a platform.

Drama and comedy are the most prevalent genres, reflecting popular viewer choices. However, the platform boasts a diverse range of genres catering to various tastes.

While movie runtimes have slightly decreased over the years, TV shows have seen a rise in the number of seasons, indicating a shift in viewer engagement patterns.

Older titles generally have higher IMDb and TMDb ratings compared to newer releases, potentially due to factors like nostalgia and evolving preferences.

The United States dominates content production, followed by India and the United Kingdom, highlighting their significant contributions to the platform's library.

Overall, this EDA has successfully uncovered patterns and trends in the Amazon Prime Video dataset, offering valuable information for content creators, platform strategists, and viewers alike. Further analysis and deeper dives into specific genres, regions, or content types could provide even richer insights for future decision-making.