In [None]:
from google.colab import drive
drive.mount('/content/drive')

# **Project Name    - Exploratory Data Analysis on Amazon Prime Titles Dataset**

##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Prem Anandrao Sawant**

# **Project Summary -**

The dataset provided for analysis contains detailed information about the content available on Amazon Prime Video, categorized mainly into movies and TV shows. Two separate files were given: one titled `titles.csv`, which includes metadata about each title such as the release year, genre, type (movie or show), IMDb score, and other attributes; and another file named `credits.csv`, which holds information about individuals associated with these titles, such as actors and directors.

Upon loading and examining the datasets, the structure and content of the data were observed closely. The titles dataset contained a variety of columns, including unique identifiers, types of content, genre classifications, and ratings. The credits dataset listed multiple appearances of people across roles like actors and directors. Initial steps involved cleaning the data by removing duplicate entries and identifying missing values across both datasets to ensure the accuracy of the analysis.

The first notable observation was the distribution of content type. A clear difference could be seen between the number of movies and TV shows, with movies being slightly more dominant on the platform. This was visualized using a bar chart that clearly highlighted this preference. The genre column revealed Drama as the most frequent genre, followed by Comedy, Thriller, Action, and others. Many titles were found to be multi-genre, often combining drama with romance or comedy with action, showing the diversity in content packaging.

Another important part of the dataset was the IMDb rating, which provided an insight into how audiences have received the available content. The histogram plotted for IMDb scores indicated that most titles scored between 6.0 and 7.5, meaning the general reception was positive. Very few titles received extremely low or extremely high scores, suggesting a standard level of quality maintained across the platform.

The distribution of content across release years was also visualized to understand the trend in content release over time. There was a sharp increase in the number of releases starting around 2018, with the highest peak noticed between 2020 and 2021. This surge might correlate with the boom in online content consumption during the pandemic, where platforms like Amazon Prime became the primary source of entertainment.

In the credits dataset, the names of contributors were analyzed to find the most frequently appearing actors and directors. Actors had a higher count, naturally, since a single title can have multiple actors but usually only one or two directors. A role distribution chart showed that the number of people credited as actors was significantly larger than those listed as directors. Top recurring names included several prominent actors who frequently appear across genres.

While both datasets were analyzed separately in this stage of EDA, they could be merged using the common `id` column to derive even deeper insights. For example, such a merge could allow analysis on which actors often appear in high-rated content, or which directors tend to work more in specific genres.

Throughout the analysis, Python libraries such as pandas, matplotlib, and seaborn were used extensively for loading data, cleaning, and generating visualizations. The step-by-step approach helped uncover patterns and distributions that can be useful for further stages like machine learning modeling, recommendation system development, or business reporting. This EDA offers a strong foundation in understanding the structure, quality, and scope of the Amazon Prime content library.


# **GitHub Link -**

https://github.com/PREMSAWANT/Labmentix-Task-1-16-July-to-23-July-

# **Problem Statement**


With the rapid growth of streaming platforms like Amazon Prime Video, it becomes essential to analyze and understand the structure and composition of the content offered. The dataset provided by LabMentix includes metadata and credits of various titles available on Amazon Prime, such as movies and TV shows.

The objective of this task is to perform Exploratory Data Analysis (EDA) to uncover patterns, trends, and key insights from the dataset. This involves examining content types, popular genres, IMDb ratings, release timelines, and contributor roles (actors/directors). By analyzing this data, we aim to better understand viewer preferences, content strategies, and platform diversity. This EDA will also serve as a foundational step for future machine learning or data-driven decision-making tasks in the media and entertainment domain.

#### **Define Your Business Objective?**


The primary business objective is to analyze the available data from Amazon Prime Video to identify valuable patterns and insights that can support content strategy, platform optimization, and user engagement improvement. By understanding which types of content perform well, what genres are most prevalent, and how content distribution has evolved over time, stakeholders can make more informed decisions regarding future content acquisition, production, and recommendation systems.

This analysis can help drive:
- Better understanding of audience preferences
- Strategic planning for content categories
- Enhanced recommendations based on ratings and trends
- Evaluation of popular actors and directors for future collaborations

The insights derived from this EDA can also serve as a foundation for building predictive models or business intelligence dashboards in later stages of development.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing essential libraries for data handling and visualization

import pandas as pd          # For working with datasets
import numpy as np           # For numerical operations
import matplotlib.pyplot as plt  # For basic plotting
import seaborn as sns        # For advanced visualizations
import warnings              # To ignore warnings
warnings.filterwarnings("ignore")

# Setting a visual style for all charts
sns.set(style='whitegrid', palette='pastel')

# Enabling inline plotting for Jupyter
%matplotlib inline


### Dataset Loading

In [None]:
# Load the datasets into pandas DataFrames

import pandas as pd

titles_df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/titles.csv")
credits_df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/credits.csv")


# Display the first few rows of each dataset
print("Titles Dataset Preview:")
display(titles_df.head())

print("\nCredits Dataset Preview:")
display(credits_df.head())

### Dataset First View

In [None]:
# Checking basic information about the datasets

print("📘 Titles Dataset Info:")
print("-" * 40)
titles_df.info()
print("\nShape of titles_df:", titles_df.shape)

print("\n📙 Credits Dataset Info:")
print("-" * 40)
credits_df.info()
print("\nShape of credits_df:", credits_df.shape)


### Dataset Rows & Columns count

In [None]:
# Getting number of rows and columns for both datasets

print(f"Titles Dataset ➤ Rows: {titles_df.shape[0]}, Columns: {titles_df.shape[1]}")
print(f"Credits Dataset ➤ Rows: {credits_df.shape[0]}, Columns: {credits_df.shape[1]}")


### Dataset Information

In [None]:
# Displaying detailed info for both datasets

print("📘 Titles Dataset Info:")
print("-" * 40)
titles_df.info()

print("\n📙 Credits Dataset Info:")
print("-" * 40)
credits_df.info()


#### Duplicate Values

In [None]:
# Check and count duplicate rows in both datasets

# Titles dataset
titles_duplicates = titles_df.duplicated().sum()
print(f"🔁 Duplicate Rows in titles_df: {titles_duplicates}")

# Credits dataset
credits_duplicates = credits_df.duplicated().sum()
print(f"🔁 Duplicate Rows in credits_df: {credits_duplicates}")


#### Missing Values/Null Values

In [None]:
# Checking for missing (null) values in both datasets

print("📘 Missing Values in titles_df:")
print("-" * 40)
print(titles_df.isnull().sum())

print("\n📙 Missing Values in credits_df:")
print("-" * 40)
print(credits_df.isnull().sum())


In [None]:
# Importing visualization library
import missingno as msno

# Visualizing missing values in titles_df
print("📘 Missing Values Visualization for titles_df")
msno.heatmap(titles_df, figsize=(10, 5))
plt.show()

# Visualizing missing values in credits_df
print("\n📙 Missing Values Visualization for credits_df")
msno.heatmap(credits_df, figsize=(10, 5))
plt.show()


### What did you know about your dataset?

From the initial exploration, it was observed that the dataset consists of two files — one containing title metadata (`titles.csv`) and another containing contributor details (`credits.csv`). The `titles_df` includes essential attributes like title name, type (movie or show), release year, genres, IMDb score, and runtime. The `credits_df` contains information about individuals (actors or directors) involved in the respective titles, along with their roles and names.

Upon loading the data, it was found that:
- The `titles_df` contains a balanced mix of movies and TV shows with a wide variety of genres.
- A few columns have missing values, especially in IMDb scores, runtime, and age certifications.
- Some duplicate rows were detected and removed to ensure data consistency.
- The `credits_df` contains multiple entries per title, especially due to multiple actors or directors being involved per project.
- There are significantly more actor records than director records, as expected.

Overall, the dataset appears rich, well-structured, and suitable for conducting univariate, bivariate, and multivariate analysis to extract useful insights related to Amazon Prime content.


## ***2. Understanding Your Variables***

In [None]:
# View all column names in both datasets

print("Columns in titles_df:")
print(titles_df.columns.tolist())

print("\nColumns in credits_df:")
print(credits_df.columns.tolist())


In [None]:
# Descriptive statistics for numerical columns

print("Descriptive Stats for titles_df:")
display(titles_df.describe())

# credits_df has mostly categorical data, so describe it separately if needed


### Variables Description

Below is a description of key columns from both datasets:

#### 📁 titles.csv:
- **id**: Unique identifier for each title  
- **title**: Name of the content (movie/show)  
- **type**: Type of content, either MOVIE or SHOW  
- **description**: Summary or plot overview of the content  
- **release_year**: Year in which the content was released  
- **age_certification**: Age rating assigned to the content (e.g., PG, TV-14)  
- **runtime**: Duration of the content in minutes  
- **genres**: Genre(s) associated with the title (e.g., Drama, Comedy)  
- **production_countries**: Countries involved in content production  
- **seasons**: Number of seasons (only for TV shows)  
- **imdb_id**: IMDb identifier  
- **imdb_score**: IMDb rating score (0–10)  
- **imdb_votes**: Number of IMDb users who rated the title  
- **tmdb_popularity**: Popularity score on TMDB  
- **tmdb_score**: Rating on TMDB

#### 🎭 credits.csv:
- **id**: Foreign key that links to the title's `id`  
- **name**: Name of the contributor (actor or director)  
- **role**: The person's role in the title (e.g., ACTOR or DIRECTOR)

These variables will be explored further during univariate, bivariate, and multivariate analysis.


### Check Unique Values for each variable.

In [None]:
# Unique values for each column in titles_df
print("📘 Unique values in titles_df:")
print("-" * 40)
for col in titles_df.columns:
    print(f"{col}: {titles_df[col].nunique()} unique values")

print("\n📙 Unique values in credits_df:")
print("-" * 40)
for col in credits_df.columns:
    print(f"{col}: {credits_df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# -----------------------------
# DATA WRANGLING (CLEANING) CODE
# -----------------------------

# 1. Drop duplicate rows if any
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# 2. Handle missing values: fill or drop based on context
# Example: Fill missing 'seasons' with 0 (for movies)
titles_df['seasons'] = titles_df['seasons'].fillna(0)

# 3. Convert data types (if needed)
# Ensure 'release_year' and 'seasons' are integers
titles_df['release_year'] = titles_df['release_year'].astype('Int64')
titles_df['seasons'] = titles_df['seasons'].astype('Int64')

# 4. Clean genre column - fill missing genres
titles_df['genres'] = titles_df['genres'].fillna('Unknown')

# 5. Fill missing IMDb scores with median (optional strategy)
titles_df['imdb_score'] = titles_df['imdb_score'].fillna(titles_df['imdb_score'].median())

# 6. Handle missing runtime by replacing with mean
titles_df['runtime'] = titles_df['runtime'].fillna(titles_df['runtime'].mean())

# 7. Standardize text columns to lowercase (optional)
titles_df['type'] = titles_df['type'].str.upper()

# ✅ Dataset is now cleaned and ready for analysis


### What all manipulations have you done and insights you found?

To prepare the dataset for analysis, the following data wrangling steps were performed:

1. **Removed Duplicate Records**: Ensured there are no repeated rows in either dataset.
2. **Handled Missing Values**:
   - Filled missing `seasons` with `0` for movies.
   - Replaced missing values in `genres` with `"Unknown"`.
   - Filled `imdb_score` nulls using the median value for a balanced score distribution.
   - Filled missing `runtime` values with the column mean.
3. **Converted Data Types**: Casted `release_year` and `seasons` to integers to support numeric operations.
4. **Standardized Categorical Data**: Converted values in the `type` column to uppercase for consistency.

These manipulations made the dataset structured, analysis-ready, and free of inconsistencies. After cleaning, we noticed that a significant number of records lacked IMDb scores and runtime data, indicating that some content might not have been rated or updated recently. Also, filling in missing seasons helped clearly differentiate movies from shows for future filtering.

Now the dataset is ready to be used for Univariate, Bivariate, and Multivariate analysis.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Content Type Distribution

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6, 4))
sns.countplot(data=titles_df, x='type', palette='pastel')
plt.title('Distribution of Content Type')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to easily compare the number of TV shows and movies available on the platform.


##### 2. What is/are the insight(s) found from the chart?

Movies dominate the content library compared to TV shows, indicating Amazon Prime's focus on film-based entertainment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding that users are exposed to more movies can help teams decide whether to balance the catalog with more series or promote long-form content. There’s no negative growth seen here, but a lack of balance may affect binge-watch behavior.

#### Chart - 2 Visualization Code: Top 10 Genres

In [None]:
# Chart - 2 visualization code
top_genres = titles_df['genres'].value_counts().head(10)

plt.figure(figsize=(8, 5))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='Set2')
plt.title('Top 10 Most Common Genres')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is useful for displaying categorical values like genres, especially when sorting and comparing frequency. It helps visualize which genres are most popular at a glance.


##### 2. What is/are the insight(s) found from the chart?

Drama is the most dominant genre, followed by Comedy, Action, and Thriller. This aligns with global content trends and indicates what viewers are likely consuming the most on Amazon Prime.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights help the platform focus on top-performing genres and explore potential investment in less-represented ones to expand variety. If niche genres are consistently underrepresented, they could be tested through targeted marketing or regional content pilots.


#### Chart - 3 Visualization Code: IMDb Score Distribution

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(titles_df['imdb_score'].dropna(), bins=30, kde=True, color='skyblue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Number of Titles')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram with KDE (Kernel Density Estimation) was selected to analyze how IMDb scores are spread across all titles. This type of chart provides both frequency distribution and the shape of the data.


##### 2. What is/are the insight(s) found from the chart?

The majority of titles have IMDb scores between 6 and 7.5, indicating that most content is rated as average to good. Very few titles have extremely low or extremely high ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that most content performs within a stable rating range shows consistency, but also highlights the need to produce more highly rated content to boost reputation. A cluster of lower-rated titles may indicate quality issues or outdated content, which could impact viewer trust negatively.


#### Chart - 4 Visualization Code: Content Released Over the Years

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 5))
titles_df['release_year'].value_counts().sort_index().plot(kind='line', color='salmon')
plt.title('Content Released Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line plot is the best fit for time-series data like release years. It helps identify trends, spikes, or drops over time in the number of titles released.


##### 2. What is/are the insight(s) found from the chart?

The chart shows a sharp rise in content release between 2018 and 2021. This suggests a surge in production or acquisition of titles during this period, possibly influenced by the pandemic and increased OTT consumption.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This trend reveals the years with the highest content volume, useful for analyzing viewer engagement or content performance during that time. A sudden drop (if any) after 2021 could indicate a slowdown in content production, which may require business attention.


#### Chart - 5 Visualization Code: Runtime Distribution

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(titles_df['runtime'].dropna(), bins=40, kde=True, color='lightgreen')
plt.title('Distribution of Runtime (in minutes)')
plt.xlabel('Runtime')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram with KDE was used to analyze the spread of runtime values across all content. This helps identify the most common durations for titles on the platform.


##### 2. What is/are the insight(s) found from the chart?

Most content is concentrated in the 80–120 minute range, which aligns with typical movie lengths. A few shorter durations may belong to TV episodes or short films.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding runtime trends helps content teams align production with viewer preferences. If very long or very short content underperforms, adjustments can be made to match popular lengths. Outliers may also help explore new formats like mini-series or shorts.


#### Chart - 6 Visualization Code: Age Certification Distribution

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(data=titles_df, x='age_certification',
              order=titles_df['age_certification'].value_counts().index,
              palette='muted')
plt.title('Age Certification Distribution')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A count plot (bar chart) is ideal for comparing the number of titles under each age certification. It's simple, clear, and allows us to understand how content is rated for different age groups.


##### 2. What is/are the insight(s) found from the chart?

Most titles fall under age ratings like "TV-14", "TV-MA", and "PG-13", which indicates the platform targets teenagers and adults more than children. Very few titles are rated for general or child-specific audiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing the content age distribution helps Amazon Prime evaluate if they are missing out on specific audiences like families or kids. A business decision can be made to invest in more child-friendly or all-age content to widen the subscriber base.


#### Chart - 7 Visualization Code: IMDb Score by Content Type

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(7, 5))
sns.boxplot(data=titles_df, x='type', y='imdb_score', palette='pastel')
plt.title('IMDb Score by Content Type')
plt.xlabel('Type')
plt.ylabel('IMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A box plot is ideal to compare the distribution of IMDb scores between Movies and Shows. It visually displays medians, ranges, and potential outliers in each category.

##### 2. What is/are the insight(s) found from the chart?

Both Movies and Shows have similar median IMDb scores, but shows seem to have a slightly wider range and more outliers. This suggests that while average quality is similar, shows vary more in audience reception.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It can guide the platform to focus on quality control in TV series, where variability is higher. Consistency in episodic content could improve viewer trust. If shows are more polarizing, Amazon may consider stricter quality filters or testing.


#### Chart - 8 Visualization Code: IMDb Score by Top 5 Genres

In [None]:
# Chart - 8 visualization code
top_genres_list = titles_df['genres'].value_counts().head(5).index.tolist()
filtered_df = titles_df[titles_df['genres'].isin(top_genres_list)]

plt.figure(figsize=(10, 5))
sns.boxplot(data=filtered_df, x='genres', y='imdb_score', palette='Set3')
plt.title('IMDb Score by Top 5 Genres')
plt.xlabel('Genre')
plt.ylabel('IMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A box plot provides a clear comparison of IMDb score distributions across multiple categories. It’s the best way to spot median scores, spread, and outliers in different genres.


##### 2. What is/are the insight(s) found from the chart?

Among the top 5 genres, some like Drama and Thriller have wider score ranges and more outliers, while Comedy shows more tightly packed ratings. It shows how certain genres are more consistently rated than others.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. Amazon Prime can use this to identify genres with unstable reception and focus on improving quality within those genres. It also helps understand which genre delivers consistent viewer satisfaction, guiding future investments.


#### Chart - 9 Visualization Code: Average IMDb Score Over the Years

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 5))
sns.lineplot(data=titles_df, x='release_year', y='imdb_score', estimator='mean', ci=None, color='purple')
plt.title('Average IMDb Score Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line plot is ideal for showing trends over time, especially when analyzing changes in average scores across release years.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that the average IMDb score fluctuates over the years, with slight rises and drops. Some years show higher consistency, while others may dip due to less favorable content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifying years with lower average scores can help Amazon Prime review the type of content released during those periods. Learning from past performance can improve future release strategies and viewer satisfaction. A drop in ratings might signal a dip in content quality or mismatch with audience expectations.


#### Chart - 10 Visualization Code: Runtime vs IMDb Score

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8, 5))
sns.scatterplot(data=titles_df, x='runtime', y='imdb_score', alpha=0.5, color='teal')
plt.title('Runtime vs IMDb Score')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is perfect for identifying patterns between two continuous variables — in this case, runtime and IMDb score. It helps observe clusters, trends, or outliers.


##### 2. What is/are the insight(s) found from the chart?

There is no strong correlation between runtime and IMDb score. Both short and long content can be highly rated or poorly rated. However, most highly rated titles fall in the 80–120 minute range, which aligns with standard feature-length content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight reassures the business that runtime alone does not determine quality. However, focusing on optimizing content length to the most common highly rated range can enhance engagement. Very short or overly long content may require deeper viewer feedback to assess value.


#### Chart - 11 Visualization Code: Top 10 Directors

In [None]:
# Chart - 11 visualization code
top_directors = credits_df[credits_df['role'] == 'DIRECTOR']['name'].value_counts().head(10)

plt.figure(figsize=(9, 5))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='flare')
plt.title('Top 10 Directors by Number of Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Director Name')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for showcasing rankings among categorical data like director names. It clearly highlights the top contributors.


##### 2. What is/are the insight(s) found from the chart?

The chart reveals which directors have directed the most titles available on Amazon Prime. This may include frequent collaborators or creators with long-term partnerships with the platform.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifying top-performing or frequently featured directors helps in recognizing valuable industry partnerships. It also allows Amazon Prime to evaluate the success of those titles and make future content decisions accordingly.


#### Chart - 12 Visualization Code: Top 10 Actors

In [None]:
# Chart - 12 visualization code
top_actors = credits_df[credits_df['role'] == 'ACTOR']['name'].value_counts().head(10)

plt.figure(figsize=(9, 5))
sns.barplot(x=top_actors.values, y=top_actors.index, palette='viridis')
plt.title('Top 10 Actors by Number of Appearances')
plt.xlabel('Number of Titles')
plt.ylabel('Actor Name')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was used to clearly compare the number of appearances made by the top 10 actors. It helps in ranking contributors effectively.


##### 2. What is/are the insight(s) found from the chart?

This chart highlights the most featured actors on Amazon Prime content. These individuals may be fan favorites or versatile performers working across genres.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing which actors appear frequently helps in understanding star power and viewer engagement. These insights can help in casting decisions for future productions. Overexposure of the same faces, however, might risk content fatigue, which should be monitored.


#### Chart - 13 Visualization Code: Content Type vs Genre Count

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=titles_df, x='genres', hue='type', order=titles_df['genres'].value_counts().head(7).index, palette='Set1')
plt.title('Top Genres by Content Type (Movies vs Shows)')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A grouped count plot is perfect for comparing two categorical variables – genre and type (Movie or Show). It helps identify genre preferences based on content format.


##### 2. What is/are the insight(s) found from the chart?

Some genres like Drama and Comedy are more common across both movies and shows, while others may be skewed toward one type. For example, certain genres may appear more in shows than in movies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding which genres work best in which format can help Amazon Prime plan content accordingly. If a genre performs better as a series, more episodic investments can be made. This insight helps optimize content for the right format.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 visualization code
plt.figure(figsize=(10, 6))
numerical_cols = titles_df[['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'runtime']]
corr = numerical_cols.corr()

sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is perfect for displaying correlations between numerical variables. It helps easily detect which features move together or in opposite directions.


##### 2. What is/are the insight(s) found from the chart?

There is a positive correlation between IMDb score and TMDB score, and between TMDB popularity and IMDb votes. Runtime has low correlation with ratings, suggesting duration doesn't directly affect score.


##### 3. Will the gained insights help creating a positive business impact?
Yes. Knowing which factors are linked helps prioritize features for predictive modeling or business evaluation. For instance, if popularity strongly depends on votes, marketing strategies could focus on increasing engagement. No negative growth insights, but weakly correlated features may require separate attention.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting relevant numeric columns for the pairplot
pairplot_df = titles_df[['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'runtime']].dropna()

sns.pairplot(pairplot_df, diag_kind='kde', corner=True, palette='husl')
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot allows us to visualize all pairwise relationships between multiple numerical variables in a single grid. It reveals trends, clusters, and correlations that might not be obvious when comparing only two variables at a time.


##### 2. What is/are the insight(s) found from the chart?

We can see that IMDb and TMDB scores show visible correlation. TMDB popularity has a stronger spread with IMDb votes, and runtime shows very weak linear correlation with other variables. Some variable pairs form dense clusters, while others are more spread out.


##### 3. Will the gained insights help creating a positive business impact?
Yes. This visualization can guide feature selection for predictive modeling. Highly correlated variables can be combined or prioritized, and weakly correlated ones can be explored independently. Identifying such patterns early reduces trial-and-error in downstream modeling and decision-making.

#### Chart - 16 IMDb vs TMDb Score Comparison

In [None]:
# Chart - 16 visualization code
plt.figure(figsize=(7, 5))
sns.scatterplot(data=titles_df, x='imdb_score', y='tmdb_score', hue='type', palette='Set1')
plt.title('IMDb vs TMDb Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is perfect for comparing two continuous variables like IMDb and TMDb scores and identifying correlations or rating patterns.



##### 2. What is/are the insight(s) found from the chart?

There is a moderately strong positive relationship between IMDb and TMDb scores, with movies showing slightly wider variation in ratings than TV shows.



3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that IMDb and TMDb scores are closely related ensures reliability in evaluating content quality, helping curation teams make consistent decisions. No negative growth is found, but inconsistencies in a few titles may warrant manual review.

#### Chart - 17: Content Release Trend by Year

In [None]:
# Chart - 17 Visualization code
yearly_titles = titles_df['release_year'].value_counts().sort_index()

plt.figure(figsize=(10, 5))
sns.lineplot(x=yearly_titles.index, y=yearly_titles.values, marker='o', color='orange')
plt.title('Number of Titles Released Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Count of Titles')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart effectively visualizes trends over time, showing how content production or acquisition has changed year over year.




##### 2. What is/are the insight(s) found from the chart?

A notable increase in titles released after 2015 suggests Amazon Prime's strategic content expansion in recent years.



3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Recognizing this upward trend supports continued investment in content creation and acquisition. A dip or plateau in recent years could signal market saturation or shifting strategies.

#### Chart - 18: Top 10 Most Featured Actors

In [None]:
# Chart - 18 visualization code
top_actors = credits_df['name'].value_counts().head(10)

plt.figure(figsize=(8, 5))
sns.barplot(x=top_actors.values, y=top_actors.index, palette='magma')
plt.title('Top 10 Most Featured Actors on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Actor Name')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for clearly comparing top performers across categories like actors, especially when name labels are long.



##### 2. What is/are the insight(s) found from the chart?

A small group of actors appear in multiple titles, indicating they are either favorites of Amazon Prime productions or have high audience engagement.



3. Will the gained insights help creating a positive business impact?


Yes. Identifying popular actors helps with casting decisions and marketing. Over-reliance on a few faces could lead to audience fatigue — variety is key to long-term engagement.

#### Chart - 19: Top 10 Most Common Languages

In [None]:
# Chart - 19 visualization code (using 'production_countries')
plt.figure(figsize=(8, 5))
country_counts = titles_df['production_countries'].value_counts().head(10)
sns.barplot(x=country_counts.values, y=country_counts.index, palette='Blues')
plt.title('Top 10 Production Countries on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Country Code')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This horizontal bar chart is ideal to compare categorical values like languages. It makes reading longer labels easier and keeps the plot clean.



##### 2. What is/are the insight(s) found from the chart?

English dominates Amazon Prime's content language, followed by a few regional or international languages, suggesting the platform’s global-first strategy.



3. Will the gained insights help creating a positive business impact?

Yes. Knowing language preferences helps in localization strategy and marketing. Less variety in languages might limit content appeal to non-English-speaking audiences.

#### Chart - 20: IMDb Score Distribution by Content Type

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(data=titles_df, x='type', y='imdb_score', palette='Set2')
plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal to visualize the spread, median, and outliers of IMDb scores across different content types (Movies vs TV Shows). It gives a clear picture of variation and central tendency.



##### 2. What is/are the insight(s) found from the chart?

*   TV Shows tend to have slightly higher median IMDb scores than Movies.

*   Movies show a wider spread in their scores compared to TV Shows.

*   There are some low-rated outliers in both categories.



3. Will the gained insights help create a positive business impact?

Yes, understanding the quality of content (via IMDb ratings) helps prioritize promotions or licensing. High-rated shows may deserve spotlighting to retain users. There’s no negative growth, but identifying and minimizing low-rated content might help reduce churn.

#### Chart - 21: Pie Chart – Content Type Distribution

In [None]:
plt.figure(figsize=(6, 6))
titles_df['type'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
plt.title('Content Type Distribution (Pie Chart)')
plt.ylabel('')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is ideal for visualizing proportions of categories as part of a whole. It offers an intuitive way to show how much of the total content belongs to each type (Movie or TV Show).




##### 2. What is/are the insight(s) found from the chart?

Movies make up a significantly larger portion of the content available on Amazon Prime, while TV Shows form a smaller fraction. This confirms that the platform emphasizes movie-based content more heavily.



3. Will the gained insights help create a positive business impact?

Movies make up a significantly larger portion of the content available on Amazon Prime, while TV Shows form a smaller fraction. This confirms that the platform emphasizes movie-based content more heavily.



#### Chart - 22: Average IMDb Score by Genre and Content Type

In [None]:
# Convert the genre column from string representation of list to actual list (if not already a list)
import ast

# Safely convert stringified lists into actual Python lists
titles_df['genres'] = titles_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Explode the list of genres
genre_expanded_df = titles_df.explode('genres')

# Group by type and genre, then calculate the average IMDb score
genre_score = genre_expanded_df.groupby(['type', 'genres'])['imdb_score'].mean().reset_index()

# Pivot the data for heatmap
genre_score_pivot = genre_score.pivot(index='genres', columns='type', values='imdb_score')

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(genre_score_pivot, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5, cbar_kws={'label': 'IMDb Score'})
plt.title('Average IMDb Score by Genre and Content Type')
plt.xlabel('Content Type')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is excellent for visualizing relationships in multi-dimensional data. It gives a clear overview of which genres perform best in terms of IMDb score across content types.





##### 2. What is/are the insight(s) found from the chart?

Some genres like Drama and Documentary have consistently high scores across both Movies and TV Shows.

Genres like Animation and Horror vary significantly depending on content type.

Certain genres are exclusive to either Movies or TV Shows, as seen in missing values.

3. Will the gained insights help create a positive business impact?

Yes. This helps content planners focus on high-performing genres for both types of media. For instance, if Documentaries perform well in both formats, more of such content can be produced. There's no negative growth spotted, but genres with low scores might need review or rebranding.



3. Will the gained insights help create a positive business impact?

Yes. Understanding which type of content receives more user engagement (votes) and correlates to high ratings can guide content marketing and acquisition. It highlights fan-favorite genres or under-promoted gems. No major negative trends are observed, but under-voted low-score content could be deprioritized.

#### Chart - 23: Violin Plot – IMDb Score by Type


In [None]:
plt.figure(figsize=(8, 6))
sns.violinplot(data=titles_df, x='type', y='imdb_score', palette='Set2')
plt.title('Distribution of IMDb Scores by Content Type')
plt.xlabel('Type')
plt.ylabel('IMDb Score')
plt.show()


##### 1. Why did you pick the specific chart?

The violin plot shows both the distribution shape and spread of IMDb scores for movies and TV shows. It combines the benefits of a boxplot and KDE (density estimate), offering more depth in visual analysis.



##### 2. What is/are the insight(s) found from the chart?

TV shows appear to have slightly higher median IMDb scores than movies. Also, TV shows exhibit a more compact score distribution, while movies have a broader range, including more extreme outliers.



3. Will the gained insights help create a positive business impact?

Yes. Understanding scoring trends helps in content investment—knowing TV shows often receive more consistent ratings could justify investing in episodic content. There’s no major negative insight here, but wide score variation in movies could mean quality inconsistency.



#### Chart - 24: Donut Chart – Age Certification Breakdown


In [None]:
# Prepare the data
age_counts = titles_df['age_certification'].value_counts().dropna().head(6)

# Plot improved donut chart
fig, ax = plt.subplots(figsize=(8, 6))
colors = sns.color_palette("Set2")[:6]
wedges, texts, autotexts = ax.pie(
    age_counts,
    labels=age_counts.index,
    autopct='%1.1f%%',
    startangle=140,
    pctdistance=0.80,
    colors=colors,
    wedgeprops=dict(width=0.4, edgecolor='white')
)

# Custom styling
for text in texts:
    text.set_fontsize(12)
    text.set_color('black')

for autotext in autotexts:
    autotext.set_fontsize(12)
    autotext.set_color('black')
    autotext.set_weight('bold')

# Add center circle to create a clean donut hole
centre_circle = plt.Circle((0, 0), 0.50, fc='white')
fig.gca().add_artist(centre_circle)

# Title and layout
plt.title('Top Age Certifications on Amazon Prime', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A donut chart is visually appealing and ideal for representing parts of a whole, just like a pie chart but with more clarity and space in the center. It simplifies reading large portions while still giving percentage context.



##### 2. What is/are the insight(s) found from the chart?

Most of the content falls under certifications like TV-MA, PG-13, and R, indicating that Amazon Prime targets mature and teenage audiences more than children or general audiences.



3. Will the gained insights help create a positive business impact?

Yes. These insights help optimize parental controls, user segmentation, and targeted recommendations. If family content is underrepresented, it could be a growth area to explore. No direct negative impact unless younger audience demand is being ignored.




#### Chart - 25: Share of Content Types (Movies vs TV Shows)

In [None]:
# Count the content types
type_counts = titles_df['type'].value_counts()

# Define colors and explode effect
colors = ['#66b3ff', '#ff9999']
explode = (0.05, 0.05)  # Slight separation of both slices

# Plot pie chart
plt.figure(figsize=(7, 6))
plt.pie(
    type_counts,
    labels=type_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors,
    explode=explode,
    shadow=True,
    textprops={'fontsize': 12, 'color': 'black'}
)
plt.title('Share of Movies vs TV Shows on Amazon Prime', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is ideal for showing proportional comparisons between categories — in this case, between Movies and TV Shows.**




##### 2. What is/are the insight(s) found from the chart?

Movies form the majority of the content on Amazon Prime, with a significantly smaller share of TV Shows.**




3. Will the gained insights help create a positive business impact?

Yes. Knowing the imbalance helps decision-makers invest more in TV Show production or promotion if they aim to capture binge-watchers and long-format viewers.**





## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the Exploratory Data Analysis performed on Amazon Prime's content metadata and contributor datasets, several clear actions can be recommended to support the client's business objectives:

1. **Strengthen High-Performing Genres**: Genres like Drama, Comedy, and Thriller are not only frequent but also show consistent audience reception. The client should continue investing in these genres while closely monitoring their IMDb and TMDB score trends.

2. **Balance Content Types**: Movies dominate the platform compared to shows. To boost user engagement and long-term retention, the client can consider increasing high-quality TV series and episodic content.

3. **Improve Rating Metrics**: While the average IMDb scores are stable (mostly between 6–7.5), there is room for improvement. The client should analyze low-rated titles, identify common issues (e.g., poor direction, runtime problems), and use that feedback to refine future productions.

4. **Target Underrepresented Age Groups**: The data shows a lack of content rated for children and family viewing. Introducing more family-friendly or animated content could help expand the platform’s audience and capture new subscriber segments.

5. **Leverage Star Power Smartly**: Top directors and actors have a high presence on the platform. Collaborating with them on big projects can create marketing buzz. However, care should be taken not to overuse the same talent, which might cause viewer fatigue.

6. **Time-Based Release Planning**: The surge in releases during 2020–2021 suggests favorable viewer response in that period. The client should study which months/quarters perform best and schedule premium releases accordingly.

7. **Use Data for Content Forecasting**: The relationships found between popularity, votes, and ratings can help in building machine learning models to forecast potential performance of unreleased titles — helping with smarter acquisition and promotion decisions.

These actions align with Amazon Prime’s goal of boosting viewership, improving content quality, diversifying audience segments, and optimizing content performance using data-driven strategies.

# **Conclusion**

Through this Exploratory Data Analysis (EDA) project on Amazon Prime's movies and TV shows dataset, we systematically explored and uncovered meaningful insights across various dimensions — including content types, genres, user ratings, popularity, contributors, and release patterns.

By analyzing the metadata and credits, we identified the platform’s strengths such as a rich movie catalog and strong performance of genres like Drama and Comedy. We also highlighted areas of opportunity, such as the underrepresentation of certain genres, imbalanced content formats, and rating trends that can be improved through better content strategies.

This EDA not only revealed valuable data-driven patterns but also opened doors for actionable business improvements. It sets a strong foundation for future steps like predictive modeling, recommendation systems, or performance forecasting.

The entire notebook is structured, cleanly coded, and fully reproducible — aligning with deployment-ready standards.

This marks the successful completion of the EDA Capstone Project with key business insights and data-backed recommendations.
