# **Project Name**    -



##### **Project Type**    - Exploratory data analysis (EDA) on AmazonPrime
##### **Contribution**    - Individual
##### **Team Member 1 -** Shreyansh Saxena

# **Project Summary -**

The objective of this project was to perform Exploratory Data Analysis (EDA) on Amazon Prime TV shows and movies to understand the key trends, patterns, and potential improvements for content recommendations. The analysis was conducted using two datasets: titles.csv, which contains detailed information about movies and TV shows, and credits.csv, which includes details about the people involved in these productions.

Understanding the Dataset
The titles.csv dataset consists of 9,871 entries with 15 columns, including title names, types (movie or show), descriptions, genres, release years, age certifications, IMDb and TMDb ratings, and popularity scores. The credits.csv dataset has 124,235 records with 5 columns, detailing actors, directors, and other contributors to each title.

Initial data inspection revealed several missing values, particularly in key columns such as age_certification, seasons, imdb_score, and tmdb_score. The credits.csv dataset had some missing values in the character column, where certain actors did not have a recorded role name. Additionally, the datasets contained duplicate records, which needed to be removed for accurate analysis.

Data Cleaning and Preprocessing
To ensure data quality, several data wrangling techniques were applied:

Duplicate Removal – Duplicate records in both datasets were dropped to avoid redundancy.
Handling Missing Values – Missing values in age_certification were replaced with “Not Rated,” missing seasons were set to 0 for movies, and IMDb/TMDb scores were filled with their respective median values. Missing character names in credits.csv were replaced with “Unknown.”
Data Type Optimization – Columns like release_year, seasons, and imdb_votes were converted to integer types, while IMDb and TMDb scores were set as floats for numerical analysis.
Text Standardization – Title names were capitalized, genres were converted to lowercase for consistency, and production country names were cleaned for better readability.
Key Insights from the Analysis
After cleaning the dataset, several key insights were uncovered:

A significant number of shows were missing age certifications, which could impact parental control features and audience targeting.
IMDb and TMDb ratings had some missing values, which might affect recommendation algorithms if not handled properly.
The seasons column was mostly empty, indicating that most titles were movies rather than TV shows.
Popularity scores varied widely, suggesting that some titles gained significant audience traction while others remained less popular.
Business Recommendations
Based on the insights from the EDA, several recommendations were made to improve the content strategy for Amazon Prime:

Enhancing Content Recommendations – By utilizing IMDb and TMDb ratings, the platform can prioritize highly-rated shows and movies in user recommendations.
Filling Metadata Gaps – Missing values in key columns like age certifications and ratings should be cross-referenced with external databases to improve data completeness.
Targeted Marketing Based on Popularity – High tmdb_popularity scores indicate which titles attract more viewers. These should be highlighted in promotional banners and featured sections.
Regional Content Optimization – Analyzing production_countries and age_certification can help curate region-specific recommendations, improving audience engagement.
Better Differentiation Between TV Shows and Movies – Using the seasons column to correctly classify content will enhance search filters and user experience.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

Content Diversity: What genres and categories dominate the platform?

Regional Availability: How does content distribution vary across different regions?

Trends Over Time: How has Amazon Prime's content library evolved?

IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

Main Libraries to be used:

Pandas for data manipulation, aggregation
Matplotlib and Seaborn for visualization and behavior with respect to the target variable. Lise at least 5 different visualizations.
NumPy for computationally efficient operations

#### **Define Your Business Objective?**

In today's competitive streaming industry, platforms like Amazon Prime Video are constantly expanding their content libraries to cater to diverse audiences. With a growing number of shows and movies available on the platform, data-driven insights play a crucial role in understanding trends, audience preferences, and content strategy.

This dato set was created to list all shows available on Amozon Prime streaming, and analyze the data to find interesting facts. This dataset has data available in the United States.

This dataset has 2 csv files and it is a mix of categorical and numeric values.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



### Dataset Loading

In [None]:
# Load the titles dataset
titles_df = pd.read_csv('/titles.csv')

# Load the credits dataset
credits_df = pd.read_csv('/credits.csv')







### Dataset First View

In [None]:
# Display first few rows in a tabular format
import pandas as pd

print("First 5 rows of Titles Dataset:")
display(titles_df.head())

print("\nFirst 5 rows of Credits Dataset:")
display(credits_df.head())




### Dataset Rows & Columns count

In [None]:
# Display row and column count using Pandas
print("Titles Dataset:")
print(titles_df.shape)

print("\nCredits Dataset:")
print(credits_df.shape)



### Dataset Information

In [None]:
# Display dataset info
titles_df.info()
credits_df.info()


#### Duplicate Values

In [None]:
# Count duplicate rows in each dataset
titles_duplicates = titles_df.duplicated().sum()
credits_duplicates = credits_df.duplicated().sum()

print(f"Duplicate rows in Titles Dataset: {titles_duplicates}")
print(f"Duplicate rows in Credits Dataset: {credits_duplicates}")


#### Missing Values/Null Values

In [None]:
# Count missing values in each column
print("Missing Values in Titles Dataset:")
print(titles_df.isnull().sum())

print("\nMissing Values in Credits Dataset:")
print(credits_df.isnull().sum())


In [None]:
# Bar plot for title.csv showing missing values
titles_df = pd.read_csv("/titles.csv")

missing_values = titles_df.isnull().sum()  # Count missing values
missing_values = missing_values[missing_values > 0]  # Filter out columns with no missing values

plt.figure(figsize=(10, 5))
missing_values.plot(kind='bar', color='red')

plt.title("Missing Values in title.csv")
plt.xlabel("Columns")
plt.ylabel("Count of Missing Values")
plt.xticks(rotation=45)
plt.show()




*missing values in credit.csv*

We provide a bar plot for title.csv to visualize missing values, as it contains columns with null values. However, we do not generate a bar plot for credit.csv since it has no missing values.

### What did you know about your dataset?

1.Credits Dataset (credits.csv):

* Contains 124,235 entries with 5 columns: person_id, id, name, character, and role.
* No missing values except in the character column, which has 16,287 missing values.
* Represents a mapping of people (actors, directors, etc.) to titles.

2.Titles Dataset (titles.csv):

* Contains 9,871 entries with 15 columns.
* Significant missing values in:
* age_certification (6,487 missing)
* seasons (8,514 missing)
* imdb_id, imdb_score, imdb_votes, tmdb_score (some missing values)
* Contains information about movies and TV shows, including their genres, ratings, and popularity scores.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
import pandas as pd

# Load the datasets
credits = pd.read_csv("/credits.csv")
titles = pd.read_csv("/titles.csv")

# Display dataset columns
print("Credits Dataset Columns:")
print(credits.columns.tolist())

print("\nTitles Dataset Columns:")
print(titles.columns.tolist())


In [None]:
# Display dataset statistics
print("Credits Dataset Description:")
print(credits.describe())

print("\nTitles Dataset Description:")
print(titles.describe())


### Variables Description

The Credits dataset (credits.csv) contains information about people involved in movies and TV shows. It includes person_id (unique identifier for a person), id (title identifier), name (person’s name), character (name of the character played, if applicable), and role (job type like actor, director, etc.).

The Titles dataset (titles.csv) provides details about movies and shows. It includes id (unique title identifier), title (name of the movie/show), type (movie or show), description (short summary), release_year (year of release), and age_certification (age rating like PG, R, etc.). It also has runtime (duration in minutes), genres (categories like action, drama, etc.), and production_countries (where it was made). Additionally, it includes seasons (only for TV shows), imdb_id, imdb_score, imdb_votes (IMDb data), and tmdb_popularity, tmdb_score (TMDb data).

### Check Unique Values for each variable.

In [None]:
import pandas as pd

# Load the datasets
credits = pd.read_csv("/credits.csv")
titles = pd.read_csv("/titles.csv")

# Check unique values count for each column
print("Unique Values in Credits Dataset:")
print(credits.nunique())

print("\nUnique Values in Titles Dataset:")
print(titles.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd

# Load the datasets
credits = pd.read_csv("/credits.csv")
titles = pd.read_csv("/titles.csv")

# 1️⃣ Remove Duplicates
credits.drop_duplicates(inplace=True)
titles.drop_duplicates(inplace=True)

# 2️⃣ Handle Missing Values Smartly
credits["character"].fillna("Unknown", inplace=True)  # Fill missing character names with "Unknown"

titles.fillna({
    "age_certification": "Not Rated",  # Replace missing age ratings
    "seasons": 0,  # Movies don't have seasons, so 0
    "imdb_score": titles["imdb_score"].median(),  # Use median for IMDb score
    "tmdb_score": titles["tmdb_score"].median(),  # Use median for TMDb score
    "imdb_votes": titles["imdb_votes"].median(),  # Fill missing votes with median
    "tmdb_popularity": titles["tmdb_popularity"].median()  # Use median for popularity score
}, inplace=True)

# 3️⃣ Convert Data Types for Correct Analysis
titles = titles.astype({
    "release_year": "int32",
    "seasons": "int32",
    "imdb_score": "float32",
    "tmdb_score": "float32",
    "imdb_votes": "int32",
    "tmdb_popularity": "float32"
})

# 4️⃣ Normalize Text Data for Consistency
titles["title"] = titles["title"].str.title()  # Capitalize each word in title
titles["genres"] = titles["genres"].str.lower()  # Convert genres to lowercase
titles["production_countries"] = titles["production_countries"].str.replace("[", "").str.replace("]", "").str.replace("'", "")  # Clean country names


# ✅ Data is Cleaned & Ready for Analysis
print("Credits Dataset Ready:")
print(credits.info())

print("\n Titles Dataset Ready:")
print(titles.info())


### What all manipulations have you done and insights you found?

 Data Manipulations Done & Insights Found

1️⃣ Removed Duplicates

✅ Action: Eliminated duplicate rows in both credits.csv and titles.csv.

🔍 Insight: Ensures unique records and prevents biased analysis.


2️⃣ Handled Missing Values

✅ Action:

Filled missing character values in credits.csv with "Unknown".
Replaced missing age_certification in titles.csv with "Not Rated".
Filled missing seasons with 0 (since movies don’t have seasons).
Replaced missing IMDb & TMDb scores/votes/popularity with median values.

🔍 Insight: Preserving data instead of dropping rows prevents data loss while ensuring meaningful analysis.


3️⃣ Converted Data Types

✅ Action:

Converted release_year, seasons, and imdb_votes to integers.
Converted imdb_score, tmdb_score, and tmdb_popularity to floats for proper numerical analysis.

🔍 Insight: Optimizing data types reduces memory usage and speeds up processing.


4️⃣ Normalized Text Data

✅ Action:

Capitalized title names for consistency.
Converted genres to lowercase to avoid case-sensitive mismatches.
Cleaned production_countries by removing extra brackets and quotes.

🔍 Insight: Standardized text formatting makes filtering, grouping, and analysis more accurate.

📌 Key Insights from Data Cleaning

🔹 Many shows have missing age certifications, indicating possible regional rating variations.

🔹 IMDb and TMDb scores have some missing values, which could affect rating-based recommendations.

🔹 The seasons column is mostly empty (for movies), so it’s only relevant for TV shows.

🔹 Text inconsistencies (like different cases in title and genres) could have led to incorrect groupings if not normalized.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Count the number of titles per release year
titles_per_year = titles_df["release_year"].value_counts().sort_index()

# Plot the bar chart
plt.figure(figsize=(12, 6))
plt.bar(titles_per_year.index, titles_per_year.values, color="skyblue")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.title("Number of Movies and Shows Released Per Year")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is an excellent choice for visualizing the number of movie and show releases over the years. Since release years represent categorical data, a bar chart effectively captures the frequency distribution, making it easy to observe patterns and trends. This visualization helps identify peaks and declines in content production, which would be difficult to interpret from raw numbers alone. Additionally, bar charts allow for clear comparisons between different years, making it simple to track industry growth or decline.

##### 2. What is/are the insight(s) found from the chart?

The bar chart reveals a significant increase in the number of movies and shows produced in recent years, suggesting the expansion of the media and entertainment industry. This could be attributed to the rise of streaming platforms, technological advancements, and increased global demand for diverse content. Additionally, the lower number of releases in older years may reflect historical limitations in production, fewer studios, and lack of advanced filmmaking technology at that time. Another key insight is that the highest spikes in content production might coincide with the emergence of digital platforms, allowing for easier content creation and distribution.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: The increasing trend in content production highlights a growing market, presenting opportunities for media companies, streaming platforms, and content creators to invest in new projects. Understanding this trend enables businesses to make informed decisions about content creation, acquisitions, and marketing strategies. Companies can leverage the demand for entertainment by producing high-quality, diverse content to capture audience interest and maximize revenue.

Potential Negative Growth: If the number of releases has begun to decline in recent years, it could indicate market saturation. A saturated market results in heightened competition among production houses, making it more challenging for new entrants to achieve profitability. Additionally, audience fatigue due to excessive content availability may lead to reduced viewer engagement, affecting revenue and long-term sustainability. If companies fail to adapt to changing consumer preferences, they might struggle to maintain relevance in an increasingly competitive industry.

Final Justification:
While an increase in content production suggests a growing market, companies must balance quantity with quality to avoid oversaturation. Investing in data-driven content strategies, audience insights, and emerging technologies can help businesses sustain positive growth despite competition. However, failure to innovate or differentiate content could lead to stagnation or declining profits

#### Chart - 2

In [None]:
# MOVIES Vs SHOWS PIECHART
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Count the number of movies and shows
content_type_counts = titles_df["type"].value_counts()

# Plot the pie chart
plt.figure(figsize=(8, 8))
plt.pie(content_type_counts, labels=content_type_counts.index, autopct="%1.1f%%",
        colors=["lightblue", "lightcoral"], startangle=140, wedgeprops={'edgecolor': 'black'})
plt.title("Distribution of Movies vs. Shows")
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is ideal for representing proportions and distributions within a dataset. Since the goal is to compare the share of movies versus shows in the dataset, a pie chart effectively conveys the percentage contribution of each category in a visually intuitive manner. It allows us to quickly grasp the relative dominance of one type over another, making it easier to analyze the composition of the content.

##### 2. What is/are the insight(s) found from the chart?

* The chart reveals whether movies or shows make up the larger portion of the dataset.
* If movies have a higher percentage, it indicates that the industry still focuses more on standalone film content.
* If shows dominate, it suggests a shift towards serialized content, which could be attributed to the growing popularity of streaming platforms that favor long-term viewer engagement.
* The balance (or imbalance) between movies and shows also provides insight into consumer preferences and industry production strategies.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

If shows dominate, streaming platforms can capitalize on binge-watching trends, leading to higher user retention and engagement.
If movies have a higher share, businesses can focus on producing blockbuster films that attract large audiences in a short period.
Understanding the distribution of content types helps businesses align their production strategies with audience preferences, ensuring optimal investment in new projects.

Potential Negative Growth & Justification:

If the industry is oversaturated with one category (either movies or shows), it could lead to audience fatigue.
For instance, if too many long series are being produced while viewers prefer short, standalone content, platforms may struggle to retain their audience.
Similarly, an underrepresentation of one category may result in missed opportunities for businesses looking to attract a diverse audience.


Justification: An imbalanced content distribution can impact long-term engagement. If consumers do not find variety, they might switch platforms or reduce content consumption, ultimately affecting revenue. To maintain success, companies must produce a balanced mix of both movies and shows to meet evolving audience demands.

#### Chart - 3

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Split genres and count occurrences
genre_counts = Counter()
titles_df["genres"] = titles_df["genres"].fillna('')  # Handle NaN values
titles_df["genres"].apply(lambda x: genre_counts.update(x.split(',')))

# Get the top 10 genres
top_genres = genre_counts.most_common(10)
genres, counts = zip(*top_genres)

# Plot the bar chart
plt.figure(figsize=(10, 6))
plt.barh(genres, counts, color='skyblue', edgecolor='black')
plt.xlabel("Number of Titles")
plt.ylabel("Genres")
plt.title("Top 10 Most Popular Genres")
plt.gca().invert_yaxis()  # Invert for better readability
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is an ideal choice for visualizing the proportion of different content types (Movies vs. Shows) because it clearly presents the relative distribution of categories. Since we are comparing two distinct groups, a pie chart effectively highlights which type dominates the dataset. This visualization provides an intuitive and immediate understanding of how content is distributed, making it easy to assess trends in media production.

##### 2. What is/are the insight(s) found from the chart?

* The chart reveals whether movies or shows make up the larger portion of the dataset.
* If movies dominate, it suggests that standalone films are more common in the industry.
* If shows have a larger share, it could indicate the growing popularity of serialized content, likely driven by streaming services that focus on long-term viewer engagement.
* This insight helps understand industry trends, consumer preferences, and production strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

If shows dominate, streaming platforms can capitalize on binge-watching trends, leading to higher user retention and engagement.
If movies have a higher share, businesses can focus on producing blockbuster films that attract large audiences in a short period.
Understanding the distribution of content types helps businesses align their production strategies with audience preferences, ensuring optimal investment in new projects.

Potential Negative Growth :

If the industry is oversaturated with one category (either movies or shows), it could lead to audience fatigue.
For instance, if too many long series are being produced while viewers prefer short, standalone content, platforms may struggle to retain their audience.
Similarly, an underrepresentation of one category may result in missed opportunities for businesses looking to attract a diverse audience.

Justification: An imbalanced content distribution can impact long-term engagement. If consumers do not find variety, they might switch platforms or reduce content consumption, ultimately affecting revenue. To maintain success, companies must produce a balanced mix of both movies and shows to meet evolving audience demands.

#### Chart - 4

In [None]:
# MOVIES SHOWS TRENDS
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Drop rows with missing release years and ensure integer type
titles_df = titles_df.dropna(subset=['release_year'])
titles_df['release_year'] = titles_df['release_year'].astype(int)

# Filter only reasonable years (e.g., 1912-2025)
titles_df = titles_df[(titles_df['release_year'] >= 1912) & (titles_df['release_year'] <= 2025)]

# Group data by release year and type
movies_shows_trend = titles_df.groupby(['release_year', 'type']).size().unstack(fill_value=0)

# Ensure data is sorted by year
movies_shows_trend = movies_shows_trend.sort_index()

# Plot the corrected line chart
plt.figure(figsize=(12, 6))
movies_shows_trend.plot(kind='line', marker='o', figsize=(12, 6), color=['blue', 'red'])
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.title("Trend of Movies vs. Shows Over the Years (Corrected)")
plt.legend(title="Type")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen because it is the best way to visualize trends over time. Since we are analyzing the number of movies and shows released per year, a line chart effectively highlights the growth, dips, and fluctuations in content production. Additionally, it allows us to easily compare the trends between movies and shows in a single, clear representation.

##### 2. What is/are the insight(s) found from the chart?

* There has been a significant rise in the number of movies and shows produced in recent years, indicating growth in the entertainment industry.
* The early years (before 2000) had fewer releases, which could be due to limited production capabilities and fewer platforms for content distribution.
* In the last decade, there has been an increase in show production compared to movies, likely due to the boom in streaming platforms like Netflix, Amazon Prime, and Disney+.
* If the number of releases has started to decline in recent years, it may indicate market saturation or shifts in audience preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
The increasing trend in content production suggests that demand for entertainment is growing. This is a positive sign for streaming services, production houses, and investors, as it encourages more content creation and investments in media.
The rise in TV shows over movies indicates that long-form storytelling and series formats are becoming more popular, making them a profitable segment for content creators.
Companies can capitalize on this trend by investing in high-quality shows and leveraging streaming services for global reach.

Potential Negative Growth:
If the recent years show a decline in the number of releases, it may indicate market saturation, where too much content is available, leading to stiff competition.
A decline could also mean changing audience behavior, with people preferring short-form content (e.g., YouTube, TikTok) over traditional movies and series.
Additionally, high production costs and low ROI on some projects might discourage companies from investing in large-scale productions, leading to industry slowdowns.

Justification:
If the entertainment industry is oversaturated, streaming platforms may face subscriber fatigue, meaning consumers have too many options and start canceling subscriptions. To avoid negative impacts, companies must adapt strategies by focusing on unique, high-quality content rather than mass-producing shows.

#### Chart - 5

In [None]:
# TOP ACTORS CHART
import pandas as pd
import matplotlib.pyplot as plt

# Load the datasets
titles_path = "/titles.csv"
credits_path = "/credits.csv"
titles_df = pd.read_csv(titles_path)
credits_df = pd.read_csv(credits_path)

# Merge titles and credits data on 'id' to connect movies/shows with their cast/crew
merged_df = pd.merge(credits_df, titles_df, on="id")

# Filter only actors from the credits dataset
actors_df = merged_df[merged_df["role"] == "ACTOR"]

# Count appearances of each actor and get the top 10
top_actors = actors_df["name"].value_counts().head(10)

# Plot the bar chart for top 10 actors
plt.figure(figsize=(12, 6))
top_actors.plot(kind="bar", color="purple")
plt.xlabel("Actor Name")
plt.ylabel("Number of Titles")
plt.title("Top 10 Most Frequently Credited Actors")
plt.xticks(rotation=45, ha="right")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is the best choice for visualizing categorical data like actor names and their number of appearances. Since we are comparing the top 10 most frequently credited actors, a bar chart clearly shows the differences in the number of titles each actor has worked on. The horizontal bars make it easy to compare and identify the most prominent actors in the dataset.

##### 2. What is/are the insight(s) found from the chart?

* he chart highlights the top 10 actors who have appeared in the most movies or shows.
* Some actors have been featured in a significantly higher number of titles, showing their popularity and demand in the industry.
* The leading actors might have played roles in long-running series or franchises, contributing to their high count.
* The distribution suggests that the industry is often dominated by a few highly active actors, while most others have fewer appearances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Knowing the most frequently credited actors helps producers and streaming platforms identify high-demand talent, allowing them to make informed decisions when casting for new projects.
* Popular actors bring brand value, increasing viewership and profitability for movies and series they are part of.
* Streaming services like Netflix and Amazon Prime can leverage this data to recommend content based on actors who attract the most audience engagement.

Potential Negative Growth:
* Over-reliance on a few actors might indicate a lack of diversity in casting, which could lead to audience fatigue over time.
* If the same actors dominate the industry, new talent may struggle to get opportunities, causing stagnation in creativity and innovation.
* The industry might become risk-averse, favoring established stars over fresh talent, which can negatively impact the variety and uniqueness of content.

Justification:
While using well-known actors is a safe business strategy, excessive dependence on them can limit fresh storytelling and innovation. To balance commercial success with industry growth, production houses should promote new talent while still leveraging top actors for key roles.









#### Chart - 6

In [None]:
# CONTENT TYPE DISTRIBUTION
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Count the number of unique production countries
top_countries = titles_df["production_countries"].value_counts().head(10)

# Plot a horizontal bar chart
plt.figure(figsize=(12, 6))
top_countries.plot(kind="barh", color="teal")
plt.xlabel("Number of Titles")
plt.ylabel("Production Country")
plt.title("Top 10 Countries Producing the Most Titles")
plt.gca().invert_yaxis()
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen because it allows for easy comparison of categorical data, especially when dealing with long text labels like country names. Since we are analyzing the top 10 countries producing the most content, this format clearly shows the differences in production volume among them while keeping the country names readable.

##### 2. What is/are the insight(s) found from the chart?

* Certain countries, such as the United States, India, and the United Kingdom, dominate content production, likely due to their well-established film and television industries.
* Some emerging markets are also producing a significant number of titles, indicating global expansion in the entertainment sector.
* Countries with fewer productions may have smaller industries, lower budgets, or a focus on regional rather than global content distribution.
* If a country is rising in the rankings, it suggests increasing investment and interest in media production from that region.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Businesses in the streaming industry can use this insight to target top-producing countries for content acquisition and partnerships.
* Regional content strategies can be formed by focusing on emerging markets, ensuring that localized content reaches the right audiences.
* Knowing which countries dominate production allows companies to adapt their investment strategies, choosing whether to collaborate with top countries or invest in underrepresented markets for fresh content.

Potential Negative Growth:
* Over-dependence on a few top-producing countries could lead to content saturation, where audiences feel overwhelmed by similar types of content.
* If smaller markets are neglected, companies might miss out on niche content opportunities that could cater to unique cultural preferences.
* Some countries might have strict regulations or censorship policies affecting the global reach of their productions, impacting revenue potential.

Justification:
While focusing on top-producing countries ensures high-quality, globally appealing content, diversification is key. Expanding production into less dominant markets can bring fresh storytelling styles, cultural diversity, and new audience engagement opportunities. This helps avoid stagnation and keeps the industry innovative and competitive.

#### Chart - 7

In [None]:
# MOST COMMON ROLES
import pandas as pd
import matplotlib.pyplot as plt

# Load the datasets
titles_path = "/titles.csv"
credits_path = "/credits.csv"
titles_df = pd.read_csv(titles_path)
credits_df = pd.read_csv(credits_path)

# Merge titles and credits data on 'id'
merged_df = pd.merge(credits_df, titles_df, on="id")

# Count the occurrences of each role in the dataset (excluding minor categories)
role_counts = merged_df["role"].value_counts()

# Select the top 10 roles if there are more than two unique roles
num_roles = min(10, len(role_counts))
top_roles = role_counts.head(num_roles)

# Plot a bar chart for the most common roles
plt.figure(figsize=(12, 6))
top_roles.plot(kind="bar", color="orange")
plt.xlabel("Role in Production")
plt.ylabel("Number of Credits")
plt.title("Most Common Roles in the Entertainment Industry")
plt.xticks(rotation=45, ha="right")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()



##### 1. Why did you pick the specific chart?

A bar chart was chosen because it effectively displays categorical data (different roles) while allowing easy comparison between them. Since we want to analyze the most frequently credited roles in the entertainment industry, a bar chart clearly shows the distribution and ranking of these roles. It also allows us to identify whether certain roles dominate the industry.

##### 2. What is/are the insight(s) found from the chart?

* The dataset reveals that Actor and Director roles dominate the entertainment industry, suggesting that most credits in productions belong to these professions.
* Other roles, such as Writers, Producers, and Cinematographers, might be underrepresented in the dataset or may have been categorized differently.
* If the dataset contains mostly mainstream productions, it makes sense that Actors and Directors receive the most credits, as they are the most recognizable figures in a production.
* The presence (or absence) of additional roles can indicate how detailed the dataset is regarding crew members.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Streaming platforms and production companies can use this insight to allocate budgets effectively, ensuring proper recognition and funding for essential roles like Actors and Directors.
* If certain roles (e.g., Writers, Editors) appear less frequently, it could indicate an opportunity to highlight and promote these professions, leading to more balanced industry growth.
* Talent agencies can use this data to understand demand trends and adjust their hiring strategies accordingly.

Potential Negative Growth:
* If the dataset primarily focuses on only lead roles (Actors and Directors) and ignores essential crew members (Editors, Cinematographers, Writers), it may * create an industry imbalance where some professions get overvalued while others are overlooked.
* A lack of representation of technical roles could lead to skill shortages in specialized fields, which may negatively impact production quality in the long run.
* If the trend of undervaluing behind-the-scenes roles continues, it might discourage people from pursuing careers in technical or creative fields like scriptwriting, set design, or editing, leading to fewer skilled professionals in those domains.

Justification:
While it’s natural for Actors and Directors to receive the most recognition, a well-balanced industry also requires strong contributions from supporting roles. Investing in a diverse talent pool (including behind-the-scenes professionals) ensures high-quality productions, creative storytelling, and long-term sustainability of the entertainment industry.

#### Chart - 8

In [None]:
#AVERAGE IMDB RATING BY GENRE
import pandas as pd
import matplotlib.pyplot as plt

# Load the datasets
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Ensure 'imdb_score' column is numeric and drop missing values
titles_df['imdb_score'] = pd.to_numeric(titles_df['imdb_score'], errors='coerce')
titles_df = titles_df.dropna(subset=['imdb_score', 'genres'])

# Split genres and expand them into multiple rows
titles_df = titles_df.assign(genres=titles_df["genres"].str.split(", ")).explode("genres")

# Calculate average IMDb score per genre
avg_rating_per_genre = titles_df.groupby("genres")["imdb_score"].mean().sort_values(ascending=False).head(10)

# Plot a bar chart for the top 10 highest-rated genres
plt.figure(figsize=(12, 6))
avg_rating_per_genre.plot(kind="bar", color="purple")
plt.xlabel("Genre")
plt.ylabel("Average IMDb Score")
plt.title("Top 10 Genres with the Highest Average IMDb Ratings")
plt.xticks(rotation=45, ha="right")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it is effective in displaying categorical data (genres) while clearly representing numerical values (average IMDb ratings). This visualization makes it easy to compare different genres and determine which ones consistently receive higher audience appreciation. The ranking format also highlights the top-performing genres, helping to identify trends in viewer preferences.

##### 2. What is/are the insight(s) found from the chart?

* Certain genres, such as Documentary, History, and Biography, tend to have higher average IMDb ratings, indicating that audiences appreciate factual and well-researched content.
* More mainstream genres, like Action or Comedy, might have lower IMDb ratings despite their popularity, possibly due to variations in storytelling quality.
* Some niche genres receive higher ratings because they cater to specific audiences with strong engagement.
* The trend suggests that audiences value meaningful storytelling and high-quality content over just entertainment value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Streaming platforms and production houses can prioritize high-rated genres when making content acquisition or production decisions.
* If a genre consistently receives higher ratings, it may indicate higher audience satisfaction and potential for long-term engagement, leading to better customer retention for streaming services.
* Filmmakers can use this insight to improve scriptwriting and production strategies, ensuring they focus on genres with proven audience appreciation.

Potential Negative Growth:
* If studios focus only on high-rated genres, they may neglect commercial genres (such as Action and Comedy), which bring in more revenue despite receiving lower ratings.
* An overemphasis on niche, high-rated genres may lead to market saturation, where too much content of the same type reduces its uniqueness and appeal.
* Some highly rated genres may not necessarily translate into higher revenue, as IMDb ratings reflect viewer sentiment but not necessarily box office success.

Justification:
While highly rated genres indicate strong audience engagement, the entertainment industry must balance quality and profitability. A genre with moderate ratings but high commercial success (like superhero films or rom-coms) still plays a crucial role in business growth. Therefore, data-driven decisions should be balanced with market demand to ensure sustainable success.

#### Chart - 9

In [None]:
# DISTRIBUTION OF IMDB SCORES
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Ensure 'imdb_score' column has valid values and handle missing data
titles_df = titles_df.dropna(subset=['imdb_score'])

# Plot IMDb score distribution
plt.figure(figsize=(10, 6))
sns.histplot(titles_df['imdb_score'], bins=20, kde=True, color='purple')
plt.xlabel("IMDb Score")
plt.ylabel("Number of Titles")
plt.title("Distribution of IMDb Scores for Movies and TV Shows")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()




##### 1. Why did you pick the specific chart?

A histogram with a KDE curve was chosen because it effectively visualizes the distribution of IMDb scores across all movies and TV shows. This chart helps in understanding how content is rated—whether most titles have high, low, or average ratings—and provides insights into the general quality of content available in the dataset. The KDE curve further smooths out the data to show trends clearly.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows the most common IMDb ratings for movies and TV shows. If the distribution is skewed towards higher ratings, it indicates that many titles are well-received by audiences.
* If the distribution is more centered around the middle values (5-7), it suggests that most titles have moderate ratings, meaning a mix of good and average content.
* If a significant number of titles have very low ratings (below 4), it could indicate that there are many poorly rated movies/TV shows in the dataset.
* The presence of a peak around certain IMDb scores may indicate a trend in audience preferences or possible rating inflation in certain genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* If the majority of movies and TV shows have high ratings, it shows that quality content is being produced, which is beneficial for streaming services and production houses.
* Understanding IMDb score distribution can help platforms recommend better-rated content to users, improving engagement and viewer satisfaction.
* Production companies can analyze patterns in highly rated movies and create more content in similar genres or storytelling styles to maximize audience retention.

Potential Negative Growth:
* If a large portion of content has low IMDb ratings, it might indicate that many movies/shows fail to meet audience expectations, leading to reduced viewership.
* A heavy concentration of titles around average ratings (5-7) might suggest that most content is mediocre, making it difficult for new titles to stand out.
* If the highest-rated content comes from a single genre, it could mean that other genres are not meeting audience expectations, requiring diversification strategies.

Justification:
The IMDb score distribution provides crucial insights into viewer satisfaction and content quality. While high ratings indicate success, a large presence of low-rated content might harm a platform’s reputation and affect business growth. Thus, streaming services and production studios must balance quality improvement with market trends to sustain long-term success.

#### Chart - 10

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Ensure 'imdb_score' and 'tmdb_score' columns have valid values and handle missing data
titles_df = titles_df.dropna(subset=['imdb_score', 'tmdb_score'])

# Plot the relationship between IMDb and TMDb scores
plt.figure(figsize=(8, 5))
sns.scatterplot(x=titles_df['imdb_score'], y=titles_df['tmdb_score'], alpha=0.5, color='purple')
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Score")
plt.title("Correlation Between IMDb and TMDb Scores")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it effectively visualizes the relationship between IMDb scores and TMDb scores. Since both ratings measure audience perception, comparing them can reveal how closely they align. If the points form a linear trend, it indicates a strong correlation, whereas scattered points suggest varied opinions across platforms.

##### 2. What is/are the insight(s) found from the chart?

* If the plot shows a strong positive correlation (points forming a line), it means IMDb and TMDb scores are closely aligned, indicating that audiences rate content similarly on both platforms.
* If the plot is widely scattered, it suggests that IMDb and TMDb ratings differ significantly, possibly due to differences in rating systems, user demographics, or platform biases.
* Clusters of points at high or low scores could indicate certain types of content receiving consistent ratings across platforms, while outliers may highlight controversial or polarizing titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* If the correlation is strong, businesses can use either IMDb or TMDb ratings interchangeably to predict audience preferences. This simplifies recommendation models for streaming platforms.
* Understanding rating discrepancies can help companies adjust marketing strategies—for example, if a movie has high TMDb ratings but low IMDb ratings, they can focus on the IMDb audience for improvement.
* A high correlation assures content creators that good ratings on one platform will likely reflect on another, helping them assess their content's reception.

Potential Negative Growth:
* A low or inconsistent correlation suggests that one platform’s ratings might not be reliable in predicting audience perception on another platform.
* If TMDb consistently gives higher ratings than IMDb, it could indicate rating inflation, meaning audiences might not fully trust TMDb ratings, leading to skepticism about content quality.
* If certain content gets highly polarized ratings (e.g., very high on IMDb but low on TMDb), it could indicate regional, demographic, or review-bombing biases, which businesses must consider to avoid misleading marketing.

Justification:
The correlation between IMDb and TMDb ratings provides insights into audience perception, rating reliability, and potential biases. A strong correlation benefits content prediction models, while inconsistencies require deeper analysis to understand audience preferences across platforms. This helps businesses refine content strategies and improve user engagement.

#### Chart - 11

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Ensure 'production_countries' column has valid values and handle missing data
titles_df = titles_df.dropna(subset=['production_countries'])

# Split countries and count occurrences
country_counts = Counter()
for countries in titles_df['production_countries']:
    for country in str(countries).split(','):
        country_counts[country.strip()] += 1

# Convert to DataFrame
country_df = pd.DataFrame(country_counts.items(), columns=['Country', 'Count'])
country_df = country_df.sort_values(by='Count', ascending=False).head(10)  # Top 10 countries

# Plot a bar chart
plt.figure(figsize=(10, 5))
sns.barplot(x=country_df['Country'], y=country_df['Count'], palette='coolwarm')
plt.xlabel("Country")
plt.ylabel("Number of Titles Produced")
plt.title("Top 10 Countries by Content Production")
plt.xticks(rotation=45)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is the best choice because it clearly shows the distribution of content production across different countries. Since the dataset contains country-wise production data, a bar chart helps compare the top 10 content-producing countries in an easy-to-read manner. The sorted bars provide a quick visual understanding of which regions dominate the media industry.

##### 2. What is/are the insight(s) found from the chart?

* Some countries produce significantly more content than others, indicating a strong media and entertainment industry in those regions.
* The USA, UK, and India are likely to be among the top contributors, given their well-established film and TV industries.
* Countries with fewer productions may have smaller entertainment industries, limited funding, or regional restrictions on content creation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Companies looking to expand their media production can focus on countries with high output, as these regions have the necessary infrastructure, talent, and audience base.
* Streaming platforms can tailor content libraries by understanding which regions produce the most content and what genres are popular there.
* Investors can identify growing markets where content production is increasing, helping them make strategic decisions on partnerships and distribution.

Potential Negative Growth:
* If certain countries have low content production, it might indicate regulatory restrictions, lack of funding, or lower demand for entertainment.
* Heavy concentration of content in a few countries might lead to cultural homogenization, limiting diversity in global entertainment.
* Smaller industries may struggle to compete with dominant markets, leading to fewer opportunities for local filmmakers and actors.
* This analysis can help businesses navigate opportunities and challenges in the global media landscape.

Justification : Thus, while the insights help businesses make data-driven decisions, they also highlight challenges related to market concentration, competition, and diversity that must be addressed for a balanced and inclusive media industry

#### Chart - 12

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Load the dataset
credits_path = "/credits.csv"
credits_df = pd.read_csv(credits_path)

# Ensure 'role' and 'name' columns have valid values and handle missing data
credits_df = credits_df.dropna(subset=['role', 'name'])

# Filter only actors
actor_df = credits_df[credits_df['role'].str.lower() == 'actor']

# Count occurrences of each actor
actor_counts = Counter(actor_df['name'])

# Convert to DataFrame
actor_df = pd.DataFrame(actor_counts.items(), columns=['Actor', 'Count'])
actor_df = actor_df.sort_values(by='Count', ascending=False).head(10)  # Top 10 actors

# Plot a bar chart
plt.figure(figsize=(10, 5))
sns.barplot(x=actor_df['Actor'], y=actor_df['Count'], palette='viridis')
plt.xlabel("Actor")
plt.ylabel("Number of Credits")
plt.title("Top 10 Most Frequently Credited Actors")
plt.xticks(rotation=45, ha="right")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is the best choice because it allows us to compare the number of times different actors have been credited in the dataset. Since the dataset contains actor names with their frequency of appearances, a bar chart makes it easy to identify the most frequently credited actors at a glance. The horizontal format ensures clarity, especially when dealing with long names.

##### 2. What is/are the insight(s) found from the chart?

* A small group of actors appear significantly more frequently than others, suggesting they may be highly sought-after or have worked on multiple projects.
* Lesser-known actors appear fewer times, which may indicate that the industry is dominated by a few recurring stars.
* Some actors may have higher credits due to appearing in TV series with multiple seasons, rather than standalone films.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Casting Decisions: Filmmakers and producers can use this data to identify top actors who have a strong track record and fan following.
* Marketing Strategy: Streaming platforms and production houses can leverage popular actors to attract audiences, ensuring higher engagement and viewership.
* Talent Acquisition: Talent agencies can spot rising actors and sign them for future projects, contributing to industry growth.

Potential Negative Growth & Justification:
* Lack of Diversity: If a small group of actors dominates the industry, it may limit opportunities for emerging talent, making it difficult for newcomers to break into the business.
* Market Saturation: Overuse of the same actors may result in audience fatigue, reducing interest in new content featuring the same faces.
* Risk Dependence: Production companies relying on a few actors for success may face challenges if their popularity declines, affecting revenue and viewership.
* This analysis provides valuable insights for both industry professionals and business strategists by helping them balance casting choices and identify upcoming talent for a sustainable media industry.

Justification : While leveraging data on top actors helps optimize casting, marketing, and investment, it also highlights challenges related to over-saturation, dependency, and lack of diversity. A balanced approach—introducing fresh talent while retaining audience-favorite actors—is crucial for sustainable industry growth

#### Chart - 13

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
titles_path = "/titles.csv"
titles_df = pd.read_csv(titles_path)

# Selecting relevant numerical columns for correlation analysis
numerical_columns = ['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
titles_numeric = titles_df[numerical_columns].dropna()

# Compute the correlation matrix
correlation_matrix = titles_numeric.corr()

# Plot a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Correlation Heatmap of IMDb and TMDb Metrics")
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an effective way to visualize the correlation between different numerical variables in a dataset. Since IMDb and TMDb scores, votes, and popularity are all numerical, a heatmap helps identify relationships between them. Unlike scatter plots, which compare only two variables at a time, a heatmap gives a comprehensive overview of all relationships simultaneously.

##### 2. What is/are the insight(s) found from the chart?

* IMDb score and IMDb votes show a positive correlation, meaning movies with higher IMDb votes tend to have higher IMDb ratings.
* TMDb popularity and TMDb score may have a weaker correlation, suggesting that a movie’s popularity doesn’t always guarantee a higher score.
* There may be a moderate correlation between IMDb votes and TMDb popularity, indicating that movies with many IMDb votes tend to be more popular on TMDb as well.
* Some negative or weak correlations may indicate that different scoring systems (IMDb vs. TMDb) have different criteria for rating content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Better Content Selection: Streaming platforms and content producers can focus on highly rated movies that also have strong engagement (votes and popularity).
* Marketing Strategy Optimization: Understanding which factors contribute most to popularity can help improve promotional efforts, targeting the right audience.
* Data-Driven Decision Making: Platforms like Netflix and Amazon Prime can prioritize content that is both critically acclaimed and popular, leading to higher user retention.

Potential Negative Growth & Justification:
* Over-reliance on Popularity Metrics: If companies focus only on popularity (TMDb votes) rather than quality (IMDb score), they might end up promoting average or low-quality content, reducing long-term engagement.
* Discrepancies Between IMDb & TMDb Scores: If platforms rely only on one rating system, they might misjudge a film’s actual audience reception. For example, a movie may be highly rated on IMDb but have low popularity on TMDb, leading to biased content curation.
* Ignoring Niche Content: If decision-makers only invest in high-correlation trends, independent or niche films may get overlooked, leading to a less diverse content library.

Justification :
The heatmap provides valuable insights into content performance, helping businesses improve recommendations, marketing, and production strategies. However, relying too heavily on popularity metrics without considering audience diversity and quality ratings could lead to short-term gains but long-term negative impact on content engagement.

#### Chart - 14 - Correlation Heatmap

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
credits_path = "/credits.csv"
titles_path = "/titles.csv"
credits_df = pd.read_csv(credits_path)
titles_df = pd.read_csv(titles_path)

# Merge datasets on 'id' column to combine actor/director data with movie details
merged_df = pd.merge(credits_df, titles_df, on='id', how='inner')

# Selecting relevant numerical columns for correlation analysis
numerical_columns = ['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
merged_numeric = merged_df[numerical_columns].dropna()

# Compute the correlation matrix
correlation_matrix = merged_numeric.corr()

# Plot a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Correlation Heatmap of IMDb and TMDb Metrics with Merged Data")
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is ideal for visualizing the correlation between numerical variables. By merging the credits.csv and titles.csv datasets, we can analyze the relationship between IMDb scores, IMDb votes, TMDb popularity, and TMDb scores, considering additional context from cast and crew data. This approach helps uncover insights about how audience engagement and ratings interact with movie popularity.

##### 2. What is/are the insight(s) found from the chart?

* IMDb votes and IMDb scores show a strong positive correlation, meaning movies with more user ratings generally receive higher scores.
* TMDb popularity has a moderate correlation with IMDb votes, suggesting that films with more IMDb engagement also tend to be popular on TMDb.
* TMDb score correlation with other metrics may be weaker, indicating that the rating system on TMDb is influenced by different factors compared to IMDb.
* Some weak or negative correlations may indicate that popularity doesn’t always align with high ratings, as some widely discussed films may still receive mixed reviews.

#### Chart - 15 - Pair Plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
credits_path = "/credits.csv"
titles_path = "/titles.csv"
credits_df = pd.read_csv(credits_path)
titles_df = pd.read_csv(titles_path)

# Merge datasets on 'id' column to combine actor/director data with movie details
merged_df = pd.merge(credits_df, titles_df, on='id', how='inner')

# Selecting relevant numerical columns for pair plot analysis
numerical_columns = ['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
merged_numeric = merged_df[numerical_columns].dropna()

# Plot a pair plot
sns.pairplot(merged_numeric)
plt.suptitle("Pair Plot of IMDb and TMDb Metrics", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A Pair Plot is an effective way to visualize the relationships between multiple numerical variables. Since IMDb and TMDb scores, votes, and popularity are interconnected, a pair plot helps identify correlations, trends, and outliers in one glance. It is especially useful in understanding how different metrics interact and spotting clusters or anomalies in the data.

##### 2. What is/are the insight(s) found from the chart?

* There seems to be a strong positive correlation between IMDb votes and IMDb scores, meaning that movies with more ratings tend to have a higher score.
* TMDb popularity does not always correlate strongly with IMDb scores, suggesting that a movie can be widely discussed but not necessarily well-rated.
Some movies with high IMDb scores might have fewer votes, indicating niche  content that has a dedicated audience but lacks mainstream attention.
* The scatter plots reveal outliers, where certain movies have extremely high popularity but moderate or low scores, possibly indicating hype-driven content.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective, the client should focus on improving content recommendations by using IMDb and TMDb ratings to suggest top-rated movies and shows. This will help increase user engagement and satisfaction.

Ensuring metadata completeness is also important. Filling missing values in columns like age_certification, seasons, and imdb_score by cross-referencing other sources will improve content categorization and help users make better viewing choices.

Another key strategy is targeted marketing based on content popularity. Promoting high tmdb_popularity titles in featured lists and advertisements can attract more viewers and boost watch time.

Analyzing production_countries and age_certification data can help in optimizing content for different regions. Understanding regional preferences allows the client to curate recommendations tailored to specific audiences, improving retention rates.

Finally, differentiating between TV shows and movies using the seasons column is essential. Proper classification will ensure better filtering and a more seamless user experience.

By implementing these strategies, the client can enhance user engagement, improve content discoverability, and drive more viewership.









# **Conclusion**

This exploratory data analysis on Amazon Prime TV shows and movies helped in understanding key aspects of the dataset, including missing values, duplicate entries, and inconsistencies. Through data cleaning and wrangling, we prepared the dataset for meaningful analysis by handling missing values, optimizing data types, and standardizing text formats.

Key insights revealed trends in content popularity, missing metadata issues, and differences in movie and TV show classifications. Based on these insights, recommendations were made to improve content recommendations, enhance metadata completeness, and implement targeted marketing strategies.

By leveraging IMDb and TMDb ratings, optimizing content for different regions, and ensuring proper categorization, the client can improve user engagement and content discoverability. This analysis provides a strong foundation for further predictive modeling or recommendation system development.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***