<a href="https://colab.research.google.com/github/DataCrack-Sushama/Amazon-Prime-Streaming-Analysis/blob/main/Amazon_prime_streaming_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[link text](https:// [link text](https://))# **Project Name**    -
Amazon Prime Streaming Analysis


##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Conducted an exploratory data analysis of Amazon Prime Video's streaming catalog in the United States to derive insights into content diversity, regional availability, evolving trends, and IMDb ratings. Analyzed two CSV datasets using Python libraries like Pandas, NumPy, Matplotlib, and Seaborn. Extracted valuable insights on genre dominance, regional content distribution, content evolution over time, and the popularity of top-rated shows. The analysis provided data-driven recommendations for enhancing content strategy, boosting user engagement, and optimizing subscription growth in the competitive streaming industry.

# **GitHub Link -**

https://github.com/DataCrack-Sushama/Amazon-Prime-Streaming-Analysis




# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

*  Content Diversity : What genres and categories dominate the platform?
*  Regional Availability : How content Distribution varies across different regions?
*  Trends over time : How has Amazon Prime's content library evolved?
*  IMDb Ratings & Popularity : What are the highest-rated or most popular shows on the platform?

By analyzing this dataset, business, content creators, and data analysts can cover key trends that influence subscription growth, user engagement and content investment strategies in the streaming industry.





#### **Define Your Business Objective?**

The primary objective of this project is to leverage data analytics and machine learning to enhance content strategy and improve user engagement on Amazon Prime Video in the United States. By analyzing streaming data, genre trends, regional preferences, and IMDb ratings, this project aims to:

Optimize Content Acquisition: Identify high-performing genres and categories to guide future content investments.

Enhance User Retention: Develop personalized content recommendations that increase viewer engagement and reduce churn.

Boost Revenue Growth: Support data-driven decisions for marketing strategies by understanding popular trends and viewer demographics.

Improve Platform Experience: Provide insights for content creators and producers to align with audience preferences, boosting satisfaction and platform loyalty.

Drive Strategic Planning: Enable Amazon Prime Video to forecast future content trends, optimize regional content strategies, and enhance subscription growth.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from zipfile import ZipFile

# loading the temp.zip and creating a zip object
with ZipFile("/content/credits.csv.zip", 'r') as zObject:

    # Extracting specific file in the zip
    # into a specific location.
    zObject.extract(
        "credits.csv", path="/content/sample_data/extracted_files")
with ZipFile("/content/titles.csv.zip", 'r') as zObject:

    # Extracting specific file in the zip
    # into a specific location.
    zObject.extract(
        "titles.csv", path="/content/sample_data/extracted_files")

# Load Dataset
import pandas as pd
data1 = pd.read_csv('/content/sample_data/extracted_files/credits.csv')
data2 = pd.read_csv('/content/sample_data/extracted_files/titles.csv')

df = pd.merge(data1,data2, on='id')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"number of rows: {df.shape[0]}")
print(f"number of columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
print(f"dataset info : {df.info()}")

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"number of duplicate values: {df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(f"number of missing values: {df.isnull().sum()}")

In [None]:
# Visualizing the missing values
plt.figure(figsize = (20,10))

sns.heatmap(df.isnull(),cmap = 'viridis',cbar = False,yticklabels = False)
plt.title('missing values')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

### What did you know about your dataset?

there's a high missing values for the columns below:
character : 16307, which has large datasets of movies/tv shows.
seasons :116194, we can filter this based on type ='movies' as movies doesn't have seasons.
age_certification : 67640, lack of age rating for many entries possibly old content or less popular.
imdb_id : 5303, this can effect linked external data IMDb
imdb_score : 6051,imdb_votes : 6075 lack of IMBd data for certain titles.
tmdb_score (10,265 missing): Similar issue as IMDb data.
and less missing values for the columns like description and tmdb_score.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
print(f"numberical columns: {df.describe()}")
print(f"categorical columns: {df.describe(include='object')}")

### Variables Description

In [None]:
variables_description= pd.DataFrame({
    'Column' : df.columns,
    'Data Type' : df.dtypes.values,
    'Non-Null Count': df.notnull().sum().values,
    'Missing values': df.isnull().sum().values,
    'unique values': df.nunique().values,
    'Sample values': [df[col].unique()[:5] for col in df.columns]} #display 1st five unique values in a column
    )

print(variables_description)

to be able to understand structure of the data, to spot the columns that has missing values or unexpected data types and to plan the exploratory data analysis and pre-processing steps.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = {col: df[col].unique() for col in df.columns}

for column, values in unique_values.items():
  print(f"Column: {column}")
  print(f"unique values: {values[:10]}")
  print(f"total unique values : {len(values)}/n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

print("dataset info")
print(df.info())
print("\n Missing values before handling:", df.isnull().sum())

# Handling Missing Values
numeric_cols = ['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Drop rows with missing 'id' or 'title' (essential columns)
df = df.dropna(subset=['id', 'title'])

#Fill missing 'character' with 'Unknown'
df['character'] = df['character'].fillna('Unknown')

## Fill missing 'description' with an empty string
df['description'] = df['description'].astype(object)
df['description'].fillna('',inplace = True)

## Fill missing 'age_certification' with 'Not Rated'
df.get('age_certification', pd.Series(dtype='object')).fillna('Not Rated', inplace=True)

# Ensure 'tmdb_score' and 'tmdb_popularity' are numeric
df['tmdb_score'] = pd.to_numeric(df['tmdb_score'], errors='coerce')
df['tmdb_popularity'] = pd.to_numeric(df['tmdb_popularity'], errors='coerce')

# Handling missing values
df['seasons'] = df['seasons'].fillna(1)  # Assuming missing means 1 season
df['imdb_score'] = df['imdb_score'].fillna(df['imdb_score'].median())
df['imdb_votes'] = df['imdb_votes'].fillna(df['imdb_votes'].median())
df['tmdb_score'] = df['tmdb_score'].fillna(df['tmdb_score'].median())
df['tmdb_popularity'] = df['tmdb_popularity'].fillna(df['tmdb_popularity'].median())

## Fill missing IMDb votes with the median value
df['imdb_votes'].fillna(df['imdb_votes'].median(),inplace = True)

# Fill missing tMDb popularity and score with the median value
df['tmdb_score'].fillna(df['tmdb_score'].median(),inplace = True)
df['tmdb_popularity'].fillna(df['tmdb_popularity'].median(),inplace = True)

# data type conversion
#Convert 'release_year' to datetime format

df['release_year'] = pd.to_datetime(df['release_year'], format = '%Y',errors = 'coerce') # removed extra quote

# Convert categorical columns to 'category' data type for optimization
categorical_columns = ['type','description','age_certification','genres','production_countries','role','name']
for col in categorical_columns:
  df[col] = df[col].astype('category')

# Handling Duplicates
# Drop duplicate rows if any
df.drop_duplicates(inplace=True)

# 4Feature Engineering
# Create a new column for content age (years since release)
df['content_age'] = pd.Timestamp.now().year - df['release_year'].dt.year

# Create a new column indicating whether the content is a 'Movie' or 'TV Show'
df['is_movie'] = df['type'].apply(lambda x: 1 if x.lower() == 'movie' else 0)

# Outlier Handling
# Clip IMDb scores to a valid range (0-10)
df['imdb_score'] = df['imdb_score'].clip(0, 10)

# Clip runtime to avoid extreme values
df['runtime'] = df['runtime'].clip(lower=0, upper=df['runtime'].quantile(0.95))

# 6. Renaming Columns (if needed)
#df.rename(columns={'imdb_score': 'IMDB_Score', 'tmdb_score': 'TMDB_Score'}, inplace=True)

# 7. Final Check for Missing Values
print("\nMissing Values After Handling:\n", df.isnull().sum())

# 8. Save the Cleaned Data (Optional)
df.to_csv('cleaned_dataset.csv', index=False)

print("\nData Wrangling Completed! Dataset is now analysis-ready.")



### What all manipulations have you done and insights you found?

Data Manipulations Performed:
--------------
Checked for Missing Values
--------------------

Verified that no missing values were present before handling.

Handled Potential Missing Values
----------------
description: Filled missing values with an empty string ('').
age_certification: Filled missing values with 'Not Rated'.
imdb_votes: Replaced missing values with the median.
tmdb_score: Replaced missing values with the median.
tmdb_popularity: Replaced missing values with the median.

Ensured Data Consistency
------------------
Verified column names to prevent KeyError (e.g., case-sensitive issues with 'age_certification').
Optimized Memory Usage
-------------
The dataset has 124,179 rows and 21 columns, with multiple categorical columns to reduce memory usage.

Insights Derived:
--------------------
Data Completeness:
----------
No missing values remain, ensuring the dataset is clean for analysis.
Data Distribution:
---------
imdb_score & tmdb_score: Evaluating central tendency and spread could help identify highly-rated or low-rated titles.
genres & production_countries: Categorized, allowing for genre-based or country-based analysis.
content_age: Can be used for age-based content recommendations.
Potential Analysis Areas:
----------
Movie vs. TV Show Analysis: Using is_movie to compare trends.
Popularity Trends: Exploring tmdb_popularity over time.
Impact of Runtime on Ratings: Checking if longer movies receive better ratings.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# IMDb Score Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['imdb_score'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution of IMDb scores across movies.
It helps in understanding the central tendency, spread, and skewness of the ratings. The addition of a KDE curve smooths the distribution, making patterns easier to interpret.

##### 2. What is/are the insight(s) found from the chart?



*  IMDb scores follow a normal distribution, with most movies scoring between 5 and 7.
*  The peak occurs around 6, meaning the majority of movies receive moderate ratings rather than extremely high or low scores.
*  Very few movies receive scores below 3 or above 9, indicating that extreme ratings are rare.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, streaming platforms and movie producers can benchmark their content based on typical IMDb scores.
* Content recommendation algorithms can use this insight to promote movies with average or higher ratings to maximize user engagement.
* Production houses can aim to optimize storytelling and marketing to move their movies into the higher-rating range.


Since most movies cluster around average scores, it might be difficult to differentiate between them, leading to content saturation in the mid-range.

A lack of highly rated movies (8-10 range) suggests that either audience expectations are very high or few movies manage to achieve outstanding quality.

Solution: Production teams should analyze factors contributing to top-rated movies and incorporate those elements (e.g., screenplay, direction, casting) to improve future releases.




#### Chart - 2

In [None]:
#  IMDb vs. TMDB Score
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['imdb_score'], y=df['tmdb_score'], alpha=0.5, color='red')
plt.title('IMDb Score vs. TMDB Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is the best choice for visualizing relationships between two numerical variables (IMDb and TMDB scores).
It helps in identifying correlation, trends, and outliers in the dataset.
The plot efficiently highlights how closely the two rating systems align for different movies.

##### 2. What is/are the insight(s) found from the chart?



*  There is a positive correlation between IMDb and TMDB scores, meaning higher IMDb ratings tend to align with higher TMDB ratings.
*  However, the spread is wide, indicating that some movies are rated significantly differently on both platforms.

* Many movies cluster around the 6-8 rating range, suggesting most films receive moderate ratings rather than extreme scores.
* Some outliers exist, where a movie is highly rated on one platform but poorly rated on the other.







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, streaming services can prioritize content that has high ratings on both platforms to ensure audience satisfaction.

* Distributors can analyze rating discrepancies to understand viewer expectations across different platforms.
* Filmmakers can benchmark their movie ratings against similar content to improve future productions.


The presence of rating inconsistencies could lead to audience confusion, affecting trust in reviews.

Outliers indicate possible bias or differences in rating criteria, which might make it hard for businesses to rely on a single platform’s scores for decision-making.

Solution: Businesses should consider multiple rating sources and not rely solely on IMDb or TMDB when evaluating content success.




#### Chart - 3

In [None]:
#  Top Genres Count
plt.figure(figsize=(10, 5))
sns.countplot(y=df['genres'], order=df['genres'].value_counts().index[:10], palette='viridis')
plt.title('Top 10 Most Common Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for categorical data with long labels (like genres), ensuring better readability.
It clearly shows the ranking of genres based on frequency, making comparisons easy and it  helps quickly identify dominant genres in the dataset.


##### 2. What is/are the insight(s) found from the chart?

Drama is the most common genre, significantly ahead of other genres.
Comedy follows next, but with a considerable gap from Drama.
* Mixed genres (e.g., Drama-Romance, Comedy-Drama, Romance-Comedy) are quite frequent, indicating that audiences enjoy hybrid storytelling.
*  Horror and Thriller appear in the top 10, suggesting a steady demand for suspenseful and fear-driven content.
* Documentation has a notable presence, highlighting growing interest in factual and reality-based storytelling.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, streaming platforms can prioritize acquiring or producing more Drama and Comedy content, as they dominate in popularity.
* Content creators can focus on producing films in these genres to increase viewership and engagement.
* Marketing teams can design genre-specific promotional campaigns targeting Drama and Comedy fans, boosting subscription rates.



The dominance of Drama and Comedy might indicate oversaturation, making it harder for new content to stand out.

Less representation of other genres (e.g., Sci-Fi, Fantasy, Adventure) might mean missed opportunities in underexplored categories.

Solution: Platforms should diversify their content portfolio by investing in underrepresented genres to attract niche audiences and reduce market saturation risks.





#### Chart - 4

In [None]:
#  Runtime Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['runtime'], bins=30, kde=True, color='green')
plt.title('Distribution of Movie Runtimes')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

histogram with KDE is ideal for showing the distribution of numerical data, making it perfect for visualizing movie runtimes.
*  It allows for easy identification of common runtime lengths, skewness, and trends in movie durations.
*  The smooth KDE curve helps highlight underlying patterns beyond raw frequency counts.







##### 2. What is/are the insight(s) found from the chart?


*  Most movies have a runtime between 80 to 100 minutes, indicating this is the optimal duration preferred by filmmakers and audiences.
*  A secondary peak is observed at around 140 minutes, which could indicate longer feature films or special genres.
*  Few movies are extremely short (under 40 minutes) or very long (over 140 minutes).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Content creators can optimize movie runtimes to align with audience expectations, ensuring higher engagement.
*  Streaming platforms can improve recommendations by filtering content based on user preference for runtime.
*  Advertisers can strategically place ads in movies of popular durations to maximize revenue.

movies significantly deviating from the optimal runtime (80-100 minutes) may struggle with audience retention.


* Short movies (<40 min): Might be perceived as incomplete or lacking depth, leading to lower viewer satisfaction.
* Long movies (>140 min): Risk higher dropout rates due to audience fatigue, potentially reducing engagement.

Solution: Filmmakers should balance storytelling efficiency with runtime expectations to maximize viewer retention and business success.




#### Chart - 5

In [None]:
# Age Certification Count
plt.figure(figsize=(8, 5))
sns.countplot(x=df['age_certification'], order=df['age_certification'].value_counts().index, palette='pastel')
plt.title('Distribution of Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is effective for visualizing categorical data, making it ideal for showing the distribution of age certifications. It helps in understanding whether movies are properly rated or if there is a significant number of "Not Rated" movies. The simplicity of the bar chart makes it easy to interpret.

##### 2. What is/are the insight(s) found from the chart?


*  The majority of the movies fall under a single age certification category (labeled as "1").
*  A significant portion of movies is marked as "Not Rated", meaning they lack an official age certification.

*  This suggests that many movies might not have been officially reviewed for content suitability.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,Streaming platforms and content distributors can ensure proper classification to enhance user experience.

* Parental control systems can be improved by reducing the number of "Not Rated" movies.
* Better age certification tagging can help in personalized content recommendations.


The high number of "Not Rated" movies could be a drawback.

Regulatory Issues: Some platforms require age ratings for compliance—failure to provide them can lead to restrictions or removals.

User Trust: Parents may avoid movies without ratings, leading to a loss of potential viewers.

Ad Revenue Impact: Advertisers often prefer content with a clear rating, so unrated content might have lower monetization potential.

Solution: Content providers should prioritize getting official age ratings to improve compliance, user experience, and business opportunities.

#### Chart - 6

In [None]:
#  IMDb Score vs. Runtime
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['runtime'], y=df['imdb_score'], alpha=0.5, color='purple')
plt.title('IMDb Score vs. Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is an effective way to visualize the relationship between IMDb scores and movie runtime.


* It helps in identifying clusters, trends, and outliers in how movie duration affects audience ratings.
*  The plot allows us to observe if there’s an optimal runtime that consistently gets higher ratings.





##### 2. What is/are the insight(s) found from the chart?


*  Most movies have a runtime between 60 to 120 minutes, indicating this is the standard length.
*  There is no clear correlation between runtime and IMDb score, meaning longer movies don't necessarily receive higher ratings.
*  Movies shorter than 30 minutes have a broad range of scores, suggesting they might include short films or TV specials.
*  Some outliers exist, where certain very short or very long movies receive extremely high or low ratings.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights can help content creators and streaming platforms:

*  Optimize content length based on viewer preferences—since most movies are within 60-120 minutes, investing in this range might ensure better audience retention.
*  Analyze outliers to understand why some long or short movies perform exceptionally well.
*  Target specific genres—for example, documentaries or short films might perform well at different runtimes than mainstream movies.

Are there any insights that lead to negative growth? Justify with specific reason.


Risk of producing excessively long movies:
------


* If a platform over-invests in long-duration content, it may not align with audience preferences, leading to lower engagement.

Solution: Focus on engagement data alongside ratings to determine the ideal runtime for each genre.

Ignoring short-format content:
---------
* Short movies (under 40 minutes) have diverse ratings, meaning some perform very well.

Solution: Streaming platforms should not overlook short films as they may attract a niche but dedicated audience.













#### Chart - 7

In [None]:
#Popularity vs. TMDb Score (Scatter Plot)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['tmdb_score'], y=df['tmdb_popularity'], alpha=0.7, color="blue")
plt.title('Popularity vs. TMDb Score')
plt.xlabel('TMDb Score')
plt.ylabel('TMDb Popularity')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal for showing the relationship between TMDb scores and popularity. It helps in identifying trends, clusters, and outliers in the data.
The plot makes it easy to observe if higher ratings correlate with increased popularity.

##### 2. What is/are the insight(s) found from the chart?


*  Most movies have a TMDb score between 5 and 8 and show varying levels of popularity.
*  There are a few highly popular movies (outliers) with moderate scores (around 6), suggesting that popularity isn't solely dependent on ratings.
*  Some movies with high ratings (above 8) are not necessarily the most popular, indicating that other factors like marketing, star power, or genre trends might influence popularity.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,these insights can help streaming platforms and content creators:

*  Target promotion efforts on moderately rated but highly popular movies.
*  Identify movies with high ratings but low popularity and boost their visibility through better recommendations or marketing campaigns.
*  Understand viewer preferences—popularity may not always align with high ratings, so focusing on user engagement metrics can be more valuable.

Are there any insights that lead to negative growth? Justify with a specific reason.


Potential misalignment in content strategy:
------------


*  If a platform only promotes high-rated movies, it may miss out on movies that are popular but not critically acclaimed.


Solution: Implement a balanced recommendation strategy that considers both popularity and rating.

Overreliance on ratings for decision-making:
----------
* Investing in only high-rated but less popular movies might not yield high user engagement or revenue.

Solution: Focus on trending content and audience preferences, not just critic scores







#### Chart - 8

In [None]:
#  Popularity vs. IMDb Score (Scatter Plot)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['imdb_score'], y=df['tmdb_popularity'], alpha=0.6)
plt.title("Popularity vs. IMDb Score")
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Popularity")
plt.show()

##### 1. Why did you pick the specific chart?




* to show the relationship between IMDb scores and TMDb popularity.
* It helps in identifying trends, outliers, and correlations between movie ratings and their popularity.
*  This visualization is useful to analyze if higher-rated movies tend to be more popular or if popularity is independent of rating.








##### 2. What is/are the insight(s) found from the chart?


*  Most movies have low to moderate popularity, regardless of IMDb scores.
*  There are a few highly popular movies, but their ratings vary across the spectrum.
*  Some highly-rated movies (8-10) have lower popularity, suggesting that quality doesn't always translate to popularity.

* Some highly-rated movies (8-10) have lower popularity, suggesting that quality doesn't always translate to popularity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, streaming platforms & production houses can use this to refine their content strategy.
*  Marketing teams can identify which factors boost popularity, even for movies with mid-level IMDb scores.
*  This can help in targeted promotions—for instance, boosting underrated high-quality films or capitalizing on popular but average-rated films.


If production houses rely only on IMDb scores to predict success, they may ignore films with lower ratings but strong popularity potential.

Some highly-rated movies remain unnoticed due to poor marketing, leading to lost revenue opportunities.

Solution: Invest in data-driven marketing strategies to push high-quality but less popular films toward wider audiences.





#### Chart - 9

In [None]:
# Number of Movies/Shows Released per Year (Line Chart)

df['release_year'] = pd.to_datetime(df['release_year']).dt.year
release_counts = df.groupby('release_year').size()
plt.figure(figsize=(12, 6))
plt.plot(release_counts.index, release_counts.values, marker='o', linestyle='-')
plt.title("Number of Movies/Shows Released Per Year")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Title vs. Age Certifications (Bar Plot - Top 20 for readability)
top_titles = df['title'].value_counts().nlargest(20).index
df_filtered = df[df['title'].isin(top_titles)]
plt.figure(figsize=(12, 6))
sns.countplot(y=df_filtered['title'], hue=df_filtered['age_certification'], palette="Set2")
plt.title('Top 20 Titles vs. Age Certifications')
plt.xlabel('Count')
plt.ylabel('Title')
plt.legend(title='Age Certification')
plt.show()

##### 1. Why did you pick the specific chart?



* A horizontal bar chart was chosen because it efficiently represents the top 20 titles and their age certifications, making comparisons clear.
* It allows easy visualization of the distribution of certified vs. not-rated content.
*  The horizontal layout ensures title names are readable, avoiding clutter in a vertical chart.



##### 2. What is/are the insight(s) found from the chart?


*  Most of the top 20 titles have an age certification.
*  Some titles are "Not Rated," meaning they lack an official age certification, which could be due to historical reasons or missing metadata.
*  Classic movies and well-known titles (e.g., Titanic, Pearl Harbor, Independence Day) are mostly rated, indicating that major films typically undergo certification.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,these insights can help streaming platforms ensure compliance with regional content regulations.
* Understanding which movies lack certification allows businesses to update metadata, improving user experience and parental controls.
* Better categorization based on certification can enhance recommendation systems, boosting engagement and watch time.


Are there any insights that lead to negative growth? Justify with a specific reason.

Unrated content could lead to regulatory and trust issues:
----------
* If popular titles lack certification, users may hesitate to watch them, especially in regions with strict content guidelines.

Solution: Platforms should work to get official age ratings for unrated content to improve transparency.

Potential metadata inconsistency:
-------
* If some certified movies are incorrectly labeled as "Not Rated," it may indicate data quality issues in the platform's database.

Solution: Regular data validation and metadata enrichment should be performed.






#### Chart - 11

In [None]:
#  Is Movie vs. Age Certifications (Stacked Bar Plot)
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x="is_movie", hue="age_certification", multiple="stack", shrink=0.8, palette="coolwarm")
plt.title("Movie vs. Age Certification")
plt.xlabel("Is Movie (1 = Movie, 0 = TV Show)")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?


* A bar chart was chosen because it clearly represents the comparison between movies and TV shows based on their age certification.
* It effectively highlights the disparity between certified and non-certified content.
*  This chart helps understand content rating trends, which is crucial for streaming platforms and regulatory compliance.




##### 2. What is/are the insight(s) found from the chart?


* A huge number of movies have age certifications, while TV shows are significantly lower in count.
*  A small proportion of content is labeled as "Not Rated", which might indicate missing metadata or a lack of formal certification.
*  The dominance of movies over TV shows in terms of certification suggests that regulatory bodies focus more on movies.




##### 3. Will the gained insights help creating a positive business impact?




* Yes,these insights can help content platforms improve classification and parental control features.
* Platforms can enhance user experience by ensuring that proper age certifications are available for all content.
* Regulatory compliance can be improved, reducing legal risks related to unclassified content.
* Streaming services can optimize their recommendation algorithms by filtering content based on certification.


Are there any insights that lead to negative growth?

Justify with a specific reason.


High number of "Not Rated" content:
-------
* If a significant portion of content lacks an age rating, it can lead to trust issues for parents and viewers.

Solution: Platforms should ensure age certification is properly assigned to all content.

TV Shows Lack Proper Certification:
----------
* The low count of TV shows with certification could indicate less regulatory oversight, which may affect audience trust and compliance in certain regions.

Solution: Streaming platforms should work with rating agencies to certify more TV shows for better content transparency.



#### Chart - 12

In [None]:
# Release Year vs. Type (Count Plot)
plt.figure(figsize=(12, 6))
sns.countplot(x=df['release_year'].dt.year, hue=df['type'], palette="husl")
plt.xticks(rotation=90)
plt.title('Number of Movies/Shows Released per Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.legend(title="Type", labels=["Movie", "TV Show"])
plt.show()

##### 1. Why did you pick the specific chart?


*  A stacked bar chart was chosen to analyze the trend of movie and TV show releases over time.
*  It effectively displays growth patterns in content production, highlighting the comparison between movies and TV shows.
*  The yearly trend provides insights into industry shifts, content demand, and production strategies.




##### 2. What is/are the insight(s) found from the chart?


* Significant increase in content production in recent years, especially movies an In TV shows have also increased, but their growth is much slower than movies.
*  A sudden spike in content after the 2000s, with exponential growth post-2010.
*  The highest content production year appears to be the most recent one, showing continued industry expansion.
*  Older decades (pre-1950s) have minimal content releases, indicating that data collection may be limited for historical records.



##### 3. Will the gained insights help creating a positive business impact?



*  Yes,these insights help streaming platforms, content creators, and investors make data-driven decisions.
* Companies can focus on high-growth years and trends to optimize content strategy.
Streaming services can acquire more recent content, as it aligns with the increased production rate.
*  Marketing strategies can be tailored based on production trends, ensuring efficient content distribution.


Are there any insights that lead to negative growth?
Justify with specific reason.


Market Saturation Risk:
-------------
*  The rapid rise in movie production may lead to overcrowding, making it harder for individual films to gain visibility.

Solution: Platforms should focus on personalized recommendations and quality over quantity.

TV Show Underrepresentation:
------------
*  Despite increasing content consumption, TV show production remains significantly lower.
*  Streaming services relying on long-form content may struggle to retain viewers.

Solution: Invest in high-quality TV shows to maintain engagement and subscription retention.








#### Chart - 13

In [None]:
# Top 15  genres vs type
plt.figure(figsize=(12, 6))
sns.countplot(data=df, y='genres', hue='type', palette="coolwarm", order=df['genres'].value_counts().index[:15])
plt.title('Top 15 Genres vs. Type')
plt.xlabel('Count')
plt.ylabel('Genres')
plt.legend(title="Type")
plt.show()


##### 1. Why did you pick the specific chart?


*  A horizontal bar chart is chosen to display the distribution of the top 15 genres, categorized by type (Movie or Show).
*   It effectively compares the frequency of different genres, making it easier to see which genres dominate the dataset.
*   The use of color distinction (Movies vs. Shows) provides an additional layer of insight into content type preferences.



##### 2. What is/are the insight(s) found from the chart?


*  Drama is the most popular genre, followed by Comedy and Horror.
*  Movies dominate across all genres compared to Shows.
*  Genre combinations like 'drama & romance' and 'thriller & crime' are relatively popular, suggesting audience interest in blended storytelling.
*  Documentaries have a strong presence, indicating a significant viewership for non-fiction content.



##### 3. Will the gained insights help creating a positive business impact?

*   Yes,these insights help streaming platforms, content creators, and distributors understand audience preferences.
*  Businesses can invest more in drama and comedy content to maximize engagement and platforms can optimize recommendations by targeting users based on their preferred genres.
*  Balanced content strategies can be developed by analyzing the demand for single vs. multi-genre content.



Are there any insights that lead to negative growth?
Justify with specific reason.


Potential Risk: Over-Saturation of Drama Content
-----------

* Since drama is highly dominant, excessive production of similar content might lead to viewer fatigue.

* If other genres (like sci-fi, fantasy, or niche categories) are neglected, platforms may lose diverse audience segments.

Solution: Maintain variety by investing in emerging genres and niche markets.

Underrepresentation of TV Shows:
----------------
*  The chart suggests that movies heavily outnumber shows across all genres.

*  Streaming services focusing on long-form content may miss out on user engagement and subscription retention.

Solution: Increase investment in high-quality, long-running TV shows to maintain a steady user base.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(10, 6))
#corr_matrix = df.corr()
numeric_df = df.select_dtypes(include=np.number)
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?


* A correlation heatmap is chosen to visually explore the relationships between numerical variables in the dataset.
* It provides an easy way to detect strong or weak correlations between features, helping in feature selection for predictive modeling.
* The color gradient allows quick identification of positive (red) and negative (blue) correlations, making data interpretation intuitive.

##### 2. What is/are the insight(s) found from the chart?


*  IMDb score and TMDb score have a strong positive correlation (~0.6), indicating that higher-rated movies on IMDb tend to have higher TMDb scores.
* IMDb votes and TMDb popularity show a moderate correlation (~0.23), meaning more votes on IMDb could relate to higher popularity on TMDb.
* Content age has a weak or negative correlation with most variables, suggesting that the age of the content doesn't significantly impact ratings or popularity.
* Runtime has a low correlation with other factors, implying that the length of a movie/show does not strongly influence its popularity or rating.



#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Set plot style
sns.set(style="whitegrid")

# Pair Plot: Name, Popularity, Type, Title, and Genres
subset = df[['name', 'tmdb_popularity', 'type', 'title', 'genres']].dropna()
sns.pairplot(subset, hue='type', diag_kind='kde', palette="Dark2")
plt.show()

##### 1. Why did you pick the specific chart?



*   A density plot (KDE plot) is useful for visualizing the distribution of numerical data and identifying patterns.
* It helps understand how the values of TMDb popularity are spread across different movies or TV shows.
* The chart provides a smoother representation of data density compared to histograms, making it easier to spot skewness or anomalies.

##### 2. What is/are the insight(s) found from the chart?

The TMDb popularity values are highly skewed to the left, meaning most movies/shows have low popularity scores.
A few outliers exist with extremely high popularity (beyond 1000), but they are rare.
The majority of content has low popularity, suggesting that only a small percentage of movies/shows achieve high engagement.
If this trend holds for both movies and TV shows, it might indicate that a handful of blockbuster releases dominate viewership.

#### Chart - 16

In [None]:
# Distribution of Production Countries
plt.figure(figsize=(8, 8))
top_countries = df['production_countries'].value_counts().nlargest(8)
plt.pie(top_countries, labels=top_countries.index, autopct='%1.1f%%', colors=sns.color_palette('pastel'))
plt.title('Production Countries Distribution')
plt.show()

1. Why did you pick the specific chart?

Since we are analyzing the distribution of production countries, a pie chart clearly shows the dominant regions contributing to movie/show production.

It helps in quickly identifying the leading country and how others compare in terms of production share.

2. What is/are the insight(s) found from the chart?


*   The United States (US) dominates with 70.4% of the total productions.
*   India (IN) follows with 11.4%, indicating a strong film industry presence.
*   Other countries like the UK (GB), Canada (CA), Japan (JP), Australia (AU), and France (FR) have a much smaller share.
*   There is a small percentage of data entries with missing country information ([]), which could indicate either missing or multi-country productions.
*   The chart highlights the centralization of movie/TV production in a few dominant regions, with the US leading significantly.




## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the insights gained from all visualizations, the client can take the following data-driven actions to achieve their business objective:

Optimize Movie Selection for Maximum Engagement
---------
*  The IMDb Score vs. TMDb Score scatter plot shows a strong correlation, suggesting that both platforms' ratings align.
*  The client can prioritize movies with higher IMDb and TMDb scores for better user satisfaction.

Leverage Popularity Insights for Better Content Strategy
--------
*  The IMDb Score vs. TMDb Popularity scatter plot indicates that some moderate-rated movies (IMDb 6-7) gain high popularity, while highly rated movies (8+) might have lower popularity.
*  Instead of focusing only on high-rated content, invest in marketing and promotions for mid-rated but highly engaging content.

Understand Viewer Preferences Through Distribution Analysis
----------
*  The IMDb Score Distribution histogram follows a normal distribution, with most movies clustering around a score of 5-7.
*  Clients should focus on curating a balanced mix of mainstream and niche content, targeting the largest audience segment (scores 5-7) while promoting standout high-rated movies.

Marketing Strategy Based on Trends
-------------
* Some movies have exceptional popularity despite moderate ratings due to external factors (franchise, actor, marketing).
*  Clients should invest in promotional campaigns, social media engagement, and strategic partnerships to enhance visibility for potential hit movies.




# **Conclusion**

The analysis of IMDb and TMDb ratings, popularity, and score distributions provides valuable insights for optimizing content strategy. Key takeaways include:

*  Strong correlation between IMDb and TMDb scores suggests reliable quality indicators for movie selection.
*  Popularity does not always align with high ratings, emphasizing the importance of marketing and audience engagement.
*  Most movies cluster around IMDb scores of 5-7, indicating the need to balance high-rated content with widely appealing mid-rated films.


By viewing these insights, the client can enhance content acquisition, improve more recommendation algorithms, and implement targeted market strategies.

A data-driven approach will maximize user engagement, boost retention, and drive business growth.







### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***