<a href="https://colab.research.google.com/github/Lokendra-cloud/Projects/blob/main/EDA_Amazon_Prime_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Amazon Prime Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


**Objective:**

The goal of Amazon Prime prediction is to identify users who are most likely to subscribe to (or renew) Amazon Prime services. This can be expanded to predict the likelihood of existing Prime users canceling their subscriptions.

**Business Context:**

Amazon Prime is a major revenue stream for Amazon, providing not only subscription income but also boosting customer retention and purchase frequency. Predicting user behavior around Prime subscriptions allows Amazon to:

**1. Optimize Marketing Efforts:** Focus campaigns on users with high conversion potential.

**2. Reduce Churn:** Identify users at risk of canceling and offer targeted incentives.

**3. Personalize Offers:** Provide customized deals or trial offers to undecided users.

**4. Inventory & Logistics Planning:** Anticipate demand for Prime-related perks like free shipping and same-day delivery.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import os
import seaborn as sns
from collections import Counter
import ast
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Setting up directory helps access the files easily.
os.chdir('/content/drive/MyDrive/M2_Project_Amazon_Prime')
os.getcwd()    # Get current working directory
!ls            # # Command to see the folder available

In [None]:
# Reading Dataset-1
read_credits = pd.read_csv('credits.csv')

# Reading Dataset-2
read_titles = pd.read_csv('titles.csv')

### Dataset First View

In [None]:
# Dataset First - 1
read_credits.head()

In [None]:
# Dataset First - 2
read_titles.head()


### Dataset Rows & Columns count

In [None]:
# Dataset-1 Rows & Columns
read_credits.shape

In [None]:
# Dataset-2 Rows & Columns
read_titles.shape

### Dataset Information

In [None]:
# Dataset-1 Info
read_credits.info()

In [None]:
# Dataset-2 Info
read_titles.info()

#### Duplicate Values

In [None]:
# Dataset-1(Credits) Duplicate Value Count
len(read_credits[read_credits.duplicated()])

In [None]:
# Dataset-2(titles) Duplicate Value Count
len(read_titles[read_titles.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count(Dataset-1(Credits))
print(read_credits.isnull().sum())

In [None]:
# Missing Values/Null Values Count(Dataset-2(titles))
print(read_titles.isnull().sum())

In [None]:
# Visualizing the missing values(Dataset-1(Credits))
# Checking Null Value by plotting Heatmap
sns.heatmap(read_credits.isnull(), cbar=False)

In [None]:
# Visualizing the missing values(Dataset-2(titles))
# Checking Null Value by plotting Heatmap
sns.heatmap(read_titles.isnull(), cbar=False)

### What did you know about your dataset?

The dataset given is a dataset from Entertainment and Media industry, and we have to analysis the Content Diversity, Regional Availability, Trends Over Time & IMDb Ratings & Popularity.

Amazon Prime prediction is analytical studies on the possibility of a customer abandoning a product or service. The goal is to understand and take steps to change it before the costumer gives up the product or service.

The above dataset-1(Credits) has 124235 rows and 5 columns & dataset-2(titles) has 9871 rows and 15 columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset-1 Columns(Credits)
read_credits.columns

In [None]:
# Dataset-2 Columns(titles)
read_titles.columns

In [None]:
# Dataset-1 Describe(Credits)
read_credits.describe(include='all')

In [None]:
# Dataset-2 Describe(titles)
read_titles.describe(include='all')

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable(Credits).
for i in read_credits.columns.tolist():
  print("No. of unique values in ",i,"is",read_credits[i].nunique(),".")

In [None]:
# Check Unique Values for each variable(titles).
for i in read_titles.columns.tolist():
  print("No. of unique values in ",i,"is",read_titles[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

### Content Diversity Analysis

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

In [None]:
# 1. Genre Distribution
all_genres = [genre for sublist in read_titles_copy['genres'] for genre in sublist]
genre_counts = Counter(all_genres)
top_genres = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)
print(top_genres)

In [None]:
# 2. Type Analysis (Movie vs. Show)
type_distribution = read_titles_copy['type'].value_counts().reset_index()
type_distribution.columns = ['Type', 'Count']

In [None]:
# 3. Multi-genre Analysis
read_titles_copy['genre_count'] = read_titles_copy['genres'].apply(len)
multi_genre_distribution = read_titles_copy['genre_count'].value_counts().reset_index()
multi_genre_distribution.columns = ['Number of Genres', 'Count']

### Regional Availability Analysis

1. **Count Titles Available per Production Country :**

    We'll parse the production_countries column (which contains lists in string format) and count the number of titles per country.

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

In [None]:
# Convert 'production_countries' column from string representation of list to actual list
read_titles_copy['production_countries'] = read_titles_copy['production_countries'].apply(ast.literal_eval)

In [None]:
# Explode the 'production_countries' column so each country gets its own row
countries_exploded = read_titles_copy.explode('production_countries')

In [None]:
# Count titles per country
titles_per_country = countries_exploded['production_countries'].value_counts().reset_index()
titles_per_country.columns = ['country', 'title_count']

In [None]:
# Count the number of titles per country
country_counts = countries_exploded['production_countries'].value_counts().head(15)

In [None]:
# Display top 10 countries by number of titles
print(titles_per_country.head(10))

2. **Analyze Genre Distribution by Country :**

    We'll explode the genres and countries, and then count the occurrences of each genre per country.

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

In [None]:
# Convert string representations of lists to actual Python lists
read_titles_copy['production_countries'] = read_titles_copy['production_countries'].apply(ast.literal_eval)
read_titles_copy['genres'] = read_titles_copy['genres'].apply(ast.literal_eval)

In [None]:
# Explode both 'production_countries' and 'genres' columns
exploded_read_titles = read_titles_copy.explode('production_countries').explode('genres')

In [None]:
# Group by country and genre, then count occurrences
genre_distribution = exploded_read_titles.groupby(['production_countries', 'genres']).size().reset_index(name='count')

In [None]:
# Display the top 10 country-genre combinations
print(genre_distribution.sort_values(by='count', ascending=False).head(10))

3. **Highlight Countries Producing the Most High-Rated Content**

    We’ll define "high-rated" as having an IMDb score ≥ 7.5, and then count such titles per country.

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

In [None]:
# Convert 'production_countries' from string to list
read_titles_copy['production_countries'] = read_titles_copy['production_countries'].apply(ast.literal_eval)

In [None]:
# Filter titles with IMDb score >= 7.5
high_rated_df = read_titles_copy[read_titles_copy['imdb_score'] >= 7.5]

In [None]:
# Explode 'production_countries' so each country is in a separate row
high_rated_exploded = high_rated_df.explode('production_countries')

In [None]:
# Count high-rated titles per country
high_rated_counts = high_rated_exploded['production_countries'].value_counts().reset_index()
high_rated_counts.columns = ['country', 'high_rated_title_count']

In [None]:
# Display top 10 countries by high-rated title count
print(high_rated_counts.head(10))

### Trends Over Time Analysis

**1. Number of Releases per Year:**

    There's a clear growth trend in the number of releases, peaking around recent years, with some fluctuations.

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

In [None]:
# Count number of titles released per year
releases_per_year = read_titles_copy['release_year'].value_counts().sort_index()
print(releases_per_year.head())

**2. Genre Evolution:**

    Genres like Drama, Comedy, and Documentary show consistent prominence, with visible shifts in their frequency over time.

In [None]:
# Convert 'genres' from string to list
read_titles_copy['genres'] = read_titles_copy['genres'].apply(ast.literal_eval)

In [None]:
# Explode the genres column so each genre is in a separate row
genre_year_df = read_titles_copy.explode('genres')

In [None]:
# Group by year and genre, count the number of titles
genre_trend = genre_year_df.groupby(['release_year', 'genres']).size().unstack(fill_value=0)

In [None]:
# Select top 5 most frequent genres for clarity
top_genres = genre_trend.sum().sort_values(ascending=False).head(5).index
filtered_genre_trend = genre_trend[top_genres]
print(filtered_genre_trend.head(5))

**3. IMDb Score Trend:**
  
    The average IMDb score has remained relatively stable, with slight variations, indicating consistent quality perception over time.

In [None]:
# Group by release year and calculate average IMDb score
imdb_score_trend = read_titles_copy.groupby('release_year')['imdb_score'].mean()
print(imdb_score_trend.head())

### IMDb Ratings & Popularity Analysis

**1. Top-Rated Titles by IMDb Score**

In [None]:
# Select relevant columns and drop rows with missing IMDb scores
top_imdb_df = read_titles_copy[['title', 'imdb_score', 'imdb_votes', 'release_year']].dropna(subset=['imdb_score'])

In [None]:
# Filter to ensure a reasonable number of votes (e.g., avoid titles with very few votes skewing the rating)
top_imdb_df = top_imdb_df[top_imdb_df['imdb_votes'] > 100]

In [None]:
# Sort by IMDb score and then by number of votes
top_imdb_df = top_imdb_df.sort_values(by=['imdb_score', 'imdb_votes'], ascending=[False, False])

In [None]:
# Select top 10 titles
top_10_imdb = top_imdb_df.head(10)
print(top_10_imdb)

**2. Top-Rated Titles by TMDb Score**

In [None]:
# Select relevant columns and drop rows with missing TMDb scores
top_tmdb_df = read_titles_copy[['title', 'tmdb_score', 'tmdb_popularity', 'release_year']].dropna(subset=['tmdb_score'])

In [None]:
# Sort by TMDb score and then by popularity
top_tmdb_df = top_tmdb_df.sort_values(by=['tmdb_score', 'tmdb_popularity'], ascending=[False, False])

In [None]:
# Select top 10 titles
top_10_tmdb = top_tmdb_df.head(10)
print(top_10_tmdb)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Content Diversity Analysis

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

# Checking Shape of True Value
read_credits_copy.shape
read_titles_copy.shape

**1. Genre Distribution**

In [None]:
read_titles_copy['genres'] = read_titles_copy['genres'].apply(ast.literal_eval)
genre_df = read_titles_copy.explode('genres')
genre_counts = genre_df['genres'].value_counts()

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='muted')
plt.title('Overall Genre Distribution')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

**2. Type Analysis (Movie vs. Show)**

In [None]:
# Fill missing types if any
read_titles_copy['type'] = read_titles_copy['type'].fillna('Unknown')

# Count types
type_counts = read_titles_copy['type'].value_counts()

# Display counts
print("Content Type Counts:\n", type_counts)

In [None]:
# Plotting
plt.figure(figsize=(6, 6))
type_counts.plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'))
plt.title('Content Type Distribution (Movie vs. Show)')
plt.ylabel('')
plt.tight_layout()
plt.show()

**3. Multi-genre Analysis**

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to read_titles_copy
read_credits_copy = read_credits.copy()
read_titles_copy = read_titles.copy()

In [None]:
# Convert 'genres' column to list
read_titles_copy['genres'] = read_titles_copy['genres'].apply(ast.literal_eval)

# Count number of genres per title
read_titles_copy['num_genres'] = read_titles_copy['genres'].apply(len)

In [None]:
# Plotting
plt.figure(figsize=(8, 4))
sns.histplot(read_titles_copy['num_genres'], bins=range(1, read_titles_copy['num_genres'].max() + 2), discrete=True)
plt.title('Number of Genres per Title')
plt.xlabel('Genres per Title')
plt.ylabel('Number of Titles')
plt.tight_layout()
plt.show()

### Regional Availability Analysis

1. **Count Titles Available per Production Country :**

    We'll parse the production_countries column (which contains lists in string format) and count the number of titles per country.

In [None]:
# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='coolwarm')
plt.title('Number of Titles per Production Country (Top 15)')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

2. **Analyze Genre Distribution by Country :**

    We'll explode the genres and countries, and then count the occurrences of each genre per country.

In [None]:
# Group by country and genre, then count
genre_country_counts = exploded_read_titles.groupby(['production_countries', 'genres']).size().unstack(fill_value=0)

In [None]:
# Optional: limit to top countries and genres for clearer plot
top_countries = genre_country_counts.sum(axis=1).sort_values(ascending=False).head(10).index
top_genres = genre_country_counts.sum().sort_values(ascending=False).head(10).index
filtered_data = genre_country_counts.loc[top_countries, top_genres]

In [None]:
# Plotting heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(filtered_data, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Genre Distribution by Country (Top 10 Countries & Genres)')
plt.xlabel('Genre')
plt.ylabel('Production Country')
plt.tight_layout()
plt.show()

3. **Highlight Countries Producing the Most High-Rated Content**

    We’ll define "high-rated" as having an IMDb score ≥ 7.5, and then count such titles per country.

In [None]:
# Filter out rows with missing IMDb scores
country_scores_df = high_rated_exploded.dropna(subset=['imdb_score'])

In [None]:
# Group by country and calculate average IMDb score and count of titles
country_rating_stats = country_scores_df.groupby('production_countries').agg(
    average_imdb_score=('imdb_score', 'mean'),
    title_count=('title', 'count'))

In [None]:
# Optional: filter countries with at least 20 titles to avoid small sample bias
filtered_stats = country_rating_stats[country_rating_stats['title_count'] >= 20]

In [None]:
# Sort by average IMDb score
top_countries = filtered_stats.sort_values(by='average_imdb_score', ascending=False).head(10)

In [None]:
# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x=top_countries['average_imdb_score'], y=top_countries.index, palette='crest')
plt.title('Top Countries Producing the Most High-Rated Content (IMDb, Min 20 Titles)')
plt.xlabel('Average IMDb Score')
plt.ylabel('Production Country')
plt.tight_layout()
plt.show()

### Trends Over Time Analysis

**1. Number of Releases per Year:**

    There's a clear growth trend in the number of releases, peaking around recent years, with some fluctuations.

In [None]:
# Plotting
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x=releases_per_year.index, y=releases_per_year.values)
plt.title('Number of Releases per Year')
plt.xlabel('Year')
plt.ylabel('Number of Titles')
plt.tight_layout()
plt.show()

**2. Genre Evolution:**

    Genres like Drama, Comedy, and Documentary show consistent prominence, with visible shifts in their frequency over time.

In [None]:
# Plotting
plt.figure(figsize=(14, 7))
filtered_genre_trend.plot()
plt.title('Genre Evolution Over Time (Top 5 Genres)')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.legend(title='Genre')
plt.grid(True)
plt.tight_layout()
plt.show()

**3. IMDb Score Trend:**
  
    The average IMDb score has remained relatively stable, with slight variations, indicating consistent quality perception over time.

In [None]:
# Plotting
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x=imdb_score_trend.index, y=imdb_score_trend.values)
plt.title('Average IMDb Score Over Time')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.tight_layout()
plt.show()

### IMDb Ratings & Popularity Analysis

**1. Top-Rated Titles by IMDb Score**

In [None]:
# Plot:
plt.figure(figsize=(12, 6))
sns.barplot(data=top_10_imdb, x='imdb_score', y='title', palette='viridis')
plt.title('Top 10 Titles by IMDb Score (Min 100 Votes)')
plt.xlabel('IMDb Score')
plt.ylabel('Title')
plt.tight_layout()
plt.show()

**2. Top-Rated Titles by TMDb Score**

In [None]:
# Plot:
plt.figure(figsize=(12, 6))
sns.barplot(data=top_10_tmdb, x='tmdb_score', y='title', palette='magma')
plt.title('Top 10 Titles by TMDb Score (Filtered by Popularity > 0)')
plt.xlabel('TMDb Score')
plt.ylabel('Title')
plt.tight_layout()
plt.show()

# **Conclusion**

1. Amazon Prime's Content Strategy:

    * Amazon Prime has a global reach but shows strong dominance in
      English-speaking countries like the US, UK, and Canada.

    * Indian content plays a major role, suggesting Amazon is targeting
      emerging markets with localized offerings.

    * Prime’s catalog leans heavily on movies, though its TV series
      library has grown, especially in recent years.

2. High-Rated Content Trends:

    * Countries such as South Korea, UK, and France consistently deliver high IMDb-rated titles on Prime.

    * Drama, Thriller, and Crime are the most consistently high-rated genres.

    * Ratings correlate loosely with popularity—titles with more votes and broader genre appeal tend to score higher.

3. Temporal Trends & Genre Shifts:

    * Content production has accelerated over the past decade, especially post-2015, indicating Amazon's growing investment in original programming.

    * Genre diversity is increasing. While Drama remains the most common, Documentaries, Animation, and Science Fiction are rising in frequency.

    * There's no strong upward or downward trend in average IMDb scores, but recent years show greater volume and variability.

4. Diversity & Reach:

    * Amazon Prime’s catalog is becoming more multi-genre, with an increasing number of titles crossing traditional genre boundaries (e.g., Drama + Sci-Fi, Romance + Thriller).

    * However, most titles come from a single production country, with fewer co-productions compared to global competitors like Netflix.

5. Recommendations for Strategy:

    * Content Acquisition: Amazon could boost content from underrepresented high-rating countries (e.g., Nordic nations, Argentina).

    * Localized Genres: Invest more in regional hits and culturally specific genres, especially in Asia and Latin America.

    * Genre Innovation: Encourage cross-genre experimentation and nurture niche genres with growing interest, like true crime and docuseries.

### ***Hurrah! You have successfully completed your Capstone Project !!!***