<a href="https://colab.research.google.com/github/SahilAgarwal03/Python_Projects/blob/main/Amazon_Prime_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Prime Video EDA






##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member  -** Sahil Agarwal


# **Project Summary -**

The dataset contains detailed information about Movies and TV Shows available on Amazon Prime. It includes 15 variables such as title, type, release year, runtime, genre, country of production, and audience ratings from IMDb and TMDb. The dataset allows analysis of content trends over time, audience preferences, and content quality. It helps identify the most popular and highly-rated titles across various genres and countries. Overall, it is well-suited for both descriptive and comparative entertainment analytics.

# **GitHub Link -**

https://github.com/SahilAgarwal03

# **Problem Statement**


The primary objective of this project is to perform Exploratory Data Analysis (EDA) on the Amazon Prime dataset to uncover meaningful insights about the platform's content library. The analysis will focus on identifying:

1. The type of content available (Movies vs TV Shows).

2. The top countries producing content on Amazon Prime.

3. The most common genres offered.

4. The distribution of content across different release years.

5. The top actors and directors featured in Amazon Prime titles.

This analysis aims to provide a comprehensive overview of the platform’s offerings and highlight key trends in its content catalog.

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import ast
import plotly.express as px
import seaborn as sns

### Dataset Loading

In [None]:
# Load Datasets
from google.colab import drive
drive.mount('/content/drive')
titles = pd.read_csv('/content/drive/My Drive/Amazon Prime Project/titles.csv')
credits = pd.read_csv('/content/drive/My Drive/Amazon Prime Project/credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
display(titles.head())
display(credits.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns Count
print("Number of rows and columns in Titles dataset:", titles.shape)
print("Number of rows and columns in Credits:", credits.shape)

### Dataset Information

In [None]:
# Dataset Info
titles.info()
credits.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
display(titles.duplicated().sum())
display(credits.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
display(titles.isnull().sum())
display(credits.isnull().sum())

### What did you know about your dataset?

There are 2 different csv files and they contain specific columns that are:


id:	Unique identifier for each title (movie or show)

title:	Name of the movie or TV show

type:	Indicates whether the title is a MOVIE or a SHOW

description:	A short summary or plot of the content

release_year:	The year the movie or show was released

age_certification:	Content rating (e.g., PG, TV-14, R) based on age appropriateness

runtime:	Duration of the movie or an episode of a show in minutes

genres:	List of genres associated with the title (e.g., Drama, Comedy)

production_countries:	List of countries where the title was produced

seasons:	Number of seasons (only applicable for shows)

imdb_id:	IMDb identifier (used to fetch data from IMDb)

imdb_score:	Rating from IMDb (0–10) based on viewer ratings

imdb_votes:	Number of IMDb users who rated the title

tmdb_popularity:	Popularity score from TMDb (The Movie Database)

tmdb_score:	TMDb rating (usually 0–10) from users on The Movie Database


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
display(titles.columns)
display(titles.columns)

In [None]:
# Dataset Describe
display(titles.describe())
display(credits.describe())

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
display(titles.nunique())
display(credits.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code



In [None]:
# Write your code to make your dataset analysis ready.
# Merging both of the datasets
dataset = pd.merge(titles, credits, on= 'id')
# Replacing null values with 'Not Avilable'
dataset['description'] = dataset['description'].fillna('Not Available')
dataset['age_certification'] = dataset['age_certification'].fillna('Not Available')
dataset['seasons'] = dataset['seasons'].fillna('Not Available')
dataset['imdb_id'] = dataset['imdb_id'].fillna('Not Available')
dataset['imdb_score'] = dataset['imdb_score'].fillna('Not Available')
dataset['imdb_votes'] = dataset['imdb_votes'].fillna('Not Available')
dataset['tmdb_popularity'] = dataset['tmdb_popularity'].fillna('Not Available')
dataset['tmdb_score'] = dataset['tmdb_score'].fillna('Not Available')
dataset['character'] = dataset['character'].fillna('Not Available')
# Null Values Count
display(dataset.isnull().sum())
# Droping duplicate values
dataset.drop_duplicates(inplace=True)
# Duplicate Values Count
display(dataset.duplicated().sum())
# Final view at the dataset
display(dataset.head())

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Coubnting the number of Movies and TV Shows usng .value_counts()
count = dataset['type'].value_counts()
# Ploting Pie Chart
plt.pie(count, autopct= '%1.1f%%', labels= count.index)
plt.title('Tv vs Movie')
plt.legend(count.index)
plt.show()

##### 1. Why did you pick the specific chart?

I have picked Pie Chart because, it clearly shows the percentage wise comparision between Movies and TV Shows on Amazon Prime Video.

##### 2. What is/are the insight(s) found from the chart?

By the chart we can clearly see that the majority of content( Movies or TV Shows) released on the platform are Movies with 93.4% of the total content followed by TV Shows which are only 6.6% of the total content.

This suggests that there is a strong focus on Movies then that of TV Shows by the platform.

It also suggests that Movies are the key revenue drivers of the platform as the contribute the most in the content released.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart suggests that the content strategy of the platform purely focuses on Movies then that of TV Shows.

 This can be because, movies are in trend over TV Shows.

 This can lead to Customer Churn because there will be no **Content Diversity** as far as the type of content is conserned.

Seems like Content Strategy needs to be modified and there should be balanced approach on content addition.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Top Performing Genres
# Using Abstract Library for converting string to lists and then used explode() to split them into rows and value_counts() for individual counting
top_genres = dataset['genres'].apply(ast.literal_eval).explode().value_counts().sort_values(ascending= False).head()
# Ploting bar graph
top_genres.plot(kind='bar')
plt.title('Top Performing Genres')
plt.xlabel('Genres')
plt.ylabel('Count of Movies and TV Shows')
plt.show()

##### 1. Why did you pick the specific chart?

I have picked Bar Chart because it ideal for comparing categorical data like Genred on the basis of Count of Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart

The chart suggests that 'Dramas' are the most preferred Genre across Movies and TV Shows with count just below 70000.

Followed by 'Comedy' which is over 40000, 'Thriller' and 'Action' over 30000 and 'Romance' just below it.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The Chart suggest that most of the content on Amazon Prime Video is based on these genres. So the platform can strategically optimize their Recommendation Systems accordingly.

This is will ensure **Content Diversity** because there will be multiple options for the user of the same Geners. This can reduce the chances of **Customer Churn**.

Also can design **Marketing Strategy** accordingly, advertising the top and trendy Movies and TV Shows from these Genres.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Using Abstract Library for converting string to lists for eg("['US', 'GB']" to ['US', 'UK']) and then used explode() to split them into rows for eg(e.g., ['US', 'GB'] to 'US', 'GB')
countries = dataset['production_countries'].apply(ast.literal_eval).explode()
# Using .value_counts() to individual count the rows for eg( ['US', 'GB', 'US'] to US: 2, GB: 1)
top_countries = countries.value_counts().head()
# Replacing countries not in top_countries with 'Others' for eg( 'GE' to 'Other' if not in top_countries)
countries = countries.apply(lambda x: x if x in top_countries else 'Others')
# Using .value_counts() to individual count the rows for eg( ['US', 'GB', 'US', 'Others'] to US: 2, GB: 1, Others: 1)
countries_count = countries.value_counts()
# Ploting the pie chart
plt.pie(countries_count, labels=countries_count.index, autopct='%1.1f%%')
plt.title("Top 5 Content Production Countries")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I have picked Pie Chart because, it clearly shows the percentage wise comparision between every countries on Amazon Prime Video.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests that US produces the most amount of content around 53.6% followed by GB with 8.5%, IN with 7.8%,CA with 4.5% and FR with 3.3%.

So this means 77.7 % of the total content released on Amazon Prime Video are by these top 5 country and only 22.3% is been produced by the rest.

This also suggest majority of revenue are produced by these 5 players especially Us which has control over half of the market.

This means that most of the content on Amazon is in English Language so there is a large content diversity for english speaking users.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart suggests that the platform has large focus over these countries besause they have released about 80% of the total content on Amazon Prime Video.
This can lead to limitation over Content Diversity for non-english speaking audience and audience who like regional content.

There should be a balanced approach over content releasing on platform.
Also should optimize the recommendation systems because there is a alot of content for the english speaking audience, this will lead to complete use the content on the platform and will lead to new subscriptions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Using Abstract Library for converting string to lists for eg("['US', 'GB']" to ['US', 'UK'])
dataset['production_countries'] = dataset['production_countries'].apply(ast.literal_eval)

# Using explode() to split them into rows for eg(e.g., ['US', 'GB'] to 'US', 'GB')
countries = dataset.explode('production_countries')

# Converting imdb_score to float value using pandas and avoiding errors
countries['imdb_score'] = pd.to_numeric(countries['imdb_score'], errors='coerce')

# # Using .groupby() to group the dataset, .mean() for average of ratings and .sort() for getting highest to lowest
ratings = countries.groupby('production_countries')['imdb_score'].mean().sort_values()

# Ploting Top 10 Rated Countries
ratings.head(10).plot(kind='barh')
plt.title('Top 10 Highest IMDb Rated Countries')
plt.xlabel('Avg IMDb Rating')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used horizontal bar graph because it is best for showing the the rating of different countries.




##### 2. What is/are the insight(s) found from the chart?

This chart suggest that these are the countries with the highest average rating on the platform.

This chart also suggest that the top content producing countries are not on the list.This can be because they produce large amount of content which can lead to lower average rating.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This means,may be this countries are not producing large content but they are producing quality content.For balance in the content addition strategy, the platform can invest in these proven performers. This can lead to content diversity and less customer churn.

The platform has the best available regional content of these countries, so they can optimize their recommendation systems accordingly.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# # Using explode() to split them into rows for eg(e.g., ['US', 'GB'] to 'US', 'GB')
countries = dataset.explode('production_countries')

# Converting imdb_score to float value using pandas and avoiding errors
countries['tmdb_score'] = pd.to_numeric(countries['tmdb_score'], errors='coerce')

# Using .groupby() to group the dataset, .mean() for average of ratings and .sort() for getting highest to lowest
ratings = countries.groupby('production_countries')['tmdb_score'].mean().sort_values()

# Ploting Top 10 Rated Countries
ratings.head(10).plot(kind='barh')
plt.title('Top 10 Highest TMDB Rated Countries')
plt.xlabel('Avg TMDb Rating')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used horizontal bar graph because it is best for showing the the rating of different countries.

##### 2. What is/are the insight(s) found from the chart?

This chart suggest that these are the countries with the highest average rating on the platform.

This chart also suggest that the top content producing countries are not on the list.This can be because they produce large amount of content which can lead to lower average rating.

And also the countries of highest average Imdb rating are different from highest average Tmdb rating showing the difference in the rating criteria.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This means,may be this countries are not producing large content but they are producing quality content.For balance in the content addition strategy, the platform can invest in these proven performers. This can lead to content diversity and less customer churn.

The platform has the best available regional content of these countries, so they can optimize their recommendation systems accordingly.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Converting imdb_score to float value using pandas and avoiding errors
dataset['imdb_score'] = pd.to_numeric(dataset['imdb_score'], errors='coerce')

# Grouping 'release_years' with 'imdb_score' with .groupby() and finding the average rating over the years by .mean()
df = dataset.groupby('release_year')['imdb_score'].mean().reset_index()

# Ploting the line chart using plotly
fig = px.line(df, x='release_year', y='imdb_score', title='Average IMDb Rating by Release Years',labels={'release_year': 'Year', 'imdb_score': 'Avg IMDb Rating'})
fig.show()

##### 1. Why did you pick the specific chart?

I used line charts besause they are best for time series analysis.

##### 2. What is/are the insight(s) found from the chart?

This chart suggests that the content released in the during 1916 has the most highest average rating (7.5).Means the platform has the best content of that specific year.

And it peaked again in 1926 with highest average rating being approximaly 7.4. While there is a huge dip in the year 1930 where the average rating went below 5.5.

And these ups and downs countinued untill 1963 where the average ratings reached 6.7 again there was a dip in 1983 where the average rating went near to record below 5.4 and then the trend was similar untill 2022.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart says that there are specific years of whuch the platform has the best content of like:

1916 with 7.5

1926 with 7.4

1946 with 6.7

1963 with 6.69

1997 with 6.66

2022 with 6.57

By looking at this the platform can strategically design their recommendation systems that showcases the content of these specific years because the platform has the best and the highest rated content of these years. By doing this the platform will attract new coustomers which are willing to watch content of thesse specific years.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
dataset['tmdb_score'] = pd.to_numeric(dataset['tmdb_score'], errors='coerce')

# Group by year and calculate average IMDb score
df = dataset.groupby('release_year')['tmdb_score'].mean().reset_index()

# Plot
fig = px.line(df, x='release_year', y='tmdb_score', title='Average TMDb Rating by Release Years',labels={'release_year': 'Year', 'imdb_score': 'Avg Rating'})
fig.show()

##### 1. Why did you pick the specific chart?

I used line charts besause they are best for time series analysis.

##### 2. What is/are the insight(s) found from the chart?

This chart suggests that the content released in the during 1917 has the most highest average rating (9).Means the platform has the best content of that specific year.

And it peaked again in 1926 with highest average rating being approximaly 7.5 .While there is a huge dip in the year 1930 where the average rating went below 5.23.

And these ups and downs countinued untill 1981 where the average ratings reached 6.6 again there was a dip in 1983 where the average rating went near to record below 5.3 and then the trend was similar untill 2022.

Also seems that the rating criteria can be different but the trend is similar to Imdb Rating.

This confirms that the analysis is valid.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart says that there are specific years of whuch the platform has the best content of like:

1917 with 9

1926 with 7.25

1950 with 6.25

1968 with 6.62

1981 with 6.5

2022 with 6.84

By looking at this the platform can strategically design their recommendation systems that showcases the content of these specific years because the platform has the best and the highest rated content of these years. By doing this the platform will attract new coustomers which are willing to watch content of thesse specific years.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Creating a dataframe for 'title', 'imdb_score', 'imdb_votes'
data = dataset[['title', 'imdb_score', 'imdb_votes']]

# Converting imdb_votes to float value using pandas
data['imdb_votes'] = pd.to_numeric(data['imdb_votes'], errors='coerce')

# Finding the mean votes from the whole 'imdb_votes'
mean_votes = data['imdb_votes'].mean()

# Filtering 'imdb_votes' on the basis of mean_votes(equal or greater than mean votes)
votes = data[data['imdb_votes'] >= mean_votes]

# Finding the highest or maximum score
ratings = votes['imdb_score'].max()

# Finding the movies and shows equal to the highest score
high_ratings = votes[votes['imdb_score'] == ratings]

# Finding the top 10 Highest Rated TV Shows or Movies on basis of Votes and removing duplicate titles
top_rated = votes.sort_values(by =['imdb_score', 'imdb_votes'], ascending= [False, False]).drop_duplicates().head(10)

# Ploting using plotly
fig = px.bar(top_rated, x='imdb_score', y='title', text='imdb_votes', title='Top IMDb-Rated Movies Or TV Shows On Basis Of Average Number Of Votes')
fig.show()

##### 1. Why did you pick the specific chart?

I used lined bar chart because it is best for comaring the rating and votes of different Movies or TV Shows on rating scale.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests that there are Top Imdb rated TV Shows or Movies on basic of Average number of votes.

Means the displayed TV Shows and Movies are highly rated but their voting were higher or equal to average. This ensures that the rating is not biased.

The chart also suggests that most of High Rated content are TV Shows.Means the platform has few TV Shows but they are best.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By this the platform can stategies their recommendation sytems accordingly.There are few TV Shows but what they have is the best and can also optimize the marketing campains. This will ensure, customer satisfaction, can lead to decrease in customer churn and user engagment.

Try focusing on more movies which are as high rated as TV Shows because it will create a balance for the customers.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Creating 1 dataframe for 'release_year', 'runtime' and 'type'
runtime = dataset[['release_year', 'runtime', 'type']]

# Using data of runtime from 2000's
runtime = runtime[runtime['release_year'] >= 2000]

# Using matplotlib for setting the size of the chart
plt.figure(figsize=(12, 5))

# Using seaborn for ploting chart
sns.boxplot(x='release_year', y='runtime', hue='type', data=runtime)
plt.title('Runtime Distribution of Movies and TV Shows by Year (Since 2000)')
plt.xlabel('Release Year')
plt.ylabel('Runtime')
plt.xticks(rotation=90) # For rotating the x-axis which vertically to avoid overlaping
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I am using  paired box plot because it will make easier to analyse the median runtime fo Movies And TV Shows over the years.This will help in understand the meadian and outliers fo runtime over the years.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests the trend of runtime of movies and TV Shows over the years.

In case of Movies, the trend looks stable over the years with respect to outliers. The median seems to be floating between 100 to 120 which suggest that the the length of the movies are are to some what similar over the years.

In case of TV Shows, the trend looks that there is increase in the median over the years. TV Shows seems to be more longer over the time this means that there is increase in the treand of TV Shows over the years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Considering the fact that Movies dominate the platform and the runtime is still same over almost a century, means that there should be should be more invetment made in this type of content because it is a proven performner.

The number of TV shows are les in number on the platform as compared to Movies, and there is increase in median runtime of TV Shows, suggests that that TV Shows are more likely or prefered on the platform and the platform has some of the best TV Shows available.Means the platform should highly and aggresivly invest in the TV Shows.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Creating 1 dataframe for 'production_countires' and 'imdb_score'
data = dataset[['production_countries', 'imdb_score']]

# used explode() to split them into rows for eg(e.g., ['US', 'GB'] to 'US', 'GB')
data = data.explode('production_countries')

# Using .value_counts() to individual count the rows for eg( ['US', 'GB', 'US'] to US: 2, GB: 1)
country_counts = data['production_countries'].value_counts()

# Getting the top 10 Countries on basis of Number of TV Shows and Movies
top_countries = country_counts.head(10)

# Grouping the Top countries and their average or the mean Imdb_rating using .groupby() and .mean()
avg_ratings = data.groupby('production_countries')['imdb_score'].mean().loc[top_countries.index] # Using .loc[top_countries] for getting the mean value only for top_countries

# Plotinh the chart using matplotlib
avg_ratings.plot(kind='bar')
plt.title('Top 10 Countries and Their Average Imdb Rating')
plt.xlabel('Country')
plt.ylabel('Average IMDb Rating')
plt.ylim(0,10) # For showing the rating scale 1 to 10
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used vertical bar chart to display Top Content Producing Countries's Average Ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests that these are the top content producing countries and this is thier average ratins.

These are not highest rated countries because they produce the most number of content but there is the comaprision between them ensuring review of their content quality.

Their average rating seems to be even but considering the fact that US produces more than 50% but still has maintained the average rating just below 6 is impressive. And also GB with average rating just above 6.This suggests that the there is a hugh high quality content available for English speaking audience.

These countries combined produces more that 80 to 85% of the total content on the platform but still has maintained the average rating of 6 is quite impressive.This also suggests that the overall content quality of the platform is great.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is almost 70% content available for english speaking audience.The platform should completly use this in the future marketing strategy for incresing user engagement and new subscriptions.

As mentioned before, the overall 85% of the content is produced by these countries which is good but the platform needs to make the content more diverse and invest in more regional content or should create a balance in content adition. The potential countries are mentioned in the previous charts.


#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Creating 1 dataframe for 'production_countires' and 'imdb_score'
data = dataset[['production_countries', 'tmdb_score']]

# used explode() to split them into rows for eg(e.g., ['US', 'GB'] to 'US', 'GB')
data = data.explode('production_countries')

# Using .value_counts() to individual count the rows for eg( ['US', 'GB', 'US'] to US: 2, GB: 1)
country_counts = data['production_countries'].value_counts()

# Getting the top 10 Countries on basis of Number of TV Shows and Movies
top_countries = country_counts.head(10)

# Grouping the Top countries and their average or the mean Imdb_rating using .groupby() and .mean()
avg_ratings = data.groupby('production_countries')['tmdb_score'].mean().loc[top_countries.index] # Using .loc[top_countries] for getting the mean value only for top_countries

# Plotinh the chart using matplotlib
avg_ratings.plot(kind='bar')
plt.title('Top 10 Countries and Their Average Tmdb Rating')
plt.xlabel('Country')
plt.ylabel('Average TMDb Rating')
plt.ylim(0,10) # For showing the rating scale 1 to 10
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used vertical bar chart to display Top Content Producing Countries's Average Ratings.

##### 2. What is/are the insight(s) found from the chart?

The chart suggests that these are the top content producing countries and this is thier average ratins.

These are not highest rated countries because they produce the most number of content but there is the comaprision between them ensuring review of their content quality.

Their average rating seems to be even but considering the fact that US produces more than 50% but still has maintained the average rating just below 6 is impressive. And also GB with average rating just above 6.This suggests that the there is a hugh high quality content available for English speaking audience.

These countries combined produces more that 80 to 85% of the total content on the platform but still has maintained the average rating of 6 is quite impressive.This also suggests that the overall content quality of the platform is great.

This chart also suggest that there is no much dfference in IMDB rating and TMDB rating creteria but China seems to lead in TMdb ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is almost 70% content available for english speaking audience.The platform should completly use this in the future marketing strategy for incresing user engagement and new subscriptions.

As mentioned before, the overall 85% of the content is produced by these countries which is good but the platform needs to make the content more diverse and invest in more regional content or should create a balance in content adition. The potential countries are mentioned in the previous charts.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
# Creating one dataframe for 'type', 'imdb_score' and 'tmdb_score'
data = dataset[['type', 'imdb_score', 'tmdb_score']]

# Creating a variable for pairplot using seaborn by 'type'
pair = sns.pairplot(data, hue='type')

# Title for the pair plot
pair.fig.suptitle("Pair Plot of IMDb Score and TMDb Score by Type")

# Showing the graph
plt.show()


##### 1. Why did you pick the specific chart?

I used paired plot chart because it can analysis more than 2 columnsor creteria a one point of time.

##### 2. What is/are the insight(s) found from the chart?

Movies seems to be more widely rated then TV Shows just because of the numbers but the TV Shows have higher rating as compared TV Shows.

Some being rated even 10/10 in both TV Shows and Movies, means that the platform has the best qaulity content available.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The content has a suttled biasness over TV Shows with context to qauntity but there are some quality shows available on the platform that needs to be strategically marketized. invest more in TV Shows as their runtime has increased over the years and try creating a balance between TV SHows and Movies.

And if you talk about the countries, there are 10 countries which combined contribute over 85% of the total content these can lead to decrease in content diversity in the upcoming years  due to rapid moderization and can also lead to customer churn and bring your competitors in the game. The platform should invest in the new contries that have potential means that they have high rated content. These countries are mentioned in the charts.

Platform also has the best movies or TV Shows of specific years means they have some years with highest average rating mentioned in the charts, by knowing that the platform can design their recommendation systems accordingly, means if a user watches a movie or a TV Show of a year which was a high performing year so they can suggest the movie of that year so that the customer is engaged for long period of time.

# **Conclusion**

The analysis shows that movie runtimes have remained relatively consistent over the years, while TV show runtimes vary more and have generally decreased, reflecting changing audience preferences and streaming trends.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***