# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual (Someshwar M)


# **Project Summary -**

In today’s digital world, online streaming platforms play a major role in entertainment. With thousands of movies and TV shows available, it becomes difficult for platforms to understand what type of content performs well and what audiences actually prefer.

This project aims to analyse movies and TV shows data to gain meaningful insights that can help improve content planning and business decision making.This project is focused on analysing movies and TV shows data from an Amazon streaming platform to understand content performance, audience interest, and overall trends. The dataset contains information related to titles, genres, release year, ratings, popularity, runtime, age certification, production countries,content type etc.

Data cleaning and wrangling tasks to be performed to ensure the data used for visualization is clean which inturn give accurate results for business objective and growth. After preparing the data, different types of visualizations will be used to explore it. Univariate analysis will help in understanding individual variables like ratings, genres, and content types. Bivariate analysis will be used to study relationships between two variables, such as popularity and ratings or content type and score. Multivariate analysis will further help in understanding how multiple factors interact with each other. Charts like bar plots, box plots, scatter plots, line charts, and heatmaps will be used for better interpretation.

Overall, this project provides how data analysis can be used to support decision making in the entertainment industry. By using structured analysis and visual exploration, the project sets a strong foundation for deeper insights that can later be derived from the cleaned and analysed data which inturn helps in business growth.


# **GitHub Link -**

https://github.com/Someshwar24/Exploratory-Data-Analysis.git

# **Problem Statement**


Online streaming platforms like Amazon have many movies and TV shows, but it is hard to know which content performs well. Without proper analysis, the company may invest in wrong content or miss popular trends where people lean towards periodically over the years. So there is a need to analyse the data to understand audience prefernces and content performance.
This analysis will help the business take better decisions.

#### **Define Your Business Objective?**

The main objective of this project is to analyse movies and TV shows data to understand content performance. The goal is to identify which genres and content types are liked more by the audience.
This analysis helps the company know what kind of content gives better ratings and engagement and the countries from which more production comes and to increase the production in other countries too. It also helps in planning future content investment and acquisition.
Using these insights, the business can improve user satisfaction and growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
data_movies = pd.read_csv("titles.csv")
data_movies

In [None]:
data_title = pd.read_csv("credits.csv")
data_title

In [None]:
data = pd.merge(data_movies,data_title,on="id",how="left")
data

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

### What did you know about your dataset?

The dataset contains the details of Amazon platforms movies and TV shows details. It contains the title,genre,release year,IMDB rating,production counties and popularity.

Additionally, cast and crew details such as actor names,roles are provided in a separate dataset.

Combining both the datasets having the ID column as unique key we have 125354 rows and 19 columns. The higher count in rows is because of one to many relationships between title and cast members.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

**id**: Unique title identifier

**title**: Movie or show name

**type** : Content type (Movie/Show)

**description** : Content summary

**release_year** : Year of release

**age_certification** : Age rating

**runtime** : Duration in minutes

**genres** : Content genres

**production_countries** : Production countries

**seasons** : Number of seasons

**imdb_id** : IMDb unique ID

**imdb_score** : IMDb rating score

**imdb_votes** : Number of IMDb votes

**tmdb_popularity** : TMDB popularity score

**tmdb_score** : TMDB rating score

**person_id** : Cast/Crew identifier

**name** : Cast or crew name

**character** : Character played

**role** : Actor or director

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns:
  print(f'The no of unique values in {i} is',data[i].nunique())

In [None]:
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Creating a copy of the dataset
df = data.copy()
df

In [None]:
#Converting the seasons from Float to Int values
df['seasons'] = df['seasons'].astype('Int64')
df.dtypes

In [None]:
#Identify missing values
df.isnull().sum()
#Replace missing values with 0
df.replace(0, np.nan,inplace=True)
df.isnull().sum()

In [None]:
#Calculate and drop duplicate values
df.duplicated().sum()
df.drop_duplicates(subset=['id','name'],inplace=True)

In [None]:
#Added new column based on the IMDB scores
def rating_category(score):
  if score>=8:
    return "High"
  elif score>=6:
    return "Medium"
  else:
    return "Low"

df["rating_category"] = df['imdb_score'].apply(rating_category)

### What all manipulations have you done and insights you found?

**Wrangling Insights:-**

The data set was first copied to make sure the original data was preserved while the cleaning steps were performed. The season column was converted from float to integer to correctly reflect the number of seasons since seasons are always whole numbers and not decimals.

Missing values ​​were identified and replaced with zeros to avoid errors during analysis and visualization. Duplicate entries based on title ID and cast name were removed to prevent duplicate entries more than 168 duplicate rows were removed.

Finally, a new rating category column was created to group titles into high, medium, and low ratings from IMDb scores to make easier for user to watch movies/shows based on preferences.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
# Having the columns handy to see what charts can be created for insights
data.columns

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Hist plot used to see the no.of movies released over the years
sns.histplot(data=df,x='release_year',color='blue',bins=20,kde=True)
plt.xlabel("Years")
plt.ylabel("Number of movies")
plt.title("No.of movies in years")
plt.show()

##### 1. Why did you pick the specific chart?

**HISTOGRAM:**

The Histogram chart is picked to find the distribution of movies across years and not each year since the data which we have is more than 10 decades. It helps in understanding trends over time and identifying periods with higher or lower movie releases.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*   The number of movies released shows strong upward trend over time, with strong increase after 2000.
*   Movies released in the early decades were relatives low and steadily increase over the years, peaking the most in recent years.
*   The sharp rise shows that people are more interested towards digital platforms




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help create a positive business impact by showing that movie releases have increased a lot in recent years, indicating higher audience demand and growth opportunities.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Hist plot used to see the overall IMDB rating distribution
sns.histplot(data=df,x='imdb_score',color='red',bins=20,kde=True)
plt.xlim(2,10)
plt.xlabel("imdb_score")
plt.ylabel("Rating count")
plt.title("IMDB Rating")
plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

HISTOGRAM:

The Histogram chart is picked to find the distribution of IMDB scores of movies. It helps in understanding the IMDB scores of all the movies released.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*   Most of the Movies released were average ratings.
*   Very low and very high rating movies are less frequent over the decades.
*   More than 20000 movies got the average distribution as 6.
*   IMDB ratings follow a normal distibution




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It helps in assessing that people mostly come up with average ratings and very rare to praise a movie with good ratings.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Count plot to calculate the total no.of movies and shows
sns.countplot(data=df,x="type",hue="type")
plt.xlabel("Content")
plt.ylabel("Count of Movies/Shows")
plt.xticks(rotation=45)
plt.title("Total movies/shows")
plt.show()

##### 1. Why did you pick the specific chart?

**COUNTPLOT:**

Countplot is used to calculate the total no.of types of contents viewed by the audiences.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*   Audiences prefer to watch more movies rather than TV shows
*   Movies are dominated more than the TV shows
*   5 times more movies are released over the decades compared to shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive impact:**

Expanding TV shows can increase user engagement, watch time, and long-term subscriptions. Platform should focus more on TV shows rather than movies.

**Negative impact:**
With the above data, Movies dominate the content librarya nnd audiences movie-centric, which may not fully serve binge-watching audiences which in-turn will not pull audiences to monthly subscriptions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Count plot to calculate the total no.of movies/shows based on content
sns.countplot(data=df,x="age_certification",color="green")
plt.xlabel("Content")
plt.ylabel("Count of Movies/Shows")
plt.xticks(rotation=45)
plt.title("Movies/Shows Content")
plt.show()

##### 1. Why did you pick the specific chart?

Count plot was used because age certification is a categorical variable and it helps compare the number of movies and TV shows across different age groups.

##### 2. What is/are the insight(s) found from the chart?

*   R-rated content dominates the platform depicting a strong focus on
adult oriented movies and TV shows.
*   PG-13 content forms the second-largest category, showing a balanced offering for teenage and young adult audiences.
*   Children and family-oriented certifications (TV-G, TV-Y, G) have significantly fewer titles, suggesting limited focus on younger audiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is potential opportunity to expand family friendly contents to reach a wide range of audience since viewers are very low compared to teenage and adults.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Hist plot to calculate the rating category of the movies/shows
sns.countplot(data=df,x="rating_category",palette="viridis",hue="rating_category")
plt.xlabel("Rating Category")
plt.ylabel("Total movies/shows")
plt.xticks(rotation=45)
plt.title("Rating category")
plt.show()

##### 1. Why did you pick the specific chart?

A count plot was chosen because rating category is a categorical variable and the chart clearly compares the number of movies and shows across quality levels.It was used to analyze the distribution of content across rating categories.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*   Low-rated content dominates the dataset, indicating a large volume of below average movies/shows.
*  Medium-rated titles form a significant portion, showing that most content falls within acceptable quality levels.
*  High-rated content is limited, suggesting very few movies and shows are extremely good and accepted bu audience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The platform focuses on quantity over elite quality, offering a wide range of content.The business should try to produce more good rating movies/shows

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Barchart to calculate the top 10 genres
top_genres = df["genres"].value_counts().head(10)
sns.barplot(x=top_genres.values,y=top_genres.index)
plt.xlabel("Total movies/shows")
plt.ylabel("Genres")
plt.title("Top 10 Genre Count")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart was chosen to top 10 genres to visualize the frequency of movies/shows with such genre content.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*   Drama is the most dominant genre, outperforming all other genres in content count with 10000 movies/shows.
*   Comedy is the second most common genre, indicating a strong focus on entertainment-driven content.
*   Mixed-genre combinations (e.g., Drama–Romance, Comedy–Drama) appear frequently, showing the platform’s preference for multi-genre contents which attracts the audiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help in creating a positive business impact by showing which genres are most popular among users.
Based on this drama genre is most viewed so the best way to attract the audiences is by focussing more on Drama/Comedy.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Boxplot comparing content type with IMDB scores
plt.figure(figsize=(6,4))
sns.boxplot(data=df,x="type",y="imdb_score",color="green")
plt.title("IMDb Score by Content Type")
plt.xlabel("Content Type")
plt.ylabel("IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot is used to compare a numberical variable and categorical variable and we can also find outliers easily with this chart

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

*  TV shows have a higher median IMDb score compared to movies.
*  Movies show more variation in ratings, with many low-rated outliers.
*  TV shows generally maintain more consistent quality than movies.
*  IMDB score of movies have average from 5-7 and shows has average from 6.5-8

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since TV shows receive higher and more consistent ratings, So focusing more on show based content can improve users satisfaction.
Production strategy can be improved to increase in investing in more shows and movies can also be made with good contents for more IMDB scores.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Scatterplot comparing IMDB votes with IMDB scores
plt.figure(figsize=(6,4))
sns.scatterplot(data=df,x="imdb_votes",y="imdb_score",color="blue")
plt.title("IMDB Votes vs IMDB Score")
plt.xlabel("IMDB Votes")
plt.ylabel("IMDB Score")
plt.ticklabel_format(style='plain', axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used to find the relationship between two different variables, so here it was used to find whether the popularity (votes) influences content quality (ratings).

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS**:

*   Titles with very low votes show a wide range of ratings, indicating unreliable early feedback.
*   As the number of votes increases, IMDb scores become more stable and consistent
*  Highly voted titles generally maintain above-average ratings, showing audience trust.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since highly voted content shows more reliable ratings, decision-making can focus on popular titles.
So content promotion, recommendation accuracy, and user trust can be improved.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Barplot comparing average IMDB scores with top 10 genres
avg_genres = data.groupby("genres")["imdb_score"].mean().sort_values(ascending=False).head(10)
sns.barplot(y=avg_genres.index,x=avg_genres.values)
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.title("Top 10 Genres by Average IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

Barchart is used to compare the average IMDb scores across genres, helping evaluate content quality with ratings. A bar chart is suitable for comparing average values across categories.



##### 2. What is/are the insight(s) found from the chart?

INSIGHTS:
- Genre combinations involving Drama consistently appear among the highest average IMDb scores.

- Mixed-genre content generally performs better than single-genre content in terms of ratings.

- The average IMDb scores across the top genres are relatively close, indicating consistent quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Since certain genre combinations achieve higher average ratings, focusing on similar content themes can improve quality rather than focussing only on one genre.

Negative Impact - Not limit the content of the movies/shows based on only one genre.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Lineplot to calculate the movies released per year
movies_per_year = df[df["type"] == "MOVIE"].groupby("release_year").size().reset_index(name='count')
sns.lineplot(data=movies_per_year,x="release_year",y="count")
plt.title("Movies Released per Year")
plt.xlabel("Release Year")
plt.ylabel("Number of Movies")
plt.show()

##### 1. Why did you pick the specific chart?

The linechart was used because Release year is a time-based variable, and a line plot is best to show trends over time against movies.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- The number of movies released per year shows a steady increase over time, especially after the year 2000.

- There is a sharp rise in movie production in recent years, indicating rapid content expansion.

- A  drop at the end suggests data availability or recent-year impact rather than an actual decline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since movie production has increased significantly over the years, the platform can plan focus more on producing movies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Boxplot comparing runtime and rating category
plt.figure(figsize=(7,5))
sns.boxplot(data=df,x="rating_category",y="runtime",color="yellow")
plt.title("Runtime by Rating Category")
plt.xlabel("Rating Category")
plt.ylabel("Runtime (minutes)")
plt.show()

##### 1. Why did you pick the specific chart?

Runtime is a numeric variable and rating category is categorical.
A box plot is suitable to compare runtime distribution, median, and outliers across categories.
Barplot can also be used for this analysis.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- Medium and Low-rated content have similar median runtimes, mostly around 90–110 minutes.

- High-rated content shows a wider runtime spread, indicating flexibility in content length and no outliers too symbolizing movie genre.

- Low-rated content has many extreme outliers, suggesting inconsistent runtime.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since higher-rated content allows flexible runtimes but averages between 80-120 minutes, content duration planning can be optimized. So production efficiency and viewer experience can be improved.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Barplot comparing Average imdb score and Age certification
avg_age = df.groupby("age_certification")["imdb_score"].mean().reset_index()
sns.barplot(data=avg_age,x="age_certification",y="imdb_score")
plt.title("Average IMDb Score by Age Certification")
plt.xlabel("Age Certification")
plt.ylabel("Average IMDb Score")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is suitable to compare average ratings across categories.
It helps understand how content ratings vary by target audience age group.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- TV-MA and TV-PG content have the highest average IMDb scores, indicating better audience reception for mature and teen-focused content.

- NC-17 content has the lowest average rating, suggesting limited audience prefer it.

- Family-friendly certifications (G, PG, PG-13) show moderate and consistent ratings, indicating stable but not exceptional performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since higher age-certified content tends to receive better ratings, focusing on such categories can improve content performance.
So content selection and audience targeting strategies can be improved.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Scatter plot of TMDB popularity vs IMDB score
plt.figure(figsize=(7,5))
sns.scatterplot(data=df,x="tmdb_popularity",y="imdb_score",alpha=0.6)
plt.title("TMDB Popularity vs IMDb Score")
plt.xlabel("TMDB Popularity")
plt.ylabel("IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is suitable to analyze the relationship and pattern between popularity and ratings.It helps understand whether high popularity translates into better audience ratings.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- Most titles have low TMDB popularity, but their IMDb scores vary widely, indicating mixed audience reception.

- Highly popular titles generally fall within mid to high IMDb score ranges, though popularity.

- The relationship between popularity and IMDb score appears weak to moderate

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since popularity alone does not ensure high ratings, content evaluation should consider both metrics together.
So content promotion and recommendation strategies can be improved.

#### Chart - 14

In [None]:
# Chart - 14 visualization code
# Scatter plot of IMDB votes vs IMDB score againtst type
plt.figure(figsize=(7,5))
sns.scatterplot(data=df,x="imdb_votes",y="imdb_score",hue="type",alpha=0.6)
plt.title("IMDB votes vs IMDb Score")
plt.xlabel("IMDB Votes")
plt.ylabel("IMDb Score")
plt.ticklabel_format(style='plain', axis='x')
plt.legend(title="Type of Content")
plt.show()

##### 1. Why did you pick the specific chart?

IMDb votes and IMDb score are numeric variables, while content type is categorical so hue helps compare rating patterns between movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- TV shows generally have higher IMDb scores compared to movies, even with fewer votes.

- Movies tend to receive significantly higher vote counts, indicating wider reach.

- As vote counts increase, IMDb scores become more stable for both content types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since TV shows maintain higher ratings and movies attract more votes, investment in both high-quality shows and high-reach movies can be improved.

#### Chart - 15

In [None]:
# Chart - 15 visualization code
# Boxplot of Genre vs IMDB score against content type
plt.figure(figsize=(10,5))
top_genres = data["genres"].value_counts().head(5).index
filtered = data[data["genres"].isin(top_genres)]
sns.boxplot(data=filtered,x="genres",y="imdb_score",hue="type")
plt.title("IMDB Score by Genre and Content Type")
plt.xlabel("Genre")
plt.ylabel("IMDB Score")
plt.legend(title="Content Type")
plt.show()

##### 1. Why did you pick the specific chart?

Genre and content type are categorical variables, while IMDb score is numeric.
A box plot with hue helps compare rating distribution between movies and shows within each genre.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- TV shows consistently have higher median IMDb scores than movies across most top genres.

- Documentary and Drama genres show the highest overall ratings for both movies and shows.

- Horror content has lower and more varied ratings, especially for movies, indicating mixed audience response.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since shows perform better across popular genres, focusing more on show-based content can improve overall ratings.

#### Chart - 16

In [None]:
# Chart - 16 visualization code
# Lineplot of Release years vs IMDB score against content type
plt.figure(figsize=(8,6))
sns.lineplot(data=df,x="release_year",y="imdb_score",hue="type")
plt.title("Average IMDb Score Over Years by Content Type")
plt.xlabel("Release Year")
plt.ylabel("Average IMDb Score")
plt.legend(title="Type of Content")
plt.show()

##### 1. Why did you pick the specific chart?

Release year is a time-based variable, and IMDb score is numeric. A line plot with hue helps compare how ratings change over time for movies vs TV shows.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- TV shows consistently maintain higher average IMDb scores than movies across most years.

- Movie ratings remain relatively stable around mid-range values showing consistent but average ratings.

- TV show ratings show more variation over time, possibly due to changes in production quality or content formats. The sharp decline in the year 1970 maybe because of data inconsistency.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since TV shows perform better in terms of ratings over time, focusing more on high-quality shows can improve platform perception.

#### Chart - 17

In [None]:
# Chart - 17 visualization code
# Lineplot to calculate the movies and shows released per year
movies_per_year = df[df["type"] == "MOVIE"].groupby("release_year").size().reset_index(name='count')
shows_per_year = df[df["type"] == "SHOW"].groupby("release_year").size().reset_index(name='count')
sns.lineplot(data=movies_per_year,x="release_year",y="count",label="MOVIES")
sns.lineplot(data=shows_per_year, x="release_year", y="count", label="Shows")
plt.title("Movies Released per Year")
plt.xlabel("Release Year")
plt.ylabel("Number of Movies/Shows")
plt.show()

##### 1. Why did you pick the specific chart?

Release year is a time-based variable, and the number of titles is a count. A line plot is suitable to compare trends over time.

##### 2. What is/are the insight(s) found from the chart?

INSIGHTS:

- Movie releases are consistently higher than TV shows across most years.

- Both movies and TV shows show a strong increase after 2000, indicating rapid platform expansion.

- TV show releases grow steadily in recent years, though still at a lower volume than movies.

- Sharp decline in both contents after 2020 maybe due to data inconsistency

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since both movies and TV shows have increased significantly over time, content demand is clearly growing. So long-term content  and production planning can be improved.

#### Chart - 18

In [None]:
# Chart - 13 visualization code
# Barplot for Total movies/shows vs The production country
top_countries = data["production_countries"].value_counts().head(10)
plt.figure(figsize=(8,5))
sns.barplot(x=top_countries.values,y=top_countries.index)
plt.xlabel("Total Movies/Shows")
plt.ylabel("Production Country")
plt.title("Top 10 Production Countries by Content Count")
plt.show()

##### 1. Why did you pick the specific chart?

Production country is a categorical variable, and total movies/shows is a count measure. A bar chart is suitable to compare content volume across different countries.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- The United States dominates content production by a very large margin compared to other countries.

- India and the UK are the next major contributors, but their content volume is significantly lower than the US which couldn't match even 50% combining both countries production.

- Content from other countries such as Canada, Japan, and France forms a much smaller portion, indicating country does not focus that much in production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since most content is produced in the US, the platform has a strong base in a major content market so diversification strategies can be improved by increasing content from other countries.

#### Chart - 19 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8,6))
corr_matrix = data[["imdb_score", "imdb_votes", "tmdb_score", "tmdb_popularity", "runtime"]].corr()
sns.heatmap(corr_matrix,annot=True,cmap="coolwarm",linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Variables")
plt.show()

##### 1. Why did you pick the specific chart?

All selected variables (imdb_score, imdb_votes, tmdb_score, tmdb_popularity, runtime) are numeric. A correlation heatmap is ideal to understand the strength and direction of relationships between multiple variables at once.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- IMDB score and TMDB score show a strong positive correlation, indicating consistency between the two rating platforms.

- IMDB votes have a weak to moderate correlation with ratings, suggesting popularity influences perception but does not fully determine quality.

- Runtime and TMDB popularity show very low correlation with IMDb score, indicating content length and trendiness do not strongly impact ratings.

#### Chart - 20 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data[["imdb_score", "imdb_votes", "runtime"]])
plt.suptitle("Pair Plot of Ratings, Popularity, and Runtime",y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

All selected variables are numeric.A pair plot allows simultaneous visualization of distributions and relationships.It helps detect correlation, spread, and outliers in one view for different columns.

##### 2. What is/are the insight(s) found from the chart?

**INSIGHTS:**

- IMDb score shows weak to moderate relationship with IMDb votes, indicating popularity does not directly define quality.

- Runtime has no strong correlation with IMDb score or votes, suggesting duration does not impact audience rating significantly.

- Most contents comes around mid-range scores and runtimes, with few extreme outliers.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**OBJECTIVE:**

Based on the analysis of content ratings, genres, popularity, release trends, and production countries, the client should focus more on high-performing TV shows and top-rated genres such as Drama,Comedy and Documentary since they consistently receive better audience ratings. Additionally, the client should use popularity and voting trends to promote quality content while gradually expanding production beyond dominant regions like the US to reach a wider global audience.

The audiences are more likely to watch movies because of the runtime though it plays a crucuial role ratings are always high compared to TV shows and audiences are most preferrably in the age of 20-30 so more contents can be produced based on that.

# **Conclusion**

This project analyzed movies and TV shows to understand ratings, popularity, etc. The results show that movies are released expoentially over the years but TV shows hold the stage because of more ratings and certain genres perform better in terms of audiences IMDB scores. Popularity and votes help measure audience interest but do not always reflect quality. The analysis also highlights growth in content over time and regional dominance specially in country like US which holds 50% of overally releases all over the world. Overall, these insights help in making better content and business decisions.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***