<a href="https://colab.research.google.com/github/Faisal-Ghub/EDA/blob/main/Project_of_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Exploratory Data Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The data is of Amazon TV Shows and Movies. It has 125354 rows and 19 columns. Columns are personal_id, id , name, character, role, title, type, description, release_year, age certification, runtime, genres, production countries, seasons, imdb id , imdb score, imdb votes, tmdb popularity, tmdb score.
There are total 8418 unique movies and 1355 unique shows. Only 5-10 % content scores good on IMDB & TMDB out of all the content which means there are some uncover patterns behind the success of such content. By analysing the data we have to find such insights which guide the production houses and businesses to save time and produce quality content.

# **GitHub Link -**

https://github.com/Faisal-Ghub

# **Problem Statement**


"In today’s hypercompetitive digital era, where content is produced at unprecedented scale, there are some key factors which creates a difference between the high scored, popular and low scored, unpopular content i.e Genres,
 runtime, actors and so on. Through rigorous analysis of critical metrics—including IMDb/TMDB ratings, genre trends, runtime performance, and audience popularity—businesses can optimize content production to exceed audience expectations and maximize commercial success.

#### **Define Your Business Objective?**

Producing contents which boost engagement , provide consumer satisfaction, high score on TMDB & IMDB and maximize the proft.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
credits = pd.read_csv("/content/credits.csv.zip")
titles = pd.read_csv("/content/titles.csv.zip")
data = pd.merge(credits,titles,on = "id",how = "outer")

### Dataset First View

In [None]:
# Dataset First Look
data.head(15)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(data.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize = (10,8))
sns.heatmap(data.isnull(), cbar=False)

### What did you know about your dataset?

The given dataset is from Amazon Prime Tv Shows and Movies, It comprises of 125354 rows and 19 columns. There are some important columns like imdb score, tmdb score, imdb vote, tmdb popularity,actor , title which helps us to analyze the correlation between content and extract the insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(data.columns)

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description


*   Person ID : Total person ID

*   ID : Total ID

*   Name  : Total name of actors

*   Character : Name of actors plays as a character in the content

*   Role : Which role is being played

*   Title : Name of the content

*   Type : Type of content (Movie or Show)

*   Description : Represents the category of movies and shows

*   Release year : In which year did the content have been released

*   Age certification : Age rating of content

*   Runtime : Duration of movies/shows

*   Genres : Which genres have been used in content

*   Production Countries : Total name of countries who produces the content

*   Seasons : Incomplete data leads to misunderstanding of columns

*   IMDB ID : IMDB Id of movise and shows

*   IMDB Score : IMDB Scores of movies and shows

*   IMDB Votes : IMDB Votes of movies and shows

*   TMDB Popularity : TMDB popularity of movies and shows

*   TMDB Score : TMDB Score of content







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns.tolist():
  print("No. of unique values in ",i,"is",data[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Create a copy of data
dataset=data.copy()


In [None]:
# Analysing the null value
dataset.isnull().sum()

In [None]:
# Analysing the null value in percentage
(dataset.isnull().sum()/ len(dataset))*100

Filling in missing values :

In [None]:
#  Dropping age_certification and seasons columns because of incomplete data.

dataset.drop(["age_certification","seasons"],inplace=True,axis=1)


In [None]:
# As person_id comprises numerical value and is right skewed so we use median to fill the missing values

person_id_median = dataset["person_id"].median()


dataset.fillna({"person_id": person_id_median},inplace=True)

In [None]:
# Replacing missing values from name using mode as it comprises categorical data

name_mode = dataset["name"].mode()[0]

dataset.fillna({"name": name_mode},inplace=True)

In [None]:
# Replacing missing values from character by using mode
character_mode=dataset["character"].mode()[0]

dataset.fillna({"character": character_mode},inplace=True)

In [None]:
# Removing missing values from role column
role_mode = dataset["role"].mode()[0]

dataset.fillna({"role": role_mode},inplace=True)

In [None]:
# Removing missing value using mode

description_mode = dataset["description"].mode()[0]

dataset.fillna({"description": description_mode},inplace=True)



In [None]:
# Filling in mising values of imdb_id using mode

imdb_id_mode = dataset["imdb_id"].mode()[0]

dataset.fillna({"imdb_id": imdb_id_mode},inplace=True)

In [None]:
# Filling in missing values of imdb_score using median

dataset["imdb_score"].skew()  # This reflects that it is left skewed

imdb_score_median = dataset["imdb_score"].median()

dataset.fillna({"imdb_score": imdb_score_median},inplace= True)




In [None]:
# Filling in missing values  of imdb_score using median

dataset["imdb_score"].skew()

imdb_score_median = dataset["imdb_score"].median()

dataset.fillna({"imdb_score": imdb_score_median},inplace=True)



In [None]:
# Filling in missing values  of imdb_votes using median

dataset["imdb_votes"].skew()  # Using median as it is right skewed

imdb_votes_median = dataset["imdb_votes"].median()

dataset.fillna({"imdb_votes": imdb_votes_median},inplace = True)




In [None]:
# Filling in missing values of tmdb_popularity using median

dataset["tmdb_popularity"].skew()

tmdb_popularity_median = dataset["tmdb_popularity"].median()

dataset.fillna({"tmdb_popularity": tmdb_popularity_median},inplace=True)



In [None]:
# Filling in missing values using median

dataset["tmdb_score"].skew()  # As it is left skewed so we will use median

tmdb_score_median = dataset["tmdb_score"].median()

dataset.fillna({"tmdb_score": tmdb_score_median},inplace = True)



In [None]:
# Renaming the production countries as countries

dataset=dataset.rename(columns={"production_countries":"countries"})

In [None]:
# Creating a new columns named as movies

dataset.loc[:,"movies"] = dataset.loc[dataset.loc[:,"type"] == 'MOVIE', "title"]

# Drop the missing values of movies
dataset.dropna(subset=["movies"],inplace=True)

In [None]:
# Renaming name column as actor
dataset.rename(columns ={"name": "actor"},inplace=True)

In [None]:
dataset.isnull().sum()

Data manipulation


In [None]:
''' For high score movies this is my base parameter :
For IMDB - high score = > 7
For TMDB - high score = > 7.5  '''

In [None]:
# 1.  Which top 10 countries are rleasing movies with high imdb and tmdb score


# Using groupby to group countries and movies
good_movies = dataset[(dataset.loc[:,"imdb_score"] > 7) & (dataset.loc[:,"tmdb_score"] > 7.5) & (dataset.loc[:,"type"] == "MOVIE")]

# convert string into list and then explode

good_movies.loc[:,"countries"] = good_movies["countries"].str.strip("[]").str.replace("'","").str.split(", ")

explode = good_movies.explode("countries")

explode = explode[explode["countries"] != '']

group = (explode.groupby("countries",as_index=False)
    ["movies"].count()
    .sort_values(by= "movies",ascending= False)
    .head(10))

group




In [None]:
# 2. which countries are investing more on which particluar genres for movies after 2000

recent_years = dataset[(dataset["release_year"] >= 2000) & (dataset["type"] == "MOVIE") & (dataset["imdb_score"]> 6.5) & (dataset["tmdb_score"]> 7)]

recent_years.loc[:,"genres"] = (recent_years["genres"].str.strip("[]").str.replace("'", "").str.split(", "))

recent_years.loc[:,"countries"] = (recent_years["countries"].str.strip("[]").str.replace("'", "").str.split(", "))

recent_exploded = recent_years.explode('genres')

recent_exploded = recent_exploded[recent_exploded['genres'] != '']



# Explode 'countries' column as well
recent_exploded = recent_exploded.explode('countries')
recent_exploded = recent_exploded[recent_exploded['countries'] != '']

genre_focus = recent_exploded.groupby(["countries", "genres"])["movies"].nunique().reset_index(name="movie_count")

genre_focus = genre_focus.sort_values(by=["countries", "movie_count"], ascending=[True, False])

# Optional: Show top 10 entries

# Get the top genre for each country (genre with max movie_count)
top_genres_by_country = genre_focus.groupby('countries').apply(
    lambda x: x.nlargest(1, 'movie_count')
).reset_index(drop=True) # Removed include_groups=False as it is not supported in older pandas versions

# This will give you a DataFrame with each country and its top genre
top_genres_by_country[['countries', 'genres', 'movie_count']].sort_values(by="movie_count",ascending=False).head(10)

In [None]:
# 3. which genres are among the top based on avergae popularity after 2000

# Filtering dataset including data after 2000
recent_years= dataset[dataset["release_year"] >= 2000]

# Converting elements of genres into a separate entity
recent_years.loc[:,"genres"] = recent_years["genres"].str.strip("[]").str.replace("'","").str.split(", ")

# Explode the genres
explode_recent_years = recent_years.explode("genres")

explode_recent_years = explode_recent_years[explode_recent_years["genres"] != '' ]    # Explode if there's empty value

# Using groupby function to get the desired result


avg_gen_pop = explode_recent_years.groupby("genres",as_index = False).agg(avg_popularity=("tmdb_popularity", "mean"),movies_count=("movies","count")).sort_values(by="avg_popularity",ascending=False)

avg_gen_pop  # This data have been used to create chart #13





In [None]:
# 4. which genres are among the top based on average imdb and average tmdb score after 2000

recent_years = dataset[dataset["release_year"] >= 2000]

recent_years.loc[:,"genres"] = (recent_years["genres"].str.strip("[]").str.replace("'", "").str.split(", "))

recent_exploded = recent_years.explode('genres')

recent_exploded = recent_exploded[recent_exploded['genres'] != '']

genre_scores = (
    recent_exploded.groupby('genres', as_index=False)
    .agg(
        avg_imdb=('imdb_score', 'mean'),
        avg_tmdb=('tmdb_score', 'mean'),
        movie_count=('movies', 'count')
    )
    .query('movie_count >= 10')  # Ignore niche genres
    .sort_values(['avg_imdb', 'avg_tmdb'], ascending=False)
)

top_genres = genre_scores.head(10)
genre_score=top_genres[['genres', 'avg_imdb', 'avg_tmdb', 'movie_count']]
genre_score  # This data have been used to create chart # 12

In [None]:
# 5. name top 10 movies who got highes imdb and tmdb score

# Step - 1 : Find the highest score data
high_rated = dataset[(dataset.loc[:,"imdb_score"] > 7 ) & (dataset.loc[:,"tmdb_score"] > 7.5 )].sort_values(by = ["imdb_score","tmdb_score"],ascending=[False,False]).reset_index(drop=True)

# Step - 2 : drop duplicate titles
unique_high_rated = high_rated.drop_duplicates("movies")

unique=unique_high_rated.head(10)[["movies","imdb_score","tmdb_score"]]
unique  # This result have been used to create chart - 10

In [None]:
# 6. name top 10 shows who got highes imdb and tmdb score

# Step - 1 : Find the highest score data
high_rated = data[(data.loc[:,"imdb_score"] > 7 ) & (data.loc[:,"tmdb_score"] > 7.5 ) & (data["type"] == "SHOW") ].sort_values(by = ["imdb_score","tmdb_score"],ascending=[False,False]).reset_index(drop=True)

# Step - 2 : drop duplicate titles
unique_high_rated = high_rated.drop_duplicates("title")
unique_high_rated[['title',"imdb_score","tmdb_score"]].head(10)
uniques=unique_high_rated[['title',"imdb_score","tmdb_score"]].head(10)
uniques  # This dataframe is used to create chart # 11


In [None]:
# 7 .Name top 10 most popular actors in 2000s

recent_years = dataset[dataset["release_year"] >= 2000]

actor_counts = recent_years["actor"].value_counts()

# Keep actors with at least 3 movies
frequent_actors = actor_counts[actor_counts >= 3].index

# Filter dataset
filtered = recent_years[recent_years["actor"].isin(frequent_actors)]

# Compute mean popularity again
pop = filtered.groupby("actor", as_index=False).agg({"tmdb_popularity":"mean","movies":"count"}).sort_values(by="tmdb_popularity", ascending=False)

# Show top 10
pop.head(10)    # Chart -7 : The result (pop.head(10) used in chart 7


In [None]:
# 8. Name top 10 most popular actors in 1900s

before_2000 = dataset[dataset["release_year"] < 2000]

actor_counts = before_2000["actor"].value_counts()

# Keep actors with at least 3 movies
frequent_actors = actor_counts[actor_counts >= 3].index

# Filter dataset
filtered = before_2000[before_2000["actor"].isin(frequent_actors)]

# Compute mean popularity again
pops = filtered.groupby("actor", as_index=False).agg({"tmdb_popularity":"mean","movies":"count"}).sort_values(by="tmdb_popularity", ascending=False)

# Show top 10
pops.head(10)



In [None]:
# 9. which movies is  mostly liked on imdb and tmdb score basen on runtime

# Step 1: Remove duplicates (one row per movie)
unique_movies = dataset.drop_duplicates(subset=["title", "imdb_score", "tmdb_score", "runtime"])

# Step 2: Filter high-rated movies
high_score = unique_movies[
    (unique_movies["imdb_score"] > 7) &
    (unique_movies["tmdb_score"] > 7.5)
]

# Step 3: Group by runtime and calculate MEDIAN scores
runtime_stats = high_score.groupby(
    pd.cut(high_score["runtime"],
           bins=[0, 90, 120, 180, 1000],
           labels=["Short (<90m)", "Medium (90-120m)", "Long (120-180m)", "Epic (180m+)"]),observed=False
).agg({
    "imdb_score": "median",
    "tmdb_score": "median",
    "title": "count"  # Number of movies per group
}).rename(columns={"title": "movie_count"}).reset_index()

# Sort by highest median IMDb score
a = runtime_stats.sort_values("imdb_score", ascending=False)
a  # This data is used to create chart # 17

In [None]:
# 10. name 10 movies which got the highest vote
unique_movies = dataset.drop_duplicates(subset="movies")

top_10 = unique_movies[["movies","imdb_votes"]].sort_values(by="imdb_votes",ascending=False).head(10)

top_10  # This result have been used to create chart 9

In [None]:
#  11. In which year did the movies released got the highest imdb and tmdb score

# Step 1: Filter high-rated movies (IMDB > 7, TMDB > 7.5) and remove duplicates
high_scores = dataset[
    (dataset["type"] == "MOVIE") &
    (dataset["imdb_score"] > 7) &
    (dataset["tmdb_score"] > 7.5)
].drop_duplicates(subset="movies")

# Step 2: Calculate a combined score (weighted average) for ranking
high_scores["combined_score"] = (high_scores["imdb_score"] * 0.5) + (high_scores["tmdb_score"] * 0.5)

# Step 3: Get the top movie per year based on combined score
top_movies_by_year = high_scores.sort_values(
    by=["release_year", "combined_score"],
    ascending=[True, False]
).groupby("release_year").first().reset_index()

# Step 4: Display results (Year, Movie, Scores)
result = top_movies_by_year[["release_year", "movies", "imdb_score", "tmdb_score"]]
result.tail(10)

In [None]:
# 12. which countries have released the most number of movies with good imdb and tmdb score after 2000

recent_years = dataset[(dataset["imdb_score"]>7 )& (dataset["tmdb_score"] > 7.5)& (dataset["release_year"]>=2000) & (dataset["type"]=="MOVIE")].drop_duplicates(subset="movies")

# Clean and explode countries column
recent_years.loc[:,"countries"] = recent_years["countries"].str.strip("[]").str.replace("'", "").str.split(", ")

# explode the countries column
explode = recent_years.explode("countries")
explode = explode[explode["countries"] !='' ]

country_count = (explode.groupby("countries")
    .size().reset_index(name="movie_count")
    .sort_values("movie_count",ascending=False))

country_count.head(10)  # The extracted data has been used in chart 8

### What all manipulations have you done and insights you found?




1. "I filtered the dataset to include only high-rated movies (IMDB > 7, TMDB > 7.5) and cleaned the countries column by converting it from a string to a list. Using explode(), I accounted for co-productions by splitting multi-country entries into separate rows. After grouping, the US emerged as the top producer (2,246 movies), followed by India (778) and the UK (448). This suggests Hollywood’s dominance, but also highlights India’s growing influence and Europe’s niche appeal. Anomalies like ‘SU’ (Soviet Union) indicate potential data quality issues."

2. "I filtered the dataset to include only high-rated movies (IMDB > 6.5, TMDB > 7) released after 2000. After cleaning and exploding the genres and countries columns, I grouped the data to count unique movies per genre-country pair. The results show the US produces the most high-quality documentation ( 106 films), followed by India (drama, 93 movies) then UK (drama, 30 movies). This analysis avoids overcounting by using nunique,() and highlights global genre trends in critically acclaimed cinema."

3.  "To identify genres trend post 2000, I had filtered the dataset to include movies which are released after 2000 and then explode the genres column then used groupby with tmdb_popularity to get the average popularity based on different genres. After grouping I got to know animation genre is on top among all the genres with 79.05 average popularity. followed by (fantasy, 38.08) then (sci-fi, 37.9). This shows that although countries like US and India is releasing movies with good imdb and tmdb score on documentation and drama but in terms of popularity animation is on top. Also i got to know, high number of movies can increase the production weightage but it is not necessary it increases the popularity as drama genre had 40000+ movies but in terms of popularity it is on 15th place.

4. "To identify which movies got the highest imdb and tmdb score , I had filtered the dataset (IMDB > 7 , TMDB > 7.5) and unique movies. I got to know Pawankhind is on top followed by Alexander Babu: Alex in Wonderland	then Soorarai Pottru."

5. "Among shows with IMDB > 7 & TMDB > 7.5 ,the top rated were Couple of mirrors(9.5, 8.0), The Chosen (9.4, 9.4), Surgeons : At the edge of life(9.2, 8.0).
Note : Score isn't about trend or populairty its all about the quality."

6. "Among actors who are in atleast 3 movies before 2000, James Cameron is on top in terms of average popularity followed by Paul Brightwell then Bill Paxton.
Before 2000 they are the most popular actors."

7. "While categorizing runtime on 4 head I got to know that medium runtime(90-120m) scored high on tmdb and imdb score. It means audience likes such content whose runtime is betweeen 120-180 minutes."

8. "Among movies who got the highest imdb votes are Titanic (1133692) followed by The Usual Suspect(1059480) then Brave Heart(1016629)."

9. "After filtering the dataset (IMDB> 7 & TMDB > 7.5) I had calculated the weightage average then I had used groupby which helps me to know which movies scored high on year basis."

10. "After filtering the dataset (IMDB > 7 & TMDB > 7.5), release years >= 2000.I cleaned and explode the countries column then uses groupby to know how many high score films has been released by different countries. I got to know US is among the top with 68 movies followed by India with 80 movies then United Kingdom 15 movies."

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart visualization code
# Set up the figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a histplot with KDE (Kernal Density Estimate) plot
# Filtering only MOVIE type from dataset
sns.histplot(data=dataset[dataset["type"]=="MOVIE"],x="release_year",color="blue",bins=30,kde=True)

# Adding labels and titles for clarity
plt.xlabel("Release year")
plt.ylabel("Number of movies")
plt.title("Distribution of release year")


# Adding grid for better readibility of values
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

By plotting a histplot we can analyze the distribution of movies released based on different years.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there's signficant increase in number of movies released after 1980. Also highes number of movies released are in the year from 2000 to 2020.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight from the distribution helps the business to create a positive impact. The sharp rise in number of movies after 2000 especially after 2010 indicates the interest of audience and industry expansion, guiding the businesses to invest more.
There is no sign of negative growth but a slight dip post 2020 might reflect pandemic disruption rather than declining interest.

#### Chart - 2

In [None]:
# Chart visualization code
# Filtering the SHOW type from data
show = data[data["type"]== "SHOW"]

# Set up the figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a histplot with kde(Kernal Density Estimate) plot
sns.histplot(data = show,x = "release_year",bins=30,kde=True,color="red")

# Adding labels and titles for better clarity
plt.title("Distribution of Show over the period of time")
plt.xlabel("Release Years",labelpad=15,fontsize=12)
plt.ylabel("Total number of Shows",labelpad=15,fontsize=12)

# Adding grid for better readibility
plt.minorticks_on()
plt.grid(which='minor', linestyle='--', alpha=0.1, color='black')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Histplot highlights the distribution of SHOWS by release year

##### 2. What is/are the insight(s) found from the chart?

There's significant increase in Shows after 2000 which highlights the untaped opportunity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, after the year 2000, increase in shows indicates the interest of audience in which businesses can park their fund into this untapped opportunity.

#### Chart - 3

In [None]:
# Chart visualization code
# Set up the figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a countplot
sns.countplot(x='type', data=data,color="yellow",hue="type",palette =["orange","blue"],legend=True)

# Adding labels and titles for better clarity
plt.xlabel("Type")
plt.ylabel("Count of movies")
plt.title("Distribution of content type  : Movies vs Shows",pad=30)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()



##### 1. Why did you pick the specific chart?

Count plot enables the user to count the number of elements based on differenet subhead that's why I had used count plot.

##### 2. What is/are the insight(s) found from the chart?

In the data we have more than 110000 movies and 2000 shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help business to create a positive impact. As we can see movie is dominating among the content type when compared to show. This suggests a higher production volume and high avalability which can guide platforms to prioritize movie based recommendation. Also low number of shows doesn't means there is no scope ,it might indicate untapped opportunities or under-investment in series.

#### Chart - 4

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a boxplot of tmdb_popularity column
sns.boxplot(x='tmdb_popularity', data=dataset)

# Adding labels and title for better clarity
plt.xlabel("TMDB Popularity",labelpad=15)
plt.title("Distribution of TMDB Popularity",pad=30)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Box plot was chosen because it's ideal for visualizing the distribution of numerical values like TMDB Popularity. Also it helps us to know the median, IQR and outliers.

##### 2. What is/are the insight(s) found from the chart?

Most of the movies/shows have low popularity as indicates in chart most of the data are clustered left side. Also there are several extreme outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, since most of the content have low popularity, businesses can easily differentiate the difference between high popularity based on genres,runtime,features and can implement the changes to get the desired result.

#### Chart - 5

In [None]:
# Chart visualization code
# Set up the figure size for better visibility
plt.figure(figsize=(6,5))

# Creating a box plot of imdb_votes column
sns.boxplot(dataset,x="imdb_votes")

# Adding title and label for better clarity
plt.title("Distribution of IMDB Votes",pad =15,fontsize=15)
plt.xlabel("IMDB Votes (In Millions)",labelpad=15,fontsize=12)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Box plot indicates the median, IQR(Inter Quartile Range), outliers and also it shows the distribution of numerical values that's why I have used this.

##### 2. What is/are the insight(s) found from the chart?

Most of the movies have low IMDB Votes and there are multiple outliers which represents many top voted movies like Titanic, Shrek and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Analyzing the genres, runtime, type of movies who got the highest vote can guide businesses to uncover the hidden patterns. However considering high voted movies and shows alone is not enough to multiply the profits rather we must consider scores too as a movie x may get a lot of votes but it is not necessary that it must get high scores.

#### Chart - 6

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a violin plot to check the distribution of runtime
sns.violinplot(data=dataset,x="runtime")

# Adding label and title for better clarity
plt.xlabel("Runtime")
plt.title("Distribution of Runtime",pad=20)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Violinplot was choosen because it not only reflects the distribution of runtime, median, IQR, outliers but also highlights where the data is most densly concentrated.

##### 2. What is/are the insight(s) found from the chart?

The cluster of data represents the average content runtime is between 80 - 120.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the production house must focus on producing content with the runtime between 80-120 minutes as it is mostly liked by the audience. Also it indicates movies with high runtime (More than 180 minutes) is not favourable.

#### Chart - 7

In [None]:
# Chart visualization code
# Set up figure size for better visibility
plt.figure(figsize=(8,6))

# Creating a barplot to visually analyze top 10 popular actor in 20th Century
# Using the filtered data as pop.head(10)
sns.barplot(data=pop.head(10),x="actor",y="tmdb_popularity",hue="actor",palette="Blues_d")  # "Data has been extracted in the 7th code cell of the data manipulation section"

# Adding labels and title for better clarity
plt.xticks(rotation=45,ha="right")
plt.xlabel("Actors", fontsize=12, labelpad=15, loc='center',
           bbox=dict(facecolor='lightyellow', edgecolor='black'))
plt.title("Top 10 Actor by TMDB Popularity in the 20th Century",pad= 30)
plt.ylabel("TMDB Popularity",labelpad=20,bbox=dict(facecolor="lightyellow",edgecolor="black"))

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot was choosen because it is ideal for analyzing catogarical data against numerical data. It clearly indicaes TMDB Popularity along with actors making comparison easy and quick.

##### 2. What is/are the insight(s) found from the chart?

In 20th century, Jeniffer Kluska is among the top based on popularity, followed by Derek Drymon, Angela Yeoh and so on. This indicates these actors had significant recognition on popularity metrics during this period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, businesses must invest in appointing such actors who have significant recognition on popularity metrics. Popular actor attracts more audience, increase revenue as well as boost engagement.

#### Chart - 8

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a barplot to visually analyze country wise movies with good score post 2000
sns.barplot(data=country_count.head(10),x="countries",y="movie_count",hue="countries",palette= "YlGnBu",legend=True)  # This data comes from the 12th code cell in the data manipulation section.

# Adding title and labels for better clarity
plt.title("Country wise Movies with Good Score (Post 2000)",pad= 20)
plt.xlabel("Countries",labelpad= 20 ,loc='center',
           bbox=dict(facecolor='lightyellow', edgecolor='black'))
plt.ylabel("Total Number of Movies",labelpad=20, loc='center',
           bbox=dict(facecolor='lightyellow', edgecolor='black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot was chosen because it is ideal for visualizing the categorical data along with numerical data. It makes our visualization simple and easy which ultimately saves time.

##### 2. What is/are the insight(s) found from the chart?

After the year 2000, US leads in producing highest number of movies with good score (IMDB score > 7 & TMDB score > 7.5) followed by India with 50 movies then UK with 15 movies and so on. This suggests that US and India have been major contributors to high quality cinema over this period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, As US & IN is producing high quality movies, businesses can park their fund in their content. They not only demonstrate the consistency of the quality but also they have large established audience(domestic as well as international).

#### Chart - 9

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a barplot to visually analyze top 10 movies with highest votes
sns.barplot(data = top_10, x ="movies",y = "imdb_votes", hue = "movies", palette = "GnBu", legend = False)  # This data comes from 10th code cell of data manipulation segment.

# Adding title and labels for better clarity
plt.xlabel("Movie",labelpad=15,  bbox=dict(facecolor='lightyellow', edgecolor='black'))
plt.ylabel("IMDB Votes (In Millions)",labelpad=15,  bbox=dict(facecolor='lightyellow', edgecolor='black'))
plt.title("Top 10 Movies with Highes Votes",pad=15)
plt.xticks(rotation=45,ha="right")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Barplot highlights the comparison between multilple elements along with their numerical value that's why I have used it.

##### 2. What is/are the insight(s) found from the chart?

Among all the movies, Titanic stands on top in terms of votes followed by The UsuaL Suspects then Braveheart and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Analyzing the genres, runtime and their marketing strategy can guide businesses to identify the hiddent pattern behind their success. However IMDB votes alone is not enough for a business to generate profit but it attracts large audience and boost engagement.



#### Chart - 10

In [None]:
# Chart visualization code
# Meting the filtered dataset to combine IMDB score & TMDB score
df=unique.melt(id_vars="movies",value_vars=["imdb_score","tmdb_score"],var_name= "score_type",value_name= "score")  # This data was generated in cell #5 of data manipulation segement.

# Set up the figure size for bettter visualization
plt.figure(figsize=(10,8))

# Creating a barplot
sns.barplot(data=df,x="movies",y="score",hue="score_type",palette={"Gold","Blue"})

# Adding title and labels for better clarity
plt.title("IMDB vs TMDB Score of Top 10 Movies",pad=10)
plt.xlabel("Movies",labelpad = 10,fontsize = 15, bbox = dict(facecolor = 'white', edgecolor='black'))
plt.ylabel("Score",labelpad=10,fontsize=15, bbox=dict(facecolor = 'white', edgecolor='black'))
plt.xticks(rotation=90,ha="center")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is ideal for comparing the distribution of numerical data along with categorical data.

##### 2. What is/are the insight(s) found from the chart?

Pawankhind stands on top among the movies with good scores followed by Alexander Babu then Soorarai Pottru and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Movies like Pawankhind, Home , Drishyam 2 scored high on imdb and tmdb score metrics which guide businessed to invest in content released by such countries. Also by taking into consideration, runtime,genres,actor can positively impact the businesses to make informed decisions.

#### Chart - 11

In [None]:
# Chart visualization code
# Melting the filterd dataset to combine scores
df = uniques.melt(id_vars = "title",value_vars = ["imdb_score","tmdb_score"],var_name = "score_type",value_name = "score")  # This data has been generated in code # 6 cell of data manipulation segment.

# Set up figure size for better visualization
plt.figure(figsize=(10,8))

# Creating a barplot to compare the scores of to 10 Shows
sns.barplot(data=df,x="title",y="score",hue="score_type",palette=["gold","grey"])

# Adding title and labels for better clarity
plt.title("IMDB vs TMDB Score of Top 10 Shows",pad=20,fontsize=15,bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Shows",labelpad = 15,fontsize = 15, bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("Score",labelpad = 15,fontsize = 15, bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.xticks(rotation=90)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It helps to compare the scores based on different shows.

##### 2. What is/are the insight(s) found from the chart?

Couple of Mirrors stands on top in SHOW type based on IMDB & TMDB Score followed by The Chosen then Surgeons: At the Edge of Life.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart helps to identify top performing shows with good scores. Although the numberof shows is less when compared with movies but businesses can invest on similar content to boost subscriber satisfaction and generate profit.

#### Chart - 12

In [None]:
# Chart visualization code
# Set up the figure size
plt.figure(figsize=(8,6))

# Melting the score to combine Average IMDB & Average TMDB Scores
df = genre_score.melt(id_vars="genres",value_vars=["avg_imdb","avg_tmdb"],var_name= "score_type",value_name="score")  # Thid data has been generated in code cell # 4 of data manipulation segment.

# Plotting a barplot to compare Average IMDB vs Average TMDB Score of recent years(post 2000)
sns.barplot(data = df,x = "genres",y = "score",hue = "score_type",palette = ["orange","blue"])

# Move legend outside to the right
plt.legend(bbox_to_anchor = (1.05, 1), loc = 'upper left', borderaxespad = 0)

# Adding labels and title for better clarity
plt.title("Avg IMDB vs Avg TMDB Score of Top 10 Genres",pad=20,fontsize=15,bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Genres",labelpad = 15,fontsize = 15,  bbox = dict(facecolor='white', edgecolor = 'black'))
plt.ylabel("Score",labelpad = 15,fontsize = 15,  bbox = dict(facecolor = 'white', edgecolor = 'black'))
plt.xticks(rotation = 90,fontsize = 11)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Barplot helps s to compare Genres by Average IMDB & Average TMDB Score.

##### 2. What is/are the insight(s) found from the chart?

Among all the genres, documentation genre stands on top with highest average IMDB & TMDB score, followed by history then music and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the above chart clearly represents that documentation genres is scoring high among all the genres which means the quality of documentation is high as it is liked by the audience so businesses can invest and creates similar content based on such genres inorder to boost engagement, generate revenue and increase consumer satisfaction.

#### Chart - 13

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize = (8,6))

# Plotting a barplot to visually analyze popular genres of 20th Century
sns.barplot(data = avg_gen_pop, x = "genres", y = "avg_popularity", hue = "genres",palette = "viridis")  # The data is generated in code cell #3 of data manipulation segment

# Adding title and labels for better clarity
plt.xticks(rotation = 90,fontsize = 11)
plt.title("Post-2000 Genres Ranked by Average Popularity", pad = 15, fontsize = 15,bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Genres",labelpad = 10,fontsize = 14, bbox = dict(facecolor = 'white', edgecolor = 'black'))
plt.ylabel("Average Popularity",labelpad = 15,fontsize = 13, bbox = dict(facecolor = 'white', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It helps to visually compare the genres based on average popularity.

##### 2. What is/are the insight(s) found from the chart?

Animation towers above among all the genres in terms of average popularity, followed by fantasy then scifi then family and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Genres like animation, scifi, family are among the top genres on popularity metrics after the year 2000. This can guide businesses to invest in such genres as it holds large audience base which helps in generating revenue, consumer satisfaction, boost engagements. However high popularity does not indicate the quality but it definitely indicates the awareness of movies/shows.

#### Chart - 14

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize=(6,5))

# Plotting a scatterplot to check the distributon of IMDB & TMDB Score
sns.scatterplot(data = dataset, x = 'imdb_score', y = 'tmdb_score')

# Adding title and labels for better clarity
plt.title("IMDb vs. TMDB Scores",pad = 10, bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("IMDB Score",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("TMDB Score",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It is ideal for comparing two numerical values (IMDB Score & TMDB Score). Also it visually represents the correlation, trends and any outliers.

##### 2. What is/are the insight(s) found from the chart?

It shows a moderate positive correlation betweeen IMDB & TMDB Scores which means as IMDB Score increases TMDB Scores increases too. Also a dense cluster formed between IMDB (4-6) & TMDB (5-8) which represents most of the movies/shows fall in this score range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If a production house want to predict the audience responde, understanding the difference between IMDB & TMDB Score helps in recommending and promoting content more confidently. However noticable difference between both of the score can create a negative impact.

#### Chart - 15

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize = (6,5))

# Plotting a scatterplot to visually analyze the distribution of Runtime and TMDB Popularity
sns.scatterplot(data = dataset, x ='runtime', y ='tmdb_popularity')

# Adding title and labels for better clarity
plt.title("Runtime vs TMDB Popularity",pad = 10, bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Runtime",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("TMDB Popularity",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It helps to visually analyze the correlation between two numerical columns (Runtime,TMDB Popularity).

##### 2. What is/are the insight(s) found from the chart?

Content with runtime between 90-120 are the performing well in terms of popularity metrics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses must invest on content whose runtime is between 90-120 as Movies/shows with such runtime are the one who is getting popular.

#### Chart - 16

In [None]:
# Chart visualization code
# Set up figure size for better visualization
plt.figure(figsize = (6,5))

# Plotting a scatterplot to visually analyze the correlation between Runtime and IMDB Votes
sns.scatterplot(data = dataset, x ='runtime', y ='imdb_votes')

# Adding title and labels for better clarity
plt.title("Runtime vs IMDB Votes ",pad = 10,bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Runtime",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("Imdb Votes (In Milion)",labelpad = 15,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A scatterplot helps to find correlation between two numerical data (Runtime & IMDB Votes.)

##### 2. What is/are the insight(s) found from the chart?

Content with runtime in the range of (90-120) are the one who got the high number of votes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Production house  must focus more on producing content whose runtime is between 90-120 as it is getting higher votes.

#### Chart - 17



In [None]:
# Chart visualization code
# Filtering the dataset to get Runtime and Movie_count of movies with good score(IMDB > 7 & TMDB > 7.5)
runtime_data = a.set_index('runtime')['movie_count']  # This data have been generated in code cell #9  of data manipulation segment

# Create pie chart to visually analyze the proportion of movies with good score by runtime
plt.figure(figsize = (8, 6))
plt.pie(
    runtime_data,
    labels = runtime_data.index,
    autopct='%1.1f%%',
    startangle = 90,           # Rotate pie to start from top
    colors = ["gold","yellow","blue","lightpink"],  # Custom colors
    explode = (0.05, 0, 0, 0)  # Highlight "Long" segment slightly
)

# Equal aspect ratio ensures pie is circular
plt.axis('equal')

# Adding title for better clarity
plt.title('Proportion of Movies by Runtime', pad = 20,bbox = dict(facecolor = 'White', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It helps to indicate the distribution of runtime on the basis of duration.

##### 2. What is/are the insight(s) found from the chart?

Runtime with medium duration (90-120m) outshines with 36.2 % of content, followed by short (<90m) with 35.3 % then Long (120-180m) with 25.9 % and last Epic (180m+) with 2.6 %.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Production house must focus on producing content with Medium duration (90-120m)and short duration (<90m) as they are the one who get most score as well as popularity.

#### Chart - 18



In [None]:
# Chart visualization code
# Filter actors (post-2000) who appeared in movies with good score

score_high= dataset[(dataset["imdb_score"]>7)&(dataset["tmdb_score"]>7.5)&(dataset["type"]=="MOVIE")&(dataset["release_year"]>=2010)].drop_duplicates(subset="title")
high_ = score_high.sort_values(by = ["imdb_score","tmdb_score"],ascending = [False,False]).head(10)

# Using melt() to combine IMDB & TMDB Score
actor_by_score = high_.melt(id_vars = "actor",value_vars = ["imdb_score","tmdb_score"],var_name = "score_type",value_name = "score")


# Set up figure size for better visualization
plt.figure(figsize = (8,8))

# Creating a barplot to visually analyze the top 10 actor based on IMDB & TMDB Score
sns.barplot(data = actor_by_score,x = "actor",y = "score", hue = "score_type",palette = {"blue","orange"})

# Adding title and labels for better clarity
plt.title("Top 10 actor by IMDB & TMDB Score (POST-2000)",pad=15,fontsize=15, bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Actors",labelpad = 10,fontsize = 14)
plt.ylabel("Score",labelpad = 10,fontsize = 14)
plt.xticks(rotation = 90)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is ideal for comparing numerical values along with categorical value(IMDB vs TMDB Score by top 10 actors).

##### 2. What is/are the insight(s) found from the chart?

The chart clearly represents that although actors like Jennifer Kluska , Derek Drymon stands on top on popularity metrics but Chinmey Mandlekar outshines in IMDB & TMDB score , followed by Alexander Babu then Suriya and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Production house must focus on appointing actors like Chinmey Mandlekar, Suriya as they are the one who got the highes score on IMDB as well as TMDB.

#### Chart - 19 - Correlation Heatmap

In [None]:
# Chart visualization code
# Filtering numerical data
num_data = dataset.select_dtypes(include=["int64","float64"])
correl = num_data.corr()

# Set up figure size for better visualization
plt.figure(figsize = (8,6))

# Plot heatmap
sns.heatmap(correl,annot = True,cmap = "rainbow")
plt.tight_layout()

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?

It is ideal for understanding the correlation of numerical values.

##### 2. What is/are the insight(s) found from the chart?

*   IMDB Score and runtime shows weak positve correlation (0.26) meaning as runtime increase to medium duration the score also increase.

*   IMDB Score and TMDB Score shows positive correlation (0.58).
*   IMDB Votes and TMDB score shows weak positive correlation(0.24).









#### Chart - 20 - Pair Plot

In [None]:
# Chart visualization code
# Creating pairplot
sns.pairplot(dataset)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot is used to visualize both the distribution of individual variables and the relationship between pairs of variables in a dataset.

##### 2. What is/are the insight(s) found from the chart?





*   IMDb Score vs TMDB Score shows a clear positive linear relationship, confirming both rating platforms agree to a large extent.
*   IMDb Votes vs IMDb Score reveals a mild positive trend — higher-rated movies tend to have more votes.

*   Runtime is mostly concentrated between 50 to 150 minutes, indicating a common range for most films.
*   IMDb Votes vs Runtime shows a weak correlation, though some longer movies gather more votes, possibly due to big-budget productions.





## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?


As there is positive relation between movies and release year, businesses must invest in producing movies with actors like Jennifer Kluska, Derek Drymon, Suriya , Alexander Babu and so forth. Additionally businnes must pay direct attention to genres like animation, scifi, documentation, fantasy, history and so on because these genres are the one who attracts large audience in terms of score and popularity. Furthermore considering runtime based on duration plays a very crucial role as runtime with long( 90-120m) & short (<90m) are critically acclaimed. However businesses must focus on Shows too as there is positive relation with release_year which indicates the rise in number of shows. Countries like US, India , UK are producing content with high quality so businesses must park their fund in such content in order to maximize revenue, boost engagement, consumer satisfaction e.t.c.

# **Conclusion**



1. High-Quality Movies & Shows

*   Focus on high-rated content (IMDB > 7, TMDB > 7.5) as they perform well in both popularity and quality.
* Medium and short runtimes  (90 - 120 m) & (<90m ) tend to get better ratings, so prioritize well-paced storytelling.

2.   Popular Genres = Bigger Audience

*   Animation, Sci-Fi, Fantasy, and Documentary genres have the highest popularity and ratings.
*   Drama has high production volume but lower popularity, so balance quantity with engaging genres.

3.   Top Countries for High-Quality Content

*   US, India, and UK dominate in producing critically acclaimed films.
*   Hollywood leads, but India is rising fast, making it a key market for investment.

4.   Shows Are Growing in Demand


*   There’s a positive trend in shows (like Couple of Mirrors, The Chosen), indicating a shift toward binge-worthy series.

5.   Actors Matter

*   Actors like Suriya, Alexander Babu have strong audience appeal.
*   Casting popular actors from past hits (pre-2000) can add credibility.

6.   Data-Driven Production Decisions

*   Use TMDB/IMDB ratings, runtime trends, and genre popularity to guide investments.
*   Avoid overproducing low-popularity genres (even if they have high volume, like drama).





### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***