**Data provided by [TMDb](https://www.themoviedb.org/). This product uses the TMDb API but is not endorsed or certified by TMDb.**

---




<p style="text-align:center;">
    <img src="https://storage.googleapis.com/kaggle-datasets-images/1369641/2274357/b4c2d21181787e2628bab89cb2047d68/dataset-card.png?t=2021-05-27-12-11-59"
         alt="TMDb Logo"
         style="width: auto; height: 50px; margin-right: 10px;" />
</p>


# **Exploratory Data Analysis on TMDB Movies Dataset**

# ⬇️ Importing the Dataset

In [None]:
import numpy as np  # numpy!
import seaborn as sns # visualisation!
import matplotlib.pyplot as plt # visualisation!
import pandas as pd # dataframes & data analysis!
import ast # to convert string to list!

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv('TMDB_movies.csv')

# 🔎 Inspecting the Dataset

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.dtypes

# 🧹 Data Cleaning:


```
Looking at some columns where they can be categorised!
```

### **'status' column:**

```
Changing datatype to category
```

In [None]:
# looking at the status column
df.status.unique()

In [None]:
# changing it to category datatype
df['status'] = df.status.astype('category')

In [None]:
# check :)
df['status'].unique()

### **'genre' column:**

In [None]:
# looking at the genre column
df.genres.unique()

In [None]:
# creating a copy
df['genres_original'] = df['genres'].copy()

In [None]:
# cleaning
def extract_genres(genre_str):
    try:
        genre_list = ast.literal_eval(genre_str)  # convert string to a list of dictionaries
        return [genre['name'] for genre in genre_list]  # extracting only the genre names
    except (ValueError, SyntaxError):  # in case conversion fails
        return []

# appling function to the 'genres' column
df['genres'] = df['genres'].apply(extract_genres)

In [None]:
# check if everything worked :)
df[['genres_original', 'genres']].head()

In [None]:
# removing the copy
df.drop(columns=['genres_original'], inplace=True)

### **'keywords' column**

In [None]:
# looking at the keywords column
df.keywords.unique()

In [None]:
# creating a copy
df['keywords_original'] = df['keywords'].copy()

In [None]:
# cleaning
def extract_keywords(keywords_str):
    try:
        keywords_list = ast.literal_eval(keywords_str)
        return [keywords['name'] for keywords in keywords_list]
    except (ValueError, SyntaxError):
        return []

# applying
df['keywords'] = df['keywords'].apply(extract_keywords)

In [None]:
# check :)
df[['keywords_original','keywords']].head()

In [None]:
# removing the copy
df.drop(columns=['keywords_original'], inplace=True)

### **'production_companies' columns**

In [None]:
# looking at the production_companies column
df.production_companies.unique()

In [None]:
# creating a copy
df['production_companies_original'] = df['production_companies'].copy()

In [None]:
# cleaning
def extract_companies(production_companies_str):
    try:
        companies_list = ast.literal_eval(production_companies_str)
        return [companies['name'] for companies in companies_list]
    except (ValueError, SyntaxError):
        return []

# applying
df['production_companies'] = df['production_companies'].apply(extract_companies)

In [None]:
# check :)
df[['production_companies_original','production_companies']].head()

In [None]:
# removing the copy
df.drop(columns=['production_companies_original'], inplace=True)

### **'production_countries' column**

In [None]:
# looking at the production_countries column
df.production_countries.unique()

```
Wow thats a lot :O
```

In [None]:
# cleaning
def extract_countries(production_countries_str):
    try:
        countries_list = ast.literal_eval(production_countries_str)
        return [countries['name'] for countries in countries_list]
    except (ValueError, SyntaxError):
        return []

# applying
df['production_countries'] = df['production_countries'].apply(extract_countries)

In [None]:
# check :)
df[['production_companies']].head()

```
much better :)
```

### **'spoken_languages' column**

```
I noticed some unusual values, such as empty string ' ', '??????', and 'No Language'
```

In [None]:
# looking at the spoken_languages column
df.spoken_languages.unique()

In [None]:
# cleaning
def extract_languages(spoken_languages_str):
    try:
        languages_list = ast.literal_eval(spoken_languages_str)
        return [languages['name'] for languages in languages_list]
    except (ValueError, SyntaxError):
        return []

# applying
df['spoken_languages'] = df['spoken_languages'].apply(extract_languages)

In [None]:
# check :)
df[['spoken_languages']]

```
Still some strange quirks.. but it will be sorted later!
```

### (Taking another look at the dataframe)

In [None]:
df

### **Null handling**

In [None]:
# counting nulls
null_df = df.isnull()
null_df.sum()

```
homepage - 3091 nulls (optional, ignorable)
overview - 3 nulls (optional, ignorable)
release date - 1 null                       (investigate!)
runtime - 2 nulls (movie length, ignorable)
tagline - 844 nulls (optional, ignorable)
```

In [None]:
df.shape

In [None]:
# lets identify the amount of nulls and put them into a data frame
# code credit: Digital Futures

def null_vals(dataframe):
    '''function to show both number of nulls and the percentage of nulls in the whole column'''
    null_vals = dataframe.isnull().sum() ## How many nulls in each column
    total_cnt = len(dataframe) ## Total entries in the dataframe
    null_vals = pd.DataFrame(null_vals,columns=['null']) ## Put the number of nulls in a single dataframe
    null_vals['percent'] = round((null_vals['null']/total_cnt)*100,3) ## Round how many nulls are there, as %, of the df

    return null_vals.sort_values('percent', ascending=False)

In [None]:
# applying the above ^ to look at percentages of nulls in the dataframe
null_vals(df).head(5)

In [None]:
# retrieve the 1 record where the release_date data is missing
df[df['release_date'].isnull()]

In [None]:
# checking if this there was a duplication error
df[df['original_title'].str.lower() == 'america is still the place']

# there is only one instance, and missing most of it's data
# therefore, it is droppable

In [None]:
# dropping the row
df.drop(df[df['original_title'] == 'America Is Still the Place'].index, inplace=True)

In [None]:
# checking the nulls
null_vals(df)

# now original_title has no nulls :)

```
Now, I noticed there are '0's in the budget and revenue columns.

In budget, it could be intentional due to no budget given.
In revenue, it's could be due to cancelled or unreleased films where 0 is valid, but it would be an issue if this isn't the case.
This will be dropped for the visuals later

Next, I'll investigate the revenue
```

In [None]:
# revenue column has how many '0'

(df['revenue'] == 0).sum()

# 1426 instances of 0 in revenue

In [None]:
# looking at the movie status
df.status.unique()

In [None]:
(df['status'] == 'Post Production').sum()
# only 3 movies are in post production

In [None]:
(df['status'] == 'Rumored').sum()
# only 5 movies are rumored

In [None]:
# count zero values in revenue column
# code credit: found on google

zero_revenue_count = (df['revenue'] == 0).sum()
total_count = len(df)

print(f"Total rows: {total_count}")
print(f"Zero revenue count: {zero_revenue_count} ({zero_revenue_count/total_count:.2%})")

```
In EDA: Revenue Across Decades section, I removed the 0's so it won't hinder the visualisations
```

# 🌟 Dataset Features:

```
budget - The budget for the movie.
genres - The primary genre(s) of the movie (e.g., Action, Thriller, Crime).
homepage - The official website of the movie, if available.
id - A unique identifier for the movie in the dataset.
keywords - Relevant keywords associated with the movie.
original_language - The primary language in which the movie was originally made.
original_title - The movie's original title before any translations.
overview - A brief synopsis or summary of the movie.
popularity - A numerical score representing the movie's popularity based on views, searches, or ratings.
production_companies - The studios or companies that produced the movie.
production_countries - The countries where the movie was produced.
release_date - The official release date of the movie.
revenue - The total box office earnings of the movie.
runtime - The duration of the movie in minutes.
spoken_languages - The languages spoken in the movie.
status - The current status of the movie (e.g., Released, In Production, Rumored).
tagline - A short promotional phrase or slogan associated with the movie.
title - The official title of the movie.
vote_average - The average user rating of the movie.
vote_count - The total number of user votes received.
```

# 🚀 EDA

## EDA: Number of Movies Released Per Year


In [None]:
# checking datatype for release date
df.dtypes

# object!

In [None]:
# looking at the release date format
df.release_date

In [None]:
# converting 'release_date' to datetime format and extracting the year to a new column
df['release_date'] = pd.to_datetime(df['release_date'], errors='raise')
df['release_year'] = df['release_date'].dt.year

In [None]:
# checking it worked for release year with eyes
df

In [None]:
# checking datatype for release date and release year
df.dtypes

In [None]:
# Exploring using Visualisations:
plt.figure(figsize=(12,6))
df['release_year'].value_counts().sort_index().plot(kind='line', marker='o', color='skyblue')

plt.title("Number of Movies Released Per Year")
plt.xlabel("Year")
plt.ylabel("Count of Movies")
plt.grid(True)
plt.show()


```
Movies releases have been steadily increasing over the years, surging in the 1990's and reaching its peak around 2010's.

However, there is a steep dip around 2015. Why is this?
```

In [None]:
# count of movies per recent years
df['release_year'].value_counts().sort_index().tail(10)

# steadily decreases after 2014, and theres only 1 movie recorded in 2017

In [None]:
# checking for missing years
df[df['release_year'].isna()]

# no missing years

In [None]:
# investigating recent years
df[df['release_year'] > 2014]

# could the status be affecting the chart?

In [None]:
# looking at status unique values
df.status.unique()

# we have post production, released, and rumored

In [None]:
# does the Post Production status affect the revenue?
df_post_production = df[df["status"] == "Post Production"]
df_post_production


# A: no - in row 4178, for movie 'Higher Ground' there is revenue evidence

In [None]:
# does the Rumored status affect the revenue?
df_rumored = df[df["status"] == "Rumored"]
df_rumored

# A: yes - there is no revenue for those rumored aka unreleased

```
The visual might include the rumored and post production status movies,
so I created a filtered dataframe where it only keeps records of released status movies:
```

In [None]:
released_df = df[df['status'] == 'Released']  # keeps only 'Released' movies

plt.figure(figsize=(12,6))
released_df['release_year'].value_counts().sort_index().plot(kind='line', marker='o', color='skyblue')

plt.title('Number of Movies Released Per Year (Filtered)')
plt.xlabel('Year')
plt.ylabel('Count of Movies')
plt.grid(True)
plt.show()

```
Nothing really changed
I can deduce that, around 2017, movies were not recorded.
```

## EDA: Revenue Across Decades

In [None]:
# removing values of 0 - data quality issue mentioned before
df = df[(df['budget'] > 0) & (df['revenue'] > 0)]

In [None]:
# creating a new column for decades
# df['decade'] = (df['release_year'] // 10) * 10
# ^ this gave a copy warning

# so i decided to follow the warning advice and use .loc
df.loc[:, 'decade'] = (df['release_year'] // 10) * 10

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='decade', y='revenue', showfliers=False, hue='decade', palette='GnBu_r', legend=False)
plt.yscale('log')
plt.xlabel('Decades')
plt.ylabel('Revenue')
plt.title('Movie Revenue Distribution by Decade')
plt.show()

```
10⁹ (1,000,000,000) → 1 billion revenue
10⁸ (100,000,000) → 100 million revenue
10⁷ (10,000,000) → 10 million revenue
```



---



## EDA: Low Budget hits

In [None]:
# finding movies where budget is low but revenue is high
low_budget_hits = df[(df['budget'] > 0) & (df['budget'] < 1000000) & (df['revenue'] > 50000000)]

# show the top results
low_budget_hits[['title', 'budget', 'revenue']].sort_values(by='revenue', ascending=False)

In [None]:
df[df['title'] == 'Paranormal Activity']

```
Paranormal Activity has a revenue of almost 200 million, with a budget of only 15,000!
(Note: release year 2007, 2000's decade)

Let's visualise this returned table:
```

In [None]:
# sorting the data by revenue
low_budget_hits_sorted = low_budget_hits.sort_values(by='revenue', ascending=False)

# set figure size
plt.figure(figsize=(10, 6))

# creating bar plot
sns.barplot(x='title', y='revenue', data=low_budget_hits_sorted, hue='title', palette='winter_r')

# adding titles and labels
plt.title('Low Budget Films with High Revenue', fontsize=14)
plt.xlabel('Movie Title', fontsize=12)
plt.ylabel('Revenue', fontsize=12)
plt.xticks(rotation=45, ha='right')

plt.show()

```
^ unused visual
```

```
This is great, but as I want to focus on 'Paranormal Activity' being a massive outlier,
the ROI is off the charts with it having such a low budget.

A scattor plot would show this better:
```

In [None]:
# scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='budget', y='revenue', alpha=0.5)

# finding Paranormal Activity
pa = df[df['title'] == 'Paranormal Activity']
plt.scatter(pa['budget'], pa['revenue'], color='lime', s=50, label='Paranormal Activity')

# log scale to better visualize differences
plt.xscale('log')
plt.yscale('log')

# labels and title
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs. Revenue for Movies')
plt.legend()

plt.show()



---



## EDA: Return of Investment for Low Budget Movies


```
After finding the movie Paranormal Activity's evidence of ROI, I became curious about the other outliers in the scatter plot.
Let's look at the profit margins!
```

```Low budget films:
 Title, Budget, Revenue

 - Bambi	858000	267447150
 - The Blair Witch Project	60000	248000000
 - Paranormal Activity	15000	193355800
 - American Graffiti	777000	140000000
 - Mad Max	400000	100000000
 - Halloween	300000	70000000
 - Dr. No	950000	59600000
 - Open Water	130000	54667954
```




In [None]:
# selecting low budget movies
compared_movies = df[df['title'].isin(['The Blair Witch Project', 'Paranormal Activity', 'American Graffiti', 'Mad Max', 'Halloween', 'Dr. No', 'Open Water'])].copy()

# calculating ROI
compared_movies['ROI'] = compared_movies['revenue'] /compared_movies['budget']

# sorting by movie's ROI
compared_movies = compared_movies.sort_values(by='ROI', ascending=False)

# bar chart
plt.figure(figsize=(10,6))
plt.bar(compared_movies['title'], compared_movies['ROI'], color='skyblue')

# labels and title
plt.ylabel('ROI (Revenue/Budget)')
plt.title('ROI Comparison of Selected Movies')
plt.xticks(rotation=45)
plt.show()

```
Paranormal Activity continues to be the best performing!
```



---



## EDA: Financial Powerhouse

In [None]:
# Let's find the highest budget movies!

highest_budget_movies = df.sort_values(by='budget', ascending=False).head(10)
highest_budget_movies[['title', 'budget', 'revenue']]

# 'Pirates of the Caribbean: On Stranger Tides' takes the lead!

In [None]:
# looking at the info 'Pirates of the Caribbean: On Stranger Tides'
df.iloc[17]

```
It's easy to see that this movie is the best performing in terms of budget, and in terms of revenue.
However, I want to see it's impact!
Is it as popular as the other big budget hits?
```

In [None]:
# filtering for 'Pirates of the Caribbean: On Stranger Tides'
movie = df[df['title'] == "Pirates of the Caribbean: On Stranger Tides"]

plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='budget', y='revenue', size='popularity', alpha=0.5, sizes=(10, 300))
plt.scatter(movie['budget'], movie['revenue'], color='lime', s=300, label='Pirates of the Caribbean')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs Revenue (Bubble Size = Popularity)')
plt.legend()
plt.show()


### **unused visual**


 ```
 (not relevant to story)
 ```

In [None]:
low_budget_movies = df[df['budget'] < 1000000]

plt.figure(figsize=(10,6))
sns.histplot(low_budget_movies['revenue'], bins=50, kde=True, log_scale=True, color='skyblue')

plt.xlabel('Revenue')
plt.ylabel('Number of Movies')
plt.title('Revenue Distribution of Low Budget Films')
plt.show()



# *the analysis begins...*




**Data provided by [TMDb](https://www.themoviedb.org/). This product uses the TMDb API but is not endorsed or certified by TMDb.**


<p style="text-align:center;">
    <img src="https://storage.googleapis.com/kaggle-datasets-images/1369641/2274357/b4c2d21181787e2628bab89cb2047d68/dataset-card.png?t=2021-05-27-12-11-59"
         alt="TMDb Logo"
         style="width: auto; height: 50px; margin-right: 10px;" />
</p>


# 🎥 **The Evolution of Cinema: A Century of Film Trends**

I'll be exploring the relationships between movie production, revenue, budget, and their success over the years.

We'll dive into trends and see how certain movies have defied expectations.

## **Number of Movies Released Per Year**



In [None]:
released_df = df[df['status'] == 'Released']
plt.figure(figsize=(12,6))
released_df['release_year'].value_counts().sort_index().plot(kind='line', marker='o', color='skyblue')
plt.title('Movies Released Over Time')
plt.xlabel('Year')
plt.ylabel('Count of Movies')
plt.grid(True)
plt.show()

🎯 Key Question:
- Has movie production increased or fluctuated over time?

📌 Key Points:
- Film production grew significantly around the 1990s, peaking in the 2000s.

🔍 In-Depth Observations:
- Older movies (pre-2000s) are fewer due to limited filmmaking technology and documentation. [Technology in Cinema](https://blushgrove.com/articles/cinema-landscape-2000s-analysis/#:~:text=The%202000s%20marked%20a%20significant%20transformation%20in%20the,while%20influential%20filmmakers%20carved%20out%20new%20cinematic%20paths)

- A sharp decline in recent years (2015 onwards) may indicate incomplete data for unreleased movies (only 1 movie was recorded for 2017)




---



## **Revenue Across Decades**

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='decade', y='revenue', showfliers=False, hue='decade', palette='GnBu_r', legend=False)
plt.yscale('log')
plt.xlabel('Decades')
plt.ylabel('Revenue')
plt.title('Movie Revenue Distribution by Decade')
plt.show()


> 10⁹ (1,000,000,000) → 1 billion revenue

>  10⁸ (100,000,000) → 100 million revenue

> 10⁷ (10,000,000) → 10 million revenue



🎯 Key Question:
- Have movies become more profitable over time?

📌 Key Points:
- The box (middle 50% of data) shifts upwards in later decades, suggesting that movies are making more money over time.

- 1980s onwards: Movie revenue is consistently high, the rise of high-budget companies generating billions in revenue.

🔍 In-Depth Observations:
- There might be inflation effects—a 10M budget in 1980 is not the same as 10M in 2020!

- **Some low-budget films became surprise hits, showing that storytelling matters as much as money.**



---



## **Low Budget Hits** (1)

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='budget', y='revenue', alpha=0.5)
pa = df[df['title'] == 'Paranormal Activity']
plt.scatter(pa['budget'], pa['revenue'], color='lime', s=50, label='Paranormal Activity')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs. Revenue for Movies')
plt.legend()
plt.show()

🎯 Key Question:
- Are there outliers in the budget-to-revenue trends?

📌 Key Points:
- Most movies follow a budget-to-revenue trend (top-right cluster). Higher budgets generally lead to higher revenue.
- Paranormal Activity is an extreme outlier (green dot).

🔍 In-Depth Observations:
- **This suggests low budget horror films can be massively profitable.**



---



## **Return of Investment for Low Budget Hits** (2)

Let's look at the comparison of low budget movies ROI!

In [None]:
compared_movies = df[df['title'].isin(['The Blair Witch Project', 'Paranormal Activity', 'American Graffiti', 'Mad Max', 'Halloween', 'Dr. No', 'Open Water'])].copy()
compared_movies['ROI'] = compared_movies['revenue'] /compared_movies['budget']
compared_movies = compared_movies.sort_values(by='ROI', ascending=False)
plt.figure(figsize=(10,6))
plt.bar(compared_movies['title'], compared_movies['ROI'], color='skyblue')
plt.ylabel('ROI (Revenue/Budget)')
plt.title('ROI Comparison of Low Budget Hits')
plt.xticks(rotation=45)
plt.show()

📌 Key Points:
- This chart shows Paranormal Activity as the best in terms of ROI in low budget movies, as we already know. Here we see it generates over 12,000 times its budget!
- Even low budget horror films, when executed well, can have surprising financial success.

While this movie showcases the power of minimal budgets, another interesting case is **[spoilers!]**, which followed a very different path to success...

Instead of a low-budget surprise hit, it became a financial powerhouse due to its massive investment in production and marketing. Let's explore its impact.

**Let's find out which movie I'm refering to.**





---



## **Financial Powerhouse**

Let's find the highest budget movies!

In [None]:
highest_budget_movies = df.sort_values(by='budget', ascending=False).head(10)
highest_budget_movies[['title', 'budget', 'revenue']].head(5)

**'Pirates of the Caribbean: On Stranger Tides' takes the lead!**

🎯 Key Question:
- How impactful was Pirates of the Caribbean: On Stranger Tides in terms of popularity?

In [None]:
movie = df[df['title'] == 'Pirates of the Caribbean: On Stranger Tides']
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='budget', y='revenue', size='popularity', alpha=0.5, sizes=(10, 300))
plt.scatter(movie['budget'], movie['revenue'], color='lime', s=300, label='Pirates of the Caribbean')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs Revenue (Bubble Size = Popularity)')
plt.legend()
plt.show()

📌 Key Points:
- It stands out in terms of budget, however, we can see it's performce in terms of revenue is average.
- That said, it's high ranking in audience engagement.

🔍 In-Depth Observations:
- Most of the movies cluster towards the bottom left, showing the movies with less budget has less revenue and follow a smaller bubble pattern - meaning lower popularity.
- This movie proves that high budget and well executed advertising can lead to high popularity.

# 🎉 **Conclusion**

 In conclusion, this study provides valuable insights into movie production trends, revenue, and success factors, but it's important to acknowledge that the conclusions are limited by the quality of the data. Missing values for budget and revenue, along with some questionable budget figures, affected the ROI results. I had to make assumptions about outliers and focused on general trends, which may not tell the full story...

While the study gives us insight into patterns like increasing production and higher budgets leading to more profitability, it doesn't prove a formula for success. Instead, it highlights common factors among successful films, but these factors might not necessarily be the key reasons behind their success. Ultimately, the creativity, execution, and audience engagement play a crucial role in a movie's true success!

<p style="text-align:center;">
    <img src="https://images.unsplash.com/photo-1554830072-52d78d0d4c18?q=80&w=1935&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
         alt="cute doggie"
         style="width: auto; height: 30px; margin-right: 10px;" />
</p>
