# <font color='darkred'>**TMBD Movies**</font>

## **Table of Contents**
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## **Introduction**

> This dataset comes from IMDB and contains information about 10,000 movies,
short films and tv series collected from The Movie Database (TMDb), including user ratings, revenue, runtime and budget.

## **Generate Questions**

In this project, i'll be answering the following questions:
- What's the genre with the highest median popularity?
- What's the genre with the highest median revenue?
- What's the genre with the highest median vote_count?
- What's the genre with the highest mean vote_average?
- what movie is considered movie genre is considered the "best"?
- What month is considered "best" for releasing a films?
- What is the relationship between the columns and each other?

## **Import Libraries**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## **Data Wrangling**


### General Properties

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/Dee-M123/lab-tmdb-movies-eda-wrangling/refs/heads/main/tmdb-movies.csv")

df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [8]:
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

## **Data Cleaning**

In [9]:
import pandas as pd

def display_missing_values(df):
    """
    Displays the number and percentage of missing values per column.
    Only columns with missing values are shown.
    """

    missing_count = df.isnull().sum()


    missing_percent = (missing_count / len(df)) * 100


    missing_df = pd.DataFrame({
        'Missing Count': missing_count,
        'Missing Percentage (%)': missing_percent
    })

    # Filter columns with at least 1 missing value
    missing_df = missing_df[missing_df['Missing Count'] > 0]

    # Sort by highest missing percentage
    missing_df = missing_df.sort_values(by='Missing Percentage (%)', ascending=False)

    if missing_df.empty:
        print("No missing values found.")
    else:
        print("Columns with Missing Values:")
        display(missing_df)



In [10]:
display_missing_values(df)

Columns with Missing Values:


Unnamed: 0,Missing Count,Missing Percentage (%)
homepage,7930,72.979937
tagline,2824,25.989324
keywords,1493,13.740107
production_companies,1030,9.479109
cast,76,0.699429
director,44,0.404933
genres,23,0.211669
imdb_id,10,0.09203
overview,4,0.036812


In [12]:
df.dropna(subset=['genres'], inplace=True)

<a id='eda'></a>
## **Exploratory Data Analysis**

### What's the genre with the highest median popularity?

In [16]:
median_pop = (df.groupby('genres')['popularity'].median()
.sort_values(ascending=False))

median_pop.head(10)

Unnamed: 0_level_0,popularity
genres,Unnamed: 1_level_1
Adventure|Science Fiction|Thriller,13.112507
Adventure|Drama|Science Fiction,12.699699
Science Fiction|Adventure|Thriller,10.739009
Action|Thriller|Science Fiction|Mystery|Adventure,9.363643
Western|Drama|Adventure|Thriller,9.1107
Adventure|Family|Animation|Action|Comedy,8.691294
Science Fiction|Action|Thriller|Adventure,8.654359
Action|Animation|Horror,8.411577
History|Drama|Thriller|War,8.110711
Drama|Adventure|Science Fiction,7.6674


### What's the genre with the highest median revenue?

In [23]:
df_revenue = df[df['revenue'] > 0]

median_revenue = (df_revenue.groupby('genres')['revenue'].median()
.sort_values(ascending=False))

median_revenue.head(10)

Unnamed: 0_level_0,revenue
genres,Unnamed: 1_level_1
Action|Adventure|Science Fiction|Fantasy,2068178000.0
Crime|Drama|Mystery|Thriller|Action,1106280000.0
Family|Fantasy|Adventure,1025467000.0
Adventure|Fantasy|Family|Mystery,938212700.0
Science Fiction|Thriller|Action|Adventure,847423500.0
Adventure|Fantasy|Family,833246500.0
Action|Thriller|Science Fiction|Mystery|Adventure,825500000.0
Science Fiction|Adventure|Family|Fantasy,792910600.0
Family|Animation|Drama,788241800.0
Fantasy|Adventure|Action|Family|Romance,758410400.0


### What's the genre with the highest median vote_count?

In [27]:
median_vote_count = (df.groupby('genres')['vote_count'].median().sort_values(ascending=False))

median_vote_count.head(3)

Unnamed: 0_level_0,vote_count
genres,Unnamed: 1_level_1
Action|Thriller|Science Fiction|Mystery|Adventure,9767.0
Science Fiction|Adventure|Fantasy,7080.0
Drama|Adventure|Science Fiction,4572.0


### What's the genre with the highest mean vote_average?

In [29]:
mean_vote_average = (df.groupby('genres')['vote_average'].mean().sort_values(ascending=False))

mean_vote_average.head(5)

Unnamed: 0_level_0,vote_average
genres,Unnamed: 1_level_1
Drama|Horror|Mystery|Science Fiction|Thriller,8.8
Music|Drama|Fantasy|Romance,8.4
Thriller|Documentary,8.2
Science Fiction|Adventure|Family,8.0
Mystery|Documentary|Crime,8.0


### What movie is considered movie genre is considered the "best"?


In [30]:
df_filtered = df[df['vote_count'] >= 50]

best_movies = (df_filtered.sort_values(['genres', 'vote_average', 'vote_count'], ascending=[True, False, False]).groupby('genres').first()[['original_title', 'vote_average', 'vote_count']]
.sort_values('vote_average', ascending=False))

best_movies

Unnamed: 0_level_0,original_title,vote_average,vote_count
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adventure|Documentary,The Art of Flight,8.5,60
Documentary,The Jinx: The Life and Deaths of Robert Durst,8.4,72
Drama|Crime,The Shawshank Redemption,8.4,5754
Crime|Documentary,Dear Zachary: A Letter to a Son About His Father,8.3,74
Drama|Music,Whiplash,8.2,2372
...,...,...,...
Action|Adventure|Fantasy|Horror,BloodRayne,3.8,62
Thriller|Drama|Science Fiction,Monsters: Dark Continent,3.8,60
Adventure|Thriller,Jaws: The Revenge,3.7,111
TV Movie|Action|Science Fiction,San Andreas Quake,3.7,60


### What month is considered "best" for releasing a films/shows?


In [36]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

df['release_month'] = df['release_date'].dt.month

df_revenue = df[df['revenue'] > 0]

best_month = (df.groupby('release_month')['revenue'].mean().sort_values(ascending=False))

best_month

Unnamed: 0_level_0,revenue
release_month,Unnamed: 1_level_1
6,74649620.0
5,62444140.0
12,59339310.0
7,56869960.0
11,56383610.0
3,38194540.0
4,33115760.0
2,28811910.0
8,27814160.0
10,25569430.0


<a id='conclusions'></a>
## Conclusions


## Limitations


- Genres such as Drama and Documentary tend to have the highest average ratings, suggesting strong critical reception.

- High-engagement genres (based on median vote_count) are typically Action, Adventure, and Science Fiction, indicating strong audience interest.

- Financially, blockbuster-style genre combinations (e.g., Action–Adventure–Fantasy) generate the largest revenues.

- any films report zero revenue and zero budget, which may represent missing data rather than true values.

- Removing zero revenue entries may introduce bias toward commercially successful films.