# **TMBD MOVIE DATA ANALYSIS**

This project extracts movie data from the TMDB API, cleans and preprocesses the dataset, and performs in-depth exploratory data analysis. The goal is to uncover meaningful insights, identify trends, and support data-driven decision-making based on movie performance, audience behavior, and industry patterns.

## **Data Extraction**
Extracts movie data from the TMDB API

In [46]:
#Imports packages and modules

#Import os and sys
import os
import sys 
from pathlib import Path
import pandas as pd

#Extarcts the root path of the project and appends it to the sys path
project_root = Path().resolve().parent
sys.path.append(str(project_root))

#Imports loadEnv from config module
from Config.config import loadEnv, getURL, create_retry

#Imports extractDataFromAPI from the extract data module
from Data_Extraction.extractData import extractDataFromAPI

#Imports separate Array from the Data Cleaning module
from Data_Cleaning.separateArray import separateArray

#Imports removeColumns from the remove column module
from Data_Cleaning.removeColumn import removeColumn

#Imports the convertDataType from the convert data type module
from Data_Cleaning.convertDataType import convertDataType

In [37]:
movie_ids = [0, 299534,19995,140607,299536,597,135397, 420818, 24428, 168259, 99861,
                    284054, 12445, 181808, 330457, 351286, 109445, 321612, 260513] 

API_KEY = loadEnv(fileName="API_KEY")
url = getURL()
session = create_retry()
data = extractDataFromAPI(session = session, url = url, API_KEY = API_KEY, movie_ids=movie_ids)

#Creates a copy of the original data so that the original data is not been modified
movie_data = data.copy()


In [38]:
#Outputs the columns if the dataFrame is not None or empty
movie_data.columns if movie_data is not None and not movie_data.empty else "No data extracted from the API"

Index(['adult', 'backdrop_path', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'origin_country', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'belongs_to_collection.id',
       'belongs_to_collection.name', 'belongs_to_collection.poster_path',
       'belongs_to_collection.backdrop_path', 'belongs_to_collection'],
      dtype='object')

In [39]:
#Outputs the first five rows in the extracted data
movie_data.head()

Unnamed: 0,adult,backdrop_path,budget,genres,homepage,id,imdb_id,origin_country,original_language,original_title,...,tagline,title,video,vote_average,vote_count,belongs_to_collection.id,belongs_to_collection.name,belongs_to_collection.poster_path,belongs_to_collection.backdrop_path,belongs_to_collection
0,False,/9wXPKruA6bWYk2co5ix6fH59Qr8.jpg,356000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 878, ...",https://www.marvel.com/movies/avengers-endgame,299534,tt4154796,[US],en,Avengers: Endgame,...,Avenge the fallen.,Avengers: Endgame,False,8.237,26983,86311.0,The Avengers Collection,/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg,/zuW6fOiusv4X9nnW3paHGfXcSll.jpg,
1,False,/7JNzw1tSZZEgsBw6lu0VfO2X2Ef.jpg,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",https://www.avatar.com/movies/avatar,19995,tt0499549,[US],en,Avatar,...,Enter the world of Pandora.,Avatar,False,7.6,32887,87096.0,Avatar Collection,/3C5brXxnBxfkeKWwA1Fh4xvy4wr.jpg,/6qkJLRCZp9Y3ovXti5tSuhH0DpO.jpg,
2,False,/8BTsTfln4jlQrLXUBquXJ0ASQy9.jpg,245000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.starwars.com/films/star-wars-episod...,140607,tt2488496,[US],en,Star Wars: The Force Awakens,...,Every generation has a story.,Star Wars: The Force Awakens,False,7.255,20107,10.0,Star Wars Collection,/22dj38IckjzEEUZwN1tPU5VJ1qq.jpg,/qVPChlozQ1BP3svfHjiAdNneMGA.jpg,
3,False,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",https://www.marvel.com/movies/avengers-infinit...,299536,tt4154756,[US],en,Avengers: Infinity War,...,Destiny arrives all the same.,Avengers: Infinity War,False,8.235,31192,86311.0,The Avengers Collection,/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg,/zuW6fOiusv4X9nnW3paHGfXcSll.jpg,
4,False,/xnHVX37XZEp33hhCbYlQFq7ux1J.jpg,200000000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",https://www.paramountmovies.com/movies/titanic,597,tt0120338,[US],en,Titanic,...,Nothing on earth could come between them.,Titanic,False,7.903,26522,,,,,


In [40]:
#Outputs information of the extracted data
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   adult                                18 non-null     bool   
 1   backdrop_path                        18 non-null     object 
 2   budget                               18 non-null     int64  
 3   genres                               18 non-null     object 
 4   homepage                             18 non-null     object 
 5   id                                   18 non-null     int64  
 6   imdb_id                              18 non-null     object 
 7   origin_country                       18 non-null     object 
 8   original_language                    18 non-null     object 
 9   original_title                       18 non-null     object 
 10  overview                             18 non-null     object 
 11  popularity                        

## **DATA CLEANING**

This section of rthe code runs through the data to keep for analysis, converting data types to save space when the program is in execution.


### **DROPPING COLUMNS**

The below columns will be dropped based on the reasons attached:

adult — The adult column is not needed for KPI analysis.

backdrop_path — Rarely needed; large strings/URLs; drop to save space (poster_path is enough).

homepage — optional metadata; drop (keeps size down).

imdb_id — external id not used in spec; drop unless you plan cross-referencing.

origin_country — ambiguous / redundant with production_countries.

original_title — redundant with title for your analysis (drop unless you need original-language title).

video — not useful for KPIs.

belongs_to_collection.id, belongs_to_collection.poster_path, belongs_to_collection.backdrop_path — drop (keep only collection name).

belongs_to_collection (raw JSON) — if you extract the .name into a single column, drop the raw JSON.

any duplicate columns (e.g., both belongs_to_collection and belongs_to_collection.name keep only the .name value).

In [41]:
#Columns to drop
#columns: ['adult', 'imdb_id', 'original_title', 'video', 'homepage'].
columns_to_remove = ('adult', 'imdb_id', 'original_title', 'video', 'homepage', 'backdrop_path', 'origin_country' )

#Calls the remove column function to remove the stated columns from the dataFrame
movie_data = removeColumn(movie_data, columns=columns_to_remove)

In [42]:
#Outputs the remaining columns after the dropping the irrelevant columns for this analysis
movie_data.columns

Index(['budget', 'genres', 'id', 'original_language', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'status',
       'tagline', 'title', 'vote_average', 'vote_count',
       'belongs_to_collection.id', 'belongs_to_collection.name',
       'belongs_to_collection.poster_path',
       'belongs_to_collection.backdrop_path', 'belongs_to_collection'],
      dtype='object')

### **EXTRACTING KEY DATA POINTS**

This section extracts important data points from the Genre, Spoken languages, Production companies and Production countries.

Genre names (genres → separate multiple genres with "|").

Spoken languages (spoken_languages → separate with "|").

Production countries (production_countries → separate with "|").

Production companies (production_companies → separate with "|").

In [43]:
movie_data = separateArray(data=movie_data, 
                           columns={"genres":"name", "production_countries": "name",
                                    "spoken_languages": "name", "production_companies": "name"}
                            )

In [45]:
#Value count of the modified columns
modified_columns = ["genres", "production_countries", "spoken_languages", "production_companies"]

for modified_column in modified_columns:
    print(f"\n ---- {modified_column.upper()} -------")
    #Counts movies with the same column data and outputs it
    print(movie_data[modified_column].value_counts(dropna=False).head(20))


 ---- GENRES -------
genres
Adventure|Action|Science Fiction             3
Action|Adventure|Science Fiction|Thriller    2
Action|Adventure|Science Fiction             2
Action|Adventure|Fantasy|Science Fiction     1
Drama|Romance                                1
Adventure|Science Fiction|Action             1
Adventure|Drama|Family|Animation             1
Science Fiction|Action|Adventure             1
Action|Crime|Thriller                        1
Adventure|Fantasy                            1
Family|Animation|Adventure|Comedy|Fantasy    1
Animation|Family|Adventure|Fantasy           1
Family|Fantasy|Romance                       1
Action|Adventure|Animation|Family            1
Name: count, dtype: int64

 ---- PRODUCTION_COUNTRIES -------
production_countries
United States of America                   16
United States of America|United Kingdom     1
United Kingdom|United States of America     1
Name: count, dtype: int64

 ---- SPOKEN_LANGUAGES -------
spoken_languages
English          

### **HANDLING INCORRECT AND MISSING DATA**

#### **CONVERT COLUMN DATA TYPES**

Converts budget, id and popularity to integers to save space

Converts release_date to datetime

In [47]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   budget                               18 non-null     int64  
 1   genres                               18 non-null     object 
 2   id                                   18 non-null     int64  
 3   original_language                    18 non-null     object 
 4   overview                             18 non-null     object 
 5   popularity                           18 non-null     float64
 6   poster_path                          18 non-null     object 
 7   production_companies                 18 non-null     object 
 8   production_countries                 18 non-null     object 
 9   release_date                         18 non-null     object 
 10  revenue                              18 non-null     int64  
 11  runtime                           

In [48]:
movie_data.head()

Unnamed: 0,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,...,status,tagline,title,vote_average,vote_count,belongs_to_collection.id,belongs_to_collection.name,belongs_to_collection.poster_path,belongs_to_collection.backdrop_path,belongs_to_collection
0,356000000,Adventure|Science Fiction|Action,299534,en,After the devastating events of Avengers: Infi...,14.7068,/bR8ISy1O9XQxqiy0fQFw2BX72RQ.jpg,Marvel Studios,United States of America,2019-04-24,...,Released,Avenge the fallen.,Avengers: Endgame,8.237,26983,86311.0,The Avengers Collection,/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg,/zuW6fOiusv4X9nnW3paHGfXcSll.jpg,
1,237000000,Action|Adventure|Fantasy|Science Fiction,19995,en,"In the 22nd century, a paraplegic Marine is di...",39.5744,/gKY6q7SjCkAU6FqvqWybDYgUKIF.jpg,Dune Entertainment|Lightstorm Entertainment|20...,United States of America|United Kingdom,2009-12-15,...,Released,Enter the world of Pandora.,Avatar,7.6,32887,87096.0,Avatar Collection,/3C5brXxnBxfkeKWwA1Fh4xvy4wr.jpg,/6qkJLRCZp9Y3ovXti5tSuhH0DpO.jpg,
2,245000000,Adventure|Action|Science Fiction,140607,en,Thirty years after defeating the Galactic Empi...,8.615,/wqnLdwVXoBjKibFRR5U3y0aDUhs.jpg,Lucasfilm Ltd.|Bad Robot,United States of America,2015-12-15,...,Released,Every generation has a story.,Star Wars: The Force Awakens,7.255,20107,10.0,Star Wars Collection,/22dj38IckjzEEUZwN1tPU5VJ1qq.jpg,/qVPChlozQ1BP3svfHjiAdNneMGA.jpg,
3,300000000,Adventure|Action|Science Fiction,299536,en,As the Avengers and their allies have continue...,21.7168,/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,Marvel Studios,United States of America,2018-04-25,...,Released,Destiny arrives all the same.,Avengers: Infinity War,8.235,31192,86311.0,The Avengers Collection,/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg,/zuW6fOiusv4X9nnW3paHGfXcSll.jpg,
4,200000000,Drama|Romance,597,en,101-year-old Rose DeWitt Bukater tells the sto...,27.2002,/9xjZS2rlVxm8SFx8kPC3aIGCOYQ.jpg,Paramount Pictures|20th Century Fox|Lightstorm...,United States of America,1997-11-18,...,Released,Nothing on earth could come between them.,Titanic,7.903,26522,,,,,


In [None]:
#Checks the unique values in the status column
movie_data["status"].unique()

array(['Released'], dtype=object)

In [50]:
movie_data = convertDataType(
                                data=movie_data, 
                                columns={"budget": "int64", "id": "int64", "popularity": "int64",
                                         "release_date": "datetime", "title": "string", "poster_path": "string",
                                         "tagline": "string", "status": "category", "overview": "string"
                                         }
                            )

In [51]:
#Gets Info of the data
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   budget                               18 non-null     int64         
 1   genres                               18 non-null     object        
 2   id                                   18 non-null     int64         
 3   original_language                    18 non-null     object        
 4   overview                             18 non-null     string        
 5   popularity                           18 non-null     float64       
 6   poster_path                          18 non-null     string        
 7   production_companies                 18 non-null     object        
 8   production_countries                 18 non-null     object        
 9   release_date                         18 non-null     datetime64[ns]
 10  revenue         

### **REPLACE ALL UNREALISTIC VALUES**

Replaces all unrealistic values:

Budget/Revenue/Runtime = 0 → Replace with NaN or infer from similar movies.

Convert 'budget' and 'revenue' to million USD.

Movies with vote_count = 0 → Analyze their vote_average and adjust accordingly.

'overview' and 'tagline' → Replace known placeholders (e.g., 'No Data') with NaN.

#### **REPLACES BUDGET, REVENUE AND RUNTIME**

Replaces all occurence in of zeros or NaNs with NaNs in any of the columns

In [65]:
#Checks the budget, Revenue and Runtime columns and counts all the data with  0
#print(f"Number: {(movie_data["revenue"] == "NaN").sum()}")
unrealistic_columns = ["budget", "revenue", "runtime"]

#Loops through all the columns in the unrealistic columns and outputs the number of 0 and NaNs in the column
for column in unrealistic_columns:
    print(f"{column.upper()} has {(movie_data[column] == 0).sum()} zeros \n")
    print(f"{column.upper()} has {(movie_data[column].isna()).sum()} NaNs \n")
    print("------------\n")



BUDGET has 0 zeros 

BUDGET has 0 NaNs 

------------

REVENUE has 0 zeros 

REVENUE has 0 NaNs 

------------

RUNTIME has 0 zeros 

RUNTIME has 0 NaNs 

------------



Since all the columns have no zeros or NaNs, there is not need to add a transformation again

#### **CONVERT BUDGET AND REVENUE TO MILLION USD**

Converts the budget and revenue columns to million USD