# Creating the Fact Table

In this file I have enriched the daily box office information priorly extracted from TMDb by adding information about directors and special days.

##Importing necessary libraries


In [2]:
import pandas as pd
import numpy as np
import re
import os
import sys
import subprocess

##Loading the datasets

In order to fetch unique identifier (IMDb ID in this case) and merge it with my created movie dataset, I have to load two separate files.

In [4]:
# 1. AUTO-INSTALL GDOWN (If missing)
try:
    import gdown
except ImportError:
    print("gdown not found. Installing...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "gdown"])
    import gdown

# 2. FILE MAPPING
# Map to the "Google Drive Link
files = {
    "Box office new data.csv": "https://drive.google.com/file/d/1WJbnWtcpDzGCJjSo8W9sDdU7ELsX2pTU/view?usp=share_link",
    "title.basics.tsv":    "https://drive.google.com/file/d/1xoAUaTtZ-3Wn9IX9XcHJ5Tv8T0HA0n7w/view?usp=share_link"
}

# 3. DOWNLOADER LOOP
for filename, drive_link in files.items():
    if not os.path.exists(filename):
        print(f"Downloading {filename}...")

        # Extract ID from the link safely
        try:
            file_id = drive_link.split('/d/')[1].split('/')[0]
            url = f'https://drive.google.com/uc?id={file_id}'

            # Download (quiet=False shows the progress bar)
            gdown.download(url, filename, quiet=False)
        except IndexError:
            print(f"Error: Could not parse ID for {filename}. Check the link.")
    else:
        print(f"Found {filename} locally. Skipping download.")

# 4. LOAD DATA
print("\nLoading Dataframes...")

# Load Mojo (CSV)
if os.path.exists("Box office new data.csv"):
    df = pd.read_csv("Box office new data.csv")
    print("TMDb Data Loaded.")

# Load IMDB (TSV)
if os.path.exists("title.basics.tsv"):
    df2 = pd.read_csv("title.basics.tsv", sep='\t', low_memory=False)
    print("IMDB Data Loaded.")

Found Box office new data.csv locally. Skipping download.
Found title.basics.tsv locally. Skipping download.

Loading Dataframes...


  df = pd.read_csv("Box office new data.csv")


Mojo Data Loaded.
IMDB Data Loaded.


##Looking at the Dataset

In [5]:
df.head()

Unnamed: 0,TD,YD,Release,Daily,%¬± YD,%¬± LW,Theaters,Avg,To Date,Days,Distributor,date
0,1,1,Marley & Me,"$9,956,212",44.60%,-30.80%,3480,"$2,860","$82,400,283",8,Twentieth Century Fox,1/1/09
1,2,2,Bedtime Stories,"$8,336,917",46.40%,-21.20%,3681,"$2,264","$65,037,829",8,Walt Disney Studios Motion Pictures,1/1/09
2,3,3,The Curious Case of Benjamin Button,"$7,939,690",85.40%,-33.10%,2988,"$2,657","$60,605,838",8,Paramount Pictures,1/1/09
3,4,4,Valkyrie,"$5,747,446",64.80%,-32.30%,2711,"$2,120","$46,649,304",8,United Artists,1/1/09
4,5,5,Yes Man,"$5,567,221",102.30%,-6.10%,3434,"$1,621","$65,596,911",14,Warner Bros.,1/1/09


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269406 entries, 0 to 269405
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   TD           269406 non-null  int64 
 1   YD           269406 non-null  object
 2   Release      269406 non-null  object
 3   Daily        269406 non-null  object
 4   %¬± YD       269406 non-null  object
 5   %¬± LW       269406 non-null  object
 6   Theaters     269406 non-null  object
 7   Avg          269406 non-null  object
 8   To Date      269406 non-null  object
 9   Days         269406 non-null  object
 10  Distributor  268604 non-null  object
 11  date         269406 non-null  object
dtypes: int64(1), object(11)
memory usage: 24.7+ MB


###File Information:

**Box Office new data** is the dataset I created of movies from 2009-2025 using TMDb API; it contains the following columns:

TD (Today's Rank): Each movie's ranking for the specific day based on daily gross revenue (starting at 1 for top-grossing).

YD (Yesterday's Rank): Movie ranking from the previous day.

Release: The title of the movie.

Daily: Gross revenue earned at the box office on the specific date.

%± YD (Percent Change from Yesterday): Percentage increase or decrease in revenue compared to the previous day.

%± LW (Percent Change from Last Week): Percentage increase or decrease in revenue compared to the same day one week prior.

Theaters: Total number of movie theaters showing the film on that date.

Avg (Average per Theater): Daily gross divided by the number of Theaters.

To Date: The cumulative total gross revenue earned by the movie from its release up to the current date.

Days: The number of days the movie has been in release (starting at Day 1 for opening day).

Distributor: The studio or company responsible for releasing the movie.

date: The specific calendar date for this row of data.

The data table above gives us the daily info about movies released and premiered from 2009-2025. I have to add important info like a unique identifier, genres, no. of theaters etc. which I do using the following file from the official IMDb website.

In [8]:
df2.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269406 entries, 0 to 269405
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   TD           269406 non-null  int64 
 1   YD           269406 non-null  object
 2   Release      269406 non-null  object
 3   Daily        269406 non-null  object
 4   %¬± YD       269406 non-null  object
 5   %¬± LW       269406 non-null  object
 6   Theaters     269406 non-null  object
 7   Avg          269406 non-null  object
 8   To Date      269406 non-null  object
 9   Days         269406 non-null  object
 10  Distributor  268604 non-null  object
 11  date         269406 non-null  object
dtypes: int64(1), object(11)
memory usage: 24.7+ MB


###File information:

**title.basics.tsv.gz** is a non-commercial IMDb dataset which contains the following:

tconst (string) - alphanumeric unique identifier of the title

titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)

primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release

originalTitle (string) - original title, in the original language

isAdult (boolean) - 0: non-adult title; 1: adult title

startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year

endYear (YYYY) – TV Series end year. '\N' for all other title types

runtimeMinutes – primary runtime of the title, in minutes

genres (string array) – includes up to three genres associated with the title

##Cleaning and Filtering the Data

Since several different movies can have the same title, I will merge on both title and release date to have the closest match and thus more accuracy. So I clean these two columns in both the datasets respectively.

In [None]:
df['Release_cleaned'] = df['Release'].str.lower().str.replace(r'[^a-zA-Z0-9 ]', '', regex=True) #standardized for better matching
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y', errors='coerce') #proper datetime format for necessary extraction later
df['Release_Year'] = df['date'].dt.year #year info extracted from release date
df['Release_Year'] = pd.to_numeric(df['Release_Year'], errors='coerce')

In [None]:
df2['primaryTitle_cleaned'] = df2['primaryTitle'].str.lower().str.replace(r'[^a-zA-Z0-9 ]', '', regex=True) #standardized for better matching
df2['startYear'] = pd.to_numeric(df2['startYear'], errors='coerce')
df2['startYear'] = df2['startYear'].astype('Int64')

Since I am creating a movie prediction dataset, any other type of media is irrelevant and would worsen my dataset. So I filter and pick only relevant columns from the IMDb dataset for the merge:

In [None]:
df_movies = df2[df2['titleType'] == 'movie'] #Only movie type records are selected
df_movies = df_movies[['primaryTitle_cleaned', 'startYear', 'tconst', 'genres', 'runtimeMinutes']] #All relevant columns for the merge

##Merging the Datasets

In [None]:
# Adding a temporary unique identifier to df to track original rows
df['orig_id'] = range(len(df))

# Creating expanded df_movies with year offsets for tolerance
offsets = [-2, -1, 0, 1, 2]
df_movies_expanded_list = []
for offset in offsets:
    temp = df_movies.copy()
    temp['adjusted_year'] = temp['startYear'] + offset
    df_movies_expanded_list.append(temp)
df_movies_expanded = pd.concat(df_movies_expanded_list, ignore_index=True)

# Performing left merge on cleaned titles and adjusted year
merged = pd.merge(
    df,
    df_movies_expanded,
    left_on=['Release_cleaned', 'Release_Year'],
    right_on=['primaryTitle_cleaned', 'adjusted_year'],
    how='left'
)

# Dropping the adjusted_year column
merged = merged.drop('adjusted_year', axis=1)

# Calculating year difference (use a large value for unmatched rows)
merged['year_diff'] = abs(merged['startYear'] - merged['Release_Year']).fillna(999)

# For each original row, selecting the match with the smallest year_diff
min_diff_indices = merged.groupby('orig_id')['year_diff'].idxmin()
merged_clean = merged.loc[min_diff_indices]

# Sorting back to original order and clean up temporary columns
merged_clean = merged_clean.sort_values('orig_id').drop(['orig_id', 'year_diff'], axis=1)

# Now merged_clean contains the merged dataframe with all original df rows preserved,
# and best-matching rows from df_movies based on title and year tolerance <= 2

In [None]:
merged_clean.shape #Checking the number of records for an idea about the dataset

(269406, 19)

In [None]:
merged_clean.info() #Verifying desired columns

<class 'pandas.core.frame.DataFrame'>
Index: 269406 entries, 0 to 320269
Data columns (total 19 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   TD                    269406 non-null  int64         
 1   YD                    269406 non-null  object        
 2   Release               269406 non-null  object        
 3   Daily                 269406 non-null  object        
 4   %¬± YD                269406 non-null  object        
 5   %¬± LW                269406 non-null  object        
 6   Theaters              269406 non-null  object        
 7   Avg                   269406 non-null  object        
 8   To Date               269406 non-null  object        
 9   Days                  269406 non-null  object        
 10  Distributor           268604 non-null  object        
 11  date                  269406 non-null  datetime64[ns]
 12  Release_cleaned       269406 non-null  object        
 13  Rele

## Adding Special Days
Since I am working with domestic data, it is important to identify if a given date falls on a US public holiday. This feature is added by calculating and mapping these special days for each unique year present in the `fact_table`'s `date` column.

In [None]:
import holidays as hdays

us_holidays = hdays.US(years=merged_clean['date'].dt.year.unique())
print("US holidays object initialized.")

US holidays object initialized.


In [None]:
def get_holiday_name(date):
    """
    Checks if a given date is a US holiday and returns its name.
    Requires the us_holidays object to be globally accessible.
    """
    return us_holidays.get(date)

print("Function 'get_holiday_name' defined.")

Function 'get_holiday_name' defined.


In [None]:
merged_clean['Special_Day'] = merged_clean['date'].apply(lambda x: get_holiday_name(x.date()))
display(merged_clean.head())

Unnamed: 0,TD,YD,Release,Daily,%¬± YD,%¬± LW,Theaters,Avg,To Date,Days,Distributor,date,Release_cleaned,Release_Year,primaryTitle_cleaned,startYear,tconst,genres,runtimeMinutes,Special_Day
0,1,1,Marley & Me,"$9,956,212",44.60%,-30.80%,3480,"$2,860","$82,400,283",8,Twentieth Century Fox,2009-01-01,marley me,2009,marley me,2008,tt0822832,"Drama,Family",115,New Year's Day
1,2,2,Bedtime Stories,"$8,336,917",46.40%,-21.20%,3681,"$2,264","$65,037,829",8,Walt Disney Studios Motion Pictures,2009-01-01,bedtime stories,2009,bedtime stories,2008,tt0960731,"Adventure,Comedy,Family",99,New Year's Day
2,3,3,The Curious Case of Benjamin Button,"$7,939,690",85.40%,-33.10%,2988,"$2,657","$60,605,838",8,Paramount Pictures,2009-01-01,the curious case of benjamin button,2009,the curious case of benjamin button,2008,tt0421715,"Drama,Fantasy,Romance",166,New Year's Day
3,4,4,Valkyrie,"$5,747,446",64.80%,-32.30%,2711,"$2,120","$46,649,304",8,United Artists,2009-01-01,valkyrie,2009,valkyrie,2008,tt0985699,"Drama,History,Thriller",121,New Year's Day
4,5,5,Yes Man,"$5,567,221",102.30%,-6.10%,3434,"$1,621","$65,596,911",14,Warner Bros.,2009-01-01,yes man,2009,yes man,2008,tt1068680,"Comedy,Romance",104,New Year's Day


##Creating the Fact Table

With a few selected columns that have the relevant information in the merged dataset, I create the fact table.

In [None]:
fact_table = merged_clean[['tconst','Release', 'Daily','date','%¬± YD', '%¬± LW', 'Theaters', 'Release_Year','Avg', 'Days', 'To Date', 'Distributor','genres', 'runtimeMinutes', 'Special_Day']]
display(fact_table.tail())

Unnamed: 0,tconst,Release,Daily,date,%¬± YD,%¬± LW,Theaters,Release_Year,Avg,Days,To Date,Distributor,genres,runtimeMinutes,Special_Day
320265,tt30144839,One Battle After Another,"$140,000",2025-11-14,203.40%,-44%,251,2025,$557,50,"$69,815,567",IMAX,"Action,Crime,Drama",161.0,
320266,tt4627382,Roofman,"$50,000",2025-11-14,119.90%,-61.20%,254,2025,$196,36,"$22,575,135",Paramount Pictures International,"Biography,Crime,Drama",126.0,
320267,tt32820897,Demon Slayer: Kimetsu no Yaiba- The Movie - In...,"$40,000",2025-11-14,70.40%,-36.50%,183,2025,$218,64,"$133,740,032",Sony Pictures Releasing,"Action,Adventure,Animation",155.0,
320268,,Back to the Future40th Anniversary,"$36,000",2025-11-14,-4.70%,-85.70%,275,2025,$130,15,"$7,791,710",Universal Pictures International (UPI),,,
320269,tt32214143,Gabby's Dollhouse: The Movie,"$10,000",2025-11-14,109.20%,-53.90%,134,2025,$74,50,"$31,943,585",Universal Pictures International (UPI),"Adventure,Animation,Comedy",98.0,


The fact table is thus converted into CSV and uploaded on Google Drive for further working.

In [None]:
fact_table.to_csv('fact_table.csv', index=False)
print('fact_table.csv saved successfully!')

fact_table.csv saved successfully!
