## Loading required libraries

In [76]:
suppressMessages({library(tidyverse)
                  library(dplyr)
                  library(janitor)
                  library(lubridate)})

## Preliminary Analysis

### Load dataset

In [51]:
path <- "data/raw/imdb_data.csv"
imdb_data <- read.csv(path)

### Summary of dataset

In [75]:
summary(imdb_data)

      id            primaryTitle       originalTitle         isAdult 
 Length:3348        Length:3348        Length:3348        Min.   :0  
 Class :character   Class :character   Class :character   1st Qu.:0  
 Mode  :character   Mode  :character   Mode  :character   Median :0  
                                                          Mean   :0  
                                                          3rd Qu.:0  
                                                          Max.   :0  
                                                                     
 runtimeMinutes     genres          averageRating      numVotes      
 Min.   : 63.0   Length:3348        Min.   :1.000   Min.   :  50004  
 1st Qu.: 98.0   Class :character   1st Qu.:6.200   1st Qu.:  78977  
 Median :109.0   Mode  :character   Median :6.800   Median : 129040  
 Mean   :112.7                      Mean   :6.739   Mean   : 215549  
 3rd Qu.:124.0                      3rd Qu.:7.300   3rd Qu.: 246850  
 Max.   :242.0      

### Data wrangling

### Data Cleaning and Preprocessing

The original dataset, `imdb_data.csv`, contained information on over 3,000 movies, including variables such as ratings, budget, runtime, release date, and more. To prepare the data for analysis, we first used the `read_csv()` function and applied `clean_names()` from the **janitor** package to standardize column names for easier handling.

Since our analysis focuses on movies released after 1970 and with substantial public engagement, we filtered the data to only include films released in 1970 or later and with at least 50,000 user votes.

The `release_date` column was originally stored as a character string (e.g., `"December 11, 2001"`), so we used the `mdy()` function from the **lubridate** package to convert it into a proper `Date` object, and then extracted the release year using `year()`.

During this process, we also removed rows where the `gross` earnings were missing, as this information is critical to our exploratory analysis. Finally, we selected only the variables relevant to our research question: `average_rating`, `budget`, `runtime_minutes`, `release_year`, `gross`, and `num_votes`.

We verified the cleaned dataset for any remaining missing values using `summarise_all()` with `is.na()`, and confirmed that no missing data remained. The cleaned dataset was then written to the `data/processed/` directory using `write_csv()`, making it ready for analysis.


In [71]:
cleaner <- function(path) {

    # Suppress col_types warnings
    options(readr.show_col_types = FALSE)

    # CLEANED DATAFRAME STORED HERE
    df <- read_csv(path) |> 
        clean_names() |> 
        mutate(
          release_date = mdy(release_date),
          release_year = year(release_date)
        ) |> 
        filter(release_year >= 1970, num_votes >= 50000) |> 
        drop_na(gross) |> 
        select(average_rating, budget, runtime_minutes, release_year, gross, num_votes)

    # Check for remaining missing values
    missing_summary <- df |> summarise_all(~sum(is.na(.)))
    print("Missing values in final cleaned dataset:")
    print(missing_summary)
    
    # Write the cleaned dataframe to "processed data" directory
    write_csv(df, "data/processed/processed_imdb.csv") 
}

In [74]:
# Use this for further analysis
imdb_cleaned <- cleaner(path)

[1m[22m[36mℹ[39m In argument: `release_date = mdy(release_date)`.
[33m![39m  79 failed to parse.”


[1] "Missing values in final cleaned dataset:"
[90m# A tibble: 1 × 6[39m
  average_rating budget runtime_minutes release_year gross num_votes
           [3m[90m<int>[39m[23m  [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m        [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m
[90m1[39m              0      0               0            0     0         0
