Previous notes from EDA:

- To deal with the zeros in [budget, popularity, revenue, runtime, vote_average]:
    - It could be one of these ways:
        1. Predictive Modeling: Building a model to predict budget or revenue based on other features (e.g., genre, popularity, runtime) could provide more accurate results
        2. Binning: Instead of treating budget and revenue as continuous variables, you can categorize them into bins (e.g., low, medium, high).
        3. Data Augmentation: Seek out additional data sources to fill in the missing budget and revenue information. 

- Now for the high-value outliers in [budget, populairty, revenue, vote_count]:
    - It could be one of these ways:
        1. Winsorization: capping the extreme values at a certain percentile. points above the 95th percentile are set to the value of the 95th percentile.
        2. Binning: as explained before...
        3. Transformation: Applying transformations to the data can reduce the impact of outliers. Log transformations, square root transformations, or Box-Cox transformations
        4. Model-choice: we can use models that are less sensetive to outliers e.g. Tree-based models


In [121]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [122]:
movies_df = pd.read_csv("../data/movies_after_eda.csv")
ratings_df = pd.read_csv("../data/ratings_small.csv")
credits_df = pd.read_csv("../data/credits.csv")

## Let's try to create to merged datasets
one for (moves+credits) for content-based filtering
and the other (movies+ratings) for collabritve-based filtering

some of movies_id has weird values like dates instead of numbers and nan's. So, we will be dealing with that

In [123]:
movies_df["id"] = movies_df["id"].apply(
    lambda x: int(x) if str(x).isnumeric() else np.nan
)
movies_df["id"].isnull().sum()
movies_df = movies_df.dropna(subset=["id"])
movies_df["id"] = movies_df["id"].astype(int)

In [124]:
movies_credits_df = pd.merge(movies_df, credits_df, on="id")
movies_credits_df.head(2)

Unnamed: 0,id,imdb_id,title,budget,original_language,overview,popularity,release_date,revenue,runtime,vote_average,vote_count,genres_list,release_year,production_companies_list,cast,crew
0,862,tt0114709,Toy Story,30000000.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,7.7,5415.0,"['Animation', 'Comedy', 'Family']",1995.0,['Pixar Animation Studios'],"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,8844,tt0113497,Jumanji,65000000.0,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,6.9,2413.0,"['Adventure', 'Fantasy', 'Family']",1995.0,"['TriStar Pictures', 'Teitler Film', 'Intersco...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."


In [125]:
movies_credits_df.tail(2)

Unnamed: 0,id,imdb_id,title,budget,original_language,overview,popularity,release_date,revenue,runtime,vote_average,vote_count,genres_list,release_year,production_companies_list,cast,crew
45536,227506,tt0008536,Satan Triumphant,0.0,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,0.0,0.0,[],1917.0,['Yermoliev'],"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de..."
45537,461257,tt6980792,Queerama,0.0,en,50 years after decriminalisation of homosexual...,0.163015,2017-06-09,0.0,75.0,0.0,0.0,[],2017.0,[],[],"[{'credit_id': '593e676c92514105b702e68e', 'de..."


In [126]:
movies_credits_df.isnull().sum()

id                             0
imdb_id                       17
title                          3
budget                         0
original_language             11
overview                     954
popularity                     3
release_date                  87
revenue                        3
runtime                      260
vote_average                   3
vote_count                     3
genres_list                    0
release_year                  87
production_companies_list      0
cast                           0
crew                           0
dtype: int64

the null values seem to hurt the dataset as its mostly on not so important columns
so, we will deal with with simple imputations ways

In [127]:
# dealing with null values by mean imputation for numerical columns and mode imputation for categorical columns
movies_credits_df["runtime"] = movies_credits_df["runtime"].fillna(
    movies_credits_df["runtime"].mean()
)
movies_credits_df["vote_average"] = movies_credits_df["vote_average"].fillna(
    movies_credits_df["vote_average"].mean()
)
movies_credits_df["vote_count"] = movies_credits_df["vote_count"].fillna(
    movies_credits_df["vote_count"].mean()
)
movies_credits_df["release_date"] = movies_credits_df["release_date"].fillna(
    movies_credits_df["release_date"].mode()[0]
)
movies_credits_df["release_year"] = movies_credits_df["release_year"].fillna(
    movies_credits_df["release_year"].mode()[0]
)
movies_credits_df["original_language"] = movies_credits_df["original_language"].fillna(
    movies_credits_df["original_language"].mode()[0]
)
movies_credits_df["revenue"] = movies_credits_df["revenue"].fillna(
    movies_credits_df["revenue"].mean()
)
movies_credits_df["imdb_id"] = movies_credits_df["imdb_id"].fillna(
    "No imdb_id available"
)
movies_credits_df["title"] = movies_credits_df["title"].fillna("No title available")
movies_credits_df["overview"] = movies_credits_df["overview"].fillna(
    "No overview available"
)
movies_credits_df["popularity"] = movies_credits_df["popularity"].fillna(
    movies_credits_df["popularity"].mean()
)

In [128]:
movies_credits_df.isnull().sum()

id                           0
imdb_id                      0
title                        0
budget                       0
original_language            0
overview                     0
popularity                   0
release_date                 0
revenue                      0
runtime                      0
vote_average                 0
vote_count                   0
genres_list                  0
release_year                 0
production_companies_list    0
cast                         0
crew                         0
dtype: int64

Let's change the cast, crew columns to a more readbile format

In [129]:
import ast


def extract_actors(cast_str):
    cast_list = ast.literal_eval(cast_str)
    actors = [actor["name"] for actor in cast_list[:3]]
    return actors


def extract_crew(crew_str):
    crew_list = ast.literal_eval(crew_str)
    crew_members = [member["name"] for member in crew_list[:3]]
    return crew_members


movies_credits_df["top_actors"] = movies_credits_df["cast"].apply(extract_actors)
movies_credits_df["top_crew"] = movies_credits_df["crew"].apply(extract_crew)

movies_credits_df.drop(["cast", "crew"], axis=1, inplace=True)

movies_credits_df.head()

Unnamed: 0,id,imdb_id,title,budget,original_language,overview,popularity,release_date,revenue,runtime,vote_average,vote_count,genres_list,release_year,production_companies_list,top_actors,top_crew
0,862,tt0114709,Toy Story,30000000.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,7.7,5415.0,"['Animation', 'Comedy', 'Family']",1995.0,['Pixar Animation Studios'],"[Tom Hanks, Tim Allen, Don Rickles]","[John Lasseter, Joss Whedon, Andrew Stanton]"
1,8844,tt0113497,Jumanji,65000000.0,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,6.9,2413.0,"['Adventure', 'Fantasy', 'Family']",1995.0,"['TriStar Pictures', 'Teitler Film', 'Intersco...","[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[Larry J. Franco, Jonathan Hensleigh, James Ho..."
2,15602,tt0113228,Grumpier Old Men,0.0,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,6.5,92.0,"['Romance', 'Comedy']",1995.0,"['Warner Bros.', 'Lancaster Gate']","[Walter Matthau, Jack Lemmon, Ann-Margret]","[Howard Deutch, Mark Steven Johnson, Mark Stev..."
3,31357,tt0114885,Waiting to Exhale,16000000.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,6.1,34.0,"['Comedy', 'Drama', 'Romance']",1995.0,['Twentieth Century Fox Film Corporation'],"[Whitney Houston, Angela Bassett, Loretta Devine]","[Forest Whitaker, Ronald Bass, Ronald Bass]"
4,11862,tt0113041,Father of the Bride Part II,0.0,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,5.7,173.0,['Comedy'],1995.0,"['Sandollar Productions', 'Touchstone Pictures']","[Steve Martin, Diane Keaton, Martin Short]","[Alan Silvestri, Elliot Davis, Nancy Meyers]"


First we will deal with the zeros in the `budget` and `revenue` we will do that by using predictive modelling, using linear_regression