## Import libraries

In [None]:
import pandas as pd
import seaborn as sns 
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler, LabelEncoder

In [None]:
pd.set_option('display.max_columns', None)

## Load the dataset

In [None]:
# X1: entry dataset (note: X2 is the testing dataset)
# use row `Unamed: 0` as the row index
X1 = pd.read_csv("datasets/X1.csv")

print(X1.shape)
X1.head(5)

In [None]:
X1.loc[X1["title"] == "Clown"]

inputs dataset has dimension (3540, 14)

One first thing we can notice is that our dataset use a special character "\\N" for empty values. We should modify them to NaN.

In [None]:
# Y1: target dataset
Y1 = pd.read_csv("datasets/Y1.csv", header = None, names = ["revenues"])

print(Y1.shape)
Y1.head(5)

target dataset has dimension (3540, 1)

In [None]:
# X2: testing entry dataset
X2 = pd.read_csv("datasets/X2.csv")

print(X2.shape)
X2.head()

## Dataset description

inputs (X1):     
- `title`: title of the movie.    
- `ratings`: rating on IMDB.    
- `n_votes`: number of votes that are averaged for the given rating.    
- `is_adult`: is the movie destined for a mature audience (0 or 1).    
- `production_year`: the year the movie was produced.    
- `release_year`: the year the movie was released.    
- `runtime`: how long the movie lasts for.    
- `genres`: a list of maximum 3 genres that fits the movie.   
- `studio`: the movie studio that produced the movie.        
- `img.url`: the url of the poster of the movie.    
- `img.embeddings`: vector of size 2048 representing the poster.    
- `description`: synopsis of the movie.    
- `text.embeddings`: vector of size 768 representing the synopsis.

There is also an `"Unnamed: 0"` column that seems to be an **id for the movie**. We can remove it.

target (Y1):     
- `revenue`: the amount in dollars the movie made in the USA.    

## EDA and data engineering

What we're gonna do :
- Reencode some categorical variables and integers variables differently
- Remove useless / redondant features
- ...

For feature engineering and the sake of simplicity, we're gonna concatenate the inputs `X1` with the target `Y1`

In [None]:
df = pd.concat([X1, Y1], axis = 1)
df.head()

First, let's rename `Unnamed: 0` column to `movie_id`

In [None]:
df.columns = df.columns.str.replace('Unnamed: 0','movie_id')
df.drop("movie_id", axis=1, inplace=True)
df.head()

### Types of variables

Let's check the different types of variables

In [None]:
# types of variables
df.dtypes.value_counts()

In [None]:
df.info()

Among the 8 object variables, we have the **2 vectors embeddings** as well as the `img_url` and `description` features that we could drop since we have the embeddings.
We also have the following categorical variables: `title`, `genres` and `studio`.

Finally, we have the `runtime` feature which contains `str` instead of `int`. It is because it uses "\\N" instead of NaN for missing values.

We will drop `title` feature later.

In [None]:
df.drop(columns=["img_url", "description"], inplace=True)
df.head()

converting `img_embeddings` and `text_embeddings` from **string** to **numpy array**

In [None]:
df["text_embeddings"] = df["text_embeddings"].apply(eval).apply(np.array)
df["img_embeddings"] = df["img_embeddings"].apply(eval).apply(np.array)

df.head()

### Duplicated observations

Let's check if we have any duplicate observations (we saw before that there could be duplicated movies with different `movie_id`)

In [None]:
df[df.duplicated()]

In [None]:
df[df["title"] == "The Ox"]

In [None]:
df[df.duplicated()].count()

In [None]:
df.drop_duplicates(keep="first", inplace=True)

### Missing values

In [None]:
df["runtime"].describe()

In [None]:
df["runtime"].unique()

In [None]:
df["genres"].unique()

Let's see if there is any empty values

In [None]:
df.isna().sum()

As we saw before, there is no empty values because this dataset uses the character "\\N" for empty values. Let's modify it to NaN.

We also convert type of `runtime` feature to `int` instead of `float`.

In [None]:
# replace "\\N" by "-1" (just for conversion to int)
df.replace("\\N", "-1", inplace=True)

# convert to int
df["runtime"] = df["runtime"].astype(int)

# replace -1 (for column runtime) and "-1" (for column genres) by NaN
df.replace(-1, np.nan, inplace=True)
df.replace("-1", np.nan, inplace=True)

In [None]:
df["runtime"].unique()

In [None]:
df["genres"].unique()

Now, we can check for any empty values

In [None]:
# number of missing values
df.isna().sum()

There are 227 missing values for `runtime` feature and 3 missing values for `genres` feature

In [None]:
# percentage of missing values
((df.isna().sum() / df.shape[0]) * 100).round(decimals = 2)

In [None]:
plt.figure(figsize = (20, 10))
sns.heatmap(df.isna(), cbar = False)

Let's check the rows containing missing values

In [None]:
df.loc[df.isna().any(axis=1)]

**Rule of thumb**: _if values are missing at random and percentage of observations with these missing values are less than $5\%$. We can drop them without risking of creating bias in our dataset._

We have $0.3\%$ of entries with missing values for `genres` features. These are random missing values (no reason for these to be missing, probably forgotten) so we can definitelty drop these entries without risk of creating bias in our dataset.
However, for the `runtime` feature, we have ~ $7\%$ of missing values. That's a little bit much for removing all these entries even though they also seem to be random missing values.

We could try to impute by mean or something else.

In [None]:
sns.histplot(df["runtime"])

print(df["runtime"].mode())

If data is missing randomly but the rows with these missing values are more than $5\%$ of the dataset, we can use **mean** (in case feature is normally distributed) or **median** (otherwise) imputation. We can also consider **mode** imputation.

However, keep in mind it affects data distribution (in particular the variance is reduced).

In [None]:
# Drop rows with NaN in genres column
df.dropna(subset=["genres"], axis=0, inplace=True)

print("runtime mean: {}".format(df["runtime"].mean()))
print("runtime median: {}".format(df["runtime"].median()))
print("runtime mode: {}".format(df["runtime"].mode()))

# Impute rows with NaN in runtime column. Replacing with mean.
df["runtime"].fillna(df["runtime"].mean(), inplace=True)

In [None]:
df.isna().sum()

### Analysis of columns

In [None]:
plt.subplots(figsize=(20,30))

i = 1

for col in df.select_dtypes("int"):
    plt.subplot(4,3,i)
    sns.histplot(df[col])
    i += 1

There seems to be only _non-adult_ movies (to confirm later).

Movies were mainly produced between **1990** and **2010**. We have a left skewed distribution.

In [None]:
df["is_adult"].value_counts()

Indeed, we **do not have any movies** for a _mature audience_. 
Therefore, we could drop this column.

In [None]:
df.drop(["is_adult"], axis=1, inplace=True)
df.head()

In [None]:
plt.subplots(figsize=(20,30))

i = 1

for col in df.select_dtypes("float"):
    plt.subplot(4,3,i)
    sns.histplot(df[col])
    i += 1

Ratings are more or less normally distributed with a mean around **6.5**.

The number of votes are pretty homogeneous amoung the movies.

Most movies were released between **2005** and **2010**. We have a left skewed distribution.

It doesn't seem to make sense to have `n_votes` and `release_year` as "float". Let's convert them into "int".

In [None]:
df["release_year"] = df["release_year"].astype(int)
df["n_votes"] = df["n_votes"].astype(int)

df.head()

In [None]:
df.index.is_unique

### Skewness, outliers and normalization

In [None]:
df.boxplot()

In [None]:
sns.boxplot(y = df["n_votes"])

We notice a big difference of scale for the `n_votes` feature. It also contains outliers. We do not remove outliers in testing set.

In [None]:
sns.boxplot(y = df["revenues"])

In [None]:
# TODO: remove outliers via IQR (because data has not normal shape)
def find_boundaries(df, variable):
    iqr = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_bound = df[variable].quantile(0.25) - (iqr * 1.5)
    upper_bound = df[variable].quantile(0.75) + (iqr * 1.5)

    return lower_bound, upper_bound

lower_bound_votes, upper_bound_votes = find_boundaries(X1, "n_votes")

print(lower_bound_votes)
print(upper_bound_votes)   

votes_outliers = np.where(X1["n_votes"] > upper_bound_votes, True, np.where(X1["n_votes"] < upper_bound_votes, True, False))
print(votes_outliers.sum())

X1_without_outliers = X1.loc[(~votes_outliers), ]

print(X1.shape)
print(X1_without_outliers.shape)

Then, we standardize the data. To do that, we apply a scaler on the training set and then apply it on both training and testing sets.

In [None]:
df.describe()

Let's analyze the range of the different numericals features

In [None]:
df.select_dtypes(include=["int64", "float64"]).max() - df.select_dtypes(include=["int64", "float64"]).min()

Standardizing data with `Standadardizer` or `Normalizer` is not a good idea with skewed data. 
Since we cannot remove outliers of `n_votes` feature, we use `RobustScaler` that works better with skewed data.

In [None]:
scaler = RobustScaler()

numerical_features = df.select_dtypes(include=["int64", "float64"]).columns
numerical_features = numerical_features.to_list()
print(numerical_features)

# fit the scaler on training dataset
scaler.fit(df[numerical_features].to_numpy())

# apply the scaler on both training and testing datasets
df_scaled = scaler.transform(df[numerical_features])
#X2_scaled = scaler.transform(X2[numerical_features])

df[numerical_features] = pd.DataFrame(df_scaled, columns=numerical_features, index=df.index)

df.head()

In [None]:
df["n_votes"].describe()

### One Hot Encoding

We should then One-Hot encode the genres

In [None]:
# separate all genres into one big list of list of genres
genres_list = df["genres"].str.split(",").tolist()

unique_genres = []

# retrieve each genre
for sublist in genres_list:
    for genre in sublist:
        if genre not in unique_genres:
            unique_genres.append(genre)

# sort
unique_genres = sorted(unique_genres)
print(unique_genres)

# one hot encode movies genres
df = df.reindex(X1.columns.tolist() + unique_genres, axis = 1, fill_value = 0)

for index, row in df.iterrows():
    for genre in row["genres"].split(","):
        df.loc[index, genre] = 1

In [None]:
# drop old genres column
df.drop("genres", axis=1, inplace=True)

df.head()

### Label Encoding

We should also One-Hot encode the `studio` feature. 
Well, there are a lof of different studios, therefore, it will result in a lof of features if we One-Hot encode them. As a consequence, we would explose the dimensionnality of the datas and there would be more risk to overfit. Better to Label encode ?

In [None]:
df["studio"].describe()

In [None]:
df["studio"].unique()

In [None]:
label_encoder = LabelEncoder()

df["studio"] = label_encoder.fit_transform(df["studio"].to_numpy())

df.head()

We could drop `title` feature as it doesn't have any value for the target (all unique values).

In [None]:
df.drop("title", axis=1, inplace=True)

## Model

We're gonna build regression models :
- Linear regression
- K-Nearest Neighbors 
- MLP
- One other non-linear method (can be one not seen during the course)

We're gonna do **feature selection** and **model selection**.
/!\ model selection can require a lot of computation time /!\

We're gonna validate the model.

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_predict

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

from sklearn.feature_selection import mutual_info_regression, SequentialFeatureSelector, SelectFromModel

from sklearn.metrics import mean_squared_error

### Train - Test splitting

In [None]:
X1 = df.drop("revenues", axis=1)
Y1 = df["revenues"]

X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, train_size = 0.8, test_size = 0.2, shuffle = True, random_state = 0)

print(f"training dataset dimension: X1 = {X1_train.shape}, Y1 = {Y1_train.shape}")
print(f"testing dataset dimension: X1 = {X1_test.shape}, Y1 = {Y1_test.shape}")

### Feature selection : Filter Method

We want to remove redundant or irrelevant features to improve computational efficiency and reduce the risk of overfitting.
As I understand, there's multiple ways to selection features via filter method. As a reminder, filter method is independent of any machine learning model but does not take into account feature redundancy. 

Some of them are:
- `Chi-Square` and `ANOVA`: for categorical variables and categorical targets    
- `Correlation matrix`: for continuous variables, continuous target and linear model    
- `Mutual information`: for continuous variables, continuous target and non-linear model     

Since we're dealing with continuous target and we will train linear and non-linear models, we use the two last one.

In [None]:
X1.dtypes

#### Correlation matrix (linear models)

In [None]:
fig = plt.subplots(figsize = (12,10))

corr_with_target = X1.select_dtypes("int", "float").corrwith(Y1)

print(corr_with_target)
sns.heatmap(corr_with_target, annot=True)

In [None]:
relevant_features = corr_with_target[corr_with_target > 0.4]

print(relevant_features.items())
relevant_features

In [None]:
# get relevant features list for linear models
rf_lm = [k for k,v in relevant_features.items()]

#### Mutual information (non-linear models)

In [None]:
mutual_info_matrix = mutual_info_regression(X1, Y1)

### Feature selection: Wrapper Method

In [436]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel

forest = RandomForestRegressor(n_estimators=500, random_state=0)
forest.fit(X1_train, Y1_train)

sfm = SelectFromModel(forest, threshold=0.1, prefit=True)

X1_selected = sfm.transform(X1_train)

feat_labels = X1_train.columns
importances = forest.feature_importances_

for i in range(X1_selected.shape[1]):
    print("%2d) %-*s %f" % (i + 1, feat_labels[indices[i]], importances[indices[i]]))

ValueError: setting an array element with a sequence.

## Prediction

We're gonna make prediction about the revenue of movies present in `X2.csv`.