# XGBoost Model

We choose to implement an XGBoost model since it has been a high performance model in our previous data science experience. For our rating prediction task we want the model to have the ability to handle complex data as well as being fast and efficient.

For our loss function we choose to use MeanAbsoluteError(MAE). This is a standard loss function for this task and does not penalise outlier data as strongly as MeanSquaredError(MSE). This provides more interpretability to our results.

In [28]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

We begin by concatenating our LDA models with the output dataset from 02, These models have not used the 'Title' column in our dataset since the short titles of films felt as if they would be weak predictors, except in the rare case of sequels. The original 'Plot' and 'Title' column are then dropped so the data is now in a format which can be input for the XGBoost model.

For an XGBoost model, we do not need to normalize our data since the model is gradient boosted decision trees.

We also should create a baseline method which we will be able to compare against to evaluate model performance.
For a simple baseline, we choose to take the mean 'IMDbRating' from the training set.
For a more advanced baseline to test how the text processing performs, we can run an XGBoost model which does not use any of the 'Title' or 'Plot' data.

In [29]:
df1 = pd.read_csv("Data/PreProcessedData.csv")
df_no_plot = df1.drop('Plot', axis = 1)
df2 = pd.read_csv("Data/LDA_topics.csv") #UPDATE THIS TO BE WHATEVER DATA WE GIVE IT
df_LDATOPICS = pd.concat([df_no_plot, df2], axis=1)
df_LDATOPICS = df_LDATOPICS.drop('Title', axis = 1)
df_LDATOPICS = df_LDATOPICS.drop('Unnamed: 0', axis = 1)

In [30]:
df2 = pd.read_csv("Data/LDA_topics_synonym.csv")
df_LDATOPICS_synonym = pd.concat([df_no_plot, df2], axis=1)
df_LDATOPICS_synonym = df_LDATOPICS_synonym.drop('Title', axis = 1)
df_LDATOPICS_synonym = df_LDATOPICS_synonym.drop('Unnamed: 0', axis = 1)
print(df_LDATOPICS_synonym)

      IMDbRating  Year  Action  Adult  Adventure  Animation  Biography  \
0            7.1  2000       0      0          0          0          0   
1            4.1  2000       0      0          0          0          0   
2            6.6  2000       0      0          1          0          0   
3            5.6  2000       0      0          0          0          0   
4            7.7  2000       0      0          0          0          0   
...          ...   ...     ...    ...        ...        ...        ...   
6509         4.0  2022       0      0          0          0          0   
6510         7.9  2022       0      0          1          1          0   
6511         7.6  2022       0      0          0          0          0   
6512         6.9  2022       0      0          0          0          1   
6513         6.7  2022       0      0          0          0          0   

      Comedy  Crime  Documentary  ...  Zoë Kravitz        t1        t2  \
0          0      0            0  ...

We create a function to produce test train splits for each of the datasets we are going to use for our boosting algorithm. Since we set the same random_seed for all the data as we use the test_trainsets function, our test and train sets will be consistent for all of the datasets.

In [31]:
def test_trainsets(df):
    rating = df['IMDbRating']
    xdf = df.drop('IMDbRating', axis=1, inplace=False)
    
    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(xdf, rating, test_size=0.2, random_state=1)
    
    X_train_df = pd.DataFrame(X_train)
    X_test_df = pd.DataFrame(X_test)
    y_train_df = pd.DataFrame(y_train)
    y_test_df = pd.DataFrame(y_test)

    return X_train_df, y_train_df, X_test_df, y_test_df

In [32]:
#ROOM FOR PARAMETER TESTING

Firstly we train the LDA model.

In [33]:
X_train, y_train, X_test, y_test = test_trainsets(df_LDATOPICS)

# Create the XGBoost model
xgb_model = xgb.XGBRegressor()

# Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=123)

# store the score for each fold
scores = []

mae_scores = []
baseline_scores = []
# Iterate over the folds
for train_index, val_index in kf.split(X_train):
    # Split the data into training and validation sets
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train the model on the training data
    xgb_model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation data
    y_pred_fold = xgb_model.predict(X_val_fold)
    # Compute the MAE score
    baseline = mean_absolute_error(y_val_fold, np.array([np.mean(y_train['IMDbRating'])] * len(y_pred_fold)))
    mae = mean_absolute_error(y_val_fold, y_pred_fold)
    print(f"MAE: {mae}")
    print(f"Baseline: {baseline}")
    mae_scores.append(mae)
    baseline_scores.append(baseline)

# Compute the average MAE score
average_mae_LDA = sum(mae_scores) / len(mae_scores)
average_baseline = sum(baseline_scores) / len(baseline_scores)
print(f"Average MAE over training: {average_mae_LDA}")
print(f"Average Baseline over training: {average_baseline}")
y_pred = xgb_model.predict(X_test)
test_score_LDA = mean_absolute_error(y_pred, y_test)
results = pd.DataFrame({'Actual': y_test['IMDbRating'], 'Prediction': y_pred})
results.to_csv('LDA_Topics_results.csv', index=False)

MAE: 0.7265008439039338
Baseline: 0.8036852310907336
MAE: 0.7397469437282511
Baseline: 0.8222375264785754
MAE: 0.7114159764804218
Baseline: 0.7977755235768447
MAE: 0.6829293358577647
Baseline: 0.7885243492376062
MAE: 0.7136890634694163
Baseline: 0.8163850941331495
Average MAE over training: 0.7148564326879576
Average Baseline over training: 0.8057215449033819


Now we train our LDA model with synonyms.

In [34]:
X_train, y_train, X_test, y_test = test_trainsets(df_LDATOPICS_synonym)

# Create the XGBoost model
xgb_model = xgb.XGBRegressor()

# Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=123)

# store the score for each fold
scores = []

mae_scores = []
baseline_scores = []
# Iterate over the folds
for train_index, val_index in kf.split(X_train):
    # Split the data into training and validation sets
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train the model on the training data
    xgb_model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation data
    y_pred_fold = xgb_model.predict(X_val_fold)
    # Compute the MAE score
    baseline = mean_absolute_error(y_val_fold, np.array([np.mean(y_train['IMDbRating'])] * len(y_pred_fold)))
    mae = mean_absolute_error(y_val_fold, y_pred_fold)
    print(f"MAE: {mae}")
    print(f"Baseline: {baseline}")
    mae_scores.append(mae)
    baseline_scores.append(baseline)

# Compute the average MAE score
average_mae_LDASynonym = sum(mae_scores) / len(mae_scores)
average_baseline = sum(baseline_scores) / len(baseline_scores)

print(f"Average MAE over training: {average_mae_LDASynonym}")
print(f"Average Baseline over training: {average_baseline}")
y_pred = xgb_model.predict(X_test)
test_score_LDA_synonym = mean_absolute_error(y_pred, y_test)
results = pd.DataFrame({'Actual': y_test['IMDbRating'], 'Prediction': y_pred})
results.to_csv('LDA_Topics_synonym_results.csv', index=False)

MAE: 0.7303719280092966
Baseline: 0.8036852310907336
MAE: 0.7207185729680272
Baseline: 0.8222375264785754
MAE: 0.698293873536152
Baseline: 0.7977755235768447
MAE: 0.6922140400148857
Baseline: 0.7885243492376062
MAE: 0.6980984519402034
Baseline: 0.8163850941331495
Average MAE over training: 0.7079393732937129
Average Baseline over training: 0.8057215449033819


We train our HF pre-trained text encoder model.

In [35]:
df_HFTransformer = pd.read_csv("Data/PreProcessedData_with_HF_embeddings.csv")
X_train, y_train, X_test, y_test = test_trainsets(df_HFTransformer)

# Create the XGBoost model
xgb_model = xgb.XGBRegressor()

# Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=123)

# store the score for each fold
scores = []

mae_scores = []
baseline_scores = []
# Iterate over the folds
for train_index, val_index in kf.split(X_train):
    # Split the data into training and validation sets
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train the model on the training data
    xgb_model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation data
    y_pred_fold = xgb_model.predict(X_val_fold)
    # Compute the MAE score
    baseline = mean_absolute_error(y_val_fold, np.array([np.mean(y_train['IMDbRating'])] * len(y_pred_fold)))
    mae = mean_absolute_error(y_val_fold, y_pred_fold)
    print(f"MAE: {mae}")
    print(f"Baseline: {baseline}")
    mae_scores.append(mae)
    baseline_scores.append(baseline)

# Compute the average MAE score
average_mae_HF = sum(mae_scores) / len(mae_scores)
average_baseline = sum(baseline_scores) / len(baseline_scores)

print(f"Average MAE over training: {average_mae_HF}")
print(f"Average Baseline over training: {average_baseline}")
y_pred = xgb_model.predict(X_test)
test_score_HF = mean_absolute_error(y_pred, y_test)
results = pd.DataFrame({'Actual': y_test['IMDbRating'], 'Prediction': y_pred})

results.to_csv('HF_Transformer_Model_Results.csv', index=False)

MAE: 0.7503356624083918
Baseline: 0.8036852310907336
MAE: 0.7656118661108036
Baseline: 0.8222375264785754
MAE: 0.7458166718254162
Baseline: 0.7977755235768447
MAE: 0.7168532994338052
Baseline: 0.7885243492376062
MAE: 0.7466297551446135
Baseline: 0.8163850941331495
Average MAE over training: 0.7450494509846061
Average Baseline over training: 0.8057215449033819


Finally, we train our boosting model which does not use any text data

In [36]:
# Baseline boost model

df_BaselineBoost = pd.read_csv("Data/PreProcessedData.csv")
df_BaselineBoost = df_BaselineBoost.drop('Plot', axis=1)
df_BaselineBoost = df_BaselineBoost.drop('Title', axis=1)
X_train, y_train, X_test, y_test = test_trainsets(df_BaselineBoost)

# Create the XGBoost model
xgb_model = xgb.XGBRegressor()

# Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=123)

# store the score for each fold
scores = []

mae_scores = []
baseline_scores = []
# Iterate over the folds
for train_index, val_index in kf.split(X_train):
    # Split the data into training and validation sets
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train the model on the training data
    xgb_model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation data
    y_pred_fold = xgb_model.predict(X_val_fold)
    # Compute the MAE score
    baseline = mean_absolute_error(y_val_fold, np.array([np.mean(y_train['IMDbRating'])] * len(y_pred_fold)))
    mae = mean_absolute_error(y_val_fold, y_pred_fold)
    print(f"MAE: {mae}")
    print(f"Baseline: {baseline}")
    mae_scores.append(mae)
    baseline_scores.append(baseline)

# Compute the average MAE score
average_mae_baselineboost = sum(mae_scores) / len(mae_scores)
average_baseline = sum(baseline_scores) / len(baseline_scores)

print(f"Average MAE over training: {average_mae_baselineboost}")
print(f"Average Baseline over training: {average_baseline}")
y_pred = xgb_model.predict(X_test)
test_score_baselineboost = mean_absolute_error(y_pred, y_test)
results = pd.DataFrame({'Actual': y_test['IMDbRating'], 'Prediction': y_pred})

results.to_csv('BaselineBoost_Model_Results.csv', index=False)

MAE: 0.7011224842711582
Baseline: 0.8036852310907336
MAE: 0.7171122579794241
Baseline: 0.8222375264785754
MAE: 0.6979645785351861
Baseline: 0.7977755235768447
MAE: 0.6693779420028949
Baseline: 0.7885243492376062
MAE: 0.6936342674116255
Baseline: 0.8163850941331495
Average MAE over training: 0.6958423060400578
Average Baseline over training: 0.8057215449033819


In [37]:
print("TRAINING AVERAGE MAE")
print(f"HuggingFaceTransformer: {average_mae_HF}, LDA: {average_mae_LDA}, LDA with synonyms: {average_mae_LDASynonym}, BaselineBoostingModel: {average_mae_baselineboost}, Baseline: {average_baseline}")
print("TEST MAE")
print(f"HuggingFaceTransformer: {test_score_HF}, LDA: {test_score_LDA}, LDA with synonyms: {test_score_LDA_synonym}, BaselineBoostingModel: {test_score_baselineboost}, Baseline: {mean_absolute_error(y_test, np.array([np.mean(y_train['IMDbRating'])] * len(y_test)))}")


TRAINING AVERAGE MAE
HuggingFaceTransformer: 0.7450494509846061, LDA: 0.7148564326879576, LDA with synonyms: 0.7079393732937129, BaselineBoostingModel: 0.6958423060400578, Baseline: 0.8057215449033819
TEST MAE
HuggingFaceTransformer: 0.74668775947846, LDA: 0.7180114193870211, LDA with synonyms: 0.7227935320033719, BaselineBoostingModel: 0.6949550314308219, Baseline: 0.817787598198688


We can now evaluate the performance of these models in the next section.