## Model Performance

As seen in the previous section, we found the mean absolute error for each of our four models: Our baseline prediction which just returned the average viewer rating, our two LDA models: one using basic preprocessing and one using improved preprocessing, and our pretrained Hugging Face transformer model. Unfortunately we found the HDP model not to work so we will omit it from our evaluation.

The mean absolute error works as a simple metric for this kind of prediction model, where we simply calculate the average absolute difference between each prediction and its true value $|x_i-y_i|$. This is commonly used in prediction models elsewhere and is very similar to the also commonly used mean squared error. Our models ranked by order of MAE, with the best having the lowest MAE, are:

 - 1. LDA model (Synonyms): 0.716 MAE
 
 - 2. LDA model (Standard): 0.715 MAE
 
 - 3. Hugging Face transformer: 0.760 MAE
 
 - 4. Baseline: 0.826 MAE

 - 5. Baseline boosting algorithm with text features removed: 0.695
 

Unsurprisingly the baseline performed the worst, which is a good sign for our three models. The pretrained model coming in at 4th also makes sense as it was a brute force embedding of the text, with the hopes of finding meaningful embeddings. For this task, a more fine tuned approach was needed to better understand what information the plots give us, such as topic modelling. Interestly, the synonym model did slightly outperform the standard LDA model, which implies that a good way to improve performance in the LDA models is to fine tune the preprocessing of the input data.

It is not entirely clear what advantage of the synonym model was the cause of the improved performance. As mentioned during 04 - Preprocessing, the motivation behind it was to 'save' words from being removed, and to increase the number of more significant words. The better performance gives evidence to this having an effect; it could be that by having more frequent words kept in the plots, a plot was more likely to have a word appear in a topic and have that information available for the predictor model.

## Effect of the Topic Model

It is regretful to see that the boosting algorithm without any text features was the best model we trained. It seems as if the added complexity from the text data only harmed our models predictions. It is unlikely that this is due to overfitting, since the validation scores were also consistent with this result. I would also not conclude that it was down to bad text data, The plot data was complete and of a good length for training.

To gain more insight into this, we can produce a random forest to measure feature importance.

In [17]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np

df1 = pd.read_csv("Data/PreProcessedData.csv")
df_no_plot = df1.drop('Plot', axis = 1)
df2 = pd.read_csv("Data/LDA_topics_synonym.csv")
df_LDATOPICS_synonym = pd.concat([df_no_plot, df2], axis=1)
df_LDATOPICS_synonym = df_LDATOPICS_synonym.drop('Title', axis = 1)
df_LDATOPICS_synonym = df_LDATOPICS_synonym.drop('Unnamed: 0', axis = 1)

y = df_LDATOPICS_synonym['IMDbRating']
X = df_LDATOPICS_synonym.drop(['IMDbRating'], axis=1)

# Create a random forest regressor with 100 trees
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the training data
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Print the feature importances in descending order
indices = np.argsort(importances)[::-1]
for f in range(45):
    print("%d. %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))


1. t6 (0.069979)
2. t1 (0.061592)
3. t8 (0.061389)
4. t7 (0.060017)
5. t5 (0.059447)
6. t4 (0.059300)
7. Horror (0.058862)
8. t3 (0.057350)
9. t9 (0.055264)
10. t2 (0.055160)
11. Year (0.048794)
12. Drama (0.048281)
13. Documentary (0.039883)
14. Biography (0.014063)
15. Animation (0.010882)
16. Comedy (0.009644)
17. Action (0.008749)
18. Bruce Willis (0.008608)
19. Adventure (0.007874)
20. Miley Cyrus (0.007002)
21. Thriller (0.005326)
22. Crime (0.004219)
23. Mystery (0.004146)
24. Family (0.003596)
25. Sport (0.003558)
26. Fantasy (0.003184)
27. Sci-Fi (0.003137)
28. Romance (0.002940)
29. Patrick Muldoon (0.002745)
30. Tyler Perry (0.002546)
31. Danny Dyer (0.002489)
32. Short (0.002375)
33. Jaime Pressly (0.002259)
34. Music (0.002235)
35. War (0.002094)
36. Robert Downey Jr. (0.001957)
37. Jon Voight (0.001843)
38. Larry the Cable Guy (0.001767)
39. Nick Cannon (0.001721)
40. Christian Bale (0.001702)
41. James Corden (0.001628)
42. Suki Waterhouse (0.001477)
43. Ethan Hawke (0.0

We also explore removing the actor columns from our dataset and retraining all models.
All models now perform significantly worse, indicating that even though the actor columns added a lot of complexity to our model, they work as good predictors.

 - 1. LDA model (Synonyms): 0.770 MAE

 - 2. LDA model (Standard): 0.774 MAE

 - 3. Baseline: 0.826 MAE

 - 4. Baseline boosting algorithm with text features removed: 0.759

# A possible explanation for why we are not getting the results we expect

We have concluded that the additional text features are not overfitting, however, they may be introducing noise into the model which weakens the overall predictive performance. It could be that despite our expectations, movie plots are not a very good predictor and that anyone can write a good movie plot.