# Prediction of the superhero's overall score

The goal of this notebook is to predict a superhero's overall score based on the hero's history and powers description.
Different techniques are used to achieve this goal.

- __Section 2__ uses a bag of words (BoW) approach to encode the text into a fixed length vector.
    - In __Section 2.2__ this representation is the input to a linear regression model that aims to predict the superhero's overall score.
    - In __Section 2.3__ the same inputs are used to train a multilayer perceptron.
- __Section 3__ Follows a similar approach while replacing the linear regression model by a multilayer perceptron.

## 0. Setup

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

In [2]:
# global variables
MLP_LAYER_CONFIG = (800, 400, 200, 100)
experiments = []

# data classes to reduce the cluttering in the namespace
class DataSet:

    def __init__(self, name: str):
        self.name = name
        self.x_train = None
        self.y_train = None
        self.x_test = None
        self.y_test = None

    def set_data(self, x_train, x_test, y_train, y_test):
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test

    def get_bow_encoding(self):

        bow_transformer = CountVectorizer(analyzer='word').fit(self.x_train)

        ds_bow = DataSet(self.name + ' BoW')
        ds_bow.x_train = bow_transformer.transform(self.x_train)
        ds_bow.y_train = self.y_train
        ds_bow.x_test = bow_transformer.transform(self.x_test)
        ds_bow.y_test = self.y_test

        return ds_bow

    def get_tfidf_encoding(self):

        tfidf_transformer = TfidfVectorizer(analyzer='word').fit(self.x_train)

        tfidf_bow = DataSet(self.name + ' tf-idf')
        tfidf_bow.x_train = tfidf_transformer.transform(self.x_train)
        tfidf_bow.y_train = self.y_train
        tfidf_bow.x_test = tfidf_transformer.transform(self.x_test)
        tfidf_bow.y_test = self.y_test

        return tfidf_bow


class Experiment:

    def __init__(self, name: str, data_set: DataSet):
        self.name = name
        self.data_set = data_set
        # models
        self.model = None
        # errors
        self.errors = None  # errors for all predicted values
        self.mae = None  # mean absolute error
        self.mse = None  # mean square error

    def train(self):
        self.model.fit(self.data_set.x_train, self.data_set.y_train)

    def evaluate(self):
        y_test_predicted = self.model.predict(self.data_set.x_test)
        self.errors = y_test_predicted - np.array(self.data_set.y_test.values)

        self.mae = np.linalg.norm(self.errors, 1) / len(y_test_predicted)
        self.mse = np.linalg.norm(self.errors, 2) / len(y_test_predicted)

    @staticmethod
    def compare(experiments):
        comparison_df = pd.DataFrame(columns=['mean absolute error', 'mean square error'])
        for ex in experiments:
            comparison_df.loc[ex.name, :] = [ex.mae, ex.mse]
        return comparison_df


class LinRegExperiment(Experiment):

    def __init__(self, name: str, data_set: DataSet, model: LinearRegression):
        super().__init__(name, data_set)
        self.model = model

class MLPExperiment(Experiment):

    def __init__(self, name: str, data_set: DataSet, model: MLPRegressor):
        super().__init__(name, data_set)
        self.model = model
        self.training_history_nn = None

## 1. Data Preparation

While loading the data, the following additional preprocessing steps are applied.

- The columns `history_text` and `powers_text` are concatenated in a new column `text`.
- The rows with `NaN` values in the `overall_score` column are dropped.

In [3]:
superheros = pd.read_csv('datasets/Preprocessed.csv')
superheros.loc[:, 'text'] = superheros.loc[:, 'history_text'].astype(str) + superheros.loc[:, 'powers_text'].astype(str)
superheros = superheros.dropna(subset=['overall_score'])
superheros.head(2)

Unnamed: 0,name,overall_score,history_text,powers_text,superpowers,creator,alignment,text
0,A-Bomb,20.0,richard rick jone orphan young age expel sever...,rare occasion unusual circumstance jone able t...,"['Accelerated Healing', 'Agility', 'Berserk Mo...",Marvel Comics,Good,richard rick jone orphan young age expel sever...
1,Abe Sapien,10.0,sapien begin life langdon everett caul victori...,abe humanoid amphibious creature pair gill nec...,"['Accelerated Healing', 'Agility', 'Cold Resis...",Other,Good,sapien begin life langdon everett caul victori...


Two pairs of inputs and outputs are created.
The first one contains the `history_text` and the second one the `text` column as input.
Both contain the `overall_score` as output.

The dataset is split up into training ($65\ \%$) and test data ($35\ \%$).

In [4]:
seed = 42
test_ratio = 0.35

ds_hist = DataSet('history')
ds_cnct = DataSet('concatenated')

ds_hist.set_data(*train_test_split(superheros.loc[:, 'history_text'], superheros.loc[:, 'overall_score'],
                                  test_size=test_ratio, random_state=seed))
ds_cnct.set_data(*train_test_split(superheros.loc[:, 'text'], superheros.loc[:, 'overall_score'],
                                  test_size=test_ratio, random_state=seed))

print(f'training data size: {ds_hist.x_train.shape}')
print(f'test data size: {ds_hist.x_test.shape}')

training data size: (605,)
test data size: (327,)


- [ ] If there are not too many NaN values, make another experiment with only the superpower_text column.

## 2. BoW

### 2.1 Linear Regression Model

In [5]:
bow_lin_hist = LinRegExperiment('BoW Lin Reg history', ds_hist.get_bow_encoding(), LinearRegression())
bow_lin_cnct = LinRegExperiment('BoW Lin Reg concatenated', ds_cnct.get_bow_encoding(), LinearRegression())
experiments.append(bow_lin_hist)
experiments.append(bow_lin_cnct)

# Training the model.
bow_lin_hist.train()
bow_lin_cnct.train()

# Evaluating the model
bow_lin_hist.evaluate()
bow_lin_cnct.evaluate()

# comparing the model
Experiment.compare(experiments)

Unnamed: 0,mean absolute error,mean square error
BoW Lin Reg history,27.414058,2.801669
BoW Lin Reg concatenated,19.521665,1.975268


In [6]:
# px.histogram(bow_lin_hist.errors)

In [7]:
# px.histogram(bow_lin_cnct.errors)

TODOs

- [ ] Style the plots better.

Ideas

- [ ] find out which word have the highest weights in the regression model

Observations

- The model that has more data at its disposal performs better.
- The model performs well in most cases but makes large mistakes for a few superheros.

### 2.2 Multilayer Perceptron Regressor

In [8]:
bow_nn_hist = MLPExperiment('BoW MLP history', ds_hist.get_bow_encoding(),
                               MLPRegressor(hidden_layer_sizes=MLP_LAYER_CONFIG, max_iter=25))
bow_nn_cnct = MLPExperiment('BoW MLP concatenated', ds_cnct.get_bow_encoding(),
                               MLPRegressor(hidden_layer_sizes=MLP_LAYER_CONFIG, max_iter=25))
experiments.append(bow_nn_hist)
experiments.append(bow_nn_cnct)

# Training the model.
bow_nn_hist.train()
bow_nn_cnct.train()

# Evaluating the model
bow_nn_hist.evaluate()
bow_nn_cnct.evaluate()
# px.line(mlp_bow_history.loss_curve_)
# px.line(mlp_bow_concat.loss_curve_)

# comparing the model
Experiment.compare(experiments)

# px.histogram(e_mlp_history)
# px.histogram(e_mlp_concat)



Unnamed: 0,mean absolute error,mean square error
BoW Lin Reg history,27.414058,2.801669
BoW Lin Reg concatenated,19.521665,1.975268
BoW MLP history,12.898538,2.023387
BoW MLP concatenated,12.263702,1.857456


Relative improvments over the linear regression model:

In [9]:
# lin_reg_errors / mlp_reg_errors - 1

Observations

- The model trained on more data outperforms the other one again by a small margin.
- The MLP models seem to perform a little better than the linear regression models but not much despite being much more complex.
This suggests that the limiting factor is not the linear regression model but another element in the approach like the
encoding of the text, the available amount of data or the data itself.

## 3 tf-idf

### 3.1 Linear Regression Model

In [10]:
tfidf_lin_hist = LinRegExperiment('tf-idf Lin Reg history', ds_hist.get_tfidf_encoding(), LinearRegression())
tfidf_lin_cnct = LinRegExperiment('tf-idf Lin Reg concatenated', ds_cnct.get_tfidf_encoding(), LinearRegression())
experiments.append(tfidf_lin_hist)
experiments.append(tfidf_lin_cnct)

# Training the model.
tfidf_lin_hist.train()
tfidf_lin_cnct.train()

# Evaluating the model
tfidf_lin_hist.evaluate()
tfidf_lin_cnct.evaluate()

# comparing the model
Experiment.compare(experiments)

Unnamed: 0,mean absolute error,mean square error
BoW Lin Reg history,27.414058,2.801669
BoW Lin Reg concatenated,19.521665,1.975268
BoW MLP history,12.898538,2.023387
BoW MLP concatenated,12.263702,1.857456
tf-idf Lin Reg history,14.697725,1.870293
tf-idf Lin Reg concatenated,15.025488,1.738038


- [ ] todo: observation and error plots

### 3.3 tf-idf and Multilayer Perceptron Regressor

In [11]:
tfidf_nn_hist = LinRegExperiment('tf-idf MLP history', ds_hist.get_tfidf_encoding(),
                                 MLPRegressor(hidden_layer_sizes=MLP_LAYER_CONFIG, max_iter=25))
tfidf_nn_cnct = LinRegExperiment('tf-idf MLP Reg concatenated', ds_cnct.get_tfidf_encoding(),
                                 MLPRegressor(hidden_layer_sizes=MLP_LAYER_CONFIG, max_iter=25))
experiments.append(tfidf_nn_hist)
experiments.append(tfidf_nn_cnct)

# Training the model.
tfidf_nn_hist.train()
tfidf_nn_cnct.train()

# Evaluating the model
tfidf_nn_hist.evaluate()
tfidf_nn_cnct.evaluate()

# comparing the model
Experiment.compare(experiments)



Unnamed: 0,mean absolute error,mean square error
BoW Lin Reg history,27.414058,2.801669
BoW Lin Reg concatenated,19.521665,1.975268
BoW MLP history,12.898538,2.023387
BoW MLP concatenated,12.263702,1.857456
tf-idf Lin Reg history,14.697725,1.870293
tf-idf Lin Reg concatenated,15.025488,1.738038
tf-idf MLP history,11.386986,1.910227
tf-idf MLP Reg concatenated,10.992288,1.750343


- [ ] word2vec: explain why we're not using it in the report.
