# Prediction of the superhero's overall score

The goal of this notebook is to predict a superhero's overall score based on the hero's history and powers description.
Different techniques are used to achieve this goal.

- __Section 2__ uses a bag of words (BoW) approach to encode the text into a fixed length vector. This representation is the input to a linear regression model that aims to predict the superhero's overall score.
- __Section 3__ Follows a similar approach while replacing the linear regression model by a multilayer perceptron.

## 0. Setup

In [18]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## 1. Data Preparation

While loading the data, the following additional preprocessing steps are applied.

- The columns `history_text` and `powers_text` are concatenated in a new column `text`.
- The rows with `NaN` values in the `overall_score` column are dropped.

In [2]:
superheros = pd.read_csv('datasets/Preprocessed.csv')
superheros.loc[:, 'text'] = superheros.loc[:, 'history_text'].astype(str) + superheros.loc[:, 'powers_text'].astype(str)
superheros = superheros.dropna(subset=['overall_score'])
superheros.head(2)

Unnamed: 0,name,overall_score,history_text,powers_text,superpowers,creator,alignment,text
0,A-Bomb,20.0,richard rick jone orphan young age expel sever...,rare occasion unusual circumstance jone able t...,"['Accelerated Healing', 'Agility', 'Berserk Mo...",Marvel Comics,Good,richard rick jone orphan young age expel sever...
1,Abe Sapien,10.0,sapien begin life langdon everett caul victori...,abe humanoid amphibious creature pair gill nec...,"['Accelerated Healing', 'Agility', 'Cold Resis...",Other,Good,sapien begin life langdon everett caul victori...


The dataset is split up into training ($65\ \%$) and test data ($35\ \%$).

In [3]:
seed = 42
x_train, x_test, y_train, y_test = train_test_split(
    superheros.loc[:, 'text'], superheros.loc[:, 'overall_score'], test_size=0.35, random_state=seed)

## 2. BoW and Linear Regression

### 2.1 BoW Encoding

In [4]:
# defining the bag-of-words transformer on the text-processed corpus
bow_transformer = CountVectorizer(analyzer='word').fit(x_train)
# transforming into Bag-of-Words and hence textual data to numeric..
x_train_bow = bow_transformer.transform(x_train)
# transforming into Bag-of-Words and hence textual data to numeric..
x_test_bow = bow_transformer.transform(x_test)

## 2.2 Linear Regression Model

Training the model.

In [6]:
reg_model = LinearRegression().fit(x_train_bow, y_train)

Evaluating the model

In [24]:
y_test_predicted = reg_model.predict(x_test_bow)
error_bow = y_test_predicted - np.array(y_test.values)

print('\nmean absolute error:')
print(np.linalg.norm(error_bow, 1) / len(y_test_predicted))
print('\nmean square error:')
print(np.linalg.norm(error_bow, 2) / len(y_test_predicted))


mean absolute error:
19.521664589263775

mean square error:
1.9752684885927538


Ideas

- [ ] compare single column text to the concatenated column.
- [ ] Error distribution representation (histogram)
- [ ] find out which word have the highest weights in the regression model

## 3. BoW and Multilayer Perceptron

In [7]:
# todo: everything :P
