# Marvel Wikia Data

Data originally from: [Comic Books Are Still Made By Men, For Men And About Men, by Walt Hickey for FiveThirtyEight](https://fivethirtyeight.com/features/women-in-comic-books/)

Available on GitHub (with details): [fivethirtyeight/data/comic-characters](https://github.com/fivethirtyeight/data/tree/master/comic-characters)

## 1 Imports

In [6]:
import pandas as pd
import numpy  as np

% matplotlib inline
from matplotlib import pyplot as plt

# Deprecated
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

## 2 About the data

We will import the example data using the `read_csv` method from `pandas`.

In [24]:
data = pd.read_csv('../data/marvel-wikia-data.csv')

For convenience, we will rename all columns to lower case and drop unnecessary columns.

In [26]:
# Rename all columns to lower case
data.columns = data.columns.str.lower()

# Drop unnecessary columns 'urlslug' and 'gsm'
data = data.drop(['urlslug', 'gsm'], 1)

Finally, let's take a look at our data.

In [27]:
data.head(n=2)

Unnamed: 0,page_id,name,id,align,eye,hair,sex,alive,appearances,first appearance,year
0,1678,Spider-Man (Peter Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,Living Characters,4043.0,Aug-62,1962.0
1,7139,Captain America (Steven Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,Living Characters,3360.0,Mar-41,1941.0


In [35]:
print("The complete dataset contains %s rows and %s columns." % data.shape)

The complete dataset containg 16376 rows and 11 columns.


## 3 Training a model

Auch. Doesn't generalize to new observations.

## 4 Train and test split

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)

The training dataset contains 13100 rows and 10 columns.
The test dataset contains 3276 rows and 10 columns.


## 5 Train, validation and test split

When evaluating different hyperparameters for estimators, there is still the risk of overfitting on the test set. To avoid knowledge about the test set to "leak" into the model, we may want to hold out a validation set.

To do this in a simple way, we can just use `train_test_split` twice: to split train from test, and test validation from train.

In [50]:
X_train, X_validation, y_train, y_validation = train_test_split(
    X_train, y_train, test_size=0.2, random_state=2)
print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The validation dataset contains %s rows and %s columns." % X_validation.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)

The training dataset contains 10480 rows and 10 columns.
The validation dataset contains 2620 rows and 10 columns.
The test dataset contains 3276 rows and 10 columns.


By partitioning our data into three sets, we drastically reduce the number of samples which can be used for training, thus a lot of times this is not affordable nor desirable in practice.