```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# Preprocess Data

In [1]:
# Connect with underlying Python code
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')

In [33]:
from datasets import (
    get_dataset,
    add_dataset
)

In [3]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
df = pd.DataFrame(
    {
        "a": range(5),
        "b": [-100, -50, 0, 200, 1000],
    }
)
df.head()

Unnamed: 0,a,b
0,0,-100
1,1,-50
2,2,0
3,3,200
4,4,1000


## Standardize
`Standardization` means standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bias.

Some algorithms, such as `SVM`, perform better when the data is standardized. Each column should have a mean value of 0 and standard deviation of 1. Sklearn provides a `.fit_transform` method that combines both `.fit` and `.transform`.

Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values

In [5]:
from sklearn import preprocessing

std = preprocessing.StandardScaler()
std.fit_transform(df)

array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

In [6]:
std.scale_

array([  1.41421356, 407.92156109])

In [7]:
std.mean_

array([  2., 210.])

In [8]:
std.var_

array([2.000e+00, 1.664e+05])

In [9]:
# Pandas version
std2 = (df - df.mean()) / df.std()
std2

Unnamed: 0,a,b
0,-1.264911,-0.67972
1,-0.632456,-0.570088
2,0.0,-0.460455
3,0.632456,-0.021926
4,1.264911,1.73219


In [10]:
std2.mean()

a    4.440892e-17
b    0.000000e+00
dtype: float64

In [11]:
std2.std()

a    1.0
b    1.0
dtype: float64

## Normalization (Scale to Range)
Scaling to range is translating data so it is between 0 and 1, inclusive. Having the data bounded may be useful. However, if you have `outliers`, you probably want to be careful using this.

In [12]:
from sklearn import preprocessing

mms = preprocessing.MinMaxScaler()
mms.fit(df)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [13]:
mms.transform(df)

array([[0.        , 0.        ],
       [0.25      , 0.04545455],
       [0.5       , 0.09090909],
       [0.75      , 0.27272727],
       [1.        , 1.        ]])

In [14]:
# Pandas version
norm = (df - df.min()) / (df.max() - df.min())
norm

Unnamed: 0,a,b
0,0.0,0.0
1,0.25,0.045455
2,0.5,0.090909
3,0.75,0.272727
4,1.0,1.0


## Dummy Variables (One-hot Encoding)
We can use pandas to create dummy variables from categorical data. This is also referred to as one-hot encoding, or indicator encoding. Dummy variables are especially useful if the data is nominal (unordered). The `get_dummies` function in pandas creates multiple columns for a categorical column, each with a 1 or 0 if the original column had that value. The `drop_first` option can be used to eliminate a column (one of the dummy columns is a linear combination of the other columns).

In [15]:
df_cat = pd.DataFrame(
    {
        "name": ["George", "Paul"],
        "inst": ["Bass", "Guitar"],
    }
)
df_cat.head()

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar


In [16]:
pd.get_dummies(df_cat)

Unnamed: 0,name_George,name_Paul,inst_Bass,inst_Guitar
0,1,0,1,0
1,0,1,0,1


In [17]:
pd.get_dummies(df_cat, drop_first=True)

Unnamed: 0,name_Paul,inst_Guitar
0,0,0
1,1,1


In [18]:
df_cat2 = pd.DataFrame(
    {
        "A": [1, None, 3],
        "names": [
            "Fred,George",
            "George",
            "John,Paul",
        ],
    }
)
df_cat2.head()

Unnamed: 0,A,names
0,1.0,"Fred,George"
1,,George
2,3.0,"John,Paul"


In [19]:
pd.get_dummies(df_cat2)

Unnamed: 0,A,"names_Fred,George",names_George,"names_John,Paul"
0,1.0,1,0,0
1,,0,1,0
2,3.0,0,0,1


In [20]:
pd.get_dummies(df_cat2, drop_first=True)

Unnamed: 0,A,names_George,"names_John,Paul"
0,1.0,0,0
1,,1,0
2,3.0,0,1


## Label Encoder
If we have high cardinality nominal data, we can use label encoding. This will take categorical data and assign each value a number. It is useful for high cardinality data. This encoder imposes ordinality, which may or may not be desired. It can take up less space than one-hot encoding, and some (tree) algorithms can deal with this encoding. The label encoder can only deal with one column at a time.

In [21]:
from sklearn import preprocessing

lab = preprocessing.LabelEncoder()
lab.fit_transform(df_cat.name)

array([0, 1])

If you have encoded values, applying the `.inverse_transform` method decodes them.

In [22]:
lab.inverse_transform([1, 1, 0])

array(['Paul', 'Paul', 'George'], dtype=object)

You can also use pandas to label encode. First, you convert the column to a categorical column type, and then pull out the numeric code from it. This code will create a new series of numeric data from a pandas series. We use the `.as_ordered` method to ensure that the category is ordered.

In [23]:
df_cat.name.astype("category").cat.as_ordered().cat.codes + 1

0    1
1    2
dtype: int8

## Frequency Encoding
Another option for handling high cardinality categorical data is to frequency encode it. This means replacing the name of the category with the count it had in the training data. We will use pandas to do this. First, we will use the pandas `.value_counts` method to make a mapping (a pandas series that maps strings to counts). With the mapping we can use the `.map` method to do the encoding. Make sure you store the training mapping so you can encode future data with the same data.

In [24]:
df_cat3 = pd.DataFrame(
    {
        "name": ["George", "Paul", "George"],
        "inst": ["Bass", "Guitar", "Bass"],
    }
)
df_cat3.head()

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar
2,George,Bass


In [25]:
mapping = df_cat3.name.value_counts()
mapping

George    2
Paul      1
Name: name, dtype: int64

In [26]:
df_cat3.name.map(mapping)

0    2
1    1
2    2
Name: name, dtype: int64

## Pulling Categories from Strings
One way to increase the accuracy of the Titanic model is to pull out titles from the names. A quick hack to find the most common triples is to use the `Counter` class. Another option is to use a `regular expression` to pull out the capital letter followed by lowercase letters and a period.

In [27]:
df = get_dataset('titanic3')

In [28]:
from collections import Counter

c = Counter()
def triples(val):
    for i in range(len(val)):
        c[val[i : i + 3]] += 1

df.name.apply(triples)
c.most_common(10)

[(', M', 1282),
 (' Mr', 954),
 ('r. ', 830),
 ('Mr.', 757),
 ('s. ', 460),
 ('n, ', 320),
 (' Mi', 283),
 ('iss', 261),
 ('ss.', 261),
 ('Mis', 260)]

In [29]:
df.name.str.extract("([A-Za-z]+)\.", expand=False).head()

0      Miss
1    Master
2      Miss
3        Mr
4       Mrs
Name: name, dtype: object

In [30]:
# We can use .value_counts to see the frequency of these
df.name.str.extract("([A-Za-z]+)\.", expand=False).value_counts()

Mr          757
Miss        260
Mrs         197
Master       61
Rev           8
Dr            8
Col           4
Mlle          2
Major         2
Ms            2
Lady          1
Don           1
Jonkheer      1
Dona          1
Countess      1
Mme           1
Sir           1
Capt          1
Name: name, dtype: int64

## Manual Feature Engineering
We can use pandas to generate new features. For the Titanic dataset, we can add aggregate cabin data (maximum age per cabin, mean age per cabin, etc.). To get aggregate data per cabin and merge it back in, use the pandas `.groupby` method to create the data. Then align it back to the original data using the `.merge` method.

In [31]:
agg = (
    df.groupby("cabin")
    .agg("min,max,mean,sum".split(","))
    .reset_index()
)

agg.columns = [
    "_".join(c).strip("_")
    for c in agg.columns.values
]

agg_df = df.merge(agg, on="cabin")

In [32]:
agg_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,...,parch_mean,parch_sum,fare_min,fare_max,fare_mean,fare_sum,body_min,body_max,body_mean,body_sum
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
1,1,1,"Madill, Miss. Georgette Alexandra",female,15.0,0,1,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0
3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0


In [55]:
add_dataset(data=agg_df, name="titanic3_aggregated", url="../data/", type="csv", origin="Dataset created from titanic3")