# Process Data

In [55]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [56]:
X2 = pd.DataFrame({
    'a':range(5),
    'b':[-100,-50,0,200,1000]
})

In [57]:
X2

Unnamed: 0,a,b
0,0,-100
1,1,-50
2,2,0
3,3,200
4,4,1000


## Standardize

Some algorithms such as SVM, perform better when the data is _standardized_. Each column should have a mean value of 0 and standard deviation of 1. 

In [58]:
from sklearn import preprocessing

std = preprocessing.StandardScaler()
std.fit_transform(X2)

array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

In [59]:
std.scale_

array([  1.41421356, 407.92156109])

In [60]:
std.mean_

array([  2., 210.])

In [61]:
std.var_

array([2.000e+00, 1.664e+05])

Remember that you will need to track the original mean and standard deviation if you use this for preprocesisng. Any sample that you will use to preedict later will need to be standardized with those same values:

In [62]:
X_std = (X2-X2.mean())/X2.std()
X_std

Unnamed: 0,a,b
0,-1.264911,-0.67972
1,-0.632456,-0.570088
2,0.0,-0.460455
3,0.632456,-0.021926
4,1.264911,1.73219


In [63]:
X2.mean()

a      2.0
b    210.0
dtype: float64

The `fastai` library also implements this:

## Scale to range

Scaling to range is translating data so it is between 0 and 1, inclusive. Having the data bounded may be useful. However, if you have outliersm you probably want to be careful using this: 

In [64]:
from sklearn import preprocessing

mms = preprocessing.MinMaxScaler()
mms.fit(X2)
mms.transform(X2)

array([[0.        , 0.        ],
       [0.25      , 0.04545455],
       [0.5       , 0.09090909],
       [0.75      , 0.27272727],
       [1.        , 1.        ]])

## Dummy Variables

We can use pandas to create dummy variables from categorical data. This is also refereed to as one-hot encoding, or indicator encoding. Dummy variables are especially useful if the data is nominal (unordered). The `get_dummies` function in pandas creates multiple columns for a categorical column, each ith a 1 or 0 if the original column had that value. 

In [65]:
X_cat = pd.DataFrame({
    'name': ['George', 'Paul'],
    'inst': ['Bass', 'Guitar']
})

In [66]:
X_cat

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar


In [67]:
pd.get_dummies(X_cat, drop_first=True)

Unnamed: 0,name_Paul,inst_Guitar
0,0,0
1,1,1


In [68]:
import janitor as jn

In [69]:
X_cat2 = pd.DataFrame({
    'names': ['Fred', 'George', 'John' ,'Paul'],
})

In [70]:
jn.expand_column(X_cat2, 'names', sep=',')

Unnamed: 0,names,Fred,George,John,Paul
0,Fred,1,0,0,0
1,George,0,1,0,0
2,John,0,0,1,0
3,Paul,0,0,0,1


If we have high cardinality nominal data, we can use _label encoding_.

## Label Encoding

An alternative to dummy variable encoding is label encoding. This will take categorical data and assign each value a number. It is useful for high cardinality data. This encoder imposes ordinality, which may or may not be desired. It can take up less space than one-hot encoding and some (tree) algorithms can deal with this encoding.

In [71]:
from sklearn import preprocessing

In [72]:
lab = preprocessing.LabelEncoder()
lab.fit_transform(X_cat['inst'])

array([0, 1])

In [73]:
lab.inverse_transform([1,1,0])

array(['Guitar', 'Guitar', 'Bass'], dtype=object)

You can also use pandas to label encode. First, you convert the column to a categorical column type, and then pull out the numeric code from it.
This code will create a new series of numeric data from a pandas series. We use `.as_ordered` method to ensure that the category is ordered:

In [74]:
X_cat.name.astype(
    'category').cat.as_ordered().cat.codes +1 

0    1
1    2
dtype: int8

## Frequency Encoding 

Another option for handling high cardinality cateforical data is to _frequency encode_ it. This means replacing the name of the category with the count it had in the training data. We will use pandas to do this. 

First, we will use the pandas `.value_counts` method to make a mapping ( a pandas series that maps strings to counts). With the mapping we can use the `.map` method to do the encoding. 

In [75]:
mapping = X_cat.name.value_counts()
X_cat.name.map(mapping)

0    1
1    1
Name: name, dtype: int64

Make sure you store the training mapping so you can encode future data with the same data. 

## Pulling cateogries from Strings

One way to increase the accuracy of the Titanic model is to pull out titles from the names. A quick hack to find the most common triples is to use the `Counter` class:

In [76]:
from collections import Counter

In [77]:
c = Counter()

In [78]:
def triples(val):
    for i in range(len(val)):
        c[val[i:i+3]] += 1 

In [79]:
df = pd.read_excel('titanic3.xls')

In [80]:
df.name.apply(triples)

0       None
1       None
2       None
3       None
4       None
        ... 
1304    None
1305    None
1306    None
1307    None
1308    None
Name: name, Length: 1309, dtype: object

In [81]:
c.most_common(10)

[(', M', 1282),
 (' Mr', 954),
 ('r. ', 830),
 ('Mr.', 757),
 ('s. ', 460),
 ('n, ', 320),
 (' Mi', 283),
 ('iss', 261),
 ('ss.', 261),
 ('Mis', 260)]

Another option is to use a regular expression to pull out the capital letter followed by lowercaseletters and a period:

In [82]:
df.name.str.extract(
                    '([A-Za-z]+)\.', expand = False).head()

0      Miss
1    Master
2      Miss
3        Mr
4       Mrs
Name: name, dtype: object

We can use `.value_counts` to see the frequency of these:

In [83]:
df.name.str.extract(
                    '([A-Za-z]+)\.', expand = False).value_counts()

Mr          757
Miss        260
Mrs         197
Master       61
Rev           8
Dr            8
Col           4
Mlle          2
Ms            2
Major         2
Capt          1
Sir           1
Dona          1
Jonkheer      1
Countess      1
Don           1
Mme           1
Lady          1
Name: name, dtype: int64

Using these manipulations and pandas, you can create dummy variables or combine columns with low counts into other categories (or drop them)

## Other Categorical encoding

The `categorical_encoding` library used to convert categorical data into numeric data. 


One algorigthm implemented in the library is a hash encoder. This is useful if tou do not know how many categories you have ahead of time or if you are using a bag of words to represent text. This will hash the categorical columns into `n_components`. 
If you are using online learning this can be very helpful:

In [84]:
#! pip install category_encoders

In [85]:
import category_encoders as ce

In [86]:
he = ce.HashingEncoder(verbose=1)
he.fit_transform(X_cat)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,0,0,1,0,1,0,0
1,0,2,0,0,0,0,0,0


The ordinal encoder can convert categorical columns that have order to a single column of numbers. Here we convert the size column to ordinal numbers. If a value is missing from the mapping dictionary, the default value of -1 is used: 

In [87]:
size_df = pd.DataFrame({
    'name': ['Fred', 'John','Matt'],
    'size': ['small', 'med','xxl']
})
ore = ce.OrdinalEncoder(
                        mapping = [{
                            'col': 'size',
                            'mapping':{
                                'small':1,
                                'med':2,
                                'lg':3
                            }
                        }])

In [88]:
ore.fit_transform(size_df)

Unnamed: 0,name,size
0,Fred,1.0
1,John,2.0
2,Matt,-1.0


If you have high cardinality data ( a large number of unique values) consider using one of the Bayesina encoders that output a single column pr categorical column. These are: 
* `TargetEncoder`
* `LeaveOneOutEncoder`
* `WOEEncoder`
* `JamesSteinEncoder`
* `MEstimateEncoder`

For example, to convert the Titanic survival column to a blend of posterior probability of the target and the prior probability given the title (categorical) information, use the following code:

In [89]:
def get_title(df):
    return df.name.str.extract('([A-Za-z]+)\.', expand = False)

In [90]:
te = ce.TargetEncoder(cols='Title')
te.fit_transform(
                df.assign(Title = get_title), df.survived)['Title'].head()

  elif pd.api.types.is_categorical(cols):


0    0.676923
1    0.508197
2    0.676923
3    0.162483
4    0.786802
Name: Title, dtype: float64

## Date Feature Engineering 

The `fastai` libraru has an `add_datepart` function that will generate data attribute columns based on a datetime columns based on a datetime column. This is useful as most machine learning algorithms would not be able to infer this type of signal from a numeric representation of a date:

In [91]:
from fastai import *
from fastai.torch_basics import *
from fastai.data.all import *

In [92]:
dates = pd.DataFrame({
    'A': pd.to_datetime(['9/17/2001', 'Jan 1, 2002'], 
                       )
})

In [93]:
#export
def make_date(df, date_field):
    "Make sure `df[date_field]` is of the right date type."
    field_dtype = df[date_field].dtype
    if isinstance(field_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        field_dtype = np.datetime64
    if not np.issubdtype(field_dtype, np.datetime64):
        df[date_field] = pd.to_datetime(df[date_field], infer_datetime_format=True)

In [94]:
#export
def add_datepart(df, field_name, prefix=None, drop=True, time=False):
    "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    # Pandas removed `dt.week` in v1.1.10
    week = field.dt.isocalendar().week.astype(field.dt.day.dtype) if hasattr(field.dt, 'isocalendar') else field.dt.week
    for n in attr: df[prefix + n] = getattr(field.dt, n.lower()) if n != 'Week' else week
    mask = ~field.isna()
    df[prefix + 'Elapsed'] = np.where(mask,field.values.astype(np.int64) // 10 ** 9,np.nan)
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df

In [95]:
add_datepart(dates, 'A')

Unnamed: 0,AYear,AMonth,AWeek,ADay,ADayofweek,ADayofyear,AIs_month_end,AIs_month_start,AIs_quarter_end,AIs_quarter_start,AIs_year_end,AIs_year_start,AElapsed
0,2001,9,38,17,0,260,False,False,False,False,False,False,1000685000.0
1,2002,1,1,1,1,1,False,True,False,True,False,True,1009843000.0


https://docs.fast.ai/tabular.core.html#add_datepart

__Warning__: `add_datepart` mutates the DataFrame, which pandas can do, but normally does not!

## Add col_na Feature

The `fastai` library used to have a function for creating a column to fill a missing value (with the median) and indicate that a value was missing. There might be some signal in knowing that a value is missing. Here is a copy of the function and an example using it:

In [96]:
from pandas.api.types import is_numeric_dtype

In [97]:
def fiz_missing(df,col,name,na_dict):
    if is_numeric_dtype(col):
        if pd.isnull(col).sum() or (name in na_dict):
            df[name + '_na'] = pd.isnull(col)
            filler = (na_dict[name] if name in na_dict else col.median())
            df[name] = col.fillna(filler)
            na_dict[name] = filler
            return na_dict

In [98]:
data = pd.DataFrame({'A':[0,None,5,100]})

In [99]:
fiz_missing(data, data.A, 'A', {})

{'A': 5.0}

In [100]:
data

Unnamed: 0,A,A_na
0,0.0,False
1,5.0,True
2,5.0,False
3,100.0,False


Here is a pandas version 

In [101]:
data = pd.DataFrame({'A':[0,None,5,100]})

In [102]:
data['A_na'] = data.isnull()

In [103]:
data['A'] = data.A.fillna(data.A.median())

In [104]:
data

Unnamed: 0,A,A_na
0,0.0,False
1,5.0,True
2,5.0,False
3,100.0,False


## Manual Feature Engineering

We can use pandas to generate new features. For the Titanic dataset, we can add aggregate cabin data ( maximum age per cabin, mean ager per cabin etc...). To get aggregate data per cabin and mere it back in, use the pandas, `group_by` method to create the data. Then align it back to the original data using the `.merge` method:

In [105]:
agg = (df.groupby('cabin').agg('min,max,mean,sum'.split(',')).reset_index())

In [106]:
agg.columns = ['_'.join(c).strip('_') for c in agg.columns.values]

In [107]:
agg_df = df.merge(agg, on ='cabin')

In [108]:
agg_df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,...,parch_mean,parch_sum,fare_min,fare_max,fare_mean,fare_sum,body_min,body_max,body_mean,body_sum
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
1,1,1,"Madill, Miss. Georgette Alexandra",female,15.0000,0,1,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,...,2.0,8,151.5500,151.5500,151.5500,606.200,135.0,135.0,135.0,135.0
3,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,...,2.0,8,151.5500,151.5500,151.5500,606.200,135.0,135.0,135.0,135.0
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,...,2.0,8,151.5500,151.5500,151.5500,606.200,135.0,135.0,135.0,135.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,3,1,"Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengtsson)",female,24.0000,0,2,PP 9549,16.7000,G6,...,1.2,6,10.4625,16.7000,14.2050,71.025,,,,0.0
291,3,1,"Sandstrom, Miss. Marguerite Rut",female,4.0000,1,1,PP 9549,16.7000,G6,...,1.2,6,10.4625,16.7000,14.2050,71.025,,,,0.0
292,3,0,"Strom, Miss. Telma Matilda",female,2.0000,0,1,347054,10.4625,G6,...,1.2,6,10.4625,16.7000,14.2050,71.025,,,,0.0
293,3,0,"Strom, Mrs. Wilhelm (Elna Matilda Persson)",female,29.0000,1,1,347054,10.4625,G6,...,1.2,6,10.4625,16.7000,14.2050,71.025,,,,0.0


In [109]:
agg_df.to_excel(excel_writer='C:/Users/alber/Desktop/UniTn/Machine Learning practicals/agg_df.xls')

  agg_df.to_excel(excel_writer='C:/Users/alber/Desktop/UniTn/Machine Learning practicals/agg_df.xls')
