# Feature Preprocessing and Engieneering

In [25]:
%matplotlib inline
import numpy as np 
import pandas as pd
from datetime import timedelta
import datetime as dt
import matplotlib.pyplot as plt

In [26]:
plt.rcParams['figure.figsize'] = [13, 5]

The representation of your data can have a bigger influence in the performance of your model than the type of model or the exact hyperparameters you use. This lecture is about feature preprocessing and feature engieneering.

## Pre-processing

Here are some useful functions `df.info()`, 
`df.head()`,
`df['col'].value_counts()`

In [27]:
# data for homework 2
# here is the data https://www.kaggle.com/c/avazu-ctr-prediction
path = "/Users/yinterian/teaching/ML-2/avazu_data/"
data = pd.read_csv(path + "train")
test = pd.read_csv(path + "test")

In [28]:
data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40428967 entries, 0 to 40428966
Data columns (total 24 columns):
id                  float64
click               int64
hour                int64
C1                  int64
banner_pos          int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type         int64
device_conn_type    int64
C14                 int64
C15                 int64
C16                 int64
C17                 int64
C18                 int64
C19                 int64
C20                 int64
C21                 int64
dtypes: float64(1), int64(14), object(9)
memory usage: 26.5 GB


Note that the data is taking > 7.2 GB because it is using int64. 

To select the appropiate type read here
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html. For example int8 is (-128 to 127).

In [29]:
data.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [30]:
data.describe() # look at min and max of every colunm. Can we change column types?

Unnamed: 0,id,click,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
count,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0,40428970.0
mean,9.223017e+18,0.1698056,14102560.0,1004.968,0.2880146,1.015305,0.331315,18841.81,318.8831,60.10201,2112.601,1.432499,227.1444,53216.85,83.38229
std,5.325443e+18,0.375462,296.6837,1.094586,0.506382,0.5274336,0.8547935,4959.457,21.2725,47.29538,609.4124,1.326227,351.0221,49956.82,70.28996
min,521159400000.0,0.0,14102100.0,1001.0,0.0,0.0,0.0,375.0,120.0,20.0,112.0,0.0,33.0,-1.0,1.0
25%,4.611181e+18,0.0,14102300.0,1005.0,0.0,1.0,0.0,16920.0,320.0,50.0,1863.0,0.0,35.0,-1.0,23.0
50%,9.223224e+18,0.0,14102600.0,1005.0,0.0,1.0,0.0,20346.0,320.0,50.0,2323.0,2.0,39.0,100048.0,61.0
75%,1.383561e+19,0.0,14102810.0,1005.0,1.0,1.0,0.0,21894.0,320.0,50.0,2526.0,3.0,171.0,100093.0,101.0
max,1.844674e+19,1.0,14103020.0,1012.0,7.0,5.0,5.0,24052.0,1024.0,1024.0,2758.0,3.0,1959.0,100248.0,255.0


In [31]:
data["device_conn_type"].value_counts()

0    34886838
2     3317443
3     2181796
5       42890
Name: device_conn_type, dtype: int64

### Reducing memory usage

In [9]:
types = {'id': np.uint32, 'click': np.uint8, 'hour': np.uint32, 'C1': np.uint32, 'banner_pos': np.uint32,
         'site_id': object, 'site_domain': object, 'site_category': object, 'app_id': object,
         'app_domain': object, 'app_category': object, 'device_id': object,
         'device_ip': object, 'device_model': object, 'device_type': np.uint8, 'device_conn_type': np.uint8,
         'C14': np.uint16, 'C15': np.uint16, 'C16': np.uint16, 'C17': np.uint16, 'C18': np.uint16, 'C19': np.uint16,
         'C20': np.uint16, 'C21': np.uint16}

data = pd.read_csv(path + "train", usecols=types.keys(), dtype=types)
print(data.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40428967 entries, 0 to 40428966
Data columns (total 24 columns):
id                  uint32
click               uint8
hour                uint32
C1                  uint32
banner_pos          uint32
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type         uint8
device_conn_type    uint8
C14                 uint16
C15                 uint16
C16                 uint16
C17                 uint16
C18                 uint16
C19                 uint16
C20                 uint16
C21                 uint16
dtypes: object(9), uint16(8), uint32(4), uint8(3)
memory usage: 4.0+ GB
None


You can further reduce memory usage by converting the "object" type into categorical or numerical variables.

## Numerical features

Some of these ideas were already explained in ML-1 but we will summarized here for completeness.

**Summary**: Linear models, neural networks and KNN need **feature scaling** while tree-based methods don't need scaling.

* KNN needs feature scaling because the distance between points is greatly afected by scaling.
* Linear models and Neural Nets need scaling
     * The amount of regularization applied to a feature depends on the feature's scale. 
     * Optimization methods converge more rapidly when features are scaled.
* Scales are computed on the training set and applied to the test/validation sets.
* Need to deal with outliers.

In [13]:
# boston house pricing
from sklearn.datasets import load_boston
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['target'] = boston.target

In [14]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [15]:
train=data.sample(frac=0.8, random_state=3)
test=data.drop(train.index)

### MinMaxScaler
Transforms features by scaling each feature to a given range.
```
min = X.min() 
max = X.max() 
X = (X - min)/(max - min) 
```

In [23]:
# https://github.com/scikit-learn-contrib/sklearn-pandas
# how to do tranforms with pandas and sklearn
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import MinMaxScaler

mapper = DataFrameMapper([(train.columns, MinMaxScaler())])

scaled_train = mapper.fit_transform(train.copy(), 4) #rounded by 4 digits
scaled_train = pd.DataFrame(scaled_train, index=train.index, columns=train.columns) # converts back to a dataframe
scaled_test = mapper.transform(test.copy())
scaled_test = pd.DataFrame(scaled_test, index=test.index, columns=test.columns)

In [22]:
scaled_train = mapper.fit_transform(train.copy(), 4) #rounded by 4 digits
scaled_train[0]

array([ 0.00416552,  0.        ,  0.21041056,  0.        ,  0.24485597,
        0.9015137 ,  0.77651905,  0.15991628,  0.30434783,  0.22900763,
        0.5106383 ,  0.96994674,  0.0665011 ,  0.88100686])

### Standard Scaler

```
mean = X.mean()
std = X.std()
X = (X - mean)/std
```

Exercise: Reproduce the previous example with standard scaler.

### Outliers

**Summary**: if we see outliers that don't make sense you can discard observations with > 99% quantile or <1% quantile. Or you can "clip" by changing the values to the 99% quantile or 1% quantile. Here is an example.

```
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41} 
to
{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, -5, 101, 86, 85, 15, 89, 89, 28, -5, 41} 
```

Models (such as linear regression) are highly senitive to outliers. Read more on outliers [here](https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/)

In [24]:
import scipy.stats
import numpy as np
a = np.array([92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41])
scipy.stats.mstats.winsorize(a, limits=0.05)

masked_array(data = [ 92  19 101  58 101  91  26  78  10  13  -5 101  86  85  15  89  89  28
  -5  41],
             mask = False,
       fill_value = 999999)

### Rank transformation
Example: rank([-100, 0, 10000]) = [1,2,3]

It smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

In [11]:
scipy.stats.rankdata([-100,-100,0,100000])

array([ 1.5,  1.5,  3. ,  4. ])

#### Other transformations
```
np.log(x + 1)
np.sqrt(x + 1)
```

## Categorical and ordinal features

Ordinal = Ordered categorical features.

**Summary**:
    * Label encoding can bed used by tree-based method. 
    * For non tree-based method use one-hot-encoding (or embedings which will be discussed later)
    * High cardinality can create very sparse data. One-hot-encoding can be used with sparse matrices.
    * Difficult to impute missing values. NA can be treated as another category.

### Label encoding
Encode labels with value between 0 and n_classes-1. It is used to transform non-numerical labels to numerical labels. This method is usedful for **tree-based** methods.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [15]:
df = pd.DataFrame({'A':['a','b','c'],
                   'B':['T1','T1','T3']})
df

Unnamed: 0,A,B
0,a,T1
1,b,T1
2,c,T3


In [16]:
df.apply(lambda x: pd.factorize(x)[0])

Unnamed: 0,A,B
0,0,0
1,1,0
2,2,1


In [35]:
df = pd.DataFrame({'A':['a','b','b','c']})
pd.Categorical(df['A'],categories=['a', 'b'])

[a, b, b, NaN]
Categories (2, object): [a, b]

### One-hot encoding

* Often used for linear models
* Produces very high dimensionality, this causes an increase in the model’s training and serving time and memory consumption.
* Can easily cause a model to overfit the data.
* Can’t handle categories that weren’t in the training data (e.g new city or device type). This can be problematic in domains that change all the time.
* Some of this disadvantages can be reduced by encoding all rare categories to the same features ("Other"). This method can reduce the dimensionality drastically in some datasets with a small or no decrease in performance. 

### Frequency encoding

Each category is replaced by the frequency of that category in the training data. Used for tree-based methods.
```
["a", "a", "a", "b", "c"]
```
is encoded as
```
[3/5, 3/5, 3/5, 1/5, 1/5]
```

### Label encoding versus one-hot encoding

When to use label encoding versus one-hot encoding.

Tree based methods:
* When categorical feature is ordinal **label encoding** can lead to better quality if it preserves correct order of values. In this case a split made by a tree will divide the feature to values 'lower' and 'higher' that the value chosen for this split.

Non-tree based methods:
* One-hot encoding or embedings should be used.
* Unless there is a linear relashionship between the label encoding and the dependent variable non-tree based methods will have a hard time with label encoding.


One-hot encoding a categorical feature with huge number of values can lead to high memory consumption. You can use sparse matrices to deal with this problem. You can also ignore a subset of the categories that are rare to decrease the number of new features. 

### Feature hashing (hashing trick)
It is fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly. It works well in settings with a large number of categories.

Pros:
* It is low dimensional thus it is very efficient in processing time and memory. 
* It can be computed online (without seeing all the data)

Cons:
* Hashing functions sometimes have collision so if H(New York) = H(Tehran) the model can’t know what city were in the data. Studies have shown that collisions usually doesn’t affect significantly on the models performance. 
* Hashed features are not interpretable so doing things like feature importance and model debugging is very hard.

Need install package category_encoders
```
conda install -c conda-forge category_encoders
```

In [51]:
data = pd.DataFrame({
        'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]})
y = np.array([0, 0, 1, 1, 1])

In [52]:
print(hash("CAL"))

8522546718958492772


In [55]:
from category_encoders import *
enc = HashingEncoder(cols=['year'], n_components=10).fit(data, None)

In [56]:
data2 = enc.transform(data)
data2

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,pop,state
0,0,0,0,0,0,0,1,0,0,0,1.5,Ohio
1,0,0,0,0,0,1,0,0,0,0,1.7,Ohio
2,0,1,0,0,0,0,0,0,0,0,3.6,Ohio
3,0,0,0,0,0,1,0,0,0,0,2.4,Nevada
4,0,1,0,0,0,0,0,0,0,0,2.9,Nevada


## Date and time

* Periodicity
    * Day number in week, month, season, year, second, minute, hour
* Time since
* Difference between dates

## Coordinates

* Distance to interesting points from external data or training data
* Cluster your data and use the center of the cluster to compute distances
* Compute agregated statistics
    * Mean sale price per neighbourhood

## Missing values

* Missing values can be hidden (replaced by a number 9999 or ?)
* Replacing missing values
    * mean, median
    * -999 works for tree-based methods
    * reconstruct mission value (train a model that predicts that value.)
* Add a new column is_null for every feature with missing values.
* Treat categories present in the test data but not present in the train data as missing values.
* Some methods like XGBoost can handle missing values.

## Feature extraction from text
We will talk about this in week 5.

## Target or mean encoding
Use the target varible to generate features

In [64]:
data = pd.DataFrame({
        'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'y': [1, 1, 0, 0, 0]})
data

Unnamed: 0,state,y
0,Ohio,1
1,Ohio,1
2,Ohio,0
3,Nevada,0
4,Nevada,0


In [77]:
m =  pd.DataFrame({'y_mean' : data["y"].groupby(data["state"]).mean()}).reset_index()
m

Unnamed: 0,state,y_mean
0,Nevada,0.0
1,Ohio,0.666667


In [78]:
pd.merge(data, m, how="left", on=["state"])

Unnamed: 0,state,y,y_mean
0,Ohio,1,0.666667
1,Ohio,1,0.666667
2,Ohio,0,0.666667
3,Nevada,0,0.0
4,Nevada,0,0.0


Note that mean encodoing needs to be computed on train and joined latter with validation and test.

# Aggregation and distance based features

For this section let's think about the CTR we described avobe. We have categorical features like `site_id`, `app_id`, `device_ip` and numerical features like `C14`, `C15` etc...

BTW, not sure that in this case `C14`, `C15` should be considered numerical features but let's assume they are for now.

## Aggregate by one or multiple categorical features

Here are some fetures that we can compute
* Number of times `device_ip` appears on the training data. It would be better if we had `user_id`.
* Number of times `device_ip` appears per month on the training data.
* Min, max, average `C1` per `site_id`.

We can compute these features using `groupby` to aggregate to a new dataframe and then use `merge` to make the new feature.

## Features based on KNN
$K$ nearest neighbor (KNN) classifier. Looks at the $K$ points in the training set that are nearest to the test input $x$ and returns the mean of the target variable. 

There are many other possibilities here. 

# References

* https://www.slideshare.net/HJvanVeen/feature-engineering-72376750?trk=v-feed
* https://www.coursera.org/learn/competitive-data-science
* http://scikit-learn.org/stable/modules/preprocessing.html
* https://github.com/amueller/introduction_to_ml_with_python/blob/master/04-representing-data-feature-engineering.ipynb
* Introduction to Machine Learning with Python. Muller & Guido Chapter 4.
* https://blog.myyellowroad.com/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512
* https://www.dataquest.io/blog/pandas-big-data/

In [16]:
# idea for hw https://www.csie.ntu.edu.tw/~r01922136/slides/kaggle-avazu.pdf

In [80]:
data = pd.DataFrame({
        'prediction': [0.31, 0.52, 0.95, 0.83, 0.45, 0.03, 0.44],
        'y': [1, 0, 1, 1, 1, 0, 0]})
data

Unnamed: 0,prediction,y
0,0.31,1
1,0.52,0
2,0.95,1
3,0.83,1
4,0.45,1
5,0.03,0
6,0.44,0


In [84]:
data.sort_values(["prediction"]).T

Unnamed: 0,5,0,6,4,1,3,2
prediction,0.03,0.31,0.44,0.45,0.52,0.83,0.95
y,0.0,1.0,0.0,1.0,0.0,1.0,1.0


In [85]:
data = pd.DataFrame({
        'prediction': [0.3, 0.5, 0.95, 0.99, 0.8, 0.4, 0.03, 0.44],
        'y': [1, 0, 1, 1, 1, 1, 0, 0]})
data.sort_values(["prediction"]).T

Unnamed: 0,6,0,5,7,1,4,2,3
prediction,0.03,0.3,0.4,0.44,0.5,0.8,0.95,0.99
y,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0


In [86]:
data

Unnamed: 0,prediction,y
0,0.3,1
1,0.5,0
2,0.95,1
3,0.99,1
4,0.8,1
5,0.4,1
6,0.03,0
7,0.44,0
