# <center> Recap </center>

- pandas package and data frames
- how to read in files into a data frame
- how to select rows (3 methods)
- how to select columns
- how to merge (4 methods) and append data frames

# <center> Data preprocessing, part 1, categorical and continuous features </center>
### By the end of this course, you will be able to
- describe why preprocessing is necessary
- apply one-hot encoding or ordinal encoding to categorical variables with and without missing data
- apply scaling and normalization to continuous variables
- apply label encoding to a categorical target variable


## Problem description, why preprocessing is necessary

Data format suitable for ML: 2D numerical values.

| X|feature_1|feature_2|...|feature_j|...|feature_m|<font color='red'>y</font>|
|-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__data_point_1__|x_11|x_12|...|x_1j|...|x_1m|__<font color='red'>y_1</font>__|
|__data_point_2__|x_21|x_22|...|x_2j|...|x_2m|__<font color='red'>y_2</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_i__|x_i1|x_i2|...|x_ij|...|x_im|__<font color='red'>y_i</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_n__|x_n1|x_n2|...|x_nj|...|x_nm|__<font color='red'>y_n</font>__|

### Data almost never comes in a format that's directly usable in ML.
- ML works with numerical data but some columns can be text (e.g., home country, educational level, gender, race)
- the order of magnitude of numerical features can vary greatly which is not good for most ML algorithms (e.g., salary in USD, age in years, time spent on the site in sec)
- the target variable is not in the right format (e.g., the target variable is text in classification)
- some values are missing and ML methods implemented in scikit-learn don't work with NaNs - next lecture will cover this.

### scikit-learn transformers to the rescue!

Preprocessing is done with various transformers. All transformes have three methods:
- **fit** method: estimates parameters necessary to do the transformation,
- **transform** method: transforms the data based on the estimated parameters,
- **fit_transform** method: both steps are performed at once, this can be faster than doing the steps separately.

### Transformers we cover today
- **OneHotEncoder** - converts categorical features into dummy arrays
- **OrdinalEncoder** - converts categorical features into an integer array
- **MinMaxScaler** - scales continuous variables to be between 0 and 1
- **StandardScaler** - standardizes continuous features by removing the mean and scaling to unit variance
- **LabelEncoder** - converts text target variable to numerical values


### Lecture outline

- Describe how the transformers work on some dummy data
- Show how and when to use them in the ML pipeline
- Common preprocessing mistakes
- Show scikit-learn pipelines to simplify your code

## Unordered categorical data: one-hot encoder

In [1]:
# let's assume we have two categorical features in the training and test sets
# first feature: gender (male or female)
# second feature: browser  used 
# these categories cannot be ordered

import pandas as pd

train = {'gender':['Male','Female','Unknown','Male','Female','Female'],\
         'browser':['Safari','Safari','Internet Explorer','Chrome','Chrome','Internet Explorer']}
test = {'gender':['Female','Male','Unknown','Female'],'browser':['Chrome','Firefox','Internet Explorer','Safari']}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)
# How do we convert this to numerical features?
 
from sklearn.preprocessing import OneHotEncoder

help(OneHotEncoder)

Help on class OneHotEncoder in module sklearn.preprocessing._encoders:

class OneHotEncoder(_BaseEncoder)
 |  Encode categorical integer features as a one-hot numeric array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
 |  encoding scheme. This creates a binary column for each category and
 |  returns a sparse matrix or dense array.
 |  
 |  By default, the encoder derives the categories based on the unique values
 |  in each feature. Alternatively, you can also specify the `categories`
 |  manually.
 |  The OneHotEncoder previously assumed that the input features take on
 |  values in the range [0, max(values)). This behaviour is deprecated.
 |  
 |  This encoding is needed for feeding categorical data to many scikit-learn
 |  estimators, notably linear models and SVMs with the standard kernels.
 |  
 | 

In [2]:
# initialize the encoder
enc = OneHotEncoder(sparse=False) # by default, enc returns a sparse matrix. sparse=False returns a 2D array
# fit the training data
enc.fit(X_train)
print('categories:',enc.categories_)
print('feature names:',enc.get_feature_names())
# transform X_train
X_train_ohe = enc.transform(X_train)
#print(X_train_ohe)

# do all of this in one step
X_train_ohe = enc.fit_transform(X_train)
#print(X_train_ohe)

# transform X_test
#X_test_ohe = enc.transform(X_test)
#print(X_test_ohe)

categories: [array(['Female', 'Male', 'Unknown'], dtype=object), array(['Chrome', 'Internet Explorer', 'Safari'], dtype=object)]
feature names: ['x0_Female' 'x0_Male' 'x0_Unknown' 'x1_Chrome' 'x1_Internet Explorer'
 'x1_Safari']


## Ordered categorical data: OrdinalEncoder

In [3]:
# the cateogies can be ordered
# e.g., educational level

train_edu = {'educational level':['Bachelors','Masters','Bachelors','Doctorate','HS-grad','Masters']} 
test_edu = {'educational level':['HS-grad','Masters','Masters','College','Bachelors']}

X_train = pd.DataFrame(train_edu)
X_test = pd.DataFrame(test_edu)

from sklearn.preprocessing import OrdinalEncoder
help(OrdinalEncoder)

Help on class OrdinalEncoder in module sklearn.preprocessing._encoders:

class OrdinalEncoder(_BaseEncoder)
 |  Encode categorical features as an integer array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are converted to ordinal integers. This results in
 |  a single column of integers (0 to n_categories - 1) per feature.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
 |  
 |  Parameters
 |  ----------
 |  categories : 'auto' or a list of lists/arrays of values.
 |      Categories (unique values) per feature:
 |  
 |      - 'auto' : Determine categories automatically from the training data.
 |      - list : ``categories[i]`` holds the categories expected in the ith
 |        column. The passed categories should not mix strings and numeric
 |        values, and should be sorted in case of numeric values.
 |  
 |      The use

In [4]:
# initialize the encoder
enc = OrdinalEncoder(categories = [['HS-grad','Bachelors','Masters','Doctorate']]) # The ordered list of 
# categories need to be provided. By default, the categories are alphabetically ordered!

# fit the training data
enc.fit(X_train)
# print the categories - not really important this time because we manually gave the ordered list of categories
print(enc.categories_)
# transform X_train. Again, we could have used enc.fit_transform(X_train) to combine fit and transform
X_train_oe = enc.transform(X_train)
print(X_train_oe)
# transform X_test
X_test_oe = enc.transform(X_test) # no handle_unknown = 'ignore' in this case! OrdinalEncoder always throws an error message.


[array(['HS-grad', 'Bachelors', 'Masters', 'Doctorate'], dtype=object)]
[[1.]
 [2.]
 [1.]
 [3.]
 [0.]
 [2.]]


ValueError: Found unknown categories ['College'] in column 0 during transform

## Exercise 1
Would you use the OneHotEncorder or the OrdinalEncoder for the following categorical features?
- marital status (Married, Divorced, Never-married, Separated, Widowed)
- economic status (low, medium, high)
- race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)


## Continuous features: MinMaxScaler

In [None]:
# let's assume we have two continuous features:
# feature 1: age [years]
# feature 2: number of hours worked per week [hours]
train = {'age':[32,65,13,68,42,75,32],'number of hours worked':[0,40,10,60,40,20,40]}
test = {'age':[83,26,10,60],'number of hours worked':[0,40,0,60]}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

# if the continuous feature values are reasonable bounded, MinMaxScaler is a good way to scale the features.
# Age is expected to be within the range of 0 and 100.
# Number of hours worked per week is in the range of 0 to 80.
# If unsure, plot the histogram of the feature!

from sklearn.preprocessing import MinMaxScaler
help(MinMaxScaler)

In [None]:
scaler = MinMaxScaler()
scaler.fit(X_train)
print(scaler.transform(X_train))
print(scaler.transform(X_test)) # note how scaled X_test contains values larger than 1.

## Continuous features: StandardScaler

In [None]:
# If the continuous feature values follow a tailed distribution, StandardScaler is better to use!
# Salaries are a good example. Most people earn less than 100k but there are a small number of super-rich people.
# feature 1: salary
train = {'salary':[50_000,75_000,40_000,1_000_000,30_000,250_000,35_000,45_000]}
test = {'salary':[25_000,55_000,1_500_000,60_000]}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

from sklearn.preprocessing import StandardScaler
help(StandardScaler)


In [None]:
scaler = StandardScaler()
print(scaler.fit_transform(X_train))
print(scaler.transform(X_test))

## Exercise 2

Would you use MinMaxScaler or StandardScaler for the following features?
- number of minutes spent on the website
- number of days a year spent on abroad
- USD donated to charity

## Categorical target variable: LabelEncoder

In [None]:
# classification labels need to be integers with value between 0 and n_classes-1.
# the label is sometimes categorical and sometimes numerical outside the range of 0 and n_classes-1.
y_train = ['>50K','>50K', '>50K', '<=50K', '<=50K', '>50K']
y_test = ['<=50K','>50K','>50K','<=50K']

from sklearn.preprocessing import LabelEncoder

help(LabelEncoder)


In [None]:
le = LabelEncoder()
print(le.fit_transform(y_train))
print(le.transform(y_test))

## Exercise 3

```python
y = [2,2,5,8,8,2]
le = LabelEncoder()
print(le.fit_transform(y))
```

What will be printed?

## How and when to do preprocessing in the ML pipeline?

- **APPLY TRANSFORMER.FIT ONLY ON YOUR TRAINING DATA!** Then transform the CV and test sets.
- One of the most common mistake practitioners make is leaking statistics!
     - fit_transform is applied to the whole dataset.
     - Then the data is split into train/CV/test.
- This is wrong because the properties of the CV and test set (e.g., mean and stdev) should not influence how the training set is transformed.
   - You will  see in the problem set that **leaking statistics produces a model that does not generalize well to previously unseen data.**
   - Keep your sets cleanly separated to avoid this mistake!

## Scikit-learn's pipelines

The steps in the ML pipleine can be chained together into a scikit-learn pipeline which consists of transformers and one final estimator which is usually your classifier or regression model.

It serves multiple purposes but we only discuss one for now:
- leaking statistics is safely avoided with pipelines.

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html


In [None]:
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

In [None]:
# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#prep = Pipeline(steps=[('preprocessor', preprocessor)])
#print(prep.fit_transform(X_train,y_train))

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

## Exercise 4
True or False?

- The ordinal encoder should be used if the target variable (label) in classification is not integer or it is integer but outside the range of 0 and n_classes-1.
- If a feature has a distribution with a tail, the standard scaler is the best way to scale the feature.
- The fit_transform method of a transformer is applied to the whole dataset because this is usually faster than doing fit and transform separately.
- Leaking statistics means that the properties of your CV and test sets influence how your training data is preprocessed.

## Summary
#### Now you can
- describe why preprocessing is necessary
- apply one-hot encoding or ordinal encoding to categorical variables with and without missing data
- apply scaling and normalization to continuous variables
- apply label encoding to a categorical target variable
