## <center> Go to piazza and open today's lecture notes in the hub! </center>
## <center> https://piazza.com/class/jzioyk40mhs6r2 </center>
## <center> Let's go to tophat for attendance! </center>
## <center> https://app.tophat.com/e/245218 </center>

## Project
- hopefully your TA mentor already sent an email to you.
- advice on project:
   - don't choose a very well-known dataset! No iris, titanic, MNIST, etc.
   - kaggle dataset is OK but don't participate in a competition.
      - use the kaggle dataset to answer a different question.
   - UCI ML repository is a great resource for datasets
   - supervised ML problem is preferred.
- send thee first draft of the project description to the TA mentor  by the 24th of September.
- send the final project description to me by the **30th of September**.

# <center> Mud card questions </center>

- **Pandas merge and sql join, same thing?**
   - Yes!
- **Is it possible to represent all kinds of merges through combinations of left, right, inner, and outer joins?**
   - I don't think so. Sometimes you need pd.concat to do more complex combinations.
- **When you merge, how are the values sorted ? or are they not sorted?**
   -  depends on the type of join you use and whether sort is True or False. Check the manual and experiment with a toy dataset!
- **I understand when to use brackets vs paren, but it's a little unclear that [[ ]] seems to indicate to iloc that you are now referencing multiple rows separated by commas**
   - The inner [] is a list of indices which is passed to .iloc[]

# <center> Data preprocessing, part 1, categorical and continuous features </center>
### By the end of this course, you will be able to
- describe why preprocessing is necessary
- apply one-hot encoding or ordinal encoding to categorical variables
- apply scaling and normalization to continuous variables
- apply label encoding to a categorical target variable


### <font color='LIGHTGRAY'>By the end of this course, you will be able to</font>
- **describe why preprocessing is necessary**
- <font color='LIGHTGRAY'>apply one-hot encoding or ordinal encoding to categorical variables</font>
- <font color='LIGHTGRAY'>apply scaling and normalization to continuous variables</font>
- <font color='LIGHTGRAY'>apply label encoding to a categorical target variable</font>


## Problem description, why preprocessing is necessary

Data format suitable for ML: 2D numerical values.

| X|feature_1|feature_2|...|feature_j|...|feature_m|<font color='red'>y</font>|
|-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__data_point_1__|x_11|x_12|...|x_1j|...|x_1m|__<font color='red'>y_1</font>__|
|__data_point_2__|x_21|x_22|...|x_2j|...|x_2m|__<font color='red'>y_2</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_i__|x_i1|x_i2|...|x_ij|...|x_im|__<font color='red'>y_i</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_n__|x_n1|x_n2|...|x_nj|...|x_nm|__<font color='red'>y_n</font>__|

### Data almost never comes in a format that's directly usable in ML.
- ML works with numerical data but some columns can be text (e.g., home country, educational level, gender, race)
- the order of magnitude of numerical features can vary greatly which is not good for most ML algorithms (e.g., salary in USD, age in years, time spent on the site in sec)
- the target variable is not in the right format (e.g., the target variable is text in classification (<=50K, >50K))
- some values are missing and ML methods implemented in scikit-learn don't work with NaNs - next lecture will cover this.

### scikit-learn transformers to the rescue!

Preprocessing is done with various transformers. All transformes have three methods:
- **fit** method: estimates parameters necessary to do the transformation,
- **transform** method: transforms the data based on the estimated parameters,
- **fit_transform** method: both steps are performed at once, this can be faster than doing the steps separately.

### Transformers we cover today
- **OneHotEncoder** - converts categorical features into dummy arrays
- **OrdinalEncoder** - converts categorical features into an integer array
- **MinMaxScaler** - scales continuous variables to be between 0 and 1
- **StandardScaler** - standardizes continuous features by removing the mean and scaling to unit variance
- **LabelEncoder** - converts text target variable to numerical values


### <font color='LIGHTGRAY'>By the end of this course, you will be able to</font>
- <font color='LIGHTGRAY'>describe why preprocessing is necessary</font>
- **apply one-hot encoding or ordinal encoding to categorical variables**
- <font color='LIGHTGRAY'>apply scaling and normalization to continuous variables</font>
- <font color='LIGHTGRAY'>apply label encoding to a categorical target variable</font>


## Ordered categorical data: OrdinalEncoder

Let's assume we have a categorical feature and a training and test sets

The cateogies can be ordered or ranked

E.g., educational level in the adult dataset

In [1]:
import pandas as pd

train_edu = {'educational level':['Bachelors','Masters','Bachelors','Doctorate','HS-grad','Masters']} 
test_edu = {'educational level':['HS-grad','Masters','Masters','College','Bachelors']}

X_train = pd.DataFrame(train_edu)
X_test = pd.DataFrame(test_edu)

In [2]:
from sklearn.preprocessing import OrdinalEncoder
help(OrdinalEncoder)

Help on class OrdinalEncoder in module sklearn.preprocessing._encoders:

class OrdinalEncoder(_BaseEncoder)
 |  Encode categorical features as an integer array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are converted to ordinal integers. This results in
 |  a single column of integers (0 to n_categories - 1) per feature.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
 |  
 |  Parameters
 |  ----------
 |  categories : 'auto' or a list of lists/arrays of values.
 |      Categories (unique values) per feature:
 |  
 |      - 'auto' : Determine categories automatically from the training data.
 |      - list : ``categories[i]`` holds the categories expected in the ith
 |        column. The passed categories should not mix strings and numeric
 |        values, and should be sorted in case of numeric values.
 |  
 |      The use

In [3]:
# initialize the encoder
enc = OrdinalEncoder(categories = [['HS-grad','Bachelors','Masters','Doctorate']]) # The ordered list of 
# categories need to be provided. By default, the categories are alphabetically ordered!

# fit the training data
enc.fit(X_train)
# print the categories - not really important because we manually gave the ordered list of categories
print(enc.categories_)
# transform X_train. We could have used enc.fit_transform(X_train) to combine fit and transform
X_train_oe = enc.transform(X_train)
print(X_train_oe)
# transform X_test
X_test_oe = enc.transform(X_test) # OrdinalEncoder always throws an error message if 
                                  # it encounters an unknown category in test

[array(['HS-grad', 'Bachelors', 'Masters', 'Doctorate'], dtype=object)]
[[1.]
 [2.]
 [1.]
 [3.]
 [0.]
 [2.]]


ValueError: Found unknown categories ['College'] in column 0 during transform

## Unordered categorical data: one-hot encoder

some categories cannot be ordered. e.g., workclass, relationship status

first feature: gender (male, female, unknown)

second feature: browser  used 

these categories cannot be ordered

In [4]:
train = {'gender':['Male','Female','Unknown','Male','Female','Female'],\
         'browser':['Safari','Safari','Internet Explorer','Chrome','Chrome','Internet Explorer']}
test = {'gender':['Female','Male','Unknown','Female'],'browser':['Chrome','Firefox','Internet Explorer','Safari']}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

In [5]:
# How do we convert this to numerical features?
from sklearn.preprocessing import OneHotEncoder

help(OneHotEncoder)

Help on class OneHotEncoder in module sklearn.preprocessing._encoders:

class OneHotEncoder(_BaseEncoder)
 |  Encode categorical integer features as a one-hot numeric array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
 |  encoding scheme. This creates a binary column for each category and
 |  returns a sparse matrix or dense array.
 |  
 |  By default, the encoder derives the categories based on the unique values
 |  in each feature. Alternatively, you can also specify the `categories`
 |  manually.
 |  The OneHotEncoder previously assumed that the input features take on
 |  values in the range [0, max(values)). This behaviour is deprecated.
 |  
 |  This encoding is needed for feeding categorical data to many scikit-learn
 |  estimators, notably linear models and SVMs with the standard kernels.
 |  
 | 

In [6]:
# initialize the encoder
enc = OneHotEncoder(sparse=False) # by default, OneHotEncoder returns a sparse matrix. sparse=False returns a 2D array
# fit the training data
enc.fit(X_train)
print('categories:',enc.categories_)
print('feature names:',enc.get_feature_names())
# transform X_train
X_train_ohe = enc.transform(X_train)
print(X_train_ohe)

# do all of this in one step
X_train_ohe = enc.fit_transform(X_train)
#print(X_train_ohe)

# transform X_test
X_test_ohe = enc.transform(X_test)
print(X_test_ohe)

categories: [array(['Female', 'Male', 'Unknown'], dtype=object), array(['Chrome', 'Internet Explorer', 'Safari'], dtype=object)]
feature names: ['x0_Female' 'x0_Male' 'x0_Unknown' 'x1_Chrome' 'x1_Internet Explorer'
 'x1_Safari']
[[0. 1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0.]
 [0. 1. 0. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 0.]]


ValueError: Found unknown categories ['Firefox'] in column 1 during transform

## Exercise 1
Would you use the OneHotEncorder or the OrdinalEncoder for the following categorical features?
- marital status (Married, Divorced, Never-married, Separated, Widowed)
- exterior quality of a house (Excellent, Good, Average/Typical, Fair, Poor)
- native country (USA, Hungary, China, India, Germany)


### <font color='LIGHTGRAY'>By the end of this course, you will be able to</font>
- <font color='LIGHTGRAY'>describe why preprocessing is necessary</font>
- <font color='LIGHTGRAY'>apply one-hot encoding or ordinal encoding to categorical variables</font>
- **apply scaling and normalization to continuous variables**
- <font color='LIGHTGRAY'>apply label encoding to a categorical target variable</font>


## Continuous features: MinMaxScaler

In [7]:
# let's assume we have two continuous features:
train = {'age':[32,65,13,68,42,75,32],'number of hours worked':[0,40,10,60,40,20,40]}
test = {'age':[83,26,10,60],'number of hours worked':[0,40,0,60]}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

If the continuous feature values are reasonably bounded, MinMaxScaler is a good way to scale the features.

Age is expected to be within the range of 0 and 100.

Number of hours worked per week is in the range of 0 to 80.

If unsure, plot the histogram of the feature!

In [8]:
from sklearn.preprocessing import MinMaxScaler
help(MinMaxScaler)

Help on class MinMaxScaler in module sklearn.preprocessing.data:

class MinMaxScaler(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Transforms features by scaling each feature to a given range.
 |  
 |  This estimator scales and translates each feature individually such
 |  that it is in the given range on the training set, e.g. between
 |  zero and one.
 |  
 |  The transformation is given by::
 |  
 |      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
 |      X_scaled = X_std * (max - min) + min
 |  
 |  where min, max = feature_range.
 |  
 |  The transformation is calculated as::
 |  
 |      X_scaled = scale * X + min - X.min(axis=0) * scale
 |      where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
 |  
 |  This transformation is often used as an alternative to zero mean,
 |  unit variance scaling.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_scaler>`.
 |  
 |  Parameters
 |  ----------
 |  feature_range : tuple (min, max), de

In [9]:
scaler = MinMaxScaler()
scaler.fit(X_train)
print(scaler.transform(X_train))
print(scaler.transform(X_test)) # note how scaled X_test contains values larger than 1 and smaller than 0.

[[0.30645161 0.        ]
 [0.83870968 0.66666667]
 [0.         0.16666667]
 [0.88709677 1.        ]
 [0.46774194 0.66666667]
 [1.         0.33333333]
 [0.30645161 0.66666667]]
[[ 1.12903226  0.        ]
 [ 0.20967742  0.66666667]
 [-0.0483871   0.        ]
 [ 0.75806452  1.        ]]


## Continuous features: StandardScaler

If the continuous feature values follow a tailed distribution, StandardScaler is better to use!

Salaries are a good example. Most people earn less than 100k but there are a small number of super-rich people.

In [10]:
train = {'salary':[50_000,75_000,40_000,1_000_000,30_000,250_000,35_000,45_000]}
test = {'salary':[25_000,55_000,1_500_000,60_000]}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

In [11]:
from sklearn.preprocessing import StandardScaler
help(StandardScaler)

Help on class StandardScaler in module sklearn.preprocessing.data:

class StandardScaler(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Standardize features by removing the mean and scaling to unit variance
 |  
 |  The standard score of a sample `x` is calculated as:
 |  
 |      z = (x - u) / s
 |  
 |  where `u` is the mean of the training samples or zero if `with_mean=False`,
 |  and `s` is the standard deviation of the training samples or one if
 |  `with_std=False`.
 |  
 |  Centering and scaling happen independently on each feature by computing
 |  the relevant statistics on the samples in the training set. Mean and
 |  standard deviation are then stored to be used on later data using the
 |  `transform` method.
 |  
 |  Standardization of a dataset is a common requirement for many
 |  machine learning estimators: they might behave badly if the
 |  individual features do not more or less look like standard normally
 |  distributed data (e.g. Gaussian with 0 mean 

In [12]:
scaler = StandardScaler()
print(scaler.fit_transform(X_train))
print(scaler.transform(X_test))

[[-0.44873188]
 [-0.36895732]
 [-0.4806417 ]
 [ 2.58270127]
 [-0.51255153]
 [ 0.18946457]
 [-0.49659661]
 [-0.46468679]]
[[-0.52850644]
 [-0.43277697]
 [ 4.1781924 ]
 [-0.41682206]]


## Exercise 2

Would you use MinMaxScaler or StandardScaler for the following features?
- number of minutes spent on the website
- number of days a year spent abroad
- USD donated to charity

### <font color='LIGHTGRAY'>By the end of this course, you will be able to</font>
- <font color='LIGHTGRAY'>describe why preprocessing is necessary</font>
- <font color='LIGHTGRAY'>apply one-hot encoding or ordinal encoding to categorical variables</font>
- <font color='LIGHTGRAY'>apply scaling and normalization to continuous variables</font>
- **apply label encoding to a categorical target variable**


## Categorical target variable: LabelEncoder

Classification labels need to be integers with value between 0 and n_classes-1.

The label is sometimes categorical and sometimes numerical outside the range of 0 and n_classes-1.

In [13]:
y_train = ['>50K','>50K', '>50K', '<=50K', '<=50K', '>50K']
y_test = ['<=50K','>50K','>50K','<=50K']

In [14]:
from sklearn.preprocessing import LabelEncoder
help(LabelEncoder)

Help on class LabelEncoder in module sklearn.preprocessing.label:

class LabelEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Encode labels with value between 0 and n_classes-1.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_targets>`.
 |  
 |  Attributes
 |  ----------
 |  classes_ : array of shape (n_class,)
 |      Holds the label for each class.
 |  
 |  Examples
 |  --------
 |  `LabelEncoder` can be used to normalize labels.
 |  
 |  >>> from sklearn import preprocessing
 |  >>> le = preprocessing.LabelEncoder()
 |  >>> le.fit([1, 2, 2, 6])
 |  LabelEncoder()
 |  >>> le.classes_
 |  array([1, 2, 6])
 |  >>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
 |  array([0, 0, 1, 2]...)
 |  >>> le.inverse_transform([0, 0, 1, 2])
 |  array([1, 1, 2, 6])
 |  
 |  It can also be used to transform non-numerical labels (as long as they are
 |  hashable and comparable) to numerical labels.
 |  
 |  >>> le = preprocessing.LabelEncoder()
 |  >>> le.fit(["paris"

In [15]:
le = LabelEncoder()
print(le.fit_transform(y_train))
print(le.transform(y_test))

[1 1 1 0 0 1]
[0 1 1 0]


## Exercise 3

```python
y = [2,2,5,8,8,2]
le = LabelEncoder()
print(le.fit_transform(y))
```

What will be printed?

## How and when to do preprocessing in the ML pipeline?

- **APPLY TRANSFORMER.FIT ONLY ON YOUR TRAINING DATA!** Then transform the CV and test sets.
- One of the most common mistake practitioners make is leaking statistics!
     - fit_transform is applied to the whole dataset.
     - Then the data is split into train/CV/test.
- This is wrong because the properties of the CV and test set (e.g., mean and stdev) should not influence how the training set is transformed.
   - You will see later this semester that **leaking statistics produces a model that does not generalize well to previously unseen data.**
   - Keep your sets cleanly separated to avoid this mistake!

## Summary
#### Now you can
- describe why preprocessing is necessary
- apply one-hot encoding or ordinal encoding to categorical variables
- apply scaling and normalization to continuous variables
- apply label encoding to a categorical target variable


## Scikit-learn's pipelines

The steps in the ML pipleine can be chained together into a scikit-learn pipeline which consists of transformers and one final estimator which is usually your classifier or regression model.

It serves multiple purposes but we only discuss one for now:
- leaking statistics is safely avoided with pipelines.

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html


In [16]:
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

In [17]:
# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [18]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#prep = Pipeline(steps=[('preprocessor', preprocessor)])
#print(prep.fit_transform(X_train,y_train))

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.790
