# Data preprocessing using pandas and scikit-learn

### Feature selection

Data preprocessing is most always the step before training a machine learning model. There are features that are not very useful for predicting a given outcome. For example, including an `id` field which uniquely identifies each sample does not make much sense. 
Thus, such variables can be safely deleted.


In [5]:
import pandas as pd
from matplotlib import pyplot as plt
filename="../../additional_resources/datasets/NYC Squirrels/data.csv"
df = pd.read_csv(filename)
df.columns

Index(['X', 'Y', 'Unique Squirrel ID', 'Hectare', 'Shift', 'Date',
       'Hectare Squirrel Number', 'Age', 'Primary Fur Color',
       'Highlight Fur Color', 'Combination of Primary and Highlight Color',
       'Color notes', 'Location', 'Above Ground Sighter Measurement',
       'Specific Location', 'Running', 'Chasing', 'Climbing', 'Eating',
       'Foraging', 'Other Activities', 'Kuks', 'Quaas', 'Moans', 'Tail flags',
       'Tail twitches', 'Approaches', 'Indifferent', 'Runs from',
       'Other Interactions', 'Lat/Long'],
      dtype='object')

Again, we'll ask you do to a bit of work yourself. This time, we ask you to drop unnecessary columns.

In [6]:
# Drop the `Unique Squirrel ID'` column
df.drop(df.columns[2], axis=1, inplace=True)
df

Unnamed: 0,X,Y,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,Combination of Primary and Highlight Color,...,Kuks,Quaas,Moans,Tail flags,Tail twitches,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long
0,-73.956134,40.794082,37F,PM,10142018,3,,,,+,...,False,False,False,False,False,False,False,False,,POINT (-73.9561344937861 40.7940823884086)
1,-73.957044,40.794851,37E,PM,10062018,3,Adult,Gray,Cinnamon,Gray+Cinnamon,...,False,False,False,False,False,False,False,True,me,POINT (-73.9570437717691 40.794850940803904)
2,-73.976831,40.766718,02E,AM,10102018,3,Adult,Cinnamon,,Cinnamon+,...,False,False,False,False,False,False,True,False,,POINT (-73.9768311751004 40.76671780725581)
3,-73.975725,40.769703,05D,PM,10182018,5,Juvenile,Gray,,Gray+,...,False,False,False,False,False,False,False,True,,POINT (-73.9757249834141 40.7697032606755)
4,-73.959313,40.797533,39B,AM,10182018,1,,,,+,...,True,False,False,False,False,False,False,False,,POINT (-73.9593126695714 40.797533370163)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3018,-73.963943,40.790868,30B,AM,10072018,4,Adult,Gray,,Gray+,...,False,False,False,False,False,False,False,True,,POINT (-73.9639431360458 40.7908677445466)
3019,-73.970402,40.782560,19A,PM,10132018,5,Adult,Gray,White,Gray+White,...,False,False,False,False,False,False,True,False,,POINT (-73.9704015859639 40.7825600069973)
3020,-73.966587,40.783678,22D,PM,10122018,7,Adult,Gray,"Black, Cinnamon, White","Gray+Black, Cinnamon, White",...,False,False,False,False,False,False,True,False,,POINT (-73.9665871993517 40.7836775064883)
3021,-73.963994,40.789915,29B,PM,10102018,2,,Gray,"Cinnamon, White","Gray+Cinnamon, White",...,False,False,False,False,False,False,True,False,,POINT (-73.9639941227864 40.7899152327912)


### Feature slicing
Feature slicing is the act of *slicing* a feature into multiple different features.
For example, we can slice the `Date` into day, month, and year.


Hint: use the `Series.apply()` method with a lambda function. [[Help]](https://www.analyticsvidhya.com/blog/2020/03/what-are-lambda-functions-in-python/)


In [8]:
import math
sr = pd.Series(df['Date'].head())
Day = sr.apply(lambda x : math.floor(x/1000000))
Year = sr.apply(lambda x : (x%10000))
Month = sr.apply(lambda x : (math.floor(x/10000))%100)
df['Year'] = Year
df['Month'] = Month
df['Day'] = Day

#print(df)

In [9]:

sr = pd.Series(df['Date'])
Day = sr.apply(lambda x : int(str(x)[0:2]))
Month = sr.apply(lambda x : int(str(x)[2:4]))
Year = sr.apply(lambda x : int(str(x)[4:]))
print(Day)
print(Month)
print(Year)
df['Year'] = Year
df['Month'] = Month
df['Day'] = Day

0       10
1       10
2       10
3       10
4       10
        ..
3018    10
3019    10
3020    10
3021    10
3022    10
Name: Date, Length: 3023, dtype: int64
0       14
1        6
2       10
3       18
4       18
        ..
3018     7
3019    13
3020    12
3021    10
3022    12
Name: Date, Length: 3023, dtype: int64
0       2018
1       2018
2       2018
3       2018
4       2018
        ... 
3018    2018
3019    2018
3020    2018
3021    2018
3022    2018
Name: Date, Length: 3023, dtype: int64


### Feature engineering

You can create new features based on the features you have. These might be more useful for your (future) machine learning model than the ones that are already present in the dataset.
In this squirrel dataset, most of the fields encode the action taken by the squirrel when being approached by the human.
We will combine them into a single feature `Reaction` with values `'yes'` and `'no'`.

In [10]:
reaction_columns = ['Kuks', 'Quaas', 'Moans', 'Tail flags',
                   'Tail twitches', 'Approaches', 'Runs from',
                   'Other Interactions']

df['Reaction'] = df[reaction_columns].any(axis=1)
df['Reaction'] = df['Reaction'].apply(lambda x : "yes" if x else "no")
df['Reaction']

0        no
1       yes
2        no
3       yes
4       yes
       ... 
3018    yes
3019     no
3020     no
3021     no
3022    yes
Name: Reaction, Length: 3023, dtype: object

A important step for a data processing pipeline is making the data understandable for machine learning algorithms. Most of them do not understand strings, like `yes` and `no` in our newly created column.
We need to transform them to a binary format so that the machine learning model can take advantage of that feature.We are going to **One Hot Encode** our feature.


In [11]:
pd.get_dummies(df.Reaction, prefix='Reaction')

Unnamed: 0,Reaction_no,Reaction_yes
0,1,0
1,0,1
2,1,0
3,0,1
4,0,1
...,...,...
3018,0,1
3019,1,0
3020,1,0
3021,1,0


However, we have a redundancy here, as we could just transform `'yes'` to `1` and `'no'` to `0` in our `Reaction` column. This can be done by setting the argument `drop_first` to `True`.

In [None]:
df = pd.get_dummies(df.Reaction, prefix='Reaction', drop_first="True")
df.rename(columns={"Reaction_yes" : "Reaction"})

Unnamed: 0,Reaction
0,0
1,1
2,0
3,1
4,1
...,...
3018,1
3019,0
3020,0
3021,0



Similar things can be done after converting the data frame to an array using the `scikit-learn` library with `LabelBinarizer` or `OneHotEncoder`.

## Feature normalization or standardization
Although they are sometimes used interchangeably, normalization and standardization are two different ways to bring a column of values to a common scale.
In this section, we're going to use the word normalization to refer to this concept.

Why do we normalize data ?

*For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.* [[Source]](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)

Some algorithms require that data be normalized before training a model. Other algorithms perform their own data scaling or normalization.

Given a column of values *x*, if we choose to scale them, there are a few options:

- Normalization, also called *min-max scaling*, rescales every value to a range between [0, 1]. The maximum and the minimum are computed for each column separately.

  $$ z = \frac{x - min(x)}{max(x) - min(x)} $$

- Standardization, also called z-score normalization, rescales the value around a 0 mean and a standard deviation of 1. It essentially transforms all values of *x* to a *z-score*. Mean and standard deviation are computed for each column separately.

$$ z = \frac{x - mean(x)}{std(x)} $$


### Be careful!

When you want to normalize your dataset, you have to do so **AFTER** splitting your data into different train-test splits. Indeed, normalizing your data before would use some information from your testing set in the training set, thus biasing the model.
Indeed, in a real world scenario, you would not have access to the testing set, as this would be the data that you are meant to predict.

The procedure is the following:
1. Split your data into train and test
2. For every variable $x$ of your **training set**, compute $max(x_{train})$ and $min(x_{train})$ , or $mean(x_{train})$ and $std(x_{train})$ depending if you do min-max scaling or z-score-normalization.
3. Normalize your training set and your testing set using these values (here I'm only showing the testing set).
$$ z_{test} = \frac{x_{test} - min(x_{train})}{max(x_{train}) - min(x_{train})} $$


$$ z_{test} = \frac{x_{test} - mean(x_{train})}{std(x_{train})} $$


In [13]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
import numpy as np
dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [23]:
#print(dataset.feature_names)
print(dataset.data[0])

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


In [34]:
# Splitting data into train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=41, test_size=0.2)
print(X_train.shape)
print(X_test.shape)
print

(455, 30)
(114, 30)


<function print>

Here, we'll want you to use numpy's mean and standard deviation functions to standardize each feature of the training and testing set. These are `np.mean()` and `np.std()`.

In [39]:
# Standardizing each feature using the train mean and standard deviation
mean_train = np.zeros(30)
std_train = np.zeros(30)
z_train = np.zeros((455, 30))
z_test = np.zeros((114, 30))
for idx, name in enumerate(dataset.feature_names):
    # Get mean and standard deviation from training set (per feature)
    mean_train[idx] = np.mean(X_train[:,idx])
    std_train[idx] = np.std(X_train[:,idx])
    z_train[:,idx] =(X_train[:,idx] - mean_train[idx])/std_train[idx]
    z_test[:, idx] = (X_test[:, idx] - mean_train[idx])/ std_train[idx]
#print(mean_train)
#print(std_train)

#print(f"Feature '{name}' has mean {mean_train:.2f} and standard deviation {std_train:.2f}")
print(f"Feature '{name}' has mean {mean_train[idx]} and standard deviation {std_train[idx]}")
    # Standardize training and testing set using the mean and standard deviation from the training set
print(z_train)

Feature 'worst fractal dimension' has mean 0.08436483516483516 and standard deviation 0.01796064049389251
[[-0.36498729 -0.83463794 -0.37044213 ... -0.25967334  0.17192889
  -0.3532633 ]
 [ 0.05208811 -0.26700868 -0.01772895 ... -0.60430302 -0.44112743
  -0.4234167 ]
 [-0.12582517 -1.43250975 -0.15915297 ... -0.52384354 -0.86529613
   0.15674078]
 ...
 [-0.09665906  1.03574702 -0.1312069  ... -0.82212489 -0.39639089
  -1.17784414]
 [ 1.77288871  0.07961739  1.6708907  ...  0.4236204   0.25311743
  -1.39387207]
 [-0.76747963  0.38436917 -0.75491221 ... -0.81891263 -0.2323569
   0.23079159]]


If you run the previous cell twice (without running the others cells again), you'll see that the second time, the mean and standard deviation for each feature will be 0 and 1 respectively, which is exactly what we want when we standardize (z-score normalization).

## Resampling

Sometimes, when you have multiple classes and the number of samples of each class are not equally distributed, i.e. there is an imbalance in the number of samples of each class, you can resort to resampling.
Resampling is using more (or less) of a given class to get a balanced dataset.

**BE CAREFUL**, resample **AFTER** splitting your data set into two parts. You do not want to accidentally have a copy of a testing sample in the training set.
Moreover, **do not resample the testing set**. This would give a false sense of the performance of the model.


In [None]:

import sklearn
import numpy as np
from sklearn import datasets

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target


In [None]:
# We separate the samples of the different classes
class_one_idx = np.argwhere(y==1)
class_zero_idx = np.argwhere(y==0)

class_one_x = np.squeeze(X[class_one_idx])
class_zero_x = np.squeeze(X[class_zero_idx])

print("Shape of class 0 samples : ", class_zero_x.shape)
print("Shape of class 1 samples : ", class_one_x.shape)


Shape of class 0 samples :  (212, 30)
Shape of class 1 samples :  (357, 30)


You see that we have 212 samples of class 0 and 357 samples of class 1.
We can either upsample, i.e. take more samples of, the minority class (here class 0) or we can downsample, i.e. take fewer samples of, the majority class (here class 1).
To do this, we first have to separate the samples of each class.

In [None]:
from sklearn.utils import resample

# Upsample minority class
class_zero_upsampled = resample(class_zero_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=357,    # to match majority class
                                 random_state=123) # reproducible results

print("New shape of class 0 samples: ",class_zero_upsampled.shape)

# Downsample majority class
class_one_downsampled = resample(class_one_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=212,    # to match minority class
                                 random_state=123) # reproducible results

print("New shape of class 1 samples: ",class_one_downsampled.shape)


New shape of class 0 samples:  (357, 30)
New shape of class 1 samples:  (212, 30)


After having either upsampled our minority class, or downsampled our majority class, we can combine the upsampled with the majority class or the downsampled with the minority class to have a balanced data set.

Which one you use depends on what you want to do, and which one does best.

In [None]:
X_balanced = np.concatenate((class_zero_upsampled, class_one_x), axis=0)
print("(Upsampled) Balanced data set shape : ", X_balanced.shape)

X_balanced = np.concatenate((class_one_downsampled, class_zero_x), axis=0)
print("(Downsampled) Balanced data set shape : ", X_balanced.shape)

(Upsampled) Balanced data set shape :  (714, 30)
(Downsampled) Balanced data set shape :  (424, 30)


## Reading material and additional ressources

[[1] Feature Engineering - Elite Data Science](https://elitedatascience.com/feature-engineering)  
[[2] Feature Engineering - Towards Data Science](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[3] Feature Engineering Tutorial - Kaggle](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[4] LabelBinarizer - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)  
[[5] OneHotEncoder - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)  
[[6] Zscore - Simply Psychology](https://www.simplypsychology.org/z-score.html)  
[[7] Normalize data - Microsoft Azure](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)  
[[8] How to handle imbalanced classes - Elite Data Science ](https://elitedatascience.com/imbalanced-classes)