# Data Preparation

- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. 
- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. 
- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. 



## 1. Required Libraries

Before we start, we need to import few libraries and functions that we will use in this jupyterbook. You don't need to understand what those functions do for now.

In [2]:
import math
import pandas as pd
import plotly.express as px
from sklearn import preprocessing
from IPython.display import display
from matplotlib import pyplot as plt
from imblearn.over_sampling import SMOTE

## 2. Probelm Representation

In this notebook we will use a hypothetical dataset containing 4 features describing inhabitants, along with two classes, `downtown` and `suburbs`, which is the area they live. 

In [3]:
#load dataframe
df = pd.read_csv('https://raw.githubusercontent.com/5x12/themlsbook/master/supplements/data/amsterdam_demographics.csv')

#print the result
df

## 3. Outliers detection

To identify outliers in data, one common approach is to use the standard deviation of the data. The standard deviation is a measure of how spread out the data is from the mean, or average, value.

To set a number of standard deviations, you would first calculate the standard deviation of your data. Then, you can decide on a threshold, such as 3 standard deviations, to use as a cutoff for identifying outliers. Any data points that are more than 3 standard deviations from the mean would be considered outliers.

For example, let's say you have a dataset with a mean of 100 and a standard deviation of 20. If you set a threshold of 3 standard deviations, any data points more than 60 points away from the mean (3 * 20) would be considered outliers. In this case, any data points below 40 (100 - 60) or above 160 (100 + 60) would be considered outliers.

It's important to note that this method of identifying outliers is just one approach and there are other methods that can also be used. It's also important to consider whether the data points that are being identified as outliers are actually errors or if they represent valid observations.


Let’s start with identifying outliers in Table `df` for `Income` column. For that, we need to calculate two values:
- income mean 
- income standard deviation

### 3.1. Calculating Mean

Calculating the mean of the `Income` column:

In [14]:
# Calculating mean as sum of all observation divided by number of observation
income_mean = df['Income'].sum() / len(df['Income'])
print(f'Mean={income_mean} (direct approach)')

# Or more simple way using pandas built-in method mean()
income_mean = df['Income'].mean()
print(f'Mean={income_mean} (using pandas built-in method)')

Mean=151.875 (direct approach)
Mean=151.875 (using pandas built-in method)


### 3.2. Calculating Standard Deviation

Calculating standard deviation of the `Income` column:

In [19]:
# Calculating std directly using formula for standard deviation
income_std = math.sqrt(df['Income'].apply(lambda x: (x - m) ** 2).sum() / len(df['Income']))
print(f'Standard deviation={income_std} (direct approach)')

# Or more simple way using pandas built-in method std()
income_std = df['Income'].std()
print(f'Standard deviation={income_std}  (using pandas built-in method)')

Standard deviation=123.21063012175533 (direct approach)
Standard deviation=131.71770409261077  (using pandas built-in method)


We see that the results differ. This is because by default pandas sum of squares is normalized by N-1 (Corrected sample standard deviation). To change this behaviour to normalization by N (Uncorrected sample standard deviation) we can change paramter `ddof` (Delta Degrees of Freedom) from 1 to 0.

In [20]:
# Uncorrected sample standard deviation using built-in method std() 
income_std = df['Income'].std(ddof=0)
print(f'Standard deviation={income_std}  (using pandas built-in method with ddof=0)')

Standard deviation=123.21063012175533  (using pandas built-in method with ddof=0)


### 3.3. Identifying outliers

Then, we should decide on a threshold to use as a cutoff for identifying outliers. Suppose we choose `k = 3` standard deviations. In other words, any data points that are more than 3 standard deviations from the mean would be considered outliers. 

Finally, we can calculate an upper bound and a lower bound:

$lower\_bound = mean_{income} - 3 * std_{income}$

$upper\_bound = mean_{income} + 3 * std_{income}$

where any data points below the lower bound or above the upper bound would be considered outliers.

In [17]:
lower_bound = income_mean - 3 * income_std
print(f'{lower_bound=}')

lower_bound=-217.75689036526597


In [22]:
upper_bound = income_mean + 3 * income_std
print(f'{upper_bound=}')

upper_bound=521.5068903652659


Hence, anything below -217.8 or above 521.5 would be treated as an outlier. But since we cannot have an income with a negative sum, we would treat outliers above 562.82 only. In this data set, there are no outliers present.

## 4. Feature Transformation

Encoding variables refers to the process of converting data that is represented in one format (e.g. categorical values) into another format (e.g. numerical values). There are a few reasons why we might need to encode variables in data science and machine learning:

- Compatibility: Different algorithms and models may require input data to be in a specific format. For example, some algorithms may only accept numerical data, while others may only accept categorical data. Encoding variables allows us to convert data into the required format so that it can be used as input to these algorithms.
- Improved performance: Encoding variables can sometimes improve the performance of an algorithm. For example, encoding categorical variables as numerical data can allow them to be used in algorithms that only accept numerical data, which can lead to better results.
- Increased interpretability: Encoding variables can also make it easier to understand and interpret the results of an analysis. For example, encoding a categorical variable as a set of binary variables can make it easier to understand the relationship between the variable and the target variable in a regression model.
- Reduced memory usage: Encoding variables can also help to reduce the amount of memory needed to store and process data, which can be important when working with large datasets.

Overall, encoding variables is an important step in the data preparation process that can help to improve the performance and interpretability of machine learning models.

In this section we will introduce 2 encoding techniques:
1. label encoder, for features containing 2 unique values
2. one-hot encoder, for features containing more than 2 unique values

### 4.1. Label Encoder

How can we convert categorical variables to numeric variables? For binary categorical variables we can simply substitute the values 0 and 1 for each category. For example, for the `Kids` feature we can map value `No` to 0 and `Yes` to 1, and for the `Residence` feature we can map value `Suburbs` to 0 and `Downtown` to 1. 

Let's implement this tranformation with `preprocessing.LabelEncoder()`.

In [25]:
# To implement these transforms we use sklearn class LabelEncoder for each column separately\
# We will store the result in a new dataframe df_enc
df_enc = df.copy()

# Initializing class LabelEncoder for column Kids
le_kids = preprocessing.LabelEncoder()
# Fitting encoder to column Kids
le_kids.fit(df['Kids'])
# Checking encoding classes
print(f'Encoded classes for column Kids: {le_kids.classes_}')
# Trasforming column
df_enc['Kids'] = le_kids.transform(df['Kids'])

# Initializing class LabelEncoder for column Residence
le_res = preprocessing.LabelEncoder()
# Fitting encoder to column Residence
le_res.fit(df['Residence'])
# Checking encoding classes
print(f'Encoded classes for column Residence: {le_res.classes_}')
# Trasforming column
df_enc['Residence'] = le_res.transform(df['Residence'])

# Printing the result
df_enc

Encoded classes for column Kids: ['no' 'yes']
Encoded classes for column Residence: ['downtown' 'suburbs']


Unnamed: 0,Age,Income,Vehicle,Kids,Residence
0,32,95.0,none,0,0
1,46,210.0,car,1,0
2,25,75.0,truck,1,1
3,36,30.0,car,1,1
4,29,55.0,none,0,1
5,54,430.0,car,1,0
6,30,100.0,none,0,0
7,41,220.0,truck,1,0


An advantage of using built-in method (besides the code clearity and its qunatity for sure) is possibility to apply inverse transfrom of feature encoding.

In [26]:
list(le_kids.inverse_transform(df_enc['Kids']))

['no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes']

In [27]:
list(le_res.inverse_transform(df_enc['Residence']))

['downtown',
 'downtown',
 'suburbs',
 'suburbs',
 'suburbs',
 'downtown',
 'downtown',
 'downtown']

### 4.2. One-hot encoder

Column `Vehicle` has more than two unique values, that doesn't have implicit order. So we need to apply One-hot encoding for it.

In [12]:
# One-hot encoding in pandas can be done using get_dummies function
df_enc = pd.get_dummies(df_enc)
df_enc

Unnamed: 0,Age,Income,Kids,Residence,Vehicle_car,Vehicle_none,Vehicle_truck
0,32,95.0,0,0,0,1,0
1,46,210.0,1,0,1,0,0
2,25,75.0,1,1,0,0,1
3,36,30.0,1,1,1,0,0
4,29,55.0,0,1,0,1,0
5,54,430.0,1,0,1,0,0
6,30,100.0,0,0,0,1,0
7,41,220.0,1,0,0,0,1


## 5. Feature Scaling

Feature scaling is a preprocessing step that is commonly used in machine learning and data science to standardize or normalize the range of independent variables or features of a dataset. There are a few reasons why feature scaling is important:

- Algorithms can converge faster: Many machine learning algorithms use some form of distance to measure the similarity between data points. If the features in a dataset have different scales, then the distance between data points will be dominated by the features with the larger scales, which can cause the algorithm to converge slowly or poorly. Feature scaling helps to balance the scales of the features, which can improve the convergence rate of the algorithm.
- Algorithms can perform better: Some machine learning algorithms are sensitive to the scale of the input features and can perform poorly if the features are not scaled. For example, algorithms that use gradient descent (such as linear regression) can converge faster and perform better if the features are scaled.
- Easier to interpret the results: Feature scaling can also make it easier to interpret the results of an analysis. When features are on different scales, it can be difficult to compare the importance of different features. Scaling the features allows us to compare the features on a level playing field, which can make it easier to understand the results of an analysis.

Overall, feature scaling is an important preprocessing step that can improve the performance and interpretability of machine learning models.

In this section we will learn two feature scaling techniques:
- feature standardization, using `preprocessing.StandardScaler()` 
- feature normalization, using `preprocessing.MinMaxScaler()`.

### 5.1. Feature Standardization

#### 5.1.1. Implementing from scratch

First, to understand the idea, let's implement feature standardization from scratch, without any built-in python libraries.

In [30]:
# Feature Standartization by hands for column Income
m = df_enc['Income'].mean()
s = df_enc['Income'].std(ddof=0)
display((df_enc['Income'] - m) / s)

0   -0.461608
1    0.471753
2   -0.623932
3   -0.989160
4   -0.786255
5    2.257313
6   -0.421027
7    0.552915
Name: Income, dtype: float64

#### 5.1.2. Implementing with a library

In [31]:
# Feature Standartization using sklearn class StandardScaler
# Initializing StandardScaler class instance 
scaler = preprocessing.StandardScaler()

# Fitting to column Income
# class method fit expects to have 2d input, so instead of df[column_name] (Series object)
# we pass df[[column_name]] (Dataframe object)
scaler.fit(df_enc[['Income']])

# Printing result of column transformation
display(scaler.transform(df_enc[['Income']]))

array([[-0.4616079 ],
       [ 0.47175313],
       [-0.62393155],
       [-0.98915978],
       [-0.78625521],
       [ 2.25731335],
       [-0.42102698],
       [ 0.55291495]])

Now we move foreward to transforming not only one column, but all numerical columns. To do so we just need to fit StandardScaler to the whole dataset.

In [14]:
## Transforming all numerical columns

#initialize StandardScaler()
scaler = preprocessing.StandardScaler()

#fit StandardScaler() to the dataset
scaler.fit(df_enc)

#transform
result = scaler.transform(df_enc)

#print the result
print(result)

[[-0.50618484 -0.4616079  -1.29099445 -0.77459667 -0.77459667  1.29099445
  -0.57735027]
 [ 1.02605036  0.47175313  0.77459667 -0.77459667  1.29099445 -0.77459667
  -0.57735027]
 [-1.27230244 -0.62393155  0.77459667  1.29099445 -0.77459667 -0.77459667
   1.73205081]
 [-0.06840336 -0.98915978  0.77459667  1.29099445  1.29099445 -0.77459667
  -0.57735027]
 [-0.83452096 -0.78625521 -1.29099445  1.29099445 -0.77459667  1.29099445
  -0.57735027]
 [ 1.90161333  2.25731335  0.77459667 -0.77459667  1.29099445 -0.77459667
  -0.57735027]
 [-0.72507559 -0.42102698 -1.29099445 -0.77459667 -0.77459667  1.29099445
  -0.57735027]
 [ 0.4788235   0.55291495  0.77459667 -0.77459667 -0.77459667 -0.77459667
   1.73205081]]


As we see `StandardScaler` outputs numpy array, but we want pandas dataframe as output. Hopefully transform from numpy array to pandas dataframe is pretty easy.

In [15]:
display(pd.DataFrame(data=result, columns=df_enc.columns))

Unnamed: 0,Age,Income,Kids,Residence,Vehicle_car,Vehicle_none,Vehicle_truck
0,-0.506185,-0.461608,-1.290994,-0.774597,-0.774597,1.290994,-0.57735
1,1.02605,0.471753,0.774597,-0.774597,1.290994,-0.774597,-0.57735
2,-1.272302,-0.623932,0.774597,1.290994,-0.774597,-0.774597,1.732051
3,-0.068403,-0.98916,0.774597,1.290994,1.290994,-0.774597,-0.57735
4,-0.834521,-0.786255,-1.290994,1.290994,-0.774597,1.290994,-0.57735
5,1.901613,2.257313,0.774597,-0.774597,1.290994,-0.774597,-0.57735
6,-0.725076,-0.421027,-1.290994,-0.774597,-0.774597,1.290994,-0.57735
7,0.478824,0.552915,0.774597,-0.774597,-0.774597,-0.774597,1.732051


### 5.2. Feature Normalization

#### 5.2.1. Implementing from scratch

First, to understand the idea, let's implement feature normalization from scratch, without any built-in python libraries.

In [33]:
# Feature Normalization by hands for column Income
min_income = df_enc['Income'].min()
max_income = df_enc['Income'].max()
display((df_enc['Income'] - min_income) / (max_income - min_income))

0    0.1625
1    0.4500
2    0.1125
3    0.0000
4    0.0625
5    1.0000
6    0.1750
7    0.4750
Name: Income, dtype: float64

#### 5.2.2. Implementing with a library

In [34]:
# Feature Normalization using sklearn class MinMaxScaler
# Initializing MinMaxScaler class instance 
scaler = preprocessing.MinMaxScaler()

# Fitting to column Income
# class method fit expects to have 2d input, so instead of df[column_name] (Series object)
# we pass df[[column_name]] (Dataframe object)
scaler.fit(df_enc[['Income']])

# Printing result of column transformation
display(scaler.transform(df_enc[['Income']]))

array([[0.1625],
       [0.45  ],
       [0.1125],
       [0.    ],
       [0.0625],
       [1.    ],
       [0.175 ],
       [0.475 ]])

Now we move foreword to transforming not only one column, but all numerical columns. To do so we just need to fit MinMaxScaler to the whole dataset.

In [17]:
# Transforming all numerical columns

#initialize MinMaxScaler()
scaler = preprocessing.MinMaxScaler()

#fit MinMaxScaler() to the dataset
scaler.fit(df_enc)

#transform
result = scaler.transform(df_enc)

#print the result
print(result)

[[0.24137931 0.1625     0.         0.         0.         1.
  0.        ]
 [0.72413793 0.45       1.         0.         1.         0.
  0.        ]
 [0.         0.1125     1.         1.         0.         0.
  1.        ]
 [0.37931034 0.         1.         1.         1.         0.
  0.        ]
 [0.13793103 0.0625     0.         1.         0.         1.
  0.        ]
 [1.         1.         1.         0.         1.         0.
  0.        ]
 [0.17241379 0.175      0.         0.         0.         1.
  0.        ]
 [0.55172414 0.475      1.         0.         0.         0.
  1.        ]]


As we see MinMaxScaler outputs numpy array, but we want pandas dataframe as output. Hopefully transform from numpy array to pandas dataframe is pretty easy.

In [18]:
display(pd.DataFrame(data=result, columns=df_enc.columns))

Unnamed: 0,Age,Income,Kids,Residence,Vehicle_car,Vehicle_none,Vehicle_truck
0,0.241379,0.1625,0.0,0.0,0.0,1.0,0.0
1,0.724138,0.45,1.0,0.0,1.0,0.0,0.0
2,0.0,0.1125,1.0,1.0,0.0,0.0,1.0
3,0.37931,0.0,1.0,1.0,1.0,0.0,0.0
4,0.137931,0.0625,0.0,1.0,0.0,1.0,0.0
5,1.0,1.0,1.0,0.0,1.0,0.0,0.0
6,0.172414,0.175,0.0,0.0,0.0,1.0,0.0
7,0.551724,0.475,1.0,0.0,0.0,0.0,1.0


## 6. Feature Engineering

### 6.1. Domain Knowledge Binning

Feature binning is the process that converts a numerical (either continuous and discrete) feature into a categorical feature represented by a set of ranges, or bins. For example, instead of representing age as a single real-valued feature, we chop ranges of age into 3 discrete bins: young ∈ [ages 25−34], middle ∈ [ages 35−44], old ∈[ages 45−54]. Let's see how it works in practice.

First, to understand the idea, let's implement feature binning from scratch, without any built-in python libraries.
We start with writing a function for binning, `get_bin`.

In [35]:
# First, we implement simple function, that returns discrete bin (young/middle/old) to passed value
def get_bin(age: int) -> str:
    if age <= 34:
        return 'young'
    elif age >= 45:
        return 'old'
    else:
        return 'middle'
    
# Let's test its behaviour
print(get_bin(27))
print(get_bin(40))
print(get_bin(47))

young
middle
old


We then simply apply `get_bin` function to the column `Age`.

In [20]:
# We copy our dataset to new variable, in order not to change something
df_binned = df_enc.copy()

# Now we apply our function get_bin to column Age
df_binned['Age'] = df_binned['Age'].apply(get_bin)

df_binned

Unnamed: 0,Age,Income,Kids,Residence,Vehicle_car,Vehicle_none,Vehicle_truck
0,young,95.0,0,0,0,1,0
1,old,210.0,1,0,1,0,0
2,young,75.0,1,1,0,0,1
3,middle,30.0,1,1,1,0,0
4,young,55.0,0,1,0,1,0
5,old,430.0,1,0,1,0,0
6,young,100.0,0,0,0,1,0
7,middle,220.0,1,0,0,0,1


### 6.2 Equal Width Binning

Now let's see how equal width binning works in practice.

In [21]:
# We will use KBinsDiscretizer sklearn class to produce bins
# We copy our dataset to new variable, in order not to change something
df_binned = df_enc.copy()

# Initializing KBinsDiscretizer class instance
# Let number of bins be 3
# To define equal width binning we need to set parameter strategy to uniform
est = preprocessing.KBinsDiscretizer(n_bins=3, strategy='uniform', encode='ordinal')

# Fitting to column Age
est.fit(df_binned[['Age']])

# Bins edges
display(est.bin_edges_)

# Printing result of column transformation
df_binned[['Age']] = est.transform(df_binned[['Age']])
display(df_binned)

array([array([25.        , 34.66666667, 44.33333333, 54.        ])],
      dtype=object)

Unnamed: 0,Age,Income,Kids,Residence,Vehicle_car,Vehicle_none,Vehicle_truck
0,0.0,95.0,0,0,0,1,0
1,2.0,210.0,1,0,1,0,0
2,0.0,75.0,1,1,0,0,1
3,1.0,30.0,1,1,1,0,0
4,0.0,55.0,0,1,0,1,0
5,2.0,430.0,1,0,1,0,0
6,0.0,100.0,0,0,0,1,0
7,1.0,220.0,1,0,0,0,1


## 7. Handling Class Label Imbalance

Class imbalance occurs when the number of observations in one class is much higher or lower than the number of observations in the other class. This can be a problem in machine learning because it can lead to biased models that perform poorly on the minority class.

For example, consider a binary classification problem where the goal is to predict whether a customer will default on a loan. If the dataset has a very low percentage of customers who default, the model might learn to always predict that a customer will not default, resulting in high accuracy but poor performance on the minority class of customers who do default.

To address class imbalance, there are several approaches that can be taken. One approach is to collect more data for the minority class to better balance the dataset. Another approach is to use algorithms that are specifically designed to handle class imbalance, such as cost-sensitive learning algorithms. 

But in this section, we will learn about a third, more generic approach, which is to use sampling techniques, such as oversampling the minority class or undersampling the majority class, to create a more balanced dataset.


### 7.1. Dataset overview

Suppose we want to predict if a person has a child using `Age` and `Income` parameters (seems a little bit weird, but imagine we are demographers).

In [37]:
# Making dataframe consist only necessary columns for this task
data = df_enc[['Age', 'Income', 'Kids']]

In [38]:
# Let's look how classes are balanced
data['Kids'].value_counts()

1    5
0    3
Name: Kids, dtype: int64

There are 5 observations for `Has Kid` class and 3 observations for `Doesn't have kid`. Let's also plot these data points and visualize it with the graph.

In [43]:
# Ploting our dataset
fig = px.scatter(data, x="Age", y="Income", color="Kids", color_continuous_scale='Bluered_r')
fig.update_traces(marker_size=8)
fig.show()

### 7.2 Oversampling

We see classes are imbalanced: 
- 3 observations for class 0 
- 5 observations for class 1

The difference of 2 observations is not normally critical, but provided that it is a small dataset, let's try to apply some oversampling techniques to balance these classes.

First, we need to identify X and y in our dataset.

In [45]:
# Splitting data into target and independent variables
X = data[['Age', 'Income']]
y = data['Kids']

Second, we need to initialize SMOTE class instance with k-neighbors=2 (as we have 3 points in minority class, so we need 2 more)

In [50]:
# Initializing SMOTE class instance with k-neighbors=2 (as we have 3 points in minority class)
# Also don't forget to set seed in order to have reproducible experiments
oversample = SMOTE(k_neighbors=2, random_state=123)

Third, we need to fit SMOTE to the dateset.

In [None]:
# Fitting SMOTE and resampling dataset
X, y = oversample.fit_resample(X, y)

Let's see the resulted dataset.

In [41]:
# Plot our oversampled dataset
fig = px.scatter(x=X['Age'], y=X['Income'], color=y, color_continuous_scale='Bluered_r')
fig.update_traces(marker_size=8)
fig.show()


Overall, it is important to address class imbalance in datasets because it can lead to biased models that do not generalize well to the minority class, which can have serious consequences in real-world applications.