# Data Preprocessing

In real-world scenarios, data is often messy and not immediately ready for use. Therefore, data preprocessing is essential for efficient model training.
<br>

Typical data preprocessing challenges:
- Handling missing values
- Scaling numerical features
- Converting categorical attributes to numerical format
- Reducing the number of features (dimensionality reduction)
- Extracting features from non-numerical data


In Python, a library called `sklearn` (SciKit Learn) offers a wide variety of pre-processing transformers. It goes well with `numpy`, a library used for creating and processing numerical arrays.
<br>

It is important to use the same set of pre-processing for both the training data and the testing data. SKLearn offers a pipeline that makes it easy to chain multiple pre-processing transformers and apply them uniformly in the training data and the testing data. Here is a list of libraries provided by SKLearn for data pre-processing:
- Feature Extraction
- Data Cleaning
- Feature Reduction
- Feature Expansion

Consider the given datas (in the form of arrays) which are not clean and messy.

In [44]:
import numpy as np

#Sample data with missing values
X = [[1, 2], [np.nan, 3], [7, 6], [1, 4]]
Y = [[np.nan, 2, 5], [6, np.nan, 4], [np.nan, 7, 6], [8, 2, np.nan]]
Z = [[12], [2], [4], [-3], [6], [-5], [-100]]
A = [[1], [2], [3], [9], [2], [3], [1]]
B = [[1, "human", "gamer"], [2, "robot", "machine"], [0, "bot", "speaker"]]
C = [[i] for i in range(1, 11)]

## Data Cleaning

### Imputer

Imputers are used to handle missing values in a dataset. The way imputers work is they rely on nearby data to the missing values and fill in the gap by doing some process with the nearby data.
<br>

Missing values include empty fields as well as non-numeric values (called NaN or Not a Number).
<br>

To use imputers, `impute` module from the `sklearn` library must be imported.
<br>
There are two main imputer APIs available in sklearn:
- SimpleImputer
- KNNImputer 

#### SimpleImputer

This kind of imputer expects a 2D array, and fills in the missing values by checking the values in the column, and applies a strategy to compute a result with those values. The strategies available are:
- mean: Computes the mean of other datas and fills in the result
- median: Computes the median of other datas and fills in the result
- most_frequent (Mode): Computes the mode of other datas and fills in the result
- constant

<br>

If a column has multiple missing values, it fills in the values one by one. It ignores the other missing values in the column while processing.
<br>

To use this imputer, import `SimpleImputer` from `sklearn.preprocessing`.

In [2]:
#Filling in missing values with the mean strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')

imp.fit(X) #Learn the statistics from the data provided with variable X
imp.transform(X) #Apply the learned stats to fill in the missing values

array([[1., 2.],
       [3., 3.],
       [7., 6.],
       [1., 4.]])

In [3]:
#Filling in missing values with the median strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='median')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [1., 3.],
       [7., 6.],
       [1., 4.]])

In [4]:
#Filling in missing values with the most_frequent strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [1., 3.],
       [7., 6.],
       [1., 4.]])

In [5]:
#Filling in missing values with the constant strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='constant')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [0., 3.],
       [7., 6.],
       [1., 4.]])

In [6]:
# Filling in missing values for a dataset with multiple nan values in a column

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(Y)
imp.transform(Y)

array([[7.        , 2.        , 5.        ],
       [6.        , 3.66666667, 4.        ],
       [7.        , 7.        , 6.        ],
       [8.        , 2.        , 5.        ]])

#### KNNImputer

This kind of imputer expects a 2D array data, and the missing values are filled by computing the mean of `n` nearest neighbours, where `n` is a value provided by the user.

<br>

To compute the nearest neighbours, it used euclidean distance, which is calcuulated using the formula:
$$\text {dist} = \sqrt {\text {weight} \cdot \text {(distance between the coordinates)}^2}$$

<br>

The weight is calculated by using the formula:
$$\text {weight} = \frac{\text {total number of data}}{\text {number of numeric data}}$$

<br>

The distance between the coordinates is the difference between the value on the row which contains the missing value and the value on the row which we consider. Both the values must be on the same column. If a value is nan or missing, we ignore its distance.

<br>

Consider the example: `X = [[1, 2, 3], [3, np.nan, 6], [np.nan, 4, 5]]`
<br>

The second row contains a nan value. So we take the second row (say `a`), and we also choose a row to compare with, say the first row (say `b`). 
<br>

Now, `a = [1, 2, 3]` and `b = [3, np.nan, 6]`
<br>

The second column has a nan value, and it will be ignored. So the distance becomes : $$\text {dist} = (1 - 3)^2 + (3 - 6)^2 = 13$$
<br>

There are three values in a row and only two columns don't have nan or missing values. So weight : $$\text {weight} = \frac {3}{2}$$

<br>

Now the euclidian distance becomes : $$\sqrt {\frac {3}{2} \cdot 13} = \sqrt {\frac {39}{2}}$$

<br>

To use this imputer, import `KNNImputer` from `sklearn.preprocessing`.

In [7]:
#Filling in the missing values using KNNImputer

from sklearn.impute import KNNImputer

imp = KNNImputer(n_neighbors=2)
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imp.fit(X)
imp.transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

### Scaler

Scalers are used to scale numbers so all of them are on the same scale. When values are on different scales, convergeence of iterative optimization procedures becomes slower.
<br>

To use scalers, the `preprocessing` module from `sklearn` must be imported.
<br>

There are three main Scaler APIs available in sklearn:
- StandardScaler
- MinMaxScaler
- MaxAbsoluteScaler

#### Standard Scaler

Transforms the original feature `x` into `x'` using the formula: 
$$x \text {'} = \frac{x - \mu}{\sigma} \text {, where } \mu \text { = Mean; } \sigma \text { = Standard Deviation}$$
<br>

To use this scaler, import the `StandardScaler` class from `sklearn.preprocessing`.

In [8]:
#Using StandardScaler to unify the scale

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(Z)
ss.transform(Z)

array([[ 0.66107927],
       [ 0.38562957],
       [ 0.44071951],
       [ 0.24790472],
       [ 0.49580945],
       [ 0.19281479],
       [-2.42395731]])

#### MinMaxScaler

Transforms the original feature `x` into `x'` using the formula: 
$$x \text {'} = \frac{x - x_\text {min}}{x_\text {max} - x_\text {min}}$$
Where $x_\text {min}$ is the minimum value in the matrix and $x_\text {max}$ is the maximum value in the matrix. The scaled matrix has values from to 0 to 1, with both ends being present in any such matrix.
<br>

To use this scaler, import the `MinMaxScaler` class from `sklearn.preprocessing`.

In [None]:
#Using MinMaxScaler to unify the scale

from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
mms.fit_transform(Z)

array([[1.        ],
       [0.91071429],
       [0.92857143],
       [0.86607143],
       [0.94642857],
       [0.84821429],
       [0.        ]])

#### MaxAbsoluteScaler

Transforms the original feature `x` into `x'` using the formula: 
$$x \text {'} = \frac{x}{\text {MaxAbsoluteValue}}$$
Where $\text {MaxAbsoluteValue} = max \lbrace x_{max}, |x_{min}| \rbrace$. The scaled matrix has values from to -1 to 1, with -1 being present in any such matrix.
<br>

To use this scaler, import the `MinMaxScaler` class from `sklearn.preprocessing`.

In [16]:
# Using MaxAbsoluteScaler to scale values

from sklearn.preprocessing import MaxAbsScaler

mas = MaxAbsScaler()
mas.fit_transform(Z)

array([[ 0.12],
       [ 0.02],
       [ 0.04],
       [-0.03],
       [ 0.06],
       [-0.05],
       [-1.  ]])

#### RobustScaler

This scaler is specifically used in data to reduce the effects of outliers. It does so using Median and Inter Quartile Range (IQR) which do not get affected by outliers.
<br>

**Outlier**: Any value in a data that greatly impacts the balance of the data. It is on the lower extremes and upper extremes, away from the common range of the rest of the data values.
<br>

**Inter Quartile Range (IQR)**: The difference between the third qurtile Q3 (75 percentile) and first quartile (25 percentile).
<br>

To find them, First find the median of the data (which is the second quartile Q2 - 50 percentile), and split the data into two where the median lies between the highest of the first part and the lowest of the second. Then, the median of the first part becomes Q1 and median of second part becomes Q3.
<br>

This scaler transforms the original feature `x` to `x'` using the formula:
$$x \text ' = \frac {x - Q_2}{Q_3 - Q_1}$$
<br>

To use this scaler, import the `RobustScaler` class from `sklearn.preprocessing`.

In [17]:
# Using RobustScaler to reduce the effect of outliers

from sklearn.preprocessing import RobustScaler

rs = RobustScaler()
rs.fit_transform(Z)

array([[  1.11111111],
       [  0.        ],
       [  0.22222222],
       [ -0.55555556],
       [  0.44444444],
       [ -0.77777778],
       [-11.33333333]])

#### Function Transformer
Constructs transformed features using a user defined function.
<br>

To use it, import the `FunctionTransformer` class from `sklearn.preprocessing`.

In [18]:
# Using Function Transformer

from sklearn.preprocessing import FunctionTransformer

ft = FunctionTransformer(np.log2)
ft.fit_transform(A)

array([[0.       ],
       [1.       ],
       [1.5849625],
       [3.169925 ],
       [1.       ],
       [1.5849625],
       [0.       ]])

### Encoder

Encoder is pre-processing tool that encodes a categorical dataset into numeric form, so that models can understand and work with it better. It is really important while training models, because they don't understand text or any other form of data while processing, other than numeric data.  
<br>

There are three main encoders:
- OneHotEncoder
- LabelEncoder
- OrdinalEncoder

#### OneHotEncoder

An encoding method that converts each categorical feature (arranged in ascending / alphabetical order) into a column with binary. Each row represents whether the data is present at the corresponding index or not. If the value in the row is 0, then that feature is not there in that particular index. If it is one, then it is.
<br>

The number of columns in the encoded matrix is equal to the number of unique features in the data.
<br>

To use this encoder, import `OneHotEncoder` from `sklearn.preprocessing`.

In [24]:
#Appling OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit_transform(A).toarray()

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.]])

#### LabelEncoder

An encoding method that converts each categorical feature (arranged in increasing order) into its corresponding label. Assigning labels is done by first arranging the features in increasing order, and assigning values from $0$ to $K - 1$ where $K$ is number of unique features.
<br>

This encoder only works on a $n \times 1$ array. Hence, only 1 column. If multiple columns are passed, it does list comparison.

To use this encoder, import `LabelEncoder` from `sklearn.preprocessing`.

In [34]:
#Appling LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit_transform(np.ravel(A))

array([0, 1, 2, 3, 1, 2, 0])

#### OrdinalEncoder

An encoding method that works similar to label encoder, except that it can work with data having multiple columns. In this case, it does labeling from $0$ to $K - 1$ where $K$ is the number of unique features in the column. It does the same for every column.
<br>

To use this encoder, import `OrdinalEncoder` from `sklearn.preprocessing`.

In [35]:
#Appling OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
oe.fit_transform(B)

array([[1., 1., 0.],
       [2., 2., 1.],
       [0., 0., 2.]])

### Discretizer

Discretizer is a pre-processing tool that converts continuous data into discrete data in categorical, or ordinal forms by splitting the data into different parts or *bins*.
<br>

There is one majorly used discretizer:
- KBinsDiscretizer

#### KBinsDiscretizer

This discretizer takes in three parameters:
- `n_bins`: number of bins
- `strategy`: how bins are split
- `encoder`: how to encode the bins. The encoders accepted are the same three encoders listed above.
<br>

`strategy` accepts three values:
- `uniform`: split the bins with equal width
- `qauntile`: split bins based on frequency. i.e. Number of values in each bin
- `kmeans`: split bins based on k-means clustering value
<br>

To use this discretizer, import the `KBinsDiscretizer` class from `sklearn.preprocessing`

In [45]:
# Using KBinsDiscretizer

from sklearn.preprocessing import KBinsDiscretizer

kbd = KBinsDiscretizer(n_bins = 7, strategy='uniform', encode='ordinal')
kbd.fit_transform(C)

array([[0.],
       [0.],
       [1.],
       [2.],
       [3.],
       [3.],
       [4.],
       [5.],
       [6.],
       [6.]])

### Binirizer

