# Data Preprocessing

In the real world, the data we get is not always clean and ready to use. There may be missing values, features on different scales and non-number values.

<br>

Hence, there is a need to pre-process the data used to train the model efficiently.

## SKLearn Library

In Python, a library called `sklearn` (SciKit Learn) offers a wide variety of pre-processing transformers. It goes well with `numpy`, a library used for creating and processing numerical arrays.

<br>

It is important to use the same set of pre-processing for both the training data and the testing data. SKLearn offers a pipeline that makes it easy to chain multiple pre-processing transformers and apply them uniformly in the training data and the testing data. Here is a list of libraries provided by SKLearn for data pre-processing:
- Data Cleaning
- Feature Extraction
- Feature Reduction
- Feature Expansion

<br>

Upon getting the training data, it is important to explore the data to know what pre-processing is required. Typical problems include:
- Missing values
- Numerical values not being on the same scale
- Categorical attributes not being represented in a numerical format
- Too many features (must be reduced)
- Extract features from non-numerical data

### Imputer

Imputers are used to handle missing values in a dataset. The way imputers work is they rely on nearby data to the missing values and fill in the gap by doing some process with the nearby data.

<br>

Missing values include empty fields as well as non-numeric values (called NaN or Not a Number).

<br>

To use imputers, `impute` module from the `sklearn` library must be imported.

<br>

There are two main imputer APIs available in sklearn:
- SimpleImputer
- KNNImputer 

#### SimpleImputer

This kind of imputer expects a 2D array, and fills in the missing values by checking the values in the column, and applies a strategy to compute a result with those values. The strategies available are:
- mean: Computes the mean of other datas and fills in the result
- median: Computes the median of other datas and fills in the result
- most_frequent (Mode): Computes the mode of other datas and fills in the result
- constant

<br>

If a column has multiple missing values, it fills in the values one by one. It ignores the other missing values in the column while processing.

<br>

To use this imputer, import `SimpleImputer` from `sklearn.preprocessing`.

In [2]:
import numpy as np

#Sample data with missing values
X = [[1, 2], [np.nan, 3], [7, 6], [1, 4]]
Y = [[np.nan, 2, 5], [6, np.nan, 4], [np.nan, 7, 6], [8, 2, np.nan]]
Z = [[12], [2], [4], [-3], [6], [-5], [-100]]
A = [[1], [2], [3], [9], [2], [3], [1]]
B = [[1, "human", "gamer"], [2, "robot", "machine"], [0, "bot", "speaker"]]

In [3]:
#Filling in missing values with the mean strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')

imp.fit(X) #Learn the statistics from the data provided with variable X
imp.transform(X) #Apply the learned stats to fill in the missing values

array([[1., 2.],
       [3., 3.],
       [7., 6.],
       [1., 4.]])

In [4]:
#Filling in missing values with the median strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='median')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [1., 3.],
       [7., 6.],
       [1., 4.]])

In [5]:
#Filling in missing values with the most_frequent strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [1., 3.],
       [7., 6.],
       [1., 4.]])

In [6]:
#Filling in missing values with the constant strategy

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='constant')

imp.fit(X)
imp.transform(X)

array([[1., 2.],
       [0., 3.],
       [7., 6.],
       [1., 4.]])

In [7]:
# Filling in missing values for a dataset with multiple nan values in a column

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(Y)
imp.transform(Y)

array([[7.        , 2.        , 5.        ],
       [6.        , 3.66666667, 4.        ],
       [7.        , 7.        , 6.        ],
       [8.        , 2.        , 5.        ]])

#### KNNImputer

This kind of imputer expects a 2D array data, and the missing values are filled by computing the mean of `n` nearest neighbours, where `n` is a value provided by the user.

<br>

To compute the nearest neighbours, it used euclidean distance, which is calcuulated using the formula:
$$\text {dist} = \sqrt {\text {weight} \cdot \text {(distance between the coordinates)}^2}$$

<br>

The weight is calculated by using the formula:
$$\text {weight} = \frac{\text {total number of data}}{\text {number of numeric data}}$$

<br>

The distance between the coordinates is the difference between the value on the row which contains the missing value and the value on the row which we consider. Both the values must be on the same column. If a value is nan or missing, we ignore its distance.

<br>

Consider the example: `X = [[1, 2, 3], [3, np.nan, 6], [np.nan, 4, 5]]`

<br>

The second row contains a nan value. So we take the second row (say `a`), and we also choose a row to compare with, say the first row (say `b`). 

<br>

Now, `a = [1, 2, 3]` and `b = [3, np.nan, 6]`

<br>

The second column has a nan value, and it will be ignored. So the distance becomes : $$\text {dist} = (1 - 3)^2 + (3 - 6)^2 = 13$$

<br>

There are three values in a row and only two columns don't have nan or missing values. So weight : $$\text {weight} = \frac {3}{2}$$

<br>

Now the euclidian distance becomes : $$\sqrt {\frac {3}{2} \cdot 13} = \sqrt {\frac {39}{2}}$$

<br>

To use this imputer, import `KNNImputer` from `sklearn.preprocessing`.

In [8]:
#Filling in the missing values using KNNImputer

from sklearn.impute import KNNImputer

imp = KNNImputer(n_neighbors=2)
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imp.fit(X)
imp.transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

### Scaler

Scalers are used to scale numbers so all of them are on the same scale. When values are on different scales, convergeence of iterative optimization procedures becomes slower.

<br>

To use scalers, the `preprocessing` module from `sklearn` must be imported.

<br>

There are three main Scaler APIs available in sklearn:
- StandardScaler
- MinMaxScaler
- MaxAbsoluteScaler

#### Standard Scaler

Transforms the original feature `x` into `x'` using the formula: 
$$x \text {'} = \frac{x - \mu}{\sigma} \text {, where } \mu \text { = Mean; } \sigma \text { = Standard Deviation}$$

<br>

To use this scaler, import the `StandardScaler` class from `sklearn.preprocessing`.

In [9]:
#Using StandardScaler to unify the scale

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(Z)
ss.transform(Z)

array([[ 0.66107927],
       [ 0.38562957],
       [ 0.44071951],
       [ 0.24790472],
       [ 0.49580945],
       [ 0.19281479],
       [-2.42395731]])

#### MinMaxScaler

Transforms the original feature `x` into `x'` using the formula: 
$$x \text {'} = \frac{x - x_\text {min}}{x_\text {max} - x_\text {min}}$$
Where $x_\text {min}$ is the minimum value in the matrix and $x_\text {max}$ is the maximum value in the matrix. The scaled matrix has values from to 0 to 1, with both ends being present in any such matrix.

<br>

To use this scaler, import the `MinMaxScaler` class from `sklearn.preprocessing`.

In [10]:
#

## Seaborn

This library in Python is uses `matplotlib` library underneath to plot graphs using provided data. It will be used to visualise random distributions.

<br>

To use it, import the `seaborn` library along with `pyplot` from the `matplotlib` library.

#### Box Plot


#### Violin Plot

In [11]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Dataset with 3 categorical columns
X = np.array([
    ["Red", "Petrol", "Toyota"],
    ["Blue", "Diesel", "Honda"],
    ["Green", "Electric", "Tesla"],
    ["Red", "Diesel", "Toyota"]
])

encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(X)

print("Categories:", encoder.categories_)
print("Encoded:\n", encoded)

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'