# Week 2
### Data preprocessing for ML applications

During this class we will cover several basic steps of data treatment and preprocessing in order to use it for future Machine Learning models:

1. Choosing and retrieving the dataset
2. Importing the libraries
3. Importing the dataset
4. Finding and treating missing data
5. Encoding categorical data
6. Feature scaling
7. Feature selection and dimentionality reduction
8. Splitting the dataset: training, validation, and testing

### Choosing and retrieving the dataset

Since every ML model requires data to learn and its performance highly depends on the training process, it is crucially important to select appropriate information to feed the model into.

For any particular task there might be two distinct approaches for data acquisition:
* Find an existing dataset with appropriate samples
* Create a new dataset from scratch (recordings, polls, data mining, etc.)
    
Important features of datasets:
* Size (How to determine a sufficient number of samples?)
* Cleanliness (How many missing values?)
* Homogeneity (Are the samples appropriate and correspond to the task?)
* Number of features (Is it good to have too many features? Too few?)
    
Data sources:
*  Kaggle
*  Google Dataset Search
*  Datahub.io
*  Subject specific websites and services (e.g., CERN Open Data, NASA Earth Data, etc.)

### Importing the libraries

There is a handful of useful libraries in Python that provide an easy way to treat data in a fast and efficient manner. Some popular examples:
* Numpy
* Pandas
* Scikit-learn
* Tensorflow / Keras
* Pytorch
* Jax

In [191]:
import pandas as pd
import numpy as np
import sklearn

### Importing the dataset

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(data)

### Finding and treating missing data

Some samples within a dataset may have incomplete information: some or all the features might be missing. The training process cannot be executed with such missing pieces. Thus, before proceeding to the leaning, one must deal with missing data. There are several strategies that one can think of:
* Removing entries with missing features
* Removing features that are absent in too many samples
* Imputation: filling out the missing fields:
    * with the most common value
    * with the mean/median of all samples
    * with more complex strategies, e.g, Linear Regression, KNN
    * using some prior knowledge

In [None]:
# Checking the dataset information
data.info()

In [194]:
# Dropping the columns that do not contain useful information 
data.drop(labels=['PassengerId', 'Ticket'], axis=1, inplace=True)

In [None]:
# The "embarked" feature

# Let's check enrties with NaN

data[data['Embarked'].isnull()]

In [None]:
# Checking what value is the most frequent one
data['Embarked'].value_counts()

In [None]:
# Filling NaNs with the most frequent value
data['Embarked'].fillna('S', inplace=True)
print(data)

In [None]:
# One of the ways to impute the numerical feature: filling out with the median
data['Age'].median()

In [None]:
from sklearn.impute import SimpleImputer

fea_transformer = SimpleImputer(strategy="median")
values = fea_transformer.fit_transform(data[["Age"]])
print(pd.DataFrame(values))

In [200]:
# KNN imputation (Later on in the code)

# from sklearn.impute import KNNImputer

# fea_transformer = KNNImputer(n_neighbors=3)
# data["Age"] = fea_transformer.fit_transform(data[["Age"]]).astype(float)
# print(data)

In [None]:
print(data.info())

### Encoding categorical data

Some features in a dataset might be represented as values in a discrete set of categories (e.g., 'Sex', 'Embarked')
In order to process such features one has to encode them in order to convert a categorical set (usually, strings) into a set of numerical values.

In [None]:
#What is the simplest numerical encoding?
pd.get_dummies(data, columns = ['Embarked']).head()

In [None]:
#Using LabelEncoder to transform categorical features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Embarked'] = le.fit_transform(data['Embarked'])
print(data)

In [None]:
data['Sex'] = le.fit_transform(data['Sex'])
print(data)

In [None]:
#Let's use only deck instead of the cabin because of many missing values
def cabin_replace(cabin):
  cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G']
  for substring in cabin_list:
    if substring in str(cabin):
      return substring
  return np.nan

data['Cabin'] = data['Cabin'].apply(cabin_replace)
print(data)

In [None]:
data['Cabin'] = le.fit_transform(data['Cabin'])
print(data)

In [None]:
# Name itself is not useful: we can extract only titles
def get_title(string):
    import re
    regex = re.compile(r'Mr|Don|Major|Capt|Jonkheer|Rev|Col|Dr|Mrs|Countess|Dona|Mme|Ms|Miss|Mlle|Master', re.IGNORECASE)
    results = regex.search(string)
    if results != None:
        return(results.group().lower())
    else:
        return np.nan

data['Name'] = data['Name'].apply(get_title)
data['Name'] = le.fit_transform(data['Name'])
print(data)

In [None]:
# Going back to filling out the missing data with KNNImputer
from sklearn.impute import KNNImputer

fea_transformer = KNNImputer(n_neighbors=3)
data["Age"] = fea_transformer.fit_transform(data[["Age"]]).astype(float)
print(data)

### Feature scaling

* Standartization

$X_{standard} = \frac{x - x_{mean}}{\sigma_x}$

This operation brings all the distributions to the same form. If we assume that our random distribution is Gaussian, this operation allows us to have the same $\mu=0$ and $\sigma=1$ for all the features

In [None]:
print((data-data.mean())/data.std())

* Normalization

$X_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$

This operation scales all the features to the same interval [0, 1] (or [-1, 1] in some cases)

In [None]:
print((data-data.min())/(data.max()-data.min()))

Scikit-learn tools for normalization and standartization are MinMaxScaler and StandardScaler respectively.

Why would we want to use Standartization with this data?

Why would we want to use Normalization with this data?

In the following example we will proceed with the normalization:

In [None]:
data = (data-data.min())/(data.max()-data.min())

### Feature selection and dimentionality reduction

In [None]:
import matplotlib.pyplot as plt

a = data.corr()
print(a)
fig, ax = plt.subplots(figsize=(9, 9))
heatmap = ax.imshow(a, cmap='gray_r', interpolation='nearest')
ax.set_yticks(range(len(a.index.values)))
ax.set_yticklabels(a.index.values)
ax.set_xticks(range(len(a.index.values)))
ax.set_xticklabels(a.index.values)
plt.colorbar(heatmap)
plt.show()

In [None]:
data['Family_size'] = data['SibSp'] + data['Parch']
data.drop(labels=['SibSp', 'Parch', 'Cabin'], axis=1, inplace=True)
print(data)

### Splitting the dataset: training, validation, and testing

In [183]:
X = data.drop('Survived', axis=1)
y = data['Survived']

# Standard train/test split
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2)

# Train/test/validation split
X_train_val, X_test, y_train_val, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_test_val = sklearn.model_selection.train_test_split(X_train_val, y_train_val, test_size=0.25)

In [None]:
print(X_train, y_train)

### Matrix manipulation using Pandas
Pandas provides efficient methods of matrix multiplication using vector operations:

In [212]:
df = pd.DataFrame([[0, 1], [-2, -1], [1, 3], [1, 1]], columns=['col1', 'col2'])
print(df)

   col1  col2
0     0     1
1    -2    -1
2     1     3
3     1     1


In [213]:
s = pd.Series({'col1': 1, 'col2': 2})
print(df.dot(s))

0    2
1   -4
2    7
3    3
dtype: int64


In [214]:
# Different ways to perform simple operations
print(df+df)
print()
print(df.add(df))

   col1  col2
0     0     2
1    -4    -2
2     2     6
3     2     2

   col1  col2
0     0     2
1    -4    -2
2     2     6
3     2     2


In [216]:
# add, sub, mul, div, mod, pow are equivalent to arithmetic operators: +, -, *, /, //, %, **

#Element-wise multiplication df * x

df.mul(df)
# df * df

Unnamed: 0,col1,col2
0,0,1
1,4,1
2,1,9
3,1,1


In [218]:
#Element-wise division: df / x

df.div(2)
# df / 2

Unnamed: 0,col1,col2
0,0.0,0.5
1,-1.0,-0.5
2,0.5,1.5
3,0.5,0.5


In [222]:
#Element-wise division: x / df

df.rdiv(2)
# 2 / df

Unnamed: 0,col1,col2
0,inf,2.0
1,-1.0,-2.0
2,2.0,0.666667
3,2.0,2.0
