### Import Libraries

In [None]:
import pandas as pd

In [None]:
# get dataset
loans_df = pd.read_csv('../../data/loans_day2.csv', index_col=0)

In [None]:
#show first five lines
loans_df.head()

Note! If you have `purpose` as a column here, please delete it to make this notebook run completely.

### Preprocessing

Before we dive into the modelling we'll need to talk a bit about preprocessing first!

You might not expect it but data-preprocessing is one the most import aspects of Data Science and therefore also one of the most time-consuming.

![img](https://docs.google.com/uc?export=download&id=1JQuyBRxSWh90xIuxGU12cIAVyMOoI4aO)

Let's look at this chart which visualises the spreak of workload for a common data scientist.

We can clearly see the actual modelling only takes up a small amount of time on a daily basis and that the steps before take up a lot more time.

We already talked about data-sourcing or collecting data-sets in the last lecture. So let's focus on the data cleaning and preprocessing now.

**Data cleaning** is relatively straightforward. Here we make sure our dataset does not include `missing values` and/or `duplicates`. Duplicates we will off course remove. We can do the same for missing values but there's [other ways](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to handle these (which we will not discuss here).

Another part of our cleaning is to potentially remove noisy data such as `outliers`. This will again not be discussed here but you can find more information [here](https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-24620c4657e8)

After we've cleaned our data we move on to **data preprocessing**. Here we will transform the data to a context where a machine can work with them. An example you will see a bit later is `scaling`where we use statistical methods to put all our variables on the same scale and hence generally improving the performance of our models. Another will be handling categorical or text-data.

#### Data cleaning

##### 1. Duplicates

In any preprocessing workflow you would tackle `duplicates` first. You can check `.duplicated().sum()`to see how many duplicate rows you have. Next you can remove them with `.drop_duplicates()`. Let's check the cell below. If it's 0 we've taken care of it for you.. if not good luck!

In [None]:
loans_df.duplicated().sum()

##### 2. Null values/Missing data

After the duplicats you would normally handle missing data or `null-values`. Again we've already taken care of this for you. Check out below coding cell to see the % of `null-values` per column as proof!

In [None]:
loans_df.isnull().sum()/len(loans_df)

#### Data preprocessing

#### 1. Encoding

You might have already noticed that we have some columns containing text. ML models off course can't handle text but no worries! We'll handle these quickly by using encoding.

#### Ordinal encoding

![img](https://docs.google.com/uc?export=download&id=12BTL1wqY9qvolONkwDQCL7AqzFZcsi2H)

These are `ordinal variables` where we can assign an assign an inherent importance to each variable. Hence for these variables **order matters**.
This means that it makes sense for us in this context to assign higher numbers to categories of higher importance. That way our ML algorithm can pick up on this hierachy.

Examples of this in real life would be level of education, customer satisfaction or in our example `grade`.

`grade` has 7 unique values, all representing a specific grade that is related to the client's solvency.

In [None]:
loans_df.grade.nunique()

In [None]:
loans_df.grade.value_counts()

Makes sense that there is a structure here and that someone with grade A would have more change of paying back their loan compared to someone with an F grade.

Let's use the sklearns `OrdinalEncoder` class to add this hierarchy in our dataset

In [None]:
from sklearn.preprocessing import OrdinalEncoder

#specify list of lists in order
cat = [['G', 'F', 'E', 'D', 'C', 'B', 'A']]

#instantiate encoder
ord_enc = OrdinalEncoder(categories=cat)

#fit encoder on grade
ord_enc.fit(loans_df[['grade']])

#transform grade variable
loans_df['grade'] = ord_enc.transform(loans_df[['grade']])

loans_df.grade.value_counts()

#### Feature encoding


![img](https://docs.google.com/uc?export=download&id=1Z_kta3r_2IeUeF80W_qkIr8maFbaJa6N)

These are `nominal variables` where there's not really an inherent hierarchy as all possibilites are of equal importance. Hence for these variables **order does not matter**
Here we have to apply a different technique as above because we want our ML algorithm to pick up on the fact that all these variables are of equal importance. This technique, that will create a binary column for each individual option, is referred to as `on-hot-encoding`.
Examples of this in real life would be state, gender or in our example (althoug slightly debatable) `home ownership`.

`home_ownership` has 6 unique values.

In [None]:
loans_df.home_ownership.value_counts()

Of our 6 values we're only really interested in the top 3. Use `.isin()`to filter the DF on multiple conditions

In [None]:
loans_df = loans_df[loans_df['home_ownership'].isin(['MORTGAGE','RENT', 'OWN'])]

In [None]:
loans_df.home_ownership.unique()

Let's use the `one_hot_encoder` to transfer these values to all seperate binary columns

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Instantiate encoder
ohe = OneHotEncoder(sparse = False)

# Fit encoder
ohe.fit(loans_df[['home_ownership']]) 

# Encode ownership
home_encoded = ohe.transform(loans_df[['home_ownership']]) 

# Transpose encoded ownership back into dataframe -> be sure to check .unique() to figure out the order of the columns
loans_df["mortgage"],loans_df["own"],loans_df['rent'] = home_encoded.T 

#drop original columns
loans_df.drop(columns='home_ownership', inplace=True)

loans_df.head()

Now that we've gotten used to the `one_hot_encoder` let's take a look at the column **term** as well. Here we have two options so we can transform it to one binary columns with one value being 1 and the other being 0.

We can use the parameter `drop = 'if_binary'` to get our desired result.

In [None]:
loans_df.term.value_counts()

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Instantiate encoder with extra param
ohe = OneHotEncoder(sparse = False, drop='if_binary')

# Fit encoder
ohe.fit(loans_df[['term']]) 

# Encode alley
home_encoded = ohe.transform(loans_df[['term']]) 

# Transpose encoded term back into dataframe by overwriting the original column
loans_df["term"] = home_encoded


loans_df.head()

**loan_status** will be our target for our classification task. We can transform this one using the `one_hot_encoder` as before or we can also use the `label_encoder` which is specifically designed for this purpose.

In [None]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

label.fit(loans_df['loan_status'])

loans_df['loan_status'] = label.transform(loans_df['loan_status'])

loans_df.head()

In [None]:
loans_df['loan_status'].value_counts()

Let's save our preprocessed dataset before we move on to scaling

In [None]:
loans_df.to_csv('../../data/loans_day3.csv')

#### 2. Feature scaling

![img](https://docs.google.com/uc?export=download&id=1AfP4BzVxlQ2Kr16YZgPwUuHCCJBOSSU0)

Scaling is also very important in our preprocessing. Putting variables on the same scale will allow our model to treat every column equally and not give more importance to relatively larger values (like `annual_inc` vs `emp_length` for example).
Always scale when doing any type or application of ML. It'll greatly improve your model!
There are quite some scalers out there but let's use the **standard_scaler** to keep it simple. Here we use the mean and standard deviation of every feature to scale these features so that it has a mean value of 0 and a standard deviation of 1.

Let's check first the distribution of `int_rate` and `installment`

In [None]:
import matplotlib.pyplot as plt

plt.hist(loans_df['int_rate']);

In [None]:
plt.hist(loans_df['installment']);

You can clearly see that these variables have a similar distribution but are on a different scale. 

Scaling these variables will have multiple effects for our model:
- it will make sure that `installment` does not outweigh `int_rate` just purely based on it's scale
- the smaller values will make our model computationally more efficient
- it will increase the interpretability of feature coefficients

In [None]:
#define x & y because we don't wanna scale our target
from sklearn.model_selection import train_test_split

X = loans_df.drop(columns='loan_amnt')

y = loans_df.loan_amnt

#train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Instanciate StandarScaler
scaler = StandardScaler() 

# Fit scaler to data
scaler.fit(X_train)

# Use scaler to transform data
X_train_scaled = scaler.transform(X_train) 

# create df to show output
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns = scaler.get_feature_names_out())

X_train_scaled_df.head()

Now let's check `int_rate` and `installment` again. You see that while the distribution remained the same, the scale changed right?

In [None]:
plt.hist(X_train_scaled_df['int_rate']);

In [None]:
plt.hist(X_train_scaled_df['installment']);

### Check performance

Now let's check the difference in performance. For this we'll use a new model!

K-Nearest Neighbors (KNN) is a non-linear, distance based model capable of solving both regression and classification tasks.

*   Looks at K closest samples to make a prediction
*   Up to us to determine K (hyperparameter)



![img](https://docs.google.com/uc?export=download&id=1zPhJFC1Y4rUcHIRbhyAWRs3I8924Pp-_)

![img](https://docs.google.com/uc?export=download&id=1tJgdoNwaZ8IzhyUNUQtyUSry5L3SvUFr)

![img](https://docs.google.com/uc?export=download&id=1XdHcWnahZQKEaa9w7SUGVvYPG3wjUGkd)

![img](https://docs.google.com/uc?export=download&id=1MOWK38ksmco27lv95UEEV8Ynk23WUIKC)

![img](https://docs.google.com/uc?export=download&id=1-fkfQ4OHTX2fjfdWxZz_Z7i6V9fjp13b)

![img](https://docs.google.com/uc?export=download&id=1wEKxO0BZ_-ZQpb2V7qfUsFCg_AiFCUyd)

Let's first run a regression with our unscaled variables:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()

knn.fit(X_train, y_train)

knn.score(X_test, y_test)

Now let's run one with our scaled variables to see the difference in performance

In [None]:
knn.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)

knn.score(X_test_scaled, y_test)

The score was already quite high for our initial model but you do see a clear increase by doing the scaling. Since most models are based on some kind of distance.. scaling will help to improve the performance of those models.