## Feature Engineering

Feature Engineering is crucial in Machine Learning. It involves selecting, transforming, and creating features from raw data to improve model performance and interpretability. Effective feature engineering can significantly enhance the predictive power and generalization ability of machine learning models.

Yesterday, in the KNN Regression aproach, we saw a pretty poor model. Let's apply some feature engineering techniques to see if it improves our model.

In [3]:
# example to show importance of data scaling

my_dict={'age':[20, 22, 60],
         'height':[1.81, 1.83, 1.98],
         'weight': [75, 77, 86],
         'salary': [32000, 28000, 31000]}
df= pd.DataFrame.from_dict(my_dict, orient="columns")
df.head()

Unnamed: 0,age,height,weight,salary
0,20,1.81,75,32000
1,22,1.83,77,28000
2,60,1.98,86,31000


In [4]:
def distance_individual_zero(row):
    distance = (row["age"]-20)**2 + (row["height"]-1.81)**2 + (row["weight"]-75)**2 + (row["salary"]-32000)**2
    return np.sqrt(distance)

df.apply(distance_individual_zero, axis=1)

0       0.000000
1    4000.001000
2    1000.860145
dtype: float64

In [5]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns = df.columns)
df_scaled

Unnamed: 0,age,height,weight,salary
0,0.0,0.0,0.0,1.0
1,0.05,0.117647,0.181818,0.0
2,1.0,1.0,1.0,0.75


In [6]:
def distance_individual_zero(row):
    distance = (row["age"]-0.00)**2 + (row["height"]-0.00)**2 + (row["weight"]-0.00)**2 + (row["salary"]-1)**2
    return np.sqrt(distance)

df_scaled.apply(distance_individual_zero, axis=1)

0    0.000000
1    1.024402
2    1.750000
dtype: float64

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns = df.columns)
df_scaled

Unnamed: 0,age,height,weight,salary
0,-0.76075,-0.834812,-0.905753,0.980581
1,-0.652071,-0.571187,-0.487713,-1.372813
2,1.412821,1.405999,1.393466,0.392232


In [8]:
def distance_individual_zero(row):
    distance = (row["age"]-(-0.760750))**2 + (row["height"]-(-0.834812))**2 + (row["weight"]-(-0.905753))**2 + (row["salary"]-0.980581)**2
    return np.sqrt(distance)

df_scaled.apply(distance_individual_zero, axis=1)

0    4.335532e-07
1    2.407183e+00
2    3.921506e+00
dtype: float64

In [10]:
0.0000004335532

4.335532e-07

#### Loading and preparing the data

In [1]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [3]:
#your code here

#### Checking for anomalies

In [4]:
#your code here

#### Train Test Split

In [5]:
#your code here

#### Normalization

During normalization or standardization, it's essential to fit the model to the training data exclusively, preventing any exposure to the test data to avoid potential data leakage issues.

Create an instance of the normalizer

In [6]:
#your code here

Fit it to our training data

In [7]:
#your code here

Transforming our training and testing data

In [8]:
#your code here

When applying transformations of our dataframe, normalizer will return an array instead of a dataframe object

In [9]:
#your code here

In [10]:
#your code here

In [11]:
#your code here

##### KNN Regressor - modeling

Let's create an instance of KNN with the same hyperparameter as before, n_neighbors = 10.

In [12]:
#your code here

Training KNN to our normalized data

In [13]:
#your code here

Evaluate model's performance

In [14]:
#your code here

With raw data we obtain a R2 of 0.16, just by normalizing our data, model's perfomance increase a lot to a R2 of 0.70.

This happens because KNN is a distance based algorithm, so its suffers a lot with data in completely different scales.

## Feature Selection

Even though normalizing our data had a huge impact on KNN performance, we are currently using every single feature of the dataset.

Now let's do a selection of features based on correlactions between themselves but also with the target.

We want low correlaction between features, but high correlaction between features and our target.

In [15]:
#your code here

By the correlation matrix we can see that:
- "AveRooms" is highly correlated with "AveBedrms", so we drop the one less correlated with our target
- "AveOccup" and "Population" also have pretty low correlation with our target variable, so lets remove them from our selected features

In [16]:
#your code here

In [17]:
#your code here

By normalizing our data and selecting a subset of available features, we were able to massively improve our model, increasing the R2 score from 0.16 to 0.70

Notice that we still haven't fine-tuned our hyperparameter, so we will be able to improve even more our model.