# Exercise: Feature Scaling

Definition and exercise content.

notes and refs:

https://www.baeldung.com/cs/normalization-vs-standardization

https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

https://stats.stackexchange.com/questions/324369/feature-scaling-giving-reduced-output-linear-regression-using-gradient-descent


Surprisingly, feature scaling doesn’t improve the regression performance in our case. Actually, following the same steps on well-known toy datasets won’t increase the model’s success.

However, this doesn’t mean feature scaling is unnecessary for linear regression. Even the sci-kit implementation has a boolean normalize parameter to automatically normalize the input when set to True.

Instead, this result reminds us that there’s no fit for all preprocessing methods in machine learning. We need to carefully examine the dataset and apply customized methods.



from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler().fit(X_test)
X_norm = min_max_scaler.transform(X)

As a rule of thumb, we fit a scaler on the test data, then transform the whole dataset with it. By doing this, we completely ignore the test dataset while building the model.

Normalizing the concrete dataset, we get:

## Preparing data
......



In [6]:
# Import everything we will need for this unit
import pandas as pd
import numpy as np
import operator
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf


# Load data from our dataset file into a pandas dataframe
# dataset = pd.read_csv('Data/investments.csv', index_col=False, sep=",",header=0)

# # Check what's in the dataset
# print(dataset.head())



# Load a file containing people's shoe sizes
# and height, both in cm
data = pd.read_csv('Data/shoe-size-height.csv')

# Convert EU shoe sizes to the USA shoe sizes
# that we sell in our store
data["shoe_size_usa"] = data.shoe_size_eu - 33

# Print the first few rows
data.head()

Unnamed: 0,shoe_size_eu,height,sex,age_years,shoe_size_usa
0,39,173,male,60,6
1,38,173,male,48,5
2,37,157,female,43,4
3,39,175,male,51,6
4,38,170,male,39,5


In [7]:
model = smf.ols(formula = "height ~ shoe_size_usa", data = data).fit()

X = data["shoe_size_usa"].to_numpy()
y = data["height"].to_numpy()
y_hat = model.predict(data["shoe_size_usa"])

# Calculate metrics
rmse_1 = np.sqrt(mean_squared_error(y,y_hat))
r2_1 = r2_score(y,y_hat)

print(f"RMSE metrics (without scaling): {rmse_1}")
print(f"R2 metrics(without scaling): {r2_1}")


RMSE metrics: 5.8944593557470295
R2 metrics: 0.5807836498967944


Explain dataset, 

why features need scaling

train unscaled model

In [8]:
# Now scale stuff

scaler = StandardScaler()
data["shoe_size_usa"] = scaler.fit_transform(data["shoe_size_usa"].to_numpy().reshape(-1, 1))
model2 = smf.ols(formula = "height ~ shoe_size_usa", data = data).fit()
y_hat2 = model2.predict(data["shoe_size_usa"])

# Calculate metrics
rmse_2 = np.sqrt(mean_squared_error(y,y_hat2))
r2_2 = r2_score(y,y_hat2)

print(f"RMSE metrics (with feature scaling): {rmse_2}")
print(f"R2 metrics(with feature scaling): {r2_2}")


RMSE metrics: 5.894459355747031
R2 metrics: 0.5807836498967942


Show how to scale features, train new  model

In [9]:
# Do a model comparison


# # Use a dataframe to create a comparison table of metrics
# l = [["Original Model", original_rmse, original_r2],
#     ["Custom Model", rmse, r2]]

# pd.DataFrame(l, columns=["", "RMSE", "R2"])

Conclusion



## Summary

.....
