## Feature Engineering

Very often we've to deal with datasets that aren't ready to be analyzed and modelled. We then need to apply feature engineering techniques on the raw data in order to have an appropriate dataset for performing advanced analysis. We've already talked about techniques to deal with *missing values* and *outliers* and how to perform *discretization*. Now, we'll learn about other feature engineering techniques such as:
- Normalization
- Standardization
- Transformation from categorical to dummy variables

## Feature Scaling

**Feature scaling** refers to the methods or techniques used to normalize the range of independent variables in our data, or in other words, the methods to set the feature value range within a similar scale. Feature scaling is generally the last step in the data preprocessing pipeline, performed **just before training the machine learning algorithms**.

**Why is it important?**

- The regression coefficients of linear models are directly influenced by the scale of the variable.
- Variables with bigger magnitude / larger value range dominate over those with smaller magnitude / value range
- Gradient descent converges faster when features are on similar scales
- Feature scaling helps decrease the time to find support vectors for SVMs
- Euclidean distances are sensitive to feature magnitude.
- Some algorithms, like PCA require the features to be centered at 0.

**Techniques:**

There are different techniques, such as 
- Standardization
- Mean normalization
- Scaling to minimum and maximum values - MinMaxScaling

## Standardization

Standardization involves centering the variable at zero, and standardizing the variance to 1. The procedure involves subtracting the mean of each observation and then dividing by the standard deviation:

**z = (x - x_mean) /  std**

The result of the above transformation is **z**, which is called the z-score, and represents how many standard deviations a given observation deviates from the mean. A z-score specifies the location of the observation within a distribution (in numbers of standard deviations respect to the mean of the distribution). The sign of the z-score (+ or - ) indicates whether the observation is above (+) or below ( - ) the mean.

The shape of a standardized (or z-scored normalised) distribution will be identical to the original distribution of the variable. In other words, **standardizing a variable does not normalize the distribution of the data**.

In a nutshell, standardization:

- centers the mean at 0
- scales the variance at 1
- preserves the shape of the original distribution
- the minimum and maximum values of the different variables may vary
- preserves outliers

Let's use a dataset from Scikit-learn about Boston House Prices to learn how to do it. The dataset has characteristics about houses and the neighborhoods and can be used to predict the house price.

In [0]:
pip install -U scikit-learn

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split

# the scaler - for standardization
from sklearn.preprocessing import StandardScaler, RobustScaler

In [0]:
data = pd.read_csv('/dbfs/FileStore/CDS2024/boston_house_prices.csv', header=1)
data.head()

In [0]:
data.describe()

Note how the variables show different magnitudes or scales. In other words, how **the mean values are not centered at zero, and the standard deviations are not scaled to 1**.

When standardizing the data set, we need to first identify the **mean and standard deviation** of the variables. These parameters need to be learned from the train set, stored, and then used to scale test and future data. So let's split the dataset as we've done before.

In [0]:
# let's separate the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data.drop('MEDV', axis=1),
                                                    data['MEDV'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape

## Standardization

The StandardScaler from scikit-learn removes the mean and scales the data to unit variance. Plus, it learns and stores the parameters needed for scaling. Thus, it is top choice for this feature scaling technique.

On the downside, you can't select which variables to scale directly, it will scale the entire data set, and it returns a NumPy array, without the variable values.

In [0]:
# set up the scaler
scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# the scaler stores the mean and standard deviation of the features, learned from train set
print(scaler.mean_)
print(scaler.scale_)

In [0]:
# let's transform the returned NumPy arrays to dataframes
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

X_train_scaled.head()

In [0]:
# let's have a look at the original training dataset: mean and standard deviation
# np.round to reduce the number of decimals plates to 1
np.round(X_train.describe(), 1)

In [0]:
# let's have a look at the scaled training dataset: mean and standard deviation
# np.round to reduce the number of decimals plates to 1
np.round(X_train_scaled.describe(), 1)

As expected, the mean of each variable, which were not centered at zero, is now around zero and the standard deviation is set to 1. Note however, that the minimum and maximum values vary according to how spread the variable was to begin with and is highly influenced by the presence of outliers.

Let's compare the variable distributions before and after scaling:

In [0]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['RM'], ax=ax1)
sns.kdeplot(X_train['LSTAT'], ax=ax1)
sns.kdeplot(X_train['CRIM'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled['RM'], ax=ax2, label = 'RM')
sns.kdeplot(X_train_scaled['LSTAT'], ax=ax2, label = 'LSTAT')
sns.kdeplot(X_train_scaled['CRIM'], ax=ax2, label = 'CRIM')
plt.legend()
plt.show()

Note from the above plots how standardization centered all the distributions at zero, but it preserved their original distribution. The value range is not identical, but it looks more homogeneous across the variables. 

Note something interesting in the following plot:

In [0]:
# let's compare the variable distributions before and after scaling

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['AGE'], ax=ax1)
sns.kdeplot(X_train['DIS'], ax=ax1)
sns.kdeplot(X_train['NOX'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled['AGE'], label = 'Age', ax=ax2)
sns.kdeplot(X_train_scaled['DIS'], label = 'Dis', ax=ax2)
sns.kdeplot(X_train_scaled['NOX'], label = 'Nox', ax=ax2)
plt.legend()
plt.show()

In the above plot, we can see how, by scaling, the variable NOX, which varied across a very narrow range of values [0-1], and AGE which varied across [0-100], now spread over a more homogeneous range of values, so that we can compare them directly in one plot, whereas before it was difficult. In a linear model, AGE would dominate the output, but after standardization, both variables will be able to have an input (assuming that they are both predictive).

In [0]:
plt.scatter(X_train['AGE'], X_train['NOX'])
plt.xlabel("Age")
plt.ylabel("Nox")
plt.show()

In [0]:
plt.scatter(X_train_scaled['AGE'], X_train_scaled['NOX'])
plt.xlabel("Age")
plt.ylabel("Nox")
plt.show()

## Mean Normalization
Mean normalization involves centering the variable at zero, and re-scaling to the value range. The procedure involves subtracting the mean of each observation and then dividing by difference between the minimum and maximum value:

**x_scaled = (x - x_mean) / ( x_max - x_min)**


The result of the above transformation is a distribution that is centered at 0, and its minimum and maximum values are within the range of -1 to 1. The shape of a mean normalized distribution will be very similar to the original distribution of the variable, but the variance may change, so not identical.

Again, this technique will not **normalize the distribution of the data**.

In a nutshell, mean normalization:

- centers the mean at 0
- variance will be different
- may alter the shape of the original distribution
- the minimum and maximum values squeezed between -1 and 1
- preserves outliers

Obs: There is no Scikit-learn transformer for mean normalization, but we can implement it using a combination of 2 other transformers.

Let's see again the statistical parameters of the variables.

In [0]:
data.describe()

Note for this demo, how **the mean values are not centered at zero, and the min and max value vary across a big range**.

We need to first identify **the mean, the minimum and maximum values** of the variables.

As we did before, let's separate into train and test sets.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('MEDV', axis=1),
                                                    data['MEDV'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape

### Mean normalization with Pandas

In [0]:
means = X_train.mean(axis=0)
ranges = X_train.max(axis=0)-X_train.min(axis=0)
means, ranges

In [0]:
X_train_scaled = (X_train - means) / ranges
X_test_scaled = (X_test - means) / ranges

In [0]:
np.round(X_train_scaled.describe(), 1)

As expected, the mean of each variable, which were not centered at zero, is now around zero and the min and max values vary approximately between -1 and 1. Note however, that the standard deviations vary according to how spread the variable was to begin with and is highly influenced by the presence of outliers.

Let's compare the variable distributions before and after scaling:

In [0]:
# let's compare the variable distributions before and after scaling
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['RM'], ax=ax1)
sns.kdeplot(X_train['LSTAT'], ax=ax1)
sns.kdeplot(X_train['CRIM'], ax=ax1)

# after scaling
ax2.set_title('After Mean Normalisation')
sns.kdeplot(X_train_scaled['RM'], label = 'RM', ax=ax2)
sns.kdeplot(X_train_scaled['LSTAT'], label = 'LSTAT', ax=ax2)
sns.kdeplot(X_train_scaled['CRIM'], label = 'CRIM', ax=ax2)
plt.legend()
plt.show()

As we can see the main effect of mean normalization was to center all the distributions at zero, and the values range it between -1 and 1.

## Scaling to Minimum and Maximum values - MinMaxScaling

Minimum and maximum scaling squeezes the values between 0 and 1. It subtracts the minimum value from all the observations, and then divides it by the value range:

X_scaled = (X - X.min / (X.max - X.min)


The result of the above transformation is a distribution which values vary within the range of 0 to 1. But the mean is not centered at zero and the standard deviation varies across variables. The shape of a min-max scaled distribution will be similar to the original variable, but the variance may change, so not identical. This scaling technique is also sensitive to outliers.

As we said before, this technique will not **normalize the distribution of the data**.

In a nutshell, MinMaxScaling:

- does not center the mean at 0
- variance varies across variables
- may not preserve the shape of the original distribution
- the minimum and maximum values are 0 and 1.
- sensitive outliers

In [0]:
from sklearn.preprocessing import MinMaxScaler

Let's remember the statistical parameters of the variables.

In [0]:
data.describe()

Note how **the minimum and maximum values are quite different for different variables**.

Following the same logic as before, we need to first identify the minimum and maximum values of the variables.

In [0]:
# let's separate the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data.drop('MEDV', axis=1),
                                                    data['MEDV'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape

In [0]:
# set up the scaler
scaler = MinMaxScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [0]:
print(scaler.data_max_)
print(scaler.min_)
print(scaler.data_range_)
#range = max - min

In [0]:
# let's transform the returned NumPy arrays to dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [0]:
np.round(X_train_scaled.describe(), 1)

As expected, the minimum and maximum values for all the variables are 0 and 1, respectively. The mean is not centered at zero, and the variance changes.

In [0]:
# let's compare the variable distributions before and after scaling
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['RM'], ax=ax1)
sns.kdeplot(X_train['LSTAT'], ax=ax1)
sns.kdeplot(X_train['CRIM'], ax=ax1)

# after scaling
ax2.set_title('After Min-Max Scaling')
sns.kdeplot(X_train_scaled['RM'], label = 'RM', ax=ax2)
sns.kdeplot(X_train_scaled['LSTAT'], label = 'LSTAT', ax=ax2)
sns.kdeplot(X_train_scaled['CRIM'], label = 'CRIM', ax=ax2)
plt.show()

We can see that the values are now capped at 1, but the distributions are not centered.

=====================================================================================================

## Last Topic: Dummy variables
In some cases we'll have to transform categorical variables into dummy variables in order to use them in machine learning modelling. Dummy variables are also known as binary, because they can assume just two values: 0 or 1.

Let's use the automobile dataset from the internet to pratice standardization.

In [0]:
# Let's use the automobile dataset from the internet.
path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
data = pd.read_csv(path)
data.head()

In [0]:
import numpy
data_not_categorical = data.select_dtypes(numpy.number)
data_not_categorical.head()

Let's first select the categorical variables:

In [0]:
path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
data = pd.read_csv(path)
data.head()

In [0]:
# Filtering just categorical variables
data_categorical = data.select_dtypes(numpy.object_)
data_categorical.head()

Let's use *get_dummies* function from Pandas to convert each categorical variable in as many 0/1 variables as there are different values.

In [0]:
data_categorical = pd.get_dummies(data_categorical)
data_categorical.head()

**Now we are ready to start with Machine Learning !!! :)**

**Authors:** Juliana Coelho, Camila Mizokami

**Adapted by:** Kamilla Silva

**References:**

- [About Feature Scaling and Normalization](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html)
- [Normalization vs Standardization — Quantitative analysis](https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf)
- [Get Dummies in Pandas](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)