## Data transformation

Data transformation is a crucial step in the preprocessing pipeline when preparing data for machine learning models. Properly transformed data can lead to models that converge faster and produce more accurate results.

In this module we will cover the most common types of transformations you can apply to your data, namely normalization, standardization and one-hot encoding. We will continue using the "large rivers" dataset as example, and use methods contained in the *scikit-learn* library (often abbreviated as *sklearn*).

**1. Load libraries and data**

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Read dataframe from xlsx file
file_url = 'https://github.com/DHI/Intro_ML_course/raw/main/module_1/large_rivers_processed.csv'
df = pd.read_csv(file_url)

# If you are unable to read the file from the url, you can download it and read it locally
# file_path = 'large_rivers_processed.csv'
# df = pd.read_csv(file_path)

**2. Normalization (or Min-Max Scaling)**

This method scales features to lie between a given minimum and maximum value (often between 0 and 1).

In [None]:
df['Elevation'].describe()

In [None]:
# Normalize elevation
scaler = MinMaxScaler()
df['Elevation_norm'] = scaler.fit_transform(df[['Elevation']])

df['Elevation_norm'].describe()

In [None]:
# Plot two histograms as subplots with Elevation and Elevation normalized
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df['Elevation'].plot.hist(ax=axes[0], bins=20)
df['Elevation_norm'].plot.hist(ax=axes[1], bins=20)
axes[0].set_title('Elevation')
axes[1].set_title('Elevation Normalized [0,1]')
plt.show()


In [None]:
# Explore scaler data and reverse transform
print('Min value: ', scaler.data_min_)
print('Max value: ', scaler.data_max_)

print('Middle of scaling range: ', scaler.data_min_ + (scaler.data_max_ - scaler.data_min_) / 2)
print('Inverse transform of 0.5 = ', scaler.inverse_transform([[0.5]]))

**3. Standardization (or Z-score Normalization)**

This scales features to have a mean of 0 and a standard deviation of 1.

In [None]:
# Apply Standardization
scaler = StandardScaler()
df['Elevation_stand'] = scaler.fit_transform(df[['Elevation']])

df['Elevation_stand'].describe()

In [None]:
# Plot two boxplots as subplots with Elevation and Elevation standardized
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df['Elevation'].plot.box(ax=axes[0])
df['Elevation_stand'].plot.box(ax=axes[1])
axes[0].set_title('Elevation')
axes[1].set_title('Elevation Standardized')
plt.show()

**4. Distribution change**
Skewness is a measure of the asymmetry of the probability distribution. In other words, skewness can indicate whether the data points in a statistical series are skewed to one side of the average value.

Some algorithms work best if the data is normally distributed (i.e., skewness = 0). We can change the original distribution, by applying one of the following transformations:
- Logarithm: only for x>0
- Square root: only for x>=0
- Cubic root: also for x<0

In [None]:
# Square root transformation
df['Elevation_sqrt'] = np.sqrt(df['Elevation'])

# Cube root transformation
df['Elevation_cbrt'] = np.cbrt(df['Elevation'])

# Logarithmic transformation
df['Elevation_log'] = np.log(df['Elevation'])

# Plot four histograms as subplots with original Elevation distribution and transformed versions
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
df['Elevation'].plot.hist(ax=axes[0, 0], bins=30)
df['Elevation_sqrt'].plot.hist(ax=axes[0, 1], bins=30)
df['Elevation_cbrt'].plot.hist(ax=axes[1, 0], bins=30)
df['Elevation_log'].plot.hist(ax=axes[1, 1], bins=30)
axes[0, 0].set_title('Elevation')
axes[0, 1].set_title('Elevation Square Root')
axes[1, 0].set_title('Elevation Cube Root')
axes[1, 1].set_title('Elevation Logarithmic')
plt.show()

In [None]:
# Compute skewness of Elevation and transformed versions
print('Skewness of Elevation: ', df['Elevation'].skew())
print('Skewness of Elevation Square Root: ', df['Elevation_sqrt'].skew())
print('Skewness of Elevation Cube Root: ', df['Elevation_cbrt'].skew())
print('Skewness of Elevation Logarithmic: ', df['Elevation_log'].skew())

**Exercise 1.5**

Find transformation that minimizes skewness of Discharge and visualize the result

In [None]:
# 1.5 Solution


**5. One-Hot Encoding**

One-Hot Encoding converts categorical data into a binary matrix format

In [None]:
# Print beginning of Name and Continent columns
df[['Name', 'Continent']].head()

In [None]:
# Apply One-Hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df[['Continent']])

encoded_data

In [None]:
# Add encoded data to dataframe with column names equal to encoder categories and river names as index
df_encoded = pd.DataFrame(encoded_data, columns=encoder.categories_[0], index=df['Name'])
df_encoded.head()