---
University Paris 1 Panthéon-Sorbonne

Introduction to Machine Learning

Dr. Nourhène Ben Rabah


---

# Session 2: Data Scaling and Transformation

---
In the previous session, we explored the first step of data preparation: data cleaning.  
In this session, you will learn about data scaling and transformation. In the context of data preparation, these methods are important to have suitable format for machine learning algorithms.

Let’s examine each of these methods:


#### 1) Data Scaling

**Scaling** adjusts the scale of data, especially when variables have different value ranges. The goal is to bring all variables to the same scale so that they carry equal importance during model training. This is particularly important for algorithms that compute distances or are sensitive to the scale of data, such as **KNN**, **Linear Regression**, **Neural Networks**, and **K-means**.

**Example**:

| Person | Height (cm) | Income (€) |
|--------|-------------|------------|
| A      | 160         | 15,000     |
| B      | 175         | 45,000     |
| C      | 180         | 100,000    |
| D      | 190         | 300,000    |
| E      | 165         | 25,000     |
| F      | 178         | 50,000     |
| G      | 182         | 120,000    |
| H      | 168         | 35,000     |
| I      | 185         | 90,000     |
| J      | 170         | 55,000     |
| K      | 167         | 33,000     |
| L      | 174         | 60,000     |
| M      | 180         | 110,000    |

Here are two distribution plots for **height** and **income**:


![Example Image](exemple1-sans.png)

There are many methods to scale data, such as:

- [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
- [`MaxAbsScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)
- [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)
- [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In this example, we will apply the **[`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)**.  
This method scales values between **0 and 1** or between **-1 and 1**.

The formula to obtain the scaled value is:

<div>
$$
x_{\text{scaled}_i} = \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} \times (\text{max}_{\text{new}} - \text{min}_{\text{new}}) + \text{min}_{\text{new}}
$$
</div>

Where:

- $( x_i $) is the original value.  
- $( \text{min}(x) $) and $( \text{max}(x) $) are the minimum and maximum values of the original dataset.  
- $( \text{min}_{\text{new}} $) and $( \text{max}_{\text{new}} $) are the desired range limits (e.g., 0 and 1 or -1 and 1).

###### 1) Use the formula above to scale the values between 0 and 1 for the **Height** and **Income** columns

###### Scale the Height column between 0 and 1


**a. Calculate the minimum and maximum values of the feature**

- $ \text{min}(x) $ = 160  
- $ \text{max}(x) $ = 190  
- $ \text{max}_{\text{new}} $ = 1  
- $ \text{min}_{\text{new}} $ = 0

**b. Scale each value using the Min-Max scaling formula**

- For 160:  
  $ x_{\text{scaled}} = \frac{160 - 160}{190 - 160} \times (1 - 0) + 0 = 0 $

- For 175:  
  $
  x_{\text{scaled}} = \frac{175 - 160}{190 - 160} \times (1 - 0) 
  + 0 = 0.5
  $

- For 180:  
  $
  x_{\text{scaled}} = \frac{180 - 160}{190 - 160} \times (1 - 0) 
  + 0 = 0.6667
  $

###### 2) Here is a code that uses Scikit-learn to apply Min-Max scaling to the Height (cm) and Income (€) data between 0 and 1.




In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Data
data = {
    "Person": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"],
    "Height (cm)": [160, 175, 180, 190, 165, 178, 182, 168, 185, 170, 167, 174, 180],
    "Income (€)": [15000, 45000, 100000, 300000, 25000, 50000, 120000, 35000, 90000, 55000, 33000, 60000, 110000]
}

df = pd.DataFrame(data)
scaler = MinMaxScaler(feature_range=(0, 1))
# fit: calculates the min and max; transform applies the scaling
df[['Height (cm)', 'Income (€)']] = scaler.fit_transform(df[['Height (cm)', 'Income (€)']])
print(df)


   Person  Height (cm)  Income (€)
0       A     0.000000    0.000000
1       B     0.500000    0.105263
2       C     0.666667    0.298246
3       D     1.000000    1.000000
4       E     0.166667    0.035088
5       F     0.600000    0.122807
6       G     0.733333    0.368421
7       H     0.266667    0.070175
8       I     0.833333    0.263158
9       J     0.333333    0.140351
10      K     0.233333    0.063158
11      L     0.466667    0.157895
12      M     0.666667    0.333333


###### 3) Apply scaling for both variables between -1 and 1

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Data
data = {
    "Person": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"],
    "Height (cm)": [160, 175, 180, 190, 165, 178, 182, 168, 185, 170, 167, 174, 180],
    "Income (€)": [15000, 45000, 100000, 300000, 25000, 50000, 120000, 35000, 90000, 55000, 33000, 60000, 110000]
}

df = pd.DataFrame(data)
scaler = MinMaxScaler(feature_range=(-1, 1))
# fit: calculates the min and max; transform applies the scaling
df[['Height (cm)', 'Income (€)']] = scaler.fit_transform(df[['Height (cm)', 'Income (€)']])
print(df)

   Person  Height (cm)  Income (€)
0       A    -1.000000   -1.000000
1       B     0.000000   -0.789474
2       C     0.333333   -0.403509
3       D     1.000000    1.000000
4       E    -0.666667   -0.929825
5       F     0.200000   -0.754386
6       G     0.466667   -0.263158
7       H    -0.466667   -0.859649
8       I     0.666667   -0.473684
9       J    -0.333333   -0.719298
10      K    -0.533333   -0.873684
11      L    -0.066667   -0.684211
12      M     0.333333   -0.333333


We will now observe the distribution of the data for both variables after scaling.

![Example Image](exemple1-MINMAXscaling.png)

##### Exercise  
You can now download the diabetes dataset (from the dataset folder) and use Min-Max scaling on the data if necessary. For more information about this dataset, please refer to the site (https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).


In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('./diabetes.csv', delimiter=',', header=0)
scaler = MinMaxScaler(feature_range=(0, 1))
df[['Glucose', 'BloodPressure', 'Insulin', 'BMI']] = scaler.fit_transform(df[['Glucose', 'BloodPressure', 'Insulin', 'BMI']])
df


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,0.743719,0.590164,35,0.000000,0.500745,0.627,50,1
1,1,0.427136,0.540984,29,0.000000,0.396423,0.351,31,0
2,8,0.919598,0.524590,0,0.000000,0.347243,0.672,32,1
3,1,0.447236,0.540984,23,0.111111,0.418778,0.167,21,0
4,0,0.688442,0.327869,35,0.198582,0.642325,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,0.507538,0.622951,48,0.212766,0.490313,0.171,63,0
764,2,0.613065,0.573770,27,0.000000,0.548435,0.340,27,0
765,5,0.608040,0.590164,23,0.132388,0.390462,0.245,30,0
766,1,0.633166,0.491803,0,0.000000,0.448584,0.349,47,1


There is another scaling method you can try: StandardScaler (Z-score). It centers the values around zero and scales the standard deviation to 1.

This is commonly done using the **Z-score formula**:
<div>
$$
z = \frac{x - \mu}{\sigma}
$$
<div>
    
Where:
- $(x)$ is the value of a data point,
- $(\mu)$ is the mean of the data,
- $(\sigma)$ is the standard deviation of the data.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Data
data = {
    "Person": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"],
    "Height (cm)": [160, 175, 180, 190, 165, 178, 182, 168, 185, 170, 167, 174, 180],
    "Income (€)": [15000, 45000, 100000, 300000, 25000, 50000, 120000, 35000, 90000, 55000, 33000, 60000, 110000]
}

# Creating the DataFrame
df = pd.DataFrame(data)

# Initializing the StandardScaler
scaler = StandardScaler()

# Applying scaling to 'Height (cm)' and 'Income (€)' columns
df[['Height (cm) Scaling', 'Income (€) Scaling']] = scaler.fit_transform(df[['Height (cm)', 'Income (€)']])

# Checking the mean and standard deviation after scaling
mean_size = np.mean(df['Height (cm) Scaling'])
std_size = np.std(df['Height (cm) Scaling'])

mean_income = np.mean(df['Income (€) Scaling'])
std_income = np.std(df['Income (€) Scaling'])

# Displaying results
print(df[['Person', 'Height (cm)', 'Height (cm) Scaling', 'Income (€)', 'Income (€) Scaling']])
print("\nMean of height after scaling:", mean_size)
print("Standard deviation of height after scaling:", std_size)
print("\nMean of income after scaling:", mean_income)
print("Standard deviation of income after scaling:", std_income)


   Person  Height (cm)  Height (cm) Scaling  Income (€)  Income (€) Scaling
0       A          160            -1.796604       15000           -0.909857
1       B          175             0.009261       45000           -0.488927
2       C          180             0.611216      100000            0.282779
3       D          190             1.815126      300000            3.088980
4       E          165            -1.194649       25000           -0.769547
5       F          178             0.370434       50000           -0.418772
6       G          182             0.851998      120000            0.563399
7       H          168            -0.833476       35000           -0.629237
8       I          185             1.213171       90000            0.142469
9       J          170            -0.592694       55000           -0.348617
10      K          167            -0.953867       33000           -0.657299
11      L          174            -0.111130       60000           -0.278462
12      M   

##### Apply scaling to the diabetes dataset using StandardScaler()


In [8]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('./diabetes.csv', delimiter=',', header=0)
scaler = StandardScaler()
df[['Glucose', 'BloodPressure', 'Insulin', 'BMI']] = scaler.fit_transform(df[['Glucose', 'BloodPressure', 'Insulin', 'BMI']])
df


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,0.848324,0.149641,35,-0.692891,0.204013,0.627,50,1
1,1,-1.123396,-0.160546,29,-0.692891,-0.684422,0.351,31,0
2,8,1.943724,-0.263941,0,-0.692891,-1.103255,0.672,32,1
3,1,-0.998208,-0.160546,23,0.123302,-0.494043,0.167,21,0
4,0,0.504055,-1.504687,35,0.765836,1.409746,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,-0.622642,0.356432,48,0.870031,0.115169,0.171,63,0
764,2,0.034598,0.046245,27,-0.692891,0.610154,0.340,27,0
765,5,0.003301,0.149641,23,0.279594,-0.735190,0.245,30,0
766,1,0.159787,-0.470732,0,-0.692891,-0.240205,0.349,47,1


### 3) Data Transformation

The data transformation consists of converting categorical variables to numerical values, which can be useful for some machine learning algorithms that require numerical inputs.

### CASE 1. If I have **ordinal categorical** variables

Ordinal variables have a natural order or ranking between the different categories.

For example, categories such as "Small", "Medium", and "Large" have an implicit order: "Small" < "Medium" < "Large".

##### OrdinalEncoder

The `OrdinalEncoder` transforms each category of an ordinal variable into a numerical value while respecting the order of the categories.


In [3]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Creating a DataFrame with ordinal data
data = {
    "Size": ["Small", "Large", "Medium", "Medium", "Small", "Large", "Large"]
}
df = pd.DataFrame(data)

# Initializing the OrdinalEncoder with the order of categories
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])

# Applying the OrdinalEncoder
df['Size_encoded'] = ordinal_encoder.fit_transform(df[['Size']])

# Displaying the DataFrame with encoded labels
print(df)


     Size  Size_encoded
0   Small           0.0
1   Large           2.0
2  Medium           1.0
3  Medium           1.0
4   Small           0.0
5   Large           2.0
6   Large           2.0


### CASE2. If there is no order

For example, if you have a categorical variable "Color" with categories:

| Color
|------------|
|red | Blue
red | Blue| Green 
red | Blue | Green 
| red
| Yellow

Label encoder maps these categories as follows:

| Category| Encoded
|------------|---------|
| red | 0 |
Red | 0 | Blue | 1 |
| Green | 2 |
|Red | 0 |
| Yellow | 3 |

In [4]:
from sklearn.preprocessing import LabelEncoder

# Example data
colors = ['Red', 'Blue', 'Green','Red', 'Yellow']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform data
encoded_colors = label_encoder.fit_transform(colors)

# Print encoded data
print("Encoded Colors:", encoded_colors)

Encoded Colors: [2 0 1 2 3]


However, it is **important** to note that label encoder can introduce an artificial order among categories, which can be inappropriate, especially if the categories are not ordered. Therefore, it is often better to use label encoder only to encode **the target variable**.


#####  One-Hot Encoder:

**One-Hot Encoding** is a technique used to encode a category into a **binary vector** (to avoid introducing any order relation).  
For example, if you have the same categorical variable "Color" with the categories:

| Color   |
|---------|
| Red     |
| Blue    |
| Red     |
| Blue    |
| Green   |
| Red     |
| Yellow  |

After one-hot encoding, we will have:

| Color_Blue | Color_Green | Color_Red | Color_Yellow |
|------------|-------------|-----------|--------------|
| 0.0        | 0.0         | 1.0       | 0.0          |
| 1.0        | 0.0         | 0.0       | 0.0          |
| 0.0        | 1.0         | 0.0       | 0.0          |
|


In [5]:
import pandas as pd
# Example data
colors = ['Red', 'Blue', 'Green', 'Red', 'Yellow']

# Convert data to DataFrame
df = pd.DataFrame({'Color': colors})

# Perform One-Hot Encoding
one_hot_encoded = pd.get_dummies(df['Color'])

# Print One-Hot Encoded data
print("One-Hot Encoded Data:")
print(one_hot_encoded)


One-Hot Encoded Data:
   Blue  Green  Red  Yellow
0     0      0    1       0
1     1      0    0       0
2     0      1    0       0
3     0      0    1       0
4     0      0    0       1


##### Exercise. Let's encode the "class" column from the iris dataset.


<div style="background-color:lightblue; padding:1px">

### A summary

The use of scaling depends on the specific requirements of your data and the machine learning algorithm you are using.

- Scaling: Scaling is typically used when the features in your dataset have different ranges, and you want to bring them to a similar scale. This is often important for algorithms that are sensitive to feature scaling, such as Support Vector Machines (SVM), k-Nearest Neighbors (KNN), logistic regression, and K-means.

- Encoding is used to transform ordered categorical variables into numerical values, one-hot encoding is used to transform unordered categorical variables into binary variables for use in machine learning models.

</div>
