<a href="https://colab.research.google.com/github/MikkoDT/MexEE402_AI/blob/main/Ch5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 5: Unfolding the Essentials of Data Scaling and Normalization**

## Data Scaling: Leveling the Playing Field

- Standardizes the range of features so they are comparable.  
- Prevents models from favoring features with larger numerical values.  
- Commonly scales data to **0–1** or to have **mean = 0, std = 1**.  

**Example:**  
- Features:  
  - Study hours: 0–20  
  - Grades: 0–100  
- Without scaling, grades dominate due to larger values.  
- Scaling ensures both features contribute fairly to the model.  

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [None]:
# Here's a Python script using the StandardScaler function from the sklearn.preprocessing library:

# example data
data = {
'Study Hours': [10, 15, 8, 9, 12, 14, 13],
'Grades': [85, 90, 76, 81, 87, 92, 88]
}
df = pd.DataFrame(data)

df.head()


Unnamed: 0,Study Hours,Grades
0,10,85
1,15,90
2,8,76
3,9,81
4,12,87


In [None]:
# scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

print(scaled_data)

[[-0.64372631 -0.11215443]
 [ 1.40449377  0.86919684]
 [-1.46301434 -1.87858672]
 [-1.05337032 -0.89723545]
 [ 0.17556172  0.28038608]
 [ 0.99484975  1.26173735]
 [ 0.58520574  0.47665633]]


## The outcome is a new dataset where the scales of 'Study Hours' and 'Grades' are adjusted, putting them on an equal footing.

**Data Normalization**  
  - A type of data scaling.  
  - Adjusts all values to fall within **0 to 1** range.  
  - Useful when features have very different ranges.  
  - Helps when unsure about the relative importance of features.

In [None]:
# Consider our previous student data ('Study Hours' and 'Grades'). Let's normalize it:

from sklearn.preprocessing import MinMaxScaler

# example data
data_2 = {
'Study Hours': [10, 15, 8, 9, 12, 14, 13],
'Grades': [85, 90, 76, 81, 87, 92, 88]
}
df_2 = pd.DataFrame(data_2)

df_2.head()

Unnamed: 0,Study Hours,Grades
0,10,85
1,15,90
2,8,76
3,9,81
4,12,87


In [None]:
# normalize the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df_2)

print(normalized_data)

[[0.28571429 0.5625    ]
 [1.         0.875     ]
 [0.         0.        ]
 [0.14285714 0.3125    ]
 [0.57142857 0.6875    ]
 [0.85714286 1.        ]
 [0.71428571 0.75      ]]


- **MinMaxScaler** normalizes data so all features range from **0 to 1**.  
- Ensures 'Study Hours' and 'Grades' are on the **same scale** for fair comparison.  
- Helps many ML algorithms, but **not always required**.  
- The need for scaling/normalization depends on:  
  - The **nature of your data**  
  - The **requirements of the algorithm**  