# 🏡 Min-Max Normalization Workshop
## Team Name: FIVE
## Team Members: Fasalu Rahman Kottaparambu, Christo Pananjickal Baby
---

## ❗ Why We Normalize: The Problem with Raw Feature Scales

In housing data, features like `Price` and `Lot_Size` can have values in the hundreds of thousands, while others like `Num_Bedrooms` range from 1 to 5. This creates problems when we use algorithms that depend on numeric magnitudes.

---

### ⚠️ What Goes Wrong Without Normalization

---

### 1. 🧭 K-Nearest Neighbors (KNN)

KNN uses the **Euclidean distance** formula:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + \cdots}
$$

**Example:**

- $ \text{Price}_1 = 650{,}000, \quad \text{Price}_2 = 250{,}000 $
- $ \text{Bedrooms}_1 = 3, \quad \text{Bedrooms}_2 = 2 $

Now compute squared differences:

$$
(\text{Price}_1 - \text{Price}_2)^2 = (650{,}000 - 250{,}000)^2 = (400{,}000)^2 = 1.6 \times 10^{11}
$$
$$
(\text{Bedrooms}_1 - \text{Bedrooms}_2)^2 = (3 - 2)^2 = 1
$$

➡️ **Price dominates the distance calculation**, making smaller features like `Bedrooms` irrelevant.

---

### 2. 📉 Linear Regression

Linear regression estimates:

$$
y = \beta_1 \cdot \text{Price} + \beta_2 \cdot \text{Bedrooms} + \beta_3 \cdot \text{Lot\_Size} + \epsilon
$$

If `Price` has very large values:
- Gradient updates for $ \beta_1 $ will be **much larger**
- Gradient updates for $ \beta_2 $ (Bedrooms) will be **very small**

➡️ The model overfits high-magnitude features like `Price`.

---

### 3. 🧠 Neural Networks

A single neuron computes:

$$
z = w_1 \cdot \text{Price} + w_2 \cdot \text{Bedrooms} + w_3 \cdot \text{Lot\_Size}
$$

If:

- $ \text{Price} = 650{,}000 $
- $ \text{Bedrooms} = 3 $
- $ \text{Lot\_Size} = 8{,}000 $

Then:

$$
z \approx w_1 \cdot 650{,}000 + w_2 \cdot 3 + w_3 \cdot 8{,}000
$$

➡️ Even with equal weights, `Price` contributes **most of the activation**, making it difficult for the network to learn from other features.

---

### ✅ Solution: Min-Max Normalization

We apply the transformation:

$$
x_{\text{normalized}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

This scales all features to a common range (typically $[0, 1]$).

| Feature      | Raw Value | Min     | Max     | Normalized Value |
|--------------|-----------|---------|---------|------------------|
| Price        | 650,000   | 250,000 | 800,000 | 0.72             |
| Bedrooms     | 3         | 1       | 5       | 0.50             |
| Lot_Size     | 8,000     | 3,000   | 10,000  | 0.714            |

➡️ Now, **each feature contributes fairly** to model training or distance comparisons.

---

## 📌 Use Case: Housing Data
We are normalizing features from a real estate dataset to prepare it for machine learning analysis.

In [1]:
# 🔢 Load and display dataset
import pandas as pd
from tabulate import tabulate
df = pd.read_csv('data/housing_data.csv')
print(tabulate(df.head(), headers='keys', tablefmt='psql',showindex=False))

+------------+---------+-------------+----------------+-----------------+--------------+------------+
| House_ID   |   Price |   Area_sqft |   Num_Bedrooms |   Num_Bathrooms |   Year_Built |   Lot_Size |
|------------+---------+-------------+----------------+-----------------+--------------+------------|
| H100000    |  574507 |        1462 |              3 |               3 |         2002 |       4878 |
| H100001    |  479260 |        1727 |              2 |               2 |         1979 |       4943 |
| H100002    |  597153 |        1403 |              5 |               2 |         1952 |       5595 |
| H100003    |  728454 |        1646 |              5 |               2 |         1992 |       9305 |
| H100004    |  464876 |         853 |              1 |               1 |         1956 |       7407 |
+------------+---------+-------------+----------------+-----------------+--------------+------------+


### 🔎 Step 1 — Implement Min-Max Normalization on the Housing Dataset

### Cleaning the dataset

1. We have removed the `House_ID` column as it is not needed for normalization.
2. We will fill missing values in numeric columns with the median of each column to ensure that our normalization does not get affected by NaN values.


In [2]:
# ✍️ Implement Min-Max Normalization manually here (no sklearn/numpy)
# Normalize: Price, Area_sqft, Num_Bedrooms, Num_Bathrooms, Lot_Size


# Drop unnecessary columns (House_ID)
df = df.drop(columns=['House_ID'])


# Clean the dataset by filling missing values with median
numeric_columns = ['Price', 'Area_sqft', 'Num_Bedrooms', 'Num_Bathrooms', 'Year_Built', 'Lot_Size']

for col in numeric_columns:
    median_value = df[col].median()
    df[col] = df[col].fillna(median_value)

print(tabulate(df.head(), headers='keys', tablefmt='psql',showindex=False))


+---------+-------------+----------------+-----------------+--------------+------------+
|   Price |   Area_sqft |   Num_Bedrooms |   Num_Bathrooms |   Year_Built |   Lot_Size |
|---------+-------------+----------------+-----------------+--------------+------------|
|  574507 |        1462 |              3 |               3 |         2002 |       4878 |
|  479260 |        1727 |              2 |               2 |         1979 |       4943 |
|  597153 |        1403 |              5 |               2 |         1952 |       5595 |
|  728454 |        1646 |              5 |               2 |         1992 |       9305 |
|  464876 |         853 |              1 |               1 |         1956 |       7407 |
+---------+-------------+----------------+-----------------+--------------+------------+


## Explanation of the Code

Copy DataFrame: We create a copy of the original DataFrame to avoid modifying it.

Normalization Function: The min_max_normalize function:

Computes the minimum and maximum values of the column using Python’s built-in min and max.

Checks if max_val == min_val to avoid division by zero (in which case, we return a list of zeros, as all values are identical).

Applies the Min-Max formula using a list comprehension.

Apply to Each Column: We loop through the columns, apply the normalization, and store the results in new columns (e.g., Price_normalized).

Display Results: We show the normalized columns to verify the output.

In [4]:
# Manual Min-Max Normalization

# Create a copy of the dataframe to store normalized values
df_normalized = df.copy()

# Function to perform Min-Max normalization on a single column
def min_max_normalize(column):
    min_val = min(column)
    max_val = max(column)
    # Check for division by zero
    if max_val == min_val:
        return [0] * len(column)  # Return zeros if all values are the same
    return [(x - min_val) / (max_val - min_val) for x in column]

# Apply normalization to each column
for col in numeric_columns:
    df_normalized[f'{col}'] = min_max_normalize(df[col])

# # Display the first few rows of the normalized dataframe
# df_normalized[[f'{col}_normalized' for col in numeric_columns]].head()

print(tabulate(df_normalized.head(), headers='keys', tablefmt='psql',showindex=False))

+----------+-------------+----------------+-----------------+--------------+------------+
|    Price |   Area_sqft |   Num_Bedrooms |   Num_Bathrooms |   Year_Built |   Lot_Size |
|----------+-------------+----------------+-----------------+--------------+------------|
| 0.485226 |    0.315789 |           0.5  |             1   |    0.722222  |   0.320814 |
| 0.387827 |    0.394588 |           0.25 |             0.5 |    0.402778  |   0.326191 |
| 0.508384 |    0.298246 |           1    |             0.5 |    0.0277778 |   0.380129 |
| 0.642651 |    0.370503 |           1    |             0.5 |    0.583333  |   0.687045 |
| 0.373119 |    0.134701 |           0    |             0   |    0.0833333 |   0.53003  |
+----------+-------------+----------------+-----------------+--------------+------------+


### 🔎 Talking Point 1 — [Some insights about the dataset EDA]
 
We noticed in our repo the dataset did not contain any missing values, so we did not need to fill any NaN values. However we noticed the other team here filled the missing values with the median of each column. Also in order to understand the dataset better, it usually is good practice to add some visualizations and Exploaratory Data Analysis (EDA).
 
Reviwed by:
- Eris Leksi
- Erica Holden
- Reham Abuarqoub
 