# 🏡 Min-Max Normalization Workshop
## Team Name: 
## Team Members: 
---

## ❗ Why We Normalize: The Problem with Raw Feature Scales

In housing data, features like `Price` and `Lot_Size` can have values in the hundreds of thousands, while others like `Num_Bedrooms` range from 1 to 5. This creates problems when we use algorithms that depend on numeric magnitudes.

---

### ⚠️ What Goes Wrong Without Normalization

---

### 1. 🧭 K-Nearest Neighbors (KNN)

KNN uses the **Euclidean distance** formula:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + \cdots}
$$

**Example:**

- $ \text{Price}_1 = 650{,}000, \quad \text{Price}_2 = 250{,}000 $
- $ \text{Bedrooms}_1 = 3, \quad \text{Bedrooms}_2 = 2 $

Now compute squared differences:

$$
(\text{Price}_1 - \text{Price}_2)^2 = (650{,}000 - 250{,}000)^2 = (400{,}000)^2 = 1.6 \times 10^{11}
$$
$$
(\text{Bedrooms}_1 - \text{Bedrooms}_2)^2 = (3 - 2)^2 = 1
$$

➡️ **Price dominates the distance calculation**, making smaller features like `Bedrooms` irrelevant.

---

### 2. 📉 Linear Regression

Linear regression estimates:

$$
y = \beta_1 \cdot \text{Price} + \beta_2 \cdot \text{Bedrooms} + \beta_3 \cdot \text{Lot\_Size} + \epsilon
$$

If `Price` has very large values:
- Gradient updates for $ \beta_1 $ will be **much larger**
- Gradient updates for $ \beta_2 $ (Bedrooms) will be **very small**

➡️ The model overfits high-magnitude features like `Price`.

---

### 3. 🧠 Neural Networks

A single neuron computes:

$$
z = w_1 \cdot \text{Price} + w_2 \cdot \text{Bedrooms} + w_3 \cdot \text{Lot\_Size}
$$

If:

- $ \text{Price} = 650{,}000 $
- $ \text{Bedrooms} = 3 $
- $ \text{Lot\_Size} = 8{,}000 $

Then:

$$
z \approx w_1 \cdot 650{,}000 + w_2 \cdot 3 + w_3 \cdot 8{,}000
$$

➡️ Even with equal weights, `Price` contributes **most of the activation**, making it difficult for the network to learn from other features.

---

### ✅ Solution: Min-Max Normalization

We apply the transformation:

$$
x_{\text{normalized}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

This scales all features to a common range (typically $[0, 1]$).

| Feature      | Raw Value | Min     | Max     | Normalized Value |
|--------------|-----------|---------|---------|------------------|
| Price        | 650,000   | 250,000 | 800,000 | 0.72             |
| Bedrooms     | 3         | 1       | 5       | 0.50             |
| Lot_Size     | 8,000     | 3,000   | 10,000  | 0.714            |

➡️ Now, **each feature contributes fairly** to model training or distance comparisons.

---

## Use Case: Housing Data
We are normalizing features from a real estate dataset to prepare it for machine learning analysis.

In [None]:
#  Load and display dataset
import pandas as pd
df = pd.read_csv('housing_data.csv')
df.head()

Unnamed: 0,House_ID,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
0,H100000,574507,1462,3,3,2002,4878
1,H100001,479260,1727,2,2,1979,4943
2,H100002,597153,1403,5,2,1952,5595
3,H100003,728454,1646,5,2,1992,9305
4,H100004,464876,853,1,1,1956,7407


In [None]:
#Inspect the dataset structure
df.info()
df.describe()
hosdasd

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   House_ID       2000 non-null   object
 1   Price          2000 non-null   int64 
 2   Area_sqft      2000 non-null   int64 
 3   Num_Bedrooms   2000 non-null   int64 
 4   Num_Bathrooms  2000 non-null   int64 
 5   Year_Built     2000 non-null   int64 
 6   Lot_Size       2000 non-null   int64 
dtypes: int64(6), object(1)
memory usage: 109.5+ KB


Unnamed: 0,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,506896.1,1796.453,2.9835,1.966,1985.6895,6025.246
std,147878.6,502.185109,1.409333,0.825945,21.159536,2008.527265
min,100000.0,400.0,1.0,1.0,1950.0,1000.0
25%,406600.2,1445.0,2.0,1.0,1967.0,4664.0
50%,506703.0,1799.5,3.0,2.0,1986.0,6010.5
75%,602445.8,2132.0,4.0,3.0,2003.0,7414.0
max,1077909.0,3763.0,5.0,3.0,2022.0,13088.0


## Step 2: Explore the Dataset

We inspect the data types and basic statistics of the dataset using `.info()` and `.describe()`. This helps us identify which columns are numeric and suitable for normalization.


In [None]:
#Step 3: Select Columns for Min-Max Normalization
columns_to_normalize = ['Price', 'Area_sqft', 'Num_Bedrooms', 'Num_Bathrooms', 'Lot_Size']


## Step 3: Select Columns for Min-Max Normalization

We select only continuous or ordinal numeric columns that influence model predictions. For this dataset, we choose:

- Price
- Area (sqft)
- Number of Bedrooms
- Number of Bathrooms
- Lot Size


In [6]:
# Define and Apply Min-Max Normalization
def min_max_normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df_normalized = df.copy()

for col in columns_to_normalize:
    df_normalized[col] = min_max_normalize(df[col])



## ⚙️ Step 4: Apply Min-Max Normalization

We apply a custom Min-Max function to each selected column. This scales the data to a common range of [0, 1], making features comparable.


In [7]:
df_normalized.head()


Unnamed: 0,House_ID,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
0,H100000,0.485226,0.315789,0.5,1.0,2002,0.320814
1,H100001,0.387827,0.394588,0.25,0.5,1979,0.326191
2,H100002,0.508384,0.298246,1.0,0.5,1952,0.380129
3,H100003,0.642651,0.370503,1.0,0.5,1992,0.687045
4,H100004,0.373119,0.134701,0.0,0.0,1956,0.53003


## Step 5: Preview the Normalized Data

We display the top rows of the normalized dataset to confirm successful transformation.


## Peer Review: Talking Point from Team XYZ
### Talking Point 1 — Inconsistent Normalization Logic

**Issue:**  
The normalization logic is applied using `df[col] = ...` inside a loop, but `df` is not copied first. This modifies the original data, which could cause confusion or data leakage in future steps.

**Why It Matters:**  
Maintaining the original dataset (`df`) allows for comparisons between raw and normalized data, which is useful during debugging or further EDA. Mutating the original DataFrame directly is risky.

**Suggested Fix:**  
Use `df_normalized = df.copy()` before applying normalization to preserve the original.

---

### Talking Point 2 — Missing Markdown Documentation

**Issue:**  
There is minimal or no Markdown explaining the logic of the Min-Max function or why specific columns were chosen.

**Why It Matters:**  
Clear Markdown improves readability, shows intent behind code choices, and helps teammates or reviewers understand the rationale.

**Suggested Fix:**  
Add explanatory Markdown before each major code block (especially the normalization loop) and describe column selection choices.

---

### Talking Point 3 — Redundant Calculation

**Issue:**  
The min and max values are recalculated inside each function call, which is inefficient for large datasets.

**Why It Matters:**  
Recomputing `min()` and `max()` repeatedly for the same column can be a performance issue.

**Suggested Fix:**  
Pre-calculate and store `min` and `max` before the loop, or consider using vectorized operations for better efficiency.
