# Module 3: Scaling, Encoding, and Distance Measures

## üìö What You Will Learn:

**Part 1: Scaling (Making Data Equal Size)**
- ‚úÖ Standardization - Making data balanced
- ‚úÖ Min-Max Scaling - Fitting data between 0 and 1
- ‚úÖ Robust Scaling - Handling extreme values


**Part 2: Encoding (Converting Categories to Numbers)**- ‚úÖ Manhattan Distance - Block-by-block distance

- ‚úÖ One-Hot Encoding - For categories with no order- ‚úÖ Euclidean Distance - Straight line distance

- ‚úÖ Ordinal Encoding - For categories with order**Part 3: Distance (Measuring How Far Apart Things Are)**


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import warnings
warnings.filterwarnings('ignore')

## 1. Standardization (Z-score Scaling)

### üéØ What is Standardization?
- It makes all data balanced around zero
- It helps when different features have different ranges

### üìê Formula:
$z = \frac{x - \mu}{\sigma}$

### üìù Simple Meaning:

- **x** = Your original number- When you want to compare different features equally

- **Œº (mu)** = Average of all numbers- When your data has different units (like height in cm and weight in kg)

- **œÉ (sigma)** = How spread out the numbers are### üí° When to Use:



### ‚ú® Result:  - All values are now comparable

- After standardization:  - Standard deviation becomes **1**
  - Average becomes **0**

In [2]:
# Create dataset for Standardization
data_standard = {
    'Person': ['A', 'B', 'C', 'D', 'E'],
    'Height (cm)': [150, 160, 170, 180, 190],
    'Weight (Kg)': [50, 60, 70, 80, 90]
}

df_standard = pd.DataFrame(data_standard)
print("Original Data:")
print(df_standard)
print("\n" + "="*60)

Original Data:
  Person  Height (cm)  Weight (Kg)
0      A          150           50
1      B          160           60
2      C          170           70
3      D          180           80
4      E          190           90



In [3]:
# Manual Calculation for Height
heights = df_standard['Height (cm)'].values
mean_height = np.mean(heights)
std_height = np.std(heights, ddof=0)  # Population std

print(f"Height Statistics:")
print(f"Mean (Œº): {mean_height}")
print(f"Standard Deviation (œÉ): {std_height:.4f}")
print("\nManual Z-score Calculation for Height:")

z_scores_height_manual = []
for h in heights:
    z = (h - mean_height) / std_height
    z_scores_height_manual.append(z)
    print(f"Height {h}: z = ({h} - {mean_height}) / {std_height:.4f} = {z:.4f}")

df_standard['Height_Z_Manual'] = z_scores_height_manual
print("\n" + "="*60)

Height Statistics:
Mean (Œº): 170.0
Standard Deviation (œÉ): 14.1421

Manual Z-score Calculation for Height:
Height 150: z = (150 - 170.0) / 14.1421 = -1.4142
Height 160: z = (160 - 170.0) / 14.1421 = -0.7071
Height 170: z = (170 - 170.0) / 14.1421 = 0.0000
Height 180: z = (180 - 170.0) / 14.1421 = 0.7071
Height 190: z = (190 - 170.0) / 14.1421 = 1.4142



In [4]:
# Using sklearn StandardScaler
scaler_standard = StandardScaler()
scaled_values = scaler_standard.fit_transform(df_standard[['Height (cm)', 'Weight (Kg)']])

df_standard['Height_Z_Score'] = scaled_values[:, 0]
df_standard['Weight_Z_Score'] = scaled_values[:, 1]

print("Standardized Data (Z-scores):")
print(df_standard)
print("\n" + "="*60)
print(f"\nVerification - Height Z-scores:")
print(f"Mean: {df_standard['Height_Z_Score'].mean():.10f} (should be ~0)")
print(f"Std Dev: {df_standard['Height_Z_Score'].std(ddof=0):.10f} (should be 1)")

Standardized Data (Z-scores):
  Person  Height (cm)  Weight (Kg)  Height_Z_Manual  Height_Z_Score  \
0      A          150           50        -1.414214       -1.414214   
1      B          160           60        -0.707107       -0.707107   
2      C          170           70         0.000000        0.000000   
3      D          180           80         0.707107        0.707107   
4      E          190           90         1.414214        1.414214   

   Weight_Z_Score  
0       -1.414214  
1       -0.707107  
2        0.000000  
3        0.707107  
4        1.414214  


Verification - Height Z-scores:
Mean: 0.0000000000 (should be ~0)
Std Dev: 1.0000000000 (should be 1)


## 2. Min-Max Scaling (Rescaling to [0, 1])

### üéØ What is Min-Max Scaling?
- It squeezes all data between 0 and 1
- The smallest value becomes 0
- The largest value becomes 1
- Everything else falls in between

### üìê Formula:
$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$

- One very large value can affect all other values

### üìù Simple Meaning:- Be careful with extreme values (outliers)

- **x** = Your original number### ‚ö†Ô∏è Warning:

- **x_min** = The smallest number in your data

- **x_max** = The biggest number in your data- When working with neural networks

- **x'** = Your new scaled number (between 0 and 1)- When you need values in a specific range (0 to 1)

### üí° When to Use:

### ‚ú® Result:

- All values now fit between 0 and 1- Original relationships are preserved

In [5]:
# Create dataset for Min-Max Scaling
data_minmax = {
    'Person': ['A', 'B', 'C', 'D', 'E'],
    'Height (cm)': [150, 160, 170, 180, 190],
    'Weight (Kg)': [50, 60, 70, 80, 90]
}

df_minmax = pd.DataFrame(data_minmax)
print("Original Data:")
print(df_minmax)
print("\n" + "="*60)

Original Data:
  Person  Height (cm)  Weight (Kg)
0      A          150           50
1      B          160           60
2      C          170           70
3      D          180           80
4      E          190           90



In [6]:
# Manual Calculation for Weight
weights = df_minmax['Weight (Kg)'].values
min_weight = np.min(weights)
max_weight = np.max(weights)

print(f"Weight Statistics:")
print(f"Min (x_min): {min_weight}")
print(f"Max (x_max): {max_weight}")
print("\nManual Min-Max Scaling Calculation for Weight:")

scaled_weights_manual = []
for w in weights:
    scaled = (w - min_weight) / (max_weight - min_weight)
    scaled_weights_manual.append(scaled)
    print(f"Weight {w}: x' = ({w} - {min_weight}) / ({max_weight} - {min_weight}) = {scaled:.4f}")

df_minmax['Weight_Scaled_Manual'] = scaled_weights_manual
print("\n" + "="*60)

Weight Statistics:
Min (x_min): 50
Max (x_max): 90

Manual Min-Max Scaling Calculation for Weight:
Weight 50: x' = (50 - 50) / (90 - 50) = 0.0000
Weight 60: x' = (60 - 50) / (90 - 50) = 0.2500
Weight 70: x' = (70 - 50) / (90 - 50) = 0.5000
Weight 80: x' = (80 - 50) / (90 - 50) = 0.7500
Weight 90: x' = (90 - 50) / (90 - 50) = 1.0000



In [7]:
# Using sklearn MinMaxScaler
scaler_minmax = MinMaxScaler()
scaled_values = scaler_minmax.fit_transform(df_minmax[['Height (cm)', 'Weight (Kg)']])

df_minmax['Height_Scaled'] = scaled_values[:, 0]
df_minmax['Weight_Scaled'] = scaled_values[:, 1]

print("Min-Max Scaled Data [0, 1]:")
print(df_minmax)
print("\n" + "="*60)
print(f"\nVerification - Weight Scaled:")
print(f"Min: {df_minmax['Weight_Scaled'].min()} (should be 0)")
print(f"Max: {df_minmax['Weight_Scaled'].max()} (should be 1)")

Min-Max Scaled Data [0, 1]:
  Person  Height (cm)  Weight (Kg)  Weight_Scaled_Manual  Height_Scaled  \
0      A          150           50                  0.00           0.00   
1      B          160           60                  0.25           0.25   
2      C          170           70                  0.50           0.50   
3      D          180           80                  0.75           0.75   
4      E          190           90                  1.00           1.00   

   Weight_Scaled  
0           0.00  
1           0.25  
2           0.50  
3           0.75  
4           1.00  


Verification - Weight Scaled:
Min: 0.0 (should be 0)
Max: 1.0 (should be 1)


## 3. Robust Scaling (Outlier-Resistant)

### üéØ What is Robust Scaling?
- It handles extreme values (outliers) very well
- It uses median (middle value) instead of mean (average)
- Extreme values don't mess up the scaling

### üìê Formula:
$x' = \frac{x - Q_2}{IQR}$


### üìù Simple Meaning:- **Robust**: Not sensitive to outliers ‚úÖ

- **x** = Your original number- **Standardization**: Somewhat sensitive ‚ö†Ô∏è

- **Q2** = Median (middle value when sorted)- **Min-Max**: Sensitive to outliers ‚ùå

- **Q1** = 25th percentile (25% of data is below this)### üìä Comparison:

- **Q3** = 75th percentile (75% of data is below this)

- **IQR** = Q3 - Q1 (range of middle 50% data)- When data has extreme values that are real (not mistakes)

- When you want safer scaling

### ‚ú® Why It's Special:- When your data has outliers (very large or very small values)

- **Not affected by extreme values**### üí° When to Use:

- Example: If one person is 300 cm tall, it won't ruin the scaling
- Uses middle values, not averages

In [8]:
# Create dataset with outliers for Robust Scaling
data_robust = {
    'Person': ['A', 'B', 'C', 'D', 'E'],
    'Height (cm)': [150, 160, 170, 180, 300],  # E is outlier
    'Weight (Kg)': [50, 60, 70, 80, 200]       # E is outlier
}

df_robust = pd.DataFrame(data_robust)
print("Original Data (with Outliers):")
print(df_robust)
print("\n" + "="*60)

Original Data (with Outliers):
  Person  Height (cm)  Weight (Kg)
0      A          150           50
1      B          160           60
2      C          170           70
3      D          180           80
4      E          300          200



In [9]:
# Manual Calculation for Weight
weights_robust = df_robust['Weight (Kg)'].values
Q1 = np.percentile(weights_robust, 25)
Q2 = np.percentile(weights_robust, 50)  # Median
Q3 = np.percentile(weights_robust, 75)
IQR = Q3 - Q1

print(f"Weight Statistics:")
print(f"Q1 (25th percentile): {Q1}")
print(f"Q2 (Median, 50th percentile): {Q2}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR (Q3 - Q1): {IQR}")
print("\nManual Robust Scaling Calculation for Weight:")

robust_scaled_manual = []
for w in weights_robust:
    scaled = (w - Q2) / IQR
    robust_scaled_manual.append(scaled)
    print(f"Weight {w}: x' = ({w} - {Q2}) / {IQR} = {scaled:.4f}")

df_robust['Weight_Robust_Manual'] = robust_scaled_manual
print("\n" + "="*60)

Weight Statistics:
Q1 (25th percentile): 60.0
Q2 (Median, 50th percentile): 70.0
Q3 (75th percentile): 80.0
IQR (Q3 - Q1): 20.0

Manual Robust Scaling Calculation for Weight:
Weight 50: x' = (50 - 70.0) / 20.0 = -1.0000
Weight 60: x' = (60 - 70.0) / 20.0 = -0.5000
Weight 70: x' = (70 - 70.0) / 20.0 = 0.0000
Weight 80: x' = (80 - 70.0) / 20.0 = 0.5000
Weight 200: x' = (200 - 70.0) / 20.0 = 6.5000



In [10]:
# Using sklearn RobustScaler
scaler_robust = RobustScaler()
scaled_values = scaler_robust.fit_transform(df_robust[['Height (cm)', 'Weight (Kg)']])

df_robust['Height_Robust'] = scaled_values[:, 0]
df_robust['Weight_Robust'] = scaled_values[:, 1]

print("Robust Scaled Data:")
print(df_robust)
print("\n" + "="*60)
print("\nNotice: Outlier (Person E) has less extreme scaled values compared to other methods")

Robust Scaled Data:
  Person  Height (cm)  Weight (Kg)  Weight_Robust_Manual  Height_Robust  \
0      A          150           50                  -1.0           -1.0   
1      B          160           60                  -0.5           -0.5   
2      C          170           70                   0.0            0.0   
3      D          180           80                   0.5            0.5   
4      E          300          200                   6.5            6.5   

   Weight_Robust  
0           -1.0  
1           -0.5  
2            0.0  
3            0.5  
4            6.5  


Notice: Outlier (Person E) has less extreme scaled values compared to other methods


## 4. Understanding Categories: Nominal vs Ordinal

### üé® Nominal Features (No Order)
**What are they?**
- Categories that have NO natural order
- No category is bigger or better than another

**Examples:**
- üé® Colors: red, blue, green (which color is bigger? None!)
- üåç Countries: USA, India, Japan (no order)
- üçï Food types: Pizza, Burger, Salad (no ranking)

- **Ordinal**: Has order ‚Üí Use Ordinal Encoding

**How to convert?**- **Nominal**: No order ‚Üí Use One-Hot Encoding

- Use **One-Hot Encoding**### üîë Key Difference:

- Creates separate columns for each category

- Uses 1 (yes) or 0 (no)---



---- Example: Small=1, Medium=2, Large=3

- Assign numbers based on order

### üìè Ordinal Features (With Order)- Use **Ordinal Encoding**

**What are they?****How to convert?**

- Categories that HAVE a natural order

- You can say one is bigger/better/higher than another- ‚≠ê Ratings: 1 star < 2 stars < 3 stars < 4 stars < 5 stars

- üéì Education: High School < Bachelor < Master < PhD

**Examples:**- üëï Size: Small < Medium < Large (clear order!)

In [11]:
# Create tiny dataset
data_encoding = {
    'id': [1, 2, 3, 4],
    'color': ['red', 'blue', 'green', 'red'],
    'size': ['Small', 'Medium', 'Large', 'Medium'],
    'price': [10, 12, 15, 11]
}

df_encoding = pd.DataFrame(data_encoding)
print("Original Tiny Dataset:")
print(df_encoding)
print("\n" + "="*60)

Original Tiny Dataset:
   id  color    size  price
0   1    red   Small     10
1   2   blue  Medium     12
2   3  green   Large     15
3   4    red  Medium     11



### üé® One-Hot Encoding Step-by-Step

### üìã What We're Doing:
- Converting colors (red, blue, green) to numbers
- Creating separate columns for each color

### üîß Steps:

**Step 1: Create New Columns**
- Make one column for each color:

  - `Color_red`- Each color is independent

  - `Color_blue`- No color is treated as "bigger" than another

  - `Color_green`- Computer can now understand colors as numbers

### ‚úÖ Why This Works:

**Step 2: Fill with 1 or 0**

- Put **1** if the row has that color  3. Is it green? (Yes=1, No=0)

- Put **0** if the row doesn't have that color  2. Is it blue? (Yes=1, No=0)

  1. Is it red? (Yes=1, No=0)

**Step 3: Result**- Asking three questions:

- Each color becomes a pattern of 1s and 0s:### üí° Think of It Like:

  - üî¥ red ‚Üí [1, 0, 0] (only red column is 1)

  - üîµ blue ‚Üí [0, 1, 0] (only blue column is 1)  - üü¢ green ‚Üí [0, 0, 1] (only green column is 1)

In [12]:
# Manual One-Hot Encoding
df_onehot = df_encoding.copy()

print("Step 1: Creating New Columns")
print("Columns: Color_red, Color_blue, Color_green\n")

# Manual assignment
df_onehot['Color_red'] = df_onehot['color'].apply(lambda x: 1 if x == 'red' else 0)
df_onehot['Color_blue'] = df_onehot['color'].apply(lambda x: 1 if x == 'blue' else 0)
df_onehot['Color_green'] = df_onehot['color'].apply(lambda x: 1 if x == 'green' else 0)

print("Step 2: The Transformed Table:")
print(df_onehot)
print("\n" + "="*60)

Step 1: Creating New Columns
Columns: Color_red, Color_blue, Color_green

Step 2: The Transformed Table:
   id  color    size  price  Color_red  Color_blue  Color_green
0   1    red   Small     10          1           0            0
1   2   blue  Medium     12          0           1            0
2   3  green   Large     15          0           0            1
3   4    red  Medium     11          1           0            0



In [13]:
# Step 3: Row by Row Visualization
print("Step 3: Row by Row Visualization:")
for idx, row in df_onehot.iterrows():
    vector = [row['Color_red'], row['Color_blue'], row['Color_green']]
    print(f"id {row['id']}: {row['color']} ‚Üí {vector}")

print("\n" + "="*60)

Step 3: Row by Row Visualization:
id 1: red ‚Üí [1, 0, 0]
id 2: blue ‚Üí [0, 1, 0]
id 3: green ‚Üí [0, 0, 1]
id 4: red ‚Üí [1, 0, 0]



In [14]:
# Using pandas get_dummies
df_onehot_pd = pd.get_dummies(df_encoding, columns=['color'], prefix='Color')
print("Using pandas get_dummies():")
print(df_onehot_pd)

Using pandas get_dummies():
   id    size  price  Color_blue  Color_green  Color_red
0   1   Small     10       False        False       True
1   2  Medium     12        True        False      False
2   3   Large     15       False         True      False
3   4  Medium     11       False        False       True


## 5. Ordinal Encoding (Encoding with Order)

### üéØ What is Ordinal Encoding?

- Converting categories that have order into numbers- Don't use for nominal features (like colors)

- Smaller number = lower rank- Only use when categories have clear order

- Bigger number = higher rank### ‚ö†Ô∏è Important:



### üìè Example: T-Shirt Sizes- Simple and straightforward

- Small ‚Üí 1 (smallest)- Computer understands: 3 > 2 > 1

- Medium ‚Üí 2 (middle)- Keeps the natural order in your data

- Large ‚Üí 3 (biggest)### ‚úÖ Why Use It:



### üîß How It Works:3. **Keep the ranking** in the numbers

1. **Decide the order** (which is smallest, which is biggest)2. **Assign numbers** based on that order

In [15]:
# Ordinal Encoding for size feature
df_ordinal = df_encoding.copy()

print("Step 1: Decide Order:")
print("Small: 1, Medium: 2, Large: 3")
print("\n" + "="*60)

# Manual mapping
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df_ordinal['Size_encode'] = df_ordinal['size'].map(size_mapping)

print("\nStep 2: The Transformed Table:")
print(df_ordinal)
print("\n" + "="*60)

Step 1: Decide Order:
Small: 1, Medium: 2, Large: 3


Step 2: The Transformed Table:
   id  color    size  price  Size_encode
0   1    red   Small     10            1
1   2   blue  Medium     12            2
2   3  green   Large     15            3
3   4    red  Medium     11            2



In [16]:
# Alternative: Using sklearn OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df_ordinal_sklearn = df_encoding.copy()
df_ordinal_sklearn['Size_encode_sklearn'] = ordinal_encoder.fit_transform(df_ordinal_sklearn[['size']])

print("Using sklearn OrdinalEncoder:")
print(df_ordinal_sklearn[['id', 'size', 'Size_encode_sklearn']])

Using sklearn OrdinalEncoder:
   id    size  Size_encode_sklearn
0   1   Small                  0.0
1   2  Medium                  1.0
2   3   Large                  2.0
3   4  Medium                  1.0


## 6. Distance Measures (How Far Apart Are Things?)

### üéØ Why Measure Distance?
- To find which data points are similar
- To find nearest neighbors
- To group similar things together

---

### üìè Type 1: Euclidean Distance (Straight Line)

**What is it?**

- The shortest straight line between two points- **Manhattan**: When you can only move in grids (like city streets)

- Like: "As the bird flies"- **Euclidean**: Most common, general purpose

- This is the normal distance we think of### üí° Which One to Use?



**Formula:** $d_2(p,q) = \sqrt{\sum_{i=1}^{k} (p_i - q_i)^2}$| **Use case** | General purpose | Grid-based problems |

| **Distance** | Usually shorter | Usually longer |

**Simple Steps:**| **Formula** | Has ‚àö (square root) | No ‚àö (just addition) |

1. Find differences in each direction| **Path** | Straight line ‚úàÔ∏è | Grid blocks üöï |

2. Square each difference|---------|-----------|------------|

3. Add them all up| Feature | Euclidean | Manhattan |

4. Take square root

### üîç Key Differences:

**Example:**

- Point A = [70, 80]---

- Point B = [75, 70]

- Difference = [5, 10]- Distance = |5| + |10| = 15

- Distance = ‚àö(5¬≤ + 10¬≤) = ‚àö125 ‚âà 11.18- Difference = [5, 10]

- Point B = [75, 70]

---- Point A = [70, 80]

**Example:**

### üèôÔ∏è Type 2: Manhattan Distance (City Block)

4. Done! (no square root needed)

**What is it?**3. Add them all up

- The distance walking on a city grid2. Take absolute value (make positive)

- Like: Taxi cab distance in New York1. Find differences in each direction

- You can only move horizontally or vertically**Simple Steps:**


**Formula:** $d_1(p,q) = \sum_{i=1}^{k} |p_i - q_i|$

In [17]:
# Tiny dataset for distance calculation
data_distance = {
    'Student': ['S1', 'S2', 'S3', 'S4', 'S5'],
    'Feature1': [70, 60, 85, 78, 62],
    'Feature2': [80, 90, 60, 76, 65]
}

df_distance = pd.DataFrame(data_distance)
print("Student Data:")
print(df_distance)
print("\nQuery Point q = [75, 70]")
print("\n" + "="*60)

Student Data:
  Student  Feature1  Feature2
0      S1        70        80
1      S2        60        90
2      S3        85        60
3      S4        78        76
4      S5        62        65

Query Point q = [75, 70]



In [18]:
# Define query point
q = np.array([75, 70])

print("\nDetailed Calculation for S1 vs q:")
print("="*60)
S1 = np.array([70, 80])
print(f"S1 = {S1}")
print(f"q  = {q}")
print(f"\nDifferences:")
diff1 = 70 - 75
diff2 = 80 - 70
print(f"  Feature1: (70 - 75) = {diff1}")
print(f"  Feature2: (80 - 70) = {diff2}")

# Euclidean
euclidean_s1 = np.sqrt(diff1**2 + diff2**2)
print(f"\nEuclidean Distance:")
print(f"  d = sqrt(({diff1})¬≤ + ({diff2})¬≤)")
print(f"  d = sqrt({diff1**2} + {diff2**2})")
print(f"  d = sqrt({diff1**2 + diff2**2})")
print(f"  d ‚âà {euclidean_s1:.3f}")

# Manhattan
manhattan_s1 = abs(diff1) + abs(diff2)
print(f"\nManhattan Distance:")
print(f"  d = |{diff1}| + |{diff2}|")
print(f"  d = {abs(diff1)} + {abs(diff2)}")
print(f"  d = {manhattan_s1}")

print("\n" + "="*60)


Detailed Calculation for S1 vs q:
S1 = [70 80]
q  = [75 70]

Differences:
  Feature1: (70 - 75) = -5
  Feature2: (80 - 70) = 10

Euclidean Distance:
  d = sqrt((-5)¬≤ + (10)¬≤)
  d = sqrt(25 + 100)
  d = sqrt(125)
  d ‚âà 11.180

Manhattan Distance:
  d = |-5| + |10|
  d = 5 + 10
  d = 15



In [19]:
# Calculate distances for all students
def euclidean_distance(p1, p2):
    """Calculate Euclidean distance between two points"""
    return np.sqrt(np.sum((p1 - p2)**2))

def manhattan_distance(p1, p2):
    """Calculate Manhattan distance between two points"""
    return np.sum(np.abs(p1 - p2))

# Calculate for all students
results = []
for idx, row in df_distance.iterrows():
    student = row['Student']
    point = np.array([row['Feature1'], row['Feature2']])
    
    euclidean = euclidean_distance(point, q)
    manhattan = manhattan_distance(point, q)
    
    results.append({
        'Student': student,
        'Point': f"[{point[0]}, {point[1]}]",
        'Euclidean Distance': round(euclidean, 3),
        'Manhattan Distance': manhattan
    })

df_results = pd.DataFrame(results)
print("\nDistance Calculations for All Students:")
print(df_results)
print("\n" + "="*60)


Distance Calculations for All Students:
  Student     Point  Euclidean Distance  Manhattan Distance
0      S1  [70, 80]              11.180                  15
1      S2  [60, 90]              25.000                  35
2      S3  [85, 60]              14.142                  20
3      S4  [78, 76]               6.708                   9
4      S5  [62, 65]              13.928                  18



In [20]:
# Using sklearn for distance calculations
from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances

# Prepare data
X = df_distance[['Feature1', 'Feature2']].values
q_reshaped = q.reshape(1, -1)

# Calculate distances
euclidean_dist = euclidean_distances(X, q_reshaped).flatten()
manhattan_dist = manhattan_distances(X, q_reshaped).flatten()

df_distance['Euclidean_sklearn'] = np.round(euclidean_dist, 3)
df_distance['Manhattan_sklearn'] = manhattan_dist

print("Using sklearn distance functions:")
print(df_distance)
print("\n" + "="*60)

Using sklearn distance functions:
  Student  Feature1  Feature2  Euclidean_sklearn  Manhattan_sklearn
0      S1        70        80             11.180               15.0
1      S2        60        90             25.000               35.0
2      S3        85        60             14.142               20.0
3      S4        78        76              6.708                9.0
4      S5        62        65             13.928               18.0



In [21]:
# Find nearest neighbor
nearest_euclidean = df_results.loc[df_results['Euclidean Distance'].idxmin()]
nearest_manhattan = df_results.loc[df_results['Manhattan Distance'].idxmin()]

print("\nNearest Neighbors to Query Point q = [75, 70]:")
print(f"\nBy Euclidean Distance: {nearest_euclidean['Student']} with distance {nearest_euclidean['Euclidean Distance']}")
print(f"By Manhattan Distance: {nearest_manhattan['Student']} with distance {nearest_manhattan['Manhattan Distance']}")


Nearest Neighbors to Query Point q = [75, 70]:

By Euclidean Distance: S4 with distance 6.708
By Manhattan Distance: S4 with distance 9


## üìù Quick Summary - Everything You Learned

---

### üéöÔ∏è Part 1: Scaling (Making Data Same Size)

#### 1Ô∏è‚É£ Standardization (Z-score)
- **What it does:** Makes average = 0, spread = 1
- **When to use:** Data has different units
- **Warning:** Affected by outliers
- **Formula:** `(value - mean) / std`

#### 2Ô∏è‚É£ Min-Max Scaling
- **What it does:** Squeezes everything between 0 and 1

- **When to use:** Need specific range [0, 1]You now understand scaling, encoding, and distance measures! üéâ

- **Warning:** Very sensitive to outliers ‚ö†Ô∏è### üéì Congratulations!

- **Formula:** `(value - min) / (max - min)`

---

#### 3Ô∏è‚É£ Robust Scaling

- **What it does:** Uses middle values, ignores extremes- Grid-based? ‚Üí Manhattan Distance ‚úÖ

- **When to use:** Data has outliers (extreme values)- General use? ‚Üí Euclidean Distance ‚úÖ

- **Advantage:** Not affected by outliers ‚úÖ**Need to measure distance?**

- **Formula:** `(value - median) / IQR`

- Has order (sizes)? ‚Üí Ordinal Encoding ‚úÖ

---- No order (colors)? ‚Üí One-Hot Encoding ‚úÖ

**Need to encode categories?**

### üè∑Ô∏è Part 2: Encoding (Categories ‚Üí Numbers)

- No outliers? ‚Üí Min-Max or Standardization

#### 4Ô∏è‚É£ One-Hot Encoding- Has outliers? ‚Üí Robust Scaling ‚úÖ

- **Use for:** Categories with NO order (nominal)**Need to scale data?**

- **Example:** Colors, countries, food types

- **How:** Creates separate columns with 1s and 0s### üéØ Quick Decision Guide:

- **Result:** Red ‚Üí [1,0,0], Blue ‚Üí [0,1,0], Green ‚Üí [0,0,1]

---

#### 5Ô∏è‚É£ Ordinal Encoding

- **Use for:** Categories WITH order (ordinal)- **Example:** |5| + |10| = 15

- **Example:** Small, Medium, Large- **Use:** Grid-based or city problems

- **How:** Assigns numbers based on ranking- **Formula:** Just add up differences

- **Result:** Small=1, Medium=2, Large=3- **What:** Grid distance (as taxi drives) üöï

#### 7Ô∏è‚É£ Manhattan Distance

---

- **Example:** ‚àö(5¬≤ + 10¬≤) = 11.18

### üìè Part 3: Distance (Measuring Closeness)- **Use:** Most common, general purpose

- **Formula:** Has square root

#### 6Ô∏è‚É£ Euclidean Distance- **What:** Straight line distance (as bird flies) ‚úàÔ∏è