Topics:
1. Standardization, Min-Max scaling, Robust scaling
2. Nominal vs ordinal variables, one-hot vs ordinal encoding
3. Vectors, dot product, norms, Euclidean and Manhattan distance

In [26]:
import numpy as np
import pandas as pd
from scipy.spatial import distance
import warnings
warnings.filterwarnings('ignore')

# Part A: Quick Basics

## A1. Spot the right scaler

In [28]:
print("A1. Spot the right scaler\n")
print("a) Apartment_price_BDT with a few luxury penthouses")
print("   Answer: Robust Scaler")
print("   Justification: Handles outliers (luxury penthouses) well by using median and IQR\n")

print("b) Skin_temperature_C measured from a wearable between 30 and 36")
print("   Answer: Min-Max Scaler")
print("   Justification: Data is bounded with no outliers, perfect for Min-Max scaling to [0,1]\n")

print("c) Daily_app_opens with many zeros and a few power users")
print("   Answer: Robust Scaler")
print("   Justification: Power users create outliers, Robust scaler prevents them from dominating")

A1. Spot the right scaler

a) Apartment_price_BDT with a few luxury penthouses
   Answer: Robust Scaler
   Justification: Handles outliers (luxury penthouses) well by using median and IQR

b) Skin_temperature_C measured from a wearable between 30 and 36
   Answer: Min-Max Scaler
   Justification: Data is bounded with no outliers, perfect for Min-Max scaling to [0,1]

c) Daily_app_opens with many zeros and a few power users
   Answer: Robust Scaler
   Justification: Power users create outliers, Robust scaler prevents them from dominating


## A2. Manual Min-Max on a tiny set

In [32]:
print("A2. Manual Min-Max Scaling\n")
scores = [20, 25, 30, 50]
print(f"Original scores: {scores}")

# Step 1: Find min and max
min_val = min(scores)
max_val = max(scores)
print(f"\nStep 1: min = {min_val}, max = {max_val}")

# Step 2: Apply formula (x - min) / (max - min)
print("\nStep 2: Apply formula (x - min) / (max - min)")
scaled_scores = []
for score in scores:
    scaled = (score - min_val) / (max_val - min_val)
    scaled_scores.append(scaled)
    print(f"  ({score} - {min_val}) / ({max_val} - {min_val}) = {score - min_val} / {max_val - min_val} = {scaled}")

print(f"\nScaled scores: {scaled_scores}")

A2. Manual Min-Max Scaling

Original scores: [20, 25, 30, 50]

Step 1: min = 20, max = 50

Step 2: Apply formula (x - min) / (max - min)
  (20 - 20) / (50 - 20) = 0 / 30 = 0.0
  (25 - 20) / (50 - 20) = 5 / 30 = 0.16666666666666666
  (30 - 20) / (50 - 20) = 10 / 30 = 0.3333333333333333
  (50 - 20) / (50 - 20) = 30 / 30 = 1.0

Scaled scores: [0.0, 0.16666666666666666, 0.3333333333333333, 1.0]


## A3. Z-scores on a subset

In [33]:
print("A3. Z-scores (Standardization)\n")
x = [8, 9, 11]
print(f"Original values: {x}")

# Step 1: Calculate mean
mean = sum(x) / len(x)
print(f"\nStep 1: Mean = {sum(x)} / {len(x)} = {mean}")

# Step 2: Calculate population standard deviation
variance = sum((xi - mean)**2 for xi in x) / len(x)
std = variance ** 0.5
print(f"\nStep 2: Variance = {variance:.4f}")
print(f"        Std Dev = sqrt({variance:.4f}) = {std:.4f}")

# Step 3: Standardize each value
print("\nStep 3: Standardize using (x - mean) / std")
z_scores = []
for val in x:
    z = (val - mean) / std
    z_scores.append(z)
    print(f"  ({val} - {mean}) / {std:.4f} = {z:.4f}")

print(f"\nZ-scores: {[round(z, 4) for z in z_scores]}")

A3. Z-scores (Standardization)

Original values: [8, 9, 11]

Step 1: Mean = 28 / 3 = 9.333333333333334

Step 2: Variance = 1.5556
        Std Dev = sqrt(1.5556) = 1.2472

Step 3: Standardize using (x - mean) / std
  (8 - 9.333333333333334) / 1.2472 = -1.0690
  (9 - 9.333333333333334) / 1.2472 = -0.2673
  (11 - 9.333333333333334) / 1.2472 = 1.3363

Z-scores: [-1.069, -0.2673, 1.3363]


## A4. Robust scaling ingredients

In [5]:
print("A4. Robust Scaling Ingredients\n")
y = [5, 6, 6, 7, 50]
print(f"Values: {y}")

# Median
sorted_y = sorted(y)
median = sorted_y[len(sorted_y) // 2]
print(f"\nMedian: {median}")

# Q1 (25th percentile)
Q1 = np.percentile(y, 25)
print(f"Q1 (25th percentile): {Q1}")

# Q3 (75th percentile)
Q3 = np.percentile(y, 75)
print(f"Q3 (75th percentile): {Q3}")

# IQR
IQR = Q3 - Q1
print(f"IQR (Q3 - Q1): {Q3} - {Q1} = {IQR}")

A4. Robust Scaling Ingredients

Values: [5, 6, 6, 7, 50]

Median: 6
Q1 (25th percentile): 6.0
Q3 (75th percentile): 7.0
IQR (Q3 - Q1): 7.0 - 6.0 = 1.0


## A5. Nominal or ordinal

In [6]:
print("A5. Nominal or Ordinal\n")
print("a) T-shirt_size {S, M, L, XL}")
print("   Answer: ORDINAL (has natural ordering from small to extra large)\n")

print("b) City {Dhaka, Chattogram, Rajshahi}")
print("   Answer: NOMINAL (no inherent ordering between cities)\n")

print("c) Satisfaction {Low, Medium, High}")
print("   Answer: ORDINAL (has natural ordering from low to high)")

A5. Nominal or Ordinal

a) T-shirt_size {S, M, L, XL}
   Answer: ORDINAL (has natural ordering from small to extra large)

b) City {Dhaka, Chattogram, Rajshahi}
   Answer: NOMINAL (no inherent ordering between cities)

c) Satisfaction {Low, Medium, High}
   Answer: ORDINAL (has natural ordering from low to high)


# Part B: Hands on Practice

## B1. Three scalers side by side

In [7]:
print("B1. Three Scalers Side by Side\n")
Heights = [150, 160, 170, 175, 180]
Weights = [58, 62, 65, 66, 190]

print(f"Heights: {Heights}")
print(f"Weights: {Weights}\n")

# a) Min-Max scale both to [0, 1]
print("a) Min-Max Scaling to [0, 1]")
heights_min, heights_max = min(Heights), max(Heights)
weights_min, weights_max = min(Weights), max(Weights)

heights_minmax = [(h - heights_min) / (heights_max - heights_min) for h in Heights]
weights_minmax = [(w - weights_min) / (weights_max - weights_min) for w in Weights]

print(f"Heights scaled: {[round(h, 4) for h in heights_minmax]}")
print(f"Weights scaled: {[round(w, 4) for w in weights_minmax]}\n")

# b) Standardize the first three values only
print("b) Standardize first three values")
heights_first3 = Heights[:3]
weights_first3 = Weights[:3]

h_mean = np.mean(heights_first3)
h_std = np.std(heights_first3, ddof=0)
w_mean = np.mean(weights_first3)
w_std = np.std(weights_first3, ddof=0)

heights_std = [(h - h_mean) / h_std for h in heights_first3]
weights_std = [(w - w_mean) / w_std for w in weights_first3]

print(f"Heights first 3 standardized: {[round(h, 4) for h in heights_std]}")
print(f"Weights first 3 standardized: {[round(w, 4) for w in weights_std]}\n")

# c) Robust scale Weights
print("c) Robust Scaling for Weights")
w_median = np.median(Weights)
w_q1 = np.percentile(Weights, 25)
w_q3 = np.percentile(Weights, 75)
w_iqr = w_q3 - w_q1

print(f"Median: {w_median}, Q1: {w_q1}, Q3: {w_q3}, IQR: {w_iqr}")
weights_robust = [(w - w_median) / w_iqr for w in Weights]
print(f"Weights robust scaled: {[round(w, 4) for w in weights_robust]}\n")

# d) Which scaler handles outlier best
print("d) Which scaler handles the outlier (190) best?")
print("   Answer: ROBUST SCALER")
print(f"   Min-Max scaled outlier: {weights_minmax[-1]:.4f} (becomes 1.0, dominates the scale)")
print(f"   Robust scaled outlier: {weights_robust[-1]:.4f} (less extreme, better contained)")

B1. Three Scalers Side by Side

Heights: [150, 160, 170, 175, 180]
Weights: [58, 62, 65, 66, 190]

a) Min-Max Scaling to [0, 1]
Heights scaled: [0.0, 0.3333, 0.6667, 0.8333, 1.0]
Weights scaled: [0.0, 0.0303, 0.053, 0.0606, 1.0]

b) Standardize first three values
Heights first 3 standardized: [np.float64(-1.2247), np.float64(0.0), np.float64(1.2247)]
Weights first 3 standardized: [np.float64(-1.2787), np.float64(0.1162), np.float64(1.1625)]

c) Robust Scaling for Weights
Median: 65.0, Q1: 62.0, Q3: 66.0, IQR: 4.0
Weights robust scaled: [np.float64(-1.75), np.float64(-0.75), np.float64(0.0), np.float64(0.25), np.float64(31.25)]

d) Which scaler handles the outlier (190) best?
   Answer: ROBUST SCALER
   Min-Max scaled outlier: 1.0000 (becomes 1.0, dominates the scale)
   Robust scaled outlier: 31.2500 (less extreme, better contained)


## B2. One-hot by hand

In [8]:
print("B2. One-Hot Encoding by Hand\n")
Cities = ['Dhaka', 'Chattogram', 'Dhaka', 'Rajshahi', 'Rajshahi']
print(f"Original Cities: {Cities}\n")

# Create one-hot encoding
df_onehot = pd.DataFrame({
    'City_Dhaka': [1 if c == 'Dhaka' else 0 for c in Cities],
    'City_Chattogram': [1 if c == 'Chattogram' else 0 for c in Cities],
    'City_Rajshahi': [1 if c == 'Rajshahi' else 0 for c in Cities]
})

print("One-Hot Encoded:")
print(df_onehot)

B2. One-Hot Encoding by Hand

Original Cities: ['Dhaka', 'Chattogram', 'Dhaka', 'Rajshahi', 'Rajshahi']

One-Hot Encoded:
   City_Dhaka  City_Chattogram  City_Rajshahi
0           1                0              0
1           0                1              0
2           1                0              0
3           0                0              1
4           0                0              1


## B3. Ordinal mapping

In [9]:
print("B3. Ordinal Mapping\n")
Education = ['High School', 'Bachelor', 'Master', 'Bachelor', 'Master']
print(f"Original Education: {Education}\n")

# First mapping: 0, 1, 2
mapping1 = {'High School': 0, 'Bachelor': 1, 'Master': 2}
encoded1 = [mapping1[e] for e in Education]
print(f"Mapping 1 (High School=0, Bachelor=1, Master=2): {encoded1}")
print(f"Distance Bachelor to Master: {1 - 2} = {abs(1 - 2)}\n")

# Second mapping: 1, 2, 3
mapping2 = {'High School': 1, 'Bachelor': 2, 'Master': 3}
encoded2 = [mapping2[e] for e in Education]
print(f"Mapping 2 (High School=1, Bachelor=2, Master=3): {encoded2}")
print(f"Distance Bachelor to Master: {2 - 3} = {abs(2 - 3)}\n")

print("Explanation: The distances between categories remain the same (still 1 unit apart),")
print("but the absolute values shift, which can affect intercepts in models but not slopes.")

B3. Ordinal Mapping

Original Education: ['High School', 'Bachelor', 'Master', 'Bachelor', 'Master']

Mapping 1 (High School=0, Bachelor=1, Master=2): [0, 1, 2, 1, 2]
Distance Bachelor to Master: -1 = 1

Mapping 2 (High School=1, Bachelor=2, Master=3): [1, 2, 3, 2, 3]
Distance Bachelor to Master: -1 = 1

Explanation: The distances between categories remain the same (still 1 unit apart),
but the absolute values shift, which can affect intercepts in models but not slopes.


## B4. Encoding mixup (Optional)

In [10]:
print("B4. Encoding Mixup\n")
print("Risk of applying ordinal encoding to City and one-hot to Education:")
print("\nOrdinal encoding on City imposes a false ordering (e.g., Dhaka < Chattogram),")
print("making the linear model incorrectly assume cities have a numerical relationship,")
print("while one-hot on Education loses the natural ordering, treating Master and High School as equally different.")

B4. Encoding Mixup

Risk of applying ordinal encoding to City and one-hot to Education:

Ordinal encoding on City imposes a false ordering (e.g., Dhaka < Chattogram),
making the linear model incorrectly assume cities have a numerical relationship,
while one-hot on Education loses the natural ordering, treating Master and High School as equally different.


## B5. Vectors and alignment (Optional)

In [11]:
print("B5. Vectors and Alignment\n")
a = np.array([3, -1, 2])
b = np.array([4, 0, -2])
c = np.array([-6, 2, -4])

print(f"a = {a}")
print(f"b = {b}")
print(f"c = {c}\n")

# a) Compute dot products
dot_ab = np.dot(a, b)
dot_ac = np.dot(a, c)
print(f"a) Dot products:")
print(f"   a · b = (3)(4) + (-1)(0) + (2)(-2) = 12 + 0 - 4 = {dot_ab}")
print(f"   a · c = (3)(-6) + (-1)(2) + (2)(-4) = -18 - 2 - 8 = {dot_ac}\n")

# b) Compare signs and magnitudes
print(f"b) Alignment analysis:")
print(f"   a · b = {dot_ab} (positive but small) - vectors are slightly aligned")
print(f"   a · c = {dot_ac} (negative and large) - vectors point in opposite directions")
print(f"   Note: c = -2 * a, so they are perfectly anti-aligned\n")

# c) L2 normalize a
norm_a = np.linalg.norm(a)
a_normalized = a / norm_a
print(f"c) L2 normalization:")
print(f"   ||a|| = sqrt(3^2 + (-1)^2 + 2^2) = sqrt(9 + 1 + 4) = sqrt(14) = {norm_a:.6f}")
print(f"   a_normalized = a / ||a|| = {a_normalized}")
print(f"   a_normalized (3 decimals) = [{a_normalized[0]:.3f}, {a_normalized[1]:.3f}, {a_normalized[2]:.3f}]")

B5. Vectors and Alignment

a = [ 3 -1  2]
b = [ 4  0 -2]
c = [-6  2 -4]

a) Dot products:
   a · b = (3)(4) + (-1)(0) + (2)(-2) = 12 + 0 - 4 = 8
   a · c = (3)(-6) + (-1)(2) + (2)(-4) = -18 - 2 - 8 = -28

b) Alignment analysis:
   a · b = 8 (positive but small) - vectors are slightly aligned
   a · c = -28 (negative and large) - vectors point in opposite directions
   Note: c = -2 * a, so they are perfectly anti-aligned

c) L2 normalization:
   ||a|| = sqrt(3^2 + (-1)^2 + 2^2) = sqrt(9 + 1 + 4) = sqrt(14) = 3.741657
   a_normalized = a / ||a|| = [ 0.80178373 -0.26726124  0.53452248]
   a_normalized (3 decimals) = [0.802, -0.267, 0.535]


## B6. Two distances, different vibes

In [12]:
print("B6. Two Distances, Different Vibes\n")
P1 = np.array([2, 3])
P2 = np.array([5, 7])
P3 = np.array([2, 10])

print(f"P1 = {P1}")
print(f"P2 = {P2}")
print(f"P3 = {P3}\n")

# a) Compute both distances for all pairs
print("a) Distances for all pairs:\n")

euclidean_p1_p2 = distance.euclidean(P1, P2)
manhattan_p1_p2 = distance.cityblock(P1, P2)
print(f"P1 to P2:")
print(f"  Euclidean = sqrt((5-2)^2 + (7-3)^2) = sqrt(9 + 16) = sqrt(25) = {euclidean_p1_p2:.4f}")
print(f"  Manhattan = |5-2| + |7-3| = 3 + 4 = {manhattan_p1_p2:.4f}\n")

euclidean_p1_p3 = distance.euclidean(P1, P3)
manhattan_p1_p3 = distance.cityblock(P1, P3)
print(f"P1 to P3:")
print(f"  Euclidean = sqrt((2-2)^2 + (10-3)^2) = sqrt(0 + 49) = {euclidean_p1_p3:.4f}")
print(f"  Manhattan = |2-2| + |10-3| = 0 + 7 = {manhattan_p1_p3:.4f}\n")

euclidean_p2_p3 = distance.euclidean(P2, P3)
manhattan_p2_p3 = distance.cityblock(P2, P3)
print(f"P2 to P3:")
print(f"  Euclidean = sqrt((2-5)^2 + (10-7)^2) = sqrt(9 + 9) = {euclidean_p2_p3:.4f}")
print(f"  Manhattan = |2-5| + |10-7| = 3 + 3 = {manhattan_p2_p3:.4f}\n")

# b) Which distance is more sensitive
print("b) Sensitivity to large jumps:")
print("   Manhattan distance is MORE sensitive to a single large jump in one coordinate")
print("   because it sums absolute differences, while Euclidean squares them (dampening effect).\n")

# c) Scale y by 10 and recompute
print("c) Effect of scaling y-coordinate by 10:\n")
P1_scaled = np.array([2, 3*10])
P2_scaled = np.array([5, 7*10])

euclidean_scaled = distance.euclidean(P1_scaled, P2_scaled)
manhattan_scaled = distance.cityblock(P1_scaled, P2_scaled)

print(f"Original P1 to P2: Euclidean = {euclidean_p1_p2:.4f}, Manhattan = {manhattan_p1_p2:.4f}")
print(f"Scaled P1 to P2:   Euclidean = {euclidean_scaled:.4f}, Manhattan = {manhattan_scaled:.4f}\n")

print("Explanation: Scaling y by 10 amplifies distances in the y-direction, making the")
print("y-component dominate both distance metrics, but Euclidean grows more dramatically due to squaring.")

B6. Two Distances, Different Vibes

P1 = [2 3]
P2 = [5 7]
P3 = [ 2 10]

a) Distances for all pairs:

P1 to P2:
  Euclidean = sqrt((5-2)^2 + (7-3)^2) = sqrt(9 + 16) = sqrt(25) = 5.0000
  Manhattan = |5-2| + |7-3| = 3 + 4 = 7.0000

P1 to P3:
  Euclidean = sqrt((2-2)^2 + (10-3)^2) = sqrt(0 + 49) = 7.0000
  Manhattan = |2-2| + |10-3| = 0 + 7 = 7.0000

P2 to P3:
  Euclidean = sqrt((2-5)^2 + (10-7)^2) = sqrt(9 + 9) = 4.2426
  Manhattan = |2-5| + |10-7| = 3 + 3 = 6.0000

b) Sensitivity to large jumps:
   Manhattan distance is MORE sensitive to a single large jump in one coordinate
   because it sums absolute differences, while Euclidean squares them (dampening effect).

c) Effect of scaling y-coordinate by 10:

Original P1 to P2: Euclidean = 5.0000, Manhattan = 7.0000
Scaled P1 to P2:   Euclidean = 40.1123, Manhattan = 43.0000

Explanation: Scaling y by 10 amplifies distances in the y-direction, making the
y-component dominate both distance metrics, but Euclidean grows more dramatically due t

# Part C: Mini Datasets

In [13]:
# Create C-Data-1
c_data_1 = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Age': [20, 21, 22, 20, 23],
    'Hours_Study': [1.0, 0.5, 2.2, 5.0, 0.2],
    'GPA': [3.10, 2.60, 3.40, 3.90, 2.30],
    'Internet': ['Yes', 'No', 'Yes', 'Yes', 'No'],
    'City': ['Dhaka', 'Chattogram', 'Rajshahi', 'Dhaka', 'Rajshahi']
})

# Create C-Data-2
c_data_2 = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Income_BDT': [30000, 45000, 52000, 300000, 38000],
    'Transactions': [0, 1, 2, 12, 0],
    'Temp_C': [25.0, 26.0, 24.5, 28.0, 25.5],
    'Education': ['High School', 'Bachelor', 'Master', 'Bachelor', 'Master'],
    'Satisfaction': ['Low', 'Medium', 'High', 'Medium', 'Medium']
})

print("C-Data-1:")
print(c_data_1)
print("\nC-Data-2:")
print(c_data_2)

C-Data-1:
   ID  Age  Hours_Study  GPA Internet        City
0   1   20          1.0  3.1      Yes       Dhaka
1   2   21          0.5  2.6       No  Chattogram
2   3   22          2.2  3.4      Yes    Rajshahi
3   4   20          5.0  3.9      Yes       Dhaka
4   5   23          0.2  2.3       No    Rajshahi

C-Data-2:
   ID  Income_BDT  Transactions  Temp_C    Education Satisfaction
0   1       30000             0    25.0  High School          Low
1   2       45000             1    26.0     Bachelor       Medium
2   3       52000             2    24.5       Master         High
3   4      300000            12    28.0     Bachelor       Medium
4   5       38000             0    25.5       Master       Medium


## C1. Scaler choices with evidence

In [14]:
print("C1. Scaler Choices with Evidence\n")

# Income_BDT
print("1. Income_BDT:")
print("   Scaler: ROBUST SCALER")
print("   Justification: Contains outlier (300000 vs 30000-52000 range)\n")
income = c_data_2['Income_BDT'].values
median_inc = np.median(income)
q1_inc = np.percentile(income, 25)
q3_inc = np.percentile(income, 75)
iqr_inc = q3_inc - q1_inc
income_robust = (income - median_inc) / iqr_inc
print(f"   Numeric illustration:")
print(f"   Original: {income}")
print(f"   Robust scaled: {income_robust.round(4)}")
print(f"   Notice outlier 300000 becomes {income_robust[3]:.4f}, not extremely far from others\n")

# Transactions
print("2. Transactions:")
print("   Scaler: ROBUST SCALER")
print("   Justification: Has outlier (12 vs mostly 0-2) and many zeros\n")
trans = c_data_2['Transactions'].values
median_trans = np.median(trans)
q1_trans = np.percentile(trans, 25)
q3_trans = np.percentile(trans, 75)
iqr_trans = q3_trans - q1_trans
trans_robust = (trans - median_trans) / iqr_trans if iqr_trans != 0 else trans - median_trans
print(f"   Numeric illustration:")
print(f"   Original: {trans}")
print(f"   Robust scaled: {trans_robust.round(4)}")
print(f"   Power user (12) is handled without dominating the scale\n")

# Temp_C
print("3. Temp_C:")
print("   Scaler: MIN-MAX SCALER")
print("   Justification: Bounded range (24.5-28.0), no outliers, natural physical bounds\n")
temp = c_data_2['Temp_C'].values
temp_min = temp.min()
temp_max = temp.max()
temp_minmax = (temp - temp_min) / (temp_max - temp_min)
print(f"   Numeric illustration:")
print(f"   Original: {temp}")
print(f"   Min-Max scaled: {temp_minmax.round(4)}")
print(f"   All values fit nicely in [0, 1] with no distortion from outliers")

C1. Scaler Choices with Evidence

1. Income_BDT:
   Scaler: ROBUST SCALER
   Justification: Contains outlier (300000 vs 30000-52000 range)

   Numeric illustration:
   Original: [ 30000  45000  52000 300000  38000]
   Robust scaled: [-1.0714  0.      0.5    18.2143 -0.5   ]
   Notice outlier 300000 becomes 18.2143, not extremely far from others

2. Transactions:
   Scaler: ROBUST SCALER
   Justification: Has outlier (12 vs mostly 0-2) and many zeros

   Numeric illustration:
   Original: [ 0  1  2 12  0]
   Robust scaled: [-0.5  0.   0.5  5.5 -0.5]
   Power user (12) is handled without dominating the scale

3. Temp_C:
   Scaler: MIN-MAX SCALER
   Justification: Bounded range (24.5-28.0), no outliers, natural physical bounds

   Numeric illustration:
   Original: [25.  26.  24.5 28.  25.5]
   Min-Max scaled: [0.1429 0.4286 0.     1.     0.2857]
   All values fit nicely in [0, 1] with no distortion from outliers


## C2. Mixed preprocessing plan

In [15]:
print("C2. Mixed Preprocessing Plan\n")

# a) Identify nominal and ordinal
print("a) Categorization:\n")
print("NOMINAL columns:")
print("  - Internet (Yes/No - no ordering)")
print("  - City (Dhaka, Chattogram, Rajshahi - no ordering)\n")

print("ORDINAL columns:")
print("  - Education (High School < Bachelor < Master)")
print("  - Satisfaction (Low < Medium < High)\n")

# b) Encoding plan
print("b) Encoding Plan:\n")
print("ONE-HOT ENCODING:")
print("  - Internet")
print("  - City\n")

print("ORDINAL ENCODING:")
print("  - Education: High School=0, Bachelor=1, Master=2")
print("  - Satisfaction: Low=0, Medium=1, High=2\n")

# c) Scaling plan
print("c) Scaling Plan:\n")
print("MIN-MAX SCALING:")
print("  - Age (small range, no outliers: 20-23)")
print("  - GPA (bounded: 2.30-3.90, no outliers)")
print("  - Temp_C (physical bounds: 24.5-28.0)\n")

print("ROBUST SCALING:")
print("  - Income_BDT (has extreme outlier: 300000)")
print("  - Transactions (has outlier: 12 vs mostly 0-2)")
print("  - Hours_Study (has extreme values: 5.0 vs mostly < 2.5)\n")

print("STANDARDIZATION:")
print("  - (None in this case - use Min-Max for bounded or Robust for outliers)")

C2. Mixed Preprocessing Plan

a) Categorization:

NOMINAL columns:
  - Internet (Yes/No - no ordering)
  - City (Dhaka, Chattogram, Rajshahi - no ordering)

ORDINAL columns:
  - Education (High School < Bachelor < Master)
  - Satisfaction (Low < Medium < High)

b) Encoding Plan:

ONE-HOT ENCODING:
  - Internet
  - City

ORDINAL ENCODING:
  - Education: High School=0, Bachelor=1, Master=2
  - Satisfaction: Low=0, Medium=1, High=2

c) Scaling Plan:

MIN-MAX SCALING:
  - Age (small range, no outliers: 20-23)
  - GPA (bounded: 2.30-3.90, no outliers)
  - Temp_C (physical bounds: 24.5-28.0)

ROBUST SCALING:
  - Income_BDT (has extreme outlier: 300000)
  - Transactions (has outlier: 12 vs mostly 0-2)
  - Hours_Study (has extreme values: 5.0 vs mostly < 2.5)

STANDARDIZATION:
  - (None in this case - use Min-Max for bounded or Robust for outliers)


## C3. Outlier stress test (Optional)

In [16]:
print("C3. Outlier Stress Test\n")
income = c_data_2['Income_BDT'].values
print(f"Income_BDT: {income}\n")

# Min-Max scaling
income_min = income.min()
income_max = income.max()
income_minmax = (income - income_min) / (income_max - income_min)
print("Min-Max Scaled:")
print(income_minmax.round(4))

# Robust scaling
income_median = np.median(income)
income_q1 = np.percentile(income, 25)
income_q3 = np.percentile(income, 75)
income_iqr = income_q3 - income_q1
income_robust = (income - income_median) / income_iqr
print("\nRobust Scaled:")
print(income_robust.round(4))

print("\nComparison:")
print(f"Min-Max compresses normal values to [0, 0.08] while outlier becomes 1.0 (dominates scale);")
print(f"Robust keeps outlier at {income_robust[3]:.2f}, more proportional to actual deviation from median.")

C3. Outlier Stress Test

Income_BDT: [ 30000  45000  52000 300000  38000]

Min-Max Scaled:
[0.     0.0556 0.0815 1.     0.0296]

Robust Scaled:
[-1.0714  0.      0.5    18.2143 -0.5   ]

Comparison:
Min-Max compresses normal values to [0, 0.08] while outlier becomes 1.0 (dominates scale);
Robust keeps outlier at 18.21, more proportional to actual deviation from median.


## C4. Distance on feature space (Optional)

In [17]:
print("C4. Distance on Feature Space\n")

# Extract features
hours_study = c_data_1['Hours_Study'].values
gpa = c_data_1['GPA'].values

# ID 1 and ID 4
id1_features = np.array([hours_study[0], gpa[0]])  # [1.0, 3.10]
id4_features = np.array([hours_study[3], gpa[3]])  # [5.0, 3.90]

print(f"ID 1: Hours_Study={id1_features[0]}, GPA={id1_features[1]}")
print(f"ID 4: Hours_Study={id4_features[0]}, GPA={id4_features[1]}\n")

# a) Euclidean distance
euclidean_orig = distance.euclidean(id1_features, id4_features)
print(f"a) Euclidean distance:")
print(f"   sqrt((5.0-1.0)^2 + (3.90-3.10)^2) = sqrt(16 + 0.64) = sqrt(16.64) = {euclidean_orig:.4f}\n")

# b) Manhattan distance
manhattan_orig = distance.cityblock(id1_features, id4_features)
print(f"b) Manhattan distance:")
print(f"   |5.0-1.0| + |3.90-3.10| = 4.0 + 0.8 = {manhattan_orig:.4f}\n")

# c) Normalize and recompute
print("c) After Min-Max normalization:\n")

# Min-Max normalize
hours_min, hours_max = hours_study.min(), hours_study.max()
gpa_min, gpa_max = gpa.min(), gpa.max()

hours_normalized = (hours_study - hours_min) / (hours_max - hours_min)
gpa_normalized = (gpa - gpa_min) / (gpa_max - gpa_min)

id1_normalized = np.array([hours_normalized[0], gpa_normalized[0]])
id4_normalized = np.array([hours_normalized[3], gpa_normalized[3]])

print(f"ID 1 normalized: {id1_normalized}")
print(f"ID 4 normalized: {id4_normalized}\n")

euclidean_norm = distance.euclidean(id1_normalized, id4_normalized)
manhattan_norm = distance.cityblock(id1_normalized, id4_normalized)

print(f"Euclidean distance (normalized): {euclidean_norm:.4f}")
print(f"Manhattan distance (normalized): {manhattan_norm:.4f}\n")

print("Comment: Normalization prevents Hours_Study (range ~5) from dominating GPA (range ~1.6),")
print("making both features contribute equally to distance calculations.")

C4. Distance on Feature Space

ID 1: Hours_Study=1.0, GPA=3.1
ID 4: Hours_Study=5.0, GPA=3.9

a) Euclidean distance:
   sqrt((5.0-1.0)^2 + (3.90-3.10)^2) = sqrt(16 + 0.64) = sqrt(16.64) = 4.0792

b) Manhattan distance:
   |5.0-1.0| + |3.90-3.10| = 4.0 + 0.8 = 4.8000

c) After Min-Max normalization:

ID 1 normalized: [0.16666667 0.5       ]
ID 4 normalized: [1. 1.]

Euclidean distance (normalized): 0.9718
Manhattan distance (normalized): 1.3333

Comment: Normalization prevents Hours_Study (range ~5) from dominating GPA (range ~1.6),
making both features contribute equally to distance calculations.


# Part D: Mini Project (Optional)

## Step 1: Create a small DataFrame

In [18]:
# Create synthetic dataset
df_project = pd.DataFrame({
    'Income': [35000, 48000, 250000, 42000, 38000, 55000],
    'Hours_Study': [2.5, 3.0, 1.5, 4.5, 2.0, 3.5],
    'GPA': [3.2, 3.5, 2.8, 3.8, 3.1, 3.6],
    'City': ['Dhaka', 'Chattogram', 'Dhaka', 'Rajshahi', 'Dhaka', 'Chattogram'],
    'Internet': ['Yes', 'Yes', 'No', 'Yes', 'Yes', 'No'],
    'Education_Level': ['Bachelor', 'Master', 'High School', 'Master', 'Bachelor', 'Master'],
    'Satisfaction': ['Medium', 'High', 'Low', 'High', 'Medium', 'High']
})

print("Mini Project DataFrame:")
print(df_project)

Mini Project DataFrame:
   Income  Hours_Study  GPA        City Internet Education_Level Satisfaction
0   35000          2.5  3.2       Dhaka      Yes        Bachelor       Medium
1   48000          3.0  3.5  Chattogram      Yes          Master         High
2  250000          1.5  2.8       Dhaka       No     High School          Low
3   42000          4.5  3.8    Rajshahi      Yes          Master         High
4   38000          2.0  3.1       Dhaka      Yes        Bachelor       Medium
5   55000          3.5  3.6  Chattogram       No          Master         High


## Step 2: Preprocessing Plan

**One-Hot Encoding:**
- City (nominal: Dhaka, Chattogram, Rajshahi)
- Internet (nominal: Yes, No)

**Ordinal Encoding:**
- Education_Level: High School=0, Bachelor=1, Master=2
- Satisfaction: Low=0, Medium=1, High=2

**Scaling:**
- Income: Robust Scaler (has outlier: 250000)
- Hours_Study: Standardization (normal distribution)
- GPA: Min-Max Scaler (bounded range, no outliers)

## Step 3: Apply ColumnTransformer

In [19]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Define column groups
onehot_cols = ['City', 'Internet']
ordinal_cols = ['Education_Level', 'Satisfaction']
robust_cols = ['Income']
standard_cols = ['Hours_Study']
minmax_cols = ['GPA']

# Create transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first', sparse_output=False), onehot_cols),
        ('ordinal_education', OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master']]), ['Education_Level']),
        ('ordinal_satisfaction', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['Satisfaction']),
        ('robust', RobustScaler(), robust_cols),
        ('standard', StandardScaler(), standard_cols),
        ('minmax', MinMaxScaler(), minmax_cols)
    ]
)

# Apply transformation
transformed = preprocessor.fit_transform(df_project)

print(f"Transformed array shape: {transformed.shape}\n")
print("First few rows of transformed data:")
print(transformed[:3])

# Get feature names
feature_names = (preprocessor.named_transformers_['onehot'].get_feature_names_out(onehot_cols).tolist() +
                ['Education_Level', 'Satisfaction', 'Income', 'Hours_Study', 'GPA'])
print(f"\nFeature names: {feature_names}")

Transformed array shape: (6, 8)

First few rows of transformed data:
[[ 1.          0.          1.          1.          1.         -0.70175439
  -0.3380617   0.4       ]
 [ 0.          0.          1.          2.          2.          0.21052632
   0.16903085  0.7       ]
 [ 1.          0.          0.          0.          0.         14.38596491
  -1.35224681  0.        ]]

Feature names: ['City_Dhaka', 'City_Rajshahi', 'Internet_Yes', 'Education_Level', 'Satisfaction', 'Income', 'Hours_Study', 'GPA']


## Step 4: Distance before vs after scaling

In [20]:
# Select Income and Hours_Study for distance analysis
# Pick rows 0, 1, 2 as P1, P2, P3
P1 = df_project[['Income', 'Hours_Study']].iloc[0].values
P2 = df_project[['Income', 'Hours_Study']].iloc[1].values
P3 = df_project[['Income', 'Hours_Study']].iloc[2].values

print("Original Points:")
print(f"P1: {P1}")
print(f"P2: {P2}")
print(f"P3: {P3}\n")

# Distances before scaling
print("Distances BEFORE scaling:")
eucl_p1_p2_orig = distance.euclidean(P1, P2)
eucl_p1_p3_orig = distance.euclidean(P1, P3)
eucl_p2_p3_orig = distance.euclidean(P2, P3)
manh_p1_p2_orig = distance.cityblock(P1, P2)
manh_p1_p3_orig = distance.cityblock(P1, P3)
manh_p2_p3_orig = distance.cityblock(P2, P3)

print(f"Euclidean: P1-P2={eucl_p1_p2_orig:.2f}, P1-P3={eucl_p1_p3_orig:.2f}, P2-P3={eucl_p2_p3_orig:.2f}")
print(f"Manhattan: P1-P2={manh_p1_p2_orig:.2f}, P1-P3={manh_p1_p3_orig:.2f}, P2-P3={manh_p2_p3_orig:.2f}\n")

# Apply Standard Scaler
scaler_std = StandardScaler()
data_std = scaler_std.fit_transform(df_project[['Income', 'Hours_Study']].iloc[:3])
P1_std, P2_std, P3_std = data_std[0], data_std[1], data_std[2]

print("Distances AFTER Standard Scaling:")
eucl_p1_p2_std = distance.euclidean(P1_std, P2_std)
eucl_p1_p3_std = distance.euclidean(P1_std, P3_std)
eucl_p2_p3_std = distance.euclidean(P2_std, P3_std)
manh_p1_p2_std = distance.cityblock(P1_std, P2_std)
manh_p1_p3_std = distance.cityblock(P1_std, P3_std)
manh_p2_p3_std = distance.cityblock(P2_std, P3_std)

print(f"Euclidean: P1-P2={eucl_p1_p2_std:.2f}, P1-P3={eucl_p1_p3_std:.2f}, P2-P3={eucl_p2_p3_std:.2f}")
print(f"Manhattan: P1-P2={manh_p1_p2_std:.2f}, P1-P3={manh_p1_p3_std:.2f}, P2-P3={manh_p2_p3_std:.2f}\n")

# Apply Robust Scaler
scaler_robust = RobustScaler()
data_robust = scaler_robust.fit_transform(df_project[['Income', 'Hours_Study']].iloc[:3])
P1_robust, P2_robust, P3_robust = data_robust[0], data_robust[1], data_robust[2]

print("Distances AFTER Robust Scaling:")
eucl_p1_p2_robust = distance.euclidean(P1_robust, P2_robust)
eucl_p1_p3_robust = distance.euclidean(P1_robust, P3_robust)
eucl_p2_p3_robust = distance.euclidean(P2_robust, P3_robust)
manh_p1_p2_robust = distance.cityblock(P1_robust, P2_robust)
manh_p1_p3_robust = distance.cityblock(P1_robust, P3_robust)
manh_p2_p3_robust = distance.cityblock(P2_robust, P3_robust)

print(f"Euclidean: P1-P2={eucl_p1_p2_robust:.2f}, P1-P3={eucl_p1_p3_robust:.2f}, P2-P3={eucl_p2_p3_robust:.2f}")
print(f"Manhattan: P1-P2={manh_p1_p2_robust:.2f}, P1-P3={manh_p1_p3_robust:.2f}, P2-P3={manh_p2_p3_robust:.2f}")

Original Points:
P1: [3.5e+04 2.5e+00]
P2: [4.8e+04 3.0e+00]
P3: [2.5e+05 1.5e+00]

Distances BEFORE scaling:
Euclidean: P1-P2=13000.00, P1-P3=215000.00, P2-P3=202000.00
Manhattan: P1-P2=13000.50, P1-P3=215001.00, P2-P3=202001.50

Distances AFTER Standard Scaling:
Euclidean: P1-P2=0.81, P1-P3=2.71, P2-P3=3.16
Manhattan: P1-P2=0.93, P1-P3=3.79, P2-P3=4.46

Distances AFTER Robust Scaling:
Euclidean: P1-P2=0.68, P1-P3=2.40, P2-P3=2.74
Manhattan: P1-P2=0.79, P1-P3=3.33, P2-P3=3.88


### Distance Comparison Table

| Pair | Original Eucl | Original Manh | Standard Eucl | Standard Manh | Robust Eucl | Robust Manh |
|------|---------------|---------------|---------------|---------------|-------------|-------------|
| P1-P2 | See above | See above | See above | See above | See above | See above |
| P1-P3 | See above | See above | See above | See above | See above | See above |
| P2-P3 | See above | See above | See above | See above | See above | See above |

## Step 5: Short Reflection

In [21]:
print("""REFLECTION:

1. Which scaler handled outliers better?
   Robust Scaler handled the Income outlier (250000) better than Standard Scaler.
   It uses median and IQR, making it resistant to extreme values, while Standard
   Scaler's mean and std are affected by outliers.

2. Did scaling change which points are closer?
   Yes, scaling changed relative distances. Before scaling, Income dominated due
   to its large magnitude. After scaling, both features contribute equally,
   revealing true similarity patterns based on both dimensions.

3. Why does this matter for distance-based algorithms?
   Distance-based algorithms (KNN, K-Means, SVM) are sensitive to feature scales.
   Without proper scaling, features with larger ranges dominate distance calculations,
   leading to biased results. Scaling ensures all features contribute fairly to
   similarity measurements, improving model performance and interpretability.
""")

REFLECTION:

1. Which scaler handled outliers better?
   Robust Scaler handled the Income outlier (250000) better than Standard Scaler.
   It uses median and IQR, making it resistant to extreme values, while Standard
   Scaler's mean and std are affected by outliers.

2. Did scaling change which points are closer?
   Yes, scaling changed relative distances. Before scaling, Income dominated due
   to its large magnitude. After scaling, both features contribute equally,
   revealing true similarity patterns based on both dimensions.

3. Why does this matter for distance-based algorithms?
   Distance-based algorithms (KNN, K-Means, SVM) are sensitive to feature scales.
   Without proper scaling, features with larger ranges dominate distance calculations,
   leading to biased results. Scaling ensures all features contribute fairly to
   similarity measurements, improving model performance and interpretability.



# Summary

This notebook covered:
- Different scaling techniques (Min-Max, Standardization, Robust)
- Encoding strategies (One-Hot for nominal, Ordinal for ordered categories)
- Vector operations (dot products, norms, normalization)
- Distance metrics (Euclidean and Manhattan)
- Practical preprocessing with ColumnTransformer
- Impact of scaling on distance-based algorithms

Key Takeaways:
- Use Robust Scaler for data with outliers
- Use Min-Max for bounded data without outliers
- Use Standardization for normally distributed data
- Always scale features before using distance-based algorithms
- Choose encoding based on whether categories have natural ordering