# **Module 3 Practice Sheet**

## Topics covered:
Standardization, Min-Max scaling, Robust scaling
Nominal vs ordinal variables, one-hot vs ordinal encoding
Vectors, dot product, norms, Euclidean and Manhattan distance


## **Part A. Quick basics**

**A1. Spot the right scaler**

For each feature, pick one scaler and justify in one line.

 a) Apartment_price_BDT with a few luxury penthouses

 b) Skin_temperature_C measured from a wearable between 30 and 36

 c) Daily_app_opens with many zeros and a few power users

a: as Apartment_price_BDT has few luxury penthouses(outliers), Therefore, Robust scaling is more appropriate here, as robust scaling helps to detect outliers.

b:Skin_temperature_C value range is between 30 and 36, means the range is very small and there is no outliers, Hence, Min-max scaling is the best scaling to use here.

c:Daily_app_opens data contains many zeros and a few poweer users. Standardization or z-scaling is best in this case.

***A2. Manual Min-Max on a tiny set***

Given scores = [20, 25, 30, 50], scale to [0, 1] by hand. Show each step.

In [67]:
import numpy as np
import pandas as pd

scores = [20, 25, 30, 50]
scores_df = pd.DataFrame({"scores": scores})
mn = scores_df["scores"].min()
mx = scores_df["scores"].max()
scores_df["min_max_scale"] = np.round((scores_df["scores"] - mn) / (mx - mn),3)
scores_df

Unnamed: 0,scores,min_max_scale
0,20,0.0
1,25,0.167
2,30,0.333
3,50,1.0


***A3. Z-scores on a subset***

Given x = [8, 9, 11], compute mean, standard deviation, then standardize each. Use population standard deviation for this question.

In [68]:
x = pd.DataFrame({"x":[8,9,11]})
mean = x["x"].mean()
std = x["x"].std()
x["Standardization_score"] = (x["x"] - mean)/std
x

Unnamed: 0,x,Standardization_score
0,8,-0.872872
1,9,-0.218218
2,11,1.091089


***A4. Robust scaling ingredients***

Given y = [5, 6, 6, 7, 50], find median, Q1, Q3, IQR. Do not scale yet.

In [69]:
y = [5, 6, 6, 7, 50]
median = np.median(y)
q1 = np.quantile(y,.25)
q3 = np.quantile(y,.75)
iqr = q3 - q1
print("median:",median)
print("q1:",q1)
print("q3:",q3)
print("iqr:",iqr)

median: 6.0
q1: 6.0
q3: 7.0
iqr: 1.0


***A5. Nominal or ordinal***

Mark each as nominal or ordinal.

 a) T-shirt_size {S, M, L, XL}

 b) City {Dhaka, Chattogram, Rajshahi}
 
 c) Satisfaction {Low, Medium, High}

a: T-shirt_size (S,M,L,XL) -> ordinal

b: City {Dhaka,Chattogram,Rajshahi} -> nominal

c: Satisfaction {Low,Medium,High} ->ordinal

## Part B. Hands on practice

### B1. Three scalers side by side

Heights = [150, 160, 170, 175, 180]

Weights = [58, 62, 65, 66, 190]

Tasks:

a) Min-Max scale both to [0, 1]

b) Standardize the first three values of each only

c) Robust scale Weights with median and IQR

d) One line on which scaler handles the outlier best


**a) Min-Max scale both to [0, 1]**

In [70]:
Heights = [150, 160, 170, 175, 180]
Weights = [58, 62, 65, 66, 190]
b_df = pd.DataFrame({"heights":Heights,"weights":Weights})
b_a = b_df.copy()
mn = b_a.min()
mx = b_a.max()
b_a[["min_max_scale_heights","min_max_scale_weights"]] = (b_a - mn)/(mx - mn)
b_a

Unnamed: 0,heights,weights,min_max_scale_heights,min_max_scale_weights
0,150,58,0.0,0.0
1,160,62,0.333333,0.030303
2,170,65,0.666667,0.05303
3,175,66,0.833333,0.060606
4,180,190,1.0,1.0


**b) Standardize the first three values of each only**

In [71]:
b_b = b_df.copy().head(3)
mean = b_b.mean()
std = b_b.std()
b_b[["Standardization_heights","Standardization_weights"]] = (b_b - mean)/std
b_b

Unnamed: 0,heights,weights,Standardization_heights,Standardization_weights
0,150,58,-1.0,-1.044074
1,160,62,0.0,0.094916
2,170,65,1.0,0.949158


***c) Robust scale Weights with median and IQR***

In [72]:
b_c = b_df.copy()
median = b_c.median()
q1 = b_c.quantile(.25)
q3 = b_c.quantile(.75)
iqr = q3 - q1
b_c[["Robust_Scale_heights","Robust_Scale_weights"]] = (b_c - median)/iqr
b_c

Unnamed: 0,heights,weights,Robust_Scale_heights,Robust_Scale_weights
0,150,58,-1.333333,-1.75
1,160,62,-0.666667,-0.75
2,170,65,0.0,0.0
3,175,66,0.333333,0.25
4,180,190,0.666667,31.25


**B2. One-hot by hand**

Cities = [Dhaka, Chattogram, Dhaka, Rajshahi, Rajshahi]
Create three columns City_Dhaka, City_Chattogram, City_Rajshahi using 0 and

In [73]:
Cities = ["Dhaka", "Chattogram", "Dhaka", "Rajshahi", "Rajshahi"]
b2 = pd.DataFrame({"Cities":Cities})
b2
ohe_city = pd.get_dummies(b2,prefix="City",dtype=int)
b2 = pd.concat([b2,ohe_city],axis = 1)
b2

Unnamed: 0,Cities,City_Chattogram,City_Dhaka,City_Rajshahi
0,Dhaka,0,1,0
1,Chattogram,1,0,0
2,Dhaka,0,1,0
3,Rajshahi,0,0,1
4,Rajshahi,0,0,1


***B3. Ordinal mapping***

Education = [High School, Bachelor, Master, Bachelor, Master]
Map with High School=0, Bachelor=1, Master=2.
Then change the map to High School=1, Bachelor=2, Master=3 and explain in one line how this shifts distances.


In [74]:
education_dict = {"High School":0,"Bachelor":1,"Master":2}
Education = ["High School", "Bachelor", "Master", "Bachelor", "Master"]
b3 = pd.DataFrame({"education":Education})
b3["Ordinal_encoding"] = b3["education"].map(education_dict)
b3

Unnamed: 0,education,Ordinal_encoding
0,High School,0
1,Bachelor,1
2,Master,2
3,Bachelor,1
4,Master,2


***B4. Encoding mixup [Optional]***

You mistakenly apply ordinal encoding to City and one-hot to Education. Write one sentence on the risk this creates in a linear model.

ans: The city doesn't have any rank, but the ordinal will assign a rank to each city, which will leads to wrong data values. On the other hand education level has rank, but one-hot-encoding will assign 0-1 values to it, which doesn't make any sense.


***B5. Vectors and alignment [Optional]***

 a = [3, −1, 2], b = [4, 0, −2], c = [−6, 2, −4]
 
 Tasks:

 a) Compute a·b and a·c

 b) Compare signs and magnitudes to comment on the alignment of a with b and with c

 c) L2 normalize a and give the normalized vector to three decimals


a) Compute a·b and a·c

In [75]:
a = np.array([3, -1, 2])
b = np.array([4, 0, -2])
c = np.array([-6, 2, -4])
axb = np.sum(a*b)
axc = np.sum(a*c)
print("a.b:",axb)
print("a.c:",axc)

a.b: 8
a.c: -28


b) Compare signs and magnitudes to comment on the alignment of a with b and with c


a.b > 0 means a and b are in same direction
a.c < 0 means a and c are in opposite direction

c) L2 normalize a and give the normalized vector to three decimals

In [76]:
L2_a = np.linalg.norm(a)
L2_a = np.sqrt(np.sum(a*a))   #Both will provide the same value
print("L2 normalized a:",L2_a)

L2 normalized a: 3.7416573867739413


***B6. Two distances, different vibes***

Points: P1(2, 3), P2(5, 7), P3(2, 10)

Tasks:

a) Compute Euclidean and Manhattan distances for all pairs

b) Which distance is more sensitive to a single large jump in one coordinate

c) Scale y by 10 and recompute d(P1, P2) for both distances, then explain the effect in one line


a) Compute Euclidean and Manhattan distances for all pairs

In [77]:
p1 = [2, 3]
p2 = [5, 7]
p3 = [2, 10]
points = pd.DataFrame({"x":[p1[0],p2[0],p3[0]],"y":[p1[1],p2[1],p3[1]]})
points
euclidean_dist = np.linalg.norm(points['x'] - points['y'])
euclidean_dist = np.sqrt(np.sum((points['x'] - points['y'])**2)) #Both will give the same result
print("Euclidean Distance: ",euclidean_dist)
manhattan_dist = np.linalg.norm(points['x'] - points['y'],ord=1)
manhattan_dist = np.sum(np.abs(points['x'] - points['y'])) #Both will give the same answers
print("Manhattan Distance: ",manhattan_dist)

Euclidean Distance:  8.306623862918075
Manhattan Distance:  11


b) Which distance is more sensitive to a single large jump in one coordinate

Euclidean Distance is more sensitive to a large jump in one coordinate, cause the square term magnifies larger differences more strongly.

c) Scale y by 10 and recompute d(P1, P2)

In [80]:
points['y'] *= 10
points
new_manhattan_distance = np.linalg.norm(points['x'] - points["y"])
print("New_manhattan_distance:", new_manhattan_distance)

New_manhattan_distance: 1252.1313828828027


### part C. Mini datasets

In [82]:
data1 = {
    "ID": [1, 2, 3, 4, 5],
    "Age": [20, 21, 22, 20, 23],
    "Hours_Study": [1.0, 0.5, 2.2, 5.0, 0.2],
    "GPA": [3.10, 2.60, 3.40, 3.90, 2.30],
    "Internet": ["Yes", "No", "Yes", "Yes", "No"],
    "City": ["Dhaka", "Chattogram", "Rajshahi", "Dhaka", "Rajshahi"],
}

c_data_1 = pd.DataFrame(data1)

data2 = {
    "ID": [1, 2, 3, 4, 5],
    "Income_BDT": [30000, 45000, 52000, 300000, 38000],
    "Transactions": [0, 1, 2, 12, 0],
    "Temp_C": [25.0, 26.0, 24.5, 28.0, 25.5],
    "Education": ["High School", "Bachelor", "Master", "Bachelor", "Master"],
    "Satisfaction": ["Low", "Medium", "High", "Medium", "Medium"],
}
c_data_2 = pd.DataFrame(data2)
c_data_1,c_data_2

(   ID  Age  Hours_Study  GPA Internet        City
 0   1   20          1.0  3.1      Yes       Dhaka
 1   2   21          0.5  2.6       No  Chattogram
 2   3   22          2.2  3.4      Yes    Rajshahi
 3   4   20          5.0  3.9      Yes       Dhaka
 4   5   23          0.2  2.3       No    Rajshahi,
    ID  Income_BDT  Transactions  Temp_C    Education Satisfaction
 0   1       30000             0    25.0  High School          Low
 1   2       45000             1    26.0     Bachelor       Medium
 2   3       52000             2    24.5       Master         High
 3   4      300000            12    28.0     Bachelor       Medium
 4   5       38000             0    25.5       Master       Medium)

***C1. Scaler choices with evidence***

Pick a scaler for Income_BDT, Transactions, Temp_C. For each, give a one line justification and a two-line numeric illustration using C-Data-2 values.


In [83]:
#Robust scaling on income BDT
median = c_data_2["Income_BDT"].median()
q1 = c_data_2["Income_BDT"].quantile(.25)
q3 = c_data_2["Income_BDT"].quantile(.75)
iqr = q3 - q1
c_data_2["Robust_Scaling_income"] = (c_data_2["Income_BDT"] - median)/iqr
c_data_2

Unnamed: 0,ID,Income_BDT,Transactions,Temp_C,Education,Satisfaction,Robust_Scaling_income
0,1,30000,0,25.0,High School,Low,-1.071429
1,2,45000,1,26.0,Bachelor,Medium,0.0
2,3,52000,2,24.5,Master,High,0.5
3,4,300000,12,28.0,Bachelor,Medium,18.214286
4,5,38000,0,25.5,Master,Medium,-0.5


In [84]:
#Z_Sacle Standardization on Transaction
mean = c_data_2["Transactions"].mean()
std = c_data_2["Transactions"].std()
c_data_2["Standardization_Transaction"] = (c_data_2["Transactions"] - mean)/std
c_data_2

Unnamed: 0,ID,Income_BDT,Transactions,Temp_C,Education,Satisfaction,Robust_Scaling_income,Standardization_Transaction
0,1,30000,0,25.0,High School,Low,-1.071429,-0.588348
1,2,45000,1,26.0,Bachelor,Medium,0.0,-0.392232
2,3,52000,2,24.5,Master,High,0.5,-0.196116
3,4,300000,12,28.0,Bachelor,Medium,18.214286,1.765045
4,5,38000,0,25.5,Master,Medium,-0.5,-0.588348


In [85]:
#Min Max Scaling temperature
mn = c_data_2["Temp_C"].min()
mx = c_data_2["Temp_C"].max()
c_data_2["Min_max_Scale_Temp_C"] = (c_data_2['Temp_C'] - mn)/(mx - mn)
c_data_2

Unnamed: 0,ID,Income_BDT,Transactions,Temp_C,Education,Satisfaction,Robust_Scaling_income,Standardization_Transaction,Min_max_Scale_Temp_C
0,1,30000,0,25.0,High School,Low,-1.071429,-0.588348,0.142857
1,2,45000,1,26.0,Bachelor,Medium,0.0,-0.392232,0.428571
2,3,52000,2,24.5,Master,High,0.5,-0.196116,0.0
3,4,300000,12,28.0,Bachelor,Medium,18.214286,1.765045,1.0
4,5,38000,0,25.5,Master,Medium,-0.5,-0.588348,0.285714


***C2. Mixed preprocessing plan***<br>
For C-Data-1 and C-Data-2 combined:<br>
 a) Identify nominal and ordinal columns<br>
 b) Propose one encoding plan listing exact columns to one-hot vs ordinal<br>
 c) Propose one scaling plan listing exact columns to Min-Max vs Standardization vs Robust

a) Identify nominal and ordinal columns<br>
C_Data_1:<br>
Nominal Columns -> Internet, City<br>
Ordinal Columns -> None<br>
C_Data_2:<br>
Nominal COlumns -> None <br>
Ordinal Columns -> Education, Satisfaction

b) Propose one encoding plan listing exact columns to one-hot vs ordinal
C_data_1:<br>
one hot encoding -> internet, City<br>
Ordinal encoding -> None<br>
C_data_2:<br>
One hot encoding -> None<br>
Ordinal encoding -> Education, Satisfaction

c) Propose one scaling plan listing exact columns to Min-Max vs Standardization vs Robust <br>
C_data_1:<br>
Age -> Min-Max Scaling<br>
Hours_Study -> Robust Sacling<br>
GPA -> Min-Max Scaling<br><br>
C_data_2:<br>
Income_BDT -> Robust Scaling<br>
Transactions -> Standardization<br>
Temp_C -> Min-Max Scaling<br>