# Part A. Quick basics

## A1. Spot the right scaler

| Feature                         | Scaler         | Reason                                                                 |
|--------------------------------|----------------|------------------------------------------------------------------------|
| Apartment_price_BDT (luxury)    | Robust Scaler  | আউটলাইয়ার আছে, তাই Robust Scaler আউটলাইয়ার থেকে ভালো সুরক্ষা দেয়।       |
| Skin_temperature_C (30 to 36)  | Min-Max Scaler | ছোট এবং নির্দিষ্ট রেঞ্জ, Min-Max ভালো কাজ করবে।                          |
| Daily_app_opens (many zeros)   | Robust Scaler  | অনেক শূন্য এবং আউটলাইয়ার, Robust Scaler ভালো মানায়।                      |

---

## A2. Manual Min-Max on a tiny set

Given scores = `[20, 25, 30, 50]`

- Min = 20  
- Max = 50

Formula:  
\[
x_{scaled} = \frac{x - Min}{Max - Min}
\]

Calculations:

| Original | Calculation               | Scaled Value |
|----------|---------------------------|--------------|
| 20       | (20 - 20) / (50 - 20) = 0 | 0            |
| 25       | (25 - 20) / 30 = 5/30      | 0.1667       |
| 30       | (30 - 20) / 30 = 10/30     | 0.3333       |
| 50       | (50 - 20) / 30 = 30/30     | 1            |

---

## A3. Z-scores on a subset

Given \(x = [8, 9, 11]\)

1. **Mean:**  
\[
\bar{x} = \frac{8 + 9 + 11}{3} = 9.3333
\]

2. **Population standard deviation:**  
\[
\sigma = \sqrt{\frac{(8-9.333)^2 + (9-9.333)^2 + (11-9.333)^2}{3}} = 1.2472
\]

3. **Z-scores:**  
\[
z = \frac{x_i - \bar{x}}{\sigma}
\]

| Value | Z-score Calculation                  | Z-score  |
|-------|------------------------------------|----------|
| 8     | (8 - 9.333) / 1.2472 = -1.069      | -1.069   |
| 9     | (9 - 9.333) / 1.2472 = -0.267      | -0.267   |
| 11    | (11 - 9.333) / 1.2472 = 1.337      | 1.337    |

---

## A4. Robust scaling ingredients

Given \(y = [5, 6, 6, 7, 50]\)

- **Median:** 6  
- **Q1 (First Quartile):** median of [5, 6] = 5.5  
- **Q3 (Third Quartile):** median of [7, 50] = 28.5  
- **IQR (Interquartile Range):** \(28.5 - 5.5 = 23\)

---

## A5. Nominal or ordinal

| Variable                 | Type    | Reason                                                   |
|--------------------------|---------|----------------------------------------------------------|
| T-shirt_size {S, M, L, XL} | Ordinal | স্পষ্ট ক্রম আছে (S < M < L < XL)                        |
| City {Dhaka, Chattogram, Rajshahi} | Nominal | কোনো অর্ডার বা মান নেই                                  |
| Satisfaction {Low, Medium, High} | Ordinal | স্পষ্ট ক্রম আছে (Low < Medium < High)                   |


In [10]:
import numpy as np
import pandas as pd

# B1.a) Min-Max scale both to [0, 1] (হাত দিয়ে)
def min_max_scale(arr):
    min_val = np.min(arr)
    max_val = np.max(arr)
    return (arr - min_val) / (max_val - min_val)

heights = np.array([150, 160, 170, 175, 180])
weights = np.array([58, 62, 65, 66, 190])

heights_minmax = min_max_scale(heights)
weights_minmax = min_max_scale(weights)

print("Min-Max scaled heights:", heights_minmax)
print("Min-Max scaled weights:", weights_minmax)

# B1.b) Standardize the first three values of each only (হাত দিয়ে)
def standardize(arr):
    mean = np.mean(arr)
    std = np.std(arr)  # population std deviation (ddof=0)
    return (arr - mean) / std

heights_std = standardize(heights[:3])
weights_std = standardize(weights[:3])

print("\nStandardized first 3 heights:", heights_std)
print("Standardized first 3 weights:", weights_std)

# B1.c) Robust scale Weights with median and IQR (হাত দিয়ে)
def robust_scale(arr):
    median = np.median(arr)
    q1 = np.percentile(arr, 25)
    q3 = np.percentile(arr, 75)
    iqr = q3 - q1
    return (arr - median) / iqr

weights_robust = robust_scale(weights)
print("\nRobust scaled weights:", weights_robust)

# B1.d) Which scaler handles the outlier best?
# # Robust scaler আউটলাইয়ারকে সবচেয়ে ভালো হ্যান্ডেল করে কারণ এটি median এবং IQR ব্যবহার করে।

# --------------------------------------------------------------------------------

# B2. One-hot by hand
import pandas as pd

cities = ['Dhaka', 'Chattogram', 'Dhaka', 'Rajshahi', 'Rajshahi']
df = pd.DataFrame({'City': cities})

one_hot_df = pd.get_dummies(df['City'], prefix='City')
print(one_hot_df)

# B3. Ordinal mapping
education = ['High School', 'Bachelor', 'Master', 'Bachelor', 'Master']
df = pd.DataFrame({'Education': education})

map1 = {'High School': 0, 'Bachelor': 1, 'Master': 2}
df['Education_mapped'] = df['Education'].map(map1).astype(int)
print(df)

# Explanation:
# Increasing all mapped values by 1 shifts the scale by a constant,
# but relative distances between categories remain unchanged.


Min-Max scaled heights: [0.         0.33333333 0.66666667 0.83333333 1.        ]
Min-Max scaled weights: [0.         0.03030303 0.0530303  0.06060606 1.        ]

Standardized first 3 heights: [-1.22474487  0.          1.22474487]
Standardized first 3 weights: [-1.27872403  0.11624764  1.16247639]

Robust scaled weights: [-1.75 -0.75  0.    0.25 31.25]
   City_Chattogram  City_Dhaka  City_Rajshahi
0            False        True          False
1             True       False          False
2            False        True          False
3            False       False           True
4            False       False           True
     Education  Education_mapped
0  High School                 0
1     Bachelor                 1
2       Master                 2
3     Bachelor                 1
4       Master                 2


In [12]:
import numpy as np

# B4. Encoding mixup [Optional]
# You mistakenly apply ordinal encoding to City and one-hot to Education.
# Write one sentence on the risk this creates in a linear model.

# # কমেন্ট: 
# # Applying ordinal encoding to City falsely implies an order or hierarchy among cities,
# # which can mislead the linear model into interpreting meaningless relationships.
# # One-hot encoding Education removes the natural order between levels,
# # potentially losing important ordinal information.

# ------------------------------------------------------------------------------------

# B5. Vectors and alignment [Optional]
a = np.array([3, -1, 2])
b = np.array([4, 0, -2])
c = np.array([-6, 2, -4])

# a) Compute dot products
dot_ab = np.dot(a, b)
dot_ac = np.dot(a, c)

print("# B5.a) Dot products:")
print("a·b =", dot_ab)
print("a·c =", dot_ac)

# b) Compare signs and magnitudes to comment on alignment
# # dot_ab is positive (2), indicating a slight alignment between vectors a and b.
# # dot_ac is negative (-28), indicating a strong opposite alignment between vectors a and c.

# c) L2 normalize a and give the normalized vector to three decimals
norm_a = np.linalg.norm(a)
a_normalized = a / norm_a
a_normalized_rounded = np.round(a_normalized, 3)

print("\n# B5.c) L2 normalized vector a:", a_normalized_rounded)

# ------------------------------------------------------------------------------------

# B6. Two distances, different vibes
P1 = np.array([2, 3])
P2 = np.array([5, 7])
P3 = np.array([2, 10])

def euclidean_dist(p, q):
    return np.linalg.norm(p - q)

def manhattan_dist(p, q):
    return np.sum(np.abs(p - q))

pairs = [(P1, P2), (P2, P3), (P1, P3)]

print("\n# B6.a) Euclidean and Manhattan distances for all pairs:")
for i, (p, q) in enumerate(pairs, 1):
    euc = euclidean_dist(p, q)
    man = manhattan_dist(p, q)
    print(f"Pair {i}: Euclidean = {euc:.3f}, Manhattan = {man}")

# b) Which distance is more sensitive to a single large jump in one coordinate?
# # Manhattan distance is more sensitive because it sums absolute coordinate differences,
# # so a large change in one dimension directly increases the distance.

# c) Scale y by 10 and recompute d(P1, P2)
P1_scaled = np.array([2, 3*10])
P2_scaled = np.array([5, 7*10])

euc_scaled = euclidean_dist(P1_scaled, P2_scaled)
man_scaled = manhattan_dist(P1_scaled, P2_scaled)

print(f"\n# B6.c) After scaling y by 10:")
print(f"Euclidean distance between P1 and P2: {euc_scaled:.3f}")
print(f"Manhattan distance between P1 and P2: {man_scaled}")

# # Explanation:
# # Scaling y by 10 significantly increases distances, especially Manhattan distance,
# # because Manhattan distance adds coordinate differences directly,
# # so large changes in one coordinate strongly affect the total distance.


# B5.a) Dot products:
a·b = 8
a·c = -28

# B5.c) L2 normalized vector a: [ 0.802 -0.267  0.535]

# B6.a) Euclidean and Manhattan distances for all pairs:
Pair 1: Euclidean = 5.000, Manhattan = 7
Pair 2: Euclidean = 4.243, Manhattan = 6
Pair 3: Euclidean = 7.000, Manhattan = 7

# B6.c) After scaling y by 10:
Euclidean distance between P1 and P2: 40.112
Manhattan distance between P1 and P2: 43


# Part C: Mini Datasets

## C-Data-1

| ID | Age | Hours_Study | GPA  | Internet | City       |
|----|-----|-------------|------|----------|------------|
| 1  | 20  | 1.0         | 3.10 | Yes      | Dhaka      |
| 2  | 21  | 0.5         | 2.60 | No       | Chattogram |
| 3  | 22  | 2.2         | 3.40 | Yes      | Rajshahi   |
| 4  | 20  | 5.0         | 3.90 | Yes      | Dhaka      |
| 5  | 23  | 0.2         | 2.30 | No       | Rajshahi   |

---

## C-Data-2

| ID | Income_BDT | Transactions | Temp_C | Education    | Satisfaction |
|----|------------|--------------|--------|--------------|--------------|
| 1  | 30000      | 0            | 25.0   | High School  | Low          |
| 2  | 45000      | 1            | 26.0   | Bachelor     | Medium       |
| 3  | 52000      | 2            | 24.5   | Master       | High         |
| 4  | 300000     | 12           | 28.0   | Bachelor     | Medium       |
| 5  | 38000      | 0            | 25.5   | Master       | Medium       |

---

## C1. Scaler choices with evidence

| Feature       | Chosen Scaler     | Justification                                    | Numeric Illustration                                                                                             |
|---------------|-------------------|-------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| **Income_BDT**    | Robust Scaler      | আউটলায়ার (300000) আছে, তাই Robust scaler ভালো কাজ করবে।       | Median ≈ 45000; IQR ≈ 47000 (38000 থেকে 85000); আউটলায়ার 300000 খুব বড়, Min-Max তে স্কেল নষ্ট হবে।                 |
| **Transactions**  | Min-Max Scaler     | ছোট সংখ্যা, 0 থেকে 12 পর্যন্ত, স্প্রেড কম।                      | Min=0, Max=12, তাই Min-Max স্কেলিং সহজ ও কার্যকর। যেমন 0 → 0, 12 → 1।                                           |
| **Temp_C**        | Standard Scaler    | পরিসর ছোট এবং প্রায় Gaussian distribution (প্রায় 24.5 থেকে 28) | Mean ≈ 25.8, Std ≈ 1.25; StandardScaler দিয়ে মানকে normalize করা যাবে।                                            |

---

## C2. Mixed preprocessing plan

### a) Nominal ও Ordinal কলাম নির্ধারণ

| Nominal Columns  | Ordinal Columns       |
|------------------|-----------------------|
| Internet (Yes/No) | Education (High School < Bachelor < Master) |
| City             | Satisfaction (Low < Medium < High)            |

### b) Encoding plan

| Encoding type    | Columns                             |
|------------------|-----------------------------------|
| One-hot encoding | Internet, City                    |
| Ordinal encoding | Education, Satisfaction           |

### c) Scaling plan

| Scaler           | Columns                           |
|------------------|----------------------------------|
| Robust Scaler    | Income_BDT                       |
| Min-Max Scaler   | Transactions, Hours_Study, Age   |
| Standard Scaler  | Temp_C, GPA                     |

---

**বিঃদ্রঃ**  
- Age এবং Hours_Study ছোট পরিসর ও নরমাল ডিস্ট্রিবিউশনের জন্য StandardScaler বা Min-MaxScaler ব্যবহার করা যায়, এখানে Min-Max দিয়েছি।  
- GPA প্রায় Gaussian তাই StandardScaler ভালো।  
- Internet ও City nominal, তাই encoding হবে One-hot দিয়ে।  
- Education এবং Satisfaction ordinal, তাই Ordinal encoding হবে।



In [13]:
import numpy as np
import pandas as pd

# Income_BDT values
income = np.array([30000, 45000, 52000, 300000, 38000])

# Min-Max Scaling
income_min = income.min()
income_max = income.max()
income_minmax = (income - income_min) / (income_max - income_min)

# Robust Scaling (using Median and IQR)
median = np.median(income)
Q1 = np.percentile(income, 25)
Q3 = np.percentile(income, 75)
IQR = Q3 - Q1
income_robust = (income - median) / IQR

print("Min-Max Scaled Income:", np.round(income_minmax, 3))
print("Robust Scaled Income:", np.round(income_robust, 3))

# # Comparison comment:
# Min-Max scaling compresses most values into a narrow range because the outlier 300000 sets a very large max, 
# whereas Robust scaling centers data around the median and reduces outlier influence.


Min-Max Scaled Income: [0.    0.056 0.081 1.    0.03 ]
Robust Scaled Income: [-1.071  0.     0.5   18.214 -0.5  ]


In [18]:
# Data for IDs 1 and 4
import numpy as np
hs = np.array([1.0, 5.0])  # Hours_Study for ID 1 and 4
gpa = np.array([3.10, 3.90])  # GPA for ID 1 and 4

# Points as vectors (Hours_Study, GPA)
p1 = np.array([1.0, 3.10])  # ID 1
p4 = np.array([5.0, 3.90])  # ID 4
# a) Euclidean distance
euclidean = np.linalg.norm(p1 - p4)
# Actually for 2D points Euclidean is:
euclidean = np.sqrt((hs[1]-hs[0])**2 + (gpa[1]-gpa[0])**2)

# b) Manhattan distance
manhattan = np.abs(hs[1] - hs[0]) + np.abs(gpa[1] - gpa[0])

print(f"Euclidean distance: {euclidean:.3f}")
print(f"Manhattan distance: {manhattan:.3f}")

# c) Min-Max normalize Hours_Study and GPA
hs_min, hs_max = hs.min(), hs.max()
gpa_min, gpa_max = gpa.min(), gpa.max()

hs_norm = (hs - hs_min) / (hs_max - hs_min)
gpa_norm = (gpa - gpa_min) / (gpa_max - gpa_min)

euclidean_norm = np.sqrt((hs_norm[1] - hs_norm[0])**2 + (gpa_norm[1] - gpa_norm[0])**2)
manhattan_norm = np.abs(hs_norm[1] - hs_norm[0]) + np.abs(gpa_norm[1] - gpa_norm[0])

print(f"Normalized Euclidean distance: {euclidean_norm:.3f}")
print(f"Normalized Manhattan distance: {manhattan_norm:.3f}")

# # Comment:
# # Normalization scales features to the same range, preventing one feature from dominating the distance.


Euclidean distance: 4.079
Manhattan distance: 4.800
Normalized Euclidean distance: 1.414
Normalized Manhattan distance: 2.000


In [22]:
import pandas as pd
import math

# Step 1: Create DataFrame
data = {
    "Income": [30000, 45000, 35000, 80000],
    "Hours_Study": [1.0, 2.0, 1.5, 3.0],
    "GPA": [3.1, 3.4, 3.0, 3.8],
    "Transactions_7d": [0, 3, 1, 10],
    "City": ["Dhaka", "Chattogram", "Rajshahi", "Dhaka"],
    "Internet": ["Yes", "No", "Yes", "No"],
    "Education_Level": ["Bachelor", "Master", "High School", "Master"],
    "Satisfaction": ["High", "Medium", "Low", "Medium"]
}
df = pd.DataFrame(data)

# Step 2: Encoding

# One-hot encode City and Internet
df_encoded = pd.get_dummies(df, columns=['City', 'Internet'])


# Ordinal encode Education_Level and Satisfaction
education_map = {"High School": 0, "Bachelor": 1, "Master": 2}
satisfaction_map = {"Low": 0, "Medium": 1, "High": 2}

df['Education_Level_encoded'] = df['Education_Level'].map(education_map)
df['Satisfaction_encoded'] = df['Satisfaction'].map(satisfaction_map)

# Step 3: Scaling

# StandardScaler function
def standard_scale(series):
    mean = series.mean()
    std = series.std(ddof=0)
    return (series - mean) / std

# MinMaxScaler function
def minmax_scale(series):
    min_val = series.min()
    max_val = series.max()
    return (series - min_val) / (max_val - min_val)

# RobustScaler function
def robust_scale(series):
    median = series.median()
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    return (series - median) / iqr

df['Hours_Study_std'] = standard_scale(df['Hours_Study'])
df['GPA_std'] = standard_scale(df['GPA'])
df['Income_minmax'] = minmax_scale(df['Income'])
df['Transactions_robust'] = robust_scale(df['Transactions_7d'])

# Step 4: Distance calculations before and after scaling

# Select first 3 rows
P = df.iloc[:3]

def euclidean_dist(row1, row2, cols):
    return math.sqrt(sum((row1[c] - row2[c])**2 for c in cols))

def manhattan_dist(row1, row2, cols):
    return sum(abs(row1[c] - row2[c]) for c in cols)

cols_before = ['Income', 'Transactions_7d']
cols_after = ['Income_minmax', 'Transactions_robust']

print("Distances BEFORE scaling:")
for i, (idx1, idx2) in enumerate([(0,1), (1,2), (0,2)], 1):
    d_eu = euclidean_dist(P.loc[idx1], P.loc[idx2], cols_before)
    d_ma = manhattan_dist(P.loc[idx1], P.loc[idx2], cols_before)
    print(f"Pair {i}: Euclidean = {d_eu:.2f}, Manhattan = {d_ma:.2f}")

print("\nDistances AFTER scaling:")
for i, (idx1, idx2) in enumerate([(0,1), (1,2), (0,2)], 1):
    d_eu = euclidean_dist(P.loc[idx1], P.loc[idx2], cols_after)
    d_ma = manhattan_dist(P.loc[idx1], P.loc[idx2], cols_after)
    print(f"Pair {i}: Euclidean = {d_eu:.2f}, Manhattan = {d_ma:.2f}")

# Step 5: Reflection (print as comment)

print("""
Reflection:
- RobustScaler handled outliers better for Transactions_7d, reducing the influence of large values.
- Scaling changes the magnitude of distances but usually preserves relative closeness.
- Scaling is essential for distance-based algorithms to prevent dominance of features with larger scales or outliers.
""")


Distances BEFORE scaling:
Pair 1: Euclidean = 15000.00, Manhattan = 15003.00
Pair 2: Euclidean = 10000.00, Manhattan = 10002.00
Pair 3: Euclidean = 5000.00, Manhattan = 5001.00

Distances AFTER scaling:
Pair 1: Euclidean = 0.81, Manhattan = 1.05
Pair 2: Euclidean = 0.54, Manhattan = 0.70
Pair 3: Euclidean = 0.27, Manhattan = 0.35

Reflection:
- RobustScaler handled outliers better for Transactions_7d, reducing the influence of large values.
- Scaling changes the magnitude of distances but usually preserves relative closeness.
- Scaling is essential for distance-based algorithms to prevent dominance of features with larger scales or outliers.

