# Part D. Mini project [Optional]
## Goal:
### Make one notebook that shows encoding + scaling + distance change. No train–test split, no models.
## Step 1: Create a small DataFrame
### Manually create a pandas DataFrame with:
### 3–4 numeric columns (like Income, Hours_Study, GPA)
### 1–2 nominal columns (like City, Internet)
### 1–2 ordinal columns (like Education_Level, Satisfaction)
## Step 2: Decide preprocessing plan (short markdown cell)
## Write:
### Which columns will be one-hot encoded
### Which columns will be ordinal encoded (with mapping)
### Which numeric columns will use Standardization, Min–Max, or Robust
## Step 3: Apply ColumnTransformer
### Use ColumnTransformer to:
### One-hot encode nominal columns
### Ordinal encode ordinal columns
### Scale numeric columns using your chosen scaler(s)
### Show the transformed array (shape + first few rows).
## Step 4: Distance before vs after scaling
### Pick two numeric columns, for example (Income, Transactions_7d):
### Take 3 rows only, call them P1, P2, P3.
### Compute Euclidean and Manhattan distances between them before scaling.
### Apply two different scalers to these two columns (for example Standard vs Robust).
### Recompute the distances after each scaler.
### Put results in a tiny table in markdown.
## Step 5: Short reflection
### In 3–4 sentences:
### Which scaler handled outliers better for your chosen features?
### Did scaling change which points are “closer” to each other?
### Why does this matter for algorithms that use distance?

In [26]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [27]:
#Step1
data = {
    "Income": [30000, 45000, 52000, 61000, 40000],
    "Hours_Study": [10, 15, 8, 20, 12],
    "GPA": [3.2, 3.8, 2.9, 3.6, 3.1],
    "Age": [20, 22, 21, 23, 19],
    
    "City": ["Dhaka", "Chittagong", "Sylhet", "Dhaka", "Rajshahi"],
    "Internet": ["WiFi", "Mobile", "WiFi", "Broadband", "Mobile"],
    
    "Education_Level": ["High School", "College", "College", "University", "High School"],
    "Satisfaction": ["Low", "High", "Medium", "High", "Medium"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Income,Hours_Study,GPA,Age,City,Internet,Education_Level,Satisfaction
0,30000,10,3.2,20,Dhaka,WiFi,High School,Low
1,45000,15,3.8,22,Chittagong,Mobile,College,High
2,52000,8,2.9,21,Sylhet,WiFi,College,Medium
3,61000,20,3.6,23,Dhaka,Broadband,University,High
4,40000,12,3.1,19,Rajshahi,Mobile,High School,Medium


In [28]:
#Step2
#Education and Satisfaction will require ordinal encoding
#City and Internet will require one hot encoding
#Standardization on Hours_Study, Min-Max on GPA and Robust Scaling on Income

In [29]:
#Step3
order = {'High School': 1, 'College': 2, 'University': 3}
df['Education_Level'] = df['Education_Level'].map(order).astype(int)
df

Unnamed: 0,Income,Hours_Study,GPA,Age,City,Internet,Education_Level,Satisfaction
0,30000,10,3.2,20,Dhaka,WiFi,1,Low
1,45000,15,3.8,22,Chittagong,Mobile,2,High
2,52000,8,2.9,21,Sylhet,WiFi,2,Medium
3,61000,20,3.6,23,Dhaka,Broadband,3,High
4,40000,12,3.1,19,Rajshahi,Mobile,1,Medium


In [19]:
d_internet = pd.get_dummies(df['Internet'], prefix='Pupil', dtype=bool)
df_int = pd.concat([df, d_internet], axis=1)
df_int = df_int.drop('Internet', axis=1)
df_int

Unnamed: 0,Income,Hours_Study,GPA,Age,City,Education_Level,Satisfaction,Pupil_Broadband,Pupil_Mobile,Pupil_WiFi
0,30000,10,3.2,20,Dhaka,1,Low,False,False,True
1,45000,15,3.8,22,Chittagong,2,High,False,True,False
2,52000,8,2.9,21,Sylhet,2,Medium,False,False,True
3,61000,20,3.6,23,Dhaka,3,High,True,False,False
4,40000,12,3.1,19,Rajshahi,1,Medium,False,True,False


In [30]:
mn_val = df['Hours_Study'].mean()
print(mn_val)
std_val = df['Hours_Study'].std()
print(std_val)
df['Hours_Study'] = (df['Hours_Study'] - mn_val)/std_val
df['Hours_Study']

13.0
4.69041575982343


0   -0.639602
1    0.426401
2   -1.066004
3    1.492405
4   -0.213201
Name: Hours_Study, dtype: float64

In [31]:
max_val = df['GPA'].max()
print(max_val)
min_val = df['GPA'].min()
print(min_val)
rg = max_val - min_val
df['GPA'] = (df['GPA'] - min_val)/rg
df['GPA']

3.8
2.9


0    0.333333
1    1.000000
2    0.000000
3    0.777778
4    0.222222
Name: GPA, dtype: float64

In [32]:
med_val = df['Income'].median()
print(med_val)
q1 = df['Income'].quantile(.25)
q3 = df['Income'].quantile(.75)
iqr = q3-q1
print(iqr)
df['Income'] = (df['Income'] - med_val)/iqr
df['Income']

45000.0
12000.0


0   -1.250000
1    0.000000
2    0.583333
3    1.333333
4   -0.416667
Name: Income, dtype: float64

In [34]:
print(df.shape)
df

(5, 8)


Unnamed: 0,Income,Hours_Study,GPA,Age,City,Internet,Education_Level,Satisfaction
0,-1.25,-0.639602,0.333333,20,Dhaka,WiFi,1,Low
1,0.0,0.426401,1.0,22,Chittagong,Mobile,2,High
2,0.583333,-1.066004,0.0,21,Sylhet,WiFi,2,Medium
3,1.333333,1.492405,0.777778,23,Dhaka,Broadband,3,High
4,-0.416667,-0.213201,0.222222,19,Rajshahi,Mobile,1,Medium


In [42]:
#step4
data = {
    "Income": [30000, 45000, 52000, 61000, 40000],
    "Hours_Study": [10, 15, 8, 20, 12],
    "GPA": [3.2, 3.8, 2.9, 3.6, 3.1],
    "Age": [20, 22, 21, 23, 19],
    
    "City": ["Dhaka", "Chittagong", "Sylhet", "Dhaka", "Rajshahi"],
    "Internet": ["WiFi", "Mobile", "WiFi", "Broadband", "Mobile"],
    
    "Education_Level": ["High School", "College", "College", "University", "High School"],
    "Satisfaction": ["Low", "High", "Medium", "High", "Medium"]
}
df2 = pd.DataFrame(data)
df2

Unnamed: 0,Income,Hours_Study,GPA,Age,City,Internet,Education_Level,Satisfaction
0,30000,10,3.2,20,Dhaka,WiFi,High School,Low
1,45000,15,3.8,22,Chittagong,Mobile,College,High
2,52000,8,2.9,21,Sylhet,WiFi,College,Medium
3,61000,20,3.6,23,Dhaka,Broadband,University,High
4,40000,12,3.1,19,Rajshahi,Mobile,High School,Medium


In [46]:
vec_cols = ['Income', 'Age']
v1 = df2.loc[0, vec_cols].values
v2 = df2.loc[1, vec_cols].values
v3 = df2.loc[2, vec_cols].values
print(v1)
print(v2)
print(v3)

[np.int64(30000) np.int64(20)]
[np.int64(45000) np.int64(22)]
[np.int64(52000) np.int64(21)]


In [49]:
dif_12 = v1 - v2
euclid_12 = np.sqrt(np.dot(dif_12, dif_12))
print(euclid_12)
manhat_12 = np.sum(np.abs(dif_12))
print(manhat_12)

15000.000133333333
15002


In [50]:
dif_23 = v2 - v3
euclid_23 = np.sqrt(np.dot(dif_23, dif_23))
print(euclid_23)
manhat_23 = np.sum(np.abs(dif_23))
print(manhat_23)

7000.000071428571
7001


In [51]:
dif_13 = v1 - v3
euclid_13 = np.sqrt(np.dot(dif_13, dif_13))
print(euclid_13)
manhat_13 = np.sum(np.abs(dif_13))
print(manhat_13)

22000.000022727272
22001
