
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:**   Robiul Awal

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  The avergae daily screen time in minute is 181.89, and maximum is 299. minimum value is 60, and standard deviation is 68.89 minutes.
>  Total people count is 100, and 25% have daily screen time in minute 122, 50% have 178, 75% have 243.
>  The maximum "299" is quite far from the mean "179", which means some users spend much more time on their phones than the average.


### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  The proportion of the positive class is 0.52 (52%), meaning the dataset is very nearly balanced (52% positive vs 48% negative), with the positive class being slightly more common than the negative class.
>  The dataset is almost perfectly balanced.
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

> The model has an accuracy of 55%, which is only slightly better than random guessing for a roughly balanced dataset. Its precision is quite low at 0.571, meaning that when the model predicts “positive” (label 1), it is correct a little more than half the time and makes many false positive errors. The recall is also low at 0.538, so the model misses almost half of the actual positive cases. Overall, the model performs poorly on both classes and needs significant improvement.
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]

numeric_column = "monthly_income"
income = df[numeric_column]
income.head()

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84


In [None]:
# Q4.2: Standardization with z-score [10 marks]

mean_income = income.mean()
std_income = income.std()


df['monthly_income_standardized'] = (income - mean_income)  / std_income


print("Mean of the standardized column:", df['monthly_income_standardized'].mean())


print("Std of the standardized column:", df['monthly_income_standardized'].std())

df[['monthly_income', 'monthly_income_standardized']].head(10)

Mean of the standardized column: -8.826273045769995e-16
Std of the standardized column: 1.0


Unnamed: 0,monthly_income,monthly_income_standardized
0,3734.19,0.944685
1,2594.19,-0.324626
2,3550.47,0.740126
3,3821.18,1.041542
4,1750.84,-1.263639
5,4043.4,1.288969
6,4076.52,1.325846
7,2961.36,0.084192
8,2420.1,-0.518464
9,1000.0,-2.099647


In [None]:
# Q4.3: Min max scaling implementation [10 marks]

min_income = income.min()
max_income = income.max()


df['monthly_income_minmax'] = (income - min_income) /  (max_income - min_income)



print("Min of min-max scaled column:", df['monthly_income_minmax'].min())

print("Max of min-max scaled column:", df['monthly_income_minmax'].max())



df[['monthly_income', 'monthly_income_standardized', 'monthly_income_minmax']].head(10)

Min of min-max scaled column: 0.0
Max of min-max scaled column: 1.0


Unnamed: 0,monthly_income,monthly_income_standardized,monthly_income_minmax
0,3734.19,0.944685,0.675209
1,2594.19,-0.324626,0.393685
2,3550.47,0.740126,0.629839
3,3821.18,1.041542,0.696691
4,1750.84,-1.263639,0.18542
5,4043.4,1.288969,0.751568
6,4076.52,1.325846,0.759747
7,2961.36,0.084192,0.484358
8,2420.1,-0.518464,0.350694
9,1000.0,-2.099647,0.0



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  Standardization (z-score) gives values mostly between -2 and +2, with mean 0 and std 1. Some values can be negative or bigger than 1. Min-max scaling squeezes all values strictly between 0 and 1. So min-max is useful when you need everything in [0,1], while standardization keeps the original distribution shape.
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]


city_onehot = pd.get_dummies(df['city_type'], prefix='city')


print("One-hot encoded city_type columns:")


city_onehot.head()

One-hot encoded city_type columns:


Unnamed: 0,city_Rural,city_Suburban,city_Urban
0,False,True,False
1,False,False,True
2,True,False,False
3,False,True,False
4,False,True,False


In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]

df = pd.concat([df, city_onehot], axis=1)



df[['city_type', 'city_Rural', 'city_Suburban', 'city_Urban']].head()


Unnamed: 0,city_type,city_Rural,city_Suburban,city_Urban
0,Suburban,False,True,False
1,Urban,False,False,True
2,Rural,True,False,False
3,Suburban,False,True,False
4,Suburban,False,True,False


In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]

order = ['Low', 'Medium', 'High']


df['satisfaction_level_ordinal'] = df['satisfaction_level'].map(
{
    'Low': 0,
    'Medium': 1,
    'High': 2
})



df[['satisfaction_level', 'satisfaction_level_ordinal']].head(10)

Unnamed: 0,satisfaction_level,satisfaction_level_ordinal
0,Medium,1
1,Low,0
2,High,2
3,High,2
4,Medium,1
5,Low,0
6,Low,0
7,High,2
8,High,2
9,Medium,1



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  city_type has no natural order (Urban, Suburban, Rural are just different categories), so one-hot encoding is correct — it treats them equally without creating fake ordering. satisfaction_level has a clear order (Low < Medium < High), so ordinal encoding is better — it preserves the meaningful ranking that distance-based algorithms can use.
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]




v1 = df.loc[0, vec_cols].values

v2 = df.loc[1, vec_cols].values




print("v1:", v1)

print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [None]:
# Q6.2: Euclidean distance computation [5 marks]

import numpy as np


v1 = df.loc[0, ['monthly_income', 'daily_app_opens']].values

v2 = df.loc[1, ['monthly_income', 'daily_app_opens']].values




euclidean_dist = np.sqrt( ((v1[0] - v2[0])**2)  + ((v1[1] - v2[1])**2) )




print("Euclidean distance:", euclidean_dist)


In [None]:
# Q6.3: Manhattan distance computation [5 marks]

manhattan_dist = np.abs(v1[0] - v2[0]) + np.abs(v1[1] - v2[1])

manhattan_dist = np.sum(np.abs(v1 - v2))

print("Manhattan distance:", manhattan_dist)


> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  Manhattan distance is always larger than or equal to Euclidean distance. This happens because Euclidean uses the square root of sum of squares (straight-line distance), while Manhattan just adds the absolute differences (walking along grid lines). The only time they are equal is when the points differ in only one dimension.
>  
>  



---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  These 3 modules in this assignment are very connected by the data preparation pipeline for machine learning. More specifically, from Module 1 and 2, the computation of class imbalance 52% positive vs. 48% negative, and assessment of basic model metrics, namely accuracy, precision, and recall, allowed me to realize that a slightly imbalanced dataset combined with low precision and recall each at about 0.55 indicates that the model is barely better than random guessing and needs better features. Module 3 tackles exactly this by introducing standardization & min max scaling and encoding techniques one hot for nominal city_type, ordinal for ordered satisfaction_level that ensure numeric features and categorical ones are on comparable scales and preserve meaningful relationships before distance computations. We then applied Euclidean and Manhattan distances in Module 3 based on scaled numeric features, namely income and app opens, illustrating why scaling is necessary; without scaling, high magnitude features like income will dominate distance calculations. These ideas combine toward the following understanding: Dataset characteristics of Modules 1 , 2 in the terms of imbalance and metric limitations provide motivation and justification for preprocessing choices made in Module 3. This allow a cleaner, fairer  more interpretable dataset that is ready for effective modelling.


>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
