
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:**   MD SHARIFUL ISLAM

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [86]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [87]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [88]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_app_opens"

df[num_col].describe()

Unnamed: 0,daily_app_opens
count,100.0
mean,26.99
std,13.669619
min,5.0
25%,15.0
50%,27.0
75%,39.0
max,49.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  The daily_app_opens column has a mean of 26.99 with a standard deviation of 13.67 among 100 users where the maximum unit of time is 49 and minimum unit of time is 5 of a user. So according to the data it says the max value is roughly 22 units above from the mean value.


### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [89]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  Following the Proportion of positive class which valued 0.52 or 52 positive count out of 100 samples/users it means that dataset is fairly balanced between both classes so we can say that it will be good for training models as none of the class dominates.



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [90]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [91]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  From the 3.2 code cell, I can see the model's accuracy is 55%, means that it gets the prediction "right" only about half of the time, which is barely better. The precision of 57% shows that when the model says something is positive, it's correct a little more than half of the time. The recall of 54% means the model only finds about half of the actual positive cases although missing many of them. Overall, the model's performance is weak and needs improvement to catch more cases.



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [92]:
# Q4.1: Choose the numeric column [2 marks]
numeric_col = "monthly_income"

print(df[numeric_col].describe())

count     100.000000
mean     2885.745000
std       898.124693
min      1000.000000
25%      2379.165000
50%      2894.020000
75%      3593.165000
max      5049.400000
Name: monthly_income, dtype: float64


In [93]:
# Q4.2: Standardization with z-score [10 marks]

mean = df[numeric_col].mean()
std = df[numeric_col].std()

df["monthly_income_std"] = (df[numeric_col] - mean)/std

print(df["monthly_income_std"].describe())

print(f"\nFirst 10 values [monthly_income] and [monthly_income_std]:")
print(df[[numeric_col, f'{numeric_col}_std']].head(10))

count    1.000000e+02
mean    -8.826273e-16
std      1.000000e+00
min     -2.099647e+00
25%     -5.640419e-01
50%      9.213643e-03
75%      7.876635e-01
max      2.409081e+00
Name: monthly_income_std, dtype: float64

First 10 values [monthly_income] and [monthly_income_std]:
   monthly_income  monthly_income_std
0         3734.19            0.944685
1         2594.19           -0.324626
2         3550.47            0.740126
3         3821.18            1.041542
4         1750.84           -1.263639
5         4043.40            1.288969
6         4076.52            1.325846
7         2961.36            0.084192
8         2420.10           -0.518464
9         1000.00           -2.099647


In [94]:
# Q4.3: Min max scaling implementation [10 marks]

min = df[numeric_col].min()
max = df[numeric_col].max()

df["monthly_income_min_max"] = (df[numeric_col] - min)/(max - min)

print(df["monthly_income_min_max"].describe())

print(f"\nFirst 10 values [monthly_income] and [monthly_income_min_max]:")
print(df[[numeric_col, f'{numeric_col}_min_max']].head(10))

count    100.000000
mean       0.465685
std        0.221792
min        0.000000
25%        0.340585
50%        0.467729
75%        0.640383
max        1.000000
Name: monthly_income_min_max, dtype: float64

First 10 values [monthly_income] and [monthly_income_min_max]:
   monthly_income  monthly_income_min_max
0         3734.19                0.675209
1         2594.19                0.393685
2         3550.47                0.629839
3         3821.18                0.696691
4         1750.84                0.185420
5         4043.40                0.751568
6         4076.52                0.759747
7         2961.36                0.484358
8         2420.10                0.350694
9         1000.00                0.000000



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  In terms of Standardized values: Numbers like  0.94,  -0.32, -1.26 means the numbers can be negative or positive and it always centered around 0.

> In terms of Min-Max scaled values: Numbers like  0.67, 0.75, 0.35 means all the numbers between 0 and 1.

>  Range comparison: Standardized has no fixed bounds, Min-Max is always [0, 1].


### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [95]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]

unique_cities = df['city_type'].unique()
print(f"Unique city types: {unique_cities}")

for city in unique_cities:
    df[f'city_type_{city}'] = (df['city_type'] == city).astype(int)

Unique city types: ['Suburban' 'Urban' 'Rural']


In [96]:
# Q5.2: Display one hot encoded columns for city_type [5 marks]

print("DataFrame head showing city_type one-hot columns:")
display(df[['city_type', 'city_type_Suburban', 'city_type_Urban', 'city_type_Rural']].head())

DataFrame head showing city_type one-hot columns:


Unnamed: 0,city_type,city_type_Suburban,city_type_Urban,city_type_Rural
0,Suburban,1,0,0
1,Urban,0,1,0
2,Rural,0,0,1
3,Suburban,1,0,0
4,Suburban,1,0,0


In [97]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]

satisfaction_order = ['Low', 'Medium', 'High']

ordinal_mapping = {level: i for i, level in enumerate(satisfaction_order)}

df['satisfaction_level_encoded'] = df['satisfaction_level'].map(ordinal_mapping)

print("First 10 values [satisfaction_level] and [satisfaction_level_encoded]:")
display(df[['satisfaction_level', 'satisfaction_level_encoded']].head(10))

print("\nValue counts for satisfaction_level_encoded:")
print(df['satisfaction_level_encoded'].value_counts())

First 10 values [satisfaction_level] and [satisfaction_level_encoded]:


Unnamed: 0,satisfaction_level,satisfaction_level_encoded
0,Medium,1
1,Low,0
2,High,2
3,High,2
4,Medium,1
5,Low,0
6,Low,0
7,High,2
8,High,2
9,Medium,1



Value counts for satisfaction_level_encoded:
satisfaction_level_encoded
1    36
0    35
2    29
Name: count, dtype: int64



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  One-hot encoding is suitable for city_type because it's a nominal categorical feature, meaning its categories ('Urban', 'Suburban', 'Rural') have no numerical relationship.

>  Ordinal encoding is suitable for satisfaction_level because it's an ordinal categorical feature, where categories ('Low', 'Medium', 'High') have a clear, meaningful order, and the encoding preserves this inherent ranking.

>  One-hot encoding avoids fake ordinality that could mislead a model.


---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [98]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [99]:
# Q6.2: Euclidean distance computation [5 marks]

euclidean_distance = np.sqrt(np.sum((v1 - v2)**2))
print(f"Euclidean Distance: {euclidean_distance}")

Euclidean Distance: 1140.7370424422975


In [100]:
# Q6.3: Manhattan distance computation [5 marks]

manhattan_distance = np.sum(np.abs(v1 - v2))
print(f"Manhattan Distance: {manhattan_distance}")

Manhattan Distance: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  In this case, the Manhattan distance (1181.0) is larger than the Euclidean distance (1140.74).

>  This typically happens because Euclidean distance represents the shortest straight-line path between two points, while Manhattan distance calculates the sum of absolute differences along each dimension, same as navigating to a city grid.


---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  From Module 1 and 2, we utilized classification evaluation metrics like accuracy, precision, and recall to assess a model's performance on the true_label and pred_label columns. This initial evaluation helped us understand the model's strengths and weaknesses.

>  Subsequently, Module 3's concepts, such as standardization and min-max scaling of numerical features like monthly_income, and one-hot and ordinal encoding of categorical features like city_type and satisfaction_level, were applied to prepare the data.

>  Together, these ideas allow for a much deeper understanding of a model. The initial metrics (from module 02) reveal what needs to be improved, and the preprocessing techniques (from module 03) provide the idea to transform data into a suitable format for effective model training and analysis.


## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
