
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Sayed Hasan Sami**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [4]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

count    100.000000
mean     181.890000
std       68.886951
min       60.000000
25%      122.000000
50%      178.000000
75%      243.750000
max      299.000000
Name: daily_screen_time_min, dtype: float64


> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

> Looking at the daily_screen_time_min data:
> - The max value (299 minutes) is much bigger than the mean (181.89 minutes), showing a huge difference between the highest and average screen time usage.
> - The standard deviation of 68.89 minutes is quite large, which means the data is spread out widely and there is much variation between users' screen time habits.
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [6]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

> The positive count is 52 and negative is 48. This means neither class is much more common than the other, so it is more likely balanced.
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [9]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [10]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

> The accuracy of 0.55 shows that the model gets 55% of predictions right and 45% wrong in total. For precision of 0.57, this means when it predicts positive, 57% are correct and 43% are incorrect. The recall of 0.54 indicates it successfully finds 54% of actual positive cases but misses 46% of them. Overall, this model has poor performance and makes too many wrong predictions to be reliable.
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [12]:
# Q4.1: Choose the numeric column [2 marks]
income_col = "monthly_income"
print("Selected column:", income_col)

0     3734.19
1     2594.19
2     3550.47
3     3821.18
4     1750.84
       ...   
95    3975.90
96    3764.77
97    2336.44
98    3007.53
99    2195.91
Name: monthly_income, Length: 100, dtype: float64

In [16]:
# Q4.2: Standardization with z-score [10 marks]
mean_valuse = df[income_col].mean()
std_value = df[income_col].std()
Z_scores = (df[income_col] - mean_valuse) / std_value
df['z_score_income'] = Z_scores
df['z_score_income']

0    0.944685
1   -0.324626
2    0.740126
3    1.041542
4   -1.263639
Name: z_score_income, dtype: float64

In [17]:
# Q4.3: Min max scaling implementation [10 marks]
min_val = df[income_col].min()
max_val = df[income_col].max()
min_max_scaled = (df[income_col] - min_val) / (max_val - min_val)
df['min_max_income'] = min_max_scaled
df['min_max_income']

0     0.675209
1     0.393685
2     0.629839
3     0.696691
4     0.185420
        ...   
95    0.734899
96    0.682760
97    0.330034
98    0.495760
99    0.295330
Name: min_max_income, Length: 100, dtype: float64


> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

> The z-score numbers can be positive or negative with no limits, telling us how far each data point is from the mean. Min-max scaling always makes values between 0 and 1, where 0 is the smallest original value and 1 is the biggest original value. Z-score keeps the same data shape but puts it around zero, while min-max scaling puts everything into the 0-1 range.
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [22]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
city_col = "city_type"
one_hot_encoded = pd.get_dummies(df[city_col], prefix='city')
print("One hot encoded columns:")
print(one_hot_encoded)

One hot encoded columns:
    city_Rural  city_Suburban  city_Urban
0        False           True       False
1        False          False        True
2         True          False       False
3        False           True       False
4        False           True       False
..         ...            ...         ...
95       False          False        True
96       False          False        True
97        True          False       False
98       False          False        True
99       False           True       False

[100 rows x 3 columns]


In [32]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df = pd.concat([df, one_hot_encoded], axis=1)
print(df.head())

   user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0        1   43         3734.19                    109               48   
1        2   49         2594.19                    194                7   
2        3   19         3550.47                    146               36   
3        4   19         3821.18                    287               14   
4        5   63         1750.84                     66               46   

   true_label  pred_label satisfaction_level city_type  income_std  ...  \
0           0           0             Medium  Suburban    0.944685  ...   
1           0           0                Low     Urban   -0.324626  ...   
2           1           0               High     Rural    0.740126  ...   
3           1           0               High  Suburban    1.041542  ...   
4           0           0             Medium  Suburban   -1.263639  ...   

   city_Urban  city_Rural  city_Suburban  city_Urban  city_Rural  \
0       False       False     

In [34]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
sat_col = "satisfaction_level"
order_mapping = {'Low': 1, 'Medium': 2, 'High': 3}
ordinal_encoded = df[sat_col].map(order_mapping)
df['satisfaction_ordinal'] = ordinal_encoded
print(df[[sat_col, 'satisfaction_ordinal']])

   satisfaction_level  satisfaction_ordinal
0              Medium                     2
1                 Low                     1
2                High                     3
3                High                     3
4              Medium                     2
..                ...                   ...
95             Medium                     2
96                Low                     1
97               High                     3
98               High                     3
99             Medium                     2

[100 rows x 2 columns]



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

> One hot encoding works well for city_type because Urban, Suburban, and Rural are just different categories with no order between them. Ordinal encoding is good for satisfaction_level because Low, Medium, and High have a clear order where Low < Medium < High. This way we keep the ranking for satisfaction but treat city types as separate things.
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
# Scale daily_app_opens to match the instruction preference
daily_app_scaled = (df["daily_app_opens"] - df["daily_app_opens"].min()) / (df["daily_app_opens"].max() - df["daily_app_opens"].min())
df['daily_app_scaled'] = daily_app_scaled

vec_cols = ["z_score_income", "daily_app_scaled"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [52]:
# Q6.2: Euclidean distance computation [5 marks]
euclidean_dist = np.sqrt(np.sum((v1 - v2) ** 2))
print("Euclidean distance:", euclidean_dist)

Euclidean distance: 1140.7370424422975


In [49]:
# Q6.3: Manhattan distance computation [5 marks]
manhattan_dist = np.sum(np.abs(v1 - v2))
print("Manhattan distance:", manhattan_dist)

Manhattan distance: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

> Manhattan distance is usually bigger than Euclidean distance in most cases. Euclidean distance measures the shortest straight line between 2 points only. But Manhattan distance does summation of absolute difference of 2 points. The squaring and square root in Euclidean distance formula makes the final number smaller compared to Manhattan distance which just adds up all the absolute differences.
>  
>  



---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

> I used descriptive statistics from Module 1 to understand data patterns like mean and standard deviation. From Module 3, I used z-score standardization to scale income data for easier comparison. These work together by first understanding data with basic statistics, then preparing it using scaling and encoding methods. This shows data science is about understanding your data first, then preparing it properly for analysis.
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
