
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Md.Atikur Rahaman**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("module123_data.csv")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [3]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

count    100.000000
mean     181.890000
std       68.886951
min       60.000000
25%      122.000000
50%      178.000000
75%      243.750000
max      299.000000
Name: daily_screen_time_min, dtype: float64

> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

> The average daily screen time is 181.89 minutes (about 3 hours). Most users' screen time is close to this average, with some variation shown by the standard deviation of 68.89 minutes. The minimum is 60 minutes and maximum is 299 minutes, which shows the data is spread out fairly evenly without any unusual extreme values.


### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [4]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52


> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

> The dataset is well-balanced because 52% of samples are positive class and 48% are negative class. This is good for training models because both classes have almost equal representation, which helps the model learn both classes equally well.


### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [5]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [6]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384


> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

> The model has 55% accuracy, which means it is only slightly better than guessing randomly. The precision is 57%, meaning when the model predicts positive, it is correct about 57% of the time. The recall is 54%, which means the model catches only 54% of all actual positive cases and misses 46% of them. Overall, this model needs more improvement because it makes many mistakes in both directions (21 false positives and 24 false negatives).


---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [7]:
# Q4.1: Choose the numeric column [2 marks]
income_col = "monthly_income"
print(f"Selected column: {income_col}")
print(f"\nOriginal values (first 5):")
print(df[income_col].head())

Selected column: monthly_income

Original values (first 5):
0    3734.19
1    2594.19
2    3550.47
3    3821.18
4    1750.84
Name: monthly_income, dtype: float64


In [8]:
# Q4.2: Standardization with z-score [10 marks]
# Calculate mean and standard deviation
mean_income = df[income_col].mean()
std_income = df[income_col].std()

# Apply z-score standardization: (x - mean) / std
df['income_std'] = (df[income_col] - mean_income) / std_income

print(f"Mean: {mean_income:.2f}")
print(f"Std: {std_income:.2f}")
print(f"\nStandardized values (first 5):")
print(df['income_std'].head())
print(f"\nStandardized range: [{df['income_std'].min():.2f}, {df['income_std'].max():.2f}]")

Mean: 2885.75
Std: 898.12

Standardized values (first 5):
0    0.944685
1   -0.324626
2    0.740126
3    1.041542
4   -1.263639
Name: income_std, dtype: float64

Standardized range: [-2.10, 2.41]


In [9]:
# Q4.3: Min max scaling implementation [10 marks]
# Calculate min and max
min_income = df[income_col].min()
max_income = df[income_col].max()

# Apply min-max scaling: (x - min) / (max - min)
df['income_minmax'] = (df[income_col] - min_income) / (max_income - min_income)

print(f"Min: {min_income:.2f}")
print(f"Max: {max_income:.2f}")
print(f"\nMin-Max scaled values (first 5):")
print(df['income_minmax'].head())
print(f"\nMin-Max range: [{df['income_minmax'].min():.2f}, {df['income_minmax'].max():.2f}]")

Min: 1000.00
Max: 5049.40

Min-Max scaled values (first 5):
0    0.675209
1    0.393685
2    0.629839
3    0.696691
4    0.185420
Name: income_minmax, dtype: float64

Min-Max range: [0.00, 1.00]


> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

> The standardized column has values centered around 0, with most values between -3 and +3. Negative numbers mean below average income, and positive numbers mean above average income. The min-max scaled column changes all values to be between 0 and 1, where 0 is the smallest income and 1 is the largest income. We use standardization when we want to know how far a value is from the average, and we use min-max scaling when we need all features to be on the same scale from 0 to 1.


### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [10]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
# Use pd.get_dummies() for one-hot encoding
city_encoded = pd.get_dummies(df['city_type'], prefix='city')

print("One-hot encoded city_type:")
print(city_encoded.head(10))
print(f"\nOriginal column vs encoded columns:")
print(pd.concat([df['city_type'].head(10), city_encoded.head(10)], axis=1))

One-hot encoded city_type:
   city_Rural  city_Suburban  city_Urban
0       False           True       False
1       False          False        True
2        True          False       False
3       False           True       False
4       False           True       False
5        True          False       False
6        True          False       False
7        True          False       False
8       False          False        True
9        True          False       False

Original column vs encoded columns:
  city_type  city_Rural  city_Suburban  city_Urban
0  Suburban       False           True       False
1     Urban       False          False        True
2     Rural        True          False       False
3  Suburban       False           True       False
4  Suburban       False           True       False
5     Rural        True          False       False
6     Rural        True          False       False
7     Rural        True          False       False
8     Urban       False   

In [11]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
# Concatenate the one-hot encoded columns to the original dataframe
df = pd.concat([df, city_encoded], axis=1)

print("Dataframe with one-hot encoded columns:")
print(df[['city_type', 'city_Rural', 'city_Suburban', 'city_Urban']].head(10))

Dataframe with one-hot encoded columns:
  city_type  city_Rural  city_Suburban  city_Urban
0  Suburban       False           True       False
1     Urban       False          False        True
2     Rural        True          False       False
3  Suburban       False           True       False
4  Suburban       False           True       False
5     Rural        True          False       False
6     Rural        True          False       False
7     Rural        True          False       False
8     Urban       False          False        True
9     Rural        True          False       False


In [12]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
# Define the ordinal mapping: Low < Medium < High
ordinal_mapping = {'Low': 0, 'Medium': 1, 'High': 2}

# Apply the mapping
df['satisfaction_encoded'] = df['satisfaction_level'].map(ordinal_mapping)

print("Ordinal encoding for satisfaction_level:")
print(f"\nMapping: {ordinal_mapping}")
print(f"\nOriginal vs Encoded:")
print(df[['satisfaction_level', 'satisfaction_encoded']].head(10))
print(f"\nValue counts:")
print(df[['satisfaction_level', 'satisfaction_encoded']].value_counts().sort_index())

Ordinal encoding for satisfaction_level:

Mapping: {'Low': 0, 'Medium': 1, 'High': 2}

Original vs Encoded:
  satisfaction_level  satisfaction_encoded
0             Medium                     1
1                Low                     0
2               High                     2
3               High                     2
4             Medium                     1
5                Low                     0
6                Low                     0
7               High                     2
8               High                     2
9             Medium                     1

Value counts:
satisfaction_level  satisfaction_encoded
High                2                       29
Low                 0                       35
Medium              1                       36
Name: count, dtype: int64


> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

> One-hot encoding is good for `city_type` because Urban, Suburban, and Rural are just different categories with no ranking or order. One-hot encoding creates separate columns for each city type so the model doesn't think one city is "bigger" or "better" than another. Ordinal encoding is good for `satisfaction_level` because Low, Medium, and High have a clear order (Low < Medium < High), and encoding them as 0, 1, 2 helps the model understand that High is greater than Medium, which is greater than Low.


---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [13]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [3734.19 48]
v2: [2594.19 7]


In [14]:
# Q6.2: Euclidean distance computation [5 marks]
# Euclidean distance formula: sqrt(sum((x_i - y_i)^2))
euclidean_dist = np.sqrt(np.sum((v1 - v2) ** 2))

print(f"Vector 1: {v1}")
print(f"Vector 2: {v2}")
print(f"\nEuclidean distance: {euclidean_dist:.2f}")
print(f"\nCalculation: sqrt((3734.19-2594.19)^2 + (48-7)^2)")
print(f"           = sqrt({(v1[0]-v2[0])**2:.2f} + {(v1[1]-v2[1])**2:.2f})")
print(f"           = sqrt({np.sum((v1-v2)**2):.2f}) = {euclidean_dist:.2f}")

Vector 1: [3734.19 48]
Vector 2: [2594.19 7]

Euclidean distance: 1140.74

Calculation: sqrt((3734.19-2594.19)^2 + (48-7)^2)
           = sqrt(1299600.00 + 1681.00)
           = sqrt(1301281.00) = 1140.74


In [15]:
# Q6.3: Manhattan distance computation [5 marks]
# Manhattan distance formula: sum(|x_i - y_i|)
manhattan_dist = np.sum(np.abs(v1 - v2))

print(f"Vector 1: {v1}")
print(f"Vector 2: {v2}")
print(f"\nManhattan distance: {manhattan_dist:.2f}")
print(f"\nCalculation: |3734.19-2594.19| + |48-7|")
print(f"           = {abs(v1[0]-v2[0]):.2f} + {abs(v1[1]-v2[1]):.2f}")
print(f"           = {manhattan_dist:.2f}")

Vector 1: [3734.19 48]
Vector 2: [2594.19 7]

Manhattan distance: 1181.00

Calculation: |3734.19-2594.19| + |48-7|
           = 1140.00 + 41.00
           = 1181.00


> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

> Manhattan distance (1181.00) is larger than Euclidean distance (1140.74) in this result. This happens because Manhattan distance adds up all the differences directly (1140 + 41 = 1181), while Euclidean distance takes the square root after adding squared differences (√(1140² + 41²) = 1140.74). Taking the square root makes the final number smaller. Manhattan distance is usually larger, especially when one feature has much bigger differences than others, like income in this case.

---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

> In this assignment, I learned how different modules work together to analyze data and build machine learning models. From Module 1 and 2, I used descriptive statistics like mean and standard deviation to understand the data, and I used confusion matrix with precision and recall to check how well the model performs. From Module 3, I learned how to prepare data for machine learning by using standardization and min-max scaling to make numbers comparable, and using one-hot encoding and ordinal encoding to convert categories into numbers. These techniques work together as a complete process: first I explore the data to understand it, then I transform the data to make it ready for models, and finally I evaluate the model to see if it works well. The distance calculations showed me why scaling is important - without scaling, features with big numbers (like income) will dominate the calculations over features with small numbers (like app opens). This complete approach helps me understand both what the data means and how to prepare it properly for machine learning.