<a href="https://colab.research.google.com/github/MoriamAkterSwarna/AI-ML/blob/main/Assignment_01_Machine_Learning_Course_Moriam_Akter_Swarna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Moriam Akter Swarna**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

> The mean is around 181.89, while the minimum value is 60 and the maximum reaches 299, indicating a wide spread in the data. The standard deviation of about 68.88 also shows high variability among the values. Since the max is quite far from the mean, it suggests that some users have significantly higher values than average.
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  52% of the samples are in the positive class, as indicated by the positive class proportion of 0.52 (1). Since neither the positive nor the negative class is noticeably greater in number than the other, this suggests that the dataset is generally balanced.
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  Having an accuracy of 0.55, the model correctly predicts the label for 55% of the samples. The model has a decent capacity to prevent false positives, as evidenced by its precision of 0.57, which indicates that it is accurate approximately 57% of the time when it predicts a positive class. A recall of 0.54 indicates that the model misses an important amount of true positives, since it only detects roughly 54% of all actual positive occurrences. Although the model performs similarly overall in terms of precision and recall, both metrics are rather low, suggesting that there is room for growth in terms of accurately recognizing positive cases and making positive predictions.
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]
num_col_q4 = "monthly_income"

In [None]:
# Q4.2: Standardization with z-score [10 marks]
mean_income = df[num_col_q4].mean()
std_income = df[num_col_q4].std()
df['monthly_income_std'] = (df[num_col_q4] - mean_income) / std_income

print("Original monthly_income head:")
print(df[num_col_q4].head())
print("\nStandardized monthly_income_std head:")
print(df['monthly_income_std'].head())
print("\nDescriptive statistics for monthly_income_std:")
print(df['monthly_income_std'].describe())

Original monthly_income head:
0    3734.19
1    2594.19
2    3550.47
3    3821.18
4    1750.84
Name: monthly_income, dtype: float64

Standardized monthly_income_std head:
0    0.944685
1   -0.324626
2    0.740126
3    1.041542
4   -1.263639
Name: monthly_income_std, dtype: float64

Descriptive statistics for monthly_income_std:
count    1.000000e+02
mean    -8.826273e-16
std      1.000000e+00
min     -2.099647e+00
25%     -5.640419e-01
50%      9.213643e-03
75%      7.876635e-01
max      2.409081e+00
Name: monthly_income_std, dtype: float64


In [None]:
# Q4.3: Min max scaling implementation [10 marks]
min_income = df[num_col_q4].min()
max_income = df[num_col_q4].max()
df['monthly_income_minmax'] = (df[num_col_q4] - min_income) / (max_income - min_income)

print("Original monthly_income head:")
print(df[num_col_q4].head())
print("\nMin-Max Scaled monthly_income_minmax head:")
print(df['monthly_income_minmax'].head())
print("\nDescriptive statistics for monthly_income_minmax:")
print(df['monthly_income_minmax'].describe())

Original monthly_income head:
0    3734.19
1    2594.19
2    3550.47
3    3821.18
4    1750.84
Name: monthly_income, dtype: float64

Min-Max Scaled monthly_income_minmax head:
0    0.675209
1    0.393685
2    0.629839
3    0.696691
4    0.185420
Name: monthly_income_minmax, dtype: float64

Descriptive statistics for monthly_income_minmax:
count    100.000000
mean       0.465685
std        0.221792
min        0.000000
25%        0.340585
50%        0.467729
75%        0.640383
max        1.000000
Name: monthly_income_minmax, dtype: float64



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  The majority of the standardized column values fall between -2 and 2 (more precisely, -2.09 to 2.40 in this case), with a standard deviation of 1 and a center around 0. The Min-Max scaled column, on the other hand, compresses all data points into that predefined positive range regardless of the initial distribution because it is firmly constrained between 0 and 1.
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
city_dummies = pd.get_dummies(df['city_type'], prefix='city')
print(city_dummies.head())

   city_Rural  city_Suburban  city_Urban
0       False           True       False
1       False          False        True
2        True          False       False
3       False           True       False
4       False           True       False


In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]


df = pd.concat([df, city_dummies], axis=1)


print(df.head())

   user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0        1   43         3734.19                    109               48   
1        2   49         2594.19                    194                7   
2        3   19         3550.47                    146               36   
3        4   19         3821.18                    287               14   
4        5   63         1750.84                     66               46   

   true_label  pred_label satisfaction_level city_type  monthly_income_std  \
0           0           0             Medium  Suburban            0.944685   
1           0           0                Low     Urban           -0.324626   
2           1           0               High     Rural            0.740126   
3           1           0               High  Suburban            1.041542   
4           0           0             Medium  Suburban           -1.263639   

   monthly_income_minmax  city_Rural  city_Suburban  city_Urban  
0             

In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]

mapping = {'Low': 0, 'Medium': 1, 'High': 2}


df['satisfaction_level_encoded'] = df['satisfaction_level'].map(mapping)


print(df[['satisfaction_level', 'satisfaction_level_encoded']].head())

  satisfaction_level  satisfaction_level_encoded
0             Medium                           1
1                Low                           0
2               High                           2
3               High                           2
4             Medium                           1



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  One-hot encoding is suitable for city_type because it is a nominal feature where categories (Urban, Rural, Suburban) have no specific rank or order. Ordinal encoding is used for satisfaction_level because the data has a clear inherent hierarchy (Low < Medium < High) that needs to be preserved for the model to understand the relationship.
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [None]:
# Q6.2: Euclidean distance computation [5 marks]

euclidean_dist = np.sqrt(np.sum((v1 - v2)**2))

print("Euclidean Distance:", euclidean_dist)

Euclidean Distance: 1140.7370424422975


In [None]:
# Q6.3: Manhattan distance computation [5 marks]


manhattan_dist = np.sum(np.abs(v1 - v2))

print("Manhattan Distance:", manhattan_dist)

Manhattan Distance: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  The Manhattan distance is larger. This usually happens because Euclidean distance represents the straight-line path (shortest distance) between points, while Manhattan distance moves strictly along the grid axes (like the legs of a right triangle), and the sum of the legs is always greater than or equal to the hypotenuse.
>  
>  



---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  In Module 1 and 2, we assessed the quality of the dataset and established baseline performance using metrics like accuracy and recall, realizing that raw predictions had room for improvement. Module 3 introduced data transformation techniques, specifically Scaling (Standardization/MinMax) and Encoding, to prepare numeric and categorical features for mathematical processing. These ideas connect because raw data often has varying scales or non-numeric types that prevent distance-based algorithms (calculated in the final section) from working correctly. By applying the scaling and encoding from Module 3, we ensure the data is in the correct format to improve the evaluation metrics we analyzed in the first module.
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
