<a href="https://colab.research.google.com/github/Ovizero01/Machine-Leaning/blob/main/004_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Rakib Ahmed**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  The difference between mean and max is |181-299| = 118 and between mean and min is |181 - 60| = 121. So min and max have almost same distance from mean.

>  IQR = 243.75 - 122 = 121.75 . A smaller IQR means data is highly packed and bigger IQR means data has variability. Here we can say that, data are balanced.

>  Standard Deviation 68.886951 means that the most of people screen time lies between the range 181.89 - 68.886951 = 113.003049 to 181.89 + 68.886951 = 250.776951



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  From the output we can see that, Proportion of positive class is 0.52 that means 52%.

> We can tell that positive count is 52% and negative count is 48%.

> So we can say that, this dataset is balanced between 0 and 1. Because the difference between positive and negative count is very less.



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  Accuracy means the rate of predicting true positive value and true negative value. The model's accuracy 0.55 means that the model's prediction on true positive value and true negative value is 55%.

>  Recall is the true positive rate of actual positive value and Precision is true positive rate of predicted positive values.

> Here, Precision is around 57.14% and Recall is 53.84%. Recall is lesser than Precision because FN is greater than FP. If we want to improve Recall, we need to reduce the FN.



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]
col = df['monthly_income']
col

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [None]:
# Q4.2: Standardization with z-score [10 marks]
m = col.mean()
s = col.std()
z = (col - m) / s
z.round(2)

Unnamed: 0,monthly_income
0,0.94
1,-0.32
2,0.74
3,1.04
4,-1.26
...,...
95,1.21
96,0.98
97,-0.61
98,0.14


In [None]:
# Q4.3: Min max scaling implementation [10 marks]
mn = col.min()
mx = col.max()
rg = mx - mn
ss = col - mn
mm = ss / rg
mm.round(2)

Unnamed: 0,monthly_income
0,0.68
1,0.39
2,0.63
3,0.70
4,0.19
...,...
95,0.73
96,0.68
97,0.33
98,0.50



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  Z-score values has negative values. But min max scalling value range is from 0 to 1.

>   Z-score uses mean and standard deviation for meausring. Whereas, min-max only needs min and max value of a column.

>   



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
d_city = pd.get_dummies(df['city_type'], prefix = "city_type", dtype = int)

In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]]
df = pd.concat([df, d_city], axis = 1)
df

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,city_type_Rural,city_type_Suburban,city_type_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0,1,0
1,2,49,2594.19,194,7,0,0,Low,Urban,0,0,1
2,3,19,3550.47,146,36,1,0,High,Rural,1,0,0
3,4,19,3821.18,287,14,1,0,High,Suburban,0,1,0
4,5,63,1750.84,66,46,0,0,Medium,Suburban,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,Urban,0,0,1
96,97,52,3764.77,295,16,1,0,Low,Urban,0,0,1
97,98,35,2336.44,108,46,0,1,High,Rural,1,0,0
98,99,18,3007.53,202,42,1,1,High,Urban,0,0,1


In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
order = {"Low": 1, "Medium": 2, "High": 3}
df["satisfaction_encoded"] = df["satisfaction_level"].map(order).astype(int)
df

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,city_type_Rural,city_type_Suburban,city_type_Urban,satisfaction_encoded
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0,1,0,2
1,2,49,2594.19,194,7,0,0,Low,Urban,0,0,1,1
2,3,19,3550.47,146,36,1,0,High,Rural,1,0,0,3
3,4,19,3821.18,287,14,1,0,High,Suburban,0,1,0,3
4,5,63,1750.84,66,46,0,0,Medium,Suburban,0,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,Urban,0,0,1,2
96,97,52,3764.77,295,16,1,0,Low,Urban,0,0,1,1
97,98,35,2336.44,108,46,0,1,High,Rural,1,0,0,3
98,99,18,3007.53,202,42,1,1,High,Urban,0,0,1,3



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  For Ordinal values, we can categorize Excellent, Good, Bad into some values which will have order from 1 to 3. But for Nominal values, there is no order like this.

> We use one hot encoding for 'city_type" because there is no order between the values. So one hot encoding is suitable for this.

> For 'satifaction_level' there is a order between them 'High', 'Medium' and 'Low'. So ordinal encoding is suitable for this.



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [None]:
# Q6.2: Euclidean distance computation [10 marks]
eu = np.linalg.norm(v1 - v2)
print("Euclidean:", np.round(eu, 3).tolist())

Euclidean: 1140.737


In [None]:
# Q6.3: Manhattan distance computation [10 marks]
ma = np.linalg.norm(v1 - v2, ord = 1)
print("Manhattan:", ma.tolist())

Manhattan: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  Manhattan distance is larger than the Euclidean distance in my result.

> Mahattan distance is always larger than Euclidean distance because Euclidean distance is shortest distance also known as straight line distance.

> Whereas Manhatten distance is the summation of distance between the axis. So, Manhattan distance is always greater than Euclidean Distance.  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

> The concept of IQR, Standard Deviation from Module 1 is used here. From module 2, the concept of correlation matrix: TP, FN, FP, TN is used. Also Accuracy, Precision and Recall is used here.

> From module 3, the concept of one hot encoding and ordinal encoding is used for nominal and ordinal values. Also, Euclidean and Manhattan Distance concept is used here for finding distance between two vectors.

> By calculating Accuracy, we can detect the how the model works on detecting TP and TN values. Also by calculating sensitivity, we can measure how well model can measure true positive values from positive values. By calculating Standard Deviation, what is the range of value distribution in a dataset. Finally, by watching nominal and ordinal we can detect when to use one hot encoding and when to use ordinal encoding.



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
