# Module 10.5: Practice Sheet of Module 9 and 10

**Topic:** Data Preprocessing, Feature Engineering, Linear and Logistic Regression

This notebook contains short practice questions for:

- **Module 09:** Data Preprocessing and Feature Engineering Part 2
- **Module 10:** Linear Regression and Logistic Regression

Write your answers in the provided markdown and code cells. You can duplicate this notebook for multiple attempts.

## Module 09: Data Preprocessing and Feature Engineering Part 2

Topics:
- Outlier detection and handling
- Feature transformation (polynomial, binning)
- Feature construction and domain driven feature creation

### Q1. Outlier Detection and Handling

You collected daily website traffic data:

```python
traffic = [120, 135, 140, 125, 138, 142, 130, 900]
```

1. Detect the outlier using the **IQR method**. You can either show calculations or explain the idea.
2. Suggest **one** appropriate way to handle this outlier and justify it in one line.
3. Give an example of a real life situation where this outlier should **not** be removed.

In [None]:
# Optional: calculate IQR to detect the outlier
import numpy as np

traffic = np.array([120, 135, 140, 125, 138, 142, 130, 900])
traffic

Use the code cell above if you want to quickly check quartiles and IQR. Explain your reasoning below.

**Your answer for Q1:**

1. The IQR method ...

### Q2. Polynomial Transformation

You are predicting house prices using the number of rooms. The scatter plot shows a clear **curved** relationship.

1. Explain why adding polynomial features such as `rooms**2` might improve the model.

Short conceptual answer only. No coding required.

**Your answer for Q2:**

### Q3. Binning or Discretization

A continuous variable is given:

```python
ages = [18, 20, 45, 67, 72, 23]
```

1. Create **three bins** such as `young`, `middle`, `old` and assign each age to a bin.
2. State **one benefit** and **one drawback** of using binning in a machine learning model.

In [None]:
# Optional: try binning using pandas
import pandas as pd

ages = [18, 20, 45, 67, 72, 23]
ages_series = pd.Series(ages, name="Age")
ages_series


Unnamed: 0,Age
0,18
1,20
2,45
3,67
4,72
5,23


In [None]:
import pandas as pd

ages = [18, 20, 45, 67, 72, 23]
ages_series = pd.Series(ages, name="Age") # Re-initialize ages_series

ages_bin=pd.cut(
    ages_series,
    bins=[18, 30, 50, 100],
    labels=["young", "middle", "old"],
    right=False
)
print(ages_series,ages_bin)

0    18
1    20
2    45
3    67
4    72
5    23
Name: Age, dtype: int64 0     young
1     young
2    middle
3       old
4       old
5     young
Name: Age, dtype: category
Categories (3, object): ['young' < 'middle' < 'old']


**Your answer for Q3 (bins, benefit, drawback):**

# Benefit:
1.Bining can make a model more robust to reduce noise by grouping similar numeric value together,reducing the impact of small flaucation in the data
# Drawback
1.Bining can cause loss of information because continuous values get collapsed into broad categories,which may redues model accuracy if imporatant variation is removed

### Q4. Domain Driven Feature Construction

A food delivery dataset includes the following features:

- `distance_km`
- `order_time`
- `delivery_time`

Your task:

1. Propose **two new features** that might help predict **delivery delay**.
2. For each new feature, give **one sentence** explaining why it can be useful.

Hint: think about duration, rush hour, peak time and so on.

**Your answer for Q4:**

# travel_duration = delivery_time − order_time
This captures how long the delivery actually took, which directly relates to delay.

# is_peak_hour (e.g., 6–9 PM)
Orders during peak hours face heavier traffic and higher demand, making delays more likely.

## Module 10: Linear Regression and Logistic Regression

Topics:
- Concept of regression and line fitting
- Cost function, gradient descent and optimization
- Model evaluation metrics R squared, MAE, RMSE
- Assumptions and limitations of linear regression
- Transition from regression to classification with the sigmoid function

### Q5. Concept of Regression and Line Fitting

You are predicting exam scores based on hours studied.

```python
hours = [1, 2, 3, 4, 5]
scores = [50, 55, 65, 70, 80]
```

1. In your own words, describe what **line fitting** means in linear regression.
2. What does the **slope** of the line represent in this context, in plain language?

In [16]:
# Optional: fit a simple linear regression model
import numpy as np
from sklearn.linear_model import LinearRegression

hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
scores = np.array([50, 55, 65, 70, 80])

model = LinearRegression()
model.fit(hours, scores)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 7.500000000000001
Intercept: 41.5


**Your answer for Q5:**

1. Line fitting in linear regression means finding best straight line that goes through the data so the model can predict output from input with the least error
2. The slope of the line represent how much the score changes when  study hour increases by one hour

### Q6. Cost Function and Gradient Descent

Answer in simple language:

1. What does the **cost function** such as mean squared error measure in a regression model?
2. What does **gradient descent** do to the value of the cost function step by step?
3. Why is using a very **large learning rate** risky when running gradient descent?

**Your answer for Q6:**

1. The cost function such as mean squared error measure how far the model's prediction are from the actual target values.
Smaller MSC -> prediction are closer to actual values -> better model
2. Gradient descent is an optimization method that adjust the model's parameter like(slop and interception) to minimize cost function.

step by step-

i. It calculate the slop of the cost function and moves the parameters in the direction that reduces error.

ii. Repeat it until the cost function reach a minimum

3. A very large learing model can make gradient descent overshoot the minimum causing -
 - the cost function to bounced around instead of decreasing


### Q7. Regression Metrics Interpretation

A regression model produced the following metrics:

- R squared = 0.75
- MAE = 4.2
- RMSE = 7.6

1. Explain what each of these numbers tells you about the model.
2. Which metric **penalizes big errors more**, and why?


In [21]:
import numpy as np
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

true_vals = np.array([10, 12, 15, 20])
pred_vals = np.array([11, 13, 14, 19])

print("R squared:", r2_score(true_vals, pred_vals))
print("MAE:", mean_absolute_error(true_vals, pred_vals))
print("RMSE:", np.sqrt(mean_squared_error(true_vals, pred_vals)))

R squared: 0.9295154185022027
MAE: 1.0
RMSE: 1.0


**Your answer for Q7:**

1.
- R squared = 0.75 means the model is good and explains 75% of the variation.
- MAE = 4.2 means the prediction result is 4.2 unit far from the actual result
- RMSE = 7.6 means the model’s predictions are about 7.6 units away from the actual values.

2. RMSE penalizes big errors more than MAE
 - Reason: squaring differences gives more weight to larger error, so few a big mistakes will increase much more than MAE



### Q8. Assumptions and Residual Patterns

A residual plot for a linear regression model shows two things:

- A clear **curved** pattern in the residuals.
- The spread of residuals becomes larger for bigger values of `x`.

1. Name **two** linear regression assumptions that are probably being violated.
2. For each assumption, explain in **one sentence** why this is a problem for the model.

**Your answer for Q8:**

- Linearity assumption
Residuals show a curved pattern, which means the relationship between x and y is not truly linear, so the model cannot capture it accurately.

- Homoscedasticity (constant variance) assumption
Residuals spread increases for larger x, indicating non-constant variance, which can lead to inefficient and biased predictions.

### Q9. From Linear Regression to Logistic Regression

You want to predict whether a customer will buy a product, where `0` means No and `1` means Yes.

1. Why is **linear regression** not a good choice for this classification problem?
2. What role does the **sigmoid function** play in logistic regression?
3. If the sigmoid output is `0.81`, what is the predicted class when the decision threshold is `0.5`?

**Your answer for Q9:**

1. Linear regression can predict continuous value.
For binary classification ,it can produce output outside of [0,1], which does not make a sense as probabilities. It doesn't handle class separation well.
2. The sigmoid Function play to convert any continuous number into a value between  0 or 1.
This makes it suitable for binary classification, since probabilities can be thresholded to get class labels.
3. 0.81> 0.5,s0 the class label is 1(Yes)

### Q10. Decision Threshold and Trade offs

A hospital uses a logistic regression model to detect a risky health condition. The current decision threshold is **0.5**.

1. If the hospital wants to **reduce false negatives** that is, avoid missing patients who truly have the condition, should the threshold go **up** or **down**?
2. Explain your answer in **one sentence**.

**Your answer for Q10:**

- The threshold should down because a low threshold can modify more patient as postive ,reducing the change of false negative.



---

✅ You have reached the end of the practice sheet.

You can now:
- Review your answers.
- Run the optional code cells.