# Module 10.5: Practice Sheet of Module 9 and 10

**Topic:** Data Preprocessing, Feature Engineering, Linear and Logistic Regression

This notebook contains short practice questions for:

- **Module 09:** Data Preprocessing and Feature Engineering Part 2
- **Module 10:** Linear Regression and Logistic Regression

Write your answers in the provided markdown and code cells. You can duplicate this notebook for multiple attempts.

## Module 09: Data Preprocessing and Feature Engineering Part 2

Topics:
- Outlier detection and handling
- Feature transformation (polynomial, binning)
- Feature construction and domain driven feature creation

### Q1. Outlier Detection and Handling

You collected daily website traffic data:

```python
traffic = [120, 135, 140, 125, 138, 142, 130, 900]
```

1. Detect the outlier using the **IQR method**. You can either show calculations or explain the idea.
2. Suggest **one** appropriate way to handle this outlier and justify it in one line.
3. Give an example of a real life situation where this outlier should **not** be removed.

In [27]:
# Optional: calculate IQR to detect the outlier
import numpy as np

traffic = np.array([120, 135, 140, 125, 138, 142, 130, 900])
traffic

array([120, 135, 140, 125, 138, 142, 130, 900])

Use the code cell above if you want to quickly check quartiles and IQR. Explain your reasoning below.

**Your answer for Q1:**

1. **The IQR method:** The Interquartile Range (IQR) is calculated as Q3 - Q1 = 142 - 128.75 = 13.25. The outlier boundaries are:
   - Lower bound: Q1 - 1.5×IQR = 128.75 - 1.5×13.25 = 108.875
   - Upper bound: Q3 + 1.5×IQR = 142 + 1.5×13.25 = 161.875
   
   The value 900 is above the upper bound of 161.875, making it an outlier.

2. **Handling the outlier:** Use capping/winsorization to replace 900 with the maximum acceptable value of 161.875, as this preserves the data point while reducing its extreme influence.

3. **Real-life situation where outlier should not be removed:** During a viral marketing campaign or major website announcement, the traffic spike of 900 could represent a genuine business event that should be studied rather than removed.

### Q2. Polynomial Transformation

You are predicting house prices using the number of rooms. The scatter plot shows a clear **curved** relationship.

1. Explain why adding polynomial features such as `rooms**2` might improve the model.

Short conceptual answer only. No coding required.

**Your answer for Q2:**

Adding polynomial features like `rooms**2` allows the linear regression model to capture non-linear relationships between variables. While linear regression can only fit straight lines, the polynomial terms transform the feature space, enabling the model to learn curved patterns in the original data. This gives the model flexibility to better fit data where the relationship between rooms and house price isn't strictly linear but shows acceleration or diminishing returns.

### Q3. Binning or Discretization

A continuous variable is given:

```python
ages = [18, 20, 45, 67, 72, 23]
```

1. Create **three bins** such as `young`, `middle`, `old` and assign each age to a bin.
2. State **one benefit** and **one drawback** of using binning in a machine learning model.

In [28]:
# Optional: try binning using pandas
import pandas as pd

ages = [18, 20, 45, 67, 72, 23]
ages_series = pd.Series(ages, name="Age")
ages_series


0    18
1    20
2    45
3    67
4    72
5    23
Name: Age, dtype: int64

**Your answer for Q3 (bins, benefit, drawback):**

**Bins:**
- young: 18, 20, 23
- middle: 45
- old: 67, 72

**Benefit:** Binning can capture non-linear relationships and make the model more robust to outliers by grouping similar values together.

**Drawback:** Binning results in loss of information by treating all values within a bin as identical, which can reduce model precision and hide important variations within each group.

### Q4. Domain Driven Feature Construction

A food delivery dataset includes the following features:

- `distance_km`
- `order_time`
- `delivery_time`

Your task:

1. Propose **two new features** that might help predict **delivery delay**.
2. For each new feature, give **one sentence** explaining why it can be useful.

Hint: think about duration, rush hour, peak time and so on.

**Your answer for Q4:**

**Two new features:**

1. **delivery_duration = delivery_time - order_time**
   - This feature directly measures how long the actual delivery took, which is the primary factor affecting delays.

2. **is_rush_hour = 1 if order_time is during peak meal hours (12-2pm or 7-9pm), else 0**
   - This captures traffic conditions and order volume effects that typically cause longer delivery times during busy periods.

## Module 10: Linear Regression and Logistic Regression

Topics:
- Concept of regression and line fitting
- Cost function, gradient descent and optimization
- Model evaluation metrics R squared, MAE, RMSE
- Assumptions and limitations of linear regression
- Transition from regression to classification with the sigmoid function

### Q5. Concept of Regression and Line Fitting

You are predicting exam scores based on hours studied.

```python
hours = [1, 2, 3, 4, 5]
scores = [50, 55, 65, 70, 80]
```

1. In your own words, describe what **line fitting** means in linear regression.
2. What does the **slope** of the line represent in this context, in plain language?

In [29]:
# Optional: fit a simple linear regression model
import numpy as np
from sklearn.linear_model import LinearRegression

hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
scores = np.array([50, 55, 65, 70, 80])

model = LinearRegression()
model.fit(hours, scores)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 7.500000000000001
Intercept: 41.5


**Your answer for Q5:**

1. **Line fitting** means finding the best straight line that represents the relationship between hours studied and exam scores, minimizing the overall distance between the line and actual data points.

2. The **slope** represents how much the exam score is expected to increase for each additional hour of studying - in this case, approximately 7.5 points per hour.

### Q6. Cost Function and Gradient Descent

Answer in simple language:

1. What does the **cost function** such as mean squared error measure in a regression model?
2. What does **gradient descent** do to the value of the cost function step by step?
3. Why is using a very **large learning rate** risky when running gradient descent?

**Your answer for Q6:**

1. The **cost function** measures the average squared difference between predicted and actual values, quantifying how "wrong" the model's predictions are.

2. **Gradient descent** iteratively adjusts the model parameters to gradually reduce the cost function, like walking downhill to find the lowest point.

3. A very **large learning rate** is risky because it can overshoot the optimal solution, potentially causing the algorithm to diverge or bounce around without finding the minimum.

### Q7. Regression Metrics Interpretation

A regression model produced the following metrics:

- R squared = 0.75
- MAE = 4.2
- RMSE = 7.6

1. Explain what each of these numbers tells you about the model.
2. Which metric **penalizes big errors more**, and why?


In [30]:
# Optional: try computing these metrics on a toy example
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

true_vals = np.array([10, 12, 15, 20])
pred_vals = np.array([11, 13, 14, 19])

print("R squared:", r2_score(true_vals, pred_vals))
print("MAE:", mean_absolute_error(true_vals, pred_vals))
# print("RMSE:", mean_squared_error(true_vals, pred_vals, squared=False))
print("MSE:", mean_squared_error(true_vals, pred_vals))
print("RMSE:", np.sqrt(mean_squared_error(true_vals, pred_vals)))

R squared: 0.9295154185022027
MAE: 1.0
MSE: 1.0
RMSE: 1.0


**Your answer for Q7:**

1. **Interpretation of metrics:**
   - **R squared = 0.75**: The model explains 75% of the variance in the target variable, indicating good overall fit.
   - **MAE = 4.2**: On average, the model's predictions are off by 4.2 units from the actual values.
   - **RMSE = 7.6**: The model's typical prediction error is 7.6 units when considering larger errors more heavily.

2. **RMSE penalizes big errors more** because it squares the errors before averaging, giving exponentially more weight to larger mistakes compared to MAE which treats all errors linearly.

### Q8. Assumptions and Residual Patterns

A residual plot for a linear regression model shows two things:

- A clear **curved** pattern in the residuals.
- The spread of residuals becomes larger for bigger values of `x`.

1. Name **two** linear regression assumptions that are probably being violated.
2. For each assumption, explain in **one sentence** why this is a problem for the model.

**Your answer for Q8:**

1. **Two violated assumptions:**
   - **Linearity**: The curved pattern indicates the relationship isn't linear
   - **Homoscedasticity** (constant variance): The increasing spread shows variance isn't constant

2. **Why these are problems:**
   - Violating linearity means the model won't capture the true relationship, leading to biased predictions.
   - Heteroscedasticity violates the constant variance assumption, making confidence intervals and hypothesis tests unreliable.

### Q9. From Linear Regression to Logistic Regression

You want to predict whether a customer will buy a product, where `0` means No and `1` means Yes.

1. Why is **linear regression** not a good choice for this classification problem?
2. What role does the **sigmoid function** play in logistic regression?
3. If the sigmoid output is `0.81`, what is the predicted class when the decision threshold is `0.5`?

**Your answer for Q9:**

1. **Linear regression is not good** because it can predict values outside 0-1 range and assumes continuous outcomes rather than binary classes.

2. The **sigmoid function** transforms any real-valued number into a probability between 0 and 1, enabling classification.

3. With sigmoid output **0.81** and threshold **0.5**, the predicted class is **1 (Yes)**.

### Q10. Decision Threshold and Trade offs

A hospital uses a logistic regression model to detect a risky health condition. The current decision threshold is **0.5**.

1. If the hospital wants to **reduce false negatives** that is, avoid missing patients who truly have the condition, should the threshold go **up** or **down**?
2. Explain your answer in **one sentence**.

**Your answer for Q10:**

1. The threshold should go **down**.

2. Lowering the threshold makes the model more sensitive, classifying more patients as positive and thus reducing false negatives at the cost of increasing false positives.