# Module 10.5: Practice Sheet of Module 9 and 10

**Topic:** Data Preprocessing, Feature Engineering, Linear and Logistic Regression

This notebook contains short practice questions for:

- **Module 09:** Data Preprocessing and Feature Engineering Part 2
- **Module 10:** Linear Regression and Logistic Regression

Write your answers in the provided markdown and code cells. You can duplicate this notebook for multiple attempts.

## Module 09: Data Preprocessing and Feature Engineering Part 2

Topics:
- Outlier detection and handling
- Feature transformation (polynomial, binning)
- Feature construction and domain driven feature creation

### Q1. Outlier Detection and Handling

You collected daily website traffic data:

```python
traffic = [120, 135, 140, 125, 138, 142, 130, 900]
```

1. Detect the outlier using the **IQR method**. You can either show calculations or explain the idea.
2. Suggest **one** appropriate way to handle this outlier and justify it in one line.
3. Give an example of a real life situation where this outlier should **not** be removed.

In [None]:
# Optional: calculate IQR to detect the outlier
import numpy as np

traffic = np.array([120, 135, 140, 125, 138, 142, 130, 900])
traffic
Q1 = np.quantile(traffic, 0.25)
Q3 = np.quantile(traffic, 0.75)
IQR = Q3 - Q1
print(IQR)

11.75


Use the code cell above if you want to quickly check quartiles and IQR. Explain your reasoning below.

**Your answer for Q1:**

1. The IQR is 11.75
2. One way to handle the outlier is Cap Outliers.
3. Medical data, stock market data

### Q2. Polynomial Transformation

You are predicting house prices using the number of rooms. The scatter plot shows a clear **curved** relationship.

1. Explain why adding polynomial features such as `rooms**2` might improve the model.

Short conceptual answer only. No coding required.

**Your answer for Q2:**

adding polynomial feature such ase room**2 allows to bend the line.

### Q3. Binning or Discretization

A continuous variable is given:

```python
ages = [18, 20, 45, 67, 72, 23]
```

1. Create **three bins** such as `young`, `middle`, `old` and assign each age to a bin.
2. State **one benefit** and **one drawback** of using binning in a machine learning model.

In [None]:
# Optional: try binning using pandas
import pandas as pd

ages = [18, 20, 45, 67, 72, 23]
ages_series = pd.Series(ages, name="Age")
df = pd.DataFrame(ages_series)
df['Age_bin'] = pd.cut(df["Age"], bins = [0,30, 50, 100], labels=['Young', 'Middle', 'Old'])

print(df)


   Age Age_bin
0   18   Young
1   20   Young
2   45  Middle
3   67     Old
4   72     Old
5   23   Young


**Your answer for Q3 (bins, benefit, drawback):**

-- Benefit - reduces noise

-- drawback - Treat all values same of a range.

### Q4. Domain Driven Feature Construction

A food delivery dataset includes the following features:

- `distance_km`
- `order_time`
- `delivery_time`

Your task:

1. Propose **two new features** that might help predict **delivery delay**.
2. For each new feature, give **one sentence** explaining why it can be useful.

Hint: think about duration, rush hour, peak time and so on.

**Your answer for Q4:**

1. Delivery_speed : It can be useful to identify the speed and distance relation.
2. time_status: It will help to identify the peak time and usual time of the oder time.

## Module 10: Linear Regression and Logistic Regression

Topics:
- Concept of regression and line fitting
- Cost function, gradient descent and optimization
- Model evaluation metrics R squared, MAE, RMSE
- Assumptions and limitations of linear regression
- Transition from regression to classification with the sigmoid function

### Q5. Concept of Regression and Line Fitting

You are predicting exam scores based on hours studied.

```python
hours = [1, 2, 3, 4, 5]
scores = [50, 55, 65, 70, 80]
```

1. In your own words, describe what **line fitting** means in linear regression.
2. What does the **slope** of the line represent in this context, in plain language?

In [None]:
# Optional: fit a simple linear regression model
import numpy as np
from sklearn.linear_model import LinearRegression

hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
scores = np.array([50, 55, 65, 70, 80])

model = LinearRegression()
model.fit(hours, scores)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 7.500000000000001
Intercept: 41.5


**Your answer for Q5:**

Line fitting means in linear regression is a straight line which represents the relation between hours and scores.

Slope means in change of one unit of x asix the change in y axis.

### Q6. Cost Function and Gradient Descent

Answer in simple language:

1. What does the **cost function** such as mean squared error measure in a regression model?
2. What does **gradient descent** do to the value of the cost function step by step?
3. Why is using a very **large learning rate** risky when running gradient descent?

**Your answer for Q6:**

1.   MSE measures the error between actual point and predicted point.  
2.   Gradient Descent updates parameters (slope and intersec) calculate and takes the minimum error for the regression line.
3.





### Q7. Regression Metrics Interpretation

A regression model produced the following metrics:

- R squared = 0.75
- MAE = 4.2
- RMSE = 7.6

1. Explain what each of these numbers tells you about the model.
2. Which metric **penalizes big errors more**, and why?


In [None]:
# Optional: try computing these metrics on a toy example
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

true_vals = np.array([10, 12, 15, 20])
pred_vals = np.array([11, 13, 14, 19])

print("R squared:", r2_score(true_vals, pred_vals))
print("MAE:", mean_absolute_error(true_vals, pred_vals))
print("RMSE:", mean_squared_error(true_vals, pred_vals))

R squared: 0.9295154185022027
MAE: 1.0
RMSE: 1.0


**Your answer for Q7:**
1.
- R square = 0.75 means 75% varience in the target variable.
- On average the points are 4.2 units away from the predicted valus.
- THe model error is about 7.6 unit

2. RMSE penalize big errors.