# Supervised Learning: Regression Models and Performance Metrics — Solution

**Assignment Code:** D-AG-008

**Notes:** This notebook contains the answers and runnable code cells. No personal data has been included.

## Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

**Answer:**

Simple Linear Regression (SLR) is a statistical method that models the relationship between a single independent variable (predictor) `x` and a dependent variable (response) `y` using a straight line. The goal is to find the best-fitting line that predicts `y` from `x`.

**Purpose:**
- To quantify and model the linear relationship between two variables.
- To make predictions of the dependent variable given new values of the independent variable.
- To understand how changes in the predictor affect the response (direction and magnitude).

SLR is widely used for trend estimation, forecasting, and as a simple baseline model in predictive tasks.

## Question 2: What are the key assumptions of Simple Linear Regression?

**Answer:**

1. **Linearity**: The relationship between `x` and the expected value of `y` is linear.
2. **Independence**: Observations (and their errors) are independent of one another.
3. **Homoscedasticity**: The variance of the residuals (errors) is constant across all values of `x`.
4. **Normality of errors**: The residuals are (approximately) normally distributed. This matters for inference (confidence intervals and hypothesis tests).
5. **No (or little) multicollinearity**: For simple linear regression there's only one predictor; multicollinearity is a concern in multiple regression.
6. **No influential outliers**: Extreme outliers can disproportionately affect the fitted line.

Violations of these assumptions can degrade the model's performance or invalidate statistical inferences. Visual diagnostics and tests (residual plots, Durbin-Watson for independence, Breusch-Pagan for heteroscedasticity, QQ-plot for normality) help check assumptions.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

**Answer:**

The mathematical equation for simple linear regression is:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where:
- \(y\) is the dependent (response) variable.
- \(x\) is the independent (predictor) variable.
- \(\beta_0\) is the **intercept** (value of `y` when `x = 0`).
- \(\beta_1\) is the **slope** (change in expected `y` for a one-unit change in `x`).
- \(\epsilon\) is the error term (residual) capturing the difference between observed `y` and the value predicted by the line; assumed to have mean zero and some variance \(\sigma^2\).

## Question 4: Provide a real-world example where simple linear regression can be applied.

**Answer:**

**Example:** Predicting a student's exam score `y` from the number of hours studied `x`.
- Here `x` = hours studied, `y` = exam score.
- Simple linear regression can estimate how many points on the exam increase per additional hour studied (slope), and provide predictions for new study-times.

Other examples: predicting house price from square footage (if only one predictor is used), estimating sales from advertising spend, or estimating fuel consumption from vehicle weight (with single predictor).

## Question 5: What is the method of least squares in linear regression?

**Answer:**

The method of least squares estimates the regression coefficients (\(\beta_0, \beta_1\)) by minimizing the **sum of squared residuals** (differences between observed and predicted values):

\[ \text{RSS} = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2. \]

Solving for the coefficients that minimize RSS gives closed-form formulas for simple linear regression:

\[ \hat{\beta}_1 = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. \]

Least squares provides the best linear unbiased estimator (BLUE) when the Gauss–Markov assumptions hold (errors have zero mean, are uncorrelated, and have equal variance).

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

**Answer:**

**Logistic Regression** is a classification algorithm used to predict a categorical outcome (most commonly binary: 0 or 1). It models the probability that the target belongs to a particular class using the logistic (sigmoid) function applied to a linear combination of inputs:

\[ P(y=1|x) = \sigma(\beta_0 + \beta_1 x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}. \]

**Differences from Linear Regression:**
- **Target type:** Linear regression predicts continuous numeric outcomes; logistic regression predicts probabilities for discrete classes.
- **Model output:** Linear regression gives direct numeric prediction; logistic regression gives a probability which is thresholded to get class labels.
- **Loss function:** Linear regression commonly uses least squares; logistic regression uses the logistic (cross-entropy) loss (maximum likelihood estimation).
- **Interpretation:** Logistic coefficients represent log-odds changes; they are not direct changes in the predicted label value.
- **Assumptions & diagnostics:** Different assumptions and evaluation metrics (e.g., accuracy, ROC-AUC for classification vs. RMSE, R² for regression).

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

**Answer:**

1. **Mean Squared Error (MSE):** Average of squared differences between predicted and actual values:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2. \]
   - Penalizes larger errors more strongly because of squaring.

2. **Root Mean Squared Error (RMSE):** Square root of MSE, has same units as the target:
\[ \text{RMSE} = \sqrt{\text{MSE}}. \]
   - Easier to interpret than MSE because of units.

3. **Mean Absolute Error (MAE):** Average of absolute differences between predicted and actual values:
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|. \]
   - Less sensitive to outliers than MSE/RMSE.

(Other common metrics: R-squared, Mean Absolute Percentage Error (MAPE), and adjusted R-squared.)

## Question 8: What is the purpose of the R-squared metric in regression analysis?

**Answer:**

**R-squared (\(R^2\))** measures the proportion of variance in the dependent variable that is predictable from the independent variable(s). It is defined as:

\[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}. \]

- **Interpretation:** An \(R^2\) of 0.8 means 80% of the variance in `y` is explained by the model.
- **Range:** Typically between 0 and 1 for models with intercept, but can be negative if the model performs worse than predicting the mean.
- **Caution:** A high R² does not imply causation, and R² always increases (or stays same) when adding predictors; use **adjusted R²** to compare models with different numbers of predictors.

R² is useful to quantify goodness-of-fit but should be used alongside other diagnostics (residual analysis, domain knowledge).

In [None]:
# Question 9: Fit a simple linear regression using scikit-learn and print slope and intercept.
# We'll create synthetic data, fit the model, and print the learned coefficients.
import numpy as np
from sklearn.linear_model import LinearRegression

# Create synthetic linear data
rng = np.random.RandomState(42)
X = 2.5 * rng.rand(100, 1)  # predictor between 0 and 2.5
true_slope = 4.2
true_intercept = 1.5
noise = rng.normal(scale=0.8, size=(100, 1))
y = true_intercept + true_slope * X + noise

# Fit model
model = LinearRegression()
model.fit(X, y.ravel())

slope = model.coef_[0]
intercept = model.intercept_

print("Learned slope (beta1):", slope)
print("Learned intercept (beta0):", intercept)

# Quick sanity checks (asserts)
assert abs(slope - true_slope) < 1.0, "Learned slope deviates too much from expected"
assert abs(intercept - true_intercept) < 1.0, "Learned intercept deviates too much from expected"

# Show a few predictions
X_new = np.array([[0.0], [1.0], [2.0]])
preds = model.predict(X_new)
for xi, p in zip(X_new.ravel(), preds):
    print(f"X={xi:.2f} => predicted y={p:.3f}")


## Question 10: How do you interpret the coefficients in a simple linear regression model?

**Answer:**

- **Intercept (\(\beta_0\))**: The expected value of the response `y` when the predictor `x` is zero (assuming `x=0` is within domain). It is the point where the regression line crosses the y-axis.

- **Slope (\(\beta_1\))**: The expected change in the response `y` for a one-unit increase in the predictor `x`, holding other factors constant. If \(\beta_1\) is positive, `y` increases as `x` increases; if negative, `y` decreases as `x` increases.

**Example interpretation:** If \(\beta_1 = 4.2\), then for each additional unit increase in `x`, the expected value of `y` increases by 4.2 units on average.

**Notes & caveats:**
- Coefficients show association, not causation.
- If predictors are scaled (standardized), the slope represents change per standard deviation.
- In the presence of multiple predictors, interpretation of one coefficient assumes other predictors are held constant.
- Large standard errors around coefficients mean less reliable estimates; use confidence intervals or hypothesis tests for inference.

### How to run this notebook

1. Open this notebook in Google Colab or Jupyter.
2. Run each cell in order. The code cell under Question 9 will run a quick synthetic example using scikit-learn.
3. If running in a fresh environment, ensure `scikit-learn` and `numpy` are installed. You can install them with `pip install scikit-learn numpy` in a Colab cell if needed.

---

**End of assignment solution.**