# Supervised Learning: Regression Models and Performance Metrics

---

## Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

**Simple Linear Regression (SLR)** is a statistical method used to model and analyze the relationship between two continuous variables.

It involves:
* An **independent variable** (predictor or explanatory variable), denoted as X.
* A **dependent variable** (response or outcome variable), denoted as Y.

### Purpose of SLR

The primary purpose of Simple Linear Regression is to understand and quantify the relationship between the independent and dependent variables. Specifically, it aims to:
1.  **Model the Relationship**: Find the best-fitting straight line (called the regression line) that describes how the dependent variable `Y` changes as the independent variable `X` changes.
2.  **Make Predictions**: Use the established linear relationship to predict the value of the dependent variable (`Y`) for a given value of the independent variable (`X`).

---

## Question 2: What are the key assumptions of Simple Linear Regression?

For a Simple Linear Regression model to be accurate and reliable, several key assumptions about the data must be met:

1.  **Linearity**: The relationship between the independent variable (X) and the dependent variable (Y) is linear. This means that a straight line is the best way to represent their relationship.

2.  **Independence**: The residuals (the differences between the actual and predicted values) are independent. This means the error of one observation is not influenced by the error of another. This is particularly important for time-series data.

3.  **Homoscedasticity** (Constant Variance): The variance of the residuals is constant for all values of X. In simpler terms, the spread of the errors should be roughly the same across all points along the regression line.

4.  **Normality of Residuals**: The residuals of the model are normally distributed. This assumption is important for conducting hypothesis tests and constructing confidence intervals.



---

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

The mathematical equation for a simple linear regression model is:

$$ Y = \beta_0 + \beta_1X + \epsilon $$

Each term in the equation represents:

* **Y**: The **dependent variable**. This is the outcome or the variable you are trying to predict.

* **X**: The **independent variable**. This is the predictor or the variable you are using to make the prediction.

* **$\beta_0$ (Beta 0)**: The **y-intercept** of the regression line. It is the predicted value of `Y` when `X` is equal to 0.

* **$\beta_1$ (Beta 1)**: The **slope** of the regression line. It represents the change in the dependent variable `Y` for every one-unit increase in the independent variable `X`.

* **$\epsilon$ (Epsilon)**: The **error term**. This represents the random variability in the data that is not explained by the model. It accounts for the difference between the actual observed value of `Y` and the value predicted by the line.

---

## Question 4: Provide a real-world example where simple linear regression can be applied.

A classic real-world example of simple linear regression is **predicting a student's final exam score based on the number of hours they studied**.

* **Independent Variable (X)**: Number of hours studied.
* **Dependent Variable (Y)**: Final exam score (e.g., out of 100).

**Application:**

A university could collect data on students' study hours and their corresponding exam scores. By fitting a simple linear regression model, they could establish a relationship, such as: `Score = 35 + 5 * (Hours Studied)`. 

This model could then be used to:
- **Predict performance**: Estimate the expected score for a student who studies a certain number of hours.
- **Offer guidance**: Advise students that, on average, each additional hour of study could improve their score by 5 points.

---

## Question 5: What is the method of least squares in linear regression?

The **method of least squares** is the standard technique used to determine the best-fitting line for a dataset in linear regression.

Its main goal is to find the values for the intercept ($\beta_0$) and the slope ($\beta_1$) that **minimize the sum of the squared residuals**.

### How It Works:
1.  A **residual** is the vertical distance between an actual data point and the predicted point on the regression line. It represents the prediction error for that data point.
2.  For each data point, this residual is calculated and then squared.
3.  The method of least squares finds the unique line that makes the sum of all these squared residuals as small as possible—hence the name "least squares."

By minimizing this sum, it finds the line that is, on average, closest to all the data points.



---

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

**Logistic Regression** is a supervised learning algorithm used for **classification** problems, where the goal is to predict a categorical outcome. Unlike linear regression, which predicts a continuous value, logistic regression predicts the probability that an observation belongs to a particular category (e.g., the probability of an email being spam or not spam).

### Key Differences from Linear Regression:

| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| **Primary Use Case** | Regression (predicting continuous values) | Classification (predicting discrete categories) |
| **Output** | A continuous numerical value (e.g., 150.7, -23.5) | A probability between 0 and 1 |
| **Relationship** | Models a linear relationship | Models the probability using a logistic (sigmoid) function |
| **Equation** | $Y = \beta_0 + \beta_1X$ | $P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X)}}$ |


---

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

Here are three common metrics used to evaluate the performance of regression models:

1.  **Mean Absolute Error (MAE)**
    * **Description**: MAE calculates the average of the absolute differences between the actual and predicted values.
    * **Interpretation**: It gives a straightforward idea of the average magnitude of the errors in the predictions, in the same units as the target variable.

2.  **Mean Squared Error (MSE)**
    * **Description**: MSE calculates the average of the squared differences between the actual and predicted values.
    * **Interpretation**: By squaring the errors, it penalizes larger errors more heavily than smaller ones. This is useful when large errors are particularly undesirable.

3.  **R-squared (R²)**
    * **Description**: Also known as the coefficient of determination, R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
    * **Interpretation**: It provides a measure of how well the model explains the variability of the data, with values ranging from 0 to 1 (or 0% to 100%).

---

## Question 8: What is the purpose of the R-squared metric in regression analysis?

The primary purpose of the **R-squared (R²)** metric is to measure the **goodness of fit** of a regression model.

Specifically, R-squared quantifies the proportion (or percentage) of the total variance in the dependent variable (Y) that can be explained by the variation in the independent variable(s) (X) included in the model.

**In simple terms:**
- An R-squared of **0.85** means that **85%** of the changes in the dependent variable can be explained by the changes in the independent variables.
- An R-squared of **0** means that the model explains **none** of the variability.
- An R-squared of **1** means that the model explains **all** of the variability.

It helps analysts understand how well their model is capturing the underlying patterns in the data. A higher R-squared generally indicates a better fit, although it should be used in conjunction with other metrics for a complete evaluation.

---

## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample Data: Let's use the example of hours studied vs. exam scores
# X = Hours Studied (Independent Variable)
# y = Exam Score (Dependent Variable)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)  # Needs to be a 2D array for scikit-learn
y = np.array([50, 55, 68, 72, 75, 84, 90, 95])

# 1. Create a Linear Regression model object
model = LinearRegression()

# 2. Fit the model to the data
model.fit(X, y)

# 3. Get the slope and intercept from the fitted model
slope = model.coef_[0]
intercept = model.intercept_

# 4. Print the results
print(f"The equation for the regression line is: y = {intercept:.2f} + {slope:.2f}x")
print("--------------------------------------------------")
print(f"Intercept (β0): {intercept:.2f}")
print(f"Slope (β1):     {slope:.2f}")

The equation for the regression line is: y = 42.18 + 6.57x
--------------------------------------------------
Intercept (β0): 42.18
Slope (β1):     6.57


---

## Question 10: How do you interpret the coefficients in a simple linear regression model?

In a simple linear regression model, there are two coefficients to interpret: the intercept ($\beta_0$) and the slope ($\beta_1$).

### 1. Interpreting the Intercept ($\beta_0$)

The **intercept** represents the **predicted value of the dependent variable (Y) when the independent variable (X) is 0**.

* **Example**: In our `hours studied vs. exam score` model, if the intercept is 42.18, it means that a student who studies for **0 hours** is predicted to get a score of **42.18**. 
* **Caution**: The interpretation is only meaningful if X=0 is a realistic and relevant value within the context of the data. If X can never be zero (e.g., if X is a person's weight), the intercept serves mainly as a mathematical baseline for the line.

### 2. Interpreting the Slope ($\beta_1$)

The **slope** represents the **estimated change in the dependent variable (Y) for a one-unit increase in the independent variable (X)**.

* **Example**: If the slope for our model is 6.57, it means that for **each additional hour** a student studies, their exam score is predicted to **increase by 6.57 points**, on average.
* The sign of the slope indicates the direction of the relationship:
    * A **positive slope** means Y increases as X increases.
    * A **negative slope** means Y decreases as X increases.