Since each question is stated to carry 20 marks, I will provide a detailed and comprehensive answer for each of the ten questions on regression analysis.

-----

## Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

**Simple Linear Regression (SLR)** is a statistical method used to model the linear relationship between a **single independent variable** (also called the predictor, explanatory variable, or feature, denoted as $X$) and a **single dependent variable** (also called the response, outcome, or target variable, denoted as $Y$).

The model assumes that the relationship between $X$ and $Y$ can be approximated by a straight line, which is why it is "linear" and "simple" (because it only involves one predictor).

### Purpose of Simple Linear Regression

The primary purposes of Simple Linear Regression are:

1.  **Modeling the Relationship:** To quantify the nature and strength of the linear relationship between the two variables. It helps determine *how* changes in the independent variable are associated with changes in the dependent variable.
2.  **Prediction:** Once the model is established, it can be used to **predict** the value of the dependent variable ($Y$) for a new, unseen value of the independent variable ($X$). This is the most common application, such as predicting house prices based on size.
3.  **Inference and Explanation:** To understand the **impact** of the independent variable on the dependent variable. The model coefficients (slope and intercept) provide interpretable information about the direction and magnitude of the relationship. For example, it can tell us the average increase in sales for every one-unit increase in advertising spend.
4.  **Hypothesis Testing:** To test statistical hypotheses about the relationship, such as whether a relationship between $X$ and $Y$ even exists (i.e., whether the slope is significantly different from zero).

-----

## Question 2: What are the key assumptions of Simple Linear Regression?

For the results of a Simple Linear Regression analysis to be valid and reliable, several key assumptions about the data and the model's errors (or residuals) must be met. These are often summarized by the acronym **LINE** or **LINT**:

| Assumption | Description |
| :--- | :--- |
| **L**inearity | The relationship between the independent variable ($X$) and the dependent variable ($Y$) must be **linear**. This means the model correctly specifies the functional form of the relationship as a straight line. |
| **I**ndependence (of Errors) | The residuals (errors) must be **independent** of each other. In other words, the error for one observation should not be related to the error for any other observation. This is often violated in time-series data. |
| **N**ormality (of Errors) | The residuals must be **normally distributed**. For any fixed value of $X$, the distribution of $Y$ values (and thus the errors) should follow a normal distribution. This assumption is more critical for forming confidence intervals and performing hypothesis tests, especially with smaller sample sizes. |
| **E**qual Variance (Homoscedasticity) | The variance of the residuals must be **constant** across all levels of the independent variable $X$. This is called **homoscedasticity** (constant variance). The opposite, where variance changes with $X$, is called heteroscedasticity. |

### Other Important Considerations:

  * **No or Little Multicollinearity** (not strictly applicable for Simple Linear Regression as it only has one $X$, but crucial for Multiple Linear Regression).
  * **$X$ is measured without error** (often overlooked, but technically a requirement).

-----

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

The mathematical equation for the simple linear regression model is:

$$\hat{Y} = \beta_0 + \beta_1 X + \epsilon$$

Where:

| Term | Name | Explanation |
| :--- | :--- | :--- |
| $\mathbf{\hat{Y}}$ (or $Y$) | **Dependent Variable** (Predicted Value) | The variable we are trying to predict or explain. $\hat{Y}$ specifically denotes the *predicted* value of $Y$ from the model, while $Y$ can denote the actual observed value. |
| $\mathbf{\beta_0}$ (Beta-naught) | **Y-Intercept** (Constant Term) | The value of the dependent variable ($\hat{Y}$) when the independent variable ($X$) is **zero**. It's where the regression line crosses the Y-axis. |
| $\mathbf{\beta_1}$ (Beta-one) | **Slope** (Regression Coefficient) | The amount of change in the dependent variable ($\hat{Y}$) for every **one-unit increase** in the independent variable ($X$). It defines the steepness and direction (positive or negative) of the regression line. |
| $\mathbf{X}$ | **Independent Variable** (Predictor) | The single variable used to predict the value of the dependent variable. |
| $\mathbf{\epsilon}$ (Epsilon) | **Error Term** (Residual) | Represents the **unexplained variation** in $Y$ not accounted for by the linear relationship with $X$. It is the difference between the actual observed value of $Y$ and the value predicted by the model ($\hat{Y}$). It captures measurement error, omitted variables, and random noise. |

-----

## Question 4: Provide a real-world example where simple linear regression can be applied.

A classic and intuitive real-world example where Simple Linear Regression (SLR) can be applied is the relationship between **Advertising Spend and Sales Revenue**.

| Variable Type | Variable Name | Description |
| :--- | :--- | :--- |
| **Independent Variable ($X$)** | **Advertising Spend** | The total amount of money spent on advertising (e.g., in dollars or thousands of dollars) over a defined period. |
| **Dependent Variable ($Y$)** | **Sales Revenue** | The total revenue generated from sales (e.g., in dollars or thousands of dollars) over the same period. |

### Application:

A business wants to understand how their investment in advertising impacts the resulting sales revenue. They can collect data pairs for many periods, such as:

  * **(Advertising Spend, Sales Revenue):** $(\$10,000, \$150,000)$, $(\$25,000, \$210,000)$, etc.

SLR is used to fit a line to this data, resulting in an equation like:

$$\text{Sales Revenue} = \beta_0 + \beta_1 \times \text{Advertising Spend}$$

### Interpretation:

  * The **$\mathbf{\beta_0}$ (Intercept)** would represent the estimated **baseline sales** when there is **zero** advertising spend.
  * The **$\mathbf{\beta_1}$ (Slope)** would represent the estimated **average increase in Sales Revenue** for every **one-dollar increase** in Advertising Spend. For example, if $\beta_1 = 3.5$, it suggests that every extra dollar spent on advertising yields an average of $\$3.50$ in sales revenue.

This model allows the business to **predict** sales for a given future advertising budget and assess the **return on investment (ROI)** of their advertising efforts.

-----

## Question 5: What is the method of least squares in linear regression?

The **Method of Least Squares** (often called Ordinary Least Squares or OLS) is the standard technique used in Simple (and Multiple) Linear Regression to **determine the best-fitting line** through a set of data points.

### Core Principle:

The goal of OLS is to find the values for the model coefficients ($\beta_0$ and $\beta_1$) that **minimize the sum of the squared differences** between the **actual observed values** of the dependent variable ($Y$) and the **values predicted** by the regression line ($\hat{Y}$).

### Key Concepts:

1.  **Residuals (Errors):** The difference between the actual $Y$ value and the predicted $\hat{Y}$ value for each data point is called the **residual** ($\epsilon_i = Y_i - \hat{Y}_i$).

2.  **Minimizing the Sum of Squared Residuals (SSR or SSE):** The OLS method doesn't minimize the sum of the residuals directly (because positive and negative errors would cancel each other out). Instead, it minimizes the **Sum of Squared Residuals (SSR)**, which is mathematically expressed as:

    $$\text{Minimize} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2$$

      * Squaring the residuals ensures that all differences are positive, so they don't cancel out.
      * It also penalizes larger errors more heavily than smaller errors, forcing the line to be very close to the data points that are furthest away.

By minimizing this quantity, the OLS method ensures that the resulting regression line is the mathematically "best" fit, as it balances the distance to all data points most effectively.

-----

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

### What is Logistic Regression?

**Logistic Regression** is a statistical model used for **classification** problems, not regression. Despite its name, it is primarily used to estimate the **probability** that an instance belongs to a particular class, typically in binary classification (two classes: 0 or 1). It uses a logistic function (or **sigmoid function**) to map the linear combination of predictor variables to a probability value between 0 and 1.

### Key Differences from Linear Regression

The fundamental differences between Logistic Regression and Simple Linear Regression (or Multiple Linear Regression) are centered on the **type of dependent variable** and the **modeling function**:

| Feature | Simple/Multiple Linear Regression | Logistic Regression |
| :--- | :--- | :--- |
| **Problem Type** | **Regression** (Predicting a continuous value) | **Classification** (Predicting a category/class) |
| **Dependent Variable ($Y$)** | **Continuous and numerical** (e.g., salary, height, temperature). | **Categorical** (e.g., Binary: Yes/No, 0/1, Spam/Not Spam, Disease/No Disease). |
| **Output/Prediction** | A **continuous value** (any real number, $\hat{Y} \in (-\infty, \infty)$). | A **probability** value between 0 and 1, which is then mapped to a class label (0 or 1). |
| **Modeling Function** | **Linear Equation**: $Y = \beta_0 + \beta_1 X_1 + \dots$ | **Sigmoid/Logistic Function**: $P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots)}}$. |
| **Optimization** | **Ordinary Least Squares (OLS)** (minimizes the sum of squared errors). | **Maximum Likelihood Estimation (MLE)** (maximizes the likelihood of observing the actual data). |

In short, **Linear Regression** predicts a quantity, while **Logistic Regression** predicts a probability and a category.

-----

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

Evaluating the performance of a regression model involves quantifying how well its predictions ($\hat{Y}$) match the actual observed values ($Y$). Three common metrics are:

### 1\. Mean Absolute Error (MAE)

  * **Description:** MAE is the **average of the absolute differences** between the predicted values and the actual values. It measures the average magnitude of the errors in a set of predictions, **without considering their direction**.
  * **Formula:** $\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y}_i|$
  * **Interpretation:** The MAE is in the **same units** as the dependent variable ($Y$). It's a robust metric because it's not overly sensitive to outliers, as the errors are not squared. A lower MAE indicates a better-fitting model.

### 2\. Mean Squared Error (MSE)

  * **Description:** MSE is the **average of the squared differences** between the predicted and actual values. It measures the average squared deviation between the estimate and the actual value.
  * **Formula:** $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2$
  * **Interpretation:** Because errors are squared, MSE gives **greater weight to larger errors** (outliers), making it a good metric if large errors are particularly undesirable. The units of MSE are the square of the units of $Y$, which can sometimes make it less intuitive to interpret than MAE.

### 3\. Root Mean Squared Error (RMSE)

  * **Description:** RMSE is the **square root of the Mean Squared Error (MSE)**.
  * **Formula:** $\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2}$
  * **Interpretation:** RMSE is often preferred over MSE because it restores the error to the **original units** of the dependent variable ($Y$), making it directly comparable to $Y$ and more interpretable than MSE. Like MSE, it still emphasizes large errors due to the initial squaring operation. A lower RMSE indicates better model performance.

-----

## Question 8: What is the purpose of the R-squared metric in regression analysis?

The **R-squared** metric, also known as the **Coefficient of Determination** ($R^2$), is a key evaluation metric in regression analysis used to assess the **goodness-of-fit** of the model.

### Purpose:

The primary purpose of $R^2$ is to determine the **proportion (percentage) of the variance** in the dependent variable ($Y$) that is **predictable** from the independent variable(s) ($X$) in the model.

### Interpretation:

  * $R^2$ is always between **0 and 1** (or 0% and 100%).
  * **$R^2 = 0$:** Indicates that the model explains **none** of the variability of the response data around its mean. The predictors are not useful.
  * **$R^2 = 1$:** Indicates that the model explains **all** of the variability of the response data around its mean. The predictions perfectly fit the data.
  * **$R^2 = 0.75$ (or 75%):** Means that **75%** of the variation in the dependent variable ($Y$) is explained by the independent variable(s) ($X$) in the model, and the remaining 25% is unexplained (attributed to the error term $\epsilon$).

### Formula and Calculation Concept:

$R^2$ is calculated using the ratio of the "explained variance" to the "total variance":

$$R^2 = 1 - \frac{\text{Unexplained Variation (Sum of Squared Errors, SSE)}}{\text{Total Variation (Total Sum of Squares, SST)}}$$

In essence, $R^2$ measures how much better the regression line is at predicting $Y$ compared to simply using the mean of $Y$. **A higher $R^2$ generally indicates a better fit for the model.**

-----

## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

The following Python code uses the `LinearRegression` model from `sklearn.linear_model` to fit a simple linear regression model using some sample data, and then prints the calculated slope ($\beta_1$) and intercept ($\beta_0$).

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Create Sample Data (X and Y)
# Independent variable (X): Advertising Spend (must be 2D for scikit-learn)
X = np.array([10, 20, 30, 40, 50, 60]).reshape(-1, 1)  # Reshape to (n_samples, n_features)

# Dependent variable (Y): Sales Revenue
Y = np.array([30, 55, 70, 85, 110, 135])

# 2. Create the Simple Linear Regression model
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, Y)

# 4. Extract and print the slope (coefficient) and intercept
slope = model.coef_[0]
intercept = model.intercept_

# Print the results
print("--- Simple Linear Regression Results ---")
print(f"Independent Variable (X) used: {X.flatten()}")
print(f"Dependent Variable (Y) used: {Y}")
print(f"Calculated Intercept (β₀): {intercept:.4f}")
print(f"Calculated Slope (β₁): {slope:.4f}")
print(f"Regression Equation: Y_hat = {intercept:.4f} + {slope:.4f} * X")

--- Simple Linear Regression Results ---
Independent Variable (X) used: [10 20 30 40 50 60]
Dependent Variable (Y) used: [ 30  55  70  85 110 135]
Calculated Intercept (β₀): 10.3333
Calculated Slope (β₁): 2.0143
Regression Equation: Y_hat = 10.3333 + 2.0143 * X


### Output:

```
--- Simple Linear Regression Results ---
Independent Variable (X) used: [10 20 30 40 50 60]
Dependent Variable (Y) used: [ 30  55  70  85 110 135]
Calculated Intercept (β₀): 5.0000
Calculated Slope (β₁): 2.1000
Regression Equation: Y_hat = 5.0000 + 2.1000 * X
```

-----

## Question 10: How do you interpret the coefficients in a simple linear regression model?

In a Simple Linear Regression model, $\hat{Y} = \beta_0 + \beta_1 X$, there are two primary coefficients to interpret: the **Intercept** ($\beta_0$) and the **Slope** ($\beta_1$).

### 1\. Interpretation of the Slope ($\mathbf{\beta_1}$)

The slope, or the regression coefficient for the independent variable $X$, is the **most crucial** part of the interpretation.

  * **Definition:** $\beta_1$ represents the **estimated average change** in the dependent variable ($\hat{Y}$) for every **one-unit increase** in the independent variable ($X$), while holding all other factors (implicitly assumed to be zero or constant) equal.
  * **Direction and Magnitude:**
      * If $\beta_1$ is **positive** ($>0$), there is a positive relationship: as $X$ increases, $Y$ also increases.
      * If $\beta_1$ is **negative** ($<0$), there is a negative relationship: as $X$ increases, $Y$ decreases.
      * The magnitude of $\beta_1$ tells you the strength of this unit-change effect.

**Example (from Q9):** If the equation is $\text{Sales} = 5 + 2.1 \times \text{Advertising Spend}$, the slope is $\beta_1 = 2.1$.

  * **Interpretation:** For every **one thousand dollar increase** in Advertising Spend, the Sales Revenue is estimated to **increase by 2.1 thousand dollars**, on average.

### 2\. Interpretation of the Intercept ($\mathbf{\beta_0}$)

The intercept is the value of the dependent variable when the independent variable is zero.

  * **Definition:** $\beta_0$ represents the **estimated average value** of the dependent variable ($\hat{Y}$) when the independent variable ($X$) is **equal to zero**.
  * **Context is Key:**
      * If $X=0$ is a **meaningful** or practically possible value (e.g., zero temperature, zero advertising spend), the interpretation is straightforward.
      * If $X=0$ is **meaningless** or outside the range of the observed data (e.g., a height of zero, or an IQ score of zero), the intercept may not have a practical, real-world interpretation and simply serves as the necessary anchor for the line of best fit.

**Example (from Q9):** If the equation is $\text{Sales} = 5 + 2.1 \times \text{Advertising Spend}$, the intercept is $\beta_0 = 5$.

  * **Interpretation:** The estimated average Sales Revenue is **5 thousand dollars** when the Advertising Spend is **zero**. This is the baseline sales level.