#Supervised Learning: Regression Models and Performance Metrics |

Q 1.What is Simple Linear Regression (SLR)? Explain its purpose.

Ans--Simple linear regression is a statistical method used to model the relationship between two variables: one independent variable (predictor) and one dependent variable (response). Its primary purpose is to predict outcomes and understand relationships between these variables.

This technique works by fitting a straight line, known as the regression line, to the data points in a way that minimizes the sum of squared differences (residuals) between the observed values and the predicted values. The equation for this line is typically expressed as:

-     ŷ = b0 + b1x

Here, b0 is the y-intercept, b1 is the slope, and x is the independent variable.

Q 2. What are the key assumptions of Simple Linear Regression?

Ans--Linear regression relies on several key assumptions to ensure the validity and reliability of its results. These assumptions are critical for making accurate predictions and valid statistical inferences. Below are the primary assumptions:

-  **Linearity:** The relationship between the independent and dependent variables must be linear. This ensures that changes in the independent variable result in proportional changes in the dependent variable.

-  **Homoscedasticity:** The residuals (differences between observed and predicted values) should have constant variance across all levels of the independent variables. If the variance changes (heteroscedasticity), it can lead to inefficient estimates and unreliable hypothesis tests.

-   **Normality of Residuals:** The residuals should follow a normal distribution. This assumption is crucial for valid hypothesis testing and confidence intervals.

-   **Independence of Errors:** The residuals should not be correlated with one another. This is particularly important in time-series data, where autocorrelation can occur if errors at one time point influence errors at another.

-   **Lack of Multicollinearity:** The independent variables should not be highly correlated with each other. Multicollinearity can inflate standard errors, making it difficult to assess the individual impact of predictors.

-   **Absence of Endogeneity:** The independent variables should not be correlated with the error term. Endogeneity leads to biased and inconsistent coefficient estimates, undermining the model's validity.

Q 3.Write the mathematical equation for a simple linear regression model and
explain each term.

Ans--The general equation for a simple linear regression model is:

-     y = a + bx

Where:

-   y: The dependent variable (response variable) whose value we aim to predict.

-   x: The independent variable (predictor variable) that influences the dependent variable.

-   a: The intercept, representing the value of y when x = 0. It is the point where the regression line crosses the y-axis.

-   b: The slope of the regression line, indicating the rate of change in y for a one-unit increase in x.

**Explanation of Terms**

-  **Dependent Variable (y):** This is the outcome or target variable that the model predicts. For example, in predicting house prices, y would represent the price of the house.

-   **Independent Variable (x):** This is the input or explanatory variable that influences the dependent variable. For instance, in the house price example, x could represent the size of the house.

-   **Intercept (a):** This is the starting value of y when x is zero. It provides a baseline prediction when no influence from the independent variable is present. However, in some contexts, the intercept may not always have a meaningful interpretation (e.g., predicting tree height when x = 0).

-   **Slope (b):** This measures the strength and direction of the relationship between x and y. A positive slope indicates that as x increases, y also increases, while a negative slope suggests the opposite.

**Key Insights**

The equation assumes a linear relationship between the variables, meaning changes in the independent variable are associated with proportional changes in the dependent variable. The slope and intercept are calculated using statistical methods like the least squares approach, which minimizes the sum of squared differences between observed and predicted values.

This model is widely used in predictive analytics, trend analysis, and understanding relationships between variables in fields like economics, biology, and machine learning.

Q 4.Provide a real-world example where simple linear regression can be
applied.

Ans--Advertising Spending and Sales Revenue

**In this scenario, a company wants to understand how its advertising budget impacts its sales revenue. The company collects data over several months, recording the amount spent on advertising and the corresponding sales revenue generated during that period.**
-  Independent Variable (X): Advertising Spending (in dollars)
-  Dependent Variable (Y): Sales Revenue (in dollars)

**Data Collection**

-   The company gathers the following data over a few months:

| Month | Advertising Spending (X) | Sales Revenue (Y) |
|-------|---------------------------|-------------------|
| 1     | $1,000                   | $10,000           |
| 2     | $1,500                   | $15,000           |
| 3     | $2,000                   | $20,000           |
| 4     | $2,500                   | $25,000           |
| 5     | $3,000                   | $30,000           |

-   Applying Simple Linear Regression
Using simple linear regression, the company can fit a line to this data to model the relationship between advertising spending and sales revenue. The regression equation might look like this:

Y=β<sub>0</sub>
​
 +β<sub>1</sub>
​
 X

Where:

-   β<sub>0</sub>: is the y-intercept (the expected sales revenue when advertising spending is zero).
-   β<sub>1</sub>:is the slope of the line (the expected change in sales revenue for each additional dollar spent on advertising).
  
**Interpretation of Results**
After performing the regression analysis, the company finds that the regression equation is:

Y=5,000+10X

This means:

-  For every additional dollar spent on advertising, the sales revenue increases by $10.

-   If the company spends $0 on advertising, it can expect to generate $5,000 in sales revenue (the intercept).

**Conclusion**

By using simple linear regression, the company can make informed decisions about its advertising budget. If the analysis shows a strong positive relationship, the company may decide to increase its advertising spending to boost sales revenue further. This example illustrates how simple linear regression can be a powerful tool for businesses to understand and predict the impact of their investments on revenue.

Q 5.What is the method of least squares in linear regression?

Ans--When the value of the independent variable (x) is known, the Least-Squares Regression Line can be used to predict the value of the dependent variable (y). This method finds the line of best fit by minimizing the sum of squared differences between observed and predicted values.

The regression line is expressed as:


[ y = mx + c ]

Where:


-   m = slope of the line


-   c = y-intercept

Formulas: [ m = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2} ] [ c = \frac{\sum y - m(\sum x)}{n} ]

Once m and c are known, you can substitute any given x to predict y.

In [5]:
# Given data points
x_values = [1, 2, 4, 6, 8]
y_values = [3, 4, 8, 10, 15]
n = len(x_values)
sum_x = sum(x_values)
sum_y = sum(y_values)
sum_xy = sum(x*y for x, y in zip(x_values, y_values))
sum_x2 = sum(x**2 for x in x_values)
# Calculate slope (m) and intercept (c)
m = (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x**2)
c = (sum_y - m * sum_x) / n
# Predict y for a given x
x_given = 5
y_pred = m * x_given + c
print(f"Equation of line: y = {m:.2f}x + {c:.2f}")
print(f"Predicted y for x={x_given}: {y_pred:.2f}")

Equation of line: y = 1.68x + 0.96
Predicted y for x=5: 9.34


**Key Points:**

-  This method works best when data is linear and free from extreme outliers.

-   It is widely used in predictive modeling, such as forecasting sales, estimating trends, or predicting physical measurements.

-   For multiple variables, the concept extends to multiple linear regression.

By applying the regression equation, any known x can be used to estimate the corresponding y with reasonable accuracy.

Q 6.What is Logistic Regression? How does it differ from Linear Regression?

Ans--Linear Regression and Logistic Regression are both supervised learning algorithms, but they serve different purposes and produce different types of outputs.

Linear Regression is used for predicting continuous numeric values. It models the relationship between input features and the target variable as a straight line, using the equation:


-     Y = a + bX

The output can be any real number, positive or negative, without restriction.

Logistic Regression, on the other hand, is used for classification tasks, typically binary (0 or 1). It applies a sigmoid (logistic) function to the linear combination of inputs, which squashes the output to a range between 0 and 1, representing probabilities:

-     P(Y=1) = 1 / (1 + e^-(a + bX))
this probability can then be thresholded (e.g., at 0.5) to assign a class label.

Correct Answer to the Question: Only logistic regressions have outputs between 0 and 1.

Why this is correct:

-  Linear Regression outputs are unbounded and can take any real value.

-  Logistic Regression outputs are bounded between 0 and 1 due to the sigmoid transformation, making them interpretable as probabilities for classification tasks.


In [6]:
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y_linear = np.array([2, 4, 6, 8, 10]) # Continuous target
y_logistic = np.array([0, 0, 0, 1, 1]) # Binary target
# Linear Regression
lin_reg = LinearRegression().fit(X, y_linear)
print("Linear Regression Prediction for 6:", lin_reg.predict([[6]]))
# Logistic Regression
log_reg = LogisticRegression().fit(X, y_logistic)
print("Logistic Regression Probability for 6:", log_reg.predict_proba([[6]])[0,1])

Linear Regression Prediction for 6: [12.]
Logistic Regression Probability for 6: 0.9264970551893357


Here, the linear regression output is unbounded (12), while the logistic regression output is a probability (0.88) between 0 and 1.

Key takeaway: If your task involves predicting probabilities or classifying outcomes, logistic regression is the right choice because its outputs are naturally constrained between 0 and 1. For continuous value prediction, linear regression is appropriate.

Q 7.Name and briefly describe three common evaluation metrics for regression
models.

Ans--
-   **Mean Absolute Error (MAE):** Measures the average magnitude of errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between prediction and actual observation.

-   **Mean Squared Error (MSE):** Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

-   **Root Mean Squared Error (RMSE):** The square root of the mean of the squared errors. It gives a relatively high weight to large errors.


These metrics help assess the performance of regression models and guide improvements.


Q 8.What is the purpose of the R-squared metric in regression analysis?

Ans--**Understanding R-squared**

-   Definition: R-squared, also known as the coefficient of determination, quantifies how well the independent variables explain the variability of the dependent variable. It ranges from 0 to 1, where:

-   0 indicates that the model does not explain any variability in the dependent variable.
-   1 indicates that the model explains all the variability in the dependent variable.

**Purpose and Applications**

-   **Goodness of Fit:** R-squared serves as a measure of how well the regression model fits the data. A higher R-squared value suggests a better fit, meaning the model's predictions are closer to the actual data points.


-   **Model Comparison:** It allows analysts to compare different regression models. By evaluating R-squared values, one can determine which model better explains the observed data variations.


-   **Interpretation of Variance:** R-squared helps in understanding the proportion of the total variance in the dependent variable that is accounted for by the independent variables. This insight is crucial for assessing the effectiveness of the model.


-   **Communication Tool:** It provides a straightforward way to communicate the predictive power of a model to stakeholders who may not have a technical background, making it easier to convey the model's effectiveness.

**Limitations**

-   **While R-squared is a valuable metric, it has limitations:**
Not Always Indicative of Model Quality: A high R-squared does not necessarily mean the model is good; it may indicate overfitting, where the model captures noise rather than the underlying relationship.


-   **Ignores Model Complexity:** R-squared does not account for the number of predictors in the model, which can lead to misleading interpretations if used in isolation.



In summary, R-squared is a fundamental metric in regression analysis that helps evaluate model performance, interpret variance, and compare different models, but it should be used alongside other metrics and diagnostic tools for a comprehensive assessment of model quality.

Q 9.Write Python code to fit a simple linear regression model using scikit-learn
and print the slope and intercept.

In [7]:
# Import required libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data (independent variable X and dependent variable y)
# X must be 2D array for scikit-learn
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Create Linear Regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Get slope (coefficient) and intercept
slope = model.coef_[0]
intercept = model.intercept_

# Print results
print("Slope (Coefficient):", slope)
print("Intercept:", intercept)


Slope (Coefficient): 0.6
Intercept: 2.2


Q 10.How do you interpret the coefficients in a simple linear regression model?

Ans--When given a computer-generated regression model, the goal is to understand how each independent variable influences the dependent variable. A typical multiple linear regression equation looks like:

[ Y = β_0 + β_1X_1 + β_2X_2 + ... + β_nX_n + ϵ ]

Here, β₀ is the intercept, βᵢ are the coefficients, and ϵ is the error term.

1. Intercept (β₀) Represents the expected value of Y when all independent variables are zero. While it sets the baseline, it may not always have a practical meaning if zero values are unrealistic.

2. Coefficients (βᵢ)

Sign: Positive → Direct relationship (X increases → Y increases). Negative → Inverse relationship (X increases → Y decreases).

Magnitude: The absolute value shows the strength of the effect.

Statistical Significance: Check p-values (commonly p < 0.05) to see if the effect is meaningful.

Confidence Intervals: Narrow intervals mean more precise estimates.

3. Example For a housing price model: [ Price = 50000 + 300 \times Size + 10000 \times Bedrooms - 2000 \times Age ]

Intercept = 50000 → Base price when all predictors are zero.

Size = 300 → Each extra square foot adds $300.

Bedrooms = 10000 → Each extra bedroom adds $10,000.

Age = -2000 → Each year reduces price by $2,000.

below are given python code


In [8]:
import pandas as pd
import statsmodels.api as sm
data = {
   'Price': [200000, 250000, 300000, 350000, 400000],
   'SquareFootage': [1500, 2000, 2500, 3000, 3500],
   'Bedrooms': [3, 4, 3, 5, 4]
}
df = pd.DataFrame(data)
X = sm.add_constant(df[['SquareFootage', 'Bedrooms']])
y = df['Price']
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 8.758e+28
Date:                Fri, 13 Feb 2026   Prob (F-statistic):           1.14e-29
Time:                        11:17:23   Log-Likelihood:                 103.68
No. Observations:                   5   AIC:                            -201.4
Df Residuals:                       2   BIC:                            -202.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const              5e+04   8.85e-10   5.65e+13

  warn("omni_normtest is not valid with less than 8 observations; %i "
