#Supervised Learning: Regression Models and Performance Metrics | Solution


1. What is Simple Linear Regression (SLR)? Explain its purpose.
  - Simple Linear Regression (SLR) is a statistical technique used to model and analyze the relationship between two quantitative variables: one independent variable (predictor) and one dependent variable (outcome).
  
  - The method seeks to fit a straight line (the "regression line") through the data points in such a way that the line best predicts the values of the dependent variable based on the independent variable

  - Purpose of Simple Linear Regression

    - The main purpose of SLR is to estimate and understand how changes in the independent variable X are associated with changes in the dependent variable.

    - SLR helps to quantify the strength and direction of this relationship, and it provides a mathematical equation (typically Y = $β_{0} + β_{1}X$ + ϵ), which can be used to make predictions about new data where only X is known.​

    - SLR is valuable for both descriptive analysis (understanding relationships) and predictive modeling (forecasting outcomes).

2. What are the key assumptions of Simple Linear Regression?
  - The key assumptions of Simple Linear Regression (SLR) are as follows:

Linearity
  - The relationship between the independent variable and the dependent variable must be linear, meaning that the change in the outcome should be proportional to changes in the predictor.​

Independence
  - The residuals (errors) should be independent. Observations and their errors must not be correlated with each other, ensuring valid statistical inference.​

Homoscedasticity (Constant Variance)
  - The variance of the residuals should remain constant across all levels of the independent variable. This means that the spread of errors should not increase or decrease as the predictor changes.​

Normality
  - The residuals (error terms) should be normally distributed. This is crucial for valid hypothesis testing and confidence interval estimation

3. Write the mathematical equation for a simple linear regression model and explain each term.
  - The mathematical equation for a simple linear regression model is:

<div style="text-align: center;">
  <pre>
                                                Y = B<sub>0</sub> + B<sub>1</sub>x
  </pre>
</div>

  where,

 - Y =  Predicted value of the dependent variable (the response variable).
 - x = Independent variable (the predictor variable).
 - $B_{0}$=Intercept term; the expected value of Y when x=0.
 - $B_{1}$ = Slope or regression coefficient; represents the expected change in Y for a one-unit increase in x.

4. Provide a real-world example where simple linear regression can be applied.
  - A classic real-world example where simple linear regression is applied is predicting a student's exam score based on the number of hours they studied.

    - Independent Variable (x): Number of hours studied

    - Dependent Variable (y): Exam score obtained

  - In this scenario, historical data is collected showing how many hours different students studied and what scores they achieved. By applying simple linear regression, a best-fit line can be drawn through this data to determine the relationship between hours studied and exam score. The resulting model allows for predictions such as estimating the likely score if a student studies for 5 hours
  - This approach helps educators and students understand the strength of the association between preparation (study time) and performance, providing actionable insight for planning study schedules.

5. : What is the method of least squares in linear regression?
  - The method of least squares in linear regression is a mathematical technique used to find the line of best fit for a set of data points by minimizing the sum of the squares of the differences (residuals) between the observed values and the values predicted by the line.

6. What is Logistic Regression? How does it differ from Linear Regression?
  - Logistic Regression is a supervised machine learning algorithm used for classification tasks, where the output (dependent variable) is categorical, typically representing binary outcomes such as yes/no or 0/1. It models the probability that a given input belongs to a certain class using the logistic (sigmoid) function, which transforms predictions to a range between 0 and 1.

  - Logistic Regression Differs from Linear Regression in following aspects :

    - Linear Regression
      - Output Type : Continuous (e.g., price)
      - Purpose : Predict values
      - Mathematical Function : Linear equation
      - Mathematical Function : Any real number
      - Loss/Estimation Method : Least Squares
      - Common Use Cases : Forecasting numbers
    
    - Logistic Regression
      - Output Type : Categorical (e.g., yes/no)
      - Purpose : 	Classify outcomes
      - Mathematical Function : Logistic (sigmoid) function
      - Mathematical Function : Probability (0 to 1)
      - Loss/Estimation Method : Maximum Likelihood
      - Common Use Cases :  Predicting class labels
    
  - Linear Regression predicts continuous values and fits a straight line using least squares to minimize errors.​

  - Logistic Regression predicts the probability of an event and is used for classification; it uses the logistic function to map predictions to probability, and classifies based on a threshold (like 0.5).

7. Name and briefly describe three common evaluation metrics for regression models.
  - Three common evaluation metrics for regression models are:

    - Mean Absolute Error (MAE)
      - Measures the average absolute difference between the actual values and the predicted values.

      - It is easy to interpret because it is in the same units as the target variable and treats all errors equally.​

    - Mean Squared Error (MSE)
      - Calculates the average of the squared differences between actual and predicted values.

      - It penalizes larger errors more heavily and is commonly used for model evaluation and optimization.​

    - R-squared (R²) / Coefficient of Determination
      - Reflects the proportion of the variance in the dependent variable that is explained by the model.

      - Values range from 0 to 1, with higher values indicating a better fit of the regression model to the data.​

These metrics help in understanding and comparing the predictive accuracy and explanatory power of regression models.

8. What is the purpose of the R-squared metric in regression analysis?
  - The purpose of the R-squared metric in regression analysis is to measure the proportion of the variance in the dependent variable that is explained by the independent variable(s) in the regression model.​

  - It quantifies how well the regression model fits the observed data by indicating the fraction of the total variation in the outcome that the model accounts for. R-squared values range from 0 to 1, where a value of 1 means the model perfectly explains the data variation, and 0 means it explains none of it.​

  - In essence, R-squared helps assess the goodness of fit, showing how effectively the model's independent variables predict or explain the dependent variable's behavior.

9. Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.
(Include your Python code and output in the code box below.)

In [9]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression
import numpy as np

# Example data: Hours studied (X) and exam scores (y)
X = np.array([[1], [2], [3], [4], [5]])  # Independent variable (reshape for sklearn)
y = np.array([50, 55, 65, 70, 75])      # Dependent variable

# Create the model and fit it
model = LinearRegression()
model.fit(X, y)

# Retrieve slope and intercept
slope = model.coef_[0]
intercept = model.intercept_

print(f"Slope (Coefficient): {slope:.2f}")
print(f"Intercept: {intercept}")


Slope (Coefficient): 6.50
Intercept: 43.5


10. How do you interpret the coefficients in a simple linear regression model?
  - In a simple linear regression model, the coefficients are interpreted as follows:

    - The intercept (β₀) represents the expected value of the dependent variable (response) when the independent variable is zero. It serves as the baseline level of the outcome variable.​

    - The slope coefficient (β₁) indicates the average change in the dependent variable for a one-unit increase in the independent variable, holding all else constant. A positive slope means the dependent variable increases as the independent variable increases, while a negative slope means it decreases.