Supervised Learning: Regression Models and Performance Metrics

In [None]:
'''
Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.
Answer:      Simple Linear Regression (SLR) is a statistical method used to model and analyze the relationship between two continuous variables: 
             one independent variable (predictor) and one dependent variable (response). The goal of SLR is to find a linear equation that best 
             predicts the dependent variable based on the independent variable.
            
             Purpose of Simple Linear Regression:
               a. Prediction: Estimate the value of the dependent variable based on the independent variable.
               b. Understanding relationships: Determine the strength and direction of the relationship between X and Y.
               c. Trend analysis: Identify how changes in the independent variable influence the dependent variable.


Question 2: What are the key assumptions of Simple Linear Regression?
Answer:      The key assumptions of Simple Linear Regression (SLR) are:
                 a. Linearity: The relationship between the independent variable (X) and dependent variable (Y) is linear.
                 b. Independence: The residuals (errors) are independent of each other.
                 c. Homoscedasticity: The variance of the residuals is constant across all values of X.
                 d. Normality of errors: The residuals are normally distributed.
                 e. No multicollinearity: (In SLR, only one predictor exists, so this is inherently satisfied.)


Question 3: Write the mathematical equation for a simple linear regression model and explain each term.
Answer:

           The mathematical equation for a Simple Linear Regression (SLR) model is:
               Y = β0 + β1 * X + ε
               Where:
                    - Y : Dependent variable (what we want to predict)
                    - X : Independent variable (predictor)
                    - β0 : Intercept (value of Y when X = 0)
                    - β1 : Slope (change in Y for a one-unit change in X)
                    - ε : Error term (difference between the observed and predicted Y)
           This equation represents a straight line where Y changes linearly with X, and ε accounts for variability not explained by X.


Question 4: Provide a real-world example where simple linear regression can be applied.
Answer:     A real-world example of Simple Linear Regression is predicting a person's **monthly electricity bill (Y)** 
            based on their **electricity consumption in units (X)**. 
            Here:
               - X (Independent variable) = Number of electricity units consumed
               - Y (Dependent variable) = Monthly electricity bill amount
            By using SLR, we can estimate how much a person's bill will increase for each additional unit of electricity consumed.


Question 5: What is the method of least squares in linear regression?
Answer:     The **method of least squares** is a technique used in linear regression to find the best-fitting line 
            through the data points. 
            It works by **minimizing the sum of the squares of the differences** (errors) between the observed values 
            and the values predicted by the linear model.
            Mathematically, it minimizes:

                   Σ (Yi - (β0 + β1*Xi))^2

           Where:
                - Yi = observed value of the dependent variable
                - Xi = value of the independent variable
                - β0 = intercept
                - β1 = slope
           This ensures that the regression line is as close as possible to all the data points in terms of squared error.


Question 6: What is Logistic Regression? How does it differ from Linear Regression?
Answer:     Logistic Regression is a statistical method used for predicting a binary outcome (such as Yes/No, 0/1) 
            based on one or more independent variables. Instead of predicting a continuous value, it estimates the 
            probability that a given input belongs to a particular class using the logistic (sigmoid) function.

            Differences from Linear Regression:
            1. Output type:  
            - Linear Regression predicts a continuous variable.  
            - Logistic Regression predicts a probability (between 0 and 1) and maps it to discrete classes.
            2. Equation:  
            - Linear: Y = β0 + β1*X + ε  
            - Logistic: log(p/(1-p)) = β0 + β1*X, where p is the probability of the outcome.
            3. Error distribution:  
            - Linear Regression assumes normally distributed errors.  
            - Logistic Regression uses a binomial distribution for errors.
            4. Use case:  
            - Linear Regression: predicting sales, price, temperature, etc.  
            - Logistic Regression: predicting pass/fail, disease/no disease, spam/not spam, etc.


Question 7: Name and briefly describe three common evaluation metrics for regression models.
Answer:  1. Mean Absolute Error (MAE):
          - Measures the average of the absolute differences between actual and predicted values.
          - Formula: MAE = (1/n) * Σ |Yi - Ŷi|
          - Lower MAE indicates better model performance.

         2. Mean Squared Error (MSE):
          - Measures the average of the squared differences between actual and predicted values.
          - Formula: MSE = (1/n) * Σ (Yi - Ŷi)^2
          - Penalizes larger errors more than MAE.

        3. R-squared (R²):
         - Represents the proportion of variance in the dependent variable explained by the model.
         - Formula: R² = 1 - (SS_res / SS_tot)
         - Value ranges from 0 to 1; higher values indicate a better fit.


Question 8: What is the purpose of the R-squared metric in regression analysis?
Answer:     R-squared (R²) measures the proportion of the variance in the dependent variable (Y) 
            that is explained by the independent variable(s) (X) in the regression model.
            Purpose:
                   - Indicates how well the regression model fits the data.
                   - Ranges from 0 to 1:
                        - 0 means the model explains none of the variability in Y.
                        - 1 means the model explains all the variability in Y.
                  - Helps assess the predictive power of the model: higher R² means a better fit.
'''


In [3]:
'''
Question 9: Write Python code to fit a simple linear regression model using scikit-learn
and print the slope and intercept.
Answer:
'''
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) 
Y = np.array([2, 4, 5, 4, 5])                
model = LinearRegression()
model.fit(X, Y)

print("Slope (β1):", model.coef_[0])
print("Intercept (β0):", model.intercept_)

Y_pred = model.predict(X)
print("Predicted Y values:", Y_pred)

Slope (β1): 0.6000000000000002
Intercept (β0): 2.1999999999999993
Predicted Y values: [2.8 3.4 4.  4.6 5.2]


In [None]:
'''
Question 10: How do you interpret the coefficients in a simple linear regression model?
Answer:      In a simple linear regression model (Y = β0 + β1*X + ε):

             1. Intercept (β0):
                  - Represents the expected value of the dependent variable (Y) when the independent variable (X) is 0.
                  - It is the point where the regression line crosses the Y-axis.
             2. Slope (β1):
                  - Represents the change in the dependent variable (Y) for a **one-unit increase** in the independent variable (X).
                  - Indicates the strength and direction of the relationship:
                            - Positive slope → Y increases as X increases.
                            - Negative slope → Y decreases as X increases.
'''
