#Regression & Its Evaluation

Question 1: What is Simple Linear Regression?


- Simple Linear Regression is a statistical technique used to understand and model the relationship between two variables.
- Simple Linear Regression shows how one independent variable (X) affects one dependent variable (Y) by fitting a straight line through the data points.
           
            formula =  Y=a+bX

Question 2: What are the key assumptions of Simple Linear Regression?

- The key assumptions of simple linear regression are linearity, independence of errors, homoscedasticity, normal distribution of errors, and absence of influential outliers.

Question 3: What is heteroscedasticity, and why is it important to address in regression models?

- Heteroscedasticity refers to a situation in regression analysis where the variance of the error terms (residuals) is not constant across all levels of the independent variable.

- Why is it Important to Address?

1- Incorrect Standard Errors

Leads to unreliable confidence intervals and p-values.

2️- Invalid Hypothesis Tests

t-tests and F-tests may give wrong conclusions.

3️- Inefficient Estimates

Coefficient estimates are less efficient (not minimum variance).

4️-  Poor Model Reliability

Predictions and inferences become less trustworthy.

Question 4: What is Multiple Linear Regression?

- Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one dependent variable and two or more independent variables.
  
         formula = Y=b0​+b1​X1​+b2​X2​+⋯+bn​Xn​

Question 5: What is polynomial regression, and how does it differ from linear
regression?

- Polynomial Regression is a type of regression analysis where the relationship between the independent variable and the dependent variable is modeled as a polynomial function of the independent variable.

              Formula = Y=b0​+b1​X+b2​X2+b3​X3+…

6. import numpy as np
import matplotlib.pyplot as plt

#Given data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2.1, 4.3, 6.1, 7.9, 10.2])

#Calculate slope (m) and intercept (c)
x_mean = np.mean(X)
y_mean = np.mean(Y)

m = np.sum((X - x_mean) * (Y - y_mean)) / np.sum((X - x_mean) ** 2)
c = y_mean - m * x_mean

# Predicted values
Y_pred = m * X + c

# Plot
plt.scatter(X, Y)
plt.plot(X, Y_pred)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Simple Linear Regression")
plt.show()

print("Slope (m):", m)
print("Intercept (c):", c)

7. import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Sample data
data = {
    "Area": [1200, 1500, 1800, 2000],
    "Rooms": [2, 3, 3, 4],
    "Price": [250000, 300000, 320000, 370000]
}

df = pd.DataFrame(data)

# Independent variables
X = df[["Area", "Rooms"]]
y = df["Price"]

# Add constant
X_const = sm.add_constant(X)

# Fit Multiple Linear Regression model
model = sm.OLS(y, X_const).fit()
print(model.summary())

# Calculate VIF
vif = pd.DataFrame()
vif["Feature"] = X_const.columns
vif["VIF"] = [
    variance_inflation_factor(X_const.values, i)
    for i in range(X_const.shape[1])
]

print("\nVariance Inflation Factor (VIF):")
print(vif)

8. import numpy as np
import matplotlib.pyplot as plt

# Given data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2.2, 4.8, 7.5, 11.2, 14.7])

# Fit 2nd-degree polynomial
coefficients = np.polyfit(X, Y, 2)
poly_model = np.poly1d(coefficients)

# Generate smooth curve
X_curve = np.linspace(min(X), max(X), 100)
Y_curve = poly_model(X_curve)

# Plot
plt.scatter(X, Y)
plt.plot(X_curve, Y_curve)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("2nd Degree Polynomial Regression")
plt.show()

print("Polynomial Coefficients:", coefficients)

9. import numpy as np
import matplotlib.pyplot as plt

# Given data
X = np.array([10, 20, 30, 40, 50])
Y = np.array([15, 35, 40, 50, 65])

# Fit simple linear regression (y = mx + c)
x_mean = np.mean(X)
y_mean = np.mean(Y)

m = np.sum((X - x_mean) * (Y - y_mean)) / np.sum((X - x_mean) ** 2)
c = y_mean - m * x_mean

# Predictions
Y_pred = m * X + c

# Residuals
residuals = Y - Y_pred

# Residuals plot
plt.scatter(X, residuals)
plt.axhline(0)
plt.xlabel("X")
plt.ylabel("Residuals")
plt.title("Residuals Plot")
plt.show()

10. Imagine you are a data scientist working for a real estate company. You
need to predict house prices using features like area, number of rooms, and location.
However, you detect heteroscedasticity and multicollinearity in your regression
model. Explain the steps you would take to address these issues and ensure a robust
model.
- As a data scientist predicting **house prices** using features like **area, rooms, and location**, the regression model shows:

-  Heteroscedasticity (non-constant error variance)
- Multicollinearity (high correlation among predictors)

Both issues can make the model unreliable and unstable.
1. Handling Heteroscedasticity
- Plot **residuals vs predicted values**
- Perform **Breusch–Pagan or White test**
 Solutions

1. **Transform the target variable**

   * Apply **log, square root, or Box-Cox transformation** to house prices
   * Example: `log(Price)`

2. **Use Weighted Least Squares (WLS)**

   * Assign lower weights to observations with high variance

3. **Robust Standard Errors**

   * Use **heteroscedasticity-robust (HC) standard errors** to get reliable p-values

4. **Feature engineering**

   * Add missing variables (e.g., neighborhood quality, age of house)



2.  Handling Multicollinearity

* Compute **Variance Inflation Factor (VIF)**
* Check **correlation matrix**

Solutions

1. **Remove or combine correlated features**

   * Example: Combine `Area` and `Rooms` into **price per square foot**

2. **Feature selection**

   * Use **stepwise regression**, **Lasso**, or **Ridge regression**

3. **Regularization**

   * **Ridge Regression** → reduces coefficient variance
   * **Lasso Regression** → removes less important features

4. **Principal Component Analysis (PCA)**

   * Convert correlated features into independent components

3.  Ensuring a Robust Model

- Re-train the model after corrections
- Validate using **cross-validation**
- Check assumptions again
- Evaluate performance using **RMSE and R²**
-  Interpret coefficients carefully