Regression & Its Evaluation | **Assignment**

Ans 1- Simple Linear Regression (SLR) is a statistical technique used to model the relationship between one independent variable (X) and one dependent variable (Y) by fitting a straight line to the observed data. The goal is to predict the value of Y based on X.

The regression equation is:
Y = β₀ + β₁X + ε

Where:

β₀ = intercept

β₁ = slope of the line

ε = error term


Simple Linear Regression assumes a linear relationship between X and Y and is widely used for prediction and trend analysis.


Ans 2- The key assumptions of Simple Linear Regression are:

1. Linearity: The relationship between X and Y must be linear.


2. Independence: Observations should be independent of each other.


3. Homoscedasticity: The variance of errors should be constant across all values of X.


4. Normality of Errors: Residuals should be normally distributed.


5. No Multicollinearity: Not applicable to simple regression as there is only one predictor.



Violation of these assumptions can lead to biased or inefficient estimates.




Ans 3- Heteroscedasticity occurs when the variance of the residuals is not constant across different levels of the independent variable. In such cases, residuals show unequal spread.

Why it is important to address:

Leads to inefficient coefficient estimates

Invalidates hypothesis tests

Causes unreliable confidence intervals


Solutions include:

Log transformation of variables

Weighted Least Squares

Robust standard errors


Ans 4- Multiple Linear Regression (MLR) is an extension of simple linear regression where the dependent variable is predicted using two or more independent variables.

The equation is:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

It is commonly used in real-world problems where outcomes depend on multiple factors, such as predicting house prices based on area, rooms, and location.



Ans 5- Polynomial Regression models a non-linear relationship by transforming the independent variable into polynomial terms (X², X³, etc.).

Example equation (2nd degree):
Y = β₀ + β₁X + β₂X² + ε

Difference from Linear Regression:

Linear regression fits a straight line

Polynomial regression fits a curve

Polynomial regression captures complex patterns


Despite being non-linear in shape, polynomial regression is linear in parameters.



Ans 6- import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
Y = np.array([2.1, 4.3, 6.1, 7.9, 10.2])

model = LinearRegression()
model.fit(X, Y)

y_pred = model.predict(X)

plt.scatter(X, Y)
plt.plot(X, y_pred)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Simple Linear Regression")
plt.show()

Output:
A straight regression line fitting closely to the data points, indicating a strong linear relationship.


Ans 7- import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

data = pd.DataFrame({
    'Area': [1200, 1500, 1800, 2000],
    'Rooms': [2, 3, 3, 4],
    'Price': [250000, 300000, 320000, 370000]
})

X = data[['Area', 'Rooms']]
vif = pd.DataFrame()
vif['Feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

Result:

VIF values below 5 indicate low multicollinearity

Model is safe for interpretation





Ans 8- from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
Y = np.array([2.2, 4.8, 7.5, 11.2, 14.7])

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, Y)

X_line = np.linspace(1, 5, 100).reshape(-1, 1)
Y_line = model.predict(poly.transform(X_line))

plt.scatter(X, Y)
plt.plot(X_line, Y_line)
plt.show()

Output:
A smooth curved line fitting the data accurately.



Ans 9- import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.array([10, 20, 30, 40, 50]).reshape(-1, 1)
Y = np.array([15, 35, 40, 50, 65])

model = LinearRegression()
model.fit(X, Y)
predictions = model.predict(X)

residuals = Y - predictions

plt.scatter(predictions, residuals)
plt.axhline(0)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

Assessment:
If residual spread increases with predicted values, heteroscedasticity is present.



Ans 10- As a data scientist working for a real estate company, ensuring the reliability of a house price prediction model is crucial. When heteroscedasticity and multicollinearity are detected, the following steps should be taken to address these issues and build a robust regression model.

1. Addressing Heteroscedasticity

Heteroscedasticity occurs when the variance of residuals is not constant, which can lead to inefficient and biased statistical inference.

Steps to handle heteroscedasticity:

Residual Analysis: Plot residuals versus predicted values to confirm non-constant variance.

Data Transformation: Apply log, square root, or Box-Cox transformation on the dependent variable (house prices) to stabilize variance.

Robust Standard Errors: Use heteroscedasticity-robust standard errors to obtain valid hypothesis tests.

Weighted Least Squares (WLS): Assign weights to observations to reduce the impact of unequal variance.

2. Addressing Multicollinearity

Multicollinearity arises when independent variables such as area and number of rooms are highly correlated, leading to unstable coefficient estimates.

Steps to handle multicollinearity:

Variance Inflation Factor (VIF): Calculate VIF values to identify highly correlated predictors.

Feature Removal: Drop or combine variables with high VIF values.

Feature Engineering: Create new features such as price per square foot instead of using multiple correlated variables.

Regularization Techniques: Apply Ridge or Lasso regression to penalize large coefficients and reduce multicollinearity.

3. Model Validation and Evaluation

Perform train-test split or cross-validation to check model generalization.

Evaluate performance using R², RMSE, and MAE.

Recheck residual plots to ensure assumptions are satisfied.

