# Regression 1: Linear, Multiple, and Polynomial Regression
This notebook covers key concepts in regression analysis, including simple and multiple linear regression, assumptions, interpretation, gradient descent, multicollinearity, and polynomial regression.

## Q1. Explain the difference between simple linear regression and multiple linear regression. Provide an example of each.

**Simple linear regression** models the relationship between a single independent variable (X) and a dependent variable (Y) using a straight line: Y = b0 + b1*X.

**Example:** Predicting house price (Y) based on square footage (X).

**Multiple linear regression** models the relationship between two or more independent variables (X1, X2, ...) and a dependent variable (Y): Y = b0 + b1*X1 + b2*X2 + ...

**Example:** Predicting house price (Y) based on square footage (X1), number of bedrooms (X2), and age of the house (X3).

In [None]:
# Simple linear regression example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Example data
X = np.array([1000, 1500, 2000, 2500, 3000]).reshape(-1, 1)
Y = np.array([200, 250, 300, 350, 400])

model = LinearRegression()
model.fit(X, Y)
plt.scatter(X, Y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel('Square Footage')
plt.ylabel('House Price (in $1000s)')
plt.title('Simple Linear Regression')
plt.show()

In [None]:
# Multiple linear regression example
import pandas as pd

# Example data
house_data = pd.DataFrame({
    'SquareFootage': [1000, 1500, 2000, 2500, 3000],
    'Bedrooms': [2, 3, 3, 4, 4],
    'Age': [10, 5, 8, 3, 1],
    'Price': [200, 250, 300, 350, 400]
})

X_multi = house_data[['SquareFootage', 'Bedrooms', 'Age']]
Y_multi = house_data['Price']
model_multi = LinearRegression()
model_multi.fit(X_multi, Y_multi)
print('Coefficients:', model_multi.coef_)
print('Intercept:', model_multi.intercept_)

## Q2. Discuss the assumptions of linear regression. How can you check whether these assumptions hold in a given dataset?

**Assumptions of linear regression:**
1. Linearity: Relationship between independent and dependent variables is linear.
2. Independence: Observations are independent.
3. Homoscedasticity: Constant variance of errors.
4. Normality: Errors are normally distributed.
5. No multicollinearity (for multiple regression).

**How to check:**
- Linearity: Plot residuals vs. fitted values.
- Independence: Check data collection process.
- Homoscedasticity: Plot residuals vs. fitted values.
- Normality: Q-Q plot of residuals.
- Multicollinearity: Calculate VIF (Variance Inflation Factor).

In [None]:
# Example: Checking assumptions
import statsmodels.api as sm

X_const = sm.add_constant(X)
model_sm = sm.OLS(Y, X_const).fit()
residuals = model_sm.resid

# Q-Q plot for normality
sm.qqplot(residuals, line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()

# Residuals vs. fitted values
plt.scatter(model_sm.fittedvalues, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

## Q3. How do you interpret the slope and intercept in a linear regression model? Provide an example using a real-world scenario.

- **Slope:** The change in the dependent variable (Y) for a one-unit increase in the independent variable (X).
- **Intercept:** The expected value of Y when X = 0.

**Example:** In a model predicting salary (Y) based on years of experience (X), if the slope is 5,000, each additional year of experience increases salary by $5,000. The intercept is the predicted salary for zero years of experience.

## Q4. Explain the concept of gradient descent. How is it used in machine learning?

**Gradient descent** is an optimization algorithm used to minimize the cost function in machine learning models. It iteratively adjusts model parameters in the direction of the steepest decrease in the cost function until convergence. It is widely used for training linear regression, logistic regression, and neural networks.

## Q5. Describe the multiple linear regression model. How does it differ from simple linear regression?

**Multiple linear regression** models the relationship between a dependent variable and two or more independent variables. The equation is:
Y = b0 + b1*X1 + b2*X2 + ... + bn*Xn

It differs from simple linear regression, which uses only one independent variable.

## Q6. Explain the concept of multicollinearity in multiple linear regression. How can you detect and address this issue?

**Multicollinearity** occurs when independent variables in a multiple regression model are highly correlated. This can make coefficient estimates unstable.

**Detection:**
- Calculate Variance Inflation Factor (VIF) for each variable. VIF > 5 or 10 indicates multicollinearity.

**Addressing:**
- Remove or combine correlated variables.
- Use dimensionality reduction (e.g., PCA).

In [None]:
# Example: Calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data['feature'] = X_multi.columns
vif_data['VIF'] = [variance_inflation_factor(X_multi.values, i) for i in range(X_multi.shape[1])]
print(vif_data)

## Q7. Describe the polynomial regression model. How is it different from linear regression?

**Polynomial regression** models the relationship between the independent variable and the dependent variable as an nth degree polynomial. It can capture non-linear relationships, unlike linear regression which fits a straight line.

In [None]:
# Example: Polynomial regression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model_poly = LinearRegression()
model_poly.fit(X_poly, Y)
plt.scatter(X, Y, color='blue')
plt.plot(X, model_poly.predict(X_poly), color='green')
plt.xlabel('Square Footage')
plt.ylabel('House Price (in $1000s)')
plt.title('Polynomial Regression (Degree 2)')
plt.show()

## Q8. What are the advantages and disadvantages of polynomial regression compared to linear regression? In what situations would you prefer to use polynomial regression?

**Advantages:**
- Can model non-linear relationships.

**Disadvantages:**
- Prone to overfitting with high-degree polynomials.
- Less interpretable than linear regression.

**Use polynomial regression** when the relationship between variables is clearly non-linear and cannot be captured by a straight line.