# <div align="center" style="color: brown"><strong>Multiple Linear Regression Tutorial</strong></div>

## <div style="color: red"><strong>Part 1. Introduction to Multiple Linear Regression</strong></div>

Multiple linear regression is an extension of simple linear regression that allows us to model the relationship between a dependent variable and multiple independent variables. While simple linear regression has one independent variable (X) and one dependent variable (Y), multiple linear regression has two or more independent variables.

### Mathematical Representation

The mathematical equation for multiple linear regression is:

$$Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε$$

Where:
- Y is the dependent variable (what we're trying to predict)
- X₁, X₂, ..., Xₙ are the independent variables (features)
- β₀ is the y-intercept (constant term)
- β₁, β₂, ..., βₙ are the coefficients for each independent variable
- ε is the error term (residual)

### Key Differences from Simple Linear Regression

- **Multiple predictors**: Uses two or more independent variables to predict the outcome
- **Higher dimensional space**: While simple linear regression models a line, multiple regression models a plane (2 predictors) or hyperplane (3+ predictors)
- **More complex interpretation**: Each coefficient represents the change in Y for a one-unit change in Xᵢ while holding all other variables constant

### Assumptions

Multiple linear regression relies on the same assumptions as simple linear regression:

1. **Linearity**: The relationship between independent and dependent variables is linear
2. **Independence**: Observations are independent of each other
3. **Homoscedasticity**: Constant variance in errors
4. **Normality**: Residuals are normally distributed
5. **No multicollinearity**: Independent variables are not highly correlated with each other (unique to multiple regression)

## <div style="color: red"><strong>Part 2. Implementing Multiple Linear Regression in Python</strong></div>

Let's start by importing the necessary libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Set the style for our plots
plt.style.use('seaborn-whitegrid')
sns.set_style("whitegrid")

### 2.1 Creating a Synthetic Dataset

Let's create a synthetic dataset to demonstrate multiple linear regression. We'll simulate a housing price dataset with three features: house size (in square feet), number of bedrooms, and age of the house (in years).

In [None]:
# Set a random seed for reproducibility
np.random.seed(42)

# Create 100 samples
n_samples = 100

# Generate features
house_size = np.random.normal(1500, 500, n_samples)  # House size in square feet
bedrooms = np.random.randint(1, 6, n_samples)        # Number of bedrooms (1-5)
house_age = np.random.normal(15, 10, n_samples)      # House age in years

# Generate target variable (house price) with some noise
# Formula: price = 50000 + 100*size + 25000*bedrooms - 5000*age + noise
price = 50000 + 100 * house_size + 25000 * bedrooms - 5000 * house_age + np.random.normal(0, 25000, n_samples)

# Create a DataFrame
data = pd.DataFrame({
    'Size': house_size,
    'Bedrooms': bedrooms,
    'Age': house_age,
    'Price': price
})

# Display the first few rows
data.head()

### 2.2 Exploratory Data Analysis

Let's explore our dataset to better understand the relationships between variables.

In [None]:
# Display summary statistics
data.describe()

In [None]:
# Create a correlation matrix and visualize it
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix', fontsize=18)
plt.show()

In [None]:
# Create pairplots to visualize relationships between all variables
sns.pairplot(data, height=2.5)
plt.tight_layout()
plt.show()

In [None]:
# Create individual scatter plots with regression lines for each feature vs price
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Size vs Price
sns.regplot(x='Size', y='Price', data=data, ax=axes[0])
axes[0].set_title('Size vs Price', fontsize=16)
axes[0].set_xlabel('House Size (sq ft)', fontsize=14)
axes[0].set_ylabel('Price ($)', fontsize=14)

# Bedrooms vs Price
sns.regplot(x='Bedrooms', y='Price', data=data, ax=axes[1])
axes[1].set_title('Bedrooms vs Price', fontsize=16)
axes[1].set_xlabel('Number of Bedrooms', fontsize=14)
axes[1].set_ylabel('Price ($)', fontsize=14)

# Age vs Price
sns.regplot(x='Age', y='Price', data=data, ax=axes[2])
axes[2].set_title('Age vs Price', fontsize=16)
axes[2].set_xlabel('House Age (years)', fontsize=14)
axes[2].set_ylabel('Price ($)', fontsize=14)

plt.tight_layout()
plt.show()

### 2.3 Preparing the Data for Modeling

Let's split our data into training and testing sets and prepare it for modeling.

In [None]:
# Define features (X) and target variable (y)
X = data[['Size', 'Bedrooms', 'Age']]
y = data['Price']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features (optional but often beneficial)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

## <div style="color: red"><strong>Part 3. Training and Evaluating the Multiple Linear Regression Model</strong></div>

Now let's train our model using both regular and scaled features to compare.

In [None]:
# Train the model with original (unscaled) features
model = LinearRegression()
model.fit(X_train, y_train)

# Train another model with scaled features
model_scaled = LinearRegression()
model_scaled.fit(X_train_scaled, y_train)

In [None]:
# Print model coefficients and intercept for unscaled model
print("Unscaled Model:")
print(f"Intercept: ${model.intercept_:.2f}")
print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"- {feature}: ${coef:.2f} per unit change")

print("\nInterpreting these coefficients:")
print(f"- For each additional square foot, the house price increases by ${model.coef_[0]:.2f}")
print(f"- For each additional bedroom, the house price increases by ${model.coef_[1]:.2f}")
print(f"- For each additional year of age, the house price decreases by ${-model.coef_[2]:.2f}")

### 3.1 Model Evaluation

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_scaled = model_scaled.predict(X_test_scaled)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

mse_scaled = mean_squared_error(y_test, y_pred_scaled)
rmse_scaled = np.sqrt(mse_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)

print("Unscaled Model Performance:")
print(f"Mean Squared Error (MSE): ${mse:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:.2f}")
print(f"R² Score: {r2:.4f}")

print("\nScaled Model Performance:")
print(f"Mean Squared Error (MSE): ${mse_scaled:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_scaled:.2f}")
print(f"R² Score: {r2_scaled:.4f}")

In [None]:
# Visualize Actual vs Predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Price', fontsize=14)
plt.ylabel('Predicted Price', fontsize=14)
plt.title('Actual vs Predicted House Prices', fontsize=16)
plt.grid(True)
plt.show()

In [None]:
# Plot residuals to check for patterns
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.title('Residual Plot', fontsize=16)
plt.grid(True)
plt.show()

# Plot a histogram of residuals to check normality
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.xlabel('Residual Value', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Distribution of Residuals', fontsize=16)
plt.grid(True)
plt.show()

## <div style="color: red"><strong>Part 4. Feature Importance and Model Interpretation</strong></div>

Let's analyze which features have the most significant impact on our prediction.

In [None]:
# Calculate standardized coefficients (for better comparison)
coef_scaled = model_scaled.coef_
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Raw Coefficient': model.coef_,
    'Standardized Coefficient': coef_scaled
})

feature_importance = feature_importance.sort_values(by='Standardized Coefficient', key=abs, ascending=False)
feature_importance

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Standardized Coefficient'])
plt.xlabel('Standardized Coefficient', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.title('Feature Importance', fontsize=16)
plt.grid(True)
plt.show()

## <div style="color: red"><strong>Part 5. Making Predictions with the Model</strong></div>

Let's use our model to make predictions for new houses.

In [None]:
# Create some new house data
new_houses = pd.DataFrame({
    'Size': [1200, 1800, 2500, 3000],
    'Bedrooms': [2, 3, 4, 5],
    'Age': [20, 10, 5, 1]
})

# Make predictions
new_predictions = model.predict(new_houses)

# Add predictions to the DataFrame
new_houses['Predicted Price'] = new_predictions

# Display the results
new_houses

## <div style="color: red"><strong>Part 6. Advanced Techniques for Multiple Regression</strong></div>

### 6.1 Adding Polynomial Features

Sometimes, the relationship between features and the target is not strictly linear. We can add polynomial terms to our model to capture non-linear relationships.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features (degree=2 includes squared terms and interaction terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Get the feature names
feature_names = poly.get_feature_names_out(X.columns)
print(f"Original features: {X.columns.tolist()}")
print(f"Polynomial features: {feature_names.tolist()}")

In [None]:
# Train a model with polynomial features
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)

# Make predictions with the polynomial model
y_pred_poly = poly_model.predict(X_test_poly)

# Calculate performance metrics
mse_poly = mean_squared_error(y_test, y_pred_poly)
rmse_poly = np.sqrt(mse_poly)
r2_poly = r2_score(y_test, y_pred_poly)

print("Polynomial Model Performance:")
print(f"Mean Squared Error (MSE): ${mse_poly:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_poly:.2f}")
print(f"R² Score: {r2_poly:.4f}")
print("\nCompare with original model:")
print(f"Original model R²: {r2:.4f}")
print(f"Improvement: {(r2_poly-r2)*100:.2f}%")

### 6.2 Dealing with Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated with each other. This can cause issues with coefficient interpretation. Let's demonstrate how to detect and handle it.

In [None]:
# Create a dataset with multicollinearity
np.random.seed(42)
n_samples = 100

# Create two highly correlated features
base_feature = np.random.normal(0, 1, n_samples)
feature1 = base_feature + np.random.normal(0, 0.1, n_samples)  # Feature 1 is very similar to base_feature
feature2 = base_feature + np.random.normal(0, 0.1, n_samples)  # Feature 2 is very similar to base_feature
feature3 = np.random.normal(0, 1, n_samples)  # Independent feature

# Generate target variable
target = 3 * base_feature + 2 * feature3 + np.random.normal(0, 1, n_samples)

# Create a DataFrame
collinear_data = pd.DataFrame({
    'Feature1': feature1,
    'Feature2': feature2,
    'Feature3': feature3,
    'Target': target
})

# Check the correlation matrix
corr_matrix = collinear_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix - Multicollinearity Example', fontsize=16)
plt.show()

Variance Inflation Factor (VIF) is a common way to detect multicollinearity. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
X_collinear = collinear_data[['Feature1', 'Feature2', 'Feature3']]
vif_data = pd.DataFrame()
vif_data["Feature"] = X_collinear.columns
vif_data["VIF"] = [variance_inflation_factor(X_collinear.values, i) for i in range(X_collinear.shape[1])]

print("VIF Values:")
print(vif_data)
print("\nInterpretation:")
print("- VIF = 1: No multicollinearity")
print("- 1 < VIF < 5: Moderate multicollinearity")
print("- 5 < VIF < 10: High multicollinearity")
print("- VIF > 10: Severe multicollinearity")

### 6.3 Regularization: Ridge and Lasso Regression

To handle multicollinearity and overfitting, we can use regularization techniques like Ridge and Lasso regression.

In [None]:
from sklearn.linear_model import Ridge, Lasso

# Prepare training and testing data
X_collinear_train, X_collinear_test, y_collinear_train, y_collinear_test = train_test_split(
    X_collinear, collinear_data['Target'], test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_collinear_train_scaled = scaler.fit_transform(X_collinear_train)
X_collinear_test_scaled = scaler.transform(X_collinear_test)

# Train OLS, Ridge, and Lasso models
ols_model = LinearRegression().fit(X_collinear_train_scaled, y_collinear_train)
ridge_model = Ridge(alpha=1.0).fit(X_collinear_train_scaled, y_collinear_train)
lasso_model = Lasso(alpha=0.1).fit(X_collinear_train_scaled, y_collinear_train)

# Compare coefficients
coef_df = pd.DataFrame({
    'Feature': X_collinear.columns,
    'OLS': ols_model.coef_,
    'Ridge': ridge_model.coef_,
    'Lasso': lasso_model.coef_
})

coef_df

In [None]:
# Visualize coefficients
plt.figure(figsize=(12, 6))
bar_width = 0.25
x = np.arange(len(coef_df['Feature']))

plt.bar(x - bar_width, coef_df['OLS'], width=bar_width, label='OLS', color='blue')
plt.bar(x, coef_df['Ridge'], width=bar_width, label='Ridge', color='green')
plt.bar(x + bar_width, coef_df['Lasso'], width=bar_width, label='Lasso', color='red')

plt.xlabel('Feature', fontsize=14)
plt.ylabel('Coefficient Value', fontsize=14)
plt.title('Comparison of Regression Coefficients', fontsize=16)
plt.xticks(x, coef_df['Feature'])
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Evaluate model performance
models = {'OLS': ols_model, 'Ridge': ridge_model, 'Lasso': lasso_model}
model_scores = {}

for name, model in models.items():
    y_pred = model.predict(X_collinear_test_scaled)
    r2 = r2_score(y_collinear_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_collinear_test, y_pred))
    model_scores[name] = {'R²': r2, 'RMSE': rmse}
    
pd.DataFrame(model_scores).T

## <div style="color: red"><strong>Part 7. Summary and Conclusion</strong></div>

In this tutorial, we've covered:

1. **Theory of Multiple Linear Regression**
   - Mathematical representation and key concepts
   - Differences from simple linear regression
   - Assumptions of the model

2. **Implementing Multiple Linear Regression in Python**
   - Creating and exploring synthetic datasets
   - Preparing data for modeling
   - Training and evaluating the model
   - Feature importance and interpretation

3. **Advanced Techniques**
   - Polynomial regression for non-linear relationships
   - Detecting and handling multicollinearity
   - Regularization with Ridge and Lasso regression

### Key Takeaways

- Multiple linear regression extends simple linear regression by incorporating multiple predictors
- Feature scaling can be important for fair comparison of coefficients
- Polynomial features can help capture non-linear relationships
- Multicollinearity can be detected using correlation matrices and VIF
- Ridge and Lasso regression can help manage multicollinearity and prevent overfitting

### Next Steps

- Apply multiple regression to real-world datasets
- Explore techniques for feature selection
- Learn about other regression techniques like Random Forest or Gradient Boosting
- Study model validation techniques like cross-validation