# Multiple Linear Regression

Performing multiple linear regression in Python is similar to simple linear regression, but you work with multiple independent variables (features) instead of just one.  
Here's a step-by-step guide on how to perform multiple linear regression using Python and scikit-learn:

### Step 1: Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

### Step 2: Load and explore your dataset

Load your dataset using pandas:  
data = pd.read_csv('your_dataset.csv')  
In this tutorial, we will use the Boston Housing dataset. The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. More info about the dataset can be found at: https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset.

In [2]:
data = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Boston.csv')

Explore the dataset as needed using commands like data.head() and data.info() to understand the structure and data types.

In [3]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NX       506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


### Step 3: Prepare your data

For multiple linear regression, you need to select the independent variables (features) and the dependent variable (target). Extract these columns from your dataset:  
X = data[['Feature1', 'Feature2', 'Feature3']]  # Include all the independent variables you want to use  
Y = data['Target_Variable]  
In this example, we will select MEDV (Median value of owner-occupied homes in $1000's) as dependent variable, and the rest of all other variables as independent variables, to build a multiple linear regression model for predicting house values.


In [5]:
Y = data['MEDV']
X = data.drop(['MEDV'],axis=1) # Select all variables except for MEDV

In [6]:
Y

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: MEDV, Length: 506, dtype: float64

In [7]:
X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48


### Step 4: Split the data into training and testing sets

To assess the model's performance, it's crucial to split your data into a training set and a testing set. This helps you evaluate how well the model generalizes to unseen data:

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Step 5: Create and fit the multiple linear regression model

In [9]:
# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, Y_train)

The model now has calculated the optimal values for the coefficients (β0, β1, β2, ...) using Least-Squares Estimation.

### Step 6: Make predictions

You can use your trained model to make predictions on the test data:

In [10]:
Y_pred = model.predict(X_test)

### Step 7: Evaluate the model

You can evaluate the model's performance using various metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²):

In [11]:
mse = mean_squared_error(Y_test, Y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R²): {r2:.2f}')

Mean Squared Error (MSE): 24.29
Root Mean Squared Error (RMSE): 4.93
R-squared (R²): 0.67


These metrics help you assess how well the multiple linear regression model fits the data and makes predictions.

### Step 8: Interpret the Results

Interpreting a multiple linear regression model involves understanding the relationships between the independent variables (features) and the dependent variable (target) as well as the coefficients (β0, β1, β2, ...) of the model. Here's a general guideline on how to interpret the model:

**Intercept (β0)**: The intercept represents the estimated mean value of the dependent variable when all independent variables are set to zero. In many cases, setting all variables to zero may not be meaningful. If it is, the intercept provides a baseline value for the dependent variable.

**Coefficients (β1, β2, ...)**: Each coefficient represents the change in the dependent variable associated with a one-unit change in the corresponding independent variable, while holding all other variables constant. Here's how to interpret them:  
A positive coefficient (e.g., β1) indicates that an increase in the corresponding independent variable leads to an increase in the dependent variable.  
A negative coefficient indicates that an increase in the corresponding independent variable leads to a decrease in the dependent variable.  
The magnitude of the coefficient represents the strength of the relationship. Larger coefficients imply a stronger effect on the dependent variable.  

**R-squared (R²)**: R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, with higher values indicating a better fit. For example, an R² value of 0.80 means that 80% of the variation in the dependent variable is explained by the model, while the remaining 20% is unexplained.

**Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)**: These metrics quantify the average squared error between the actual and predicted values. Lower MSE and RMSE values indicate better model performance.

To interpret the coefficients, you can print them along with their corresponding feature names:  


In [12]:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})  
print(coefficients)  

    Feature  Coefficient
0      CRIM    -0.113056
1        ZN     0.030110
2     INDUS     0.040381
3      CHAS     2.784438
4        NX   -17.202633
5        RM     4.438835
6       AGE    -0.006296
7       DIS    -1.447865
8       RAD     0.262430
9       TAX    -0.010647
10  PTRATIO    -0.915456
11        B     0.012351
12    LSTAT    -0.508571


**Significance of Coefficients**: It's essential to assess the statistical significance of each coefficient. You can do this by examining the p-values associated with each coefficient. A low p-value (typically below 0.05) suggests that the corresponding variable is statistically significant in explaining the variation in the dependent variable.

To examine the p-values for each variable in a multiple linear regression model, you'll need to use statistical libraries such as statsmodels in Python. statsmodels provides a more detailed statistical summary of the regression model, including p-values for each coefficient. Here's how you can do it:

**Step 1:** Install and import the statsmodels library

In [13]:
!pip install statsmodels



Then, import the library:

In [14]:
import statsmodels.api as sm

**Step 2:** Fit the multiple linear regression model using statsmodels

In [15]:
# Add a constant term to the independent variables matrix for the intercept
X = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(Y, X).fit()

**Step 3:** Examine the summary statistics

After fitting the model, you can examine the summary statistics, which include p-values for each variable:

In [16]:
summary = model.summary()
print(summary)

                            OLS Regression Results                            
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Sun, 28 Jan 2024   Prob (F-statistic):          6.72e-135
Time:                        20:08:22   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         36.4595      5.103      7.144      0.0

The summary will provide a detailed statistical summary of the regression model, including p-values for each coefficient. Look for the "P>|t|" column in the summary table, which represents the p-value for each variable. Typically, you want to check whether these p-values are below a significance level (e.g., 0.05) to determine the statistical significance of each variable.  
If the p-value for a variable is less than your chosen significance level (e.g., 0.05), you can consider that variable statistically significant in explaining the variation in the dependent variable.

If the p-value is greater than the significance level, you may consider that variable not statistically significan  t.

Below is a detailed interpretation of the model:

**R-squared (R²):** R-squared measures the proportion of the variance in the dependent variable (MEDV) that is explained by the independent variables in the model. In this case, R-squared is 0.741, which means that approximately 74.1% of the variation in MEDV is explained by the independent variables included in the model.

**Adjusted R-squared (Adj. R²):** Adjusted R-squared takes into account the number of predictors in the model and provides a more accurate measure of model fit. An Adj. R² of 0.734 indicates a good fit, considering the complexity of the model.

**F-statistic:** The F-statistic is used to test whether the overall regression model is statistically significant. A high F-statistic (in this case, 108.1) suggests that the model as a whole is statistically significant in predicting MEDV.

**Prob (F-statistic):** This p-value associated with the F-statistic is very close to zero (6.72e-135), indicating that the model is highly significant.

**Coefficients (coef):** These coefficients represent the estimated changes in MEDV associated with a one-unit change in each of the independent variables while holding all other variables constant.

For example, the coefficient for CRIM is -0.1080, meaning that for every one-unit increase in the CRIM variable, the MEDV is expected to decrease by approximately 0.1080 units.
The constant (intercept) has a coefficient of 36.4595, which represents the estimated MEDV when all independent variables are zero.
P-values (P>|t|): These p-values assess the statistical significance of each coefficient. A low p-value (typically below 0.05) suggests that the corresponding variable is statistically significant in explaining the variation in MEDV.

For instance, variables such as CRIM, ZN, CHAS, NX, RM, DIS, RAD, TAX, PTRATIO, B, and LSTAT have low p-values (close to zero), indicating their statistical significance.
On the other hand, the INDUS and AGE variables have higher p-values, suggesting that they may not be statistically significant predictors of MEDV.