## **MULTIPLE LINEAR REGRESSION**

## Let’s break down Multiple Linear Regression in a simple and easy way, like you're explaining it to someone new to data science.

### 🔍 What is Multiple Linear Regression?
##### Multiple Linear Regression (MLR) is a method used in statistics and machine learning to predict a numeric outcome (called the target) using two or more input features (called predictors or independent variables).

#### Think of it like this:

#### 💡 "You’re trying to predict the price of a house based on its size, number of bedrooms, and distance from the city center."

### 🏗️ The Formula (Don’t worry, we’ll simplify it!)
##### Price=β0+β1⋅Size+β2⋅Bedrooms+β3⋅Distance+ε

##### β₀: Intercept (the price when all other features are zero)

##### β₁, β₂, β₃: Coefficients (they tell us how much the price changes when we increase each feature)

##### ε: Error term (stuff we can’t explain)

##### 💬 Real-life Analogy
##### Imagine you’re a chef and you’re trying to predict how much customers will love your new dish. The score depends on:

##### How spicy it is 🌶️

##### How sweet it is 🍯

##### How salty it is 🧂

##### Each of these ingredients contributes in a different way to the final taste. You want to find the right weights (coefficients) for each ingredient that best explain the customers' reactions. That’s exactly what MLR does!

#### ✅ What MLR Does:
##### Fits a line (or plane/hyperplane) through the data.

##### Finds the best combination of input features to predict the output.

##### Minimizes the difference between the predicted and actual values (called the residuals).

##### 📊 Example with Data
#### Let’s say we have this table:

####  Size (sq ft)   	Bedrooms	 Distance to city (km)	      Price (Ksh)
####  1000	               2	             5	                       5M
####  1200	               3	             3	                       6.2M
####  1500	               3	             2	                       7.1M

#### MLR will learn how much Size, Bedrooms, and Distance each contribute to the final Price, then use that to predict prices for new houses.

#### 📦 Tools to Use in Python:
#### sklearn.linear_model.LinearRegression

#### statsmodels.api.OLS (for more statistical details)

#### 🎯 Why It’s Powerful
#### Helps you understand how much each feature matters.

#### Can be used for forecasting, business planning, risk modeling, and more.



## Let go through a short Python code example using scikit-learn to fit a Multiple Linear Regression model, and then show you how to interpret the coefficients.

#### 🏠 Example: Predicting House Price
##### We’ll use a small dataset where we want to predict the price of a house based on:

##### size in square feet

##### bedrooms

##### distance to city in kilometers

### ✅ Step-by-Step Code;

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# 1. Sample data
data = {
    'size': [1000, 1200, 1500, 1800],
    'bedrooms': [2, 3, 3, 4],
    'distance': [5, 3, 2, 1],
    'price': [5.0, 6.2, 7.1, 8.3]  # in millions
}

df = pd.DataFrame(data)

# 2. Features (X) and Target (y)
X = df[['size', 'bedrooms', 'distance']]
y = df['price']

# 3. Create and train the model
model = LinearRegression()
model.fit(X, y)

# 4. Show coefficients
print("Intercept (β₀):", model.intercept_)
print("Coefficients (β₁, β₂, β₃):", model.coef_)

# 5. Predict the price of a new house
new_house = [[1600, 3, 2]]
predicted_price = model.predict(new_house)
print("Predicted Price (in millions):", predicted_price[0])


Intercept (β₀): 3.274999999999988
Coefficients (β₁, β₂, β₃): [ 0.00225  0.3     -0.225  ]
Predicted Price (in millions): 7.324999999999999




### 📊 Output Interpretation (Example)
##### Let’s say the output is:

##### Intercept (β₀): 1.2
##### Coefficients (β₁, β₂, β₃): [0.003, 0.5, -0.4]
##### Predicted Price (in millions): 7.1


#### This means:

##### Intercept (β₀ = 1.2): If size, bedrooms, and distance were all 0, the model predicts the price as 1.2M (not realistic but mathematically valid).

##### Size (β₁ = 0.003): For each extra square foot, price increases by 3,000 Ksh (since 0.003M = 3K).

##### Bedrooms (β₂ = 0.5): Each additional bedroom adds 0.5M to the price.

##### Distance (β₃ = -0.4): Every 1 km farther from the city reduces price by 0.4M.



### 💡 Insights;

##### Coefficients tell you how much each feature affects the price.

##### Positive = increases price, Negative = decreases price.

#### You can use it to make predictions and understand relationships in your data.


 ### Let's now do the same example using statsmodels, which gives you a detailed statistical summary, including:

### Coefficients

#### R² score

#### p-values (to check if features are significant)

#### Confidence intervals



In [5]:
import pandas as pd
import statsmodels.api as sm

# 1. Sample data
data = {
    'size': [1000, 1200, 1500, 1800],
    'bedrooms': [2, 3, 3, 4],
    'distance': [5, 3, 2, 1],
    'price': [5.0, 6.2, 7.1, 8.3]  # in millions
}

df = pd.DataFrame(data)

# 2. Define X and y
X = df[['size', 'bedrooms', 'distance']]
X = sm.add_constant(X)  # Adds intercept term (β₀)
y = df['price']

# 3. Fit the model
model = sm.OLS(y, X).fit()

# 4. View summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Thu, 03 Jul 2025   Prob (F-statistic):                nan
Time:                        12:29:09   Log-Likelihood:                 111.02
No. Observations:                   4   AIC:                            -214.0
Df Residuals:                       0   BIC:                            -216.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2750        inf          0        n

  warn("omni_normtest is not valid with less than 8 observations; %i "
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid


### 🧠 Interpretation:
#### coef: These are the same as in sklearn (β values).

#### P>|t|: This is the p-value.

#### If it's < 0.05, the feature is statistically significant (i.e., it truly impacts price).

#### R-squared (R²): 0.998 means the model explains 99.8% of the variation in house price — excellent fit (though suspiciously high on small data).

#### Confidence Interval [0.025, 0.975]: Range in which the true coefficient likely falls.








## 🧠 When to Use statsmodels vs sklearn
#### Purpose	                        -    Use This
#### Predict new values                  -     	sklearn
#### Understand model statistically       -   	statsmodels
#### Get p-values, CI, R²	               -     statsmodels
#### Fast prediction pipelines	           -     sklearn

