In [1]:
## 🔍 What is Multicollinearity?

**Multicollinearity** occurs when **two or more independent variables (predictors) in a regression model are highly correlated**
, meaning they contain similar information about the variance in the dependent variable.

This causes problems because it becomes **difficult to determine the individual effect of each predictor on the outcome**.

---

## 📘 Example Equation

Consider a **multiple linear regression** equation:

$$
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \varepsilon
$$

Where:

* $y$ = dependent variable
* $x_1, x_2, x_3$ = independent variables
* $\beta_0$ = intercept
* $\beta_1, \beta_2, \beta_3$ = coefficients for predictors
* $\varepsilon$ = error term

Now, suppose:

$$
x_3 \approx 2x_1 + 0.5x_2
$$

This means that $x_3$ is **linearly dependent** on $x_1$ and $x_2$. This is a case of **multicollinearity**.

---

## 🚨 Why Is It a Problem?

Multicollinearity affects:

* **Interpretability**: Coefficient values become unreliable or change drastically with small data changes.
* **Statistical Significance**: High multicollinearity increases standard errors → t-values decrease → predictors may appear 
**insignificant** even if they are important.
* **Model Stability**: The model becomes unstable and hard to generalize.

---

## 🧮 How to Detect It?

### 1. **Correlation Matrix**

Check correlation between independent variables:

```python
import seaborn as sns
sns.heatmap(df.corr(), annot=True)
```

### 2. **Variance Inflation Factor (VIF)**

VIF quantifies how much the variance of a coefficient is inflated due to multicollinearity:

$$
\text{VIF}_i = \frac{1}{1 - R_i^2}
$$

Where:

* $R_i^2$ is the R-squared from regressing the $i$-th variable against all other predictors

Interpretation:

* **VIF = 1**: No multicollinearity
* **VIF > 5 or 10**: High multicollinearity (rule of thumb)

---

## ✅ How to Handle It?

* **Drop one of the correlated features**
* **Combine variables** (e.g., via PCA or feature engineering)
* **Regularization** (like Ridge Regression)
* **Check domain knowledge** to choose which features to retain

---

## 🔚 Summary

**Multicollinearity** makes regression models unstable and the coefficients unreliable. It occurs when
predictors are highly correlated. You can detect it using **VIF** or a **correlation matrix**, and fix it by removing or transforming features.



SyntaxError: invalid syntax (2409767120.py, line 3)

In [1]:
import pandas as pd

In [9]:
import statsmodels.api as sm
df_adv=pd.read_csv('Advertising.csv' , index_col=0)
X=df_adv[['TV','radio','newspaper']]
y=df_adv['sales']
df_adv.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [None]:
#ordinary Least square(OLS)  method is used to solve this regression

In [6]:
# y=B0+B1(TV)+B2(Radio)+B3(Newspaper)---> Equation

In [10]:
X=sm.add_constant(X)
model=sm.OLS(y,X).fit()

In [11]:
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 05 Aug 2025",Prob (F-statistic):,1.58e-96
Time:,04:52:45,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [12]:
import matplotlib.pyplot as plt
X.iloc[:,1:].corr()

Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


In [13]:
# since the values were multicollinear and had no issues , 
#this dataset is free to be used as a test dataset in projects

In [14]:
df_salary=pd.read_csv('Salary_Data.csv')
df_salary.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [16]:
X=df_salary[['YearsExperience','Age']]
y=df_salary['Salary']

In [17]:
X=sm.add_constant(X)
model=sm.OLS(y,X).fit()

In [18]:
model.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Tue, 05 Aug 2025",Prob (F-statistic):,1.35e-19
Time:,05:04:50,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


In [20]:
X.iloc[:,1:].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


In [None]:
#after we have found the multicollinearity , we basically can do 2 things 
# 1. Let the multicollinearity be in its own , i.e ignore it.
# 2.  "Remove the feature with high p-value and high VIF."