**Dummy Variables**: Treating categorical variables like **numerical** variables!

Think binary code: *1 = yes, 0 = no*

So, a new variable (column) is placed for EACH category, with 1 meaning that the row is in this category and 0 meaning that the row is NOT in this category.

In [13]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('https://www.ishelp.info/data/insurance.csv')

# Generate dummy variables
for col in df:  
  if not pd.api.types.is_numeric_dtype(df[col]):
    df = pd.get_dummies(df, columns=[col], prefix=col, dtype=int)
    
df.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


*Get Dummies: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html*

Compare to original

In [14]:
dforiginal = pd.read_csv('https://www.ishelp.info/data/insurance.csv')
dforiginal

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [15]:
y = df['charges']
X = df.drop(columns=['charges']).assign(const=1)

# Run the multiple linear regression model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     500.8
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:43   Log-Likelihood:                -13548.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1329   BIC:                         2.716e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                256.8564     11.899  

In [16]:
def regression_fit(model, y_actual):
  y_pred = results.fittedvalues

  print(f"R2:\t{round(results.rsquared, 4)}")
  print(f"R2-adj:\t{round(results.rsquared_adj, 4)}")
  print(f"MAE:\t{round(abs(y_pred - y_actual).mean(), 4)}")
  print(f"RMSE:\t{round(((y_pred - y_actual)**2).mean() ** (1/2), 4)}")

regression_fit(model, y)

R2:	0.7509
R2-adj:	0.7494
MAE:	4170.8869
RMSE:	6041.6797


**Variance Inflation Factor (VIF):** https://www.investopedia.com/terms/v/variance-inflation-factor.asp

**MULTICOLLINEARITY!** *WHY?* Because each of the dummy variables from the same categories are related to each other. *This is bad*. So, **let's get rid of them.**

In [17]:
dfNEW = pd.read_csv('https://www.ishelp.info/data/insurance.csv')

# Generate dummy variables
for col in dfNEW:  
  if not pd.api.types.is_numeric_dtype(dfNEW[col]):
    dfNEW = pd.get_dummies(dfNEW, columns=[col], prefix=col, drop_first=True, dtype=int)

dfNEW.head()

Unnamed: 0,age,bmi,children,charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,0,1,0,0,1
1,18,33.77,1,1725.5523,1,0,0,1,0
2,28,33.0,3,4449.462,1,0,0,1,0
3,33,22.705,0,21984.47061,1,0,1,0,0
4,32,28.88,0,3866.8552,1,0,1,0,0


In [18]:
yNEW = dfNEW['charges']
XNEW = dfNEW.drop(columns=['charges']).assign(const=1)

# Run the multiple linear regression model
model2 = sm.OLS(yNEW, XNEW)
results2 = model2.fit()
print(results2.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     500.8
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:44   Log-Likelihood:                -13548.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1329   BIC:                         2.716e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                256.8564     11.899  

**Multicollineary** fixed: so we can move on with our model-fitting.. Below are some measures you can use to compare models.

*MAE and RMSE have decreased dramatically since we got rid of multicollinearity. See your book for further explanation and comparison.* Otherwise, these won't change much with the further model-fitting.

In [19]:
regression_fit(model2, yNEW)

R2:	0.7509
R2-adj:	0.7494
MAE:	4170.8869
RMSE:	6041.6797


Remember when we didn't see much of connection between **sex** and **charges** on insurance? Now, notice that the p-value for *sex_male* is high? It's not much of a determining factor in determining charges to insurance. **Get rid of it**

In [20]:
yNEW = dfNEW['charges']
XNEW = dfNEW.drop(columns=['charges','sex_male']).assign(const=1)

# Run the multiple linear regression model
model2 = sm.OLS(yNEW, XNEW)
results2 = model2.fit()
print(results2.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.750
Method:                 Least Squares   F-statistic:                     572.7
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:44   Log-Likelihood:                -13548.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1330   BIC:                         2.715e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                256.9736     11.891  

**DROP** the Northwest region, and notice how the other p-values change. *Keep dropping columns/variables and see what happens...*

In [21]:
yNEW = dfNEW['charges']
XNEW = dfNEW.drop(columns=['charges','sex_male','region_northwest']).assign(const=1)

# Run the multiple linear regression model
model2 = sm.OLS(yNEW, XNEW)
results2 = model2.fit()
print(results2.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.750
Method:                 Least Squares   F-statistic:                     668.3
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:44   Log-Likelihood:                -13548.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1331   BIC:                         2.715e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                257.0064     11.889  

In [22]:
yNEW = dfNEW['charges']
XNEW = dfNEW.drop(columns=['charges','sex_male','region_northwest','region_southwest']).assign(const=1)

# Run the multiple linear regression model
model2 = sm.OLS(yNEW, XNEW)
results2 = model2.fit()
print(results2.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     799.7
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:44   Log-Likelihood:                -13550.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1332   BIC:                         2.714e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
age                257.1365     11.901  

In [23]:
yNEW = dfNEW['charges']
XNEW = dfNEW.drop(columns=['charges','sex_male','region_northwest','region_southwest','region_southeast']).assign(const=1)

# Run the multiple linear regression model
model2 = sm.OLS(yNEW, XNEW)
results2 = model2.fit()
print(results2.summary())  # View results

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     998.1
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        08:53:44   Log-Likelihood:                -13551.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1333   BIC:                         2.714e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
age          257.8495     11.896     21.675      0.0

In [24]:
regression_fit(model2, yNEW)

R2:	0.7509
R2-adj:	0.7494
MAE:	4170.8869
RMSE:	6041.6797


**Smoking** is the only categorical variable that plays a significant role in charges to insurance, as we can see from the process above. This should remind us of our previous analysis on individual columns and their effect on insurance.

**Therefore, our model ultimately only includes *smoker_yes* in the model. The above model means that smoking increases charges to insurance on average by over $23,000.**