# Lab 12 Building Parsimonious Models - Group - [5 points] - Solutions


## <u>Case Study</u>: Forward Selection with BIC and your Own Dataset

In this analysis, you will do the following.

1. **Choose your own dataset that meets the following specifications.**
    * It is not the fake vs. real Instagram account dataset. Any other dataset that meets the specifications below is fine.
    * This dataset has at least one categorical variable that you can use as your response variable in a logistic regression model.
        - This categorical variable should have just two levels.
        - Alternatively, you can create a categorical variable that has just two levels.
    * This dataset has at least 4 other variables that you will use as *potential* explanatory variables to put in the model.
    
2. **Use a forward selection algorithm with BIC** to try to find the parsimonious logistic regression model (taking into account your 4 explanatory variables that you are considering).



### Imports

In [14]:
import pandas as pd
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split

## 1. [0.5 pt] Data Preliminaries

Load your csv file into a dataframe. Then create a new 0/1 response variable in your dataframe where 1 = the response variable level that you are trying to predict (ie. the success level) and 0 = the response variable level that you are not trying to predict (ie. the failure level). Finally, display the first 5 rows of your updated dataframe below.

In [32]:
missing_values = ["data unavailable"]
df = pd.read_csv('county.csv', 
                   na_values=missing_values)
df.head()

Unnamed: 0,name,state,pop2000,pop2010,pop2017,pop_change,poverty,homeownership,multi_unit,unemployment_rate,metro,median_edu,per_capita_income,median_hh_income,smoking_ban
0,Autauga County,Alabama,43671.0,54571,55504.0,1.48,13.7,77.5,7.2,3.86,yes,some_college,27841.7,55317.0,none
1,Baldwin County,Alabama,140415.0,182265,212628.0,9.19,11.8,76.7,22.6,3.99,yes,some_college,27779.85,52562.0,none
2,Barbour County,Alabama,29038.0,27457,25270.0,-6.22,27.2,68.0,11.1,5.9,no,hs_diploma,17891.73,33368.0,partial
3,Bibb County,Alabama,20826.0,22915,22668.0,0.73,15.2,82.9,6.6,4.39,yes,hs_diploma,20572.05,43404.0,none
4,Blount County,Alabama,51024.0,57322,58013.0,0.68,15.6,82.0,3.7,4.02,yes,hs_diploma,21367.39,47412.0,none


In [33]:
df['y']=df['metro'].map({'yes':1,'no':0})
df.head()

Unnamed: 0,name,state,pop2000,pop2010,pop2017,pop_change,poverty,homeownership,multi_unit,unemployment_rate,metro,median_edu,per_capita_income,median_hh_income,smoking_ban,y
0,Autauga County,Alabama,43671.0,54571,55504.0,1.48,13.7,77.5,7.2,3.86,yes,some_college,27841.7,55317.0,none,1.0
1,Baldwin County,Alabama,140415.0,182265,212628.0,9.19,11.8,76.7,22.6,3.99,yes,some_college,27779.85,52562.0,none,1.0
2,Barbour County,Alabama,29038.0,27457,25270.0,-6.22,27.2,68.0,11.1,5.9,no,hs_diploma,17891.73,33368.0,partial,0.0
3,Bibb County,Alabama,20826.0,22915,22668.0,0.73,15.2,82.9,6.6,4.39,yes,hs_diploma,20572.05,43404.0,none,1.0
4,Blount County,Alabama,51024.0,57322,58013.0,0.68,15.6,82.0,3.7,4.02,yes,hs_diploma,21367.39,47412.0,none,1.0


<hr>

## <u>Tutorial</u>: Fitting a Regression Curve with No Explanatory Variables

If you would like to fit a logistic regression (or linear regression) model in the **statsmodels.formula.api** package that does not have any explanatory variables (ie. just the intercept), you can write a 1 where you normally put the explanatory variables in the **.logit()** function or the **.ols()** function as shown below.

In [1]:
import pandas as pd
df_temp=pd.DataFrame({'y': [1,1,0,0,1,1,0]})
df_temp

Unnamed: 0,y
0,1
1,1
2,0
3,0
4,1
5,1
6,0


In [2]:
import statsmodels.formula.api as smf
example_model = smf.logit('y~1', data=df_temp).fit()
example_model.summary()

Optimization terminated successfully.
         Current function value: 0.682908
         Iterations 4


0,1,2,3
Dep. Variable:,y,No. Observations:,7.0
Model:,Logit,Df Residuals:,6.0
Method:,MLE,Df Model:,0.0
Date:,"Wed, 10 Nov 2021",Pseudo R-squ.:,5.282e-12
Time:,13:31:21,Log-Likelihood:,-4.7804
converged:,True,LL-Null:,-4.7804
Covariance Type:,nonrobust,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.2877,0.764,0.377,0.706,-1.209,1.785


<hr>

## 2.  [4.5 pts]  Forward Selection with BIC Score

Next, starting with the logistic regression model with **no explanatory variables**, perform a forward selection algorithm using the BIC in attempt to find a logistic regression with the lowest BIC score.
* You should consider 4 possible explanatory variables in this algorithm.
* You will use your **full dataset** when fitting this logistic regression model (ie. you should *not* split this dataset up into training and a test datasets in this particular assignment).

Once the algorithm has stopped, print out the summary output of your **final model**. 

In [54]:
mode0 = smf.logit('y~1', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.659558
         Iterations 4


4148.75804009704

In [63]:
mode0 = smf.logit('y~multi_unit', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.583086
         Iterations 6


3676.715343208284

In [64]:
mode0 = smf.logit('y~homeownership', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.651691
         Iterations 5


4107.422193410641

In [66]:
mode0 = smf.logit('y~unemployment_rate', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.654355
         Iterations 5


4124.141337694156

In [68]:
mode0 = smf.logit('y~poverty', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.637008
         Iterations 5


4015.2384533201966

In [69]:
mode0 = smf.logit('y~multi_unit+homeownership', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.562975
         Iterations 6


3558.5135818520666

In [70]:
mode0 = smf.logit('y~multi_unit+unemployment_rate', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.581245
         Iterations 6


3673.212413205703

In [71]:
mode0 = smf.logit('y~multi_unit+poverty', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.563603
         Iterations 6


3562.4554954122764

In [72]:
mode0 = smf.logit('y~multi_unit+homeownership+unemployment_rate', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.562909
         Iterations 6


3566.152477497295

In [73]:
mode0 = smf.logit('y~multi_unit+homeownership+poverty', data=df).fit()
mode0.bic

Optimization terminated successfully.
         Current function value: 0.557435
         Iterations 6


3531.784677981465