# STAT 207 Homework 11 [25 points]

## Feature Selection for Linear Models

Due: Friday, April 26, end of day (11:59 pm CT)

<hr>

## Imports 

Run the following code cell to import the necessary packages into the file.  You may import additional packages, as needed for this assignment.

In [155]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
from sklearn.metrics import r2_score

## The Data

A famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment) was conducted to determine what factors affected or predicted outcomes, including how long a patient remained in the hospital.

We will use a random sample of 580 seriously ill hospitalized patients from the SUPPORT study, with the following variables:

- **Days**: day to death or hospital discharge
- **Age**: age on day of hospital admission
- **Sex**: female or male
- **Comorbidity**: patient diagnosed with more than one chronic disease
- **EdYears**: years of education
- **Education**: education level: high or low
- **Income**: income level: high or low
- **Charges**: hospital charges, in dollars
- **Care**: level of care required: high or low
- **Race**: Non-white or white
- **Pressure**: Blood pressure, in mmHg
- **Blood**: white blood cell count, in gm/dL
- **Rate**: heart rate, in bpm

Run the code in the cell below to read in the cleaned data and separate the data into a training and test set for this document.  The data is saved as `df` with this code, and then separated into a `df_train` and `df_test` for later analysis.  

In [156]:
df = pd.read_csv('hospital.csv')
df_train, df_test = train_test_split(df, test_size = 0.20, random_state = 9876)
df.head()

Unnamed: 0,Days,Age,Sex,Comorbidity,EdYears,Education,Income,Charges,Care,Race,Pressure,Blood,Rate
0,8,42.258972,female,no,11,low,high,9914.0,low,non-white,84,11.298828,94
1,14,63.662994,female,no,22,high,high,283303.0,high,white,69,30.097656,108
2,21,41.521973,male,yes,18,high,high,320843.0,high,white,66,0.199982,130
3,4,41.959991,male,yes,16,high,high,4173.0,low,white,97,10.798828,88
4,11,52.089996,male,yes,8,low,high,13414.0,low,white,89,6.399414,92


## 1. Research Purpose [0 points]

Question 1 does not require any analysis.

## 2. Fitting our Full Model [1 point]

For the purposes of this question, we want to be sure that we only include variables that could be determined or anticipated upon admittance to the hospital, so that our model can be applied to new patients.

Fit a linear model to the training data to predict the length of the hospital stay (**Days**) with the following predictor variables: Age, Sex, Comorbidity, EdYears, Education, Income, Care, Race, Pressure, Blood, and Rate.  Print a summary of this linear model.

In [157]:
lin_mod = smf.ols("Days ~ Age + Sex + Comorbidity + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
lin_mod.summary()

0,1,2,3
Dep. Variable:,Days,R-squared:,0.15
Model:,OLS,Adj. R-squared:,0.129
Method:,Least Squares,F-statistic:,7.23
Date:,"Fri, 26 Apr 2024",Prob (F-statistic):,2.13e-11
Time:,18:00:19,Log-Likelihood:,-2042.3
No. Observations:,464,AIC:,4109.0
Df Residuals:,452,BIC:,4158.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,16.3417,9.863,1.657,0.098,-3.041,35.724
Sex[T.male],-2.3662,1.912,-1.238,0.216,-6.124,1.391
Comorbidity[T.yes],-2.1125,3.139,-0.673,0.501,-8.281,4.056
Education[T.low],0.1577,2.969,0.053,0.958,-5.677,5.992
Income[T.low],-1.5625,2.090,-0.748,0.455,-5.670,2.545
Care[T.low],-11.5546,2.048,-5.641,0.000,-15.580,-7.529
Race[T.white],3.4544,2.379,1.452,0.147,-1.222,8.131
Age,-0.0800,0.063,-1.270,0.205,-0.204,0.044
EdYears,-0.3958,0.392,-1.011,0.313,-1.165,0.374

0,1,2,3
Omnibus:,390.399,Durbin-Watson:,1.981
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7914.321
Skew:,3.615,Prob(JB):,0.0
Kurtosis:,21.897,Cond. No.,1600.0


## 3. Metrics for Model Fit [0 points]

Question 3 does not require any analysis.

## 4. Selecting a Parsimonious Model [3 points]

We'll perform model selection using backwards elimination and using the $R^2_{\text{adj}}$ as our metric using our training data.  Be sure to show your work.  That is, don't delete any models that you fit.  Add as many code cells below as you need.

In [158]:
# Starting model
lin_mod = smf.ols("Days ~ Age + Sex + Comorbidity + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
lin_mod.rsquared_adj

0.1289374762917086

In [159]:
test_mod1 = smf.ols("Days ~ Sex + Comorbidity + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod1.rsquared_adj

0.1277601013127515

In [160]:
test_mod2 = smf.ols("Days ~ Age + Comorbidity + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod2.rsquared_adj

0.12791491513245457

In [161]:
# Becomes my new current model
test_mod3 = smf.ols("Days ~ Age + Sex + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod3.rsquared_adj

0.12998928877883442

In [162]:
test_mod4 = smf.ols("Days ~ Sex + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod4.rsquared_adj

0.1285870750166357

In [163]:
test_mod5 = smf.ols("Days ~ Age + EdYears + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod5.rsquared_adj

0.1287956436860369

In [164]:
test_mod6 = smf.ols("Days ~ Age + Sex + Education + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod6.rsquared_adj

0.1299869446823626

In [165]:
# Becomes my new current model
test_mod7 = smf.ols("Days ~ Age + Sex + EdYears + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod7.rsquared_adj

0.1319033456064036

In [166]:
test_mod8 = smf.ols("Days ~ Sex + EdYears + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod8.rsquared_adj

0.13045322100109558

In [167]:
test_mod9 = smf.ols("Days ~ Age + EdYears + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod9.rsquared_adj

0.13070608912878123

In [168]:
test_mod10 = smf.ols("Days ~ Age + Sex + Income + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod10.rsquared_adj

0.1299811682315155

In [169]:
# Becomes my new current model
test_mod11 = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod11.rsquared_adj

0.13256801987613032

In [170]:
test_mod12 = smf.ols("Days ~ Sex + EdYears + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod12.rsquared_adj

0.13114400458814313

In [171]:
test_mod13 = smf.ols("Days ~ Age + EdYears + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod13.rsquared_adj

0.1317472204654323

In [172]:
test_mod14 = smf.ols("Days ~ Age + Sex + Care + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod14.rsquared_adj

0.13174191587114337

In [173]:
test_mod15 = smf.ols("Days ~ Age + Sex + EdYears + Race + Pressure + Blood + Rate", data = df_train).fit()
test_mod15.rsquared_adj

0.06596390190132573

In [174]:
test_mod16 = smf.ols("Days ~ Age + Sex + EdYears + Care + Pressure + Blood + Rate", data = df_train).fit()
test_mod16.rsquared_adj

0.12984629985979967

In [175]:
test_mod17 = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Blood + Rate", data = df_train).fit()
test_mod17.rsquared_adj

0.11210706023609596

In [176]:
# Becomes my new current model
test_mod18 = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Pressure + Rate", data = df_train).fit()
test_mod18.rsquared_adj

0.13402570920886248

In [177]:
test_mod19 = smf.ols("Days ~ Sex + EdYears + Care + Race + Pressure + Rate", data = df_train).fit()
test_mod19.rsquared_adj

0.13276582399548542

In [178]:
test_mod20 = smf.ols("Days ~ Age + EdYears + Care + Race + Pressure + Rate", data = df_train).fit()
test_mod20.rsquared_adj

0.13291624358124488

In [179]:
test_mod21 = smf.ols("Days ~ Age + Sex + Care + Race + Pressure + Rate", data = df_train).fit()
test_mod21.rsquared_adj

0.13302234791793488

In [180]:
test_mod22 = smf.ols("Days ~ Age + Sex + EdYears + Race + Pressure + Rate", data = df_train).fit()
test_mod22.rsquared_adj

0.06496144885818012

In [181]:
test_mod23 = smf.ols("Days ~ Age + Sex + EdYears + Care + Pressure + Rate", data = df_train).fit()
test_mod23.rsquared_adj

0.131169976610493

In [182]:
test_mod24 = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Rate", data = df_train).fit()
test_mod24.rsquared_adj

0.11324299483545086

In [183]:
test_mod25 = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Pressure", data = df_train).fit()
test_mod25.rsquared_adj

0.1267820133544637

In [184]:
final_mod = smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Pressure + Rate", data = df_train).fit()
final_mod.rsquared_adj

0.13402570920886248

## 5. Measuring our Test Data [0 points] 

For this question, you'll calculate an appropriate measure of model fit through an $R^2$ (either unadjusted or adjusted $R^2$) from our model selected in Question 4 on our test data.  You won't get points for this code here, but you can use the cell below for your calculation that you enter on Gradescope.

In [185]:
smf.ols("Days ~ Age + Sex + EdYears + Care + Race + Pressure + Rate", data = df_test).fit().rsquared

0.10322703089809082

Remember to keep all your cells and hit the save icon above periodically to checkpoint (save) your results on your local computer. Once you are satisified with your results restart the kernel and run all (Kernel -> Restart & Run All). **Make sure nothing has changed**. Checkpoint and exit (File -> Save and Checkpoint + File -> Close and Halt). Follow the instructions on the Homework 11 Canvas Assignment to submit your notebook to GitHub.