# Python Assessment

<img src="images/ment.jpg"/>

In [1]:
# Libraries

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read boston data
from sklearn.datasets import load_boston
boston_dataset = load_boston()

boston = pd.DataFrame(data=boston_dataset.data, 
                      columns=boston_dataset.feature_names)
boston["MEDV"] = boston_dataset.target

# Read NHANES data
NHANES = pd.read_csv("data/nhanes_2015_2016.csv")

vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", 
        "BMXBMI", "SMQ020"]
NHANES = NHANES[vars].dropna()

NHANES["smq"] = NHANES.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})
NHANES["RIAGENDRx"] = NHANES.RIAGENDR.replace({1: "Male", 2: "Female"})
NHANES["DMDEDUC2x"] = NHANES.DMDEDUC2.replace({1: "lt9", 2: "x9_11", 3: "HS", 
                                               4: "SomeCollege",5: "College", 
                                               7: np.nan, 9: np.nan})

np.random.seed(123)

### Questions 1-3

The first three questions will be utilizing the Boston housing dataset seen in week 1. 

Here is the description for each column:

* __CRIM:__ Per capita crime rate by town
* __ZN:__ Proportion of residential land zoned for lots over 25,000 sq. ft
* __INDUS:__ Proportion of non-retail business acres per town
* __CHAS:__ Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* __NOX:__ Nitric oxide concentration (parts per 10 million)
* __RM:__ Average number of rooms per dwelling
* __AGE:__ Proportion of owner-occupied units built prior to 1940
* __DIS:__ Weighted distances to five Boston employment centers
* __RAD:__ Index of accessibility to radial highways
* __TAX:__ Full-value property tax rate per $\$10,000$
* __PTRATIO:__ Pupil-teacher ratio by town
* __B:__ $1000(Bk — 0.63)^2$, where Bk is the proportion of [people of African American descent] by town
* __LSTAT:__ Percentage of lower status of the population
* __MEDV:__ Median value of owner-occupied homes in $\$1000$s

In [3]:
# Uncomment and run the following code to generate a simple linear regression 
# and output the model summary:
model = sm.OLS.from_formula("MEDV ~ RM + CRIM", data=boston)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.542
Model:,OLS,Adj. R-squared:,0.54
Method:,Least Squares,F-statistic:,297.6
Date:,"Sat, 27 Mar 2021",Prob (F-statistic):,5.2200000000000005e-86
Time:,08:30:00,Log-Likelihood:,-1642.7
No. Observations:,506,AIC:,3291.0
Df Residuals:,503,BIC:,3304.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-29.2447,2.588,-11.300,0.000,-34.330,-24.160
RM,8.3911,0.405,20.726,0.000,7.596,9.186
CRIM,-0.2649,0.033,-8.011,0.000,-0.330,-0.200

0,1,2,3
Omnibus:,172.412,Durbin-Watson:,0.807
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1047.536
Skew:,1.349,Prob(JB):,3.3899999999999996e-228
Kurtosis:,9.512,Cond. No.,92.3


### Question 1
**What is the value of the coefficient for predictor __RM__?**

**Answer.** 8.3911

### Question 2
**Are the predictors for this model statistically significant, yes or no? (Hint: What are their p-values?)**

**Answer.** Yes. Both have p-value smaller than 0.025 (alpha is 0.05 for 95% level of confidence).

In [4]:
# Run the following code for question 3:

model = sm.OLS.from_formula("MEDV ~ RM + CRIM + LSTAT", data=boston)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.646
Model:,OLS,Adj. R-squared:,0.644
Method:,Least Squares,F-statistic:,305.2
Date:,"Sat, 27 Mar 2021",Prob (F-statistic):,1.01e-112
Time:,08:30:01,Log-Likelihood:,-1577.6
No. Observations:,506,AIC:,3163.0
Df Residuals:,502,BIC:,3180.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.5623,3.166,-0.809,0.419,-8.783,3.658
RM,5.2170,0.442,11.802,0.000,4.348,6.085
CRIM,-0.1029,0.032,-3.215,0.001,-0.166,-0.040
LSTAT,-0.5785,0.048,-12.135,0.000,-0.672,-0.485

0,1,2,3
Omnibus:,171.754,Durbin-Watson:,0.822
Prob(Omnibus):,0.0,Jarque-Bera (JB):,628.308
Skew:,1.535,Prob(JB):,3.67e-137
Kurtosis:,7.514,Cond. No.,216.0


### Question 3
**What happened to our R-Squared value when we added the third predictor __LSTAT__ to our initial model?**

**Answer.** Increased from 0.540 to 0.644.

### Question 4
**What type of model should we use when our target outcome, or dependent variable is continuous?**

**Answer.** Linear regression is to be used when the target variable is continuous and the dependent variable(s) is continuous or a mixture of continuous and categorical, and the relationship between the independent variable and dependent variables are linear.

In [5]:
# Uncomment and run the following code to generate a logistics regression 
# and output the model summary:

model = sm.GLM.from_formula("smq ~ RIAGENDRx + RIDAGEYR + DMDEDUC2x", 
                            family=sm.families.Binomial(), 
                            data=NHANES)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,smq,No. Observations:,5093.0
Model:,GLM,Df Residuals:,5086.0
Model Family:,Binomial,Df Model:,6.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3201.2
Date:,"Sat, 27 Mar 2021",Deviance:,6402.4
Time:,08:30:01,Pearson chi2:,5100.0
No. Iterations:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.3060,0.114,-20.174,0.000,-2.530,-2.082
RIAGENDRx[T.Male],0.9096,0.060,15.118,0.000,0.792,1.028
DMDEDUC2x[T.HS],0.9434,0.090,10.521,0.000,0.768,1.119
DMDEDUC2x[T.SomeCollege],0.8322,0.084,9.865,0.000,0.667,0.998
DMDEDUC2x[T.lt9],0.2662,0.109,2.438,0.015,0.052,0.480
DMDEDUC2x[T.x9_11],1.0986,0.107,10.296,0.000,0.889,1.308
RIDAGEYR,0.0183,0.002,10.582,0.000,0.015,0.022


### Question 5
**Which of our predictors has the largest coefficient?**

**Answer.** Education (DMDEDUC2x).

### Question 6
**Which values for DMDEDUC2x and RIAGENDRx are represented in our intercept, or what is our reference level?**

In [6]:
print('DMEDU2x:',NHANES['DMDEDUC2x'].unique())
print('RIAGENDRx:',NHANES['RIAGENDRx'].unique())

DMEDU2x: ['College' 'HS' 'SomeCollege' 'x9_11' 'lt9' nan]
RIAGENDRx: ['Male' 'Female']


**Answer.** These are the outcomes that do not appear explicitly in the summary model.
        
        DMDEDUC2x: 'College'
        RIAGENDRx: 'Female'

### Question 7
**What model should we use when our target outcome, or dependent variable is binary, or only has two outputs, 0 and 1.**

**Answer.** Logistic Regression.