3.3.4 Challenge- Advanced Regression

## Thinkful Challenge- Advanced Regression

Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

- Vanilla logistic regression
- Ridge logistic regression
- Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

## Import libraries and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import sklearn
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
sns.set_style('white')

In [None]:
df = pd.read_csv('cardio_train.csv', sep=';')
df.shape

In [None]:
df.head()

### Dataset details

Details from Kaggle about the dataset
- Age | Objective Feature | age | int (days)
- Height | Objective Feature | height | int (cm) |
- Weight | Objective Feature | weight | float (kg) |
- Gender | Objective Feature | gender | categorical code |
- Systolic blood pressure | Examination Feature | ap_hi | int |
- Diastolic blood pressure | Examination Feature | ap_lo | int |
- Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
- Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
- Smoking | Subjective Feature | smoke | binary |
- Alcohol intake | Subjective Feature | alco | binary |
- Physical activity | Subjective Feature | active | binary |
- Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

## Clean dataset, add features and split training and test sets

In [None]:
#Engineer features from dataframe and drop NaN's
heart = df.dropna(axis=0, how='any')

In [None]:
#Evaluate changes to the shape after dropping NaN's
heart.shape

In [None]:
#We don't need id in the features or as a target, we will drop this
heart = heart.drop('id', axis=1)

#Create new features bmi, and age in years
heart['bmi'] = (df.weight / (df.height**2))* 10000
heart['age_yrs'] = round(df.age/365)

#make gender a 0 or 1 instead of 1 or 2
heart['gender'] = np.where(df['gender'] == 1, 1, 0)

In [None]:
#Create additional features that may help predict cardio disease
heart['adult_50plus'] = np.where(heart['age_yrs']>=45, 1, 0)
heart['overweight'] = np.where((heart['bmi']>=25) & (heart['bmi'] <=29) ,1, 0)
heart['obese'] = np.where(heart['bmi']>=30,1, 0)

In [None]:
#drop highly correlated and recalculated fields
heart = heart.drop('age', axis=1)
heart = heart.drop('age_yrs', axis=1)

In [None]:
#evaluate the new dataset
heart.info()

In [None]:
#establish target variable "cardio"
y = heart.cardio.values

In [None]:
#drop target from feature set for scaling
heart2 = heart.drop('cardio', axis=1)

In [None]:
#scale data
names = heart2.columns
heart2 = pd.DataFrame(preprocessing.scale(heart2), columns=names)

In [None]:
#evaluate dataset
#heart.head(30)

In [None]:
#Evaluate new dataset to create features
#heart2.head()

In [None]:
#Define features as "X"
X = heart2.values
X.shape

In [None]:
#Reshape target variable
y = y.reshape(-1, 1)
y.shape

In [None]:
#Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [None]:
#Make a correlation matrix to evaluate features collinearity
corrmat = heart.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()


## Create Models

### Logistic Regression Model

In [None]:
#Create the regression
log_reg = LogisticRegression(random_state=42)

#Fit the regressor to the training set
log_reg.fit(X_train, y_train)

#predict based on test features
log_pred = log_reg.predict(X_test)

LogisticRegression()

print("R2: ",log_reg.score(X_train, y_train))
print("Prediction: ",log_pred)
print("Probabiliy: ", log_reg.predict_proba(X_test))


In [None]:
#Create the regression
ridge = LogisticRegression(random_state=42, penalty='l2')

#Fit the regressor to the training set
ridge.fit(X_train, y_train)

#predict based on test features
ridge_pred = ridge.predict(X_test)


print("R2: ",ridge.score(X_train, y_train))
print("Prediction: ",ridge_pred)
print("Probabiliy: ", ridge.predict_proba(X_test))


In [None]:
#Create the regression
lasso = LogisticRegression(C= .2, random_state=42, penalty='l1')

#Fit the regressor to the training set
lasso.fit(X_train, y_train)

#predict based on test features
lasso_pred = lasso.predict(X_test)

print("R2: ",lasso.score(X_train, y_train))
print("Prediction: ",lasso_pred)
print("Probabiliy: ", lasso.predict_proba(X_test))

## Conclusion

After evaluating all three models, I believe for this set the L1 (lasso) penalty gives us a stronger confidence level. 

Initially I added various "C" values to evaluate any differences in the models. The outcome was that l2 (Ridge) penalty gives us similar outcomes to the plain logistic regression. 

Conclusion is that using the Lasso model we can see a better probabilty outcome and R2.