# Assignment 3

## Question 1 (12 points)
Using the [Framingham Heart Study dataset](https://github.com/soltaniehha/Intro-to-Data-Analytics/blob/main/data/AnalyticsEdge-Datasets/Framingham.csv) create a **logistic regression** model to predict whether a patient will develop heart desease in 10 years or not.

Follow the steps outlined in the [Classification notebook](https://github.com/soltaniehha/Intro-to-Data-Analytics/blob/main/08-Machine-Learning-Overview/03-Classification.ipynb):
* Preprocessing: deleting columns with no predictive power/handling missing values
* Preprocessing: handle categorical variables, if any
* Create feature matrix and target vector. Our target variable is `TenYearCHD`
* Split the data randomly into train and test with a 70-30 split (use `random_state=780`)
* Instantiate and fit a logistic regression model
* Make predictions and find the overall accuracy, sensitivity, and specificity on your test set

**Note:** We have seen this dataset during the discussion on the Framingham Heart Study from Analytics Edge.

## Question 2 (8 points)
Open ended - Do further data exploration and create new variables when possible (feature engineering). Show your discovery process using plots and summaries. 
* How does the model performance change by adding new variables or potentially removing some of the less important ones? 
* How does the model performance change by trying different classification models?

---

### Upload your .ipynb file to Questrom Tools

A potential issue is to download the notebook before it was fully saved. To avoid this, follow these steps: 
1. go to Runtime (in the menu) and hit "Restart and run all..." 
2. after the notebook is fully run, save it and then download your .ipynb to your computer 
3. upload it back to your Drive and open it with Colab to ensure all of your recent changes are there 
4. upload the originally downloaded file to Questrom Tools.

---

The data has been loaded in the following cell:

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/soltaniehha/Intro-to-Data-Analytics/master/data/AnalyticsEdge-Datasets/Framingham.csv')
df.head(3)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0


# Question 1

In [2]:
# Preprocessing: deleting columns with no predictive power/handling missing values
# I'm gonna delete 6 cols BPMeds, prevelent Stroke, prevalent Hyp, sysBP, diaBP: because I don't have any domain knowledge
df.drop(['BPMeds', 'prevalentStroke', 'prevalentHyp', 'sysBP', 'diaBP'], axis = 1, inplace = True)


# Using DataFrame.drop
# df.drop(df.columns[[1, 2]], axis=1, inplace=True)

In [3]:
#dropping nulls so hard right now

df.dropna(axis = 0, how = 'any', inplace = True)
df.isnull().sum()

male             0
age              0
education        0
currentSmoker    0
cigsPerDay       0
diabetes         0
totChol          0
BMI              0
heartRate        0
glucose          0
TenYearCHD       0
dtype: int64

In [4]:
#Preprocessing: handle categorical variables, if any
#Don't have any

In [5]:
# Create feature matrix and target vector. Our target variable is `TenYearCHD`

X = df.drop('TenYearCHD', axis = 1)
y = df['TenYearCHD']

y.shape

(3709,)

In [6]:
# Split the data randomly into train and test with a 70-30 split (use `random_state=780`)



# !pip install sklearn -q
from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 780, stratify= y)

In [7]:
#* Instantiate and fit a logistic regression model

model = linear_model.LogisticRegression(max_iter=2000)
model.fit(X_train, y_train)

LogisticRegression(max_iter=2000)

In [8]:
#Make predictions 

y_hat = model.predict(X_test)
y_hat.shape

(1113,)

In [9]:
y_test.shape

(1113,)

In [15]:
# Find the overall accuracy, sensitivity, and specificity on your test set
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_hat)

In [11]:
print("Our model is", round(sum(y_test == y_hat)/len(y_hat),2) * 100, "% accurate!")


Our model is 85.0 % accurate!


# Question 2 

## How does the model performance change by adding new variables or potentially removing some of the less important ones? 


In [12]:
# OK let's add all the variables to see how it goes haha

df = pd.read_csv('https://raw.githubusercontent.com/soltaniehha/Intro-to-Data-Analytics/master/data/AnalyticsEdge-Datasets/Framingham.csv')
df.dropna(axis = 0, how = 'any', inplace = True)
df.isnull().sum()

X = df.drop('TenYearCHD', axis = 1)
y = df['TenYearCHD']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.3, random_state = 780, stratify= y)

model.fit(Xtrain, ytrain)

yhat = model.predict(Xtest)
yhat.shape

(1098,)

In [13]:
from sklearn.metrics import accuracy_score
adding_all_accuracy = accuracy_score(ytest, yhat)

In [16]:
adding_all_accuracy - accuracy 

0.005223906244629983

In [14]:
print("Our model is", round(sum(ytest == yhat)/len(yhat),2) * 100, "% accurate!")


Our model is 85.0 % accurate!


Huh it's literally the same

In [17]:
print("The fit all variables model improves about", (adding_all_accuracy - accuracy) * 100, "%" )

The fit all variables model improves about 0.5223906244629983 %


## How does the model performance change by trying different classification models?

In [22]:
from sklearn.naive_bayes import GaussianNB

Gau_model = GaussianNB()

Gau_model.fit(Xtrain, ytrain)

y_hat_gau = Gau_model.predict(Xtest)

print("Our Gaussian model is", round(sum(ytest == y_hat_gau)/len(y_hat_gau),2), "accurate!")


Our Gaussian model is 0.82 accurate!
