# <font color='#31394d'> Logistic Regression Practice Exercise </font>

For this exercise we are going to use the heart dataset to predict whether or not someone will get a heart attack. You can read more about the dataset here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease). The dataset is provided as a csv file in the `data` folder. 

🚀 <font color='#d9c4b1'> Exercise: </font> Start by reading in the dataset from the `data` folder and having a look at the data. Don't forget to import the necessary packages!

In [1]:
import pandas as pd
import numpy as np

In [2]:
data=pd.read_csv('data/heart.csv')

In [3]:
data.sample(10)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,Y
129,62,0,4,124,209,0,0,163,0,0.0,1,0.0,False
247,62,1,2,128,208,1,2,140,0,0.0,1,0.0,False
65,60,1,4,145,282,0,2,142,1,2.8,2,2.0,True
22,58,1,2,120,284,0,2,160,0,1.8,2,0.0,True
8,63,1,4,130,254,0,2,147,0,1.4,2,1.0,True
99,48,1,4,122,222,0,2,186,0,0.0,1,0.0,False
146,57,1,4,165,289,1,2,124,0,1.0,2,3.0,True
228,52,0,3,136,196,0,2,169,0,0.1,2,0.0,False
281,35,1,2,122,192,0,0,174,0,0.0,1,0.0,False
240,49,0,4,130,269,0,0,163,0,0.0,1,0.0,False


<!-- ### Attributes
- X1: age in years
- X2: sex (1 = male; 0 = female)
- X3: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- X4: resting blood pressure (in mm Hg on admission to the hospital)
- X5: serum cholestoral in mg/dl
- X6: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- X7: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- X8: maximum heart rate achieved
- X9: exercise induced angina (1 = yes; 0 = no)
- X10: oldpeak = ST depression induced by exercise relative to rest
- x11: slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
- X12: number of major vessels (0-3) colored by flourosopy
- Y: num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing -->

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      299 non-null    int64  
 1   X2      299 non-null    int64  
 2   X3      299 non-null    int64  
 3   X4      299 non-null    int64  
 4   X5      299 non-null    int64  
 5   X6      299 non-null    int64  
 6   X7      299 non-null    int64  
 7   X8      299 non-null    int64  
 8   X9      299 non-null    int64  
 9   X10     299 non-null    float64
 10  X11     299 non-null    int64  
 11  X12     299 non-null    float64
 12  Y       299 non-null    bool   
dtypes: bool(1), float64(2), int64(10)
memory usage: 28.4 KB


In [5]:
data.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,54.528428,0.675585,3.160535,131.668896,247.100334,0.147157,0.996656,149.505017,0.327759,1.051839,1.602007,0.672241
std,9.02095,0.468941,0.962893,17.705668,51.914779,0.354856,0.994948,22.954927,0.470183,1.163809,0.617526,0.937438
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0
50%,56.0,1.0,3.0,130.0,242.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0
75%,61.0,1.0,4.0,140.0,275.5,0.0,2.0,165.5,1.0,1.6,2.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0


#### Observations
- No null values.
- 12 numeric features and 1 categorical feature i.e the predicted feature

### Split the data for training and testing

In [6]:
from sklearn.model_selection import train_test_split
X = data.drop(['Y'], axis=1)
y= data[['Y']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)

### Standardize the features

🚀 <font color='#d9c4b1'> Exercise: </font> Now standardize the features. You can learn more about standardization in the `Logistic Regression.ipynb` notebook that we used during the session!

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
pd.DataFrame(X_train, columns=data.columns[:-1])

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,1.735199,0.679366,-1.361238,1.334799,-0.007679,-0.401386,0.997663,-0.268800,-0.702038,-0.882900,-0.974555,-0.725351
1,-1.114978,-1.471960,-1.361238,-0.118768,-0.210591,-0.401386,0.997663,1.125875,-0.702038,-0.392356,0.592230,-0.725351
2,0.253107,0.679366,-1.361238,-0.454206,0.287467,-0.401386,-1.016941,-0.355968,-0.702038,-0.637628,-0.974555,-0.725351
3,-1.571007,-1.471960,-1.361238,-0.342393,1.117563,-0.401386,-1.016941,0.602872,-0.702038,-0.882900,-0.974555,-0.725351
4,0.253107,0.679366,0.827214,0.999360,0.564165,-0.401386,0.997663,-1.619892,1.424425,-0.392356,0.592230,0.298962
...,...,...,...,...,...,...,...,...,...,...,...,...
204,0.481121,0.679366,0.827214,0.440296,-1.262046,-0.401386,-1.016941,0.559288,1.424425,-0.882900,-0.974555,0.298962
205,1.507184,0.679366,-0.267012,2.676552,0.527272,2.491364,0.997663,0.036285,1.424425,0.425216,0.592230,-0.725351
206,-0.088915,0.679366,0.827214,0.440296,-0.118358,-0.401386,-1.016941,0.472121,-0.702038,0.098187,-0.974555,-0.725351
207,-0.658950,0.679366,-0.267012,-0.789644,-1.778550,-0.401386,0.997663,-1.009722,-0.702038,-0.228842,-0.974555,2.347588


### Logistic Regression Model

🚀 <font color='#d9c4b1'> Exercise: </font> Fit a standard logistic regression model and determine which features look most promising.

In [9]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='none')

In [10]:
complex_model=model.fit(X_train, y_train.values.ravel())

#### Determining feature importance

In [11]:
complex_model.coef_[0]

array([ 0.07546429,  0.58467449,  0.86366748,  0.26114435,  0.11270334,
       -0.51808234,  0.25679271, -0.26717581,  0.58641758,  0.68859559,
        0.37278552,  1.15178896])

In [12]:
pd.Series(np.abs(complex_model.coef_[0]), index=data.columns[:-1]).sort_values(ascending=False)

X12    1.151789
X3     0.863667
X10    0.688596
X9     0.586418
X2     0.584674
X6     0.518082
X11    0.372786
X8     0.267176
X4     0.261144
X7     0.256793
X5     0.112703
X1     0.075464
dtype: float64

In [13]:
y_hat = complex_model.predict(X_test)
y_hat[:10]

array([False, False, False, False, False, False, False,  True, False,
        True])

In [14]:
y_prob = complex_model.predict_proba(X_test)
np.around(y_prob[:10,:], 2)

array([[0.86, 0.14],
       [0.56, 0.44],
       [0.84, 0.16],
       [0.8 , 0.2 ],
       [0.98, 0.02],
       [0.8 , 0.2 ],
       [0.91, 0.09],
       [0.46, 0.54],
       [0.97, 0.03],
       [0.11, 0.89]])

`X12`and`X3` are the most important features for classifying whether someone will get a heart attack or not.

### Simpler Model

🚀 <font color='#d9c4b1'> Exercise: </font> Fit another model that includes only the features that you think look promising. Use cross validation and the accuracy, precision, and recall scoring metrics to determine which model is best.

In [15]:
# fit the model with X12 and X3
simple_model=model.fit(X_train[:,[11,2]],y_train.values.ravel())

In [16]:
y_pred = simple_model.predict(X_test[:,[11,2]])
y_pred[:10]

array([False,  True,  True, False, False, False, False, False, False,
       False])

In [17]:
y_prob = simple_model.predict_proba(X_test[:,[11,2]])
np.around(y_prob[:10,:], 2)

array([[0.53, 0.47],
       [0.5 , 0.5 ],
       [0.22, 0.78],
       [0.53, 0.47],
       [0.92, 0.08],
       [0.53, 0.47],
       [0.92, 0.08],
       [0.53, 0.47],
       [0.92, 0.08],
       [0.53, 0.47]])

### Model evaluation

In [18]:
from sklearn.metrics import confusion_matrix 
c_model = confusion_matrix(y_test, y_hat)
s_model=confusion_matrix(y_test, y_pred)

pd.DataFrame(c_model, index=['actual0','actual1'], columns=['pred0','pred1'])

Unnamed: 0,pred0,pred1
actual0,44,7
actual1,13,26


In [19]:
pd.DataFrame(s_model, index=['actual0','actual1'], columns=['pred0','pred1'])

Unnamed: 0,pred0,pred1
actual0,43,8
actual1,19,20


#### a) Classification accuracy

In [20]:
from sklearn.model_selection import cross_val_score

complex_scoring= cross_val_score(estimator=complex_model, X=X_train, y=y_train.values.ravel(), scoring='accuracy', cv=10)
-complex_scoring.mean()

-0.8138095238095238

In [21]:
simple_accuracy = cross_val_score(estimator=simple_model, X=X_train, y=y_train.values.ravel(), scoring="accuracy", cv=10)
-simple_accuracy.mean()

-0.8138095238095238

#### b) Classification precision

In [22]:
complex_scoring= cross_val_score(estimator=complex_model, X=X_train, y=y_train.values.ravel(), scoring='precision', cv=10)
-complex_scoring.mean()

-0.8233469308469308

In [23]:
simple_accuracy = cross_val_score(estimator=simple_model, X=X_train, y=y_train.values.ravel(), scoring='precision', cv=10)
-simple_accuracy.mean()

-0.8233469308469308

#### c) Classification recall

In [24]:
complex_scoring= cross_val_score(estimator=complex_model, X=X_train, y=y_train.values.ravel(), scoring='recall', cv=10)
-complex_scoring.mean()

-0.7877777777777778

In [25]:
simple_accuracy = cross_val_score(estimator=simple_model, X=X_train, y=y_train.values.ravel(), scoring='recall', cv=10)
-simple_accuracy.mean()

-0.7877777777777778

The complex and simple model have the same accuracy, precision and recall metrics