<a href="https://www.kaggle.com/code/noemicj/dataset-uci-heart-disease?scriptVersionId=115934783" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Classification using scikit-learn | Dataset UCI Heart Disease Dataset**

In [None]:
### Importing Files
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

> As part of the final project of the course "[Introduction to Data Science and scikit-learn in Python](https://www.coursera.org/learn/data-science-and-scikit-learn-in-python)", we had to predict the presence of heart disease using [patient data](/kaggle/input/heart-disease-dataset-uci/HeartDiseaseTrain-Test.csv).

**Goal: presence/absence of heart disease based the following health-related features**

* age: age in years
* sex: (0 = female; 1 = male)
* cp: chest pain type (0 = Typical angina; 1 = Atypical angina; 2 = Non-anginal pain; 3 = Asymptomatic)
* trestbps: resting blood pressure (in mm Hg on admission to the hospital)
* chol: serum cholestoral in mg/dl
* fbs: (fasting blood sugar > 120 mg/dl) (0 = false; 1 = true)
* restecg: resting electrocardiographic results (0 = Normal; 1 = ST-T Wave Abnormality; 2 = Left Ventricular Hypertrophy)
* thalach: maximum heart rate achieved 
* exang: exercise induced angina (0 = no; 1 = yes)
* oldpeak: ST depression induced by exercise relative to rest 
* slope: the slope of the peak exercise ST segment (0 = Upsloping; 1 = Flat; 2 = Downsloping)
* ca: number of major vessels (0-3) colored by flourosopy
* thal: A blood disorder called 'Thalassemia' (0 = Normal; 1 = Fixed Defect; 2 = Reversable Defect)
* condition: have disease or not (0=no; 1=yes)

In [None]:
data = pd.read_csv("/kaggle/input/heart-disease-cleveland-uci/heart_cleveland_upload.csv", index_col=0)
data.head()

# Part 1 Data Preprocessing Techniques (50 pts)
'''
a) Use one-hot encoding to transform the 'thal' feature into two columns called 'is_normal', 'is_fixed', 
and 'is_reversible'. (15 pts). Be sure to drop the 'thal' column afterwards.
Hint: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
'''

In [None]:
data['thal'].replace({0: 'normal', 1: 'fixed', 2: 'reversible'}, inplace = True)
data = pd.get_dummies(data, columns=["thal"], prefix=["is"])

b) Use min-max normalization to resacle all the features between 0 and 1 (15 pts). Make sure that data remains in the same
dataframe format.
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)
data.head()

c) Split the data into a train, test set using a 75/25 split. Use a random state of 42 for grading purposes (20 pts).
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
X = data.drop(columns = 'condition')
y = data['condition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape, y_train.shape)

# Part 2: Fitting Model and Analyzing Results (50 pts)

a) Fit a logisitic regression classifier on the data. Save the model in a varaible called 'clf'. Use a random state of 42.
Use the following paramters: penalty:'l2', solver:'liblinear', C:0.1. 15 pts.

In [None]:
clf = LogisticRegression(penalty='l2', solver='liblinear', C=0.1, random_state = 42)
clf.fit(X_train, y_train)

b) Generate 0/1 predictions on the test set and store them in a varaible called 'pred'. 
Generate probability predictions on the test set and store them in a variable called 'scores'.
10 pts

In [None]:
pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]
print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

c) Fill in this function to find and return the root mean sqaured error between the predicted and actual values.
Hint: Use his formula for the rsme: https://sciencing.com/calculate-mean-deviation-7152540.html.
10 pts

In [None]:
def rsme(predictions, actuals):
    from sklearn.metrics import mean_squared_error
    return mean_squared_error(actuals, predictions, squared=False)
print('RSME: ', rsme(y_test, pred))

d) Try using a random forest classifier to fit the data instead. Use the default parameters and a random state of 42.
Save the fitted model into a varaible called 'rf'. Generate the 'pred' and 'scores' in a similar way to part b.
Hint: Use https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
15 pts

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
scores = rf.predict_proba(X_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))  
print('RSME: ', rsme(y_test, pred))