# Assignment 2
# Classification with Scikit-Learn

## Deadline: Thursday, October 12 at 8:00 PM
## The assignment must be submitted in the form of a Jupyter notebook and uploaded to eClass.

During this assignment, you will work with a dataset to predict if a patient has heart disease or does not have have heart disease (i.e., the patient is "normal"). It is a binary classification problem. The dataset is the same dataset as for assignment 1, but you do not need to use any code from assignment 1 to solve this assignment. This is based on the following Kaggle example, which you are free to examine: https://www.kaggle.com/code/pasanjayaweera/you-stole-my-heart-w-python

## Marks:
- Step 1. Load the data and create a feature matrix and a target array. 1 mark. (It is not required to filter or impute any data values. Optionally you may use your code from assignment 1 to do so here; but it is not required to do so to receive full marks.)
- Step 2. Create training and test sets. 1 mark.
- For the next three steps, select three different machine learning classifiers implemented in Scikit-Learn. You can select from linear perceptron classifier, logistic regression classifier, any type of support vector classifiers (including with the kernel trick), a decision tree and/or a random forest. You may also use any other classifier introduced in the course.
- Step 3. Train the first model, then evaluate the performance using at least accuracy, sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for classs 0). Also, display the confusion matrix with Matplotlib or Seaborn. Any other visualizations of the model are optional. 2 marks.
- Step 4. Repeat step 3 for the second model. 2 marks.
- Step 5. Repeat step 3 for the third model. 2 marks.
- Step 6. Provide a brief discussion on why you selected particular models, how the performance varied between models, which model you believe is the most generalizable, and any issues or problems you encountered during the assignment (200 words max). 2 mark.
- Total = 10 marks.

## Notes:
This notebook is structured as a series of steps. Earlier steps must be completed before later steps for the code to run. Some partial code is provided; your solution should use that partial code. The solution doesn't need to be pretty! Make sure the code runs without errors. Some required imports will be provided for you; you will need additional imports from sklearn. You may need to check with the documentation for Scikit-Learn or other Python packages. There are multiple solutions for most tasks. Feel free to write reusable functions to share among steps; however, this is not required.

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# This code will download the data file from GitHub.

import requests

def download_data(source, dest):
    base_url = 'https://raw.githubusercontent.com/'
    owner = 'MaralAminpour'
    repo = 'ML-BME-Course-UofA-Fall-2023'
    branch = 'main'
    url = '{}/{}/{}/{}/{}'.format(base_url, owner, repo, branch, source)
    r = requests.get(url)
    f = open(dest, 'wb')
    f.write(r.content)
    f.close()

download_data('Assignments/data/heart.csv', 'heart.csv')

## Data dictionary

The following data dictionary is provided. Note that unlike in Assignment 1, you can use the full set of features. There are 11 features.

  1. Age - Age of the Patient - **Numerical**
  2. Sex - Gender of the Patient - **Categorical**
        * M - Male
        * F - Female
  3. ChestPainType - Chest Pain Type - **Categorical**
        * TA - [Typical Angina](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106) - Substernal chest pain precipitated by physical exertion or emotional stress
        * ATA - [ATypical Angina](https://www.ncbi.nlm.nih.gov/medgen/149267) Angina pectoris which does not have associated classical symptoms of chest pain. Symptoms - weakness, nausea, or sweating
        * NAP - [Non-Anginal Chest Pain](https://my.clevelandclinic.org/health/diseases/15851-gerd-non-cardiac-chest-pain) - Pain in the chest that is NOT caused by Heart Disease or Heart Attack
        * ASY - [Asymptomatic](https://www.mayoclinic.org/diseases-conditions/heart-attack/expert-answers/silent-heart-attack/faq-20057777) - No symptoms
  4. RestingBP - [Resting Blood Pressure (mm/Hg)](https://www.medicinenet.com/blood_pressure_chart_reading_by_age/article.htm) - **Numerical**
  5. Cholesterol - [Serum Cholesterol (mm/dl)](https://www.medicalnewstoday.com/articles/321519) - **Numerical**
  6. FastingBS - [Fasting Blood Sugar](https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451) - **Categorical (1: if FastingBS > 120 mg/dl, 0: otherwise)**
  7. RestingECG - Resting ElectroCardiogram Results - **Categorical**
        * Normal - [Normal ECG Reading](https://ecgwaves.com/topic/ecg-normal-p-wave-qrs-complex-st-segment-t-wave-j-point/) 
        * ST - [Abnormality in ST-T Wave Part of ECG](https://www.healio.com/cardiology/learn-the-heart/ecg-review/ecg-interpretation-tutorial/68-causes-of-t-wave-st-segment-abnormalities) 
        * LVH - [Probable or definite Left Ventricular hypertrophy](https://www.healio.com/cardiology/learn-the-heart/ecg-review/ecg-interpretation-tutorial/68-causes-of-t-wave-st-segment-abnormalities) 
  8. MaxHR - Maximum Heart Rate Achieved (60-202) - **Numeric**
  9. ExerciseAngina - [Exercise Induced Angina](https://www.mayoclinic.org/diseases-conditions/angina/symptoms-causes/syc-20369373) - When your heart wants more blood, but narrowed arteries slow down the blood flow - **Categorical (Yes/No)**
  10. Oldpeak - [ST Depression](https://en.wikipedia.org/wiki/ST_depression) - **Numerical**
  11. ST_Slope - [Slope](https://pubmed.ncbi.nlm.nih.gov/3739881/) of the peak exercise ST Segment - **Categorical**
        * Up - Upward Slope
        * Flat - Slope is zero
        * Down - Downward Slope
  12. HeartDisease - Output Class - **Categorical (1: Heart Disease, 0: Normal)**

In [None]:
# Step 1. Load the data and create a feature matrix and a target array. 1 mark.
#   (It is not required to filter or impute any data values. 
#   But optionally you may use your code from assignment 1 to do so here, 
#   but it is not required to do so to receive full marks.)

# The downloaded file heart.csv contains data in a CSV (comma-separated values) formatted text file.

heart_data = pd.read_csv('heart.csv')
heart_data.head()

# For the categorical values, it is recommended to use one-hot encoding.
# You can use code similar to the following to one-hot encode the categorical values as one-hot values:
# X = heart_data.get_dummies(X, columns=['ColumnName1', 'ColumnName2'], drop_first=True)

# Add your code here.



In [None]:
# Step 2. Create training and test sets. 1 mark.

# Add your code here.

- For the next three steps, select three different machine learning classifiers implemented in Scikit-Learn. You can select from linear perceptron classifier, logistic regression classifier, any type of support vector classifiers (including with the kernel trick), a decision tree and/or a random forest). You may also use any other classifier introduced in the course.


In [None]:
# Step 3. Train the first model, then evaluate the performance using at least: accuracy,
#   sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for class = 0).
#   Also, display the confusion matrix with Matplotlib or Seaborn. 2 marks.

# Add your code here.

In [None]:
# Step 4. Repeat step 3 for the second model. 2 marks.

# Add your code here.

In [None]:
# Step 5. Repeat step 3 for the third model. 2 marks.

# Add your code here.

_Step 6. Provide a brief discussion on why you selected particular models, how the performance varied between models, which model you believe is the most generalizable, and any issues or problems you encountered during the assignment (200 words max). 2 mark._

**Add your text here**