# Lab Statement
This lab introduces the Logistic Regression classification algorithm through using a cardiology dataset. We will classify unknown patients as either healthy or sick using our predictive model.

# Objectives
- Create and use Machine Learning Models with SciKit Learn
- Learn more advanced NumPy (Numerical Python) features
- Learn how to import data from remote websites (UCI Machine Learning Dataset)
- Train-Test Split of data with SciKit Learn
- Perform Classification with SciKit Learn
- Create Predictive Models
- Perform Model Evaluation
    - Model Score and Accuracy
    - Compute Class Membership Probabilites
- Decision Boundary Plotting with MatPlotLib (Not this lab)

# Introduction

### Supervised Learning
**Supervised Learning** is the proccess of building classification models using data instances of known origin.

Classification is probably the best understood of all the data science and machine learning strategies. Classification tasks have three common characteristics:

 - Learnning is supervised
 - The dependent variable is categorical
 - The emphasis is on building models able to assign new instancces to one set of well-defined classes

### Logistic Regression

There are many applications where we are not only interested in the predicted class labels, but where the *estimation of the class-membership probability* is particularly useful (the output of the Sigmoid function prior to applying the threshold function). 

<img src="images/math.png" align="center" width=600; height=600>

Logistic Regression is used in weather forecasting, for example, not only to predict of it will rain on a particular day but also to report the chance of rain. Similarly, Logistical Regression can be used to preduct the chance that a patient has a particular disease given certain symptoms, which is why it's very popular in the field of medicine.

### Cardiology Dataset

This dataset has 303 samples of 13 different attributes:

<img src="images/cardiolog_dataset.png" align="center" width=600; height=400>

The Cardiology Patient Dataset is often used in Machine Learning. The original data was gather by Dr. Robert Detrano at the VA Medical Center in Long Beach, California. This dataset consists of 303 instances - 138 of the instances hold information abou patients with Heart Disease. The original dataset contains 13 numeric attributes  and a 14th attribute indicating whether the patient has a heart condition. The dataset was later modified by Dr. John Gennari - he changed seven of the numerical attributes to categorical equivalents for the purpose of testing data mining tools able to classify datasets with mixed data types. The Microsoft Excel file names for the datasets are **CardiologyNumerical.xlsx** and **CardiologyCategorical.xlsx**, respecively. This dataset is interesting because ot represents real patient data and has been extensively used for testing various data sciene techniques. We can use this data together, with one or more data science techniques, to help us develop profiles for differentiating individuals with heart disease from those without heart conditions.

<img src="images/table_1.png" align="center" width=600; height=400>

# Importing the Essentials

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the Cardiology Dataset

In [24]:
df = pd.read_csv('data/CardiologyNumerical_Lab5.csv')
df
df.head()

Unnamed: 0,angina,slope,thal,class
0,1,2,7,0
1,0,1,3,1
2,1,2,3,1
3,0,2,7,0
4,1,3,7,0


# Data Preprocessing

### Creating the Feature/Target Matrices

In [3]:
#Creating the Feature Matrix
X = df[['angina', 'slope', 'thal']]

#Creating the Target Matrix
y = df['class']

### Checking the Shape of the Matrices

In [4]:
print('The shape of the Feature array (X) is: {}'.format(X.shape))
print('The shape of the Target Array (y) is: {}'.format(y.shape))

The shape of the Feature array (X) is: (303, 3)
The shape of the Target Array (y) is: (303,)


###  Turning Feature/Target Arrays into NumPy Arrays

In [5]:
X = df[['angina', 'slope', 'thal']].values
y = df['class'].values

# Splitting of the Data

In [6]:
#Import Train-Test Split from Sci-Kit Learn
from sklearn.model_selection import train_test_split

#Complete the Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

### Checking the Shape (See if Split was Correct)

In [7]:
#Shape of Feature Dataset
print('Feature Training Set: {}'.format(X_train.shape))
print('Feature Testing Set: {}'.format(X_test.shape))

#Shape of the Target Dataset
print('Target Training Set: {}'.format(y_train.shape))
print('Target Testing Set: {}'.format(y_test.shape))

Feature Training Set: (227, 3)
Feature Testing Set: (76, 3)
Target Training Set: (227,)
Target Testing Set: (76,)


# Building, Training, and Evaluating the Model

### Building the Model - Importing and Instantiating the Class

In [8]:
#Import Linear Regression Model
from sklearn.linear_model import LogisticRegression

#Instantiating the Class
lr = LogisticRegression(solver = 'newton-cg',
                       multi_class = 'multinomial',
                       random_state = 0)

### Training the Model - Calling the Fit Function

In [9]:
#Calling the Fit Function
lr.fit(X_train, y_train)

LogisticRegression(multi_class='multinomial', random_state=0,
                   solver='newton-cg')

### Evaluating the Model - Checking Predictions and Scoring the Model

In [10]:
#Create the Prediction Vector
y_pred = lr.predict(X_test)

#Printing the Actual Values
print("Actual Values of the Testing Set:\n {}".format(y_test))

#Checking the Predictions
print("Test set predictions:\n {}".format(y_pred))

#Verifying the Accuracy
print('Test accuracy: {0:0.2}'.format(lr.score(X_test, y_test)))

Actual Values of the Testing Set:
 [1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1
 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0
 1 0]
Test set predictions:
 [1 1 1 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 1
 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0
 1 1]
Test accuracy: 0.8


# Predicting Class-Membership Probabilites

SciKit-Learn has a method that allows prediction of *class member probabilities.*
   - The probability that training or testing examples belong to a certain class can be computed using the **predict_proba** method. 
   - For example the first three samples in the test set as follows (NOTE: X_test[:3,:] means get the first 3 rows and the associated columns form the test dataset X_test):

In [11]:
lr.predict_proba(X_test[:3,:])

array([[0.21509924, 0.78490076],
       [0.11636406, 0.88363594],
       [0.21509924, 0.78490076]])

The method returns an array with rows that correspond to probabilities of belonging to each class, and displays the same format for other rows if multiple sample instances are predicted. If you want to predict the class membership probabilities for a single row of data in the test set, we can do that as follows:

In [12]:
lr.predict_proba(X_test[1:2])

array([[0.11636406, 0.88363594]])

**NOTE:** We can also predict probabilities (.e.g. for a patientin the dataset) by passing a 2D array(a 2D list) directly to the **lr.predict_proba()** method. For example: if **angina = 0.0, slope = 2.0, and thal = 3.0**, then we could substitute the 2D list **[[angina, slope, thal]]** for the **X_test[start:end]** in the **lr.predict_proba()** method above.

# Predictive Model Building

### Extracting the Coefficients and Intercept

In [13]:
print('The Coefficients from the Logistic Regression are {}'.format(lr.coef_))
print('The Intercept from the Logistic Regression is {}'.format(lr.intercept_))

The Coefficients from the Logistic Regression are [[-0.74518511 -0.36643182 -0.25781723]]
The Intercept from the Logistic Regression is [2.15354424]


### "De-Arraying" the Coefficients and Intercept

In [14]:
#Get the Coefficients and Intercept from Above (Still in Array Format)
coefs = lr.coef_[0]
intercept = lr.intercept_

#Strip the Array Formatting
print('The coefficient of the feature "angina" is {}'.format(coefs[0]))
print('The coefficient of the feature "slope" is {}'.format(coefs[1]))
print('The coefficient of the feature "thal" is {}'.format(coefs[2]))
print('The Intercept has the value of {}'.format(intercept[0]))

The coefficient of the feature "angina" is -0.7451851096866057
The coefficient of the feature "slope" is -0.36643182206362124
The coefficient of the feature "thal" is -0.25781723439203913
The Intercept has the value of 2.153544244675405


### Final Model:

### z = -0.745*angina  + -0.366*slope + -0.257*thal + 2.153

# Testing the Model on Unseen Data

### Creating the Sigmoid Function

In [15]:
#Sigmoid Function
def sigmoid(z):
    return 1/(1+np.exp(-z))

### Calculating the Dot Product

In [16]:
angina, slope, thal = 1, 2, 7 #Class 0
z = np.dot(np.array([angina, slope, thal]), np.array(lr.coef_[0])) + lr.intercept_[0]
print("From the dot product, we find: z=wx + b = {0:0.3f}\n".format(z))

From the dot product, we find: z=wx + b = -1.129



### Checking our Dot Product for Accuracy

In [17]:
print('After doing a Brute Force Calculation, z = {0:0.3f}'.format(angina*coefs[0] + slope*coefs[1] + thal*coefs[2] + intercept[0]))

After doing a Brute Force Calculation, z = -1.129


### Calling the Sigmoid Function and Displaying Value

In [18]:
p = sigmoid(np.dot([angina, slope, thal], lr.coef_[0]) + lr.intercept_[0])
p

0.2443041250951056

Based on the above data from Patient 1 (Index 0), the model was able to calculate that they fell into Class 0 (no heart issues), which is correct.

### Classifying Unknown Patient 1

In [19]:
#Establishing Features
angina, slope, thal = 0, 2, 3 #Class 1

#Calculating the Dot Product
z = np.dot(np.array([angina, slope, thal]), np.array(lr.coef_[0])) + lr.intercept_[0]
print('The Dot Product Calcualtion is {0:0.3f}'.format(z))

#Predicting the Class
p = sigmoid(np.dot([angina, slope, thal], lr.coef_[0]) + lr.intercept_[0])
print('The Prediction Calculation from the Sigmoid function is {0:0.3f}'.format(p))


#Explaining Results
print('\nThe predicted value is {1:0.3f}, which rounds to {2:0.0f}'.format(0,p,round(p,0)))
print('Therefore, the predicted class is {}'.format(lr.predict([[angina, slope, thal]])))

The Dot Product Calcualtion is 0.647
The Prediction Calculation from the Sigmoid function is 0.656

The predicted value is 0.656, which rounds to 1
Therefore, the predicted class is [1]


### Classifying Unknown Patient 2

In [20]:
#Establishing Features
angina, slope, thal = 1, 3, 6 #Class 0

#Calculating the Dot Product
z = np.dot(np.array([angina, slope, thal]), np.array(lr.coef_[0])) + lr.intercept_[0]
print('The Dot Product Calcualtion is {0:0.3f}'.format(z))

#Predicting the Class
p = sigmoid(np.dot([angina, slope, thal], lr.coef_[0]) + lr.intercept_[0])
print('The Prediction Calculation from the Sigmoid function is {0:0.3f}'.format(p))


#Explaining Results
print('\nThe predicted value is {1:0.3f}, which rounds to {2:0.0f}'.format(0,p,round(p,0)))
print('Therefore, the predicted class is {}'.format(lr.predict([[angina, slope, thal]])))

The Dot Product Calcualtion is -1.238
The Prediction Calculation from the Sigmoid function is 0.225

The predicted value is 0.225, which rounds to 0
Therefore, the predicted class is [0]


### Classifying Unknown Patient 3

In [21]:
#Establishing Features
angina, slope, thal = 0, 1, 3 #Class 1

#Calculating the Dot Product
z = np.dot(np.array([angina, slope, thal]), np.array(lr.coef_[0])) + lr.intercept_[0]
print('The Dot Product Calcualtion is {0:0.3f}'.format(z))

#Predicting the Class
p = sigmoid(np.dot([angina, slope, thal], lr.coef_[0]) + lr.intercept_[0])
print('The Prediction Calculation from the Sigmoid function is {0:0.3f}'.format(p))


#Explaining Results
print('\nThe predicted value is {1:0.3f}, which rounds to {2:0.0f}'.format(0,p,round(p,0)))
print('Therefore, the predicted class is {}'.format(lr.predict([[angina, slope, thal]])))

The Dot Product Calcualtion is 1.014
The Prediction Calculation from the Sigmoid function is 0.734

The predicted value is 0.734, which rounds to 1
Therefore, the predicted class is [1]


### Summary Table

<table>
    <tr> <td>Angina</td> <td>Slope</td> <td>Thal</td> <td>Known Class</td> <td>Values from the Logistic Regression Model</td> <td>Predicted Class from the Logistic Regression Model</td> </tr>
       <tr> <td>0</td> <td>2</td> <td>3</td> <td>1</td> <td> <center/>0.647 </td> <td><center/>1 </td> </tr>
       <tr> <td>1</td> <td>3</td> <td>6</td> <td>0</td> <td> <center/>0.225 </td> <td><center/>0 </td> </tr>
       <tr> <td>0</td> <td>1</td> <td>3</td> <td>1</td> <td> <center/>0.734 </td> <td><center/>1 </td> </tr>
</table>

Based on the calculated values and predicted classes, the model correctly classified all the unknown instances!