# Supervised Classification Algorithms 

## About the Dataset

> age: The person's age in years  
> sex: The person's sex (1 = male, 0 = female)  
> cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)  
> trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)  
> chol: The person's cholesterol measurement in mg/dl  
> fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)  
> restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)  
> thalach: The person's maximum heart rate achieved  
> exang: Exercise induced angina (1 = yes; 0 = no)  
> oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)  
> slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)  
> ca: The number of major vessels (0-3)  
> thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)  
> target: Heart disease (0 = no, 1 = yes)  

## Logistic Regression

### What is different between Linear and Logistic Regression?

While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the **most probable class** for that data point. For this, we use **Logistic Regression**.

<div class="alert alert-success alertsuccess">
<font size = 3><strong>Recall linear regression:</strong></font>
<br>
<br>
As you know, __Linear regression__ finds a function that relates a continuous dependent variable, y, to some predictors (independent variables $x_1, x_2 $, etc.). For example, Simple linear regression assumes a function of the form:
<br><br>
$$
y = 𝜃_0 + 𝜃_1 * x_1 + 𝜃_2 * x_2 +...
$$
<br>
and finds the values of parameters $θ_0, θ_1, θ_2$, etc, where the term $θ_0$ is the "intercept". It can be generally shown as:
<br><br>
$$
ℎ_θ(𝑥) = 𝜃^TX
$$
<p></p>

</div>

Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:

$$
ℎ_θ(𝑥) = 𝜎({θ^TX}) =  \frac {e^{(θ_0 + θ_1 * x_1 + θ_2 * x_2 +...)}}{1 + e^{(θ_0 + θ_1 * x_1 + θ_2 * x_2 +...)}}
$$
Or:
$$
ProbabilityOfaClass_1 =  P(Y=1|X) = 𝜎({θ^TX}) = \frac{e^{θ^TX}}{1+e^{θ^TX}} 
$$

In this equation, ${θ^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $𝜎(θ^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common "S" shape (sigmoid curve).

So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:

<img
src="https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png" width = "400" align = "center">


The objective of __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_θ(𝑥) = 𝜎({θ^TX})$, in such a way that the model best predicts the class of each case.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import *  

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
%matplotlib inline 

import os
print(os.listdir("../input"))

In [None]:
df = pd.read_csv('../input/heart.csv')
df.head()

In [None]:
df.info()

In [None]:
X = df.drop(['target', ], axis=1)
X.head()

In [None]:
y = df['target']
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

**Utility Functions**

In [None]:
def plot_confusion_matrix(y, pred, labels, classes, normalize=False, cmap=plt.cm.Blues):
    """
    Plots the confusion matrix.
    Args:
        y: Data Labels
        pred: Predicted outputs
        labels: A list of label values to calculate confusion matrix
        classes: A list of containing unique class names for plotting
        normalize:Wheter to plot data with int or percentage values. Default is int.
        cmap: Color map pf the plot
    
    """
    cm = confusion_matrix(y, pred, labels=labels)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title('Confusion Matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def best_model(model, train, test, grid_params):
    """
    Takes a model and grid params as an input and finds the best model.
    
    Args:
        model: A model class
        train: A dict containing train features as X and labels as y
        test: A dict containing test features as X and labels as y
        grid_params: GridSearchCV parameters
        
    Returns:
        best_estimator, table and best_params
    """
    
    grid = GridSearchCV(model, grid_params, cv=4, scoring='f1_weighted', 
                        n_jobs=-1, return_train_score=True).fit(train['X'], train['y'])
    estimator = grid.best_estimator_
    table = pd.DataFrame(grid.cv_results_).loc[:, 
                ['params', 'mean_test_score', 'std_test_score','mean_train_score', 
                 'std_train_score']].sort_values(by='mean_test_score', ascending=False)
    
    params = grid.best_params_
    preds = estimator.predict(test['X'])
    plot_confusion_matrix(test['y'], preds, labels=[1, 0], classes=['target=1','target=0'])
    print(classification_report(test['y'], preds))
    
    return estimator, table, params
    

Lets build our model using __LogisticRegression__ from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in internet.

The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.
**C** parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. 
Now lets fit our model with train set:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [None]:
est, table, params = best_model(lr, train={'X': X_train, 'y': y_train},
                                test={'X': X_test, 'y':y_test},
                                grid_params=[{'solver':['liblinear', 'sag', 'newton-cg', 'lbfgs'],
                                              'C': [0.01, 0.05, 0.1, 0.5, 1, 5]}])

Based on the count of each section, we can calculate precision and recall of each label:

- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)
    
So, we can calculate precision and recall of each class.

__F1 score:__
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. 

The F1score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

And finally, we can tell the average accuracy for this classifier is the average of the f1-score for both labels, which is 0.93 in our case.

The classifier correctly predicted 22 of them as 0, and 3 of them wrongly as 1. So, it hasn done a good job in predicting the target with value 0 and it did a good job predicting labels positive data. We see that 35 out of 36 was predicted correctly. A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes.  In specific case of binary classifier, such as this one,  we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. 

In [None]:
est

In [None]:
params

In [None]:
table

__predict_proba__  returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and second column is probability of class 0, P(Y=0|X):

In [None]:
est.predict_proba(X_test)[0:5]

## Evaluation

### Jaccard Index

Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [None]:
jaccard_similarity_score(y_test, est.predict(X_test))

### Log Loss

Now, lets try __log loss__ for evaluation. In logistic regression, the output can be a probability (or equals to 1). This probability is a value between 0 and 1.
Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. 

In [None]:
log_loss(y_test, est.predict_proba(X_test))

## SVM(Support Vector Machines) 

 SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

    1.Linear
    2.Polynomial
    3.Radial basis function (RBF)
    4.Sigmoid
    
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results.

In [None]:
from sklearn.svm import SVC

sv = SVC(gamma='scale')

In [None]:
est, table, params = best_model(sv, train={'X': X_train, 'y': y_train},
                                test={'X': X_test, 'y':y_test},
                                grid_params=[{'kernel':['linear', 'rbf'],
                                              'C': [1, 3, 5, 7, 10, 20]}])

In [None]:
est

In [None]:
table

In [None]:
params

In [None]:
jaccard_similarity_score(y_test, est.predict(X_test))

## K-Nearest Neighbors


**K-Nearest Neighbors** is an algorithm for supervised learning. Where the data is 'trained' with data points corresponding to their classification. Once a point is to be predicted, it takes into account the 'K' nearest points to it to determine it's classification.

### Here's an visualization of the K-Nearest Neighbors algorithm.

<img src = "https://ibm.box.com/shared/static/mgkn92xck0z05v7yjq8pqziukxvc2461.png" width=500 height=500>

In this case, we have data points of Class A and B. We want to predict what the star (test data point) is. If we consider a k value of 3 (3 nearest data points) we will obtain a prediction of Class B. Yet if we consider a k value of 6, we will obtain a prediction of Class A.

In this sense, it is important to consider the value of k. But hopefully from this diagram, you should get a sense of what the K-Nearest Neighbors algorithm is. It considers the 'K' Nearest Neighbors (points) when it predicts the classification of the test point.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

In [None]:
est, table, params = best_model(knn, train={'X': X_train, 'y': y_train},
                                test={'X': X_test, 'y':y_test},
                                grid_params=[{'n_neighbors':list(range(5,30)),
                                              'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                                              'leaf_size': [10, 20, 30, 40, 50]}])

In [None]:
est

In [None]:
table.head()

In [None]:
params

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

In [None]:
est, table, params = best_model(dt, train={'X': X_train, 'y': y_train},
                                test={'X': X_test, 'y':y_test},
                                grid_params=[{'max_depth':list(range(4,15)),
                                              'criterion': ['gini', 'entropy']}])

In [None]:
est

In [None]:
table

In [None]:
params

**See the Generated Tree Structure**

In [None]:
import matplotlib.image as mpimg
from sklearn import tree

filename = "tree.png"
feature_names = X.columns.tolist()
target_names = ['0', '1']
tree.export_graphviz(est, feature_names=feature_names, out_file='tree.dot', 
                           class_names=target_names, filled=True, 
                           special_characters=True) 

In [None]:
print(os.listdir('../working/'))

In [None]:
! dot -Tpng tree.dot -o tree.png

In [None]:
img = mpimg.imread('../working/tree.png')
plt.figure(figsize=(100, 200))
plt.imshow(img, interpolation='nearest')

The picture is not so clear now but when working locally, we have the option to zoom in for real structure of the data

## Naive Bayes

Naive Bayes classifiers are built on Bayesian classification methods.
These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.
In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(L~|~{\rm features})$.
Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$
P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}
$$

If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

$$
\frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)}
$$

All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label.
Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data.
Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.
Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine gaussian version with this data.

In [None]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

In [None]:
est, table, params = best_model(nb, train={'X': X_train, 'y': y_train},
                                test={'X': X_test, 'y':y_test},
                                grid_params=[{'var_smoothing':[1e-2, 1e-3, 1e-4, 1e-5,
                                                               1e-6, 1e-7, 1e-8, 1e-9]}])

In [None]:
est

In [None]:
table

In [None]:
params

## Conclusions

Our best method would be Logistic Regression for this problem with %93 accuracy. 
I hope you liked my kernel - don't forget to upvote if you do :).