![alt text](https://www.auth.gr/sites/default/files/banner-horizontal-282x100.png)
# Advanced Topics in Machine Learning - Assignment 1 - Part B


## Cost-Sensitive Learning

#### Useful library documentation, references, and resources used on Assignment:

* Statlog (Heart) Dataset: <http://archive.ics.uci.edu/ml/datasets/statlog+(heart)>
* CostCla's Documentation: <http://albahnsen.github.io/CostSensitiveClassification/#>
* scikit-learn ML library (aka *sklearn*): <http://scikit-learn.org/stable/documentation.html>
* Random Forest Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>
* Linear Support Vector Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html>
* Multinomial Naive Bayes Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html>
* Probability Calibration of Classifiers: <https://scikit-learn.org/stable/modules/calibration.html>
* Model evaluation: quantifying the quality of predictions: <https://scikit-learn.org/stable/modules/model_evaluation.html>



# 0. Import necessary libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from costcla.metrics import cost_loss
from sklearn.calibration import CalibratedClassifierCV
from costcla.models import BayesMinimumRiskClassifier
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

import matplotlib.pyplot as plt
# import seaborn as sns
# import statsmodels.api as sm
# from sklearn.feature_selection import RFE
# from sklearn import model_selection
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import train_test_split
# from sklearn import cross_validation
# from sklearn.model_selection import GridSearchCV
# from sklearn.linear_model import LogisticRegression
# from sklearn import svm

  from numpy.core.umath_tests import inner1d


# 1. Download Required Dataset
#### Use Statlog (Heart) dataset from UCI repository. Dataset consist of 4 separate data files

In [48]:
# Download Data from Internet
# path_cleveland = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# path_hungary = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"
# path_swiss = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data"
# path_venice = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data"

# Various random connection problems occur flequently during dataset download
# In this case is better to use the locally stored files (in folder 'data'):
path_cleveland = "data/processed.cleveland.data"
path_hungary = "data/processed.hungarian.data"
path_swiss = "data/processed.switzerland.data"
path_venice = "data/processed.va.data"

# 2. Data Preprocessing
## 2.1 Store data into an easy to handle DataFrame

In [68]:

paths_to_data = [path_cleveland, path_hungary, path_swiss, path_venice]
# Features' Headers used in DataFrame
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", 
         "ca", "thal", "target"]

# Whole DataSet consists of 4 individual data files
# First create a separate dataframe for each data file
dfs = []
for i in range(len(paths_to_data)):
    dfs.append(pd.read_csv(paths_to_data[i], names=columns))

# Then concat all 4 dataframes into a single one. Create new index
initial_data = pd.concat(dfs, ignore_index=1)

# Alternative way to create the final dataframe using a single command with lambda function
#initial_data = pd.concat(map(lambda x: pd.read_csv(x, names=columns), paths_to_data))

## 2.2 Examine initial dataset

In [81]:
# Number of Examples
samples = initial_data.shape[0]
print('Initial dataset contains %d examples' %samples)

# Compute initial class distribution (label 0 corresponds to Class 0, all other labels used correspond to Class 1)
class_0 = initial_data.target.value_counts().values[0]
class_1 = samples - class_0
print('\nInitial class distribution:')
print('Label: 0, Counts: %d (%.1f%s)' %(class_0, 100*class_0/samples,'%'))
print('Label: 1, Counts: %d (%.1f%s)' %(class_1, 100*class_1/samples,'%'))

print('\n',initial_data.head(10))

Initial dataset contains 920 examples

Initial class distribution:
Label: 0, Counts: 411 (44.7%)
Label: 1, Counts: 509 (55.3%)

   age  sex   cp trestbps chol fbs restecg thalach exang oldpeak slope   ca  \
0  63  1.0  1.0      145  233   1       2     150     0     2.3     3  0.0   
1  67  1.0  4.0      160  286   0       2     108     1     1.5     2  3.0   
2  67  1.0  4.0      120  229   0       2     129     1     2.6     2  2.0   
3  37  1.0  3.0      130  250   0       0     187     0     3.5     3  0.0   
4  41  0.0  2.0      130  204   0       2     172     0     1.4     1  0.0   
5  56  1.0  2.0      120  236   0       0     178     0     0.8     1  0.0   
6  62  0.0  4.0      140  268   0       2     160     0     3.6     3  2.0   
7  57  0.0  4.0      120  354   0       0     163     1     0.6     1  0.0   
8  63  1.0  4.0      130  254   0       2     147     0     1.4     2  1.0   
9  53  1.0  4.0      140  203   1       2     155     1     3.1     3  0.0   

  thal  targ

## 2.3 Dealing with missing data
#### Unfortunately a large portion of data examples contain missing data, marked with '?' symbol:

In [70]:
initial_data.tail(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
910,51,0.0,4.0,114,258,1,2,96,0,1,1,?,?,0
911,62,1.0,4.0,160,254,1,1,108,1,3,2,?,?,4
912,53,1.0,4.0,144,300,1,1,128,1,1.5,2,?,?,3
913,62,1.0,4.0,158,170,0,1,138,1,0,?,?,?,1
914,46,1.0,4.0,134,310,0,0,126,0,0,?,?,3,2
915,54,0.0,4.0,127,333,1,1,154,0,0,?,?,?,1
916,62,1.0,1.0,?,139,0,1,?,?,?,?,?,?,0
917,55,1.0,4.0,122,223,1,1,100,0,0,?,?,6,2
918,58,1.0,4.0,?,385,1,2,?,?,?,?,?,?,0
919,62,1.0,2.0,120,254,0,2,93,1,0,?,?,?,1


#### As shown below up to 2/3 of initial examples contain missing data

In [71]:
# Replace '?' symbol with 'nan'
initial_data.replace("?", np.nan, inplace=True)
# Show missing data count for each feature
print(initial_data.isnull().sum())

age           0
sex           0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalach      55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
target        0
dtype: int64


#### Suggested practice is to delete ALL examples containing missing data

In [72]:
# Delete rows with missing data
data = initial_data.dropna(axis=0)
data.reset_index(drop=True, inplace=True)


## 2.4 Correct data labels corresponding to Class 1
#### In initial data files labels '1', '2', '3' and '4' used to denote Class 1

In [75]:
data.tail(20)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
279,35,1.0,2.0,122,192,0,0,174,0,0.0,1,0.0,3.0,0
280,61,1.0,4.0,148,203,0,0,161,0,0.0,1,1.0,7.0,2
281,58,1.0,4.0,114,318,0,1,140,0,4.4,3,3.0,6.0,4
282,58,0.0,4.0,170,225,1,2,146,1,2.8,2,2.0,6.0,2
283,56,1.0,2.0,130,221,0,2,163,0,0.0,1,0.0,7.0,0
284,56,1.0,2.0,120,240,0,0,169,0,0.0,3,0.0,3.0,0
285,67,1.0,3.0,152,212,0,2,150,0,0.8,2,0.0,7.0,1
286,55,0.0,2.0,132,342,0,0,166,0,1.2,1,0.0,3.0,0
287,44,1.0,4.0,120,169,0,0,144,1,2.8,3,0.0,6.0,2
288,63,1.0,4.0,140,187,0,2,144,1,4.0,1,2.0,7.0,2


#### Replace all these labels with a single one ('1')

In [76]:
# Replace target labels 2,3 and 4 with 1
data['target'].replace(to_replace=[2, 3, 4], value=1, inplace=True)

data.tail(20)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
279,35,1.0,2.0,122,192,0,0,174,0,0.0,1,0.0,3.0,0
280,61,1.0,4.0,148,203,0,0,161,0,0.0,1,1.0,7.0,1
281,58,1.0,4.0,114,318,0,1,140,0,4.4,3,3.0,6.0,1
282,58,0.0,4.0,170,225,1,2,146,1,2.8,2,2.0,6.0,1
283,56,1.0,2.0,130,221,0,2,163,0,0.0,1,0.0,7.0,0
284,56,1.0,2.0,120,240,0,0,169,0,0.0,3,0.0,3.0,0
285,67,1.0,3.0,152,212,0,2,150,0,0.8,2,0.0,7.0,1
286,55,0.0,2.0,132,342,0,0,166,0,1.2,1,0.0,3.0,0
287,44,1.0,4.0,120,169,0,0,144,1,2.8,3,0.0,6.0,1
288,63,1.0,4.0,140,187,0,2,144,1,4.0,1,2.0,7.0,1


## 2.5 Examine final data

In [82]:
# Number of Examples
samples = data.shape[0]
print('Final dataset contains %d examples' %samples)

# Distribution of target labels
print('\nFinal class distribution:')
for i in range(2):
    counts = data.target.value_counts().values[i]
    print('Label: %1d, Counts: %d (%.1f%s)' %(i, counts, 100*counts/data.shape[0],'%'))


Final dataset contains 299 examples

Final class distribution:
Label: 0, Counts: 160 (53.5%)
Label: 1, Counts: 139 (46.5%)


#### As show above both classes are well balanced

## 2.6 Split data into train and test sets

In [7]:
# Separate Dependent and Independent variables
X = data.drop('target', axis=1)
y = data.target
# Spliting Train and Test variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 2.7 Create the CostMatrix
#### Based on informations taken form: http://archive.ics.uci.edu/ml/datasets/statlog+(heart)
#### Explanation of data labels used:
>#### 0: absence of heart disease || 1: presence of heart disease
#### Misclassification costs:
>#### Misclassification cost of Class 0 (corresponds to a False Positive (fp) prediction) = 1
>#### Misclassification cost of Class 1 (corresponds to a False Negative (fn) prediction) = 5

In [83]:
# Create data for fp, fn, tp, tn
fp = np.full((y_test.shape[0],1), 1)
fn = np.full((y_test.shape[0],1), 5)
tp = np.zeros((y_test.shape[0],1))
tn = np.zeros((y_test.shape[0],1))
cost_matrix = np.hstack((fp, fn, tp, tn))

In [86]:
cost_matrix[:5]

array([[1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.]])

# 3. Using Cost-Sensitive Techniques
+++++++++++++++
#### I wil use and compare 3 different Classification Algorithms:
- #### Random Forest Algorithm
- #### Linear SVM Algorithm
- #### Multinomial Naive Bayes Algorithm

#### Function used to print Classification results from each Algorithm used

In [93]:
def print_results(y_test,pred_test,target_names,cost_matrix):
    print(classification_report(y_test, pred_test, target_names=target_names ))
    # Compute Confusion Matrix using test set's true labels and corresponding predicted labels
    cm = confusion_matrix(y_test, pred_test)
    # Extract fp and fn values
    fp = cm[0][1]
    fn = cm[1][0]
    total_predictions = len(y_test)
    # Print misclassifications' data
    print('Misclassifications:%d(%.2f%s),  fp:%d,  fn:%d' %(fp+fn,100*(fp+fn)/total_predictions,'%',fp,fn))
    # Compute total misclassification cost using method from costcla library
    loss = cost_loss(y_test, pred_test, cost_matrix)
    print('Total Loss:%d\n' %loss)
    print('Confusion Matrix (rows:predictions, columns:true values):')
    print(cm.T)
    return('%d' %loss)


## 3.1 Probability Calibration

#### ++++++ Theory
#### I will execute and compare 5 variations of each Classification Algorithm:
- #### Pure algorithm without Cost Minimization
- #### Algorithm with Cost Minimization but no data Calibration
- #### Algorithm with Cost Minimization using Costcla Calibration on Training data set
- #### Algorithm with Cost Minimization using Sigmoid Calibration
- #### Algorithm with Cost Minimization using Isotonic Calibration

In [106]:
# Creating DataFrame to hold results for Probability Calibration technique
columns = ['Classification Algorithm', 'No CM', 'CM-No Cal', 'CM-Costcla', 'CM-Sigmoid', 'CM-Isotonic']
prdf = pd.DataFrame(columns = columns)

### 3.1.1 Random Forest Algorithm

In [107]:
# List structure to hold results of specific algorithm
rf = ['Random Forest']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42  # Random State
ests=100  # Number of Estimators

print('\n\n********** Random Forest (No Cost Minimization) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - No Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Costcla Calibration on training set) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Sigmoid Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Isotonic Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)



********** Random Forest (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.84      0.83        49
Presence of heart disease (1)       0.80      0.78      0.79        41

                  avg / total       0.81      0.81      0.81        90

Misclassifications:17(18.89%),  fp:8,  fn:9
Total Loss:53

Confusion Matrix (rows:predictions, columns:true values):
[[41  9]
 [ 8 32]]


********** Random Forest (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.95      0.43      0.59        49
Presence of heart disease (1)       0.59      0.98      0.73        41

                  avg / total       0.79      0.68      0.66        90

Misclassifications:29(32.22%),  fp:28,  fn:1
Total Loss:33

Confusion Matrix (rows:predictions, columns:true values):
[[21  1]
 [28 40]]


********** Rand

#### Random Forest Algorithm Results (Cost)

In [108]:
rf = pd.Series(rf, index=columns)
prdf.append(rf, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Random Forest,53,33,76,34,25


#### Conclusions
#### For Random Forest Classification Algorithm best cost results achieved by using Isotonic Calibration

### 3.1.2 Linear SVM Algorithm

In [112]:
# List structure to hold results of specific algorithm
lsvm = ['Linear SVM']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']

print('\n\n********** Linear SVM (No Cost Minimization) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - No Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Costcla Calibration on training set) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Sigmoid Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Isotonic Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)



********** Linear SVM (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]


********** Linear SVM (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.93      0.57      0.71        49
Presence of heart disease (1)       0.65      0.95      0.77        41

                  avg / total       0.80      0.74      0.74        90

Misclassifications:23(25.56%),  fp:21,  fn:2
Total Loss:31

Confusion Matrix (rows:predictions, columns:true values):
[[28  2]
 [21 39]]


********** Linear SV

#### Linear SVM Algorithm Results (Cost)

In [113]:
lsvm = pd.Series(lsvm, index=columns)
prdf.append(lsvm, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Linear SVM,54,31,34,33,29


### Conclusions
#### For Linear SVM Classification Algorithm best cost results achieved by using Isotonic Calibration

### 3.1.3 Multinomial Naive Bayes Algorithm

In [116]:
# List structure to hold results of specific algorithm
mnb = ['Multinomial Naive Bayes']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (No Cost Minimization) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - No Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Costcla Calibration on training set) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Sigmoid Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Isotonic Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)



********** Multinomial Naive Bayes (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]


********** Multinomial Naive Bayes (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.76      0.78        49
Presence of heart disease (1)       0.73      0.78      0.75        41

                  avg / total       0.77      0.77      0.77        90

Misclassifications:21(23.33%),  fp:12,  fn:9
Total Loss:57

Confusion Matrix (rows:predictions, columns:true values):
[[37  9]
 [12 3

#### Multinomial Naive Bayes Algorithm Results (Cost)

In [117]:
mnb = pd.Series(mnb, index=columns)
prdf.append(mnb, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Multinomial Naive Bayes,74,57,37,49,35


### Conclusions
#### For Multinomial Naive Bayes Classification Algorithm best cost results achieved by using Isotonic Calibration

### 3.1.4 Probability Calibration Results

In [118]:
prdf.append([rf,lsvm,mnb], ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Random Forest,53,33,76,34,25
1,Linear SVM,54,31,34,33,29
2,Multinomial Naive Bayes,74,57,37,49,35


#### According to total results from 3 Classification Algorithms, shown above, using 5 different Probability Calibration approaches, I conclude the following:
- #### Among the 5 different Probability Calibration approaches, __Isotonic Calibration__ always delivers the best results, regardless of the Classification Algorithm used each time
- #### Among the 3 different Classification Algorithms examined, __Random Forest__ delivers the best results
- #### The Multinomial Naive Bayes Algorithm presents __very poor performance__ if Cost Minimization is not taken into account

## 3.2 Stratification (Rebalancing)

### 3.2.1 Random Forest Algorithm

In [14]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42
ests=100

print('\n\n********** Random Forest (Without Sampling) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Random Forest (With Undersampling Class 0 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Random Forest (With Oversampling Class 1 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Random Forest (With Compination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomUnderSampler(sampling_strategy={0:80, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:80, 1:400}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)



***** Random Forest (Without Sampling) *****
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.84      0.83        49
Presence of heart disease (1)       0.80      0.78      0.79        41

                  avg / total       0.81      0.81      0.81        90

Misclassifications:17(18.89%),  fp:8,  fn:9
Total Loss:53

Confusion Matrix (rows:predictions, columns:true values):
[[41  9]
 [ 8 32]]

***** Random Forest (With Undersampling Class 0 of Training Set) *****
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.92      0.47      0.62        49
Presence of heart disease (1)       0.60      0.95      0.74        41

                  avg / total       0.77      0.69      0.67        90

Misclassifications:28(31.11%),  fp:26,  fn:2
Total Loss:36

Confusion Matrix (rows:predictions, columns:true values):
[[23 

### 3.2.2 Linear SVM Algorithm

In [49]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42

print('\n\n********** Linear SVM (Without Sampling) **********')
clf = SVC(kernel='linear', probability=False, C=1)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Linear SVM (With Undersampling Class 0 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Linear SVM (With Oversampling Class 1 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Linear SVM (With Compination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomUnderSampler(sampling_strategy={0:80, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:80, 1:400}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)



***** Linear SVM (Without Sampling) *****
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]

***** Linear SVM (With Undersampling Class 0 of Training Set) *****
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.88      0.47      0.61        49
Presence of heart disease (1)       0.59      0.93      0.72        41

                  avg / total       0.75      0.68      0.66        90

Misclassifications:29(32.22%),  fp:26,  fn:3
Total Loss:41

Confusion Matrix (rows:predictions, columns:true values):
[[23  3]
 

### 3.2.3 Multinomial Naive Bayes Algorithm

In [16]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (Without Sampling) **********')
clf = MultinomialNB(alpha = NBalpha)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Multinomial Naive Bayes (With Undersampling Class 0 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Multinomial Naive Bayes (With Oversampling Class 1 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Multinomial Naive Bayes (With Compination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomUnderSampler(sampling_strategy={0:80, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:80, 1:400}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)



***** Multinomial Naive Bayes (Without Sampling) *****
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]

***** Multinomial Naive Bayes (With Undersampling Class 0 of Training Set) *****
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.84      0.73      0.78        49
Presence of heart disease (1)       0.72      0.83      0.77        41

                  avg / total       0.79      0.78      0.78        90

Misclassifications:20(22.22%),  fp:13,  fn:7
Total Loss:48

Confusion Matrix (rows:predictions, column

## 3.3 Example Weighting

#### Defining weights for Training data

In [22]:
# now create the sample weights according to y
# misclassification cost of class 0 = 1
# misclassification cost of class 1 = 5
weights = np.zeros(y_train.shape[0])
weights[np.where(y_train == 1)] = 5;
weights[np.where(y_train == 0)] = 1;


In [19]:
len(weights)

209

### 3.3.1 Random Forest Algorithm

In [40]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42
ests=150

print('\n\n********** Random Forest (Without Weights) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
#clf = SVC(kernel='linear', probability=False, C=1)
#clf = DecisionTreeClassifier()
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Random Forest (With Weights) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)


***** Random Forest (Without Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.84      0.82        49
Presence of heart disease (1)       0.79      0.76      0.77        41

                  avg / total       0.80      0.80      0.80        90

Misclassifications:18(20.00%),  fp:8,  fn:10
Total Loss:58

Confusion Matrix (rows:predictions, columns:true values):
[[41 10]
 [ 8 31]]

***** Random Forest (With Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.81      0.90      0.85        49
Presence of heart disease (1)       0.86      0.76      0.81        41

                  avg / total       0.84      0.83      0.83        90

Misclassifications:15(16.67%),  fp:5,  fn:10
Total Loss:55

Confusion Matrix (rows:predictions, columns:true values):
[[44 10]
 [ 5 31]]


### 3.3.2 Linear SVM Algorithm

In [52]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']

print('\n\n********** Linear SVM (Without Weights) **********')
clf = SVC(kernel='linear', probability=False, C=1)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Linear SVM (With Weights) **********')
clf = SVC(kernel='linear', probability=False, C=1)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)


***** Linear SVM (Without Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]

***** Linear SVM (With Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.90      0.55      0.68        49
Presence of heart disease (1)       0.63      0.93      0.75        41

                  avg / total       0.78      0.72      0.71        90

Misclassifications:25(27.78%),  fp:22,  fn:3
Total Loss:37

Confusion Matrix (rows:predictions, columns:true values):
[[27  3]
 [22 38]]


### 3.3.3 Multinomial Naive Bayes Algorithm

In [51]:

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (Without Weights) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)

print('\n\n********** Multinomial Naive Bayes (With Weights) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
print_results(y_test,pred_test,target_names,cost_matrix)


***** Multinomial Naive Bayes (Without Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]

***** Multinomial Naive Bayes (With Weights) *****
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.76      0.78        49
Presence of heart disease (1)       0.73      0.78      0.75        41

                  avg / total       0.77      0.77      0.77        90

Misclassifications:21(23.33%),  fp:12,  fn:9
Total Loss:57

Confusion Matrix (rows:predictions, columns:true values):
[[37  9]
 [12 32]]
