![alt text](https://www.auth.gr/sites/default/files/banner-horizontal-282x100.png)
# Advanced Topics in Machine Learning - Assignment 1 - Part B


## Cost-Sensitive Learning

#### Useful library documentation, references, and resources used on Assignment:

* Statlog (Heart) Dataset: <http://archive.ics.uci.edu/ml/datasets/statlog+(heart)>
* CostCla's Documentation: <http://albahnsen.github.io/CostSensitiveClassification/#>
* scikit-learn ML library (aka *sklearn*): <http://scikit-learn.org/stable/documentation.html>
* Random Forest Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>
* Linear Support Vector Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html>
* Multinomial Naive Bayes Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html>
* Probability Calibration of Classifiers: <https://scikit-learn.org/stable/modules/calibration.html>
* Model evaluation: quantifying the quality of predictions: <https://scikit-learn.org/stable/modules/model_evaluation.html>



# 0. __Install packages - Import necessary libraries__

In [1]:
# pip install imblearn
# pip install costcla
## Important! costcla is incompatible with latest versions of sklearn
# pip install scikit-learn==0.19.0

In [2]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from costcla.metrics import cost_loss
from sklearn.calibration import CalibratedClassifierCV
from costcla.models import BayesMinimumRiskClassifier
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from collections import Counter


  from numpy.core.umath_tests import inner1d


# 1. __Download Required Dataset__
#### Use Statlog (Heart) dataset from UCI repository. Dataset consist of 4 separate data files

In [3]:
# Download Data from Internet
# path_cleveland = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# path_hungary = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"
# path_swiss = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data"
# path_venice = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data"

# Various random connection problems occured flequently during dataset download
# In this case is better to use the locally stored files (in folder 'data'):
path_cleveland = "data/processed.cleveland.data"
path_hungary = "data/processed.hungarian.data"
path_swiss = "data/processed.switzerland.data"
path_venice = "data/processed.va.data"

# 2. __Data Preprocessing__
## 2.1 Store data into an easy to handle DataFrame

In [4]:

paths_to_data = [path_cleveland, path_hungary, path_swiss, path_venice]
# Features' Headers used in DataFrame
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", 
         "ca", "thal", "target"]

# Whole DataSet consists of 4 individual data files
# First create a separate dataframe for each data file
dfs = []
for i in range(len(paths_to_data)):
    dfs.append(pd.read_csv(paths_to_data[i], names=columns))

# Then concat all 4 dataframes into a single one. Create new index
initial_data = pd.concat(dfs, ignore_index=1)

# Alternative way to create the final dataframe using a single command with lambda function
#initial_data = pd.concat(map(lambda x: pd.read_csv(x, names=columns), paths_to_data))

## 2.2 Examine initial dataset

In [5]:
def exam_dataset(labels):
    '''
    Method that examines a given dataset and prints the number of examples and the distribution of classes
    Parameters:
        labels: A list or a dataframe's column containing the labels of the dataset
    '''
    labels = list(labels)
    total_examples = len(labels)
    print('Dataset contains %d examples' %total_examples)
    # Counter() method returns results in an unsorted dictionary format.
    # Transform results into a sorted (by first key) list of tuples and then iterate
    distr = sorted([(key, value) for (key, value) in Counter(labels).items()])
    print('\nDataset\'s label (class) distribution:')
    for classes, examples in distr:
        print('\tLabel %d:  %d examples (%.1f%%)' %(classes, examples, 100*examples/total_examples))


In [6]:
print('In initial dataset label 0 corresponds to Class 0, all other labels used correspond to Class 1')
exam_dataset(initial_data.target)
print('\n',initial_data.head(10))

In initial dataset label 0 corresponds to Class 0, all other labels used correspond to Class 1
Dataset contains 920 examples

Dataset's label (class) distribution:
	Label 0:  411 examples (44.7%)
	Label 1:  265 examples (28.8%)
	Label 2:  109 examples (11.8%)
	Label 3:  107 examples (11.6%)
	Label 4:  28 examples (3.0%)

   age  sex   cp trestbps chol fbs restecg thalach exang oldpeak slope   ca  \
0  63  1.0  1.0      145  233   1       2     150     0     2.3     3  0.0   
1  67  1.0  4.0      160  286   0       2     108     1     1.5     2  3.0   
2  67  1.0  4.0      120  229   0       2     129     1     2.6     2  2.0   
3  37  1.0  3.0      130  250   0       0     187     0     3.5     3  0.0   
4  41  0.0  2.0      130  204   0       2     172     0     1.4     1  0.0   
5  56  1.0  2.0      120  236   0       0     178     0     0.8     1  0.0   
6  62  0.0  4.0      140  268   0       2     160     0     3.6     3  2.0   
7  57  0.0  4.0      120  354   0       0     163   

## 2.3 Dealing with missing data
#### Unfortunately a large portion of data examples contain missing data, marked with '?' symbol:

In [7]:
initial_data.tail(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
910,51,0.0,4.0,114,258,1,2,96,0,1,1,?,?,0
911,62,1.0,4.0,160,254,1,1,108,1,3,2,?,?,4
912,53,1.0,4.0,144,300,1,1,128,1,1.5,2,?,?,3
913,62,1.0,4.0,158,170,0,1,138,1,0,?,?,?,1
914,46,1.0,4.0,134,310,0,0,126,0,0,?,?,3,2
915,54,0.0,4.0,127,333,1,1,154,0,0,?,?,?,1
916,62,1.0,1.0,?,139,0,1,?,?,?,?,?,?,0
917,55,1.0,4.0,122,223,1,1,100,0,0,?,?,6,2
918,58,1.0,4.0,?,385,1,2,?,?,?,?,?,?,0
919,62,1.0,2.0,120,254,0,2,93,1,0,?,?,?,1


#### As shown below up to 2/3 of initial examples contain missing data

In [8]:
# Replace '?' symbol with 'nan'
initial_data.replace("?", np.nan, inplace=True)
# Show missing data count for each feature
print(initial_data.isnull().sum())

age           0
sex           0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalach      55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
target        0
dtype: int64


#### Suggested practice is to delete ALL examples containing missing data

In [9]:
# Delete rows with missing data
data = initial_data.dropna(axis=0)
data.reset_index(drop=True, inplace=True)


## 2.4 Correct data labels corresponding to Class 1
#### In initial data files labels '1', '2', '3' and '4' used to denote Class 1

In [10]:
data.tail(20)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
279,35,1.0,2.0,122,192,0,0,174,0,0.0,1,0.0,3.0,0
280,61,1.0,4.0,148,203,0,0,161,0,0.0,1,1.0,7.0,2
281,58,1.0,4.0,114,318,0,1,140,0,4.4,3,3.0,6.0,4
282,58,0.0,4.0,170,225,1,2,146,1,2.8,2,2.0,6.0,2
283,56,1.0,2.0,130,221,0,2,163,0,0.0,1,0.0,7.0,0
284,56,1.0,2.0,120,240,0,0,169,0,0.0,3,0.0,3.0,0
285,67,1.0,3.0,152,212,0,2,150,0,0.8,2,0.0,7.0,1
286,55,0.0,2.0,132,342,0,0,166,0,1.2,1,0.0,3.0,0
287,44,1.0,4.0,120,169,0,0,144,1,2.8,3,0.0,6.0,2
288,63,1.0,4.0,140,187,0,2,144,1,4.0,1,2.0,7.0,2


#### Replace all these labels with a single one ('1')

In [11]:
# Replace target labels 2,3 and 4 with 1
data['target'].replace(to_replace=[2, 3, 4], value=1, inplace=True)

data.tail(20)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
279,35,1.0,2.0,122,192,0,0,174,0,0.0,1,0.0,3.0,0
280,61,1.0,4.0,148,203,0,0,161,0,0.0,1,1.0,7.0,1
281,58,1.0,4.0,114,318,0,1,140,0,4.4,3,3.0,6.0,1
282,58,0.0,4.0,170,225,1,2,146,1,2.8,2,2.0,6.0,1
283,56,1.0,2.0,130,221,0,2,163,0,0.0,1,0.0,7.0,0
284,56,1.0,2.0,120,240,0,0,169,0,0.0,3,0.0,3.0,0
285,67,1.0,3.0,152,212,0,2,150,0,0.8,2,0.0,7.0,1
286,55,0.0,2.0,132,342,0,0,166,0,1.2,1,0.0,3.0,0
287,44,1.0,4.0,120,169,0,0,144,1,2.8,3,0.0,6.0,1
288,63,1.0,4.0,140,187,0,2,144,1,4.0,1,2.0,7.0,1


## 2.5 Examine final data

In [12]:
exam_dataset(data.target)

Dataset contains 299 examples

Dataset's label (class) distribution:
	Label 0:  160 examples (53.5%)
	Label 1:  139 examples (46.5%)


#### As show above both classes are well balanced

## 2.6 Split data into train and test sets

In [13]:
# Separate Dependent and Independent variables
X = data.drop('target', axis=1)
y = data.target
# Spliting Train and Test variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 2.7 Create the CostMatrix
#### Based on informations taken form: http://archive.ics.uci.edu/ml/datasets/statlog+(heart)
#### Explanation of data labels used:
>#### 0: absence of heart disease || 1: presence of heart disease
#### Misclassification costs:
>#### Misclassification cost of Class 0 (corresponds to a False Positive (fp) prediction) = 1
>#### Misclassification cost of Class 1 (corresponds to a False Negative (fn) prediction) = 5

In [14]:
# Create data for fp, fn, tp, tn
fp = np.full((y_test.shape[0],1), 1)
fn = np.full((y_test.shape[0],1), 5)
tp = np.zeros((y_test.shape[0],1))
tn = np.zeros((y_test.shape[0],1))
cost_matrix = np.hstack((fp, fn, tp, tn))

In [15]:
cost_matrix[:5]

array([[1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.],
       [1., 5., 0., 0.]])

# 3. __Using Cost-Sensitive Techniques__
#### Τechniques that aim to convert existing cost-insensitive learning algorithms into cost-sensitive ones. They usually act by modifying the training data without altering learning algorithms themselves. In total 4 cost-sensitive techniques will be used:
- #### Proabability Calibration
- #### Stratification (Rebalancing)
- #### Example Weighting
- #### Roulette Sampling (Cost Proportionate Roulette Sampling, CPRS)

#### I will apply and compare the cost-sensitive techiniques on 3 different Classification Algorithms:
- #### Random Forest Algorithm
- #### Linear SVM Algorithm
- #### Multinomial Naive Bayes Algorithm

#### Method to print Classification results from each Algorithm used

In [16]:
def print_results(y_test,pred_test,target_names,cost_matrix):
    print(classification_report(y_test, pred_test, target_names=target_names ))
    # Compute Confusion Matrix using test set's true labels and corresponding predicted labels
    cm = confusion_matrix(y_test, pred_test)
    # Extract fp and fn values
    fp = cm[0][1]
    fn = cm[1][0]
    total_predictions = len(y_test)
    # Print misclassifications' data
    print('Misclassifications:%d(%.2f%s),  fp:%d,  fn:%d' %(fp+fn,100*(fp+fn)/total_predictions,'%',fp,fn))
    # Compute total misclassification cost using method from costcla library
    loss = cost_loss(y_test, pred_test, cost_matrix)
    print('Total Loss:%d\n' %loss)
    print('Confusion Matrix (rows:predictions, columns:true values):')
    print(cm.T)
    return('%d' %loss)


## 3.1 __Probability Calibration__

#### This method performs callibration of probabilistic predictions performed by the classifiers (Classification Algorithms). This procedure is shown to improve Cost Minimization performance of the classifiers.
#### I will execute and compare 5 variations of Cost minimization and Probability Calibration on each Classification Algorithm:
- #### Pure algorithm without Cost Minimization
- #### Algorithm with Cost Minimization but no data Calibration
- #### Algorithm with Cost Minimization using Costcla Calibration on Training data set
- #### Algorithm with Cost Minimization using Sigmoid Calibration (Platt Scaling)
- #### Algorithm with Cost Minimization using Isotonic Calibration

In [17]:
# Creating DataFrame to hold results for Probability Calibration techniques
columns = ['Classification Algorithm', 'No CM', 'CM-No Cal', 'CM-Costcla', 'CM-Sigmoid', 'CM-Isotonic']
prdf = pd.DataFrame(columns = columns)

### 3.1.1 Random Forest Algorithm

In [18]:
# List structure to hold results of specific algorithm
rf = ['Random Forest']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42  # Random State
ests=100  # Number of Estimators

print('\n\n********** Random Forest (No Cost Minimization) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - No Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Costcla Calibration on training set) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Sigmoid Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Isotonic Calibration) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf.append(cost)



********** Random Forest (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.84      0.83        49
Presence of heart disease (1)       0.80      0.78      0.79        41

                  avg / total       0.81      0.81      0.81        90

Misclassifications:17(18.89%),  fp:8,  fn:9
Total Loss:53

Confusion Matrix (rows:predictions, columns:true values):
[[41  9]
 [ 8 32]]


********** Random Forest (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.95      0.43      0.59        49
Presence of heart disease (1)       0.59      0.98      0.73        41

                  avg / total       0.79      0.68      0.66        90

Misclassifications:29(32.22%),  fp:28,  fn:1
Total Loss:33

Confusion Matrix (rows:predictions, columns:true values):
[[21  1]
 [28 40]]


********** Rand

#### Random Forest Algorithm Results (Cost)

In [19]:
rf = pd.Series(rf, index=columns)
prdf.append(rf, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Random Forest,53,33,76,34,25


### Conclusions
- #### Cost Minimization __significantly improves__ misclassification cost on Random Forest Classification Algorithm
- #### From all Probability Calibration approaches only Isotonic Calibration further __significantly improves__ algorithm's performance

### 3.1.2 Linear SVM Algorithm

In [20]:
# List structure to hold results of specific algorithm
lsvm = ['Linear SVM']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']

print('\n\n********** Linear SVM (No Cost Minimization) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - No Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Costcla Calibration on training set) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Sigmoid Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Isotonic Calibration) **********')
clf = SVC(kernel='linear', probability=True, C=1)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm.append(cost)



********** Linear SVM (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]


********** Linear SVM (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.96      0.53      0.68        49
Presence of heart disease (1)       0.63      0.98      0.77        41

                  avg / total       0.81      0.73      0.72        90

Misclassifications:24(26.67%),  fp:23,  fn:1
Total Loss:28

Confusion Matrix (rows:predictions, columns:true values):
[[26  1]
 [23 40]]


********** Linear SV

#### Linear SVM Algorithm Results (Cost)

In [21]:
lsvm = pd.Series(lsvm, index=columns)
prdf.append(lsvm, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Linear SVM,54,28,34,33,29


### Conclusions
- #### Cost Minimization __significantly improves__ misclassification cost on Linear SVM Classification Algorithm
- #### From all Probability Calibration approaches only Isotonic Calibration __just slightly further improves__ algorithm's performance

### 3.1.3 Multinomial Naive Bayes Algorithm

In [22]:
# List structure to hold results of specific algorithm
mnb = ['Multinomial Naive Bayes']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (No Cost Minimization) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - No Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Costcla Calibration on training set) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
prob_train = model.predict_proba(X_train)
bmr = BayesMinimumRiskClassifier(calibration=True)
bmr.fit(y_train, prob_train) 
prob_test = model.predict_proba(X_test)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Sigmoid Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
cc = CalibratedClassifierCV(clf, method="sigmoid", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Isotonic Calibration) **********')
clf = MultinomialNB(alpha = NBalpha)
cc = CalibratedClassifierCV(clf, method="isotonic", cv=3)
model = cc.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb.append(cost)



********** Multinomial Naive Bayes (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]


********** Multinomial Naive Bayes (Cost Minimization - No Calibration) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.76      0.78        49
Presence of heart disease (1)       0.73      0.78      0.75        41

                  avg / total       0.77      0.77      0.77        90

Misclassifications:21(23.33%),  fp:12,  fn:9
Total Loss:57

Confusion Matrix (rows:predictions, columns:true values):
[[37  9]
 [12 3

#### Multinomial Naive Bayes Algorithm Results (Cost)

In [23]:
mnb = pd.Series(mnb, index=columns)
prdf.append(mnb, ignore_index=1)

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Multinomial Naive Bayes,74,57,37,49,35


### Conclusions
- #### Cost Minimization __significantly improves__ misclassification cost on Multinomial Naive Bayes Classification Algorithm
- #### __All__ Probability Calibration approaches further __significantly improve__ algorithm's performance
- #### For Multinomial Naive Bayes Classification Algorithm best misclassification cost results achieved by using __Isotonic Calibration__

### 3.1.4 Probability Calibration Results

In [24]:
prdf = prdf.append([rf,lsvm,mnb], ignore_index=1)
prdf

Unnamed: 0,Classification Algorithm,No CM,CM-No Cal,CM-Costcla,CM-Sigmoid,CM-Isotonic
0,Random Forest,53,33,76,34,25
1,Linear SVM,54,28,34,33,29
2,Multinomial Naive Bayes,74,57,37,49,35


#### According to the results from 3 Classification Algorithms, shown above, using 5 different Probability Calibration approaches, I conclude the following:
- #### Cost Minimization __significantly improves__ misclassification cost on all 3 Classification Algorithms
- #### Probability Calibration further __significantly improves__ performance on 2 out of 3 Classification Algorithms (Random Forest and Multinomial Naive Bayes) with respect to misclassification cost
- #### Among the 5 different Probability Calibration approaches, __Isotonic Calibration__ always delivers the best results, regardless of the Classification Algorithm used each time
- #### Among the 3 different Classification Algorithms examined, __Random Forest__ delivers the best results
- #### The Multinomial Naive Bayes Algorithm presents __very poor performance__ if Cost Minimization is not taken into account

## 3.2 __Stratification (Rebalancing)__
#### This method changes the distribution of the training data according to their costs. In other words it modifies the training data so that the number of examples of each class is __proportional__ to the misclassification cost of the class. 
#### In the study Dataset misclassification cost of Class 0 (corresponds to a False Positive (fp) prediction) equals to 1 whereas misclassification cost of Class 1 (corresponds to a False Negative (fn) prediction) equals to 5. I need to modify the classes distribution in a way that Class 1 contains 5 times the number of examples of Class 0. Initial distribution of classes is:
>#### Class 0: 111 examples  |  Class 1: 98 examples.
#### There are 3 approaches to achieve the desired class distribution:
- #### __Oversampling__ via sampling with replacement (bagging), to increase the number of examples of Class 1 to 555 while Class 0 remains to 111 examples
- #### __Undersampling__, to decrease the number of examples of Class 0 to 20 while Class 1 remains to 98 examples
- #### __Combination__ of above methods, to increase the number of examples of of Class 1 to 250 and at the same time to decrease the number of examples of Class 0 to 50



In [25]:
# Creating DataFrame to hold results for Stratification techniques
columns_st = ['Classification Algorithm', 'No Sampling', 'UnderSampling', 'OverSampling', 'Combination']
stdf = pd.DataFrame(columns = columns_st)

### 3.2.1 Random Forest Algorithm

In [26]:
# List structure to hold results of specific algorithm
rf_st = ['Random Forest']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42  # Random State
ests=100  # Number of Estimators

print('\n\n********** Random Forest (Without Sampling) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_st.append(cost)

print('\n\n********** Random Forest (With Undersampling Class 0 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_st.append(cost)

print('\n\n********** Random Forest (With Oversampling Class 1 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_st.append(cost)

print('\n\n********** Random Forest (With Combination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
sampler = RandomUnderSampler(sampling_strategy={0:50, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:50, 1:250}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_st.append(cost)




********** Random Forest (Without Sampling) **********
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.84      0.83        49
Presence of heart disease (1)       0.80      0.78      0.79        41

                  avg / total       0.81      0.81      0.81        90

Misclassifications:17(18.89%),  fp:8,  fn:9
Total Loss:53

Confusion Matrix (rows:predictions, columns:true values):
[[41  9]
 [ 8 32]]


********** Random Forest (With Undersampling Class 0 of Training Set) **********
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.92      0.47      0.62        49
Presence of heart disease (1)       0.60      0.95      0.74        41

                  avg / total       0.77      0.69      0.67        90

Misclassifications:28(31.11%),  fp:26,  fn:2
Total Loss:36

Confusion Matrix (rows:predictions, colum

#### Random Forest Algorithm Results (Cost)

In [27]:
rf_st = pd.Series(rf_st, index=columns_st)
stdf.append(rf_st, ignore_index=1)

Unnamed: 0,Classification Algorithm,No Sampling,UnderSampling,OverSampling,Combination
0,Random Forest,53,36,57,40


### Conclusions
- #### UnderSampling approach __significantly improves__ misclassification cost on Random Forest Classification Algorithm
- #### OverSampling approach __doesn't affect__ algorithm's performance
- #### Combination approach just __slightly improves__ algorithm's performance

### 3.2.2 Linear SVM Algorithm

In [28]:
# List structure to hold results of specific algorithm
lsvm_st = ['Linear SVM']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42

print('\n\n********** Linear SVM (Without Sampling) **********')
clf = SVC(kernel='linear', probability=False, C=1)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_st.append(cost)

print('\n\n********** Linear SVM (With Undersampling Class 0 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_st.append(cost)

print('\n\n********** Linear SVM (With Oversampling Class 1 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_st.append(cost)

print('\n\n********** Linear SVM (With Combination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = SVC(kernel='linear', probability=False, C=1)
sampler = RandomUnderSampler(sampling_strategy={0:50, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:50, 1:250}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_st.append(cost)




********** Linear SVM (Without Sampling) **********
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]


********** Linear SVM (With Undersampling Class 0 of Training Set) **********
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.88      0.47      0.61        49
Presence of heart disease (1)       0.59      0.93      0.72        41

                  avg / total       0.75      0.68      0.66        90

Misclassifications:29(32.22%),  fp:26,  fn:3
Total Loss:41

Confusion Matrix (rows:predictions, columns:tr

#### Linear SVM Algorithm Results (Cost)

In [29]:
lsvm_st = pd.Series(lsvm_st, index=columns_st)
stdf.append(lsvm_st, ignore_index=1)

Unnamed: 0,Classification Algorithm,No Sampling,UnderSampling,OverSampling,Combination
0,Linear SVM,54,41,34,30


### Conclusions
- #### All 3 Stratification approaches __significantly improve__ misclassification cost on Linear SVM Classification Algorithm
- #### OverSampling approach delivers __best results__

### 3.2.3 Multinomial Naive Bayes Algorithm

In [30]:
# List structure to hold results of specific algorithm
mnb_st = ['Multinomial Naive Bayes']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (Without Sampling) **********')
clf = MultinomialNB(alpha = NBalpha)
print(Counter(y_train))
#0:111, 1:98
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_st.append(cost)

print('\n\n********** Multinomial Naive Bayes (With Undersampling Class 0 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomUnderSampler(sampling_strategy={0:20, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_st.append(cost)

print('\n\n********** Multinomial Naive Bayes (With Oversampling Class 1 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomOverSampler(sampling_strategy={0:111, 1:555}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_st.append(cost)

print('\n\n********** Multinomial Naive Bayes (With Combination: Undersampling Class 0 & Oversampling Class 1 of Training Set) **********')
clf = MultinomialNB(alpha = NBalpha)
sampler = RandomUnderSampler(sampling_strategy={0:50, 1:98}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_train, y_train)
sampler = RandomOverSampler(sampling_strategy={0:50, 1:250}, random_state=rs)
X_rs, y_rs = sampler.fit_sample(X_rs, y_rs)
print(Counter(y_rs))
model = clf.fit(X_rs, y_rs)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_st.append(cost)




********** Multinomial Naive Bayes (Without Sampling) **********
Counter({0: 111, 1: 98})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]


********** Multinomial Naive Bayes (With Undersampling Class 0 of Training Set) **********
Counter({1: 98, 0: 20})
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.84      0.73      0.78        49
Presence of heart disease (1)       0.72      0.83      0.77        41

                  avg / total       0.79      0.78      0.78        90

Misclassifications:20(22.22%),  fp:13,  fn:7
Total Loss:48

Confusion Matrix (ro

#### Multinomial Naive Bayes Algorithm Results (Cost)

In [31]:
mnb_st = pd.Series(mnb_st, index=columns_st)
stdf.append(mnb_st, ignore_index=1)

Unnamed: 0,Classification Algorithm,No Sampling,UnderSampling,OverSampling,Combination
0,Multinomial Naive Bayes,74,48,52,53


### Conclusions
- #### __All__ 3 Stratification approaches __significantly improve__ misclassification cost on Multinomial Naive Bayes Classification Algorithm
- #### UnderSampling approach delivers __best results__

### 3.2.4 Stratification Results

In [32]:
stdf = stdf.append([rf_st,lsvm_st,mnb_st], ignore_index=1)
stdf

Unnamed: 0,Classification Algorithm,No Sampling,UnderSampling,OverSampling,Combination
0,Random Forest,53,36,57,40
1,Linear SVM,54,41,34,30
2,Multinomial Naive Bayes,74,48,52,53


#### According to the results from 3 Classification Algorithms, shown above, using 3 different Stratification approaches, I conclude the following:
- #### Stratification __significantly improves__ misclassification cost on all 3 Classification Algorithms
- #### __UnderSampling__ delivers best results on 2 out of 3 Classification Algorithms (Random Forest and Multinomial Naive Bayes) with respect to misclassification cost
- #### __OverSampling__ delivers best results on Linear SVM Classification Algorithm with respect to misclassification cost
- #### Among the 3 different Classification Algorithms examined, __Linear SVM__ delivers the best results
- #### The Multinomial Naive Bayes Algorithm presents __very poor performance__ even after Stratification is applied

## 3.3 __Example Weighting__
#### This method assigns a certain weight to each instance in terms of its class, according to the misclassification costs, such that the learning algorithm is in favor of the class with high weight/cost. 
#### In the study Dataset misclassification cost of Class 0 (corresponds to a False Positive (fp) prediction) equals to 1 whereas misclassification cost of Class 1 (corresponds to a False Negative (fn) prediction) equals to 5. So weight=1 will be assigned to each example of Class 0 whereas weight=5 will be assigned to each example of Class 1.

#### Create WeightMatrix containing weights for Training data

In [33]:
weights = np.zeros(y_train.shape[0])
# Set weight of Class 0 examples to 1 (misclassification cost of Class 0 = 1)
weights[np.where(y_train == 0)] = 1;
# Set weight of Class 1 examples to 5 (misclassification cost of Class 1 = 5)
weights[np.where(y_train == 1)] = 5;

In [34]:
# Creating DataFrame to hold results for Example Weighting techniques
columns_w = ['Classification Algorithm', 'wo Weighting', 'with Weighting']
wdf = pd.DataFrame(columns = columns_w)

### 3.3.1 Random Forest Algorithm

In [35]:
# List structure to hold results of specific algorithm
rf_w = ['Random Forest']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42
ests=150

print('\n\n********** Random Forest (Without Weights) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
#clf = SVC(kernel='linear', probability=False, C=1)
#clf = DecisionTreeClassifier()
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_w.append(cost)

print('\n\n********** Random Forest (With Weights) **********')
clf = RandomForestClassifier(n_estimators=ests, random_state=rs)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_w.append(cost)



********** Random Forest (Without Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.84      0.82        49
Presence of heart disease (1)       0.79      0.76      0.77        41

                  avg / total       0.80      0.80      0.80        90

Misclassifications:18(20.00%),  fp:8,  fn:10
Total Loss:58

Confusion Matrix (rows:predictions, columns:true values):
[[41 10]
 [ 8 31]]


********** Random Forest (With Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.81      0.90      0.85        49
Presence of heart disease (1)       0.86      0.76      0.81        41

                  avg / total       0.84      0.83      0.83        90

Misclassifications:15(16.67%),  fp:5,  fn:10
Total Loss:55

Confusion Matrix (rows:predictions, columns:true values):
[[44 10]
 [ 5 31]]


#### Random Forest Algorithm Results (Cost)

In [36]:
rf_w = pd.Series(rf_w, index=columns_w)
wdf.append(rf_w, ignore_index=1)

Unnamed: 0,Classification Algorithm,wo Weighting,with Weighting
0,Random Forest,58,55


### Conclusions
- #### Example Weighting just __slightly improves__ misclassification cost on Random Forest Classification Algorithm

### 3.3.2 Linear SVM Algorithm

In [37]:
# List structure to hold results of specific algorithm
lsvm_w = ['Linear SVM']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']

print('\n\n********** Linear SVM (Without Weights) **********')
clf = SVC(kernel='linear', probability=False, C=1)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_w.append(cost)

print('\n\n********** Linear SVM (With Weights) **********')
clf = SVC(kernel='linear', probability=False, C=1)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_w.append(cost)



********** Linear SVM (Without Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]


********** Linear SVM (With Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.90      0.55      0.68        49
Presence of heart disease (1)       0.63      0.93      0.75        41

                  avg / total       0.78      0.72      0.71        90

Misclassifications:25(27.78%),  fp:22,  fn:3
Total Loss:37

Confusion Matrix (rows:predictions, columns:true values):
[[27  3]
 [22 38]]


#### Linear SVM Algorithm Results (Cost)

In [38]:
lsvm_w = pd.Series(lsvm_w, index=columns_w)
wdf.append(lsvm_w, ignore_index=1)

Unnamed: 0,Classification Algorithm,wo Weighting,with Weighting
0,Linear SVM,54,37


### Conclusions
- #### Example Weighting __significantly improves__ misclassification cost on Linear SVM Classification Algorithm

### 3.3.3 Multinomial Naive Bayes Algorithm

In [39]:
# List structure to hold results of specific algorithm
mnb_w = ['Multinomial Naive Bayes']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (Without Weights) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_w.append(cost)

print('\n\n********** Multinomial Naive Bayes (With Weights) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train, weights)
pred_test = clf.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_w.append(cost)



********** Multinomial Naive Bayes (Without Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]


********** Multinomial Naive Bayes (With Weights) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.76      0.78        49
Presence of heart disease (1)       0.73      0.78      0.75        41

                  avg / total       0.77      0.77      0.77        90

Misclassifications:21(23.33%),  fp:12,  fn:9
Total Loss:57

Confusion Matrix (rows:predictions, columns:true values):
[[37  9]
 [12 32]]


#### Multinomial Naive Bayes Algorithm Results (Cost)

In [40]:
mnb_w = pd.Series(mnb_w, index=columns_w)
wdf.append(mnb_w, ignore_index=1)

Unnamed: 0,Classification Algorithm,wo Weighting,with Weighting
0,Multinomial Naive Bayes,74,57


### Conclusions
- #### Example Weighting __significantly improve__ misclassification cost on Multinomial Naive Bayes Classification Algorithm

### 3.3.4 Example Weighting Results

In [41]:
wdf = wdf.append([rf_w,lsvm_w,mnb_w], ignore_index=1)
wdf

Unnamed: 0,Classification Algorithm,wo Weighting,with Weighting
0,Random Forest,58,55
1,Linear SVM,54,37
2,Multinomial Naive Bayes,74,57


#### According to the results from 3 Classification Algorithms, shown above, using 3 different Stratification approaches, I conclude the following:
- #### Example Weighting __significantly improves__ misclassification cost on 2 out of 3 Classification Algorithms (Linear SVM and Multinomial Naive Bayes)
- #### Example Weighting just __slightly improves__ misclassification cost on Random Forest Classification Algorithm
- #### Among the 3 different Classification Algorithms examined, __Linear SVM__ delivers the best results
- #### The Multinomial Naive Bayes Algorithm presents __very poor performance__ even after Example Weighting is applied

## 3.4 __Roulette Sampling (Cost Proportionate Roulette Sampling, CPRS)__
#### This technique is used if misclassification cost is example dependent. However the technique could be used to our dataset, in which cost is class dependent (seen as a more specific version of the general concept).
#### A weight is assigned to each example, equal to their misclassification cost. Then the weights turn into probabilities by dividing each one with the total sum of weights. The result number represents the probability of each example to be selected to the new dataset.
#### Then a new dataset is created by __sampling with replacement__ the original dataset using the calculated probabilities (which is analogous to misclassification cost of each example).

#### First we must calculate each sample's probability (sample cost/sum of all samples cost)

In [42]:
# Create a new column to hold examples' misclassification cost
# Create a new cost column filled with 1s
data['cost'] = 1
# If examples belongs to Class 1 change cost to 5
data['cost'][data.target==1] = 5

# Create a new column to hold examples' probabilities (example cost/sum of all examples cost)
# Sum of all examples cost
total_cost = data['cost'].sum()
# Calculate each example's probability
data['probs'] = data['cost']/total_cost


In [43]:
print(data.probs.sum())
data.head()

0.9999999999999999


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,cost,probs
0,63,1.0,1.0,145,233,1,2,150,0,2.3,3,0.0,6.0,0,1,0.00117
1,67,1.0,4.0,160,286,0,2,108,1,1.5,2,3.0,3.0,1,5,0.005848
2,67,1.0,4.0,120,229,0,2,129,1,2.6,2,2.0,7.0,1,5,0.005848
3,37,1.0,3.0,130,250,0,0,187,0,3.5,3,0.0,3.0,0,1,0.00117
4,41,0.0,2.0,130,204,0,2,172,0,1.4,1,0.0,3.0,0,1,0.00117


In [44]:
# Use Numpy's random.choice() function to create the new dataset, same size as the original one
# New examples are sellected from original dataset via sampling with replacement (bagging)
# Only examples' indices are needed in this stage

# Create a list of calculated examples' probabilities
probs = list(data['probs'])
# Create a list of examples' indices
samples_id = list(data.index)

# Generate examples' indices of the new dataset
new_samples_id = list(np.random.choice(samples_id,size=len(samples_id),replace=True,p=probs))

# Use new examples' indices to create a Dataframe containing new dataset's examples
data_rlt = pd.DataFrame()
for i in new_samples_id:
    data_rlt = data_rlt.append(data[data.index == i], ignore_index=1)

# Columns 'cost' and 'probs' no needed in new dataset
data_rlt.drop(['cost','probs'], axis=1, inplace=True)

In [45]:
data_rlt.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,35,0.0,4.0,138,183,0,0,182,0,1.4,1,0.0,3.0,0
1,58,1.0,3.0,132,224,0,2,173,0,3.2,1,2.0,7.0,1
2,65,0.0,3.0,155,269,0,0,148,0,0.8,1,0.0,3.0,0
3,44,1.0,4.0,120,169,0,0,144,1,2.8,3,0.0,6.0,1
4,56,1.0,4.0,130,283,1,2,103,1,1.6,3,0.0,7.0,1


#### Examine new dataset

In [46]:
# Number of Examples
samples = data_rlt.shape[0]
print('New dataset contains %d examples' %samples)

# Distribution of target labels
print('\nNew dataset class distribution:')
Counter(data_rlt.target)


New dataset contains 299 examples

New dataset class distribution:


Counter({0: 75, 1: 224})

#### As show above both classes are distributed acording to their samples' probabilities

#### Split new dataset into train and test sets

In [47]:
# Separate Dependent and Independent variables
X_rlt = data_rlt.drop('target', axis=1)
y_rlt = data_rlt.target
# Spliting Train and Test variables
X_train_rlt, X_test_rlt, y_train_rlt, y_test_rlt = train_test_split(X_rlt, y_rlt, test_size=0.3, random_state=42)

In [48]:
# Creating DataFrame to hold results for Roulette technique
columns_rlt = ['Classification Algorithm', 'No_CM', 'CM', 'No_CM-RLT', 'CM-RLT']
rltdf = pd.DataFrame(columns = columns_rlt)

### 3.4.1 Random Forest Algorithm

In [49]:
# List structure to hold results of specific algorithm
rf_rlt = ['Random Forest']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
rs=42  # Random State
ests=100  # Number of Estimators

print('\n\n********** Random Forest (No Cost Minimization) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_rlt.append(cost)

print('\n\n********** Random Forest (Cost Minimization) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
rf_rlt.append(cost)

print('\n\n********** Random Forest (No Cost Minimization - Roulette Sampling) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train_rlt, y_train_rlt)
pred_test = model.predict(X_test_rlt)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
rf_rlt.append(cost)

print('\n\n********** Random Forest (Cost Minimization - Roulette Sampling) **********')
clf = RandomForestClassifier(random_state=rs, n_estimators=ests)
model = clf.fit(X_train_rlt, y_train_rlt)
prob_test = model.predict_proba(X_test_rlt)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
rf_rlt.append(cost)




********** Random Forest (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.84      0.83        49
Presence of heart disease (1)       0.80      0.78      0.79        41

                  avg / total       0.81      0.81      0.81        90

Misclassifications:17(18.89%),  fp:8,  fn:9
Total Loss:53

Confusion Matrix (rows:predictions, columns:true values):
[[41  9]
 [ 8 32]]


********** Random Forest (Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.95      0.43      0.59        49
Presence of heart disease (1)       0.59      0.98      0.73        41

                  avg / total       0.79      0.68      0.66        90

Misclassifications:29(32.22%),  fp:28,  fn:1
Total Loss:33

Confusion Matrix (rows:predictions, columns:true values):
[[21  1]
 [28 40]]


********** Random Forest (No Cos

#### Random Forest Algorithm Results (Cost)

In [50]:
rf_rlt = pd.Series(rf_rlt, index=columns_rlt)
rltdf.append(rf_rlt, ignore_index=1)

Unnamed: 0,Classification Algorithm,No_CM,CM,No_CM-RLT,CM-RLT
0,Random Forest,53,33,18,13


### Conclusions
- #### As already shown on Probability Calibration technique, Cost Minimization __significantly improves__ misclassification cost on Random Forest Classification Algorithm
- #### Roulette Sampling __significantly improves__ algorithm's performance even if no Cost Minimization is applied
- #### Combination of Cost Minimization and Roulette Sampling delivers the best results

### 3.4.2 Linear SVM Algorithm

In [51]:
# List structure to hold results of specific algorithm
lsvm_rlt = ['Linear SVM']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']

print('\n\n********** Linear SVM (No Cost Minimization) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_rlt.append(cost)

print('\n\n********** Linear SVM (Cost Minimization) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
lsvm_rlt.append(cost)

print('\n\n********** Linear SVM (No Cost Minimization - Roulette Sampling) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train_rlt, y_train_rlt)
pred_test = model.predict(X_test_rlt)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
lsvm_rlt.append(cost)

print('\n\n********** Linear SVM (Cost Minimization - Roulette Sampling) **********')
clf = SVC(kernel='linear', probability=True, C=1)
model = clf.fit(X_train_rlt, y_train_rlt)
prob_test = model.predict_proba(X_test_rlt)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
lsvm_rlt.append(cost)




********** Linear SVM (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.82      0.92      0.87        49
Presence of heart disease (1)       0.89      0.76      0.82        41

                  avg / total       0.85      0.84      0.84        90

Misclassifications:14(15.56%),  fp:4,  fn:10
Total Loss:54

Confusion Matrix (rows:predictions, columns:true values):
[[45 10]
 [ 4 31]]


********** Linear SVM (Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.93      0.55      0.69        49
Presence of heart disease (1)       0.64      0.95      0.76        41

                  avg / total       0.80      0.73      0.73        90

Misclassifications:24(26.67%),  fp:22,  fn:2
Total Loss:32

Confusion Matrix (rows:predictions, columns:true values):
[[27  2]
 [22 39]]


********** Linear SVM (No Cost Minimi

#### Linear SVM Algorithm Results (Cost)

In [52]:
lsvm_rlt = pd.Series(lsvm_rlt, index=columns_rlt)
rltdf.append(lsvm_rlt, ignore_index=1)

Unnamed: 0,Classification Algorithm,No_CM,CM,No_CM-RLT,CM-RLT
0,Linear SVM,54,32,17,18


### Conclusions
- #### As already shown on Probability Calibration technique, Cost Minimization __significantly improves__ misclassification cost on Linear SVM Classification Algorithm
- #### Roulette Sampling __significantly improves__ algorithm's performance even if no Cost Minimization is applied
- #### Combination of Cost Minimization and Roulette Sampling delivers the best results

### 3.4.3 Multinomial Naive Bayes Algorithm

In [53]:
# List structure to hold results of specific algorithm
mnb_rlt = ['Multinomial Naive Bayes']

target_names = ['Absence of heart disease (0)','Presence of heart disease (1)']
NBalpha = 0.1

print('\n\n********** Multinomial Naive Bayes (No Cost Minimization) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
pred_test = model.predict(X_test)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_rlt.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train, y_train)
prob_test = model.predict_proba(X_test)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test,pred_test,target_names,cost_matrix)
mnb_rlt.append(cost)

print('\n\n********** Multinomial Naive Bayes (No Cost Minimization - Roulette Sampling) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train_rlt, y_train_rlt)
pred_test = model.predict(X_test_rlt)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
mnb_rlt.append(cost)

print('\n\n********** Multinomial Naive Bayes (Cost Minimization - Roulette Sampling) **********')
clf = MultinomialNB(alpha = NBalpha)
model = clf.fit(X_train_rlt, y_train_rlt)
prob_test = model.predict_proba(X_test_rlt)
bmr = BayesMinimumRiskClassifier(calibration=False)
pred_test = bmr.predict(prob_test, cost_matrix)
cost = print_results(y_test_rlt,pred_test,target_names,cost_matrix)
mnb_rlt.append(cost)



********** Multinomial Naive Bayes (No Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.75      0.82      0.78        49
Presence of heart disease (1)       0.76      0.68      0.72        41

                  avg / total       0.76      0.76      0.75        90

Misclassifications:22(24.44%),  fp:9,  fn:13
Total Loss:74

Confusion Matrix (rows:predictions, columns:true values):
[[40 13]
 [ 9 28]]


********** Multinomial Naive Bayes (Cost Minimization) **********
                               precision    recall  f1-score   support

 Absence of heart disease (0)       0.80      0.76      0.78        49
Presence of heart disease (1)       0.73      0.78      0.75        41

                  avg / total       0.77      0.77      0.77        90

Misclassifications:21(23.33%),  fp:12,  fn:9
Total Loss:57

Confusion Matrix (rows:predictions, columns:true values):
[[37  9]
 [12 32]]


********** 

#### Multinomial Naive Bayes Algorithm Results (Cost)

In [54]:
mnb_rlt = pd.Series(mnb_rlt, index=columns_rlt)
rltdf.append(mnb_rlt, ignore_index=1)

Unnamed: 0,Classification Algorithm,No_CM,CM,No_CM-RLT,CM-RLT
0,Multinomial Naive Bayes,74,57,44,21


### Conclusions
- #### As already shown on Probability Calibration technique, Cost Minimization __significantly improves__ misclassification cost on Random Forest Classification Algorithm
- #### Roulette Sampling __slightly improves__ algorithm's performance even if no Cost Minimization is applied
- #### Combination of Cost Minimization and Roulette Sampling delivers the best results

### 3.4.4 Roulette Sampling Results

In [55]:
rltdf = rltdf.append([rf_rlt,lsvm_rlt,mnb_rlt], ignore_index=1)
rltdf

Unnamed: 0,Classification Algorithm,No_CM,CM,No_CM-RLT,CM-RLT
0,Random Forest,53,33,18,13
1,Linear SVM,54,32,17,18
2,Multinomial Naive Bayes,74,57,44,21


#### According to the results from 3 Classification Algorithms, shown above, using Roulette Sampling technique, I conclude the following:
- #### Cost Minimization __significantly improves__ misclassification cost on all 3 Classification Algorithms
- #### Roulette Sampling __significantly improves__ all 3 algorithms' performance even if no Cost Minimization is applied
- #### Among the 4 different techniques used, __combination__ of Cost Minimization and Roulette Sampling delivers the __best results__ on all 3 algorithms
- #### Among the 3 different Classification Algorithms examined, __Random Forest__ and __Linear SVM__ deliver the best results when combination of Cost Minimization and Roulette Sampling is applied
- #### The Multinomial Naive Bayes Algorithm presents __very poor performance__ even if Cost Minimization is applied

# 4. __Total Results__

In [62]:
print('\t1. Probability Calibration Results:')
print(prdf)
print('\n\n\t2. Stratification (Rebalancing) Results:')
print(stdf)
print('\n\n\t3. Weighting Results:')
print(wdf)
print('\n\n\t4. Roulette Sampling Results:')
print(rltdf)

	1. Probability Calibration Results:
  Classification Algorithm No CM CM-No Cal CM-Costcla CM-Sigmoid CM-Isotonic
0            Random Forest    53        33         76         34          25
1               Linear SVM    54        28         34         33          29
2  Multinomial Naive Bayes    74        57         37         49          35


	2. Stratification (Rebalancing) Results:
  Classification Algorithm No Sampling UnderSampling OverSampling Combination
0            Random Forest          53            36           57          40
1               Linear SVM          54            41           34          30
2  Multinomial Naive Bayes          74            48           52          53


	3. Weighting Results:
  Classification Algorithm wo Weighting with Weighting
0            Random Forest           58             55
1               Linear SVM           54             37
2  Multinomial Naive Bayes           74             57


	4. Roulette Sampling Results:
  Classification Algo

#### According to total results shown above, from 4 different Cost-Sensitive Techniques, applied on 3 different Classification Algorithms, I conclude the following:
- #### All 4 Cost-Sensitive Techniques __significantly improve__ misclassification cost on all 3 Classification Algorithms
- #### When applied, Cost Minimization __significantly improves__ misclassification cost on all 3 Classification Algorithms
- #### Regardless of the Classification Algorithm used, __Roulette Sampling__ has always shown the best results, especially when combined with Cost Minimization, followed by Probability Calibration, Stratification and finally Weighting
- #### Among the 3 different Classification Algorithms examined:
    - #### __Random Forest__ delivered the best results on Probability Calibration and Roulette Sampling Techniques
    - #### __Linear SVM__ delivered the best results on Stratification and Weighting Techniques
    - #### __Multinomial Naive Bayes__ constantly delivered the worst results regardless of the Cost-Sensitive Technique applied
