# Machine Learning with Python

Collaboratory workshop, 02/21/2018

This is a notebook developed throughout the first day of the Collaboratory Workshop, Machine Learning with Python. For more information, go to the workshop home page:

https://github.com/QCB-Collaboratory/W17.MachineLearning/wiki/Day-2

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Loading the synthetic data

You can download this data in our [wiki](https://github.com/QCB-Collaboratory/W17.MachineLearning/wiki/Day-2), or using this [direct link](https://github.com/QCB-Collaboratory/W17.MachineLearning/raw/master/materials/day_2/Day2_testdataset.zip). After downloading, move the file to the same place where you are running your notebook and unzip it. You should now have two files: 

* ```CollML_testdataset_features.dat```: contains feature values for each sample
* ```CollML_testdataset_labels.dat```: contains the class of each sample

In [None]:
features_origin = np.loadtxt('CollML_testdataset_features.dat')
labels_origin = np.loadtxt('CollML_testdataset_labels.dat')

In [None]:
print("Shape of features", features_origin.shape)
print("Shape of labels", labels_origin.shape)

In [None]:
np.unique(labels_origin)   ## shows unique values in an array

We want to create a classifier that reproduces the labels in the NumPy array _labels_ based on _features_. Based on the shapes shown above, we have 2 features, and 500 samples.

## Testing/Training datatset

In [None]:
from sklearn.model_selection import train_test_split

features, features_test, labels, labels_test = train_test_split( 
                features_origin, labels_origin, test_size=0.2,
                shuffle=False)

In [None]:
print("Shape of the whole dataset",features_origin.shape)
print("Shape of the train dataset",features.shape)
print("Shape of the test dataset",features_test.shape)

## Visualization

Let's start by visualizing this dataset.

In [None]:
plt.plot( features[ labels == 0, 0 ], features[ labels == 0, 1 ], 'bo'  )
plt.plot( features[ labels == 1, 0 ], features[ labels == 1, 1 ], 'rs'  )
plt.show()

In [None]:
## Re-doing the previous plot, but with more details
plt.figure( figsize=(4,3) )
plt.plot( features[ labels == 0, 0 ], features[ labels == 0, 1 ], 'o',
           markersize=5, color='b')
plt.plot( features[ labels == 1, 0 ], features[ labels == 1, 1 ], 's',
           markersize=4, color='r')

plt.xlabel('X1')
plt.ylabel('X2')
plt.tight_layout()
plt.savefig('SyntheticDataset.png', dpi=500)
plt.show()

## Decision Trees  - Synthetic data

Let's start creating our Decision Tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

Variable _clf_ will contain all information learned by the classifier. To perform the learning step, we use the method _fit_:

In [None]:
clf.fit( features, labels )

To find the accuracy of this classifier, we can use the method _score_.

In [None]:
clf.score( features, labels )

That means that our classifier was able to perfectly reproduce all points.

In [None]:
from sklearn.tree import export_graphviz
export_graphviz( clf, 'Graph_DecisionTree_testdataset.dat' )

Feature ranking by Gini Importance

In [None]:
clf.feature_importances_

Looking at the plot, we know that the point $P_1 = [-1,-1]$ should be of class 0 (zero), and the point $P_2 = [-1,4]$ should belong to class 1. Let's Check the classifier's prediction:

In [None]:
P1 = np.array([[-1,-1]])
P2 = np.array([[-1,4]])

print("Prediction for P1: ", clf.predict(P1))
print("Prediction for P2: ", clf.predict(P2))

**Question:** Why did we use [[ and ]] in the previous cell?

You can also check the probability that a given point belongs to a class:

In [None]:
print("Probability for P1: ", clf.predict_proba(P1))
print("Probability for P2: ", clf.predict_proba(P2))

Next, let's investigate the "decision boundaries" -- i.e. the boundary between classes

In [None]:
delta = 0.5
x     = np.arange(-2.0, 5.001, delta)
y     = np.arange(-2.0, 5.001, delta)

X, Y = np.meshgrid(x, y)
Z    = clf.predict( np.c_[X.ravel(), Y.ravel()] )
Z    = Z.reshape( X.shape )

plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))

plt.show()

In [None]:
delta = 0.01
x     = np.arange(-2.0, 5.001, delta)
y     = np.arange(-2.0, 5.001, delta)

X, Y = np.meshgrid(x, y)
Z    = clf.predict( np.c_[X.ravel(), Y.ravel()] )
Z    = Z.reshape( X.shape )

plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))

plt.show()

Let's use some of the meta-parameters available in SK-learn to modify the learning process.

In [None]:
clf = DecisionTreeClassifier( max_depth = 5 )
clf.fit( features, labels )
print( clf.score( features, labels ) )
export_graphviz( clf, 'Graph_DecisionTree_testdataset_2.dat' )

In [None]:
delta = 0.01
x     = np.arange(-2.0, 5.001, delta)
y     = np.arange(-2.0, 5.001, delta)

X, Y = np.meshgrid(x, y)
Z    = clf.predict( np.c_[X.ravel(), Y.ravel()] )
Z    = Z.reshape( X.shape )

plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))
plt.xlim(-2, 5)
plt.ylim(-2, 5)


plt.show()

# Random forests

Random forests are among the most important models in Machine Learning, especially for applications that demand low latency.

In [None]:
# plot example of sampled data:
np.random.seed(0)  # the random seed, to be sure that you always plot the same thing

for i in range (3):
    plt.figure( figsize=(4,3) )
    
    choice = np.random.random(size=len(labels))>0.8
    plt.plot( features[ np.logical_and(choice , labels == 0), 0 ], features[ np.logical_and(choice , labels == 0), 1 ], 'o',
               markersize=4, color='b')
    plt.plot( features[ np.logical_and(choice , labels == 1), 0 ], features[ np.logical_and(choice , labels == 1), 1 ], 's',
               markersize=4, color='r')
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier( n_estimators = 50, 
                            max_depth =5, oob_score = True )
clf.fit( features, labels )
print("Out of bag score",clf.oob_score_)

In [None]:
print("Accuracy score: ",clf.score(features,labels))
print("Features importance: ",clf.feature_importances_)

In [None]:
print("Prediction for P1: ", clf.predict(P1))
print("Prediction for P2: ", clf.predict(P2))

In [None]:
delta = 0.01
x     = np.arange(-2.0, 5.001, delta)
y     = np.arange(-2.0, 5.001, delta)

X, Y = np.meshgrid(x, y)
Z    = clf.predict( np.c_[X.ravel(), Y.ravel()] )
Z    = Z.reshape( X.shape )

plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))
plt.xlim(-2, 5)
plt.ylim(-2, 5)

plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier( n_estimators = 50, 
                        max_depth =5, oob_score = True )
clf.fit( features, labels )

In [None]:
print("Accuracy score: ",
          clf.score(features,labels))

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(labels, clf.predict(features))

Because random forests are ensembles of decision trees, there is a way to accesss the trees and inspect them closely. To do so, use the attribute ```estimators_``` of your model.

In [None]:
clf.estimators_[0]

Let's check how to visualize trees in the random forest model we just created.

In [None]:
export_graphviz( clf.estimators_[0], 
                'Graph_DecisionTree_testdataset_RF0.dat' )

## AdaBoost

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
clf_gb = GradientBoostingClassifier(learning_rate=1.5, n_estimators=100)
clf_gb.fit( features, labels )
clf_gb.score(features, labels)

In [None]:
clf_gb.feature_importances_

In [None]:
print("Prediction for P1: ", clf_gb.predict(P1))
print("Prediction for P2: ", clf_gb.predict(P2))

In [None]:
delta = 0.01
x     = np.arange(-2.0, 5.001, delta)
y     = np.arange(-2.0, 5.001, delta)

X, Y = np.meshgrid(x, y)
Z    = clf_gb.predict( np.c_[X.ravel(), Y.ravel()] )
Z    = Z.reshape( X.shape )

plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))
plt.xlim(-2, 5)
plt.ylim(-2, 5)

plt.xlabel('X1')
plt.ylabel('X2')

plt.show()

# Support Vector Machines

Support Vector Machines (or SVMs) were for a long time the most widely used model in the Machine Learning community.

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC()
clf.fit( features, labels )
print( clf.score( features, labels ) )

In [None]:
clf = SVC(gamma=10000.)
clf.fit( features, labels )
print( clf.score( features, labels ) )

Let's write a function that automatically draws the decision boundaries for us (this avoids too much replication of code).

In [None]:
def plotContours(clf, figname, delta = 0.01):
    
    x     = np.arange(-2.0, 5.001, delta)
    y     = np.arange(-2.0, 5.001, delta)

    X, Y = np.meshgrid(x, y)
    Z    = clf.predict( np.c_[X.ravel(), Y.ravel()] )
    Z    = Z.reshape( X.shape )

    plt.contourf( X, Y, Z, cmap=plt.get_cmap('jet'))

    plt.xlabel('X1')
    plt.ylabel('X2')

    plt.show()

In [None]:
clf = SVC()
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundary1.png')

In [None]:
clf = SVC( C = 1000., gamma = 100. )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundaryC100g100.png')

In [None]:
clf = SVC( C = 1000 ,gamma = 1)
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundary_C1000.png')

In [None]:
clf = SVC( C = 0.02 ,gamma = 1 )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundaryC01.png')

In [None]:
clf = SVC( C = 0.5, gamma = 70. )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundaryC01g100.png')

In [None]:
clf = SVC( C = 0.1, gamma = 1. )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundaryC001g1.png')

In [None]:
clf = SVC( C = 0.1, gamma = 0.1 )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundaryC001g01.png')

In [None]:
clf = SVC( C = 1, gamma = 0.01 )
clf.fit( features, labels )
plotContours(clf, 'SVC_decisionboundary1.png')

# Train-Validation split

Next, we will 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, Y_train, Y_valid = train_test_split( 
        features, labels, test_size=0.33,shuffle=False)

In [None]:
print("Shape of the train dataset: ", X_train.shape)
print("Shape of the validation dataset: ", X_valid.shape)

In [None]:
clf = SVC( C = 0.001, gamma = 10. )
clf.fit( X_train, Y_train )
clf.score(X_valid, Y_valid)

In [None]:
setGammas = np.array( [0.003,0.01,0.03,0.1,0.3,1.0,3.,10.,30.,100.,300] )

accuracies = []
for gamma in setGammas:
    clf = SVC( C = 1., gamma = gamma )
    clf.fit( X_train, Y_train )
    accuracies.append( clf.score(X_valid, Y_valid) )

plt.plot(setGammas, accuracies)

plt.ylabel(r'Accuracy')
plt.xlabel(r'$\gamma$')
plt.xscale('log')

plt.show()

Because there is the splitting is performed at random, to properly estimate accuracy you should perform the train-test splitting several times.

In [None]:
numRepetitions = 200
setGammas = np.array( [0.003,0.01,0.03,0.1,0.3,1.0,3.,10.,30.,100.,300] )

accuracies = np.zeros( setGammas.shape )

for j in range(numRepetitions):
    k = 0
    for gamma in setGammas:
        clf = SVC( C = 1., gamma = gamma )
        accuracies[k] += clf.fit( X_train, Y_train ).score(X_valid, Y_valid)
        
        k += 1

accuracies = accuracies / numRepetitions
plt.plot(setGammas, accuracies)

plt.ylabel(r'Accuracy')
plt.xlabel(r'$\gamma$')
plt.xscale('log')

plt.show()

Estimating the accuracy

In [None]:
numRepetitions = 100
accuracies = np.zeros( numRepetitions )

for j in range(numRepetitions):
    X_train, X_valid, Y_train, Y_valid = train_test_split( features, labels, 
                                                            test_size=0.33)
    clf = SVC( C = 1., gamma = 0.5 )
    clf.fit( X_train, Y_train )
    accuracies[j] = clf.score(X_valid, Y_valid)

print(r"Average accuracy (gamma = 0.5): ", accuracies.mean() )


for j in range(numRepetitions):
    X_train, X_valid, Y_train, Y_valid = train_test_split( features, labels,
                                                            test_size=0.33)
    clf = SVC( C = 1., gamma = 100. )
    clf.fit( X_train, Y_train )
    accuracies[j] = clf.score(X_valid, Y_valid)

print("Average accuracy (gamma = 100.): ", accuracies.mean() )

# K-fold cross validation

In [None]:
from sklearn.model_selection import KFold

In [None]:
kf = KFold(n_splits=4)

for train_index, valid_index in kf.split( features ):
    X_train = features[train_index]
    X_valid  = features[valid_index]
    Y_train = labels[train_index]
    Y_valid  = labels[valid_index]
    
    clf = SVC( C = 1., gamma = 0.5 )
    clf.fit( X_train, Y_train )
    print( clf.score(X_valid, Y_valid) )

In [None]:
kf = KFold(n_splits=4)

for train_index, test_index in kf.split(features):
    X_train = features[train_index]
    X_valid  = features[valid_index]
    Y_train = labels[train_index]
    Y_valid  = labels[valid_index]
    
    clf = SVC( C = 1., gamma = 0.5 )
    clf.fit( X_train, Y_train )
    print( clf.score(X_valid, Y_valid) )

## Assessing Performances

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [None]:
clf = RandomForestClassifier( n_estimators = 50 )
clf.fit( features, labels )

print("Accuracy: ", accuracy_score(labels_test, clf.predict(features_test)))
print("Precision: ", precision_score(labels_test, clf.predict(features_test)))
print("Recall: ", recall_score(labels_test, clf.predict(features_test)))
print("F1-score: ", f1_score(labels_test, clf.predict(features_test)))

In [None]:
clf = SVC( C = 1., gamma = 1.)
clf.fit( features, labels )        
        
print("Accuracy: ", accuracy_score(labels_test, clf.predict(features_test)))
print("Precision: ", precision_score(labels_test, clf.predict(features_test)))
print("Recall: ", recall_score(labels_test, clf.predict(features_test)))
print("F1-score: ", f1_score(labels_test, clf.predict(features_test)))

<br />
<br />

# After-the-class Practicing

<br />

Next, I show possible solutions to the proposed practice at the end of our slides. It is highly recommended that you try the exercises by yourself first.

### Wisconsing Breast Cancer dataset

As an exercise, let's try and reproduce the same analysis in the same dataset we explored yesterday!

In [None]:
from sklearn.datasets import load_breast_cancer
bcancer = load_breast_cancer()

print("Num samples x Num Features: ", bcancer.data.shape)
print("Num samples x Num Features: ", bcancer.target.shape)

In [None]:
clf_bcancer = RandomForestClassifier()
clf_bcancer.fit( bcancer.data, bcancer.target )

In [None]:
clf_bcancer.score( bcancer.data, bcancer.target )

In [None]:
clf_bcancer.feature_importances_

### Screening the $\alpha$ and $C$ meta-parameters

In [None]:
setGammas = np.linspace(0.005,10.0,50)

accuracies = []
for gamma in setGammas:
    clf = SVC( C = 0.001, gamma = gamma )
    accuracies.append( clf.fit( features, labels ).score(features, labels) )

plt.plot(setGammas, accuracies)

plt.ylabel(r'Accuracy')
plt.xlabel(r'$\gamma$')
plt.savefig('SVM_accuracyvsgamma.png', dpi=500)
plt.show()

## Banana dataset

In [None]:
bdataset = np.loadtxt('banana_dataset.csv', delimiter=',')
print("Shape of the bdataset: ", bdataset.shape )

In [None]:
bfeat = bdataset[:,1:]
blabl = bdataset[:,0]

In [None]:
clf = SVC( C = 0.001, gamma = gamma )
clf.fit( bfeat, blabl )
clf.score(bfeat, blabl)

In [None]:
setGammas = np.linspace(0.005,10.0,50)

accuracies = []
for gamma in setGammas:
    clf = SVC( C = 10., gamma = gamma )
    accuracies.append( clf.fit( bfeat, blabl ).score(bfeat, blabl) )

plt.plot(setGammas, accuracies)

plt.ylabel(r'Accuracy')
plt.xlabel(r'$\gamma$')
plt.savefig('SVM_accuracyvsgamma.png', dpi=500)
plt.show()

In [None]:
clf = SVC( C = 10., gamma = 10 )
clf.fit( bfeat, blabl )
plotContours(clf, 'SVC_decbound_bdataset1.png')

In [None]:
clf = SVC( C = 10., gamma = 1 )
clf.fit( bfeat, blabl )
plotContours(clf, 'SVC_decbound_bdataset2.png')