# Basic Classification with Scikit-Learn

## Load Data and Set Up Training and Testing Data

Import the Pandas and Numpy 

In [1]:
#Import Pandas, Numpy, and Matplotlib Python Libraries
import pandas as pd
import numpy as np

Read the Iris dataset into a Pandas data frame

In [3]:
iris = pd.read_csv('datasets/iris.csv')

View the structure of the Iris data frame

In [4]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Use the hold out method to create training data (70% random sample) and testing data (30% random sample)

In [5]:
train=iris.sample(frac=0.7,random_state=1234)
test=iris.drop(train.index)

Separate the observations from the class/target variable in both the training and testing data.  Use the ravel() function to flatten the 1D array for the class variable.  This is necessary for some of methods used to classify and assess accuracy.

In [6]:
obs = ['sepal_length', 'sepal_width', 'petal_length','petal_width']
cls = ['class']
#trainObs = train.as_matrix(obs)
trainObs = train[obs].to_numpy()
trainCls = train[cls].to_numpy().ravel()
testObs = test[obs].to_numpy()
testCls = test[cls].to_numpy().ravel()

## K Nearest Neighbor Classification

Set up a K Nearest Neighbor Classifier with the number of neighbors = 3 and weights based on Euclidean distance

In [50]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7, weights='distance')

Fit the K Nearest Neighbor classifier to the training data and use the resulting classifier to predict the class values for the test dataset

In [51]:
knn.fit(trainObs, trainCls)
knn_pred = knn.predict(testObs)

Calculate the accuracy of the classifier.

In [52]:
from __future__ import division
(sum(testCls==knn_pred))/len(knn_pred)

0.9777777777777777

Create a confusion matrix using Scikit-Learn confusion_matrix

In [53]:
from sklearn.metrics import confusion_matrix
knn_tab = confusion_matrix(testCls, knn_pred, labels=(['Iris-setosa','Iris-virginica','Iris-versicolor']))
knn_tab

array([[15,  0,  0],
       [ 0, 17,  1],
       [ 0,  0, 12]])

Create a classification report for the result including precision, recall, and f measure.

In [19]:
from sklearn import metrics
print(metrics.classification_report(testCls, knn_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.86      1.00      0.92        12
 Iris-virginica       1.00      0.89      0.94        18

       accuracy                           0.96        45
      macro avg       0.95      0.96      0.95        45
   weighted avg       0.96      0.96      0.96        45



Exercise 1: Now go back and experiment with different values of k.  What happened?

Different K values affected the acurracy of the model. When k=7 a higher accuracy was achieved. When k was set to high, the accuracy decreased again.

## Decision Tree Classification

Create a decision tree classifier and fit it to the training dataset.  This Scikit-Learn decision tree is based on the CART algorithm.  The default parameters use the GINI index as the metric for finding the best attribute split.

In [54]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(trainObs, trainCls)

Export the resulting tree in GraphVis format.  You can open the resulting file "tree.dot" in the graphviz Python library or at the graphviz website located at: http://www.webgraphviz.com/

In [55]:
tree.export_graphviz(clf, out_file='tree.dot', feature_names= ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],  
                         class_names=['Iris-setosa','Iris-virginica','Iris-versicolor'])   

In [56]:
dt_pred = clf.predict(testObs)

Exercise 2: Use the evaluation metric code from the KNN example to assess the quality of your decision tree classifier.  Did you find any differences?

In [65]:
from __future__ import division
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Determine the accuracy of the model
print("Model Accuracy\n", sum(testCls==dt_pred)/len(dt_pred))

# Print a confusion matrix
dt_tab = confusion_matrix(testCls, dt_pred, labels=(['Iris-setosa','Iris-virginica','Iris-versicolor']))
print("Confusion Matrix\n", dt_tab)

# Print a classification report
print(metrics.classification_report(testCls, dt_pred))

Model Accuracy
 0.9555555555555556
Confusion Matrix
 [[15  0  0]
 [ 0 16  2]
 [ 0  0 12]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.86      1.00      0.92        12
 Iris-virginica       1.00      0.89      0.94        18

       accuracy                           0.96        45
      macro avg       0.95      0.96      0.95        45
   weighted avg       0.96      0.96      0.96        45



You can also use a measure of entropy as the split criteria by including the parameter criterion="entropy".

In [64]:
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(trainObs, trainCls)
dt_pred = clf.predict(testObs)

## Random Forest Classifier

Set up a random forest classifier with the number of estimators (trees) = 10

In [75]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10000)
clf = clf.fit(trainObs, trainCls)

In [76]:
rf_pred = clf.predict(testObs)

Exercise 3: Assess the quality of your random forest classifier.  What did you find?  Now change the n_estimators parameter to 100, 1000, and 10,000.  What happened?

In [77]:
from __future__ import division
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Determine the accuracy of the model
print("Model Accuracy\n", sum(testCls==rf_pred)/len(rf_pred))

# Print a confusion matrix
dt_tab = confusion_matrix(testCls, rf_pred, labels=(['Iris-setosa','Iris-virginica','Iris-versicolor']))
print("Confusion Matrix\n", dt_tab)

# Print a classification report
print(metrics.classification_report(testCls, rf_pred))

Model Accuracy
 0.9555555555555556
Confusion Matrix
 [[15  0  0]
 [ 0 16  2]
 [ 0  0 12]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.86      1.00      0.92        12
 Iris-virginica       1.00      0.89      0.94        18

       accuracy                           0.96        45
      macro avg       0.95      0.96      0.95        45
   weighted avg       0.96      0.96      0.96        45



## Naive Bayes Classifier

Create a Naive Bayes classifier

In [81]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb = gnb.fit(trainObs, trainCls)

In [82]:
nb_pred = gnb.predict(testObs)

In [84]:
from __future__ import division
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Determine the accuracy of the model
print("Model Accuracy\n", sum(testCls==nb_pred)/len(nb_pred))

# Print a confusion matrix
nb_tab = confusion_matrix(testCls, nb_pred, labels=(['Iris-setosa','Iris-virginica','Iris-versicolor']))
print("Confusion Matrix\n", nb_tab)

# Print a classification report
print(metrics.classification_report(testCls, nb_pred))

Model Accuracy
 0.9555555555555556
Confusion Matrix
 [[15  0  0]
 [ 0 16  2]
 [ 0  0 12]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.86      1.00      0.92        12
 Iris-virginica       1.00      0.89      0.94        18

       accuracy                           0.96        45
      macro avg       0.95      0.96      0.95        45
   weighted avg       0.96      0.96      0.96        45



# Classification - More Models and Ideas

## Load Data and Set Up Training and Testing Data

Read the breast cancer dataset from SciKits Learn datasets

In [102]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Set X = the attributes and y = the target variable

In [86]:
X = cancer['data']
y = cancer['target']

Use train_test_split to split the data into training and testing sets

In [89]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)


Rescale the data to values between 1 and 0 (this gives each attribute equal weight)

In [93]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Logistic Regression

Set up the logistic regression classifier with 

In [96]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression(C=1e5, max_iter=10000)

Fit the logistic regression model to the training data and use the resulting classifier to predict the class values for the test dataset

In [97]:
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg

Calculate the accuracy of the classifier.

In [98]:
from __future__ import division
(sum(y_test==logreg_pred))/len(logreg_pred)

0.9370629370629371

Create a confusion matrix using Scikit-Learn confusion_matrix

In [99]:
from sklearn.metrics import confusion_matrix
logreg_tab = confusion_matrix(y_test, logreg_pred)
logreg_tab

array([[53,  3],
       [ 6, 81]])

Create a classification report for the result including precision, recall, and f measure.

In [100]:
from sklearn import metrics
print(metrics.classification_report(y_test, logreg_pred))

              precision    recall  f1-score   support

           0       0.90      0.95      0.92        56
           1       0.96      0.93      0.95        87

    accuracy                           0.94       143
   macro avg       0.93      0.94      0.93       143
weighted avg       0.94      0.94      0.94       143



## Neural Networks

Create a multilayer perceptron classifier and fit it to the training dataset.  The classifier will use the 

In [106]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='sgd', hidden_layer_sizes=(30,30,30), max_iter=10000)
mlp.fit(X_train,y_train)

Use the neural network to predict the test set and calculate the accuracy.

In [107]:
from __future__ import division
mlp_pred = mlp.predict(X_test)
(sum(y_test==mlp_pred))/len(mlp_pred)

0.965034965034965

In [None]:
print(confusion_matrix(y_test,mlp_pred))

Now try a different solver.  Did you get different results?

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(30,30,30))
mlp.fit(X_train,y_train)

## Support Vector Machines

Set up a SVM classifier using the radial basis function kernel

In [None]:
from sklearn import svm
svm_clf = svm.SVC(kernel="rbf")
svm_clf.fit(X_train,y_train)

In [None]:
svm_pred = svm_clf.predict(X_test)
print(confusion_matrix(y_test,svm_pred))

## Stacked Ensemble Methods

Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
bagging = BaggingClassifier(GaussianNB(), max_samples=0.5, max_features=0.5)
bagging.fit(X_train,y_train)

In [None]:
bag_pred = bagging.predict(X_test)
print(confusion_matrix(y_test,bag_pred))

Use cross validation to find the overall accuracy of the bagged classifier

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(bagging, X, y)
scores.mean()

Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier
boosting = AdaBoostClassifier(n_estimators=100)
boosting.fit(X_train, y_train)
boost_pred = boosting.predict(X_test)
print(confusion_matrix(y_test, boost_pred))                     

Use cross validation to find the overall accuracy of the boosted classifier

In [None]:
scores = cross_val_score(boosting, X, y)
scores.mean()

## Cross Validation Sampling

K-fold Cross Validation Sample

In [None]:
from sklearn.model_selection import KFold
from sklearn import svm
svm_clf = svm.SVC(kernel="rbf")
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    svm_clf.fit(X_train,y_train)
    svm_pred = svm_clf.predict(X_test)
    print((sum(y_test==svm_pred))/len(svm_pred))

Exercise 4: Assess the quality of your Naive Bayes classifer.  How does it compare?