# Basic Classification with Scikit-Learn

## Load Data and Set Up Training and Testing Data

Import the Pandas and Numpy 

In [None]:
#Import Pandas, Numpy, and Matplotlib Python Libraries
import pandas as pd
import numpy as np

Read the Iris dataset into a Pandas data frame

In [None]:
iris = pd.read_csv('C:\Teaching\COSC670\Labs\iris.csv')

View the structure of the Iris data frame

In [None]:
iris

Use the hold out method to create training data (70% random sample) and testing data (30% random sample)

In [None]:
train=iris.sample(frac=0.7,random_state=1234)
test=iris.drop(train.index)

Separate the observations from the class/target variable in both the training and testing data.  Use the ravel() function to flatten the 1D array for the class variable.  This is necessary for some of methods used to classify and assess accuracy.

In [None]:
obs = ['sepal_length', 'sepal_width', 'petal_length','petal_width']
cls = ['class']
trainObs = train.as_matrix(obs)
trainCls = train.as_matrix(cls).ravel()
testObs = test.as_matrix(obs)
testCls = test.as_matrix(cls).ravel()

## K Nearest Neighbor Classification

Set up a K Nearest Neighbor Classifier with the number of neighbors = 3 and weights based on Euclidean distance

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')

Fit the K Nearest Neighbor classifier to the training data and use the resulting classifier to predict the class values for the test dataset

In [None]:
knn.fit(trainObs, trainCls)
knn_pred = knn.predict(testObs)

Calculate the accuracy of the classifier.

In [None]:
from __future__ import division
(sum(testCls==knn_pred))/len(knn_pred)

Create a confusion matrix using Scikit-Learn confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix
knn_tab = confusion_matrix(testCls, knn_pred, labels=(['Iris-setosa','Iris-virginica','Iris-versicolor']))
knn_tab

Create a classification report for the result including precision, recall, and f measure.

In [None]:
from sklearn import metrics
print metrics.classification_report(testCls, knn_pred)

Exercise 1: Now go back and experiment with different values of k.  What happened?

## Decision Tree Classification

Create a decision tree classifier and fit it to the training dataset.  This Scikit-Learn decision tree is based on the CART algorithm.  The default parameters use the GINI index as the metric for finding the best attribute split.

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(trainObs, trainCls)

Export the resulting tree in GraphVis format.  You can open the resulting file "tree.dot" in the graphviz Python library or at the graphviz website located at: http://www.webgraphviz.com/

In [None]:
tree.export_graphviz(clf, out_file='tree.dot', feature_names= ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],  
                         class_names=['Iris-setosa','Iris-virginica','Iris-versicolor'])   

In [None]:
dt_pred = clf.predict(testObs)

Exercise 2: Use the evaluation metric code from the KNN example to assess the quality of your decision tree classifier.  Did you find any differences?

You can also use a measure of entropy as the split criteria by including the parameter criterion="entropy".

In [None]:
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(trainObs, trainCls)
dt_pred = clf.predict(testObs)

## Random Forest Classifier

Set up a random forest classifier with the number of estimators (trees) = 10

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(trainObs, trainCls)

In [None]:
rf_pred = clf.predict(testObs)

Exercise 3: Assess the quality of your random forest classifier.  What did you find?  Now change the n_estimators parameter to 100, 1000, and 10,000.  What happened?

## Naive Bayes Classifier

Create a Naive Bayes classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb = gnb.fit(trainObs, trainCls)

In [None]:
nb_pred = gnb.predict(testObs)

# Classification - More Models and Ideas

## Load Data and Set Up Training and Testing Data

Read the breast cancer dataset from SciKits Learn datasets

In [None]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Set X = the attributes and y = the target variable

In [None]:
X = cancer['data']
y = cancer['target']

Use train_test_split to split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Rescale the data to values between 1 and 0 (this gives each attribute equal weight)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Logistic Regression

Set up the logistic regression classifier with 

In [None]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression(C=1e5)

Fit the logistic regression model to the training data and use the resulting classifier to predict the class values for the test dataset

In [None]:
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg

Calculate the accuracy of the classifier.

In [None]:
from __future__ import division
(sum(y_test==logreg_pred))/len(logreg_pred)

Create a confusion matrix using Scikit-Learn confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix
logreg_tab = confusion_matrix(y_test, logreg_pred)
logreg_tab

Create a classification report for the result including precision, recall, and f measure.

In [None]:
from sklearn import metrics
print metrics.classification_report(y_test, logreg_pred)

## Neural Networks

Create a multilayer perceptron classifier and fit it to the training dataset.  The classifier will use the 

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='sgd', hidden_layer_sizes=(30,30,30))
mlp.fit(X_train,y_train)

Use the neural network to predict the test set and calculate the accuracy.

In [None]:
mlp_pred = mlp.predict(X_test)
from __future__ import division
(sum(y_test==mlp_pred))/len(mlp_pred)

In [None]:
print(confusion_matrix(y_test,mlp_pred))

Now try a different solver.  Did you get different results?

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(30,30,30))
mlp.fit(X_train,y_train)

## Support Vector Machines

Set up a SVM classifier using the radial basis function kernel

In [None]:
from sklearn import svm
svm_clf = svm.SVC(kernel="rbf")
svm_clf.fit(X_train,y_train)

In [None]:
svm_pred = svm_clf.predict(X_test)
print(confusion_matrix(y_test,svm_pred))

## Ensemble Methods

Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
bagging = BaggingClassifier(GaussianNB(), max_samples=0.5, max_features=0.5)
bagging.fit(X_train,y_train)

In [None]:
bag_pred = bagging.predict(X_test)
print(confusion_matrix(y_test,bag_pred))

Use cross validation to find the overall accuracy of the bagged classifier

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(bagging, X, y)
scores.mean()

Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier
boosting = AdaBoostClassifier(n_estimators=100)
boosting.fit(X_train, y_train)
boost_pred = boosting.predict(X_test)
print(confusion_matrix(y_test, boost_pred))                     

Use cross validation to find the overall accuracy of the boosted classifier

In [None]:
scores = cross_val_score(boosting, X, y)
scores.mean()

## Cross Validation Sampling

K-fold Cross Validation Sample

In [None]:
from sklearn.model_selection import KFold
from sklearn import svm
svm_clf = svm.SVC(kernel="rbf")
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    svm_clf.fit(X_train,y_train)
    svm_pred = svm_clf.predict(X_test)
    print((sum(y_test==svm_pred))/len(svm_pred))

Exercise 4: Assess the quality of your Naive Bayes classifer.  How does it compare?