# Classifier Comparison (using scikit-learn)

## In this assignment, we compare the accuracy of 6 ML algorithms. I'll be using their scikit-learn implementations for this purpose

#### Importing various toolkits I'll be using

In [2]:
from numpy import genfromtxt
from sklearn import metrics
from sklearn.model_selection import train_test_split

#### Now I'll be importing the classifiers from sklearn

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


I'm using genfromtext to import the dataset in a numpy array format to be compatible with sklearn's methods


In [4]:
transport_dataset = genfromtxt("data.csv",delimiter=",")

#### Load features (column data) and target (response) variables

In [17]:
X = transport_dataset[:,:-1]

The response variable needs to be a single column which has the numeric representation of classes. I have extracted the class labels from the original data and put them in a new CSV file named 'data_target.csv'. I'll be importing that in this block using the genfromtxt to convert it to a numpy array

In [18]:
y = genfromtxt("data_target.csv",delimiter=",")

Checking the data types to make sure we've got the variables in the necessary format for the sklearn toolkit

In [19]:
type(X)

numpy.ndarray

In [20]:
type(y)

numpy.ndarray

#### Perfect. Let's move on and check the dimensions of the data to ensure it's been imported correctly into our feature and response variables

In [21]:
print("Original dataset dimensions")
print(X.shape)
print(y.shape)

Original dataset dimensions
(21000, 128)
(21000,)


### NOTE - I'm going to first be checking the training accuracies in this section to see how the algorithms perform on the training data. I'll be comparing these results to the accuracy of the split data!

### We instantiate the models as follows (mostly using default parameters, except for random forest in which we use 100 estimators instead of 10)


In [22]:
# instantiate the models (using the default parameters)
naive = GaussianNB()
logreg = LogisticRegression()
knn = KNeighborsClassifier()
decision_tree = DecisionTreeClassifier()
svm = SVC(gamma='scale')
randomforest = RandomForestClassifier(n_estimators=100)


#### Now that we've created the instances, I'm training the data using the fit() function of the classifiers.

In [23]:
# fit the models with data
naive.fit(X,y)
logreg.fit(X, y)
knn.fit(X, y)
decision_tree.fit(X,y)
svm.fit(X,y)
randomforest.fit(X,y)




RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### Predict response values using .predict() function!

In [24]:
naive.predict(X)
logreg.predict(X)
knn.predict(X)
decision_tree.predict(X)
svm.predict(X)
randomforest.predict(X)


array([0., 0., 0., ..., 6., 6., 6.])

#### Store the values in variables:

In [25]:
naive_predicted = naive.predict(X)
logreg_predicted = logreg.predict(X)
knn_predicted = knn.predict(X)
decision_tree_predicted = decision_tree.predict(X)
svm_predicted = svm.predict(X)
randomforest_predicted = randomforest.predict(X)


### To ensure all values predicted have the right number of dimensions, we check the size of the resulting arrays


In [26]:
print(len(naive_predicted))

21000


In [27]:
print(len(logreg_predicted))

21000


In [28]:
print(len(knn_predicted))

21000


In [29]:
print(len(decision_tree_predicted))

21000


In [30]:
print(len(svm_predicted))

21000


In [31]:
print(len(randomforest_predicted))

21000


### We have thus verified that the resulting data does indeed have the correct dimensions and works as we expect it to 

### Now we check how accurately the classifiers predict their training data

In [32]:
print("Naive Bayes accuracy score")
print(metrics.accuracy_score(y, naive_predicted))

Naive Bayes accuracy score
0.5687619047619048


In [33]:
print("Logistic Regression accuracy score")
print(metrics.accuracy_score(y, logreg_predicted))

Logistic Regression accuracy score
0.5721904761904761


In [34]:
print("KNN accuracy score")
print(metrics.accuracy_score(y,knn_predicted))

KNN accuracy score
0.7446190476190476


In [36]:
print("Decision Tree accuracy score")
print(metrics.accuracy_score(y,decision_tree_predicted))

Decision Tree accuracy score
1.0


In [37]:
print("SVM accuracy score")
print(metrics.accuracy_score(y,svm_predicted))

SVM accuracy score
0.5862857142857143


In [38]:
print("Random forest accuracy score")
print(metrics.accuracy_score(y,randomforest_predicted))

Random forest accuracy score
1.0


## The above result is the highest possible values these classifiers will have for the data. Now we split the data into 80% training and 20% test and repeat the procedure on the resulting datasets.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [40]:
#Calculating the split accuracy
print("Calculating the split accuracies")


Calculating the split accuracies


In [41]:
# fitting models
naive.fit(X_train,y_train)
logreg.fit(X_train, y_train)
knn.fit(X_train, y_train)
decision_tree.fit(X_train,y_train)
svm.fit(X_train,y_train)
randomforest.fit(X_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Note here that we've switched to predicting values based on the test dataset (20% of the data)

In [42]:
# predicting response values as before
naive.predict(X_test)
logreg.predict(X_test)
knn.predict(X_test)
decision_tree.predict(X_test)
svm.predict(X_test)
randomforest.predict(X_test)


array([1., 1., 2., ..., 1., 3., 1.])

In [44]:
# storing the predicted response values
naive_split_predicted = naive.predict(X_test)
logreg_split_predicted = logreg.predict(X_test)
knn_split_predicted = knn.predict(X_test)
decision_tree_split_predicted = decision_tree.predict(X_test)
svm_split_predicted = svm.predict(X_test)
randomforest_split_predicted = randomforest.predict(X_test)


In [45]:
# checking for integrity of classifiers
print(len(naive_split_predicted))
print(len(logreg_split_predicted))
print(len(knn_split_predicted))
print(len(decision_tree_split_predicted))
print(len(svm_split_predicted))
print(len(randomforest_split_predicted))

4200
4200
4200
4200
4200
4200


### Finally, calculate the accuracy metrics, with respect to the original target values (y_test)

In [46]:
print("Naive Bayes accuracy score")
print(metrics.accuracy_score(y_test, naive_split_predicted))

Naive Bayes accuracy score
0.5630952380952381


In [47]:
print("Logistic Regression accuracy score")
print(metrics.accuracy_score(y_test, logreg_split_predicted))

Logistic Regression accuracy score
0.5507142857142857


In [48]:
print("KNN accuracy score")
print(metrics.accuracy_score(y_test,knn_split_predicted))

KNN accuracy score
0.6576190476190477


In [49]:
print("Decision Tree accuracy score")
print(metrics.accuracy_score(y_test,decision_tree_split_predicted))

Decision Tree accuracy score
0.5883333333333334


In [50]:
print("Support Vector Machine accuracy score")
print(metrics.accuracy_score(y_test,svm_split_predicted))

Support Vector Machine accuracy score
0.5814285714285714


In [51]:
print("Random Forest accuracy score")
print(metrics.accuracy_score(y_test,randomforest_split_predicted))

Random Forest accuracy score
0.7033333333333334


#### Notice that the predicted values on the split data are somewhat close to the predicted values on the training data, but just fall short of being quite as accurate.

#### This concludes my comparison of the various ML classifiers!

#### As we can see, the decision tree and naive bayes classifiers perform the best (100% accuracy) on this particular dataset. However, this is only based on the default parameters, and classifiers like K-Nearest Neighbors might do better than this if we increase the value of K, for instance. 

### - By
## Rahul Basu (rb622)