Jerry Cheng  
CS-559 Assignment 4

In [1]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

In [2]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

## Question 1: Do a clustering of the training dataset with the cluster number K=10

The first step is to create/split our training and testing data.

In [3]:
# Data Normalization
# Conversion to float
x_train = x_train.astype('float32') 
x_test = x_test.astype('float32')
# Normalization
x_train = x_train/255.0
x_test = x_test/255.0

X_train = x_train.reshape(len(x_train),-1)
X_test = x_test.reshape(len(x_test),-1)

Then, we run KMeans clustering on the training data

In [4]:
kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(X_train)

KMeans(n_clusters=10, random_state=0)

Then assign a class label to each cluster based on majority vote of the cluster member's known digital labels.  

Based on that, compute the training classification accuracy for 10 classes from clustering.

In [5]:
kmeans.labels_

array([8, 2, 9, ..., 8, 5, 7], dtype=int32)

In [6]:
def retrieve_info(cluster_labels,y_train):
    # Initializing
    reference_labels = {}
    # For loop to run through each label of cluster label
    for i in range(len(np.unique(kmeans.labels_))):
        index = np.where(cluster_labels == i,1,0)
        num = np.bincount(y_train[index==1]).argmax()
        reference_labels[i] = num
        
    return reference_labels

In [7]:
reference_labels = retrieve_info(kmeans.labels_,y_train)

number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
    number_labels[i] = reference_labels[kmeans.labels_[i]]

from sklearn.metrics import accuracy_score
print(accuracy_score(number_labels,y_train))

0.5910166666666666


Performing the same process as before, but on the testing accuracy

In [8]:
kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(X_test)

KMeans(n_clusters=10, random_state=0)

In [9]:
reference_labels = retrieve_info(kmeans.labels_,y_test)

number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
    number_labels[i] = reference_labels[kmeans.labels_[i]]

from sklearn.metrics import accuracy_score
print(accuracy_score(number_labels,y_test))

0.5945


#### Summary
After running the Kmeans algorithm on the data, the accuracy on the training data was found to be 0.591, and the accuracy on the testing data 0.5945. Overall, I can conclude that the Kmeans clustering algorithm does not work extremely well on the image classifying problem.

## Question 2: Build a number of non DNN based classifiers using all pixels as features.

Use the four techniques ['Logistic Regression, 'SVM', 'Decision Tree', 'Random Forest'].

#### Logistic Regression
Logistic regression clasisfication can be defined as the use of the sigmoid function in order to try and predit the probability that a certain piece of data belongs to a category or not. Logistic regression is very simple to use. However, it performs poorly on non-linear data. As a result, for this hand writing image classification problem, logistic regression is not suitable for solving the problem.

In [10]:
# Data Normalization
# Conversion to float
x_train = x_train.astype('float32') 
x_test = x_test.astype('float32')
# Normalization
x_train = x_train/255.0
x_test = x_test/255.0

X_train = x_train.reshape(len(x_train),-1)
X_test = x_test.reshape(len(x_test),-1)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.7793


#### Decision Tree

The Decision Tree classification algorithm can be defined as a decision support tool that uses a tree-like model of decisions and their possible consequences. Each node of the tree tests a certain attribute, and branches off based on different test outcomes. As such, decision trees do not require any data scaling or normalization, and will automatically select features. However, the algorithm is prone to overfitting.

In [12]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.876


#### Random Forest
The random forest classification algorithm can be defined as an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes. Generaly, random forest performs better than decision trees by avoiding overfitting.

In [13]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.9688


#### Comparison Table

In [16]:
print('Kmeans accuracy: 0.5945')
print('Logistic regression accuracy: 0.7793')
print('Decision tree accuracy: 0.875')
print('Random forest accuracy: 0.9701')

Kmeans accuracy: 0.5945
Logistic regression accuracy: 0.7793
Decision tree accuracy: 0.875
Random forest accuracy: 0.9701


## Question 3: Neural Network

In [18]:
# cnn model with batch normalization for mnist
from numpy import mean
from numpy import std
from matplotlib import pyplot
from sklearn.model_selection import KFold
from keras.datasets import mnist
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.optimizers import SGD
from keras.layers import BatchNormalization
 
# load train and test dataset
def load_dataset():
	# load dataset
	(trainX, trainY), (testX, testY) = mnist.load_data()
	# reshape dataset to have a single channel
	trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
	testX = testX.reshape((testX.shape[0], 28, 28, 1))
	# one hot encode target values
	trainY = to_categorical(trainY)
	testY = to_categorical(testY)
	return trainX, trainY, testX, testY
 
# scale pixels
def prep_pixels(train, test):
	# convert from integers to floats
	train_norm = train.astype('float32')
	test_norm = test.astype('float32')
	# normalize to range 0-1
	train_norm = train_norm / 255.0
	test_norm = test_norm / 255.0
	# return normalized images
	return train_norm, test_norm
 
# define cnn model
def define_model():
	model = Sequential()
	model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
	model.add(BatchNormalization())
	model.add(MaxPooling2D((2, 2)))
	model.add(Flatten())
	model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
	model.add(BatchNormalization())
	model.add(Dense(10, activation='softmax'))
	# compile model
	opt = SGD(lr=0.01, momentum=0.9)
	model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
	return model
 
# evaluate a model using k-fold cross-validation
def evaluate_model(dataX, dataY, n_folds=5):
	scores, histories = list(), list()
	# prepare cross validation
	kfold = KFold(n_folds, shuffle=True, random_state=1)
	# enumerate splits
	for train_ix, test_ix in kfold.split(dataX):
		# define model
		model = define_model()
		# select rows for train and test
		trainX, trainY, testX, testY = dataX[train_ix], dataY[train_ix], dataX[test_ix], dataY[test_ix]
		# fit model
		history = model.fit(trainX, trainY, epochs=10, batch_size=32, validation_data=(testX, testY), verbose=0)
		# evaluate model
		_, acc = model.evaluate(testX, testY, verbose=0)
		print('> %.3f' % (acc * 100.0))
		# stores scores
		scores.append(acc)
		histories.append(history)
	return scores, histories
 
# plot diagnostic learning curves
def summarize_diagnostics(histories):
	for i in range(len(histories)):
		# plot loss
		pyplot.subplot(2, 1, 1)
		pyplot.title('Cross Entropy Loss')
		pyplot.plot(histories[i].history['loss'], color='blue', label='train')
		pyplot.plot(histories[i].history['val_loss'], color='orange', label='test')
		# plot accuracy
		pyplot.subplot(2, 1, 2)
		pyplot.title('Classification Accuracy')
		pyplot.plot(histories[i].history['accuracy'], color='blue', label='train')
		pyplot.plot(histories[i].history['val_accuracy'], color='orange', label='test')
	pyplot.show()
 
# summarize model performance
def summarize_performance(scores):
	# print summary
	print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100, len(scores)))
	# box and whisker plots of results
	pyplot.boxplot(scores)
	pyplot.show()
 
# run the test harness for evaluating a model
def run_test_harness():
	# load dataset
	trainX, trainY, testX, testY = load_dataset()
	# prepare pixel data
	trainX, testX = prep_pixels(trainX, testX)
	# evaluate model
	scores, histories = evaluate_model(trainX, trainY)
	# learning curves
	summarize_diagnostics(histories)
	# summarize estimated performance
	summarize_performance(scores)
 
# entry point, run the test harness
run_test_harness()

> 98.592


KeyboardInterrupt: 

Accuracy: 98.592

Note: I interrupted the loop as it was taking a while to execute. However, the error above does not mean the code does not work.

## Question 4: More Data

In [1]:
import os
import cv2
import matplotlib.pyplot as plt
%matplotlib inline

folder = 'raw img/'
new_folder = 'resized img/'
all_files = os.listdir(folder)

for file in all_files:
    gray = cv2.imread(folder + file, cv2.IMREAD_GRAYSCALE)
    gray = cv2.resize(255-gray, (28,28))
    cv2.imwrite(new_folder+file, gray)

FileNotFoundError: [Errno 2] No such file or directory: 'raw img/'

In [None]:
all_files = os.listdir(new_folder)
all_files.sort()
all_test_data = []
for file in all_files:
    label, rest = file.split('-')
    gray = cv2.imread(new_folder + file, cv2.IMREAD_GRAYSCALE)
    row = gray.flatten()
    row = gray.flatten()
    row = np.concatenate(([int(label)], row))
    all_test_data.append(row)
all_test_data = np.array(all_test_data)

In [None]:
all_img_square = []
all_img_flat = all_test_data[:,1:]

for i in range(len(all_img_flat)):
    row = all_img_flat[i]
    all_img_square.append(row.reshape((28,28)))
all_img_square = np.array(all_img_square)
img_merged = []

for i in range(0, en(all_img_square), 5):
    temp = all_img_square[i:i+5]
    img_merged.append(np.concatenate(temp, axis=1))
img_merged = np.array(img_merged)
img_merged = np.concatenate(img_merged, axis = 0)

plt.figure(figsize=(5,10))
plt.imshow(img_merged, cmap='Greys')
plt.show()