## Introduction

Image detection is a good entry point for learning about the use of neural networks, which I aim to learn more about in this project. The problem to be solved in my project is image detection (of handwritten digits, in this case) by using the branch of machine learning that deals with image detection. 

## Data Wrangling

I'll first read in the data and then check for any anomalies. Reading in the data entails a series of functions, the first of which loads in raw gzip file and serializes it from a hierarchy to a series of bytes. That adjustment will make the data easier to vectorize and put into arrays afterward.

In [None]:
# may need to adjust intro above
 # cite: http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist-py

# explain that I'm taking it from Python package and why

In [24]:
import numpy as np
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST')
X = mnist.data.astype('float64')
y = mnist.target

In [25]:
print(X.shape)
print(y.shape)
print(np.unique(y))

(70000, 780)
(70000,)
[0 1 2 3 4 5 6 7 8 9]


In [None]:
# try with and w/o standard scaling? will that resolve issue w/ L1 and solver=saga? #look up default penalty 
# explain scaling being used (standard normal) and process of making computations more stable 
 # check distribution of data (level of gray) to see if necessary

# look into effect of permuting 

# decrease tolerance (maybe .01, .001, etc.)

In [26]:
# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [28]:
# logistic regression on training data
 # adjusting tolerance expedites convergence of algorithm 

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

C_grid = {'C': [0.001, 0.1, 1, 10, 100]}
clf_grid = GridSearchCV(LogisticRegression(multi_class='multinomial', solver='sag', tol=0.1), 
                        C_grid, cv = 5, scoring = 'accuracy')
clf_grid.fit(X_train, y_train)
print(clf_grid.best_params_, clf_grid.best_score_)

{'C': 1} 0.918714285714


In [32]:
 #check performance on test set
clf_grid_test = LogisticRegression(C = clf_grid.best_params_['C'], multi_class='multinomial', solver='sag', tol=0.1)
clf_grid_test.fit(X_train, y_train)
accuracy_score(clf_grid_test.predict(X_test), y_test)

0.92278571428571432

In [None]:
# check accuracy per digit -- metrics package may provide directly 
# heat-map-style confusion matrix for each digit
# table with test and training accuracy (and precision/recall) for each digit

In [None]:
# random forest

In [10]:
import numpy as np
import gzip
import pickle #converts object hierarchy into byte stream

def load_mnist():
    file = gzip.open('mnist.pkl.gz', 'rb') #read in in binary mode
    train, test = pickle.load(file)
    file.close()
    return (train, test)

The next function creates training and test datasets that will be easier to process. The initial training data is a list of 50,000 (x, y) tuples, where x is 784-dimensional (from 28x28 pixels) array of input image and y is a ten-dimensional array capturing the corresponding classification (digit value). The initial test data has the same structure but is a list of 10,000 tuples. 

In [11]:
def data_wrapper():
    train, test = load_mnist()
     #reshape first value (x) in each training tuple into 784D (28x28 pixels) vector
    train_inputs = [np.reshape(x, (784, 1)) for x in train[0]] 
    train_results = [vectorized_result(y) for y in train[1]]
    train_data = zip(train_inputs, train_results)
#validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
#validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in test[0]]
    test_data = zip(test_inputs, test[1])
    return (train_data, test_data) #validation_data,

In [None]:
# see to-do list before proceeding -- may not need vectorized_result and beyond

The third formula converts a 0-9 digit into a corresponding desired output from the neural network using 1 as a binary marker among zeroes otherwise. It sets up a one-versus-many method of evaluation for the algorithm, in which a yes-or-no decision is made for each potential digit. (For example, in a correct prediction of a 6, the algorithm would choose "no" for the other nine digits and "yes" for 6. The gestalt is also the reason we have a 10-dimensional unit vector -- one dimension for each digit. 

In [12]:
def vectorized_result(j):
    tenD_uv = np.zeros((10, 1)) #create 10D unit vector
    tenD_uv[j] = 1.0 #put 1 in jth position
    return tenD_uv

In [None]:
# next step: initialize a Network object??