### Dataset Description

The dataset for this project is the Digit Description dataset from Kaggle. Each image is 28 pixels by 28 pixels (784 total pixels) and has a digit between 0-9. Each pixel has a single value between 0 and 255 (inclusive) associated with it indicating lightness or darkness, with higher numbers meaning darker. There are two datasets; train.csv and test.csv. 

The training dataset has 785 columns. The first column, 'label,' is the digit the picture encodes for. The rest of the columns contain the pixel-values of the associated image. Each row corresponds to a different image. Each pixel in the training set has a name like pixelx, where x is an integer between 0 and 783 (inclusive). Each column representing a pixel (discluding label column) has a name pixelx where x is decomposed as `x = i*28 + j`, where i and j are integers between 0 and 27 representing the pixel row and column of the image. The test dataset (28000 images) is the same as the training set, except it does not contain the 'label' column.

### Goal

For each of the 28000 images in the test set, the output should be a single line containing the imageId and the predicted digit. The evaluation metric is the categorization accuracy, or the proportion of test images that are correctly classified.

In [109]:
# pandas is for reading data
import pandas as pd
# numpy is for linear algebra
import numpy as np
from matplotlib import pyplot as plt
;


''

In [110]:
data = pd.read_csv("data_files/train.csv")

In [111]:
data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [112]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB


In [113]:
# don't want to be working with pandas df
# we want to be working with numpy arrays so we can perform linear algebra manipulations

data = np.array(data)

In [114]:
m,n = data.shape

# In order to avoid overfitting, we want to randomize the data and then split it into train and dev
np.random.shuffle(data)

# transpose the data so that each column is a row (easier)
data_dev = data[0:1000].T
# each column is now an image
# first row is now labels
# following rows (783) are each pixel
Y_dev = data_dev[0]
X_dev = data_dev[1:]
# normalize data
X_dev = X_dev / 255

data_train = data[1000:].T
Y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train / 255





In [115]:
data_train.size

32185000

In [157]:
print('X_train: ', X_train.shape)
print('Y_train: ', Y_train.shape)

X_train:  (784, 41000)
Y_train:  (41000,)


In [131]:
def init_params():
    # random.randn makes dist between -.5 and .5
    # wi is the weight vector (# second layer neurons, # first layer neurons)
        # each input neuron (784) connects to each output neuron (10)
    # bi is the bias in the output layer neurons
    w1 = np.random.rand(10,784) - 0.5
    b1 = np.random.rand(10,1) - 0.5
    w2 = np.random.rand(10,10) - 0.5
    b2 = np.random.rand(10,1) - 0.5
    return w1, b1, w2, b2

In [133]:
w1, b1, w2, b2 = init_params()

In [155]:
def ReLU(Z):
    # maximum is element-wise so it runs that calc for each element in Z 
    return np.maximum(Z, 0)

def softmax(Z):
    a = np.exp(Z)
    print("a: ", a.shape)
    print("np sum: ", np.sum(a))
    print("regular sum: ", sum(a))
    nump = np.sum(a)
    reg = sum(a)
    print("nump: ", type(nump), nump.shape)
    print("reg: ", type(reg), reg.shape)


    return np.exp(Z) / sum(np.exp(Z))

In [147]:
# left off at 16:54
def forward_prop(w1, b1, w2, b2, X):
    # np uses broadcasting to add a (10,1) array to a (10,90) array
    # effectively replicates the (10,1) array 90 times to match the shape of the (10,90) array
    print("Forward Propogation")
    z1 = w1.dot(X) + b1
    print("z1: ", z1.shape)
    A1 = ReLU(z1)
    print("A1: ", A1.shape)
    z2 = w2.dot(A1) + b2
    print("z2: ", z2.shape)
    A2 = softmax(z2)
    print("A2: ", A2.shape)
    print()

    return z1, A1, z2, A2

In [121]:
# transform a vector Y of class labels into a one-hot encoded matrix
# one-hot encoding is a common way to represent categorical variables as binary vectors 
# Y is going to be an array (mx1) where each element is the predicted class for the equivalent instance column of the input data array
def one_hot(Y):
    # np.zeros line creates a 2D array of zeros with shape determined by number of samples and number of unique classes   
        # y.size returns the total number of elements in Y which represents the number of samples or instances
        # Y.max() + 1 calculates the max value in Y and adds 1 to determine the number of unique classes
            # adding one is necessary because the classes start from 0 (0-9)
    ohY = np.zeros((Y.size, Y.max()+1))
    # "for each row, go to the column specified by the label in Y and set it equal to 1"
    # by indexing ohY like this, we are effectively selecting one position per row, determined by the class label in Y
    # each row in ohY corresponds to a sample in Y and each column in ohY corresponds to a class
    # for each row in ohY, the column corresponding to its class label is set to 1 (all other columns remain 0)
        #  np.arange(Y.size) generates an array of indices from 0 to Y.size - 1 corresponding to each sample in Y --> specifies what row to access
        # Y contains the class label for each sample
            # when used as an index, Y selects the column in ohY that corresponds to its class label
    # 
    ohY[np.arange(Y.size), Y] = 1
    # transpose because we want each column to be a sample not each row
    return ohY.T

def deriv_ReLU(Z):
    # relu has deriv of 1 for x > 0 (because x = x) and 0 for x <=0 (because x = 0)
    # this works because booleans are converted to 1 for true and 0 for false so if a number is positive then its deriv was 1
    # since
    return Z > 0

In [122]:
# !!!! print out the shape of all of these matrices as they are made to make sure sizing is correct for all the matmults

In [145]:
def back_prop(z1, A1, z2, A2, w2, X, Y):
    print('Back Propogation')
    m = Y.size
    print("Y.size: ", Y.size)
    ohY = one_hot(Y)
    print(ohY, '\n')
    print("ohY: ", ohY.size)
    dz2 = A2 - ohY
    print('A2: ', A2.shape)
    print('ohY: ', ohY.shape)
    print('dz2: ', dz2.shape)
    dw2 = 1/m * dz2.dot(A1.T)
    print('A1.T: ', A1.T.shape)
    print('dw2: ', dw2.shape)
    db2 = 1/m * np.sum(dz2, axis=1)
    print('db2: ', db2.shape)
    # I don't understand this next part
    dz1 = w2.T.dot(dz2) * deriv_ReLU(z1)
    print('w2.T: ', w2.T.shape)
    print('z1: ', z1.shape)
    print('dz1: ', dz1.shape)
    dw1 = 1/m * dz1.dot(X.T)
    print('X.T: ', X.T.shape)
    print('dw1: ', dw1.shape)
    db1 = 1/m * np.sum(dz1, axis=1)
    print('db1: ', db1.shape)
    return dw1, db1, dw2, db2

In [124]:
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1.reshape(-1,1)    
    W2 = W2 - alpha * dW2  
    b2 = b2 - alpha * db2.reshape(-1,1)    
    return W1, b1, W2, b2


In [125]:
def get_predictions(A2):
    return np.argmax(A2, 0)

def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions==Y) / Y.size


def gradient_descent(X, Y, iterations, alpha):
    w1, b1, w2, b2 = init_params()
    for i in range(iterations):
        z1, A1, z2, A2 = forward_prop(w1, b1, w2, b2, X)
        dw1, db1, dw2, db2 = back_prop(z1, A1, z2, A2, w2, X, Y)
        w1, b1, w2, b2 = update_params(w1, b1, w2, b2, dw1, db1, dw2, db2, alpha)
        if (i%10) == 0:
            print("Iteration: ", i)
            print(f"Accuracy: {get_accuracy(get_predictions(A2), Y)}")
            print()
    print('w1: ', w1.shape)
    print('b1: ', b1.shape)
    print('w2: ', w2.shape)
    print('b2: ', b2.shape)
    return w1, b1, w2, b2

In [156]:
w1, b1, w2, b2 = gradient_descent(X_train, Y_train, 1, .05)

Forward Propogation
z1:  (10, 41000)
A1:  (10, 41000)
z2:  (10, 41000)
a:  (10, 41000)
np sum:  977974.7422236332
regular sum:  [18.28837951 16.84779586 38.74916146 ... 22.39813265 36.96516505
 13.81445062]
nump:  <class 'numpy.float64'> ()
reg:  <class 'numpy.ndarray'> (41000,)
A2:  (10, 41000)

Back Propogation
Y.size:  41000
[[0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 

ohY:  410000
A2:  (10, 41000)
ohY:  (10, 41000)
dz2:  (10, 41000)
A1.T:  (41000, 10)
dw2:  (10, 10)
db2:  (10,)
w2.T:  (10, 10)
z1:  (10, 41000)
dz1:  (10, 41000)
X.T:  (41000, 784)
dw1:  (10, 784)
db1:  (10,)
Iteration:  0
[5 2 2 ... 2 6 2] [2 3 1 ... 6 2 0]
Accuracy: 0.12292682926829268

w1:  (10, 784)
b1:  (10, 1)
w2:  (10, 10)
b2:  (10, 1)
