***
Developing a way to find the acceptable difference between training and real world data. (MNIST)
***

*By Jared Kelnhofer, 10-02-2020*

This notebook is designed to help me complete my DigitRecognizer program. The problem that I am trying to solve is this: I don't know if the images that I allow the user of my program to draw are close enough to MNIST's to make a good comparison between the two. Just because I provide the user a square to draw in, and resize their created image to 28 by 28 pixels doesn't mean that the end result can actually be compared successfully to the MNIST dataset. My results so far make me think that my generated images might be a little too far off from MNIST. This could require me to change different aspects of my image generation.

<font size="6">Getting MNIST ready</font>

The first order of business is to get the MNIST dataset, and split it into 11 different datasets. I want the MNIST set in full, as well as a set for each digit. The labels are only going to be important for the splitting of the data. Once it's divided up into different CSV files and saved in the /Data directory in this file's directory, we no longer need to bother with labels. Here I grab the dataset, setu up some directories, and get a little info about the shape of MNIST.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import random

import os
if not os.path.exists('./Dataset'):
    os.makedirs('./Dataset')
    
if not os.path.exists('./Dataset/individual_digits'):
    os.makedirs('./Dataset/individual_digits')

from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, data_home="./Dataset")

print("Mnist keys: " + str(mnist.keys()))
print("Shape of \"data\" key: " + str(np.shape(mnist.data)))
print("Shape of \"target\" key: " + str(np.shape(mnist.target)))

Mnist keys: dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
Shape of "data" key: (70000, 784)
Shape of "target" key: (70000,)


<font size="6">Some helpful Methods</font>


This clear_directory method will be handy as I'm developing, to get rid of extra .csv files. (They can get left behind if I change the amount of files I'm generating, and overwriting.)

In [129]:
def clear_directory(folder):
    import os, shutil
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print('Failed to delete %s. Reason: %s' % (file_path, e))

I'll include a TIMESAVER for time saving during development. The higher the TIMESAVER is, the fewer values the get_digit_and_save_to_csv function will iterate over, checking for equality. (Obviously, if you make it 70,000 or more than it will iterate over no values, as MNIST only has 70,000 instances in it.) Also, you can alter the digits_to_get array to include fewer digits for testing purposes.

In [132]:
TIMESAVER = 0
print("we're doing: " + str(len(mnist.target)-TIMESAVER) + " iterations" )

we're doing: 70000 iterations


I'll write a function *get_digit_and_save_to_csv* that can grab a huge array from MNIST with only instances of a specified digit. Than we'll use this function to split up the dataset in to it's 10 components. 

In [136]:
def get_digit_and_save_to_csv(mnist_instance, digit_name):
        
    filename = "./Dataset/individual_digits/" + str(digit_name) + ".csv" 
    digit_array = np.empty((0, 784))
 
    for i in range(0, len(mnist_instance.target) - TIMESAVER):
        
        if(mnist_instance.target[i] == str(digit_name)):
            digit_array = np.append(digit_array, mnist_instance.target[i])

    np.savetxt(filename, digit_array, delimiter=',', fmt="%s")
    return(digit_array)

In [139]:
def get_mnist_csv_files_by_digit(digits_to_get):
    
    clear_directory("./Dataset/individual_digits")
    
    for i in range(0, len(digits_to_get)):
        digit_array = get_digit_and_save_to_csv(mnist, digits_to_get[i])

<font size="5">Generating All our CSV files</font>


In [143]:
get_mnist_csv_files_by_digit([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#create a csv of the whole MNIST dataset called "all_values" in the same directory"

If we add up all the values in each of our csv files, it should come to 70000. Let's check!

In [144]:
#load csv
#count values
#

Now we need to create a way to tell how different a value is from another. (For the purposes of this notebook, assume that a value is an MNIST image, 28 by 28 pixels.)