#Overview

In this assignment, we will walk through the steps of creating a simple rules-based (symbolic) AI to differentiate the handwritten digits 0 and 1. While this might seem trivial, sucessfully creating such an AI will be prove to be much more difficult than expected given the tremendous variation in human handwriting apperance. In part this will highlight the difficulties of expert rule-based AI even for such a trivial task.

## Restarting your Virtual Machine

If at any point during this assignment you accidentally execute code or do something that cannot seem to undo and need to "restart" the system (including deleting all temporary folders), go ahead and run the following single line of code. It will take about 1 minute to restart. Following this, you will have to proceed at the beginning of the assignment to re-downloaded the data and run the code you have written. Note, the code that you have already written will **not** be deleted; you simply need to start executing the code once again from the start.

In [0]:
!kill -9 -1 # Warning this restarts your machine



## Downloading the Data

The following commands can be used to copy over the assignment materials to your local Colaboratory instance and unzip in preparation for your assignment. The commands are Linux (Unix) based which you will gradually learn during time at the CAIDM, but for now do not be concerned about deciphering their meaning (simply just click run to set up):


In [0]:
!git clone https://github.com/CAIDMRes/lecture_01
!unzip -q lecture_01/data.zip -d .
!rm -r lecture_01

# Data

## Discovery

The first step of any data science / image analysis pipeline is to take a look at the data. The bash command `ls` is a simple tool to list the contents of a folder; it is prefixed with an exclamation mark `!` to indicate that the command is to be executed by bash (the Terminal) instead of Python.

A simple `!ls` command with nothing else will list the contents of the current (root) folder. If you desire to look **into** another folder, simply write the name (or path) of the folder you wish to see.

Let's start by seeing what we just downloaded:

In [0]:
!ls

As we can see, the the data is organized into three folders, `0`, `1` and `2` (the folder `datalab` is there by default). Let us see what is in each folder:

In [0]:
!ls 0 # Cycle through 0, 1 and 2 to look inside

First of all, that's a lot of files! The naming convention for the files is as follows:

"DIGIT-ID.npy" where,

* `DIGIT` represents the number (either zero or one)
* `ID` represents a number n-th example of that number

For example `0-00508.npy` means that the file is the 508th example of a handwritten zero digit. 

## Cleaning

Based on file names, we now see that the examples of digits 0 and 1 are randomly distributed within all three directories. Let us first organize this data into two separate folders named `zeros` and `ones`. 

## shutil library

The `os` and `glob` libraries for file/folder management and search (respecitvely) were covered in the first workshop. If there are questions feel free to reference those notes. One final thing that we need to understand is how to move files from A to B. To do so, we will use the `shutil` library (stands for ba**SH** **UTIL**ities). The method of interest is `shutil.move(src, dst)` where:

* `src` is the location of file A (to be moved)
* `dst` is the location of file B (future location)

Usually the source file is easy to identify. Hoever the destination file can be tricky because it doesn't currently exist (e.g. we need to make it). In general, this destination file name takes on the following pattern:

`/new/path/folder/` + `base_name.extension`

In our examples, the new base folder will be either `zeros/` or `ones/`. The basename will be for example `0-00508.npy`, `0-00509.npy`, etc. To create this basename from the original **full** name like `0/0-00508.npy` use the `os.path.basename()` method:

In [0]:
import os

# DEFINE src
src = '0/0-00508.npy' # File A to be moved

# DEFINE dst
dst_folder = 'zeros/'
dst_base = os.path.basename(src) # Remove prefix folder name e.g. 0-00508.npy

dst = dst_folder + dst_base

# At this point shutil.move(src, dst) will move src to dst

In [0]:
shutil.move?

## Data Organization

Now we are ready to organize the data. Again the goal is to move all the files with `0-XXXXX.npy` format to the `zeros\` folder and all the files with `1-XXXXX.npy` format to the `ones\` folder. Use the following template to accomplish this task:

In [0]:
import os, glob, shutil

# Create the following two folders: 'zeros', 'ones' 
# HINT: use the os library
os.makedirs('zeros', exist_ok=True)
os.makedirs('ones', exist_ok=True)

# Find all files with zero digits (e.g. 0-XXXXX.npy)
# HINT: use the glob library
zeros = glob.glob('*/0*')

# Move all files to the zeros folder
# HINT: use the shutil library
for zero in zeros:
    
    print('Moving: ' + zero)
    dst = 'zeros/' + os.path.basename(zero)
    shutil.move(src=zero, dst=dst)
    
# Repeat for all files with ones digits (e.g. 1-XXXXX.npy)
ones = glob.glob('*/1*')
for one in ones:
    
    print('Moving: ' + one)
    dst = 'ones/' + one[2:]
    shutil.move(src=one, dst=dst)

In [0]:
!ls ones

To confirm that this is successful, first confirm that the zeros and ones folders have been created and contain the appropriate files with the same `!ls` command above. 

In [0]:
!ls ones # alternate between zeros, ones

## Loading and Viewing the Data

Alright, now let's take a look at the data that we have. Note that the files themselves are in a special NumPy format (`*.npy`) which is simply the raw image matrix. It can be loaded as shown below:

In [0]:
import numpy as np

# Find all files with a zero digit)
zeros = glob.glob('zeros/0-*')

# Load the first file
arr = np.load(zeros[0])

# Let us see the shape of the matrix
print(arr.shape)

Alright, easy enough the image itself is a 28 x 28 matrix. Let's take a look at the image itself. To draw matrices (images) we will use the `pylab` library:

In [0]:
import pylab

pylab.imshow(arr)
pylab.show()

#Algorithm Development

Now let us create an algorithm to differentiate between 0s and 1s. As a simple first pass, let us assume that in general, the number 1 will have a mean matrix value less than the number 0 since the number 0 will usually take up more surface area than the number 1. Let us check out if this is the case:

In [0]:
# Find all files with the number zero (should be in zeros/ folder)
zeros = glob.glob('zeros/*.npy')

# Loop through each file and record its mean in a running list
# HINT 1: use np.mean() to determine mean
# HINT 2: what is the method to append a new item into a list?
means = []
for zero in zeros:
    arr = np.load(zero)
    mean = np.mean(arr)
    means.append(mean)
    
print('The mean of all zeros is: %2.4f' % np.mean(means))

# Find all files with the number one (should be in the ones/ folder)
ones = glob.glob('ones/*.npy')

# Loop through each file and record its mean in a running list)
means = []
for one in ones:
    arr = np.load(one)
    mean = np.mean(arr)
    means.append(mean)

print('The mean of all ones is: %2.4f' % np.mean(means))

# Compare the mean values between the two digits



In [0]:
np.mean(means)

Okay, so we think there may be a slight difference between the two, but can this be used to reliably differentiate 0s and 1s? Let us first write a method:

In [0]:
def classify_digit(arr):
    """
    Method to classify if image array is a 0 or 1
    
    We assume that arr is a loaded Numpy image array
    
    """
    # Find mean of the input array
    mean = np.mean(arr)
    
    # Based on your assessment above, what is a reasonable threshold to differentiate 0 and 1?
    threshold = 30
    
    # If mean of input array is greater than a given threshold, classify as 0
    if mean > threshold:
        prediction = 0
    
    # Else classify as 1
    else:
        prediction = 1
    
    # Return your prediction
    return prediction
        

Now, let's test out our hypothesis!

In [0]:
# Find all files with the number zero
zeros = glob.glob('zeros/*.npy')

# Loop through each file, open (with np.load()) and run your method above
preds = {0: 0, 1: 0}

for zero in zeros:
    arr = np.load(zero)
    pred = classify_digit(arr)
    preds[pred] += 1
    
# Calculate the number of mistakes / correct predictions
accuracy = preds[0] / len(zeros)
print('Accuracy for predicting zeros: %0.5f' % accuracy)

# Repeat for all files with the number one
ones = glob.glob('ones/*.npy')

preds = {0: 0, 1: 0}

for one in ones:
    arr = np.load(one)
    pred = classify_digit(arr)
    preds[pred] += 1
    
accuracy = preds[1] / len(ones)
print('Accuracy for predicting ones: %0.5f' % accuracy)


Can you think of altnerate ways to improve our differentiation of the two digits? Some thoughts to help you get started:

* The number 1 is generally more "skinny" than the number 0. In other words if we take the sum of each column in the each, the number 1 should have overall more empty values (or low values) than the number 0.


```
# This is formatted as code
```


* The number 0 usually has an empty value (or low values) in the center of the image (e.g. where the expected "hole" should be).

See if you can implement some (or all) of these. Can you think of a way to incorporate all these strategies into a method? Can you create an algorithm that is 100% accurate?