## Getting Started

#### Instructions: Run the cell below to import the necessary libraries. Links to official package documentation are provided. Typically, the search functionality on the documentation websites will be the easiest to use. Also, searching for a language reference question on Google is also usually successful.

In [None]:
# General imports
import math
from IPython.display import display

# Data science imports
import pandas as pd # pandas documentation: https://pandas.pydata.org/docs/reference/index.html#api
import numpy as np # numpy documentation: https://numpy.org/doc/stable/reference/index.html
import scipy as sp

# ML imports
from sklearn import neighbors, metrics    # Main ML models; SK learn documentation https://scikit-learn.org/stable/modules/classes.html

# Graphical imports (not necessary to understand these librariexss)
from PIL import Image
from matplotlib import pyplot as plt

# Configuration (not necessary to understand)
np.set_printoptions(threshold=1000, linewidth=500)

#### Instructions: Use the pandas documentation to read about the operation of the `read_csv` function. Use this function to read in the data file from `/data/digits.csv`. The `read_csv` function returns a `DataFrame` object. A data frame is essentially a mini spreadsheet table. *Please see `images/Instructional/DF.png` for an example.* Save this data frame in a variable. Starter code for this task is provided below. Hint, use the proved `col_names` variable as column names for the `DataFrame`.

In [53]:
# digits_data_train = pd.read_csv(...)
# digits_data_test = pd.read_csv(...)

#### Instructions: Call the `display` function on our data frames

In [69]:
### Display data frames here

Before moving onto data preparation, consider what each row represents in the data frame. What does each column represent? Hint: read the dataset description [here](website.com)

## Data Preparation

Let's remember the data we're given. We are given both **feature values** (here, image information) and **labels** (here, the value of the handwritten digit). In the data frame we initialized, all of this information is packaged together. We need to separate the feature values into one data frame (typically denoted ` <dataset name>_<train/test>_X`) and the labels (or **targets**) into another (typically denoted `<dataset name>_<train/test>_y`). To accomplish this, we can use the `pop` function (from the `pandas` library) to "pop off" the labels column of the data frame. Using `pandas` documentation as a guide, obtain the following four `DataFrame`s: 

* training features
* training labels
* testing features
* testing labels

*Note: Error checking code provided so that if the cell is rerun and the target column has already been popped off the `DataFrame`, there won't be any errors.*

In [None]:
try:
    ### Add code to get the four data frames here ###
except KeyError as e:
    print("Note: Key Error Exception thrown---the key %s may not exist." %(e))

## First ML model

#### Instructions: Fit a first machine learning model (KNN) using the starter code given below. Using the SK learn documentation as a guide, fill in the missing sections.

In [71]:
knn_classifier = neighbors.KNeighborsClassifier(n_neighbors = 5, algorithm = 'brute') # Creating a KNN classifier object

# Fit the KNN classifier with the TRAINING DATA here 

# SOLUTION:
knn_classifier.fit(digits_train_X, digits_train_y) 

# Score (using the default metric of accuracy) the KNN classifier here USING TEST DATA

# SOLUTION:
knn_classifier.score(digits_test_X, digits_test_y)

# Obtain a list of predictions (and true values) here.

# SOLUTION:
sk_learn_predictions = knn_classifier.predict(digits_test_X)
true_values = digits_test_y

# Uncomment the following line to see a confusion matrix for the data. This shows the number of test examples in each predicted-label, true-label combination. Can you explain any patterns you see?
metrics.ConfusionMatrixDisplay(metrics.confusion_matrix(true_values, sk_learn_predictions)).plot()

#### Congratulations! You've just fitted your first machine learning model! Was there less "to do" than you expected? You may find it interesting that all different types of machine learning algorithms are implemented by the SK learn library. Furthermore, all SKlearn models have the same syntax (with minor differences)! But maybe this first model felt unsatisfying. At the end of the day, you didn't need to implement the algorithm from scratch. Commonly, students of machine learning will implement various (tractable) ML algorithms to understand how they work and to gain an appreciation for the underlying mechanics. That's what you will complete next!

## Implementing KNN

#### Instructions: Implement and test the K-nearest neighbors algorithm. This function should take 5 parameters: `X_train`, `y_train`, `X_test`, `y_test`, and `k`. It should (1) print predictions, true values, test example #, and whether the classifier was correct for each test example (2) at the end, print accuracy ($=\frac{\text{\# correct predictions}}{\text{\# incorrect predictions}}$). Your implementation may require much more computational time than the library method since the library method uses low-level C code <!--It should run in $O(m\cdot n\cdot k\log(k))$ time, where $n$ is the length of `X_test` and $m$ is the length of `X_train`. (Note that since, typically, $k$ is regarded as a small constant with $k \ll n$, the runtime is essentially $O(n)$.)-->

#### You should create a helper function that calculates the distance between two examples in a `DataFrame`.

In [None]:
### Implement KNN and distance functions here.

In [None]:
### Test your KNN implementation with k = 5 here. That is, ensure your predictions exactlly match SK Learn's predictions.


#### Food for thought: Why might we want to avoid using even values of $k$?

## Real-World handwriting

#### In this section, we will put our classifiers to use! We will create novel examples of human-handwritten digits and see how well the computer can recognize them! Images are provided from two sources. One source is an online "pixel art" program where you can create a black and white image in a 32x32 pixel grid. The other source is smartphone camera images of actual human handwriting. (Later, you can create images from these sources yourself!)

#### Below is a function `query_from_img`. Its documentation is provided below. Below *that* function is another function called `pre_proc_img`. It's documentation is also provided below.

#### Instructions: Pass the provided numeral example `png`s (found in the `images/TrueHandwriting` directory) to `pre_proc_img` and then `query_from_img` and save the resulting `Dataframe`s in appropriately-named variables (You may use an intermediate variable to store between the two functions). Then, pass these `Dataframe`s into the `predict` method of the SK learn KNN classifier from before. How did the ML model do? Complete the same task but use the pre-processed images in the `images/PixelArt` directory. This time, you don't need to pass the images to `pre_proc_img`.

In [None]:
def query_from_img(image_file_path, verbose = False):
    """ function query_from_img: 
        Preprocessing function to turn an image into a DataFrame query so it can be passed into ML classifier.

    Args:
        image_file_path (string): Path to image file to be processed
        verbose (boo    lean, optional): Whether or not to print out preprocessing information about the input image. Defaults to True.

    Returns:
        (pandas DataFrame): Contains a single novel query which is a feature vector representation of input image
    """
    
    newArr = np.zeros(shape=(32,32))
    img = Image.open(image_file_path).convert('RGBA')
    white = Image.new("RGBA", img.size, "WHITE")
    white.paste(img, (0, 0), img)    
    img = white.convert('L')
    arr = np.array(img)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j] == 0: 
                newArr[i, j] = 1
    
    if verbose:
        print("=== Array representation of handwritten digit ===")       
        print(newArr)
        print()

    
    finalArr = np.zeros(shape=(8, 8))
    for i in range(32):
        for j in range(32):
            finalArr[i//4, j//4] += newArr[i, j]
    
    if verbose:
        print("=== Compressed array representation of handwritten digit ===")
        print(finalArr)
        print()
        print("=== Graphical pseudo-greyscale representation of image ===")
        plt.figure(1, figsize=(3, 3))
        plt.imshow(finalArr, cmap=plt.cm.gray_r, interpolation="nearest")
        plt.show()
    
    feature_vector = pd.DataFrame(finalArr.ravel()).T
    feature_vector.columns = col_names[:-1]
    
    if verbose:
        print("=== Feature vector representation of handwritten digit ===")
        display(feature_vector)
    
    return feature_vector

In [None]:
import matplotlib.pyplot as plt

def pre_proc_img(filename):
    """ function pre_proc_img:
        Given an image, process this into a compressed 32 x 32 pixel representation, similar to those images found in the "digits" study. I.e., this function prepares images for the ML model.

    Args:
        filename (str): filename of input image

    Returns:
        str: file path to 32 x 32 png similar to those images found in the "digits" study.
    """
    image_file = Image.open(filename)  # open colour image
    levels = np.array(image_file).ravel()
    thsh = np.quantile(levels, 0.15)
    def fn(x): return 255 if x > thsh else 0
    image_file = image_file.convert('L').point(fn, mode='1')
    image_file = image_file.resize((256, 256))
    pix_ar = np.array(image_file)
    final_ar = np.zeros(shape=(32, 32))
    for i in range(32):
        for j in range(32):
            total_on = 0
            for k in range(8):
                for l in range(8):
                    if pix_ar[i*8+k, j*8+l] == False:
                        total_on += 1
            if total_on > 4:
                final_ar[i, j] = 1
                if i not in [0, 1, 30, 31] and j not in [0, 1, 30, 31]:
                    final_ar[i - 1, j] = 0
                    final_ar[i + 1, j] = 0
                    final_ar[i, j - 1] = 0
                    final_ar[i, j + 1] = 0
                    final_ar[i - 1, j - 1] = 0
                    final_ar[i - 1, j + 1] = 0
                    final_ar[i + 1, j + 1] = 0
                    final_ar[i + 1, j - 1] = 0
            else:
                final_ar[i, j] = 255
    new_img = Image.fromarray(final_ar).convert("RGBA")
    # *Student file paths will differ!*
    path = ("/content/drive/MyDrive/Instructor-ML-Codebook-main/images/External_Processed_" + filename[-5] + ".png")
    new_img.save(path)
    return path

In [None]:
# Your code here


#### Instructions to create images from the pixel art program:

##### Instructions to create images from a camera: 
1. Write your favorite digit 0-9 on a piece of paper with a marker. The size of the digit should be approximately 1.5-3 inches in width/height. Take a picture of this with a camera. Make sure the digit is very clear and has good lighting and contrast.
2. Import the image into your computer. The format of the image may be `.png` or `.jpeg`. If this is not the format, you must convert it. The website https://www.cloudconvert.com is a good option.  
3. Crop the image so there is a bit of white space around the digit. Use the provided examples in the `images/TrueHandwriting` directory as a guide. If you don't have image processing software on your computer, you can use the following website: https://www.iloveimg.com/crop-image  
4. Save image appropriately in the `images/TrueHandwriting` directory.

##### Instructions to create images from the pixel art program:  
1. Go to https://www.piskelapp.com/p/create/sprite.  
2. Ensure the canvas is 32x32 pixels by selecting the "Resize" menu option from the right-hand toolbar.

"/content/drive/MyDrive/Instructor-ML-Codebook-main/images/Instructional/Resize.png" width = "200px"

1. Make sure you select the following pen tools for best results. *See `images/Instructional/Resize.png` for a screenshot.*

2. Make your drawing of your favorite digit.  
3. Export the pdf by selecting the Export menu from the right-hand toolbar... *See `images/Instructional/Export1.png` for a screenshot.*

4. ... ensuring you are on the "PNG" tab of the Export menu, download your digit with the first "download" button. *See `images/Instructional/Export2.png` for a screenshot.*

5. Save image appropriately in the `images/PixelArt` directory.