## Getting Started

#### Instructions: Run the cell below to import the necessary libraries. Links to official package documentation are provided.

In [None]:
# General imports
import math, sys
from IPython.display import display


# Data science imports
import pandas as pd # pandas documentation: 
import numpy as np # numpy documentation:
import scipy as sp

# ML imports
from sklearn import neighbors, tree, svm, linear_model, metrics    # Main ML models

# Graphical imports
from PIL import Image
from matplotlib import pyplot as plt # matplotlib documentation:

# Configuration
np.set_printoptions(threshold=sys.maxsize, linewidth=500)

#### Instructions: Use the pandas documentation to read about the operation of the `read_csv` function. Use this function to read in the data file from `/data/digits.csv`. The `read_csv` function returns a `DataFrame` object. A data frame is essentially a mini spreadsheet table. Save this data frame in a variable. Starter code for this task is provided below. Hint, use the proved `col_names` variable as column names for the `DataFrame`.

In [None]:
col_names = ["pixel " + str(i) for i in range(64)] + ["true digit"]

In [None]:
# digits_data_train = pd.read_csv(...)
# digits_data_test = pd.read_csv(...)

#### Instructions: Call the `display` function on our data frames

In [None]:
### Display data frames here

Before moving onto data preparation, consider what each row represents in the data frame. What does each column represent? Hint: read the dataset description [here](website.com)

## Data Preparation

Let's remember the data we're given. We are given both **feature values** (here, image information) and **labels** (here, the value of the handwritten digit). In the data frame we initialized, all of this information is packaged together. We need to separate the feature values into one data frame (typically denoted ` <dataset name>_<train/test>_X`) and the labels (or **targets**) into another (typically denoted `<dataset name>_<train/test>_y`). To accomplish this, we can use the `pop` function (from the `pandas` library) to "pop off" the labels column of the data frame. Using `pandas` documentation as a guide, obtain the following four `DataFrame`s: 

* training features
* training labels
* testing features
* testing labels

*Note: Error checking code provided so that if the cell is rerun and the target column has already been popped off the `DataFrame`, there won't be any errors.*

In [None]:
try:
    ### Add code to get the four data frames here ###
except KeyError as e:
    print("Note: Key Error Exception thrown---the key %s may not exist." %(e))

## First ML model

#### Instructions: Fit a first machine learning model (KNN) using the starter code given below. Using the SK learn documentation as a guide, fill in the missing sections.

In [None]:
knn_classifier = neighbors.KNeighborsClassifier(n_neighbors = 5, algorithm = 'brute') # Creating a KNN classifier object

# Fit the KNN classifier with the TRAINING DATA here

# Score (using the default metric of accuracy) the KNN classifier here USING TEST DATA

# Obtain a list of predictions here 

#### Instructions: Uncomment the following line to see a *confusion matrix* for the data. This shows the number of test examples in each predicted-label, true-label combination. Can you explain any patterns you see?

In [None]:
# metrics.ConfusionMatrixDisplay(metrics.confusion_matrix(true_values, predictions)).plot()

#### Congratulations! You've just fitted your first machine learning model! Was there less "to do" than you expected? You may find it interesting that all different types of machine learning algorithms are implemented by the SK learn library. Furthermore, they all have the same syntax (with minor differences)! But maybe this first model felt unsatisfying. At the end of the day, you didn't need to implement the algorithm from scratch. Commonly, students of machine learning will implement various (tractable) ML algorithms to understand how they work and to gain an appreciation for the underlying mechanics. That's what you will complete next!

## Implementing KNN

#### Instructions: Implement and test the K-nearest neighbors algorithm. This function should take 5 parameters: `X_train`, `y_train`, `X_test`, `y_test`, and `k`. It should (1) print predictions, true values, test example #, and whether the classifier was correct for each test example and (2) return the accuracy of this model. (Your implementation may require much more computational time than the library method since the library method uses low-level C code) <!--It should run in $O(m\cdot n\cdot k\log(k))$ time, where $n$ is the length of `X_test` and $m$ is the length of `X_train`. (Note that since, typically, $k$ is regarded as a small constant with $k \ll n$, the runtime is essentially $O(n)$.)-->

#### You should create a helper function that calculates the distance between two examples in a `DataFrame`.

In [None]:
### Implement KNN and distance functions here.

In [None]:
### Test your KNN implementation with k = 5 here. Ensure that your predictions exactlly match SK Learn's predictions.

#### Food for thought: Why might we want to avoid using even values of $k$?

## Real-World handwriting

#### In this section, we will put our classifiers to use! We will create novel examples of human-handwritten digits and see how well the computer can recognize them! Below is a function `process_human_example`. Its documentation is provided below.

In [None]:
def process_human_example(image_file_path, verbose = True):
    """ function process_human_example: 
        Preprocessing function to turn an image into a DataFrame query so it can be passed into ML classifier.

    Args:
        image_file_path (string): Path to image file to be processed
        verbose (boo    lean, optional): Whether or not to print out preprocessing information about the input image. Defaults to True.

    Returns:
        (pandas DataFrame): Contains a single novel query which is a feature vector representation of input image
    """
    
    newArr = np.zeros(shape=(32,32))
    img = Image.open(image_file_path).convert('RGBA')
    arr = np.array(img)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j, 3] == 255: 
                newArr[i, j] = 1
    
    if verbose:
        print("=== Array representation of handwritten digit ===")       
        print(newArr)
        print()

    
    finalArr = np.zeros(shape=(8, 8))
    for i in range(32):
        for j in range(32):
            finalArr[i//4, j//4] += newArr[i, j]
    
    if verbose:
        print("=== Compressed array representation of handwritten digit ===")
        print(finalArr)
        print()
        print("=== Graphical pseudo-greyscale representation of image ===")
        plt.figure(1, figsize=(3, 3))
        plt.imshow(finalArr, cmap=plt.cm.gray_r, interpolation="nearest")
        plt.show()
    
    feature_vector = pd.DataFrame(finalArr.ravel()).T
    feature_vector.columns = col_names[:-1]
    
    if verbose:
        print("=== Feature vector representation of handwritten digit ===")
        display(feature_vector)
    
    return feature_vector

#### Instructions: Pass the provided numeral example `png`s (found in the `images` directory) to `process_human_example` and save the resulting `Dataframe`s in appropriately-named variables. Then, pass these `Dataframe`s into the `predict` method of the SK learn KNN classifier from before. How did the ML model do?

In [None]:
# Your code here