<h1><center><font size="6">Tensorflow/Keras/GPU for Chinese MNIST Prediction</font></center></h1>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the analysis</a>   
- <a href='#4'>Characters classification</a>       
- <a href='#5'>Conclusions</a>       


# <a id='1'>Introduction</a>  


We will use RAPIDS to solve Chinese MNIST problem.

For more details about the problem, you can check this Notebook: [Tensorflow/Keras/GPU for Chinese MNIST Prediction](https://www.kaggle.com/gpreda/tensorflow-keras-gpu-for-chinese-mnist-prediction)


We will follow the preparation steps in the model Notebook, changing the solution approach, to use KNN & RAPIDS, as shown in Chris Deotte Notebook: 
[RAPIDS GPU kNN - MNIST - [0.97]](https://www.kaggle.com/cdeotte/rapids-gpu-knn-mnist-0-97/data).

Note: I updated the installation steps for RAPIDS using inspiration from this Notebook: [👨‍🎓Answer Correctness - RAPIDS crazy fast](https://www.kaggle.com/andradaolteanu/answer-correctness-rapids-crazy-fast)

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='2'>Prepare the analysis</a>   


Before starting the analysis, we need to make few preparation: install RAPIDS from the dataset, load the packages, load and inspect the data.



# <a id='21'>Install RAPIDS & load packages</a>




In [None]:
%%time
import sys
!cp ../input/rapids/rapids.0.15.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

We load the packages used for the analysis.

In [None]:
import cudf, cuml
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, KFold
from cuml.neighbors import KNeighborsClassifier, NearestNeighbors
print('cuML version',cuml.__version__)

We also set a number of parameters for the data and model.

In [None]:
IMAGE_PATH = '..//input//chinese-mnist//data//data//'
IMAGE_WIDTH = 64
IMAGE_HEIGHT = 64
IMAGE_CHANNELS = 1
TEST_SIZE = 0.2
VAL_SIZE = 0.2

<a href="#0"><font size="1">Go to top</font></a>  


# <a id='22'>Load the data</a>  

Let's see first what data files do we have in the root directory.

In [None]:
import os
os.listdir("..//input//chinese-mnist")

There is a dataset file and a folder with images.  

Let's load the dataset file first.

In [None]:
data_df=pd.read_csv('..//input//chinese-mnist//chinese_mnist.csv')

In [None]:
image_files = list(os.listdir(IMAGE_PATH))
print("Number of image files: {}".format(len(image_files)))

In [None]:
def create_file_name(x):
    file_name = f"input_{x[0]}_{x[1]}_{x[2]}.jpg"
    return file_name

In [None]:
data_df["file"] = data_df.apply(create_file_name, axis=1)

In [None]:
file_names = list(data_df['file'])
print("Matching image names: {}".format(len(set(file_names).intersection(image_files))))

Let's also check the image sizes.

In [None]:
print(f"Number of suites: {data_df.suite_id.nunique()}")
print(f"Samples: {data_df.sample_id.unique()}")

# <a id='4'>Characters classification</a>

Our objective is to use the images that we investigated until now to correctly identify the Chinese numbers (characters).   

We have a unique dataset and we will have to split this dataset in **train** and **test**. The **train** set will be used for training a model and the test will be used for testing the model accuracy against new, fresh data, not used in training.



## <a id='40'>Split the data</a>  

First, we split the whole dataset in train and test. We will use **random_state** to ensure reproductibility of results. We also use **stratify** to ensure balanced train/validation/test sets with respect of the labels. 

The train-test split is **80%** for training set and **20%** for test set.


In [None]:
train_df, test_df = train_test_split(data_df, test_size=TEST_SIZE, random_state=42, stratify=data_df["code"].values)

Next, we will split further the **train** set in **train** and **validation**. We want to use as well a validation set to be able to measure not only how well fits the model the train data during training (or how well `learns` the training data) but also how well the model is able to generalize so that we are able to understands not only the bias but also the variance of the model.  

The train-validation split is **80%** for training set and **20%** for validation set.

Let's check the shape of the three datasets.

In [None]:
print("Train set rows: {}".format(train_df.shape[0]))
print("Test  set rows: {}".format(test_df.shape[0]))

We are now ready to start building our first model.

## <a id='41'>Build the model</a>    


Next step in our creation of a predictive model.  

Let's define few auxiliary functions that we will need for creation of our models.

* A function for reading images from the image files; resize them to prepare for KNN
* A function to prepare the data: call the read/resize image function + label encoding

In [None]:
import cv2
def read_image(file_name):
    image_data = cv2.imread(IMAGE_PATH + file_name, cv2.IMREAD_GRAYSCALE)
    image = cv2.resize(image_data, (IMAGE_WIDTH * IMAGE_HEIGHT, 1))

    return image[0,:]

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(train_df['character'])
print(le.classes_)

In [None]:
def prepare_data(dataset,label_encoding=le):
    X = np.stack(dataset['file'].apply(read_image))
    y = label_encoding.transform(dataset['character'])
    return X, y

In [None]:
X_train, y_train = prepare_data(train_df)
X_test, y_test = prepare_data(test_df)

Now we are ready to start experiment with the KNN model.

In [None]:
for k in range(1,16):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_hat_p = knn.predict_proba(X_test)
    y_tr_hat_p = knn.predict_proba(X_train)
    y_pred = pd.DataFrame(y_hat_p).values.argmax(axis=1)
    y_tr_pred = pd.DataFrame(y_tr_hat_p).values.argmax(axis=1)
    acc = (y_pred==y_test).sum()/y_test.shape[0]
    acc_tr = (y_tr_pred==y_train).sum()/y_train.shape[0]
    print(f"k: {k} accuracy(train): {round(acc_tr,3)} accuracy(test): {round(acc,3)} ")

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=le.classes_))