# Digit Recognizer Data Preparation Notebook

In this [Kaggle competition](https://www.kaggle.com/competitions/digit-recognizer/overview) 

>MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

>In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.

## Install necessary packages

We use the requirement.txt file to list all the dependencies and then run pip install for the requirements.

In [None]:

%pip install -r requirements.txt --user --quiet

If this is the first time running this pip command, restart the kernel.

## Imports

In this section, we import the packages needed in this example.  It is good practice to gather your imports into a single place.  

In [None]:
# Imports
import sys, os, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from zipfile import ZipFile


from netapp_dataops.k8s import clone_volume, create_volume, \
delete_volume, list_volumes, create_volume_snapshot, \
delete_volume_snapshot, list_volume_snapshots, restore_volumesnapshot

## Raw Data

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

- Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. 
- Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The test data set, (test.csv), is the same as the training set, except that it does not contain the "label" column.





In [None]:
TRAIN_CSV_ZIP = 'train.csv.zip'
TEST_CSV_ZIP = 'test.csv.zip'

In [None]:
ROOT = '/home/jovyan'
assert os.path.exists(ROOT)

In [None]:
DATA_ROOT = '/home/jovyan/data'
assert os.path.exists(DATA_ROOT)

In [None]:
# Training data paths
DATA_TRAIN_PVC = 'digits-train'
DATA_TRAIN_ROOT = os.path.join(DATA_ROOT, DATA_TRAIN_PVC)
os.makedirs(DATA_TRAIN_ROOT, exist_ok=True)
assert os.path.exists(DATA_TRAIN_ROOT)
DATA_TRAIN_FILE = os.path.join(DATA_TRAIN_ROOT,'train.csv')

# Testing data paths
DATA_TEST_PVC = 'digits-test'
DATA_TEST_ROOT = os.path.join(DATA_ROOT, DATA_TEST_PVC)
os.makedirs(DATA_TEST_ROOT, exist_ok=True)
assert os.path.exists(DATA_TEST_ROOT)
DATA_TEST_FILE = os.path.join(DATA_TEST_ROOT,'test.csv')

# Validation data paths
DATA_VALID_PVC = 'digits-valid'
DATA_VALID_ROOT = os.path.join(DATA_ROOT,DATA_VALID_PVC)
os.makedirs(DATA_VALID_ROOT, exist_ok=True)
assert os.path.exists(DATA_VALID_ROOT)
DATA_VALID_FILE = os.path.join(DATA_VALID_ROOT,'valid.csv')

# Production data paths
DATA_PROD_PVC = 'digits-prod'
DATA_PROD_ROOT = os.path.join(DATA_ROOT, DATA_PROD_PVC)
os.makedirs(DATA_PROD_ROOT, exist_ok=True)
assert os.path.exists(DATA_PROD_ROOT)
DATA_PROD_FILE = os.path.join(DATA_PROD_ROOT,'prod.csv')

In [None]:
with ZipFile('train.csv.zip', 'r') as zip:
    zip.extractall(ROOT)
zip.close()
RAW_TRAIN_ROOT = os.path.join(ROOT,'train.csv')
assert os.path.exists(RAW_TRAIN_ROOT)

In [None]:
# split the training data into two parts
# 75% for training
# 25% for (cross)validation
RAW_TRAIN_DF1 = pd.read_csv(RAW_TRAIN_ROOT)
PART_75 = RAW_TRAIN_DF1.sample(frac =0.75)
PART_25 = RAW_TRAIN_DF1.drop(PART_75.index)

In [None]:
# Save the split data sets to files
PART_75.to_csv(DATA_TRAIN_FILE, encoding='utf-8', index=False)
PART_25.to_csv(DATA_VALID_FILE, encoding='utf-8', index=False)

In [None]:
with ZipFile(TEST_CSV_ZIP, 'r') as zip:
    zip.extractall(ROOT)
zip.close()

RAW_TEST_ROOT = os.path.join(ROOT,'test.csv')
assert os.path.exists(RAW_TEST_ROOT)

In [None]:
# Split the test.csv into 2 parts
# 50% for Test
# 50% for Prod
RAW_TEST_DF1 = pd.read_csv(RAW_TEST_ROOT)
PART_50 = RAW_TEST_DF1.sample(frac =0.5)
PART_50_2 = RAW_TEST_DF1.drop(PART_50.index)

In [None]:
# Save the split data sets to files
PART_50.to_csv(DATA_TEST_FILE, encoding='utf-8', index=False)
PART_50_2.to_csv(DATA_PROD_FILE, encoding='utf-8', index=False)

In [None]:
# Loading dataset into pandas 
TRAIN_DF = pd.read_csv(DATA_TRAIN_FILE)
TEST_DF = pd.read_csv(DATA_TEST_FILE)
EVAL_DF = pd.read_csv(DATA_VALID_FILE)
PROD_DF = pd.read_csv(DATA_PROD_FILE)

## Training Data

Let us now explore the data
To this end, we use the pandas `head` method to visualize the 1st five rows of our data set.

In [None]:
TRAIN_DF.head()

In [None]:
TRAIN_DF.shape

In [None]:
# Spilt the training data into so the label is in TRAIN_Y and TRAIN_X doesn't include the label
TRAIN_X = TRAIN_DF.drop('label', axis=1)
TRAIN_Y = TRAIN_DF.label

# Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)
TRAIN_X = TRAIN_X.values.reshape(-1,28,28,1)


# Normalize the data
# Each pixel has a value between 0-255. Here we divide by 255, to get values from 0-1
TRAIN_X = TRAIN_X / 255.0

In [None]:
TRAIN_X.shape

In [None]:
# Visualize single data instances

img_no = 31499 # Change the number to display other examples

first_number = TRAIN_X[img_no]
plt.imshow(first_number, cmap='gray') # Visualize the numbers in gray mode
plt.show()
print(f"correct number: {TRAIN_Y[img_no]}")

## Validation Data

In [None]:
EVAL_DF.head()

In [None]:
EVAL_DF.shape

In [None]:
# Spilt the training data into so the label is in TRAIN_Y and TRAIN_X doesn't include the label
EVAL_X = EVAL_DF.drop('label', axis=1)
EVAL_Y = EVAL_DF.label

# Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)
EVAL_X = EVAL_X.values.reshape(-1,28,28,1)


# Normalize the data
# Each pixel has a value between 0-255. Here we divide by 255, to get values from 0-1
EVAL_X = EVAL_X / 255.0

In [None]:
EVAL_X.shape

In [None]:
# Visualize single data instances

img_no = 10499 # Change the number to display other examples

first_number = EVAL_X[img_no]
plt.imshow(first_number, cmap='gray') # Visualize the numbers in gray mode
plt.show()
print(f"correct number: {EVAL_Y[img_no]}")

## Testing Data

In [None]:
TEST_DF.head()

In [None]:
TEST_DF.shape

In [None]:
TEST_X = TEST_DF

# Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)
TEST_X = TEST_X.values.reshape(-1,28,28,1)


# Normalize the data
# Each pixel has a value between 0-255. Here we divide by 255, to get values from 0-1
TEST_X = TEST_X / 255.0

In [None]:
TEST_X.shape

In [None]:
# Visualize single data instances

img_no = 13999 # Change the number to display other examples

first_number = TEST_X[img_no]
plt.imshow(first_number, cmap='gray') # Visualize the numbers in gray mode
plt.show()


## Production Data

In [None]:
PROD_DF.head()

In [None]:
PROD_DF.shape

In [None]:
PROD_X = PROD_DF

# Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)
PROD_X = PROD_X.values.reshape(-1,28,28,1)


# Normalize the data
# Each pixel has a value between 0-255. Here we divide by 255, to get values from 0-1
PROD_X = PROD_X / 255.0

In [None]:
PROD_X.shape

In [None]:
# Visualize single data instances

img_no = 13999 # Change the number to display other examples

first_number = PROD_X[img_no]
plt.imshow(first_number, cmap='gray') # Visualize the numbers in gray mode
plt.show()


## Create Snapshots of the 4 Data Volumes

In [None]:
USER_NAMESPACE = "kubeflow-user-example-com"
DATA_TRAIN_SNAP = 'digits-train-snap'
DATA_TEST_SNAP = 'digits-test-snap'
DATA_VALID_SNAP = 'digits-valid-snap'
DATA_PROD_SNAP = 'digits-prod-snap'

In [None]:
# Create a VolumeSnapshot for the volume attached to the 
#   PersistentVolumeClaim (PVC) named in the variable DATA_TRAIN_PVC in namespace in USER_NAMESPACE.
#   NOTE: if snapshotName is not specified, the snapshot name will be set to 'ntap-dsutil.<timestamp>
create_volume_snapshot(pvc_name=DATA_TRAIN_PVC, namespace=USER_NAMESPACE, snapshot_name=DATA_TRAIN_SNAP, print_output=True)

In [None]:
# Create a VolumeSnapshot for the volume attached to the 
#   PersistentVolumeClaim (PVC) named in the variable DATA_TEST_PVC in namespace in USER_NAMESPACE.
#   NOTE: if snapshotName is not specified, the snapshot name will be set to 'ntap-dsutil.<timestamp>
create_volume_snapshot(pvc_name=DATA_TEST_PVC, namespace=USER_NAMESPACE, snapshot_name=DATA_TEST_SNAP, print_output=True)

In [None]:
# Create a VolumeSnapshot for the volume attached to the 
#   PersistentVolumeClaim (PVC) named in the variable DATA_VALID_PVC in namespace in USER_NAMESPACE.
#   NOTE: if snapshotName is not specified, the snapshot name will be set to 'ntap-dsutil.<timestamp>
create_volume_snapshot(pvc_name=DATA_VALID_PVC, namespace=USER_NAMESPACE, snapshot_name=DATA_VALID_SNAP, print_output=True)

In [None]:
# Create a VolumeSnapshot for the volume attached to the 
#   PersistentVolumeClaim (PVC) named in the variable DATA_PROD_PVC in namespace in USER_NAMESPACE.
#   NOTE: if snapshotName is not specified, the snapshot name will be set to 'ntap-dsutil.<timestamp>
create_volume_snapshot(pvc_name=DATA_PROD_PVC, namespace=USER_NAMESPACE, snapshot_name=DATA_PROD_SNAP, print_output=True)

In [None]:
#List the VolumeSnapshots for the namespace
list_volume_snapshots(namespace=USER_NAMESPACE)