# Jonathan's EMNIST dataset importer + expander
Welcome!

This notebook imports the EMNIST dataset and provides tools to save them as files, either as images or as tensors.

In addition, it includes the functionality to use `.ttf` (TrueType Font) files to generate more images, such as to add symbols to the dataset. Font files are linked in this projects' Github repository, to use them unzip the folder into `/MyDrive/Fonts/`

EMNIST *(Expanded Modified National Institute of Standards and Technology database)* is a set of handwritten letters and numbers provided by NIST, which will be used to train a neural network in the next notebook.

## Libraries

In [None]:
from google.colab import drive
import os
import numpy as np
from PIL import Image as im
from tqdm import tqdm
import cv2
import torch
from torchvision import io

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install emnist
from emnist import list_datasets, extract_training_samples, extract_test_samples

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emnist
  Downloading emnist-0.0-py3-none-any.whl (7.3 kB)
Installing collected packages: emnist
Successfully installed emnist-0.0


## Setup & Settings

This cell controls the location of the project's parent directory.

In [None]:
dir = "/content/drive" + "/MyDrive/Datasets/EMNIST"

Here, training data `x` and classes `y`, as well as testing data `x` and classes `y` are imported as images.

The dataset's classes come as integers 0-46, so `key` is provided with the real class names.

In [None]:
x, y = extract_training_samples('balanced')
xt, yt = extract_test_samples('balanced')
key = ['0','1','2','3','4','5','6','7','8','9','A','B','c','D','E','F','G','H','i','j','k','l','m','N','o','p','Q','R','s','T','u','v','w','x','y','z','a','b','d','e','f','g','h','n','q','r','t']

This cell iterates through class lists `y` and `yt` and creates new class lists `a` and `at`.

While the old lists use integers for their classes, `a` and `at` use characters by indexing `key`.

In [None]:
a, at = [],[]
for i in range(len(y)):
  a.append(key[y[i]])
for i in range(len(yt)):
  at.append(key[yt[i]])

## Data Enrichment
This section uses TrueType fonts to generate additional classes for the dataset.

In [None]:
from PIL import ImageFont, ImageDraw
import matplotlib.pyplot as plt
import shutil

This cell defines the settings for the generator:
- `pkey` is the list of characters to be generated.
- `qkey` is the list of subdirectories to make (note `/` and `.` are different due to folder limitations).
- `shrink` is the list of characters' images to make smaller if a full 28x28 space is too large for them.
- `shrink2` is the list of characters' images to further reduce.

In [None]:
pkey = ['!','?','(',')',"'",'&','+','-','*','@','%','=',',',':',';','#','$','\\','/','.']
qkey = ['!','?','(',')',"'",'&','+','-','*','@','%','=',',',':',';','#','$','\\',' ','~.']
shrink = ['-','*','=','+',':',';']
shrink2 = [',','.',"'"]

To ensure proper use, this cell will display all font files present in the folder.

In [None]:
[x.path for x in os.scandir('/content/drive/MyDrive/Fonts')]

This cell, for every font provided, creates an image for every character in `pkey`.

It works with 7 steps per image made:
1. A 100x100 blank canvas is created as a `numpy` array, saved to `image1`
2. It is then converted to a PIL image object, saved to `image2`
3. A PIL `ImageDraw` object is created for the image and drawn to `image2`, with a font size 56
4. By using both `Image.crop()` and `Image.getbbox()`, the image is cropped to content and saved to `image3`
5. The image is then resized such that its largest dimension is 26 pixels and saved to `image4`
6. Finally, the image is padded to make it 28x28 and saved to `image5`
7. For characters that are on the `shrink` or `shrink2` list, they are rescaled to 16x16 or 8x8 respectively and overwrite `image5`

All generated data is stored in the `images` array. Since they are already ordered by class, a separate class list is not needed.

In [None]:
images = []
for i in tqdm([x.path for x in os.scandir('/content/drive/MyDrive/Fonts')]):
  font = ImageFont.truetype(i,size=56)
  for j in range(len(pkey)):
    image1 = np.zeros((100,100,3)).astype(np.uint8)
    image2 = im.fromarray(image1)

    draw = ImageDraw.Draw(image2)
    draw.text(xy=(0,0),text=pkey[j],font=font)

    image3 = np.asarray(image2.crop(image2.getbbox()))

    if image3.shape[0] >= image3.shape[1]: # for tall images
      i4x = int(image3.shape[1]*(26/image3.shape[0]))
      i4x = i4x if i4x % 2 == 0 else i4x + 1
      image4 = cv2.resize(image3,(i4x,26))

      pad = int((28-image4.shape[1])/2)
      image5 = cv2.copyMakeBorder(image4,1,1,pad,pad,cv2.BORDER_CONSTANT)

    else: # for wide images
      i4x = int(image3.shape[0]*(26/image3.shape[1]))
      i4x = i4x if i4x % 2 == 0 else i4x + 1
      image4 = cv2.resize(image3,(26,i4x))

      pad = int((28-image4.shape[0])/2)
      image5 = cv2.copyMakeBorder(image4,pad,pad,1,1,cv2.BORDER_CONSTANT)

    if any(k == pkey[j] for k in shrink): # shrink
      image5 = cv2.resize(image5,(16,16))
      image5 = cv2.copyMakeBorder(image5,6,6,6,6,cv2.BORDER_CONSTANT)

    if any(k == pkey[j] for k in shrink2): # shrink2
      image5 = cv2.resize(image5,(8,8))
      image5 = cv2.copyMakeBorder(image5,10,10,10,10,cv2.BORDER_CONSTANT)

    images.append(image5)

100%|██████████| 140/140 [00:07<00:00, 18.08it/s]


## Tensor Format
This section moves the dataset into tensors which are then saved in bulk. This is the method which will be used in the next notebook.

Define subdirectory names for raw (not expanded) and expanded datasets.

In [None]:
raw = 'raw'
expanded = 'data'

The next two cells convert the:
1. EMNIST training data
2. EMNIST testing data
3. Generated training data
4. Generated testing data

Into tensors, respectively.

To emulate 3-channels, the images are stacked before being added to tensors.

In [None]:
xtor = torch.from_numpy(np.stack((x,x,x),axis=1))
xttor = torch.from_numpy(np.stack((xt,xt,xt),axis=1))

In [None]:
imtor = torch.from_numpy(np.array(images[:2400]).transpose(0,3,1,2))
imttor = torch.from_numpy(np.array(images[2400:]).transpose(0,3,1,2))

This cell generates the class lists for the generated images. Since they are being saved to files, they will use integer classes.

In [None]:
imindex = []
for i in range(2800):
  imindex.append(i%len(qkey)+47)
imindex = np.array(imindex)

The next three cells combine the raw and generated data into training and testing image tensors.

Then, it creates the training and testing class integers tensors as well as an array of the class names.

In [None]:
train_image_tensor = torch.cat((xtor, imtor), 0)
test_image_tensor = torch.cat((xttor, imttor), 0)

In [None]:
train_class_keys = torch.from_numpy(np.concatenate((y,imindex[:2400])))
test_class_keys = torch.from_numpy(np.concatenate((yt,imindex[2400:])))

In [None]:
class_names = np.array(key+pkey)

This cell saves the combined data set into the `data` subdirectory

In [None]:
for name,var in zip(['train_data','test_data','train_keys','test_keys','class_names'],[train_image_tensor, test_image_tensor, train_class_keys, test_class_keys, class_names]):
  torch.save(var,dir+'/data/'+name+'.dmp')

### Raw EMNIST Tensors

Without the combining, it only takes these two cells to create the tensor files for the raw EMNIST data.

In [None]:
train_emnist_tensor = torch.from_numpy(np.stack((x,x,x),axis=1))
test_emnist_tensor = torch.from_numpy(np.stack((xt,xt,xt),axis=1))
train_emnist_keys = torch.from_numpy(y)
test_emnist_keys = torch.from_numpy(yt)
emnist_names = np.array(key)

In [None]:
for name,var in zip(['train_data','test_data','train_keys','test_keys','class_names'],[train_emnist_tensor, test_emnist_tensor, train_emnist_keys, test_emnist_keys, emnist_names]):
  torch.save(var,dir+'/raw/'+name+'.dmp')

## Dataset as Images
This section saves the dataset as individual images for each entry, organized into folders by class. This approach is not recommended.

Defines subdirectory names for training and testing data.

In [None]:
train = "train"
test = "valid"

For ease of use, this cell creates directories for all classes inside the training and testing subdirectories.

In [None]:
for i in range(len(key)):
  os.mkdir(dir+'/'+train+'/'+key[i])
  os.mkdir(dir+'/'+test+'/'+key[i])

The next two cells save the training and testing data, respectively, into their corresponding folders.

Since these are saving large numbers of files to Google Drive, they may take several minutes to run.

In [None]:
start = 0
for i in tqdm(np.arange(start,len(x))):
  im.fromarray(x[i]).save(dir+'/'+train+'/'+a[i]+'/'+str(i)+'.jpg')

In [None]:
start = 0
for i in tqdm(np.arange(start,len(xt))):
  im.fromarray(xt[i]).save(dir+'/'+test+'/'+at[i]+'/'+str(i)+'.jpg')

### Enriched Data Saving

For ease of use, the next two cells can generate and delete the subdirectories for enriched data, respectively.

Directory deletion is provided in the case the dataset must be changed.

In [None]:
for i in range(len(qkey)): #generate directories
  os.mkdir(dir+'/'+train+'/'+qkey[i])
  os.mkdir(dir+'/'+test+'/'+qkey[i])

In [None]:
#for i in range(len(qkey)): #delete directories
#  shutil.rmtree(dir+'/'+train+'/'+qkey[i])
#  shutil.rmtree(dir+'/'+test+'/'+qkey[i])

Saves the enriched data and splits between training and testing

In [None]:
for i in tqdm(range(len(images))):
  if i < (len(images)*6/7):
    im.fromarray(images[i]).save(dir+'/'+train+'/'+qkey[i%len(qkey)]+'/'+str(i)+'.jpg')
  else:
    im.fromarray(images[i]).save(dir+'/'+test+'/'+qkey[i%len(qkey)]+'/'+str(i)+'.jpg')