# Mini Project V
#### - "We will combine our skills from Deep Learning and NLP and create our own language translator."



### ...

### ...

### ...

## Hmmmmmmm...... 
### It doesn't technically say it has to be a text language, does it?


#### Which, while I'm sure they meant for the project to involve a regular translation of language, say, from French, to English, this idea of just "make a translator" got me thinking...



#### What if we translated sign language instead? And instead of just images, we tried to translate from live video?


Looking around on the internet, I was able to come across an American Sign Language CSV dataset, which was comprised of greyscaled images of the ASL alphabet.
While it would be simple enough to translate these images to their respective letters, what if we took it a step further, and did real time video translation of ASL?

#### - What would that look like?


#### A little something like the below code:

- For this project, we'll construct a class called SignLanguage, which will contain all the various functions required to read into python, transform the images, and return them (this is so we could run this in multiple files as scripts if we wanted. Just running it in the notebook, we can run it all together
- Next, we'll read the samples from Kaggle into our notebook
- Third, apply various transformations to the dataset to increase the amount of data we're working with, and so the model will (hopefully) be able to accurately validate images from video
- Each of our images is 28x28, so we'll do the usual stretch to 784 x 1
- Note: Dataset comes from a Kaggle competition:

https://www.kaggle.com/datamunge/sign-language-mnist

In [1]:
import numpy as np
import torch
from torch.utils.data import Dataset
from torch.autograd import Variable
from keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization
from keras.preprocessing.image import ImageDataGenerator
from typing import List
import csv
import torchvision.transforms as transforms
import torch.nn as nn



class SignLanguage(Dataset):

    @staticmethod
    def get_label_mapping():

        mapping = list(range(25))
        mapping.pop(9)
        return mapping

    @staticmethod
    def read_label_samples(path: str):

        mapping = SignLanguage.get_label_mapping()
        labels, samples = [], []
        with open(path) as f:
            _ = next(f)  # skip header
            for line in csv.reader(f):
                label = int(line[0])
                labels.append(mapping.index(label))
                samples.append(list(map(int, line[1:])))
        return labels, samples

    def __init__(self,
            path: str="data/sign_mnist_train.csv",
            mean: List[float]=[0.485],
            std: List[float]=[0.229]):

        labels, samples = SignLanguage.read_label_samples(path)
        self._samples = np.array(samples, dtype=np.uint8).reshape((-1, 28, 28, 1))
        self._labels = np.array(labels, dtype=np.uint8).reshape((-1, 1))

        self._mean = mean
        self._std = std

    def __len__(self):
        return len(self._labels)

    def __getitem__(self, idx):
        transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(128),
            transforms.RandomRotation(10),
            transforms.RandomResizedCrop(28, scale=(0.8, 1.2)),
            transforms.ToTensor(),
            transforms.Normalize(mean=self._mean, std=self._std)])

        return {
            'image': transform(self._samples[idx]).float(),
            'label': torch.from_numpy(self._labels[idx]).float()
        }

#### Next, we'll want to use our function that loads calls the data to construct training, and testing datasets
- This whole little project uses PyTorch

In [2]:
def get_train_test(batch_size=32):
    trainset = SignLanguage('sign_mnist_train.csv')
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)

    testset = SignLanguage('sign_mnist_test.csv')
    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)
    return trainloader, testloader

In [3]:
if __name__ == '__main__':
    loader, _ = get_train_test(2)
    print(next(iter(loader)))

{'image': tensor([[[[-0.2856, -0.2513, -0.2171,  ..., -0.1314, -0.1486, -0.2856],
          [-0.2684, -0.2342, -0.1828,  ..., -0.0972, -0.1143, -0.1314],
          [-0.2171, -0.1828, -0.1486,  ..., -0.0629, -0.0801, -0.0972],
          ...,
          [ 0.4337,  0.4851,  0.5364,  ...,  0.5364,  0.5022,  0.0227],
          [ 0.4337,  0.4851,  0.5364,  ...,  0.5536,  0.5193,  0.0398],
          [-0.5253, -0.4739, -0.0287,  ...,  0.5536,  0.5364, -0.3369]]],


        [[[ 0.0398,  0.1083,  0.1426,  ...,  0.3652,  0.3823,  0.0398],
          [ 0.0741,  0.1254,  0.1768,  ...,  0.3994,  0.4166,  0.3652],
          [ 0.1083,  0.1597,  0.2111,  ...,  0.4337,  0.4337,  0.4508],
          ...,
          [ 0.6563,  0.7248,  0.7248,  ...,  0.9817,  0.9817,  0.9817],
          [ 0.6906,  0.7419,  0.7591,  ...,  0.9988,  0.9988,  0.9988],
          [ 0.7077,  0.7591,  0.7762,  ..., -0.3369, -0.3369, -0.3369]]]]), 'label': tensor([[19.],
        [17.]])}


#### Just importing various dependencies

In [10]:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch


#### Next, construct a neural network class which will be able to be called at a later point in time

In [11]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 6, 3)
        self.conv3 = nn.Conv2d(6, 16, 3)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 48)
        self.fc3 = nn.Linear(48, 24)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

### Here's the guts of our model:
- Runs for 15 epochs
- Uses a learning rate scheduler, which decreases learning rate over time (see here:

https://machinelearningmastery.com/using-learning-rate-schedules-deep-learning-models-python-keras/

In [12]:
def main():
    net = Net().float()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

    trainloader, _ = get_train_test()
    for epoch in range(15):  # loop over the dataset multiple times
        train(net, criterion, optimizer, trainloader, epoch)
        scheduler.step()
    torch.save(net.state_dict(), "checkpoint.pth")

### Next, we write a function to train the dataset

In [13]:
def train(net, criterion, optimizer, trainloader, epoch):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs = Variable(data['image'].float())
        labels = Variable(data['label'].long())
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels[:, 0])
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 0:
            print('[%d, %5d] loss: %.6f' % (epoch, i, running_loss / (i + 1)))

### Finally, we actually train the neural network. The function keeps a running tally of our loss

In [9]:
if __name__ == '__main__':
    main()

[0,     0] loss: 3.157291
[0,   100] loss: 3.180325
[0,   200] loss: 3.177582
[0,   300] loss: 3.173294
[0,   400] loss: 3.092386
[0,   500] loss: 2.903753
[0,   600] loss: 2.682888
[0,   700] loss: 2.479103
[0,   800] loss: 2.295678
[1,     0] loss: 1.321397
[1,   100] loss: 0.765594
[1,   200] loss: 0.693508
[1,   300] loss: 0.662273
[1,   400] loss: 0.610449
[1,   500] loss: 0.569189
[1,   600] loss: 0.538422
[1,   700] loss: 0.509749
[1,   800] loss: 0.484125
[2,     0] loss: 0.168282
[2,   100] loss: 0.253409
[2,   200] loss: 0.246858
[2,   300] loss: 0.235893
[2,   400] loss: 0.228783
[2,   500] loss: 0.225065
[2,   600] loss: 0.219218
[2,   700] loss: 0.211332
[2,   800] loss: 0.206462
[3,     0] loss: 0.226306
[3,   100] loss: 0.121513
[3,   200] loss: 0.126204
[3,   300] loss: 0.130682
[3,   400] loss: 0.134932
[3,   500] loss: 0.133569
[3,   600] loss: 0.131217
[3,   700] loss: 0.131261
[3,   800] loss: 0.130944
[4,     0] loss: 0.181717
[4,   100] loss: 0.108770
[4,   200] l

### Now that our model is trained, we can evaluate it on the remaining testing data we have

In [15]:
from torch.utils.data import Dataset
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import numpy as np

import onnx
import onnxruntime as ort



def evaluate(outputs: Variable, labels: Variable) -> float:
    Y = labels.numpy()
    Yhat = np.argmax(outputs, axis=1)
    return float(np.sum(Yhat == Y))


def batch_evaluate(
        net: Net,
        dataloader: torch.utils.data.DataLoader) -> float:
    score = n = 0.0
    for batch in dataloader:
        n += len(batch['image'])
        outputs = net(batch['image'])
        if isinstance(outputs, torch.Tensor):
            outputs = outputs.detach().numpy()
        score += evaluate(outputs, batch['label'][:, 0])
    return score / n


def validate():
    trainloader, testloader = get_train_test()
    net = Net().float().eval()

    pretrained_model = torch.load("checkpoint.pth")
    net.load_state_dict(pretrained_model)

    print('=' * 10, 'PyTorch', '=' * 10)
    train_acc = batch_evaluate(net, trainloader) * 100.
    print('Training accuracy: %.1f' % train_acc)
    test_acc = batch_evaluate(net, testloader) * 100.
    print('Validation accuracy: %.1f' % test_acc)

    trainloader, testloader = get_train_test_loaders(1)

    # export to onnx
    fname = "signlanguage.onnx"
    dummy = torch.randn(1, 1, 28, 28)
    torch.onnx.export(net, dummy, fname, input_names=['input'])

    # check exported model
    model = onnx.load(fname)
    onnx.checker.check_model(model)  # check model is well-formed

    # create runnable session with exported model
    ort_sesh = ort.InferenceSession(fname)
    net = lambda inp: ort_sesh.run(None, {'input': inp.data.numpy()})[0]

    print('=' * 10, 'ONNX', '=' * 10)
    train_acc = batch_evaluate(net, trainloader) * 100.
    print('Training set accuracy: %.1f' % train_acc)
    test_acc = batch_evaluate(net, testloader) * 100.
    print('Validation set accuracy: %.1f' % test_acc)


if __name__ == '__main__':
    validate()

Training accuracy: 99.8
Validation accuracy: 97.0
Training set accuracy: 99.8
Validation set accuracy: 96.8


### Here's where it gets good. 
- The following function does a number of things.
- The important stuff is as follows
- Creates a list of potential answers
- Uses ONNX to create a session with our exported model
- Captures video input frame by frame
- Takes each frame, and preprocesses it:
    - Center crops it
    - Drops it down from RBG to grey
    - Resizes the video to match our 28x28 dataset
    - Reshapes
    - Puts a letter on the window based on what the model predicts
    - Finally, runs the inputted frames against our model

In [1]:
import cv2
import numpy as np
import onnxruntime as ort


def center_crop(frame):
    h, w, _ = frame.shape
    start = abs(h - w) // 2
    if h > w:
        return frame[start: start + w]
    return frame[:, start: start + h]


def main():
    # constants
    index_to_letter = list('ABCDEFGHIKLMNOPQRSTUVWXY')
    mean = 0.485 * 255.
    std = 0.229 * 255.

    # create runnable session with exported model
    ort_sesh = ort.InferenceSession("signlanguage.onnx")

    cap = cv2.VideoCapture(0)
    while True:
        # Capture frame-by-frame
        ret, frame = cap.read()

        # preprocess data
        frame = center_crop(frame)
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        x = cv2.resize(frame, (28, 28))
        x = (x - mean) / std

        x = x.reshape(1, 1, 28, 28).astype(np.float32)
        y = ort_sesh.run(None, {'input': x})[0]

        index = np.argmax(y, axis=1)
        letter = index_to_letter[int(index)]

        cv2.putText(frame, letter, (100, 100), cv2.FONT_HERSHEY_TRIPLEX, 2.0, (0, 255, 0), thickness=2)
        cv2.imshow("Sign Language Translator", frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    main()

KeyboardInterrupt: 

## Next Steps


#### So that's good and all, but it's not fantastic at distinguishing sign language in a way you would actually use - it's okay when you literally only have a hand onscreen

#### So the next idea was to start creating my own datasets, using Labelbox, which allowed me to create both bounded boxes on videos, as well as classification
#### As you can see from the examples, it's a pretty nifty little tool.
#### Basicall you can draw the box, and the box stays on screen between frames. You can resize and move the box in between frames to keep up with whatever you're bounding, and can select up to 60 frames at a time to label.
#### These labels, at which frames, and how the boxes are shaped can then be extracted to Postman.
#### From there, you can create CSV's with the labels, and append them to the original frames, once you've extracted and greyscaled those.
#### Unfortunately, I didn't have time to complete this addendum.
#### As a proof of concept, this is pretty darn cool! With the above changes, I think you could really got a useful working model out of it. 
#### Of course, someone' already done this - some Australians completed a similar project for a hackathon, which does a similar real time extrapolation of Australian Sign Language, using neural networks, which is pretty darn neat.

- What else can we 'translate' ?
- We could translate a whole bunch of things. Maybe facial expressions, body language, and so on.
- Those might be useful for autistic individuals, if we could construct a real time program to translate the subtleties of human interactions, but as a proof of concept,
it might be easier to try and construct a sign language 