### Language Identification using AI

This is a notebook where we are going to train the model that will be able to detect languguages for our application. Basically we are going to use pytorch. I choose python over tensorflow because pytorch is my favorite this week.


We are going to reference more notebooks that i've used before, but there are two notebooks that we will be using more. The notebooks are as follows:

1. [01_Emotions_Sentiment_Analyisis_Packed_Padded_Sequences.ipynb](https://github.com/CrispenGari/nlp-pytorch/blob/main/03_Emotions/01_Emotions_Sentiment_Analyisis_Packed_Padded_Sequences.ipynb)

2. [02_Duplicate_Questions_FastText.ipynb](https://github.com/CrispenGari/nlp-pytorch/blob/main/05_Duplicate_Questions/02_Duplicate_Questions_FastText.ipynb)

I'm planning to use packed padded sequences based on the first referenced notebook but, because i care musch about speed, i will be using fast text, so most of the code will be taken from the second referenced notebook.

The data that i will be working on will come from my google drive, I've aready preapared the data and we have `3` files:

```py
1. train.csv
2. test.csv
3. valid.csv

## The langauges we will be identifying
langanges = ["eng", "fra", "deu", "ita", "swe", "por", "afr"]
```

### Mounting the drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

In [2]:
import time, os, torch, random, math

from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

from torchtext.legacy import data

torch.__version__

'1.9.0+cu102'

We will be using torchtext


### Device

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Paths to files

In [4]:
base_path = '/content/drive/My Drive/NLP Data/lang-identification'
train_path = 'train.csv'
val_path = 'valid.csv'
test_path = 'test.csv'

### Generating `bigrams`.

In [5]:
def generate_bigrams(x):
  x = [i.lower() for i in x]
  n_grams = set(zip(*[x[i: ] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['What', 'is', 'the', 'meaning', "of", "OCR", "in", "python"])

['what',
 'is',
 'the',
 'meaning',
 'of',
 'ocr',
 'in',
 'python',
 'what is',
 'the meaning',
 'ocr in',
 'meaning of',
 'is the',
 'of ocr',
 'in python']

### Tokenizer function

I'm going to use my own tokenization function, this is because different languages has different tokenization language. I'm going to make this simple and tokenize the sentences using spaces.

In [6]:
def tokenizer(sent):
  return sent.split(" ")

### Creating the fields that will process our data.

In [7]:
TEXT = data.Field(
    tokenize = tokenizer,
    preprocessing = generate_bigrams,
)
LABEL = data.LabelField()

In [8]:
fields = {
    "sent": ("text", TEXT),
    "code": ("label", LABEL),
}

### Creating the dataset useing the `TabularDataset.split()`


In [9]:
train_data, val_data, test_data = data.TabularDataset.splits(
   base_path,
   train=train_path,
   test= test_path,
   validation= val_path,
   format = "csv",
   fields=fields
)

In [10]:
print(vars(train_data.examples[0]))

{'text': ['it', 'is', 'absurd', 'of', 'you', 'to', 'do', 'that.', 'is absurd', 'it is', 'absurd of', 'of you', 'to do', 'do that.', 'you to'], 'label': 'eng'}


### Building the vocabulary.

We are not going to load the pretrained vocabulary since we have different languages and it does not make sense to do that.

In [11]:
TEXT.build_vocab(
    train_data
)
LABEL.build_vocab(train_data)

In [12]:
LABEL.vocab.stoi

defaultdict(None,
            {'afr': 6,
             'deu': 3,
             'eng': 0,
             'fra': 2,
             'ita': 4,
             'por': 5,
             'swe': 1})

### Iterators

We are going to make use of the `BucketIterator` to create iterators four our sets.

In [13]:
sort_key = lambda x: len(x.text)

BATCH_SIZE = 128

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
    sort_within_batch=True
)

### Next we will create a model.

In [14]:
class LanguageIndentifierFastText(nn.Module):
  def __init__(self,
               vocab_size,
               embedding_size,
               output_dim,
               pad_index,
               dropout=.5
               ):
    super(LanguageIndentifierFastText, self).__init__()
    self.embedding = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.out = nn.Linear(
        embedding_size,
        out_features = output_dim
    )
    self.dropout = nn.Dropout(dropout)
  
  def forward(self, text):
    embedded = self.embedding(text).permute(1 ,0, 2)
    pooled = F.avg_pool2d(embedded,(embedded.shape[1], 1)
                          ).squeeze(1)
    return self.out(pooled)

### Model instance

In [15]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM =  len(LABEL.vocab)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

language_identifier_model = LanguageIndentifierFastText(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM, 
            pad_index = PAD_IDX
            ).to(device)
language_identifier_model

LanguageIndentifierFastText(
  (embedding): Embedding(152163, 100, padding_idx=1)
  (out): Linear(in_features=100, out_features=7, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Applying model weights

In [16]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.normal_(param.data, mean=0, std=0.1)

language_identifier_model.apply(init_weights)

LanguageIndentifierFastText(
  (embedding): Embedding(152163, 100, padding_idx=1)
  (out): Linear(in_features=100, out_features=7, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting model parameters

In [17]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(language_identifier_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 15,217,007
Total tainable parameters: 15,217,007


### Criterion and optimizer

In [18]:
optimizer = torch.optim.Adam(language_identifier_model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Accuracy function

In [19]:
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

In [20]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = categorical_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = categorical_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Train loop

In [21]:
from prettytable import PrettyTable

In [22]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)
  

In [23]:
N_EPOCHS = 100
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(language_identifier_model, 
                                  train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(language_identifier_model, 
                                     val_iter, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(language_identifier_model.state_dict(), 'best-lang-ident-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 1.483 |    0.788 | 0:00:03.06 |
| Validation | 0.819 |    0.973 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.316 |    0.992 | 0:00:02.93 |
| Validation | 0.267 |    0.986 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.

In [24]:
language_identifier_model.load_state_dict(torch.load('best-lang-ident-model.pt'))

test_loss, test_acc = evaluate(language_identifier_model, test_iter, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.036 | Test Acc: 99.22%


### Model inference

In [25]:
LABEL.vocab.stoi

defaultdict(None,
            {'afr': 6,
             'deu': 3,
             'eng': 0,
             'fra': 2,
             'ita': 4,
             'por': 5,
             'swe': 1})

In [26]:
labels = {v:k for k, v in LABEL.vocab.stoi.items() }
labels

{0: 'eng', 1: 'swe', 2: 'fra', 3: 'deu', 4: 'ita', 5: 'por', 6: 'afr'}

In [27]:
def predict_language(model, sent):
  model.eval()
  sent = sent.lower()
  tokenized = tokenizer(sent)
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  probabilities = torch.softmax(model(tensor), dim=1)
  prediction = torch.argmax(probabilities, dim=1)
  item = prediction.item()

  return {
      "label": item,
      "lang": labels[item]
  }

predict_language(language_identifier_model, "this")

{'label': 0, 'lang': 'eng'}

In [28]:
# deu
predict_language(language_identifier_model, "Herzlichen Glückwunsch zum Geburtstag, Muiriel!")

{'label': 3, 'lang': 'deu'}

In [29]:
# deu
predict_language(language_identifier_model,
                 "Herzlichen Glückwunsch zum Geburtstag, Muiriel!")

{'label': 3, 'lang': 'deu'}

In [30]:
# ita
predict_language(language_identifier_model,
                 "Si è fatto tagliare i capelli.")

{'label': 4, 'lang': 'ita'}

In [31]:
# fra
predict_language(language_identifier_model,
                 "J'ai peur de tomber.")

{'label': 2, 'lang': 'fra'}

In [32]:
# deu
predict_language(language_identifier_model,
                 "Herzlichen Glückwunsch zum Geburtstag, Muiriel!")

{'label': 3, 'lang': 'deu'}

In [33]:
# swe
predict_language(language_identifier_model,
                 "ag skrev min första mening på tyska.")

{'label': 1, 'lang': 'swe'}

In [34]:
# afr
predict_language(language_identifier_model,
                 "Ek gaan nie lank hier wees nie.")

{'label': 6, 'lang': 'afr'}

In [35]:
# por
predict_language(language_identifier_model,
                 "Para mim não há problema algum.")

{'label': 5, 'lang': 'por'}