## About
DeepNLP with PyTorch

1. Representing Linguistic units.

The task is to map each item in a text(Characters and words) into a vector that can be processed in PyTorch or equivalent deep learning frameworks.

In [17]:
# mandatory imports
import torch as T
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import matplotlib.pyplot as plt
from torchsummary import summary
import random
import pandas as pd
%matplotlib inline

In [18]:
num_chars = 26
dimension_char_vector = 3
#generating embeddings by mapping them to embedding vector

char_embeddings = nn.Embedding(num_chars,dimension_char_vector)
print(char_embeddings)

Embedding(26, 3)


In [19]:
#'cab'
char_sequence =  [2,0,1]
#converting to tensor
char_idxs = T.LongTensor([char_sequence])
char_vector = char_embeddings(char_idxs)
print(char_vector)

tensor([[[ 0.2744, -1.4286, -0.9463],
         [-0.7210,  1.1087, -1.3602],
         [-0.0595,  0.0340,  1.3645]]], grad_fn=<EmbeddingBackward0>)


2. Processing these via Recurrent Neural Networks (RNNs)

- These neural networks help in learning the context of a time series by temporarily remembering the past.

- Gated Reccurent Units(GRUs) and Long Short Term memory(LSTMs) are used to achieve this.

- GRUs have one less GATE and faster training time. So, we will use it. 
- For each index in char vector i.e char, Let's have a unit and process further.

In [20]:
dimension_gru = 5
num_layers= 1
gru = nn.GRU(dimension_char_vector, dimension_gru, num_layers)

In [21]:
context_vectors, last_hidden_ids = gru(char_vector)

In [22]:
context_vectors


tensor([[[-0.2390, -0.0721,  0.1782, -0.4897,  0.0909],
         [ 0.3331,  0.0687,  0.1149, -0.0594,  0.4009],
         [-0.3206,  0.2261,  0.2416, -0.3540,  0.0245]]],
       grad_fn=<StackBackward0>)

In [23]:
#removing all dimensions just to preserve one element
context_vectors.squeeze()

tensor([[-0.2390, -0.0721,  0.1782, -0.4897,  0.0909],
        [ 0.3331,  0.0687,  0.1149, -0.0594,  0.4009],
        [-0.3206,  0.2261,  0.2416, -0.3540,  0.0245]],
       grad_fn=<SqueezeBackward0>)

We can learn the context on both sides, too. For that, We will need to use Bidirectional flag

In [24]:
bidirectional_gru = nn.GRU(dimension_char_vector, dimension_gru, num_layers,bidirectional=True)

In [25]:
bidirectional_context_vectors, last_hidden_ids = bidirectional_gru(char_vector)

In [26]:
bidirectional_context_vectors

tensor([[[-0.1700, -0.3908,  0.3062,  0.1363, -0.1156, -0.3271, -0.3678,
          -0.1393, -0.0938, -0.2658],
         [ 0.2168,  0.0903,  0.2295,  0.1065, -0.2877,  0.2146,  0.2384,
           0.2826,  0.0298, -0.1723],
         [-0.1992,  0.2225, -0.2047,  0.2786,  0.1238, -0.3141,  0.0886,
           0.3634,  0.2949, -0.2617]]], grad_fn=<CatBackward0>)

In [27]:
bidirectional_context_vectors.squeeze()

tensor([[-0.1700, -0.3908,  0.3062,  0.1363, -0.1156, -0.3271, -0.3678, -0.1393,
         -0.0938, -0.2658],
        [ 0.2168,  0.0903,  0.2295,  0.1065, -0.2877,  0.2146,  0.2384,  0.2826,
          0.0298, -0.1723],
        [-0.1992,  0.2225, -0.2047,  0.2786,  0.1238, -0.3141,  0.0886,  0.3634,
          0.2949, -0.2617]], grad_fn=<SqueezeBackward0>)

Dimension of context vector shall now be doubled when using bidirectional.

- The more RNN layers we add, The more abstract representation of texts can be learnt like syllable, morpheme etc.

- Like adding one more layer to n=1, we can learn syllable from char
- Adding one more layer to n=2, we can learn morpheme from syllable levels.

However, Adding layers come with a trade off for training time.

Let's add more layers.

In [28]:
bidirectional_gru_2 = nn.GRU(dimension_char_vector, dimension_gru, num_layers=2,bidirectional=True)

In [29]:
bidirectional_context_vectors2, last_hidden_ids = bidirectional_gru_2(char_vector)

In [30]:
bidirectional_context_vectors2

tensor([[[-0.2284,  0.1906, -0.1641, -0.1026,  0.0951,  0.1044,  0.1135,
           0.1801,  0.0754,  0.1298],
         [-0.0714,  0.3674, -0.1182,  0.0193,  0.0399, -0.0024,  0.0944,
           0.1611, -0.1253,  0.0265],
         [-0.1115,  0.2815,  0.1032,  0.0078,  0.0823,  0.1562, -0.1154,
           0.2552,  0.0586,  0.1108]]], grad_fn=<CatBackward0>)

2. Sequence Prediction.

Assuming for a given string, It is statistically generated by a hidden sequence of state transitions i.e Hidden Markov Models.

- It's mostly employed for the tasks of POS tagging, NER, Word segmentation.

- NER where B = begin, I = intermediate, O = Outer.



## Word Segmentation using GRUs.

Dataset - https://www.kaggle.com/datasets/thedevastator/morphemic-segmentation-of-english-words?select=lookup.csv

In [31]:
# loading the dataset
df = pd.read_csv('/home/suraj/ClickUp/Jan-Feb/data/lookup.csv')

In [32]:
df.head()

Unnamed: 0,y,x
0,later,late ##er
1,years,year ##s
2,used,use ##ed
3,being,be ##ing
4,united,unite ##ed


In [105]:
words = df['y'].values.tolist()
segmented_words = df['x'].values.tolist()

In [106]:
words[:5]

['later', 'years', 'used', 'being', 'united']

In [107]:
segmented_words[:5]

['late ##er', 'year ##s', 'use ##ed', 'be ##ing', 'unite ##ed']

In [108]:
import string
alphabet = list(string.ascii_lowercase)
alphabet_mapper = {alphabet[i]:i for i in range(len(alphabet))}
#adding extra for hashtag and blank
alphabet_mapper[' ']=26
alphabet_mapper['#'] =27
print(alphabet_mapper)

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, ' ': 26, '#': 27}


In [109]:
# padding to 26 length
def pad(l, content, width):
    l.extend([content] * (width - len(l)))
    return l

In [110]:
word_sequences = []
for word in words:
    char_sequence = []
    for char in list(word):
        char_sequence.append(alphabet_mapper[char])
    word_sequences.append(pad(char_sequence,0,28))

In [116]:
word_sequences[1]

[24,
 4,
 0,
 17,
 18,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [117]:
seg_word_sequences = []
for word in segmented_words:
    char_sequence = []
    for char in list(word):
        char_sequence.append(alphabet_mapper[str(char)])
    seg_word_sequences.append(pad(char_sequence,0,28))

In [133]:
type(seg_word_sequences[9][1])

int

In [119]:
train_data = zip(word_sequences,seg_word_sequences)

In [120]:
for word,segmented_words in enumerate(train_data):
    print(word,segmented_words)
    break

0 ([11, 0, 19, 4, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [11, 0, 19, 4, 26, 27, 27, 4, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [182]:
num_samples = len(words)
dimension_char_vector=28

In [273]:
class WordSeg(nn.Module):
    def __init__(self, num_chars,char_vector_dimension,dimension_gru,num_layers):
        super().__init__()
        self.character_embedding = nn.Embedding(num_chars,dimension_char_vector)
        self.gru = nn.GRU(dimension_char_vector,dimension_gru,num_layers,bidirectional=True)
        #adding activation
        self.activation1 = nn.Tanh()
        #adding hidden layer with 2 states
        self.hidden_layer = nn.Linear(2*dimension_gru,28)
        #adding softmax for predicting prob
        self.activation2 = nn.Softmax(dim=1)
    
    def forward(self,char_sequence):
        char_indices = T.LongTensor([char_sequence])
        char_vector = self.character_embedding(char_indices)
        context_vectors, last_hidden_ids = self.gru(char_vector)
        context_vectors = context_vectors.squeeze()
        state_vectors = self.hidden_layer(self.activation1(context_vectors))
        return self.activation2(state_vectors)

In [274]:
dimension_gru = 28
model = WordSeg(28,dimension_char_vector,dimension_gru,num_layers)

In [275]:
device = T.device("cpu")
model = model.to(device)

In [276]:
# training
num_epochs = 5
loss_fn = nn.CrossEntropyLoss() #negative log loss
learning_rate =0.01
optimizer = optim.SGD(model.parameters(),lr=learning_rate)
losses = []

In [277]:
for i in tqdm(range(num_epochs)):
    total_loss=0.0
    for charseq, segmented_char in train_data:
        targets = T.zeros(28,28).type(T.float32)
        
        
        preds = model(charseq)
        

        
        loss = loss_fn(preds,targets)

        total_loss+=loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("Epoch - {}, Loss - {}".format(i,total_loss/num_samples))
    losses.append(total_loss/num_samples)

  0%|          | 0/5 [00:09<?, ?it/s]


KeyboardInterrupt: 

Refer the following table for a brief about various layer types implemented in PyTorch

![Layers](Layers.png "Layers")

