# Language Modelling with Bidirectional multilayer LSTM

In this notebook we are gonna build a Bidirectional multilayer LSTM language model.

We handle data preparation/batching with Torchtext.
We use mosestokenizer for tokenizing paragraphs.

In [1]:
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install -U torchtext
!pip install -U mosestokenizer

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp36-cp36m-linux_x86_64.whl (735.4MB)
[K     |████████████████████████████████| 735.4MB 24kB/s 
[?25hCollecting torchvision==0.8.2+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp36-cp36m-linux_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 252kB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.7.0+cu101
    Uninstalling torch-1.7.0+cu101:
      Successfully uninstalled torch-1.7.0+cu101
  Found existing installation: torchvision 0.8.1+cu101
    Uninstalling torchvision-0.8.1+cu101:
      Successfully uninstalled torchvision-0.8.1+cu101
Successfully installed torch-1.7.1+cu101 torchvision-0.8.2+cu101
Collecting torchtext
[?25l  Downloading https://files.pythonhosted.org/packages/0e/81/be2d72b1ea641a

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import utils
import sys
sys.path.append('/content/drive/MyDrive/demetre_{pipia, uridia}')
import utils

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

import torch
import torch.nn as nn
import torchtext
from torchtext.datasets import PennTreebank, LanguageModelingDataset
from mosestokenizer import *
import gensim
from tqdm import tqdm_notebook
from gensim.models import KeyedVectors
import re

# this notebook was tested with PyTorch 1.7.1 and Torchtext 0.8.1
print(torch.__version__, torchtext.__version__) 

1.7.1+cu101 0.8.1


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [None]:
# using utils
# Get text field and data loaders for lstm model

w2v_model_path = '/content/drive/MyDrive/demetre_{pipia, uridia}/resources/word2vec.model_paragraph_all_only_georgian_shuffled_3M_30it'
df_path = '/content/drive/MyDrive/demetre_{pipia, uridia}/data/paragraph_all_only_georgian_shuffled.csv'
text_field = utils.TextField(w2v_model_path, df_path) 
txt_field = text_field.get_txt_field()

train_dl, dev_dl, test_dl = utils.DataLoader(text_field, device).get_dls()

In [None]:
len(train_dl), len(dev_dl), len(test_dl)

(5319, 285, 285)

In [None]:
batch = next(iter(train_dl))

# output shape should be (batch_size, bptt_len). Note that, if we used batch_first=False in above, we would've got (bptt_len, batch_size) instead.
batch.text.shape, batch.target.shape 

(torch.Size([128, 10]), torch.Size([128, 10]))

In [None]:
# Note that 'target' is left-shifted version of 'text', as we want for next word prediction!
batch.text[11], batch.target[11]

(tensor([    0,     0,  1614,   495, 27260, 20908,     4,    73,     0, 16530],
        device='cuda:0'),
 tensor([    0,  1614,   495, 27260, 20908,     4,    73,     0, 16530,   677],
        device='cuda:0'))

In [None]:
hidden_dim = 100
model = utils.LSTMModel(utils.EMBED_SIZE, hidden_dim, len(txt_field.vocab), txt_field,device, num_layers=2).to(device)

# Train

In [None]:
epochs = 1
model_save_path = '/content/drive/MyDrive/demetre_{pipia, uridia}/resources/model_tmp'
utils.train_loop(model, train_dl, dev_dl, device, epochs, model_save_path)

Epoch 1 | Iter 100 | Avg Train Loss 11.275244369506837 | Dev Perplexity None | LR  0.0001
Epoch 1 | Iter 200 | Avg Train Loss 9.500002593994141 | Dev Perplexity None | LR  0.0001
Epoch 1 | Iter 300 | Avg Train Loss 8.438881168365478 | Dev Perplexity None | LR  0.0001
Epoch 1 | Iter 400 | Avg Train Loss 8.383275566101075 | Dev Perplexity None | LR  0.0001
Epoch 1 | Iter 500 | Avg Train Loss 8.355646858215332 | Dev Perplexity 3854.6311255381393 | LR  0.0001
Epoch 1 | Iter 600 | Avg Train Loss 8.322541770935059 | Dev Perplexity 3854.6311255381393 | LR  0.0001
Epoch 1 | Iter 700 | Avg Train Loss 8.32110122680664 | Dev Perplexity 3854.6311255381393 | LR  0.0001
Epoch 1 | Iter 800 | Avg Train Loss 8.288846797943116 | Dev Perplexity 3854.6311255381393 | LR  0.0001
Epoch 1 | Iter 900 | Avg Train Loss 8.309331703186036 | Dev Perplexity 3854.6311255381393 | LR  0.0001
Epoch 1 | Iter 1000 | Avg Train Loss 8.286775550842286 | Dev Perplexity 3570.756080078349 | LR  0.0001
Epoch 1 | Iter 1100 | Avg 

# Count perplexity on Test data

In [None]:
utils.compute_perplexity(model, test_dl, device)

520.5526483478595

In [None]:
#load model
model = utils.LSTMModel(utils.EMBED_SIZE, 100, len(txt_field.vocab), txt_field,device, num_layers=2).to(device)
model.load_state_dict(torch.load('/content/drive/MyDrive/demetre_{pipia, uridia}/resources/lstm_model_300K_tuning', map_location=device))

<All keys matched successfully>

# Save embeddings after train

In [None]:
embs = model.state_dict()['emb.weight']
embs.shape

torch.Size([87151, 100])

In [None]:
with open("/content/drive/MyDrive/demetre_{pipia, uridia}/resources/word2vec_from_trained_model", 'a') as the_file:
    the_file.write(str(embs.size(0)) + ' 100')
    the_file.write('\n')
    for i in range(embs.size(0)):
        the_file.write(txt_field.vocab.itos[i] + ' ' + ' '.join([str(emb.item()) for emb in embs[i]]))
        the_file.write('\n')