# Intro

This notebook covers:</br>
- **torchtext**</br>
    - Loading data with torchtext</br>
    - Using torchtext Fields, Vocab</br>
    - Building iterator from Fields</br>
- Using **pretrained** word embeddings (FastText)





Loading custom dataset with torchtext [tutorial](https://colab.research.google.com/github/bentrevett/pytorch-sentiment-analysis/blob/master/A%20-%20Using%20TorchText%20with%20Your%20Own%20Datasets.ipynb#scrollTo=dSWFMam5VZeu)</br>
Torchtext in action [notebook](https://colab.research.google.com/github/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb#scrollTo=-1_qUNzqlYK6)</br>
Torchtext multiclass classification [tutorial](https://colab.research.google.com/github/bentrevett/pytorch-sentiment-analysis/blob/master/5%20-%20Multi-class%20Sentiment%20Analysis.ipynb#scrollTo=jR7tZ_5mxDAH)

**Note**: Textual explanations are took from aforementioned notebooks

TODO:</br>
Try to use torchtext pipeline for preprocessing

# Setup

In [1]:
!python -m spacy download en_core_web_sm &> /dev/null

In [342]:
import warnings
warnings.filterwarnings("ignore")

from torchtext.legacy import data
from torchtext.vocab import FastText

import torch
import torch.nn as nn
import torch.optim as optim

import pandas as pd
import spacy
import numpy as np
from sklearn.metrics import f1_score

from google_drive_downloader import GoogleDriveDownloader as gdd
from pathlib import Path
from tqdm import tqdm
import json
import time

In [3]:
TRAIN_DATASET_PATH = '/tmp/train-en.tsv'
TRAIN_JSON_DATASET_PATH = '/tmp/train-en.json'
EVAL_DATASET_PATH = '/tmp/eval-en.tsv'
EVAL_JSON_DATASET_PATH = '/tmp/eval-en.json'
TRAIN_DATASET_GDRIVE_ID = '196d23FA_YFJTpu_yDRpXjBG6DPKs9WI2'
EVAL_DATASET_GDRIVE_ID = '1p4O7Y2ePV17gPbwPTOio5EM7QMXHFI5V'
TEXT = 'text'
LABEL = 'label'

MAX_VOCAB_SIZE = 25_000
BATCH_SIZE = 64

In [4]:
def download_data_from_gdrive(local_path, gdrive_file_id):
    if not Path(local_path).is_file():
        gdd.download_file_from_google_drive(
            file_id=gdrive_file_id,
            dest_path=local_path,
        )
download_data_from_gdrive(TRAIN_DATASET_PATH, TRAIN_DATASET_GDRIVE_ID)
download_data_from_gdrive(EVAL_DATASET_PATH, EVAL_DATASET_GDRIVE_ID)

Downloading 196d23FA_YFJTpu_yDRpXjBG6DPKs9WI2 into /tmp/train-en.tsv... Done.
Downloading 1p4O7Y2ePV17gPbwPTOio5EM7QMXHFI5V into /tmp/eval-en.tsv... Done.


In [5]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Prepare

## Why use JSON over CSV/TSV with torchtext?

1. Your `csv` or `tsv` data cannot store lists. This means data cannot be already tokenized, thus everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the `spaCy` tokenizer, takes a non-negligible amount of time. Thus, it is better to tokenize your datasets and store them in the `json lines` format.

2. If tabs appear in your `tsv` data, or commas appear in your `csv` data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly. Worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As `json` data is essentially a dictionary, you access the data within the fields via its key, so do not have to worry about "surprise" delimiters.
</br>

**JSON example**
```
{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["i", "love", "the", "united kingdom"]}
{"name": "Mary", "location": "United States", "age": 36, "quote": ["i", "want", "more", "telescopes"]}
```

**TSV example**
```
name	location	age	quote
John	United Kingdom	42	i love the united kingdom
Mary	United States	36	i want more telescopes
```

Preprocess data

In [6]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def clean(text):
    text = text.lower()
    text = text.strip() 
    text = ' '.join(text.split())  # replace whitespace with single space

    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    tokens = [token for token in tokens if len(token) > 1]

    return ' '.join(tokens)

def preprocess(df):
    df.drop_duplicates(subset=[TEXT], inplace=True)
    texts = df[TEXT].tolist()
    texts_clean = [clean(x) for x in tqdm(texts)]
    df[TEXT] = texts_clean
    df.drop_duplicates(subset=[TEXT], inplace=True)
    df[TEXT] = df[TEXT].apply(lambda x: x.split())
    return df[TEXT].tolist(), df[LABEL].tolist()



In [7]:
def prepare_json_file(input_data_path, output_data_path):
    df = pd.read_csv(input_data_path, delimiter='\t', names=[LABEL, TEXT])
    texts, labels = preprocess(df)

    temp_list = []
    for t, l in zip(texts, labels):
        temp_list.append(json.dumps({LABEL: l, TEXT: t}))

    with open(output_data_path, 'w') as f:
        f.write('\n'.join(temp_list))

prepare_json_file(TRAIN_DATASET_PATH, TRAIN_JSON_DATASET_PATH)
prepare_json_file(EVAL_DATASET_PATH, EVAL_JSON_DATASET_PATH)

100%|██████████| 24551/24551 [01:12<00:00, 336.93it/s]
100%|██████████| 3894/3894 [00:11<00:00, 341.23it/s]


# Torchtext Dataset Iterator


Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces. We also need to specify a `tokenizer_language` which tells torchtext which spaCy model to use. We use the `en_core_web_sm` model which has to be downloaded with `python -m spacy download en_core_web_sm` before you run this notebook!

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels.

More on [Field](https://pytorch.org/text/0.8.1/data.html#field)

We also set the random seeds for reproducibility.

In [433]:
SEED = 42

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  # also used for reproducibility

TEXT_FIELD = data.Field(tokenize = 'spacy',
                        tokenizer_language = 'en_core_web_sm')
LABEL_FIELD = data.LabelField()

Next, we must tell TorchText which fields apply to which elements of the `json` object. 

For `json` data, we must create a dictionary where:
- the key matches the key of the `json` object
- the value is a tuple where:
  - the first element becomes the batch object's attribute name
  - the second element is the name of the `Field`
  
What do we mean when we say "becomes the batch object's attribute name"? Recall in the previous exercises where we accessed the `TEXT` and `LABEL` fields in the train/evaluation loop by using `batch.text` and `batch.label`, this is because TorchText sets the batch object to have a `text` and `label` attribute, each being a tensor containing either the text or the label.

A few notes:

* The order of the keys in the `fields` dictionary does not matter, as long as its keys match the `json` data keys.

- When dealing with `json` data, not all of the keys have to be used

- Also, if the values of `json` field are a string then the `Fields` tokenization is applied (default is to split the string on spaces), however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.

- The value of the `json` fields do not have to be the same type. Some examples can have their `"text"` as a string, and some as a list. The tokenization will only get applied to the ones with their `"text"` as a string.

- If you are using a `json` field, every single example must have an instance of that field, e.g. in this example all examples must have a text and label. However, if you are not using some field, it does not matter if an example does not have it.

In [434]:
fields = {TEXT: ('t', TEXT_FIELD), LABEL: ('l', LABEL_FIELD)} 

Now, in a training loop we can iterate over the data iterator and access the text via `batch.t` and the label via `batch.l`.

We then create our datasets (`train_data` and `eval_data`) with the `TabularDataset.splits` function. 

The `path` argument specifices the top level folder common among both datasets, and the `train` and `eval` arguments specify the filename of each dataset, e.g. here the train dataset is located at `tmp/train-en.json`.

We tell the function we are using `json` data, and pass in our `fields` dictionary defined previously.

In [435]:
train_data, eval_data = data.TabularDataset.splits(
                            path = '/tmp',
                            train = 'train-en.json',
                            validation = 'eval-en.json',
                            format = 'json',
                            fields = fields
)

In [436]:
print(train_data[0])
print(vars(train_data[0]))

<torchtext.legacy.data.example.Example object at 0x7f9b79b5afd0>
{'t': ['tell', 'weather', 'report', 'half', 'moon', 'bay'], 'l': 'weather/find'}


In [437]:
TEXT_FIELD.build_vocab(train_data, max_size=MAX_VOCAB_SIZE, vectors='fasttext.simple.300d', unk_init = torch.Tensor.normal_)
LABEL_FIELD.build_vocab(train_data)

In [438]:
vars(LABEL_FIELD.vocab)

{'freqs': Counter({'alarm/cancel_alarm': 624,
          'alarm/modify_alarm': 291,
          'alarm/set_alarm': 2003,
          'alarm/show_alarms': 237,
          'alarm/snooze_alarm': 110,
          'alarm/time_left_on_alarm': 97,
          'reminder/cancel_reminder': 593,
          'reminder/set_reminder': 3688,
          'reminder/show_reminders': 238,
          'weather/checkSunrise': 48,
          'weather/checkSunset': 63,
          'weather/find': 7550}),
 'itos': ['weather/find',
  'reminder/set_reminder',
  'alarm/set_alarm',
  'alarm/cancel_alarm',
  'reminder/cancel_reminder',
  'alarm/modify_alarm',
  'reminder/show_reminders',
  'alarm/show_alarms',
  'alarm/snooze_alarm',
  'alarm/time_left_on_alarm',
  'weather/checkSunset',
  'weather/checkSunrise'],
 'stoi': defaultdict(None,
             {'alarm/cancel_alarm': 3,
              'alarm/modify_alarm': 5,
              'alarm/set_alarm': 2,
              'alarm/show_alarms': 7,
              'alarm/snooze_alarm': 8,
    

In [439]:
TEXT_FIELD.vocab.itos[:10]

['<unk>',
 '<pad>',
 'alarm',
 'remind',
 'set',
 'today',
 'tomorrow',
 'reminder',
 'weather',
 'go']

In [440]:
print(TEXT_FIELD.vocab.vectors.shape)

torch.Size([3448, 300])


In [441]:
index = 1
print(len(TEXT_FIELD.vocab.vectors[index]))
print(TEXT_FIELD.vocab.itos[index])
print(TEXT_FIELD.vocab.vectors[index][:10])

300
<pad>
tensor([ 0.7694,  2.5574,  0.5716,  1.3596,  0.4334, -0.7172,  1.0554, -1.4534,
         1.7361,  1.8350])


In [442]:
word = 'alarms'
print(TEXT_FIELD.vocab.stoi[word])
print(TEXT_FIELD.vocab.vectors[TEXT_FIELD.vocab.stoi[word]][:10])

2054
tensor([ 0.0767, -0.0613, -0.1499,  0.3306,  0.3727, -0.0158, -0.2204,  0.2401,
         0.4039, -0.1509])


Then, we can create the iterators after defining our batch size and device.

By default, the train data is shuffled each epoch, but the validation/test data is sorted. However, TorchText doesn't know what to use to sort our data and it would throw an error if we don't tell it. 

There are two ways to handle this, you can either tell the iterator not to sort the validation/test data by passing `sort = False`, or you can tell it how to sort the data by passing a `sort_key`. A sort key is a function that returns a key on which to sort the data on. For example, `lambda x: x.s` will sort the examples by their `s` attribute, i.e their quote. Ideally, you want to use a sort key as the `BucketIterator` will then be able to sort your examples and then minimize the amount of padding within each batch.

We can then iterate over our iterator to get batches of data. Note how by default TorchText has the batch dimension second.

In [443]:
train_iterator, eval_iterator = data.BucketIterator.splits(
    (train_data, eval_data),
    sort_key = lambda x: x.t,  # sort by t attribute (text)
    batch_size=BATCH_SIZE,
    device=device)

# Model

In [545]:
fastText = FastText('simple')
oovs = []
vocab_shape = TEXT_FIELD.vocab.vectors.shape
weights_matrix = np.zeros((vocab_shape[0], vocab_shape[1]))

for i, word in enumerate(TEXT_FIELD.vocab.itos):
    try: 
        weights_matrix[i] = fastText[word]
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))
        oovs.append(word)

In [546]:
print(len(oovs))

0


In [547]:
def create_emb_layer(weights_matrix, trainable=True):
    num_embeddings, embedding_dim = weights_matrix.shape
    emb_layer = nn.EmbeddingBag(num_embeddings, embedding_dim)  # averages word embeddings
    emb_layer.weight=nn.Parameter(torch.tensor(weights_matrix,dtype=torch.float32))
    if not trainable:
        emb_layer.weight.requires_grad = False  # default value is True

    return emb_layer, num_embeddings, embedding_dim

Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropping out each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers, you only ever want to use dropout on hidden layers.

Shoud you fine tune word embeddings? [link](https://stackoverflow.com/questions/58630101/using-torch-nn-embedding-for-glove-should-we-fine-tune-the-embeddings-or-just-u)

In [548]:
class Logistic_Regression(nn.Module):
    def __init__(self, weights_matrix, hidden_dim, output_dim, dropout):
        super(Logistic_Regression, self).__init__()
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(weights_matrix, False)
        self.hidden_layer = nn.Linear(embedding_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.output_layer = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x.view(x.shape[1], x.shape[0]))  # set batch size to be first dimension
        x = self.dropout(self.hidden_layer(x))
        x = self.output_layer(x)
        return x

In [549]:
model = Logistic_Regression(weights_matrix=weights_matrix,  # dimension of FastText embeddings
                            hidden_dim=128,
                            output_dim=len(LABEL_FIELD.vocab),
                            dropout=0.1)  

In [550]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 40,076 trainable parameters


In [551]:
for name, param in model.named_parameters(): 
    print(name, '\ttrainable='+str(param.requires_grad))

embedding.weight 	trainable=False
hidden_layer.weight 	trainable=True
hidden_layer.bias 	trainable=True
output_layer.weight 	trainable=True
output_layer.bias 	trainable=True


# Train

In [552]:
softmax = nn.Softmax()
def calculate_f1(outputs, targets):
    out = softmax(outputs)
    y_pred = torch.argmax(out, dim=1).tolist()
    y = targets.tolist()
    return f1_score(y, y_pred, average='micro')

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [553]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_f1 = 0

    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.t)
        
        loss = criterion(predictions, batch.l)
        
        f1 = calculate_f1(predictions, batch.l)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_f1 += f1
        
    return epoch_loss / len(iterator), epoch_f1 / len(iterator)

**Note**: as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [554]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_f1 = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            
            predictions = model(batch.t)
            
            loss = criterion(predictions, batch.l)
            
            f1 = calculate_f1(predictions, batch.l)

            epoch_loss += loss.item()
            epoch_f1 += f1
        
    return epoch_loss / len(iterator), epoch_f1 / len(iterator)

In [555]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [556]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
#criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

Let's train!

In [558]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_f1 = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_f1 = evaluate(model, eval_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train F1: {train_f1*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. F1: {valid_f1*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 1.526 | Train F1: 48.57%
	 Val. Loss: 1.541 |  Val. F1: 48.80%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 1.525 | Train F1: 48.57%
	 Val. Loss: 1.533 |  Val. F1: 48.80%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 1.525 | Train F1: 48.57%
	 Val. Loss: 1.555 |  Val. F1: 48.80%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 1.525 | Train F1: 48.57%
	 Val. Loss: 1.530 |  Val. F1: 48.80%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 1.525 | Train F1: 48.58%
	 Val. Loss: 1.536 |  Val. F1: 48.80%
