This lab will introduce to:
* Tokenization 
* Dataset and Dataloader in PyTorch
* Bag-of-Words (BoW) models

### Dataset

We will start by downloading 20-newsgroup text dataset:

```http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset```

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroup_train = fetch_20newsgroups(subset='train')
newsgroup_test = fetch_20newsgroups(subset='test') # we will use it later
print(type(newsgroup_train))
print(type(newsgroup_test))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


<class 'sklearn.utils.Bunch'>
<class 'sklearn.utils.Bunch'>


In [2]:
# Split train data into actual train and validation sets

train_split = 10000
train_data = newsgroup_train.data[:train_split]
train_targets = newsgroup_train.target[:train_split]

val_data = newsgroup_train.data[train_split:]
val_targets = newsgroup_train.target[train_split:]

test_data = newsgroup_test.data
test_targets = newsgroup_test.target

print ("Train dataset size is {}".format(len(train_data)))
print ("Val dataset size is {}".format(len(val_data)))
print ("Test dataset size is {}".format(len(test_data)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


In [3]:
print ("Unique labels are {}".format((set(test_targets))))
print ("Numbers of target variables: {}".format(len(set(test_targets))))

Unique labels are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
Numbers of target variables: 20


In [4]:
from pprint import pprint
pprint(list(newsgroup_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [5]:
# Random sample from train dataset
import random
print (train_data[random.randint(0, len(train_data) - 1)])

From: hades@coos.dartmouth.edu (Brian V. Hughes)
Subject: Re: LCIII->PowerPC?
Reply-To: hades@Dartmouth.Edu
Organization: Dartmouth College, Hanover, NH
Disclaimer: Personally, I really don't care who you think I speak for.
Moderator: Rec.Arts.Comics.Info
Lines: 10

mirsky@hal.gnu.ai.mit.edu (David Joshua Mirsky) writes:

>Hi. I own an LCIII and I recently heard an interesting rumor.
>I heard that the LCIII has a built in slot for a PowerPC chip.
>Is this true? I heard that the slot is not the same as the PDS
>slot.  Is that true?

    Don't believe the hype. There is no such thing as a PowerPC slot.

-Hades



## Tokenizing Dataset

Before we trian the classifer, we have to tokenize the dataset. Tokenization is basically just chopping your sequence (in this case our document or sentences) into small consistuent units (we will choose words as our units), often times, throwing away some characters like puctuation marks or some special symbols. One could also consider doing transformations like mapping all characters to small letters as a part of tokenization.

We are going to tokenize the dataset using [spacy.io](https://spacy.io/)

Run (shown in the cell below):

* ```pip install spacy``` <br>
or if you want to you use conda
* ```conda install -c conda-forge spacy```

followed by
* ```python -m spacy download en_core_web_sm```

In [6]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [7]:
# !conda install -c conda-forge spacy
# !python -m spacy download en_core_web_sm

In [8]:
# Let's write the tokenization function 

import spacy
import string

# Load English tokenizer, tagger, parser, NER and word vectors
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

# lowercase and remove punctuation
def tokenize(sent):
  tokens = tokenizer(sent)
  return [token.text.lower() for token in tokens if (token.text not in punctuations)]

# Example
tokens = tokenize(u'Apple is looking at buying U.K. startup for $1 billion.')
print (tokens)

['apple', 'is', 'looking', 'at', 'buying', 'u.k.', 'startup', 'for', '1', 'billion']


In [9]:
import pickle as pkl

In [10]:
# This is the code cell that tokenizes train/val/test datasets
# However it takes about 15-20 minutes to run it
# For convinience we have provided the preprocessed datasets
# Please see the next code cell
# Don't run this cell, o.w. it will write over the files downloaded in the next cell

def tokenize_dataset(dataset):
    token_dataset = []
    # we are keeping track of all tokens in dataset 
    # in order to create vocabulary later
    all_tokens = []
    
    for sample in dataset:
        tokens = tokenize(sample)
        token_dataset.append(tokens)
        all_tokens += tokens

    return token_dataset, all_tokens

#val set tokens
print ("Tokenizing val data")
val_data_tokens, _ = tokenize_dataset(val_data)
pkl.dump(val_data_tokens, open("val_data_tokens.p", "wb"))

#test set tokens
print ("Tokenizing test data")
test_data_tokens, _ = tokenize_dataset(test_data)
pkl.dump(test_data_tokens, open("test_data_tokens.p", "wb"))

#train set tokens
print ("Tokenizing train data")
train_data_tokens, all_train_tokens = tokenize_dataset(train_data)
pkl.dump(train_data_tokens, open("train_data_tokens.p", "wb"))
pkl.dump(all_train_tokens, open("all_train_tokens.p", "wb"))

Tokenizing val data


KeyboardInterrupt: 

In [11]:
# First, download datasets from here
# Use your NYU account
#https://drive.google.com/open?id=1eR2LFI5MGliHlaL1S2nsX4ouIO1k_ip2
#https://drive.google.com/open?id=133QCWbiz_Xc7Qm4r6t-fJP1K669xjNlM
#https://drive.google.com/open?id=1SuUIUpJ1iznU707ktkpnEGSwt_XIqOYp
#https://drive.google.com/open?id=1UQsrZ2LVfcxdxxa47344fMs_qvya72KR

# Then, load preprocessed train, val and test datasets
train_data_tokens = pkl.load(open("train_data_tokens.p", "rb"))
all_train_tokens = pkl.load(open("all_train_tokens.p", "rb"))

val_data_tokens = pkl.load(open("val_data_tokens.p", "rb"))
test_data_tokens = pkl.load(open("test_data_tokens.p", "rb"))

# double checking
print ("Train dataset size is {}".format(len(train_data_tokens)))
print ("Val dataset size is {}".format(len(val_data_tokens)))
print ("Test dataset size is {}".format(len(test_data_tokens)))

print ("Total number of tokens in train dataset is {}".format(len(all_train_tokens)))
print ("Total number of *unique* tokens in train dataset is {}".format(len(set(all_train_tokens))))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532
Total number of tokens in train dataset is 3433739
Total number of *unique* tokens in train dataset is 135791


#### Vocabulary

Now, we are going to create the vocabulary of most common 10,000 tokens in the training set. Remember that we will add special tokens `<unk>` and `<pad>` to the vocabulary.

In [12]:
from collections import Counter

max_vocab_size = 10000
# save index 0 for unk and 1 for pad, standard
PAD_IDX = 0
UNK_IDX = 1

def build_vocab(all_tokens):
    # Returns:
    # id2token: list of tokens, where id2token[i] returns token that corresponds to token i
    # token2id: dictionary where keys represent tokens and corresponding values represent indices
    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab_size))
    id2token = list(vocab)
    token2id = dict(zip(vocab, range(2,2+len(vocab)))) 
    id2token = ['<pad>', '<unk>'] + id2token
    token2id['<pad>'] = PAD_IDX 
    token2id['<unk>'] = UNK_IDX
    return token2id, id2token

token2id, id2token = build_vocab(all_train_tokens)

In [13]:
# Lets check the dictionary by loading random token from it

random_token_id = random.randint(0, len(id2token)-1)
random_token = id2token[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id[random_token]))

Token id 7451 ; token regions
Token regions; token id 7451


In [14]:
# convert token to id in the dataset
def token2index_dataset(tokens_data):
    indices_data = []
    for tokens in tokens_data:
        index_list = [token2id[token] if token in token2id else UNK_IDX for token in tokens]
        indices_data.append(index_list)
    return indices_data

train_data_indices = token2index_dataset(train_data_tokens)
val_data_indices = token2index_dataset(val_data_tokens)
test_data_indices = token2index_dataset(test_data_tokens)

# double checking
print ("Train dataset size is {}".format(len(train_data_indices)))
print ("Val dataset size is {}".format(len(val_data_indices)))
print ("Test dataset size is {}".format(len(test_data_indices)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


visualize a random tokenized training example

In [16]:
rand_training_example = random.randint(0, len(train_data) - 1)
print (train_data_tokens[rand_training_example])
print(train_data_indices[rand_training_example])

['from', 'jschief@finbol.toppoint.de', 'joerg', 'schlaeger', '\n', 'subject', 're', 'difference', 'between', 'vlb', 'and', 'isa', 'eisa', '\n', 'distribution', 'world', '\n', 'organization', 'myself', '\n', 'lines', '24', '\n\n', 'hurley@epcot.spdc.ti.com', 'writes', 'in', 'article', '1993apr14.090534.6892@spdc.ti.com', '\n', '\n', 'what', 'about', 'vlb', 'and', 'a', '486dx50', '  ', 'does', 'the', 'local', 'bus', 'still', 'run', 'at', '33mhz', 'or', 'does', '\n', 'it', 'try', 'to', 'run', 'at', '50mhz', '\n', '\n', '\n', 'brian', '\n', '\n', '\n', 'hi', '\n', 'vlb', 'is', 'defined', 'for', '3', 'cards', 'by', '33mhz', '\n', 'and', '2', 'cards', 'by', '40mhz', '\n\n', 'there', 'are', 'designs', 'with', '50mhz', 'and', '2', 'vlb', 'slots', '\n', 's.', "c't", '9.92', '10.92', '11.92', '\n\n', '50mhz', 'and', '2', 'slots', 'are', 'realy', 'difficult', 'to', 'design', '\n\n', 'better', 'oss', 'os/2', 'ix', 'are', 'able', 'to', 'handle', 'more', 'than', '16', 'mb', 'of', 'dram', '\n', 'if',

#### Exercise
Write a index2token_dataset() function which takes in a the indices and returns back actual tokens. 

### PyTorch DataLoader 

In [17]:
MAX_SENTENCE_LENGTH = 200

In [18]:

import numpy as np
import torch
from torch.utils.data import Dataset

class NewsGroupDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        
        token_idx = self.data_list[key][:MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, len(token_idx), label]



In [19]:
train_dataset = NewsGroupDataset(train_data_indices, train_targets)

In [20]:
### Let's look at the number of tokens in the first few datapoints
for i in range(5):
    print(train_dataset[i][1])

141
146
200
139
194


In [21]:
## example output

print("x {};\ny {}".format(train_dataset[0][0], train_dataset[0][2]))

x [17, 1, 142, 25, 45, 241, 2, 34, 42, 275, 12, 19, 2, 94, 84, 91, 1, 2, 38, 83, 6, 2792, 601, 1645, 2, 36, 301, 380, 10, 29, 1218, 31, 156, 70, 44, 108, 7993, 67, 18, 19, 275, 10, 639, 2, 3, 87, 253, 14, 29, 7, 1, 2200, 275, 1123, 5, 21, 17, 3, 1266, 1, 2, 813, 9605, 14, 29, 316, 7, 1, 3, 3616, 79, 170, 461, 11, 1309, 2, 3, 785, 6988, 29, 1773, 17, 3, 765, 6, 3, 695, 19, 12, 2, 47, 10, 88, 31, 156, 40, 1, 7, 880, 272, 1005, 2655, 184, 2, 6, 2527, 142, 19, 275, 12, 228, 613, 30, 793, 393, 16, 2, 23, 18, 19, 1, 370, 275, 174, 270, 189, 9, 198, 2, 2564, 107, 856, 1267, 5, 16, 37, 62, 6685, 1, 856, 2058];
y 7


We need a **collate function** so that when we have it in batches, all the sentences have the same length. We decide to keep a `MAX_SENTENCE_LENGTH` and if the sentence has fewer tokens, append the rest with zero and if the sentence has more tokens, chop it all at `MAX_SENTENCE_LENGTH`

In [22]:
def newsgroup_collate_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []
    #print("collate batch: ", batch[0][0])
    #batch[0][0] = batch[0][0][:MAX_SENTENCE_LENGTH]
    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]), 
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
    return [torch.from_numpy(np.array(data_list)), torch.LongTensor(length_list), torch.LongTensor(label_list)]

In [23]:
BATCH_SIZE = 32

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

val_dataset = NewsGroupDataset(val_data_indices, val_targets)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

test_dataset = NewsGroupDataset(test_data_indices, test_targets)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=False)

In [24]:
### checking your data loader

for i, (data, lengths, labels) in enumerate(train_loader):
    print(data)
    print(labels)
    break

tensor([[  17,    1,    2,  ...,  980,    6,    1],
        [  17, 5636,  449,  ..., 1844,    4,    1],
        [  17, 3694,  914,  ...,  115,   31,   44],
        ...,
        [  17, 5414, 1284,  ...,    0,    0,    0],
        [  17, 4259,  514,  ...,  161,   11,    7],
        [  17,    1, 6179,  ...,  248, 1354,  682]])
tensor([14, 18,  0,  3,  0, 15, 16,  8, 14, 14,  6, 15,  8, 15,  8,  8,  1,  8,
        17, 19, 16,  0, 15, 14,  3, 16, 16,  2,  9, 17, 18,  1])


### Bag-of-Words model in PyTorch

Next, we will implement a Bag of Words in PyTorch -- as an `nn.Module`.

A `nn.Module` can really be any function, but it is often used to implement layers, functions and models. Note that you can also nest modules.

Importantly, modules need to have their `forward()` method overridden, and very often you will want to override the `__init__` method as well. 

The `__init__` method sets up the module. This is also often where the internal modules and parameters are initialized.

The `forward` method defines what happens when you *apply* the module.

In the background, PyTorch makes use of your code in the forward method and determines how to implement back-propagation with it - but all you need to do is to define the forward pass!

In [25]:
# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class BagOfWords(nn.Module):
    """
    BagOfWords classification model
    """
    def __init__(self, vocab_size, emb_dim):
        """
        @param vocab_size: size of the vocabulary. 
        @param emb_dim: size of the word embedding
        """
        super(BagOfWords, self).__init__()
        # pay attention to padding_idx 
        self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.linear = nn.Linear(emb_dim,20)
    
    def forward(self, data, length):
        """
        
        @param data: matrix of size (batch_size, max_sentence_length). Each row in data represents a 
            review that is represented using n-gram index. Note that they are padded to have same length.
        @param length: an int tensor of size (batch_size), which represents the non-trivial (excludes padding)
            length of each sentences in the data.
        """
        out = self.embed(data)
        out = torch.sum(out, dim=1)
        out /= length.view(length.size()[0],1).expand_as(out).float()
     
        # return logits
        out = self.linear(out.float())
        return out


In [28]:
emb_dim = 100
model = BagOfWords(len(id2token), emb_dim)

In [29]:
model.embed.weight.shape

torch.Size([10002, 100])

In [30]:
for x in model.parameters():
    print(x.shape)

torch.Size([10002, 100])
torch.Size([20, 100])
torch.Size([20])


### Loss Function and Optimizer

Note that in our Bag of Words model we haven't applied softmax to the output of linear layer. Why?
We use `nn.CrossEntropyLoss()` to train. From pytorch documentation for `nn.CrossEntropyLoss()` ( https://pytorch.org/docs/stable/nn.html ) - this criterion combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class. So, this is actually exactly the same as minimizing the log likelihood after applying softmax.

In [35]:
# Criterion and Optimizer
criterion = torch.nn.CrossEntropyLoss()  

learning_rate = 0.01
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [36]:
for x in model.parameters():
    print(x.shape)

torch.Size([10002, 100])
torch.Size([20, 100])
torch.Size([20])


### Training the Model

In [41]:
labels.size()

torch.Size([32])

In [39]:
num_epochs = 10 # number epoch to train

# Function for testing the model
def test_model(loader, model):
    """
    Help function that tests the model's performance on a dataset
    @param: loader - data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    for data, lengths, labels in loader:
        data_batch, length_batch, label_batch = data, lengths, labels
        outputs = F.softmax(model(data_batch, length_batch), dim=1)
        predicted = outputs.max(1, keepdim=True)[1]
        
        total += labels.size(0)
        correct += predicted.eq(labels.view_as(predicted)).sum().item()
    return (100 * correct / total)

for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        data_batch, length_batch, label_batch = data, lengths, labels
        optimizer.zero_grad()
        outputs = model(data_batch, length_batch)
        loss = criterion(outputs, label_batch)
        loss.backward()
        optimizer.step()
        # validate every 100 iterations
        if i > 0 and i % 100 == 0:
            # validate
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format( 
                       epoch+1, num_epochs, i+1, len(train_loader), val_acc))


tensor([[ 3.4264e-02, -5.5181e-02,  7.5035e-02,  7.0315e-02, -2.6792e-01,
         -8.8072e-02, -2.5678e-02, -2.3645e-02,  5.9957e-02,  1.5854e-01,
         -3.3451e-02,  1.2588e-01,  1.0733e-01, -4.7365e-02,  6.3915e-02,
         -2.4506e-02,  2.4604e-02, -2.4014e-02,  1.1208e-01,  1.6177e-01],
        [ 7.8916e-02, -1.1241e-01, -9.2073e-02,  2.0988e-03, -2.0726e-01,
         -2.2027e-02, -3.5887e-02,  4.0107e-02,  1.1785e-02, -2.8086e-02,
         -2.8415e-02,  1.3495e-01,  1.4997e-01, -6.3068e-02,  1.3471e-02,
         -9.9768e-02, -5.5491e-02,  5.4752e-03, -1.2338e-02,  1.5244e-01],
        [ 5.4201e-02, -2.2008e-02, -2.3583e-02,  8.8713e-02, -3.4194e-01,
         -8.4168e-02,  1.4889e-02, -1.7220e-02, -7.6815e-02,  1.9945e-01,
         -1.9468e-02,  2.5141e-01,  3.1136e-01, -1.1450e-01,  8.1503e-02,
          3.6845e-02, -1.5605e-02, -4.2798e-02,  1.8630e-01,  1.5492e-01],
        [-2.0899e-02, -1.0933e-01, -4.8852e-02,  8.8207e-02, -3.3929e-01,
         -1.6393e-01,  8.2652e-02, 

In [87]:
print ("After training for {} epochs".format(num_epochs))
print ("Train Acc {}".format(test_model(train_loader, model)))
print ("Val Acc {}".format(test_model(val_loader, model)))
print ("Test Acc {}".format(test_model(test_loader, model)))

After training for 10 epochs
Train Acc 99.91
Val Acc 89.3455098934551
Test Acc 79.42113648433352


## Analysis Exercises

1. Suppose modify the collate function to the following:
```python
def newsgroup_collate_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []
    #print("collate batch: ", batch[0][0])
    #batch[0][0] = batch[0][0][:MAX_SENTENCE_LENGTH]
    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
    # padding
    for datum in batch:
        padded_vec = np.pad(np.random.shuffle(np.array(datum[0])),  ##note the addition of shuffle here
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
    return [torch.from_numpy(np.array(data_list)), torch.LongTensor(length_list), torch.LongTensor(label_list)] ```
    
a) What would your test and val accuracies be for the model you trained above? Do this and verify that what you think is correct. <br>
b) You train your model with this changed collate function. Do you expect to achieve similar results to what you have currently? <br> <br>

2 Create and visualize confusion matrix for this. Do you see anything interesting? ( Look at the frequency of occurence of different labels in your train set and see the classification performance on labels that are less frequent)

## Implementation Exercises

1. Try training the model with larger embedding size and for larger number of epochs. Also plot the training curves of the model
2. Try downloading IMDB Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/ and tokenize it. After tokenizing the dataset try training Bag-of-Words model on it and report your initial results on validation set. It's again interesting to note the effect of embedding size, tokenization scheme etc on your performance.

#### Credits

This lab is built on top of the lab developed for DS-GA 1011 Fall 2018.