# **CSE6521(AU22): PyTorch Tutorial**

### Author: Zexin (Jason) Xu

This notebook will serve as a basic introduction to PyTorch. This notebook will include 3 sessions: tensor, neural network, and a sample NLP task. After finishing this notebook, you will be able to build a fully-connected feed-forward neural network with PyTorch.

### Credit
* ["Word Window Classification" tutorial notebook](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/materials/ww_classifier.ipynb) by Matt Lamm, from Winter 2020 offering of CS224N
* CSE224N: PyTorch Tutorial (Winter '22) by Dilara Soylu, Ethan Chi
* Official PyTorch Documentation on [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) by Soumith Chintala
* PyTorch Tutorial Notebook, [Build Basic Generative Adversarial Networks (GANs) | Coursera](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans) by Sharon Zhou, offered on Coursera

 

# Sample

Sentiment analysis

Deep Averaging Network..(DAN)

0. Dataset https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
1. Preprocessing
2. Introduce dataloader
3. Word embeddings
4. Batch
5. GPU
6. Training
7. Evaluation
8. Confusion matrix

In [8]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd

In [9]:
dataframe = pd.read_csv('Data/movie.csv')
print(dataframe.shape)
print(dataframe.head())
print('-' * 25)

corpus = dataframe['text'].values
labels = dataframe['label'].values
print(corpus[:2])
print(labels[:2])

(40000, 2)
                                                text  label
0  I grew up (b. 1965) watching and loving the Th...      0
1  When I put this movie in my DVD player, and sa...      0
2  Why do people who do not know what a particula...      0
3  Even though I have great interest in Biblical ...      0
4  Im a die hard Dads Army fan and nothing will e...      1
-------------------------
['I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Ger

In [10]:
from sklearn.model_selection import train_test_split

# Getting max length of a sentence
max_len = -1
for sent in corpus:
    max_len = max(max_len, len(sent.split()))
print(f'Max length: {max_len}')

# Splitting the data
corpus_train, corpus_test, labels_train, labels_test = train_test_split(corpus, labels, test_size=0.2, random_state=42)

print(corpus_train.shape)
print(corpus_test.shape)

Max length: 2470
(32000,)
(8000,)


## Word embeddings

In [11]:
# First off, let's create a indexer!
# Credit: ...
class Indexer(object):
    """
    Bijection between objects and integers starting at 0. Useful for mapping
    labels, features, etc. into coordinates of a vector space.
    """
    def __init__(self):
        self.objs_to_idx = {}
        self.idx_to_objs = {}
        # Adding <unk> and <pad>
        self.objs_to_idx['<unk>'] = 0
        self.objs_to_idx['<pad>'] = 1
        self.idx_to_objs[0] = '<unk>'
        self.idx_to_objs[1] = '<pad>'

    def __repr__(self):
        return str(self.objs_to_idx)

    def __str__(self):
        return self.__repr__()

    def __len__(self):
        return len(self.objs_to_idx)

    def get_object(self, index):
        """
        :param index: integer index to look up
        :return: Returns the object corresponding to the particular index or <unk> if not found
        """
        if (index not in self.idx_to_objs):
            return self.idx_to_objs[0]
        else:
            return self.idx_to_objs[index]

    def contains(self, object):
        """
        :param object: object to look up
        :return: Returns True if it is in the Indexer, False otherwise
        """
        return self.index_of(object) != 0

    def index_of(self, object):
        """
        :param object: object to look up
        :return: Returns -1 if the object isn't present, index otherwise
        """
        if (object not in self.objs_to_idx):
            return self.objs_to_idx['<unk>']
        else:
            return self.objs_to_idx[object]

    def add_and_get_index(self, object, add=True):
        """
        Adds the object to the index if it isn't present, always returns a nonnegative index
        :param object: object to look up or add
        :param add: True by default, False if we shouldn't add the object. If False, equivalent to index_of.
        :return: The index of the object
        """
        if not add:
            return self.index_of(object)
        if (object not in self.objs_to_idx):
            new_idx = len(self.objs_to_idx)
            self.objs_to_idx[object] = new_idx
            self.idx_to_objs[new_idx] = object
        return self.objs_to_idx[object]


In [12]:
indexer = Indexer()

for corpus in corpus_train:
    for word in corpus.split():
        indexer.add_and_get_index(word.lower())
print(f'Indexer: {len(indexer)}')
print(list(indexer.idx_to_objs.items())[:50]) # print first 50 items in the indexer

Indexer: 295808
[(0, '<unk>'), (1, '<pad>'), (2, 'i'), (3, 'watched'), (4, 'it'), (5, 'last'), (6, 'night'), (7, 'and'), (8, 'again'), (9, 'this'), (10, 'morning'), (11, '-'), (12, "that's"), (13, 'how'), (14, 'much'), (15, 'liked'), (16, 'it.'), (17, 'there'), (18, 'is'), (19, 'something'), (20, 'about'), (21, 'movie...'), (22, 'when'), (23, 'the'), (24, 'movie'), (25, 'was'), (26, 'almost'), (27, 'over,'), (28, 'to'), (29, 'cry.'), (30, 'would'), (31, 'strongly'), (32, 'recommend'), (33, '"latter'), (34, 'days"'), (35, 'my'), (36, 'friends'), (37, "it's"), (38, 'definitely'), (39, 'worth'), (40, 'seeing!'), (41, 'agree'), (42, 'with'), (43, 'those'), (44, 'who'), (45, 'say'), (46, 'that'), (47, 'some'), (48, 'parts'), (49, 'of')]


In [13]:
# Creating word embeddings
embedding_dim = 5 # 50, 100
embedding = nn.Embedding(len(indexer), embedding_dim)
print(list(embedding.parameters()))
print(list(embedding.parameters())[0].shape)

[Parameter containing:
tensor([[-1.5301, -1.2222,  0.8139, -1.4456, -0.0746],
        [ 1.8849, -1.8677,  0.6735, -0.3937,  0.6189],
        [ 0.4110,  0.0099, -0.1404,  0.5565, -0.3442],
        ...,
        [-0.3361, -2.7709,  1.5129, -0.3075, -1.2698],
        [ 1.2058,  1.6777, -1.7956, -1.6128,  0.0434],
        [ 2.7031,  0.4356, -1.6229,  2.1715, -2.8551]], requires_grad=True)]
torch.Size([295808, 5])


In [14]:
sentence = "I hate this movie!"

idx = [indexer.index_of(word.lower()) for word in sentence.split()]
print(idx)

# Remember in the tensor indexing, we can do tensor[tensor]? Let's utilize it!
idx_tensor = torch.tensor(idx, dtype=torch.long)
print(embedding(idx_tensor))

[2, 1271, 9, 10610]
tensor([[ 0.4110,  0.0099, -0.1404,  0.5565, -0.3442],
        [-1.6840, -0.5460,  0.8224, -0.8703, -0.1626],
        [ 0.3520, -0.3347,  1.8257, -1.1186,  0.2296],
        [-1.1989,  0.5747, -0.1248,  1.2426, -0.5887]],
       grad_fn=<EmbeddingBackward0>)


## Now, let's batch stuff!

we are going to use data loader

In [15]:
from torch.utils.data import DataLoader

train_data = list(zip(corpus_train, labels_train))
batch_size = 4
shuffle = True

# Things to add:
# Collate function customization
# collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=shuffle)
counter = 0
batched_corpus, batched_labels = next(iter(dataloader))
print(len(batched_corpus))
print("Batched Input:")
print(batched_corpus)
print("Batched Labels:")
print(batched_labels)

4
Batched Input:
('I also attended the RI International Horror Film Festival and I can easily see why this film won best of show.<br /><br />SEA OF DUST is a wild romp of Horror, Comedy and beautiful scenery. A back in time tale of strange goings on. An increasingly wide spread illness with an overwhelmingly irritating side effect of people\'s heads exploding, brings a young Professor\'s apprentice; Stefan, to investigate. Along his travels, he decides to briefly detour and once again ask for his long time love\'s hand in marriage, only to once again be sent packing by her extremely stubborn father\x85 Along the way "out of town" he comes across an ill girl in the road and delivers her to Dr. Maitland, (brilliantly played by up and coming Vincent Price like actor: Edward X Young.) Who fills Stefan in on the Evils a foot. Only the Dr. is insulted that he had called for the Professor and only received a boy in training\x85None the less, Stefan turns out to be much more than a common byst

## Deep Averaging Network

Time to put everything together!!

In [16]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd

from torch.utils.data import DataLoader
import torch.optim as optim

In [17]:
class Indexer(object):
    def __init__(self):
        self.objs_to_idx = {}
        self.idx_to_objs = {}
        # Adding <unk> and <pad>
        self.objs_to_idx['<unk>'] = 0
        self.objs_to_idx['<pad>'] = 1
        self.idx_to_objs[0] = '<unk>'
        self.idx_to_objs[1] = '<pad>'

    def __repr__(self):
        return str(self.objs_to_idx)

    def __str__(self):
        return self.__repr__()

    def __len__(self):
        return len(self.objs_to_idx)

    def get_object(self, index):
        if (index not in self.idx_to_objs):
            return self.idx_to_objs[0]
        else:
            return self.idx_to_objs[index]
        
    def pad(self, index):
        return self.objs_to_idx.index_of['<pad>']

    def contains(self, object):
        return self.index_of(object) != 0

    def index_of(self, object):
        if (object not in self.objs_to_idx):
            return self.objs_to_idx['<unk>']
        else:
            return self.objs_to_idx[object]

    def add_and_get_index(self, object, add=True):
        if not add:
            return self.index_of(object)
        if (object not in self.objs_to_idx):
            new_idx = len(self.objs_to_idx)
            self.objs_to_idx[object] = new_idx
            self.idx_to_objs[new_idx] = object
        return self.objs_to_idx[object]

In [33]:
class DeepAveragingNetwork(nn.Module):
    
    def __init__(self, hyper_params):
        super(DeepAveragingNetwork, self).__init__()
        self.hyper_params = hyper_params
        self.embedding = nn.Embedding(len(indexer), hyper_params['embedding_dim'])
        # nn.Embedding.from_pretrained()
        
        # model
        self.model = nn.Sequential(
            nn.Linear(hyper_params['embedding_dim'], hyper_params['hidden_dim']),
            nn.ReLU(),
            nn.Linear(hyper_params['hidden_dim'], hyper_params['output_dim']),
            nn.LogSoftmax(dim=0)
        )
    
    def average_embeddings(self, x):
        embedding_tensor = []
        for i, x_idx in enumerate(x):
            embedding = torch.mean(self.embedding(torch.LongTensor(x_idx)), dim=0)
            embedding_tensor.append(embedding.detach().numpy())
        embedding_tensor = torch.Tensor(embedding_tensor)
        return embedding_tensor
        
    def forward(self, x):
        x = self.average_embeddings(x)
        # print(self.model(x))
        return self.model(x)
    
    def predict(self, x):
        x = self.average_embeddings(x)
        return torch.argmax(self.model(x), dim=1)

## Training

In [35]:
print(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

indexer = Indexer()
for corpus in corpus_train:
    for word in corpus.split():
        indexer.add_and_get_index(word.lower())
        
hyper_params = {
    "batch_size": 4096,
    "vocab_size" : len(indexer),
    "embedding_dim": 50,
    "hidden_dim": 50,
    "output_dim": 2,
    "learning_rate": 0.01,
    "num_epochs": 30,
}
train_data = list(zip(corpus_train, labels_train))
dataloader = DataLoader(train_data, batch_size=hyper_params['batch_size'], shuffle=shuffle)

DAN = DeepAveragingNetwork(hyper_params)
optimizer = optim.Adam(DAN.parameters(), lr=hyper_params['learning_rate'])
loss_function = nn.NLLLoss()

# Training
for epoch in range(hyper_params['num_epochs']):
    total_loss = 0
    for batch in dataloader:
        batched_corpus, batched_labels = batch
        # print(batched_labels)
        # Zero the grads!
        optimizer.zero_grad()
        # corpus -> indexes
        idx = [[indexer.index_of(word.lower()) for word in corpus.split()] for corpus in batched_corpus]
        # labels -> tensor
        pred = DAN(idx)
        loss = loss_function(pred, batched_labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch: {epoch}, Loss: {total_loss/hyper_params["batch_size"]}')


cuda
Epoch: 0, Loss: 0.01617091754451394
Epoch: 1, Loss: 0.016123207518830895
Epoch: 2, Loss: 0.016105429036542773
Epoch: 3, Loss: 0.016099855303764343
Epoch: 4, Loss: 0.016094488324597478
Epoch: 5, Loss: 0.016090859891846776
Epoch: 6, Loss: 0.016086806543171406
Epoch: 7, Loss: 0.016083694295957685
Epoch: 8, Loss: 0.016080213710665703
Epoch: 9, Loss: 0.016077284468337893
Epoch: 10, Loss: 0.016075460705906153
Epoch: 11, Loss: 0.016073307488113642
Epoch: 12, Loss: 0.016071424121037126
Epoch: 13, Loss: 0.01606796868145466
Epoch: 14, Loss: 0.016064456896856427
Epoch: 15, Loss: 0.016062066424638033
Epoch: 16, Loss: 0.016059661051258445
Epoch: 17, Loss: 0.016057911328971386
Epoch: 18, Loss: 0.01605471852235496
Epoch: 19, Loss: 0.016052857507020235
Epoch: 20, Loss: 0.016050246078521013
Epoch: 21, Loss: 0.016047833021730185
Epoch: 22, Loss: 0.016046809731051326
Epoch: 23, Loss: 0.016045585507526994
Epoch: 24, Loss: 0.01604228955693543
Epoch: 25, Loss: 0.016039262525737286
Epoch: 26, Loss: 0.01

#  Evaluation

Confusion matrix explained and testing

In [37]:
test_idx = [[indexer.index_of(word.lower()) for word in corpus.split()] for corpus in corpus_test]
pred_test = DAN.predict(test_idx)

def print_evaluation(golds, predictions):
    """
    Prints evaluation statistics comparing golds and predictions, each of which is a sequence of 0/1 labels.
    Prints accuracy as well as precision/recall/F1 of the positive class, which can sometimes be informative if either
    the golds or predictions are highly biased.

    :param golds: gold labels
    :param predictions: pred labels
    :return:
    """
    num_correct = 0
    num_pos_correct = 0
    num_pred = 0
    num_gold = 0
    num_total = 0
    if len(golds) != len(predictions):
        raise Exception("Mismatched gold/pred lengths: %i / %i" % (len(golds), len(predictions)))
    for idx in range(0, len(golds)):
        gold = golds[idx]
        prediction = predictions[idx]
        if prediction == gold:
            num_correct += 1
        if prediction == 1:
            num_pred += 1
        if gold == 1:
            num_gold += 1
        if prediction == 1 and gold == 1:
            num_pos_correct += 1
        num_total += 1
    acc = float(num_correct) / num_total
    output_str = "Accuracy: %i / %i = %f" % (num_correct, num_total, acc)
    prec = float(num_pos_correct) / num_pred if num_pred > 0 else 0.0
    rec = float(num_pos_correct) / num_gold if num_gold > 0 else 0.0
    f1 = 2 * prec * rec / (prec + rec) if prec > 0 and rec > 0 else 0.0
    output_str += ";\nPrecision (fraction of predicted positives that are correct): %i / %i = %f" % (num_pos_correct, num_pred, prec)
    output_str += ";\nRecall (fraction of true positives predicted correctly): %i / %i = %f" % (num_pos_correct, num_gold, rec)
    output_str += ";\nF1 (harmonic mean of precision and recall): %f;\n" % f1
    print(output_str)
    # return acc, f1, output_str

print_evaluation(labels_test, pred_test.tolist())

Accuracy: 5109 / 8000 = 0.638625;
Precision (fraction of predicted positives that are correct): 2544 / 3945 = 0.644867;
Recall (fraction of true positives predicted correctly): 2544 / 4034 = 0.630640;
F1 (harmonic mean of precision and recall): 0.637674;



## Build your own!

Couple of stuff that can improve:
1. Data cleaning
2. More layers?
3. Activation functions