# RNN with LSTM with Log Dataset

In this lab one will use log CSV text dataset for training a simple RNN for sentiment classification (here: a binary classification problem with two labels, annomally and normal) using LSTM (Long Short Term Memory) cells and GRU Cells.

In [1]:
import torch
import torch.nn.functional as F
from torchtext.legacy import data
from spacy.lang.en import English
import spacy
import en_core_web_sm
from torchtext import datasets
import time
import random
import pandas as pd

torch.backends.cudnn.deterministic = True

In [6]:

spacy_en = spacy.load('en_core_web_sm')

## General Settings

In [2]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


VOCABULARY_SIZE = 20000
LEARNING_RATE = 1e-4
BATCH_SIZE = 128
NUM_EPOCHS = 15
EMBEDDING_DIM = 128 #128
HIDDEN_DIM = 328 #256
OUTPUT_DIM = 1

## Dataset

Check that the dataset looks okay:

In [3]:
# df = pd.read_csv('/content/drive/MyDrive/Lab 4/movie_data.csv')
# df.head()

#log= pd.read_csv('/content/drive/MyDrive/586_project_NPL/unique_id.csv')
log= pd.read_csv('/Users/yuxuancui/Desktop/MDS/data586/project/rnn/log_rnn.csv')


In [4]:
log.head()

Unnamed: 0,Label,Content_npl
0,Normal,BLOCK NameSystem allocateBlock user root sortr...
1,Normal,Receiving block blk src dest Receiving block b...
2,Normal,BLOCK NameSystem allocateBlock user root randt...
3,Normal,BLOCK NameSystem allocateBlock user root rand ...
4,Normal,Receiving block blk src dest BLOCK NameSystem ...


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
del df

Define the Label and Text field formatters:

In [19]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
TEXT = data.Field(sequential=True,
                  #tokenize='spacy',
                  include_lengths=True) # necessary for packed_padded_sequence

LABEL = data.LabelField(dtype=torch.float)

Process the dataset:

In [12]:
fields = [('Label', LABEL),('Content_npl', TEXT)]

dataset = data.TabularDataset(
    path='/Users/yuxuancui/Desktop/MDS/data586/project/rnn/log_rnn.csv', format='csv',
    skip_header=True, fields=fields)

Split the dataset into training, validation, and test partitions:

In [13]:
train_data, valid_data, test_data = dataset.split(
    split_ratio=[0.75, 0.05, 0.2],
    random_state=random.seed(RANDOM_SEED))
#One may want to vary the test, train split percentages
print(f'Num Train: {len(train_data)}')
print(f'Num Valid: {len(valid_data)}')
print(f'Num Test: {len(test_data)}')

Num Train: 431296
Num Valid: 115012
Num Test: 28753


Build the vocabulary based on the top "VOCABULARY_SIZE" words:

In [14]:
TEXT.build_vocab(train_data, max_size=VOCABULARY_SIZE)
LABEL.build_vocab(train_data)

print(f'Vocabulary size: {len(TEXT.vocab)}')
print(f'Number of classes: {len(LABEL.vocab)}')

Vocabulary size: 201
Number of classes: 2


In [11]:
LABEL.vocab.freqs
TEXT.vocab.freqs

Counter({' ': 327,
         'BLOCK': 19192,
         'Could': 39,
         'Deleting': 8,
         'EOFException': 1,
         'Exception': 1,
         'IOException': 39,
         'NameSystem': 19180,
         'PacketResponder': 14119,
         'Received': 14190,
         'Receiving': 15063,
         'Served': 303,
         'SocketTimeoutException': 1,
         'Starting': 15,
         'Transmitted': 9,
         'Verification': 681,
         'addStoredBlock': 14105,
         'added': 14105,
         'allocateBlock': 5075,
         'ask': 12,
         'blk': 63628,
         'block': 43706,
         'blockMap': 14105,
         'conf': 1,
         'current': 8,
         'data': 8,
         'datanode': 12,
         'dest': 15067,
         'dfs': 8,
         'ec': 1,
         'empty': 9,
         'exception': 40,
         'file': 8,
         'for': 14808,
         'from': 14216,
         'hadoop': 10,
         'history': 1,
         'internal': 1,
         'io': 40,
         'ip': 1,
      

The TEXT.vocab dictionary will contain the word counts and indices. The reason why the number of words is VOCABULARY_SIZE + 2 is that it contains to special tokens for padding and unknown words: `<unk>` and `<pad>`.

Make dataset iterators:

In [15]:
train_loader, valid_loader, test_loader = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE,
    sort_within_batch=True, # necessary for packed_padded_sequence
    sort_key=lambda x: len(x.Content_npl),
    device=DEVICE)

Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [16]:
print('Train')
for batch in train_loader:
    print(f'Text matrix size: {batch.Content_npl[0].size()}')
    print(f'Target vector size: {batch.Label.size()}')
    break
    
print('\nValid:')
for batch in valid_loader:
    print(f'Text matrix size: {batch.Content_npl[0].size()}')
    print(f'Target vector size: {batch.Label.size()}')
    break
    
print('\nTest:')
for batch in test_loader:
    print(f'Text matrix size: {batch.Content_npl[0].size()}')
    print(f'Target vector size: {batch.Label.size()}')
    break

Train
Text matrix size: torch.Size([146, 128])
Target vector size: torch.Size([128])

Valid:
Text matrix size: torch.Size([16, 128])
Target vector size: torch.Size([128])

Test:
Text matrix size: torch.Size([16, 128])
Target vector size: torch.Size([128])


## Model

 ### The primary goal of this lab is to vary the hyperparameters of the LSTM model and see the results and provide analysis
 ### The second task is to use a another RNN cell such as GRU and perform parameter tuning and report the results.
 
 ### The remainder of the code will have to be modified accordingly. 

In [17]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        #Here is a preliminary model using LSTM cell
        #The primary goal of this lab is to vary the dimensions of the embeddings and see the results
        #The second task is to use a another RNN cell such as GRU and perform parameter tuning and report the results.
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_length):

        #[sentence len, batch size] => [sentence len, batch size, embedding size]
        embedded = self.embedding(text)
        
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, text_length)
      
        #[sentence len, batch size, embedding size] => 
        #  output: [sentence len, batch size, hidden size]
        #  hidden: [1, batch size, hidden size]
        packed_output, (hidden, cell) = self.rnn(packed)
        
        return self.fc(hidden.squeeze(0)).view(-1)

In [17]:
INPUT_DIM = len(TEXT.vocab)

torch.manual_seed(RANDOM_SEED)
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

## Training

In [18]:
def compute_binary_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples = 0, 0
    with torch.no_grad():
        for batch_idx, batch_data in enumerate(data_loader):
            text, text_lengths = batch_data.Content_npl
            logits = model(text, text_lengths.cpu())
            predicted_labels = (torch.sigmoid(logits) > 0.5).long()
            num_examples += batch_data.Label.size(0)
            correct_pred += (predicted_labels.long() == batch_data.Label.long()).sum()
        return correct_pred.float()/num_examples * 100

In [19]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for batch_idx, batch_data in enumerate(train_loader):
        
        text, text_lengths = batch_data.Content_npl
        
        ### FORWARD AND BACK PROP
        logits = model(text, text_lengths.cpu())
        cost = F.binary_cross_entropy_with_logits(logits, batch_data.Label)
        optimizer.zero_grad()
        
        cost.backward()
        
        ### UPDATE MODEL PARAMETERS
        optimizer.step()
        
        ### LOGGING
        if not batch_idx % 50:
            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                   f'Cost: {cost:.4f}')

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_binary_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_binary_accuracy(model, valid_loader, DEVICE):.2f}%')
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_binary_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/015 | Batch 000/498 | Cost: 0.7357
Epoch: 001/015 | Batch 050/498 | Cost: 0.3297
Epoch: 001/015 | Batch 100/498 | Cost: 0.1690
Epoch: 001/015 | Batch 150/498 | Cost: 0.1873
Epoch: 001/015 | Batch 200/498 | Cost: 0.1004
Epoch: 001/015 | Batch 250/498 | Cost: 0.1630
Epoch: 001/015 | Batch 300/498 | Cost: 0.1092
Epoch: 001/015 | Batch 350/498 | Cost: 0.1359
Epoch: 001/015 | Batch 400/498 | Cost: 0.1112
Epoch: 001/015 | Batch 450/498 | Cost: 0.1662
training accuracy: 97.23%
valid accuracy: 97.31%
Time elapsed: 0.05 min
Epoch: 002/015 | Batch 000/498 | Cost: 0.1117
Epoch: 002/015 | Batch 050/498 | Cost: 0.1970
Epoch: 002/015 | Batch 100/498 | Cost: 0.0622
Epoch: 002/015 | Batch 150/498 | Cost: 0.1640
Epoch: 002/015 | Batch 200/498 | Cost: 0.1751
Epoch: 002/015 | Batch 250/498 | Cost: 0.1989
Epoch: 002/015 | Batch 300/498 | Cost: 0.2042
Epoch: 002/015 | Batch 350/498 | Cost: 0.1363
Epoch: 002/015 | Batch 400/498 | Cost: 0.1100
Epoch: 002/015 | Batch 450/498 | Cost: 0.1713
training

In [20]:

def predict_sentiment(model, sentence):
    # based on:
    # https://github.com/bentrevett/pytorch-sentiment-analysis/blob/
    # master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(DEVICE)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

In [96]:
# testing set 

test_2k=pd.read_csv('/content/drive/MyDrive/586_project_NPL/test_2k.csv')

In [97]:
test=test_2k[['BlockId','Full_Event_Description']]

In [98]:
test

Unnamed: 0,BlockId,Full_Event_Description
0,38865049064139660,PacketResponder for block blk terminating
1,6952295868487656571,PacketResponder for block blk terminating
2,7128370237687728475,BLOCK NameSystem addStoredBlock blockMap updat...
3,8229193803249955061,PacketResponder for block blk terminating
4,6670958622368987959,PacketResponder for block blk terminating
...,...,...
1995,4198733391373026104,Receiving block blk src dest
1996,5815145248455404269,Received block blk of size from
1997,295306975763175640,Receiving block blk src dest
1998,5225719677049010638,PacketResponder for block blk terminating


In [104]:
test.iloc[100]

BlockId                                    7517964792804498202
Full_Event_Description     Got exception while serving blk to 
Name: 100, dtype: object

In [103]:
print('Probability anomaly:')
1-predict_sentiment(model, "Got exception while serving blk to ")

Probability anomaly:


0.5205644965171814

In [101]:
Prob=[]
for line in test.Full_Event_Description:
  p=1-predict_sentiment(model,line)
  Prob.append(p)




In [77]:
Label=[]

for i in Prob:
  if i>0.5:
    Label.append("Anomaly")
  else:
    Label.append("Normal")

In [78]:
test["Label"]=Label

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [85]:
test.groupby(Label)["Label"].count()

Anomaly    1573
Normal      427
Name: Label, dtype: int64