## Imports 

In [1]:
from IPython.display import clear_output
from collections import Counter

import pandas as pd 
import numpy as np 
import torch
import torch.nn as nn 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

%matplotlib inline

## RNNs

Please, read about RNNs (Recurrent Neural Networks).  

1. Understand it's difference from the FFNNs. (Write your answer down below)  

https://towardsdatascience.com/recurrent-neural-networks-rnn-explained-the-eli5-way-3956887e8b75

https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7

2. Why do we need recurrent neural networks? 
3. For which tasks it would work better? 

In [2]:
### 1. Your answer here 

In [3]:
### 2. Your answer here 

In [4]:
### 3. Your answer here 

## Load data 

In [5]:
# Load the DF created during the previous task

df_binary = pd.read_json("../jigsaw-toxic-comment-classification-challenge/df_binary.json")
df_binary.head()

Unnamed: 0,index,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned,toxicity
0,60236,b3925e41b823f473,"""\n\nThank you Ian. I knew about WP:NOTCENSORE...",0,0,0,0,0,0,"[``, thank, ian, knew, wp, notcensored, also, ...",0
1,116612,b686d9f97deab4ad,Oh. I never took your comments in any negative...,0,0,0,0,0,0,"[oh, never, took, comment, negative, way, perf...",0
2,72935,d96a1c99002f9cfc,Village pump and newbie \n\nI think your handl...,0,0,0,0,0,0,"[village, pump, newbie, think, handling, newbi...",0
3,30137,59a0576f85786c1f,I didn't change it hence this BS claim of me s...,0,0,0,0,0,0,"[n't, change, hence, b, claim, saying, keep, a...",0
4,148580,0f701c200f54455c,What the hell do you people expect? Wikipedia'...,1,0,1,0,1,1,"[hell, people, expect, wikipedia, 's, controll...",4


In [6]:
# Work with small amount of this data: 
df_sample, _ = train_test_split(df_binary, test_size=0.7, stratify=df_binary['obscene'])

In [7]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

cnt_vocab = Counter(flat_nested(df_sample.cleaned.tolist()))

print("Vocab size before filtering: {}".format(len(cnt_vocab)))

threshold_count_l = 1
threshold_count_h = 500
threshold_len = 2

cleaned_vocab = [token for token, count in cnt_vocab.items() if 
                     threshold_count_h > count > threshold_count_l and len(token) > threshold_len
                ]
print("Vocab size after filtering: {}".format(len(cleaned_vocab)))

Vocab size before filtering: 110213
Vocab size after filtering: 42419


In [8]:
cleaned_vocab.append(" ")
# Convert list to set 
cleaned_vocab = set(cleaned_vocab)

In [9]:
token_to_id = {v: k for k, v in enumerate(sorted(cleaned_vocab))}
id_to_token = {v: k for k, v in token_to_id.items()}

Before passing our raw text to the model we need to represent each raw text by a vector.   
Let's do this by creating an empty list with all of the tokens in it represented by its id. 

In [10]:
def vectorize(data, token_to_id, max_len=None, dtype='int32', batch_first=True):
    """
    Casts a list of tokens into rnn-digestable matrix
        "data" contains only sequences represented by tokens from the dictionary, filter noise before 
    """
    seq_lengths = list(map(len, data))
    max_len = max_len or max(map(len, data))
    # Create a marix with a shape [batch size, max number of tokens in sequence]
    data_ix = np.zeros([len(data), max_len], dtype) + token_to_id[' ']

    for i in range(len(data)):
        line_ix = [token_to_id[c] for c in data[i]]
        data_ix[i, :len(line_ix)] = line_ix

    return data_ix, seq_lengths

In [11]:
def filter_noise_tokens(df, cleaned_vocab): 
    df['filtered_tokens'] = df.cleaned.apply(lambda x: [tok for tok in x if tok in cleaned_vocab])
    return df 

In [12]:
# After applying this function there would be sentences with all tokens filtered - empty lists. 
df_sample = filter_noise_tokens(df_sample, cleaned_vocab)

# Remove examples without any tokens assigned 
df_filtered = df_sample[df_sample.astype(str)['filtered_tokens'] != '[]']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
# Perform train-test split stratified (would be imbalanced)
df_train, df_test = train_test_split(df_filtered, test_size=0.4, stratify=df_filtered['obscene'])

In [14]:
print("Train shape: {}".format(df_train.shape))
print("Test shape: {}".format(df_test.shape))

Train shape: (26392, 12)
Test shape: (17596, 12)


In [15]:
class RNNLoop(nn.Module):
    
    def __init__(self, num_tokens, emb_size=200, hid_size=128):
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(num_tokens, emb_size)
        self.rnn = nn.RNN(emb_size, hid_size, batch_first=True)
        self.logits = nn.Linear(hid_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, seq_lengths):
        # Embed the obtained sequence 
        emb = self.emb(x)
        # Pack padded sequence - why do we need this, refer to:
        # https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch
        
        pack = torch.nn.utils.rnn.pack_padded_sequence(emb,
                                                   seq_lengths,
                                                   batch_first=True,
                                                   enforce_sorted=False
                                                  ) 
        all_hidden_states, hidden = self.rnn(pack)
        logits = self.logits(hidden)
        # Cast logits to the range from 0 to 1 
        output = self.sigmoid(logits)
        return output

In [16]:
# Initialise the model 
model = RNNLoop(num_tokens=len(cleaned_vocab))
# specify loss function
criterion = nn.BCELoss()
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
history = []

batch_size = 64
n_epochs = 10
n_iters = df_train.shape[0] // batch_size
print("Number of iterations for 1 epoch: {}".format(n_iters))

for epoch in range(n_epochs):
    epoch_loss = 0 
    for step in range(n_iters):

        optimizer.zero_grad()    # Forward pass
        # Make a random sample from the dataframe 
        sample = df_train.sample(batch_size)

        # Vectorize the obtained sample 
        batch_ix, seq_lengths = vectorize(sample.filtered_tokens.tolist(), token_to_id)
        # Convert vectorized batch to tensor 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        # Select true labels 
        y_true = sample.obscene.tolist()
        # Convert true labels to tensor 
        y_true = torch.tensor(y_true, dtype=torch.float)

        # Make prediction 
        y_pred = model(batch_ix, seq_lengths)

        loss = criterion(y_pred.squeeze(), y_true)

        epoch_loss += loss.item() / n_iters
        loss.backward()   # Backward pass 
        optimizer.step()
            
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))    

Number of iterations for 1 epoch: 412
Epoch 0: train loss: 0.15799956292691597
Epoch 1: train loss: 0.09098772610564845
Epoch 2: train loss: 0.06469609696718695
Epoch 3: train loss: 0.055170654709726714
Epoch 4: train loss: 0.05423462519691173
Epoch 5: train loss: 0.05278378341785509
Epoch 6: train loss: 0.04587735820254222
Epoch 7: train loss: 0.04313932653591123
Epoch 8: train loss: 0.036186877298745285
Epoch 9: train loss: 0.02939191049397744


In [17]:
# Functions for test dataset splitting on batches 

def index_marks(nrows, chunk_size):
    return range(1 * chunk_size, (nrows // chunk_size + 1) * chunk_size, chunk_size)

def split(df, chunk_size):
    indices = index_marks(df.shape[0], chunk_size)
    return np.split(df, indices)

In [18]:
def make_predictions(model, df_test, batch_size, threshold): 
    n_prints = 0
    predictions = []
    true_labels = []
    # Split data in batches 
    test_batches = split(df_test, batch_size)
    
    for batch in test_batches:
        # Vectorize batches
        batch_ix, seq_lengths = vectorize(batch.filtered_tokens.tolist(), token_to_id)
        # Convert vectorized batch to tensor 
        batch_ix = torch.tensor(batch_ix, dtype=torch.int64)

        # Select true labels 
        y_true = batch.obscene.tolist()

        # Make prediction 
        y_pred = model(batch_ix, seq_lengths).detach().squeeze().numpy()
        # Convert it to binaries 
        y_pred = [int(pred.item() > threshold) for pred in y_pred]
        
        # Add them to parallel lists 
        predictions.extend(y_pred)
        true_labels.extend(y_true)
        
        # Print some examples with obscene documents texts and predicted and true labels 
        for true, pred, document in zip(y_true, y_pred, batch.comment_text):
            if true == 1.0 and n_prints < 10:
                print("Predicted label: {}".format(pred))
                print("True label: {}".format(true))
                print("Document: {}".format(document))
                print("*-*-"*20)
                n_prints += 1
        
    return true_labels, predictions

In [19]:
true_labels, predictions = make_predictions(model, df_test, batch_size=64, threshold=0.3)

Predicted label: 0
True label: 1
Document: Do you think people would like you more if you weren't such a dick? Or are you a dick because no one likes you? Either way, you won't be missed.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1
Document: GET A LIFE shit Nerd!
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1
Document: YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY LICK YOU CAN SUCK MY L

In [20]:
# Pring a classification report: 

print(classification_report(true_labels, predictions))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96     16629
           1       0.37      0.49      0.42       967

    accuracy                           0.93     17596
   macro avg       0.67      0.72      0.69     17596
weighted avg       0.94      0.93      0.93     17596



## Task

1. Make a dataset balanced: for example select all of the obscene messages, calculate its number and sample from the clean messages equal number of examples. **(1)See if it increased your score on toxic messages.** 


2. Read about RNNs different types (LSTMs and GRUs): 
  https://colah.github.io/posts/2015-08-Understanding-LSTMs/  

  https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 
  
  **(2)What is the difference between RNN and LSTM? Why do we need LSTM? Explain it in your own words.**  
  
  **(3)What is the difference between LSTM and GRU? Explain it in your own words.** 
  
  
3. Modify your network to make it possible to work with nn.LSTM or nn.GRU layers. (Their outputs may be a little bit defferent from nn.RNN, so be careful to modify your code accordingly). 

4. Compare all of the previous examples: classification with RNN (or LSTM/GRU) and FFNN. **(4)Which one performed better according to the metrics? (5)To the time?**

5. **(6)How dataset imbalancing are influencing your model? Read about dataset imbalancing and about possibilities to handle them. (7)Write down below what can we do with it, or implement a solution.** 
  
  

Please, answer the questions 1-7 and write your answers down below: 

In [21]:
### Your answers here 