## Prerequisites

torch==1.1.0

In [0]:
import random
from collections import Counter

import numpy as np 
import pandas as pd 
import torch 
import torch.nn as nn 

from gensim.models import KeyedVectors
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [0]:
df = pd.read_csv("train.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"['explanation', 'edits', 'made', 'username', '..."
1,1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[""d'aww"", 'match', 'background', 'colour', ""'m..."
2,2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"['hey', 'man', ""'m"", 'really', 'trying', 'edit..."
3,3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"['``', 'ca', ""n't"", 'make', 'real', 'suggestio..."
4,4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"['sir', 'hero', 'chance', 'remember', 'page', ..."


In this notebook you will learn pytorch basics, this framework will help you to build simple neural networks during this task.   
The first neural network we will try to learn is Feed Forward Neural Network which contain one Fully Connected Layer.  
It can have 1 or more fully connected layers, also it could be called as MLP - multilayer perceptron. 

Read about PyTorch here:  
https://en.wikipedia.org/wiki/PyTorch

And here:

https://neurohive.io/ru/tutorial/glubokoe-obuchenie-s-pytorch/

While reading these articles probably you will meet some unknown terms: 
backpropagation algorithm, gradient descent, activation function, loss function, etc.  
Please, try to look for an information about why do you need all of these stuff. 

Answer this questions about Neural Nets: 

1. In previous tasks we created some features manually, tried to weight our features, tried to select special words for vectorization, how deep learning solves this problem? 

2. Why do we work with tensors in PyTorch?

3. Please, find and read information - why do we need an activation functions in our models? Please, refer to the XOR problem with MLP without activation function, find information about it and answer the previous question. 

4. Please, answer the following question - what gradient is? Why do we need gradient descent algorithm? Which problem it solves? 

5. What is backpropagation algorithm? 

6. What is loss function? 

1.  Взять тот же Word2Vec, там есть реализация простой нейронной сети. Именно и она есть способом решения поставленной в вопросе проблемы. Дело в гибкости и настраиваемости параметров модели.

2. По сути, тензоры библиотеки pytorch - те же многомерные массивы библиотеки numpy, обладающие аналогичными возможностями. Используються для вычислений. Если ещё глянуть документацию и поверить в написанное (но лучше проверить, что я и сделал), то вычисления на тензорах могут проводиться как на центральном процессоре, так и на графическом.

3.  Активационный процесс заключается в том, когда при необходимом количестве входных данных нейрон передаёт значение далее по сети. Преобразовазованием этого значения занимается функция активации нейрона. Примеры активационных функций - сигмоидная функция (tanh, логистическая, ...), Хэвисайда и т.д.
<br>
Активационные функции необходимы для гибкости нейронной сети. Ними же решалась задача о линейной несепарабельности данных проблемы XOr.

4. Пусть $\Omega \subset \mathbb{R}^d \> (d \in \mathbb{N})$ - область в $\mathbb{R}^d$. Тогда функция $\phi: \Omega \rightarrow \mathbb{R}$ - скалярное поле.
<br>
Градиентом $\phi$ является следующее выражение:
$\nabla \phi = (\frac{\partial \phi}{\partial t_1}, \frac{\partial \phi}{\partial t_2}, \ldots, \frac{\partial \phi}{\partial t_d})$,<br>
где $\frac{\partial \phi}{\partial t_j}$ - частная производная $\phi$ за переменной $t_j$. Градиент отождествляют с направлением в $\Omega$, в котором $\phi$ возрастает быстрее всего.
<br>
Градиентный спуск - метод нахождения локального экстремума некоторой функции с применением её (отрицательного) градиента. В машинном обучении,если рассматривать нейронные сети, то указанный метод используется в обучении модели в качестве принципа обратного распространения ошибки (backpropagation method). Там же и берётся градиент от функции ошибок (она же определяет качество работу нейронной сети в период циклического обучения).
<br>
Градиентный спуск используется для решения задачи минимизации среднего значения ошибки на выходе нейронной сети, обновляя весовые параметры модели.

5. Принцип обратного распространения ошибки - способ вычисления градиента функции, который используется при обновлении параметров многослойного персептрона. Цель - минимизация ошибки и получение желаемого результата.

6. Функция потерь - чувствительная к выбросам функция несогласия наблюдаемых данных и тех, что были предсказаны так званой подогнанной функцией модели.

Read the following article:

https://en.wikipedia.org/wiki/Feedforward_neural_network

What is FFNN? 

Нейронная сеть с прямой связью - тип сети, где входные данные обрабатывается из одного конца потока в другой, при этом поток состоит из последовательно соединенных нейронов, которые передают необходимые сигналы.
<br>
Для такого типа сетей циклы или петли обратной связи не характерны.
<br>
Простые примеры сетей такого плана: персептроны однослойные и многослойные.

## PyTorch basics

#### Autograd

In [4]:
# Creating a tensor:
x = torch.ones(1, requires_grad=True)

print(x.grad)    # returns None

None


print(x.grad) is None because a tensor x is a scalar, so there is nothing to be calculated.

In [5]:
x = torch.ones(1, requires_grad=True)
y = 20 + x
z = (y ** 2) * 2 
z.backward()     # auto gradient calculation

print(x.grad)    # ∂z/∂x 

tensor([84.])


### Prepare the data

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"['explanation', 'edits', 'made', 'username', '..."
1,1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[""d'aww"", 'match', 'background', 'colour', ""'m..."
2,2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"['hey', 'man', ""'m"", 'really', 'trying', 'edit..."
3,3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"['``', 'ca', ""n't"", 'make', 'real', 'suggestio..."
4,4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"['sir', 'hero', 'chance', 'remember', 'page', ..."


In [0]:
# Modify labels dtype to 'int', to make summarizing them possible
for column in df.columns: 
    if column not in ['id', 'comment_text', 'cleaned']:
        df[column] = df[column].astype('int32')
        
# Create a toxicity column (sums all of the toxic labels)
df['toxicity'] = df.iloc[:,2:8].sum(axis=1)

# Clean data - where toxicity is == 0 
clean = df[df['toxicity'] == 0]
# Messages, which were labelled as obscene
obscene = df[df['obscene'] == 1]

# Create a dataset for binary classification 
df_binary = clean.append(obscene, ignore_index=True, sort=False)

In [0]:
# Shuffle
df_binary = df_binary.sample(frac=1)

# Reset index of the pd.DataFrame
df_binary.reset_index(inplace=True)

In [9]:
df_binary.head()

Unnamed: 0.1,index,Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned,toxicity
0,100556,111910,56aa435ccdab23e3,And there is your main problem. Your edited a ...,0,0,0,0,0,0,"['main', 'problem', 'edited', 'page', 'band', ...",0
1,99725,110989,51c3e418d77f16be,Needs rewording \n\nWhole passages are lifted ...,0,0,0,0,0,0,"['need', 'rewording', 'whole', 'passage', 'lif...",0
2,106372,118360,786adbf417e61232,"First of all, please indicate on your user pag...",0,0,0,0,0,0,"['first', 'please', 'indicate', 'user', 'page'...",0
3,125406,139555,eae9a684c7d803ba,""", 2 December 2008 (UTC)\nAlso, with respect t...",0,0,0,0,0,0,"['``', '2', 'december', '2008', 'utc', 'also',...",0
4,14031,15601,292c4283aa8adaaf,GR \n\nI'll try and spend some time on the art...,0,0,0,0,0,0,"['gr', ""'ll"", 'try', 'spend', 'time', 'article...",0


In [10]:
# Load W2V model 
import gensim.downloader as api
we_model = api.load('word2vec-google-news-300')
#we_model = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# Make stratified sampling, for example: select 500 examples with obscene == 1, and 500 clean examples. 
''' TASK HERE'''

# Select only a small sample of your data (20%), do not train your model on all of the data available 
# But to make the task easier, make a stratified selection 
# (number of 1 labels would be approximately equal to number of 0 labels)
df_sample, _ = train_test_split(df_binary, train_size = 0.2)

# Split the data on the stratified training and test data sets 
''' TASK HERE'''

df_train, df_test = train_test_split(
    df_sample, train_size = 0.6, stratify = df_sample['obscene'])

In [51]:
print("Train shape: {}".format(df_train.shape))
print("Test shape: {}".format(df_test.shape))

Train shape: (18221, 12)
Test shape: (12148, 12)


In [0]:
def get_vectors(df_sample): 
    '''
    This function would process a DataFrame creating lists of:
        vectors, labels and documents corresponding to each raw document. 
        
    Args: 
        df: pd.DataFrame - DF to vectorize
    Returns: 
        X: list - Vectorized documents, each value in a list is a torch.tensor
        labels: list - Labels for each document, each value in a list is a torch.tensor
        documents: list - List of the raw texts of the vectorized documents 
    '''
    
    # Obtain vectors for documents, vectorized documents list and labels
    X, labels, documents = [], [], []
    for i, (document, tokens, label) in enumerate(zip(df_sample.comment_text, df_sample.cleaned, df_sample.obscene)):
        row_vectors = []
        for kw in tokens:
            try: 
                row_vectors.append(we_model[kw])
            except (IndexError, KeyError): 
                continue
        if not row_vectors:
            continue
        row_vectors = np.asarray(row_vectors)
        vec = row_vectors.mean(axis=0)
        X.append(torch.tensor(vec))
        documents.append(document)
        labels.append(torch.tensor(label, dtype=torch.float))
        
    return X, labels, documents

In [0]:
X_train, y_train, documents_train = get_vectors(df_train)
X_test, y_test, documents_test = get_vectors(df_test)

### How to create a simple NN: 

In [0]:
# Modify your model to work with batches, not only single item. 
''' TASK HERE'''

class FeedForward(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.fc1 = nn.Linear(self.input_size, self.hidden_size)
        self.relu = nn.ReLU()
        self.logits = nn.Linear(self.hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Makes a forward pass 
        hidden = self.fc1(x)
        relu = self.relu(hidden)
        logits = self.logits(relu)
        output = self.sigmoid(logits)
        return output

In [53]:
model = FeedForward(300, 200)
model

FeedForward(
  (fc1): Linear(in_features=300, out_features=200, bias=True)
  (relu): ReLU()
  (logits): Linear(in_features=200, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [55]:
# Initialise the model 


# Specify loss and optimization functions:

# specify loss function
criterion = nn.BCELoss()
# specify optimizer
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

# Move model to the training mode
model.train()

# init n_epochs 
n_epochs = 10

# init number of iterations for one epoch 
# we want our model during the epoch to walk trough all of the training examples 
# for batch_size == 1, number of iterations would be equal to number of examples 
# in the training set 
n_iters = len(X_train)

# initialise batch_size
# NOTE! for now it's equal == 1, you need to modify your model to make it possible to work with 
# batches during training, not only making an update for a single example 
batch_size = 1
for epoch in range(n_epochs):  
    epoch_loss = 0
    for idx in range(n_iters):
        
        # Selects only 1 sample, modify it to select N samples, N == batch_size
        ''' TASK HERE'''
        # idx = random.sample(range(len(X_train)), 1) # TIP: You can random sample N examples 
        
        optimizer.zero_grad()    # Forward pass

        # Select corresponding data from:
        # X (vectors) and labels - for calculating the loss and making a backward pass 
        # backward pass - updating our weights according to the obtained loss 
        ''' TASK HERE'''
        x = X_train[idx]
        y_true = y_train[idx]

        y_pred = model(x)    # Compute Loss
        loss = criterion(y_pred.squeeze(), y_true)
        
        epoch_loss += loss.item() / n_iters
        loss.backward()   # Backward pass 
        optimizer.step()
        
    print('Epoch {}: train loss: {}'.format(epoch, epoch_loss))    # Backward pass

Epoch 0: train loss: 0.15862504894716078
Epoch 1: train loss: 0.15816792545241073
Epoch 2: train loss: 0.1576178433830413
Epoch 3: train loss: 0.1572119482741705
Epoch 4: train loss: 0.1567066553905162
Epoch 5: train loss: 0.15633618162955137
Epoch 6: train loss: 0.15596891036991206
Epoch 7: train loss: 0.15559410756721104
Epoch 8: train loss: 0.15525694491526815
Epoch 9: train loss: 0.15502369151111636
Epoch 10: train loss: 0.15465777890029828
Epoch 11: train loss: 0.1543910327060794
Epoch 12: train loss: 0.15412987406842216
Epoch 13: train loss: 0.1539037730773472
Epoch 14: train loss: 0.1535646347953046


In [0]:
def make_predictions(model, X_test, y_test, documents_test, threshold): 
    n_prints = 0
    preds = []
    for example, label, document in zip(X_test, y_test, documents_test):
        pred = model(example)
        y_pred = int(pred.item() > threshold)
        preds.append(y_pred)
        
        # Print some examples with obscene documents texts and predicted and true labels 
        if label.item() == 1.0 and n_prints < 10:
            print("Predicted label: {}".format(y_pred))
            print("True label: {}".format(label.item()))
            print("Document: {}".format(document))
            print("*-*-"*20)
            n_prints += 1
        
    return preds

In [57]:
# Move model to the eval mode before making a prediction
model.eval()
preds = make_predictions(model, X_test, y_test, documents_test, threshold=0.5)

test_labels = [label.item() for label in y_test]

Predicted label: 0
True label: 1.0
Document: ONOREM IS STILL A FAUGOTT
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 1
True label: 1.0
Document: dr karl loves himself, fuck yourselves wikipedia 

go fuck yourselves wikipedia
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1.0
Document: "

 Thake v Maurice 

Just because it had a picture of a penis!  "
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1.0
Document: "

 You're a fucking moron 

Please kill yourself. You are a cancerous moron devoid of all intelligence and have no business editing Wikipedia or having any privileges thereof. I demand that you restore my userpages in full, removing any copyright infringement contained within. Mate1 "
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Predicted label: 0
True label: 1.0
D

In [58]:
print(classification_report(test_labels, preds))

              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     11465
         1.0       0.74      0.24      0.36       681

    accuracy                           0.95     12146
   macro avg       0.85      0.62      0.67     12146
weighted avg       0.94      0.95      0.94     12146



In [0]:
# init classification report

              precision    recall  f1-score   support

         0.0       0.98      0.99      0.99      5724
         1.0       0.87      0.62      0.72       337

    accuracy                           0.97      6061
   macro avg       0.92      0.81      0.86      6061
weighted avg       0.97      0.97      0.97      6061



## Task 1: 

#### Find all of the ''' TASK HERE ''' messages. 

1. Create stratified dataset, make your classes balanced! Train the model. Try to beat the initial score.

2. While vectorizing by W2V model, add tf-idf weightning, look at TfidfVectorizer at sklearn. 

3. Add batch size, modify your model architecture to make it possible to process batches, not only single items. 

4. Change hidden_size, n_layers, activation function, etc to modify your model. 

5. Tweak learning rate, see what happened if LR is too small, if too big (0.0001 / 0.8 for example)

In [0]:
# Tip:
# Use tf-idf scores calculated by sklearn:

def dummy_fun(doc):
    # This function is used to replace a default tokenizer in sklearn. 
    # If you are passing a tokenized documents to the tf-idf vectorizer - 
    # it would be much faster 
    return doc

def get_idf(tokenized_docs, max_features=180000):
    ''' Returns a tf-idf dictionary: 
            key: word,
            value: tf-idf score. 
    '''
    vectorizer = TfidfVectorizer(
        min_df=3,
        max_features=max_features,
        analyzer='word',
        tokenizer=dummy_fun,
        preprocessor=dummy_fun,
        token_pattern=None,
        ngram_range=(1, 1))

    vectorizer.fit(tokenized_docs)
    idf_dict = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
    
    return idf_dict

## Task 2, advanced

Working with nn.Embedding layer 

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html 

Read an example below. 

Please, try to modify your initial version of the SingleLayerPerceptron model to the model with one additional layer: 

1. Define your vocabulary size  
2. Add nn.Embedding layer to the model architecture (vocabulary_size, embedding_size) 
3. Retrain your model - see if metrics increased.

### Useful parts for the part 2: 

Refer  to the part 4.3 of the course:

https://stepik.org/lesson/262247/

It will help you to get the understanding how to use an nn.Embedding layer. 

#####  Let's create a vocabulary: 

In [0]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [0]:
threshold_count_l = 15
threshold_count_h = 500
threshold_len = 4
cleaned_vocab = [token for token, count in cnt_vocab.items() if 
                     threshold_count_h > count > threshold_count_l and len(token) > threshold_len
                ]
print("Vocab size: {}".format(len(cleaned_vocab)))

Vocab size: 13061


In [0]:
# You will need to have an id for each of your token 

token_to_id = {v: k for k, v in enumerate(sorted(cleaned_vocab))}
id_to_token = {v: k for k, v in token_to_id.items()}