## The ```torch.nn``` package

The ```torch.nn``` package provides a number of higher-level APIs that resemble tensorflow's ```layers``` or Keras. Building a simple MLP is therefore as easy as it should be:

In [0]:
!pip install torch torchvision

## Model definition

In [0]:
import torch

num_features = 28*28
num_classes = 10
H_size = 100

model = torch.nn.Sequential(
  torch.nn.Linear(num_features, H_size),
  torch.nn.ReLU(),
  torch.nn.Linear(H_size, num_classes)
)

## Dataset loading

In [0]:
from torchvision import datasets, transforms

batch_size = 32

train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=1, shuffle=True)





## Training loop

In [0]:
import torch.optim as optim

loss = torch.nn.CrossEntropyLoss()
learning_rate = 1e-5

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

num_epochs = 10

for i in range(num_epochs):
  # train 
  for j, (x, y_true) in enumerate(train_loader):
    x = x.view(batch_size, num_features)
    optimizer.zero_grad()
    y_pred = model(x)
    loss_value = loss(y_pred, y_true)
    loss_value.backward()
    optimizer.step()
    if j % 100 == 0:
      print('Epoch {}; batch {}: loss {}'.format(i, j, loss_value.detach().numpy()))
  # test
  loss_value_test = 0
  correct = 0
  with torch.no_grad():
    for x, y_true in test_loader:
      x = x.view(1, num_features)
      y_pred = model(x)
      loss_value_test += loss(y_pred, y_true)
      y_pred = y_pred.max(1, keepdim=True)[1]
      correct += y_pred.eq(y_true).sum().numpy()
  loss_value_test /= len(test_loader.dataset)
  accuracy = correct / len(test_loader.dataset)
  print('Epoch {}: loss {} accuracy {}'.format(i, loss_value_test, accuracy))
  
  

## Extending the nn.Module class

Another cool thing about PyTorch is that it kind of plays nice with the object-oriented model. When you build a neural net, you can extend the base ```nn.Module``` class. This might be overkill in the case of a MLP or a very simple feedforward model in general, since the ```nn.Sequential```
should have you set; however, it gives you some incentive to write clean code and a natural way to handle hyperparameters (constructor parameters
in your new class).

In [0]:
import torch.nn.functional as F
import torch.nn as nn

class FCNet(nn.Module):
  def __init__(self, H1, H2):
    super(FCNet, self).__init__()
    self.fc1 = nn.Linear(num_features, H1)
    self.fc2 = nn.Linear(H1, H2)
    self.fc3 = nn.Linear(H2, num_classes)
    
  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    y_pred = self.fc3(x)
    return y_pred
  
  
model = FCNet(100, 50)

loss = torch.nn.CrossEntropyLoss()
learning_rate = 1e-5
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
num_epochs = 10

for i in range(num_epochs):
  # train 
  model.train() #!!!
  for j, (x, y_true) in enumerate(train_loader):
    x = x.view(batch_size, num_features)
    optimizer.zero_grad()
    y_pred = model(x)
    loss_value = loss(y_pred, y_true)
    loss_value.backward()
    optimizer.step()
    if j % 100 == 0:
      print('Epoch {}; batch {}: loss {}'.format(i, j, loss_value.detach().numpy()))
  # test
  loss_value_test = 0
  correct = 0
  model.eval() #!!!
  for x, y_true in test_loader:
    x = x.view(1, num_features)
    y_pred = model(x)
    loss_value_test += loss(y_pred, y_true)
    y_pred = y_pred.max(1, keepdim=True)[1]
    correct += y_pred.eq(y_true).sum().numpy()
  loss_value_test /= len(test_loader.dataset)
  accuracy = correct / len(test_loader.dataset)
  print('Epoch {}: loss {} accuracy {}'.format(i, loss_value_test, accuracy))

  

## Recurrent Networks with PyTorch

Thanks to its dynamic graph building, PyTorch is a good fit for training Recurrent networks. While recurrent networks can be thought as a feedforward model that has loops, like so

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-rolled.png" height="300px;"/>

another way to think about it, perhaps more explicit, comes from **unrolling the loop**: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" height="300px;"/>

*(images from [Chris Olah's blog](http://colah.github.io))*

Here, $X_0, X_1, ...X_t$ are elements of the sequence we wish to process. Since there is no fixed length for our sequence, a library that builds a static computational graphs would resort to some kind of trick here. From what I have gathered and experienced personally, TensorFlow requires you to specify the length of each of your input sequences and to pad them with zeros until they are as long as the longest one. PyTorch, on the other hand, will build the computational graph with the correct number of nodes/tensors/operations depending on the length of the sequence you are passing to the model as input.


The specific recurrent network flavor we will try today is a *many-to-one* model: we want to read a whole sequence (actually a sentence) and return a class label. This is called a many-to-one model because we only want to predict a single value while having multiple values as an input; alternatives would be many-to-many (machine translation), one-to-many (image captioning) and one-to-one (traditional feedforward networks). 

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" height="300px;"/>

*(image from [Andrej Karpathy's blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/))*

To get an intuition for what a recurrent network does, reason about its three separate set of parameters:

* **Input-hidden parameters** which map an element of the sequence to the hidden layer neurons (red lines in the figure above);
* **hidden-hidden parameters** which map the *evolution* of your sequence and map the network's hidden state at step $t-1$ to step $t$ (green lines);
* **hidden-output parameters** which map the network's hidden state to the output layer neurons, which you may softmax to output a prediction (blue lines).

Let us call these separate set of parameters $V_{ih}$, $V_{hh}$ and $V_{ho}$. Then we can write the core equation for RNNs, which computes the activation of the hidden layer, as the following:

$$
h_t = tanh(V_{ih} x_t + b_{ih} + V_{hh}h_{t-1} + b_{hh})
$$

In our case, let's name $T$ the last timestep for a given sequence. To get a prediction for our sequence, we will compute the following:

$$
y = argmax(softmax(V_{ho}h_T + b_{ho}))
$$

These equation may be implemented directly, without the need to use a pre-made module. Obviously, this kind of "recurrent layer" is available in PyTorch, alongside many others (```RNN```, ```LSTM```, ```GRU```...); however, here we will just use the fully-connected ```Linear``` layer we introduced just above.

One thing we also have to decide is which kind of vectors we will have in our sequence $X_0 ... X_T$. There are two possible choices here: using **characters** or **words**. Imagining having to model a distribution $P(y \mid \textbf{x})$ where \textbf{x} is either the distribution of words or characters over our dataset, its dimensionality will be higher when using words, as there are many more possible words than characters; however, it seems hard to figure out the meaning of a sentence using its characters. On top of that, if we build a word-level RNN we might leverage pre-computed **word embeddings** [1]. For this reason, a Word-level RNN is the choice that makes most sense to me in this setting.

Just below, I defined two RNNs, ```CharRNN``` and ```MyRNN```,  which work on characters and words respectively. The representation I chose to model characters is just simple one-hot encoding; the Italian word embeddings are provided by the Polyglot library, and I have been told that they have been extracted from the Italian Wikipedia corpus. 



[[1]](https://arxiv.org/abs/1301.3781): Mikolov et al., Efficient Estimation of Word Representations in Vector Space 




In [0]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import numpy as np

class CharRNN(nn.Module):
  
  def __init__(self, input_size, hidden_size, output_size):
    super(CharRNN, self).__init__()

    self.input_size = input_size
    self.hidden_size = hidden_size
    self.output_size = output_size
    
    self.Vih = nn.Linear(input_size, hidden_size)
    self.Vhh = nn.Linear(hidden_size, hidden_size)
    self.Vho = nn.Linear(hidden_size, output_size)
    
  def forward(self, x):
    hidden_state = torch.zeros(self.hidden_size)
    for xi in x:
      xi = torch.Tensor(onehot_dict[xi.lower()])
      hidden_state = F.relu(self.Vih(xi) + self.Vhh(hidden_state))
    return self.Vho(hidden_state)
  
  
class MyRNN(nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
    super(MyRNN, self).__init__()

    self.input_size = input_size
    self.hidden_size = hidden_size
    self.output_size = output_size
    
    self.Vih = torch.nn.Parameter(data=torch.randn(input_size, hidden_size))
    self.bih = torch.nn.Parameter(data=torch.randn(hidden_size))
    self.Vhh = torch.nn.Parameter(data=torch.randn(hidden_size, hidden_size))
    self.bhh = torch.nn.Parameter(data=torch.randn(hidden_size))
    self.Vho = torch.nn.Parameter(data=torch.randn(hidden_size))
    self.bho = torch.nn.Parameter(data=torch.randn(output_size))
    
  def forward(self, x):
    hidden_state = torch.zeros(self.hidden_size)
    for xi in x:
      xi = torch.Tensor(xi)
      hidden_state = F.tanh((torch.matmul(xi, self.Vih) + self.bih) + (torch.matmul(hidden_state, self.Vhh) + self.bhh)) 
    return torch.matmul(hidden_state, self.Vho) + self.bho
   

model = MyRNN(5, 10, 2)
x = np.array([[0, 1, 0, 1, 1], [1, 0, 1, 1, 1], [1, 1, 0, 1, 1]])
y_ = model(torch.Tensor(x))
print(y_)

## Haspeede

Haspeede is the [hate speech detection task](http://www.di.unito.it/~tutreeb/haspeede-evalita18/index.html) @ EvalIta 2018. It is a binary classification task about deciding whether a social media comments constitutes an act of hate speech. While definitions of this phenomenon are evasive, some examples can give us at least some kind of intuition about it:

* **Encouraging or justifying violence**: any kind of comment that depicts acts of violence against minorities, or encourages others to perform such acts can be described as *hate speech*;
* **Encouraging hate and discrimination**: comments that actively promote discrimination or express hate towards minorities because of their intrisic characteristics


## Load Haspeede data

In [0]:
!pip install --upgrade -q gspread pandas

In [0]:
from google.colab import auth
auth.authenticate_user()

import gspread
import numpy as np
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/1M5oznzGZgL24DtsVDnXbnnYIiKv9XGZIjE6AZmr-Nj4/').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

data = np.array(rows)
X = data[:, 0]
y = data[:, 1].astype('int32')

print(X[0])
print(y[0])

In [0]:
!pip install nltk
nltk.download('punkt')
nltk.download('stopwords')

## CharRNN - Preprocessing

In [0]:
def unique_char_dict(X):
  d = {}
  for x in X:
    for xi in x:
      try:
        d[xi.lower()] += 1
      except KeyError:
        d[xi.lower()] = 1
  return d

def get_dropped_chars(d, thresh=100):
  l = []
  for char, occ in d.items():
    if occ < 100:
      l.append(char)
  return l

def drop_chars(X):
  d = unique_char_dict(X)
  dropped_chars_list = get_dropped_chars(d)
  X_new = []
  for x in X:
    X_new.append([xi for xi in x if xi not in dropped_chars_list])
  return X_new

def get_onehot_dict(X, d):
  temp_d = {}
  for x in X:
    for xi in x:
      temp_d[xi.lower()] = 0
    if len(temp_d) == len(d):
      break
  onehot_d = {}
  for i, char in enumerate(temp_d.keys()):
    onehot_d[char] = np.array([0 for el in temp_d.values()])
    onehot_d[char][i] = 1
  return onehot_d

label_dict = {0: np.array([0, 1]), 1: np.array([1, 0])}

X_new = drop_chars(X)
char_dict = unique_char_dict(X_new)
onehot_dict = get_onehot_dict(X_new, char_dict)
#print(onehot_dict)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3)
print(X_train[0])

## CharRNN - training

In [0]:
import torch.optim as optim

num_features = len(onehot_dict)
hidden_size = 100
num_classes = 2

model = CharRNN(num_features, hidden_size, num_classes)

loss = torch.nn.CrossEntropyLoss()
learning_rate = 1e-5

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

num_epochs = 100

for i in range(num_epochs):
  # train
  for j, (x, y_true) in enumerate(zip(X_train, y_train)):
    optimizer.zero_grad()
    y_pred = model(x)
    loss_value = loss(y_pred.view(1, -1), torch.LongTensor([y_true]))
    loss_value.backward()
    optimizer.step()
    if j % 100 == 0:
      print('Epoch {}; batch {}: loss {}'.format(i, j, loss_value.detach().numpy()))
  # test
  loss_value_test = 0
  correct = 0
  for j, (x, y_true) in enumerate(zip(X_test, y_test)):
    y_true = torch.LongTensor([y_true])
    y_pred = model(x)
    y_pred = y_pred.view(1, -1)
    loss_value_test += loss(y_pred.view(1, -1), y_true)
    y_pred = y_pred.max(1, keepdim=True)[1]
    correct += y_pred.eq(y_true).sum().numpy()
  loss_value_test /= (j+1)
  accuracy = correct / (j+1)
  print('Epoch {}: loss {} accuracy {}'.format(i, loss_value_test, accuracy))

## Word-level RNN: Getting embeddings with Polyglot

In [0]:
!pip install polyglot pyicu pycld2 morfessor
!polyglot download sgns2.it

Test: properties of summation in word embeddings (man - woman - king example). The expected result should be that $$roma - italia + germania \sim berlino$$ 

However, you see that a random other word such as Washington is actually closer...

Also observe that the original example ($king - man + woman \sim queen$) actually kind of holds for the Italian embeddings, too!

In [14]:
from polyglot.mapping import Embedding
from polyglot.text import Word

w1 = Word("re", language="it")
w2 = Word("uomo", language="it")
w3 = Word("donna", language="it")


w4 = Word("gengive", language="it")
w5 = Word("parroco", language="it")
w6 = Word("principe", language="it")
w7 = Word("regina", language="it")



vect = w1.vector - w2.vector + w3.vector

print(sum(w4.vector - vect)**2 / len(vect))
print(sum(w5.vector - vect)**2 / len(vect))
print(sum(w6.vector - vect)**2 / len(vect))
print(sum(w7.vector - vect)**2 / len(vect))



0.007138776223034396
0.2632947066149454
0.3559078571242888
0.024611842383821074


## Word-level RNN: training

In [0]:
from polyglot.text import Word
import nltk
import numpy as np


def sentence2matrix(tokens):
  x = []
  for token in tokens:
    try: 
      w = Word(token, language="it").vector
      x.append(w)
    except KeyError:
      pass
  return np.array(x)

def transform_embeddings(X):
  X_new = []
  for x in X:
    tokens = nltk.word_tokenize(x)
    x_new = sentence2matrix(tokens)
    X_new.append(x_new)
  return np.array(X_new)
    
X_w2v = transform_embeddings(X)

print(X.shape)
print(X_w2v.shape)


### Word-level RNN: Training loop 

In [0]:
from sklearn.model_selection import train_test_split
import torch.optim as optim

X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.3)

num_features = X_train[0].shape[1]
num_hidden = 100
num_classes = 2

model = MyRNN(num_features, num_hidden, num_classes)

loss = torch.nn.CrossEntropyLoss()
learning_rate = 1e-4

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

num_epochs = 100

for i in range(num_epochs):
  # train 
  model.train()
  for j, (x, y_true) in enumerate(zip(X_train, y_train)):
    optimizer.zero_grad()
    y_pred = model(x)
    y_pred = y_pred.view(1, -1)
    loss_value = loss(y_pred, torch.LongTensor([y_true]))
    loss_value.backward()
    optimizer.step()
    if j % 500 == 0:
      print('Epoch {}; batch {}: loss {}'.format(i, j, loss_value.detach().numpy()))
  # test
  loss_value_test = 0
  correct = 0
  model.eval()
  for j, (x, y_true) in enumerate(zip(X_test, y_test)):
    y_true = torch.LongTensor([y_true])
    y_pred = model(x)
    y_pred = y_pred.view(1, -1)
    loss_value_test += loss(y_pred, y_true)
    y_pred = y_pred.max(1, keepdim=True)[1]
    correct += y_pred.eq(y_true).sum().numpy()
  loss_value_test /= (j+1)
  accuracy = correct / (j+1)
  print('Epoch {}: loss {} accuracy {}'.format(i, loss_value_test, accuracy))