# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from ["An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017"](https://www.comp.nus.edu.sg/~leews/publications/acl17.pdf), also desribed in seminar notes.  


You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^m max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [0]:
import nltk

import pandas as pd
import numpy as np

import torch as tt
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, TensorDataset
from torchtext.data import Field, TabularDataset, Iterator

from tqdm import tqdm, tqdm_notebook

In [0]:
batch_size = 256
random_state = 42
num_neg_samples = 5

In [34]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true 

--2020-03-22 16:51:38--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:51:38--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:51:38--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [35]:
!unzip data.zip

Archive:  data.zip
replace data.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: data.txt                
replace stopwords.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: stopwords.txt           


In [36]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
with open('data.txt', 'r') as f:
  text = f.read()
print(text[: 1000])

Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years. So it is tempting to think the bank, when asked by US Department of Justice to pay a large bill for polluting the financial system with mortgage junk between 2005 and 2007, should cough up, apologise and learn some humility. That is not the view of the chief executive, Jes Staley. Barclays thinks the DoJ’s claims are “disconnected from the facts” and that it has “an obligation to our shareholders, customers, clients and employees to defend ourselves against unreasonable allegations and demands.” The stance is possibly foolhardy, since going into open legal battle with the most powerful US prosecutor is risky, especially if you end up losing. But actually, some grudging respect for Staley and Barclays is in order. The US system for dishing out fines to errant banks for their mortgage sins has come to resemble a casino. The approach prefers settlements behind closed

In [0]:
stop_words = []
with open( "stopwords.txt", "r" ) as f:
    for line in f.readlines():
        stop_words.append( line.strip().lower())

### add negative samples


In [0]:
df = pd.DataFrame()
df['data'] = nltk.tokenize.sent_tokenize(text)

In [40]:
df[:13]

Unnamed: 0,data
0,Barclays' defiance of US fines has merit Barcl...
1,"So it is tempting to think the bank, when aske..."
2,"That is not the view of the chief executive, J..."
3,Barclays thinks the DoJ’s claims are “disconne...
4,"But actually, some grudging respect for Staley..."
5,The US system for dishing out fines to errant ...
6,The approach prefers settlements behind closed...
7,Occasional leaks of the negotiating demands ma...
8,Deutsche Bank was initially asked for $14bn (£...
9,Where is the rhyme or reason?


Я думаю, что кавычки лучше просто игнорить... Внутри них есть точки

In [0]:
def add_negative(df):
  neg_id = np.random.choice(len(df))
  return df.iloc[neg_id, 0]

In [0]:
for i in range(num_neg_samples):
  df[f'neg{i}'] = df['data'].apply(lambda x: add_negative(df))

In [43]:
df

Unnamed: 0,data,neg0,neg1,neg2,neg3,neg4
0,Barclays' defiance of US fines has merit Barcl...,Perhaps English football’s ability to particip...,She had been so mute on Europe that it was an ...,There are the nurses who offer you their secre...,Hundreds of thousands of users say Twitter is ...,"Anushka Asthana is joined by Gary Younge, Rand..."
1,"So it is tempting to think the bank, when aske...","With delicious timing, ARM’s half-year results...",MPs and peers are demanding a full inquiry by ...,"George, who signed off: “I rely entirely on yo...",It’s all going off!,"He anticipated the international community, mi..."
2,"That is not the view of the chief executive, J...",So why didn’t this apply to the banking sector?,"“They were part-time jobs,” he said.",“Maybe we are not experienced enough in a situ...,"They were still there!” The bottom line, she t...","Well, that was a rather unexpected ending to t..."
3,Barclays thinks the DoJ’s claims are “disconne...,The same study detailed how people with panic ...,I can take our party to victory.,The government has taken a number of steps and...,The latest edition is the start of a miniserie...,This time Burnley win the ball.
4,"But actually, some grudging respect for Staley...",But would parents know where to look and if th...,"“Trump is not a career politician,” he said.",The amount of $14 billion which was initially ...,Juxtaposed with its backdrop of a cheerful lat...,I’m very short-sighted and was nervous.
...,...,...,...,...,...,...
183395,It feels as though Stone realised that some of...,There is a church in every village and distric...,It said Bourdieu and leading lady Géraldine Pa...,What Griffiths has said could be seen as pre-e...,Mauricio Pochettino’s side are hardly in good ...,A few results are beginning to trickle in from...
183396,"There are some fun elements, many involving Rh...",When I first moved here six years ago I found ...,"I keep trying to figure out if he knows, or ca...",George Lucas could invent an entire religion t...,But this is such a bracing and optimistic and ...,“We’re disappointed that ad blocking companies...
183397,I particularly enjoyed a scene in which O’Bria...,The comparable figure 12 months ago was 4.2% a...,McCarthy was crushed by the ultra-conservative...,"Translated as “black market’ in Japanese, the ...",Iowa has sent notice that the Republican nomin...,It’s maddening.
183398,His carnivorous snarl fills the immense screen...,The new ad by Future 45 shows Clinton talking ...,They went through the middle and so I asked th...,The Greens and Lib Dems seem ready to play.,"The content, its lies and contradictions, were...",So it is misleading to describe Zika as “yet a...


In [0]:
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize=nltk.tokenize.word_tokenize,
             lower=True,
             stop_words=stop_words)

datafields = [('data',TEXT), *[(f'neg{i}', TEXT) for i in range(num_neg_samples)]]

In [0]:
df.to_csv('text.csv', index=False)

In [0]:
trn = TabularDataset(path="text.csv",
                     format='csv',
                     skip_header=True,
                     fields=datafields)

In [0]:
TEXT.build_vocab(trn)

In [0]:
vocab_size = len(TEXT.vocab.itos) 

## Model



In [0]:
trn_itr = Iterator(trn, batch_size, shuffle=True)

In [50]:
pad_id = TEXT.vocab.stoi['<pad>']
pad_id

1

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, emb_dim=300, topic_dim=5):
      super(MyModel, self).__init__()
      self.embedding = nn.EmbeddingBag(vocab_size, emb_dim)  ## how do I ignore the padding?
      self.pt = nn.Linear(emb_dim, topic_dim)
      self.soft = F.softmax
      self.rs = nn.Linear(topic_dim, emb_dim, bias=False)

    def forward(self, batch):
      emb_x = self.embedding(batch.data)
      x = self.pt(emb_x)
      x = self.soft(x)
      x = self.rs(x) 
      
      negs = [self.embedding(batch.neg0), 
              self.embedding(batch.neg1), 
              self.embedding(batch.neg2),
              self.embedding(batch.neg3),
              self.embedding(batch.neg4),]  ## how do I generalize this to different num_neg_samples?
      negs = tt.stack(negs, dim=-1)

      return x, emb_x, negs

In [52]:
example_batch = next(iter(trn_itr))
example_batch


[torchtext.data.batch.Batch of size 256]
	[.data]:[torch.LongTensor of size 256x68]
	[.neg0]:[torch.LongTensor of size 256x87]
	[.neg1]:[torch.LongTensor of size 256x48]
	[.neg2]:[torch.LongTensor of size 256x58]
	[.neg3]:[torch.LongTensor of size 256x79]
	[.neg4]:[torch.LongTensor of size 256x112]

In [53]:
model = MyModel(vocab_size)
x, emb_x, negs = model(example_batch)
x.shape

  del sys.path[0]


torch.Size([256, 300])

In [54]:
emb_x.shape

torch.Size([256, 300])

In [55]:
negs.shape

torch.Size([256, 300, 5])

In [56]:
list(model.parameters())[0].shape

torch.Size([95799, 300])

In [57]:
model.embedding.weight.shape

torch.Size([95799, 300])

## Loss


**TODO FIX LAMBDA ATTR**

In [0]:
class MyLoss(nn.Module):
  def init(self, lmbd=0.01):
    super(MyLoss, self).init()
    self.lmbd = lmbd

  def forward(self, vecs_true, negs, vecs_rec, T):
    vecs_true = vecs_true.unsqueeze(1) ## add dimension for bmm
    rs = vecs_rec.unsqueeze(1) ## add dimension for bmm
    rsT = rs.permute(0, 2, 1) ## transpose
    rsTzs = tt.bmm(rsT, vecs_true)
    negs_losses = []
    for ni in negs.permute(2, 0, 1):  ## so that we iterate over the neg samples
      ni = ni.unsqueeze(1) ## add dimension for bmm
      negs_losses.append(tt.bmm(rsT, ni))
    losses = []
    for n_loss in negs_losses:
      tmp = (1 - rsTzs + n_loss).squeeze(1)
      zeros = tt.zeros_like(tmp)
      values, idx = tt.max(tt.stack([tmp, zeros]), 0)
      losses.append(values)
    losses = tt.stack(losses, dim=-1)
    reg_0 = tt.mm(T.permute(1,0), T)
    reg = (tt.norm(reg_0 - tt.eye(reg_0.shape[0]), p='fro')) #self.lmbd 
    return tt.sum(losses) + reg

In [59]:
criterion = MyLoss()

criterion(emb_x, negs, x, model.embedding.weight)

tensor(1.1686e+08, grad_fn=<AddBackward0>)

## Train

In [0]:
num_epochs = 2
optimizer = tt.optim.Adam(model.parameters())

In [0]:
def train_epoch(data_iter, len_iter, n_epoch, model, criterion, optimizer=None):
    train_losses = []
    total_loss = 0
    data_iter = tqdm_notebook(data_iter, total=len_iter, desc=f"Epoch {n_epoch + 1}", leave=True)
    counter = 0
    for batch in data_iter:
        if optimizer:
          optimizer.zero_grad()
        vec_rec = model.forward(batch)
        loss = criterion(batch.vecs, batch.negs, vec_rec,list(model.parameters())[-1])
        loss.backward()
        if optimizer:
          optimizer.step()
        loss_value = loss.detach().item()
        total_loss += loss_value
        train_losses.append(loss_value)
        data_iter.set_postfix(loss = loss_value)
        counter += 1
        
    total_loss /= counter
    return total_loss, train_losses

In [62]:
total_train_losses = []
total_valid_losses = []
for epoch in range(num_epochs):
    model.train()
    loss, train_losses = train_epoch(trn_itr, len(trn_itr), epoch, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(IntProgress(value=0, description='Epoch 1', max=717, style=ProgressStyle(description_width='ini…

  del sys.path[0]


AttributeError: ignored