<a href="https://colab.research.google.com/github/AnnaZhuravleva/compling/blob/master/assignment_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^m max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.externals import joblib
from sklearn import metrics
from tqdm import tqdm, tqdm_notebook
import nltk 
from nltk import PunktSentenceTokenizer
import gensim.downloader as api
nltk.download('punkt')
import re
import nltk
import gensim
import spacy
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torchtext.data import Field, LabelField, BucketIterator, TabularDataset, Iterator
from keras.utils.np_utils import to_categorical
from tqdm import tqdm_notebook as tqdm
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn import metrics




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Using TensorFlow backend.


## Data preparation

In [2]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
!unzip data.zip

--2020-03-22 16:22:31--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:22:31--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:22:32--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [0]:
with open('data.txt', 'r', encoding='utf-8') as txt_file:
    text = txt_file.read()

In [0]:
def tokenize(text):
  return nltk.word_tokenize(text) 

with open("stopwords.txt", "r" ) as f:
    stopwords=[line.strip().lower() for line in f.readlines()]


In [0]:
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize,
             lower=True,
             stop_words=stopwords)

datafields = [('sent',TEXT)]

#dataset = pd.DataFrame()
#dataset['sent'] = nltk.sent_tokenize(text)
#dataset.to_csv('dataset.csv', index=False)
#all_data = TabularDataset(path="dataset.csv", format='csv',
#                     skip_header=True, fields=datafields)


train_dataset = pd.DataFrame()
train_dataset['sent'] = nltk.sent_tokenize(text)[:3000] 
# I used reduced dataset because training on the whole data promised to last up to 27 hours!!!
# But even on the reduced dataset training through 2 epochs takes 25 minutes to complete...
train_dataset.to_csv('dataset.csv', index=False)
train = TabularDataset(path="dataset.csv", format='csv',
                     skip_header=True, fields=datafields)


TEXT.build_vocab(train)
train_iterator = Iterator(train, 512, shuffle=True)



In [6]:
next(iter(train_iterator)).sent[0]

tensor([   4,    5,   20,  131, 6813,    3,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1])

In [7]:
vocab_size = len(TEXT.vocab)
vocab_size

10537

## Model

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, i_dim=300, t_dim=5):
        super(MyModel, self).__init__()  
        self.get_emb = nn.EmbeddingBag(vocab_size, i_dim)
        self.pt = nn.Linear(i_dim, t_dim)
        self.softmax = F.softmax
        self.rs = nn.Linear(t_dim, i_dim, bias=False)
    
    def step(self, x):
        x = self.pt(torch.tensor(x))
        x = self.softmax(x)
        x = self.rs(x)
        return x
     
    def forward(self, batch):
        vecs = self.get_emb(batch.sent)
        zs = torch.tensor(vecs).unsqueeze(1)
        rs = self.step(torch.tensor(vecs))
        rsT = rs.unsqueeze(1).permute(0, 2, 1)
        rsTzs = torch.bmm(rsT, zs)
        return rsT, rsTzs

    def negs(self, batch):
        total = len(batch)
        for idx in range(total):
            to_random = list(range(0, idx)) + list(range(idx+1, total))
            neg_ids = np.random.choice(to_random, size=5,replace=False)

            negs = [self.get_emb(batch.sent)[i] for i in neg_ids]
            negs = torch.stack(negs, dim=-1)
            yield negs


In [0]:
class MyLoss(nn.Module):
  
    def __init__(self):
        super().__init__()

    def regularization(self, param, lambda_=1):
        inner = torch.mm(param.permute(1, 0), param) 
        reg = inner - torch.eye(inner.shape[0])
        return lambda_ * torch.norm(input=reg, p='fro')

    def forward(self, rsT, rsTzs, negs, param):
        negs = torch.stack(list(negs))
        losses = []
        for ni in negs.permute(2, 0, 1):
            ni = torch.bmm(rsT, ni.unsqueeze(1))
            tmp = (1 - rsTzs + ni).squeeze(1)
            zeros = torch.zeros_like(tmp)
            values, _ = torch.max(torch.stack([tmp, zeros]), 0)
            losses.append(values)
        losses = torch.stack(losses, dim=-1)
        return torch.sum(losses) + self.regularization(param)

## Train

In [0]:
def train_epoch(data_iter, n_epoch, model, criterion, optimizer=None):
    loss_history = []
    total_loss = 0
    counter = 0
    data_iter = tqdm_notebook(data_iter, total=len(data_iter), 
                              desc=f"Epoch {n_epoch + 1}", leave=True)
    for batch in data_iter:
        if optimizer:
          optimizer.zero_grad()
        rsT, rsTzs = model(batch)
        negs = model.negs(batch)
        param = list(model.parameters())[1]
        loss = criterion(rsT, rsTzs, negs, param)
        loss.backward()
        if optimizer:
          optimizer.step()
        curr_loss = loss.detach().item()
        total_loss += curr_loss
        loss_history.append(curr_loss)
        data_iter.set_postfix(loss = curr_loss)
        counter += 1
        
    total_loss /= counter
    return total_loss, loss_history


## Topic coherence

In [0]:
import gensim
from itertools import combinations

def my_get_descriptor(param, word_vecs, top=10):
    sims = []
    for topic_vec in param:
        similar = F.cosine_similarity(topic_vec.unsqueeze(0), word_vecs, dim=1)
        top_words = np.argsort(similar.detach().numpy())[::-1][:top]
        top_words = torch.tensor([int(word) for word in top_words])
        sims.append(top_words)
    return torch.stack(sims)

def calculate_coherence( w2v_model, term_rankings ):
    overall_coherence = 0.0
    for topic_index in range(len(term_rankings)):
        # check each pair of terms
        pair_scores = []
        for pair in combinations( term_rankings[topic_index], 2 ):
            if pair[0] in w2v_model.vocab and pair[1] in w2v_model.vocab:
              pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
        # get the mean for all pairs in this topic
        topic_score = sum(pair_scores) / len(pair_scores)
        overall_coherence += topic_score
    # get the mean score across all topics
    return overall_coherence / len(term_rankings)

def get_topic_words(topic_words):
  for topic in topic_words:
      yield [TEXT.vocab.itos[i] for i in topic]

In [12]:
TEXT.vocab.itos[0]

'<unk>'

In [13]:
a = torch.tensor([[1.0,2.0,3.0,4.0]])

b = torch.tensor([[1.0,2.0,3.0,4.0], [-1.0,2.0,-3.0,4.0]])
F.cosine_similarity(a, b, dim=1), a.shape

(tensor([1.0000, 0.3333]), torch.Size([1, 4]))

In [14]:
import gensim.downloader as api

w2v_model = api.load("word2vec-google-news-300")



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [15]:
model = MyModel(vocab_size=vocab_size, t_dim=12)
criterion = MyLoss()
num_epochs = 2
optimizer = torch.optim.Adam(model.parameters())

total_train_losses = []
total_valid_losses = []
for epoch in range(num_epochs):
    model.train()
    loss, train_losses = train_epoch(train_iterator, epoch, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

topic_words = my_get_descriptor(list(model.parameters())[1], list(model.parameters())[0])
coherence = calculate_coherence(w2v_model, list(get_topic_words(topic_words)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, description='Epoch 1', max=6, style=ProgressStyle(description_width='initi…

  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':



train 225000010.66666666


HBox(children=(IntProgress(value=0, description='Epoch 2', max=6, style=ProgressStyle(description_width='initi…


train 225000213.33333334


  if np.issubdtype(vec.dtype, np.int):


In [16]:
print(f'coherence for model with k = 12: {coherence}')

coherence for model with k = 12: 0.06733345620466734


In [17]:
for item in model.parameters():
    print(item.shape)

torch.Size([10537, 300])
torch.Size([12, 300])
torch.Size([12])
torch.Size([300, 12])


In [18]:
model = MyModel(vocab_size=vocab_size)
criterion = MyLoss()
num_epochs = 2
optimizer = torch.optim.Adam(model.parameters())

total_train_losses = []
total_valid_losses = []
for epoch in range(num_epochs):
    model.train()
    loss, train_losses = train_epoch(train_iterator, epoch, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, description='Epoch 1', max=6, style=ProgressStyle(description_width='initi…

  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':



train 224999464.0


HBox(children=(IntProgress(value=0, description='Epoch 2', max=6, style=ProgressStyle(description_width='initi…


train 224997784.0


In [19]:
topic_words = my_get_descriptor(list(model.parameters())[1], list(model.parameters())[0])
coherence = calculate_coherence(w2v_model, list(get_topic_words(topic_words)))
print("K=%02d: Coherence=%.4f" % (5, coherence) )

K=05: Coherence=0.0540


  if np.issubdtype(vec.dtype, np.int):


In [20]:
model = MyModel(vocab_size=vocab_size, t_dim=7)
criterion = MyLoss()
num_epochs = 2
optimizer = torch.optim.Adam(model.parameters())

total_train_losses = []
total_valid_losses = []
for epoch in range(num_epochs):
    model.train()
    loss, train_losses = train_epoch(train_iterator, epoch, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

topic_words = my_get_descriptor(list(model.parameters())[1], list(model.parameters())[0])
coherence = calculate_coherence(w2v_model, list(get_topic_words(topic_words)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, description='Epoch 1', max=6, style=ProgressStyle(description_width='initi…

  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':



train 224999898.66666666


HBox(children=(IntProgress(value=0, description='Epoch 2', max=6, style=ProgressStyle(description_width='initi…


train 224999373.33333334


  if np.issubdtype(vec.dtype, np.int):


In [21]:
print(f'coherence for model with k = 7: {coherence}')

coherence for model with k = 7: 0.06221849816909262
