# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from ["An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017"](https://www.comp.nus.edu.sg/~leews/publications/acl17.pdf), also desribed in seminar notes.  


You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^m max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [16]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true 

--2020-03-22 20:43:11--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-22 20:43:11--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-22 20:43:12--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [0]:
import gensim
import gensim.downloader as api
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import torch
from torch.utils.data import DataLoader, TensorDataset
from torchtext.data import Field, TabularDataset, Iterator
from itertools import combinations
from scipy.ndimage.filters import gaussian_filter1d
from tqdm import tqdm, tqdm_notebook
from nltk import tokenize
import nltk
from zipfile import ZipFile
from nltk.tokenize import TreebankWordTokenizer
import pickle
import random
from itertools import combinations
from torch import nn
from torch.nn import functional as F
from torch.optim import Adam
from pathlib import Path

In [18]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-03-22 20:43:15--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-03-22 20:43:15--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-03-22 20:43:15--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.3’


2020

In [19]:
!unzip data.zip

Archive:  data.zip
replace data.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace stopwords.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [0]:
def random_seed(value):
    torch.manual_seed(value)
    torch.cuda.manual_seed(value)
    np.random.seed(value)
    random.seed(value)

In [22]:
!nvidia-smi

Sun Mar 22 20:55:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8    29W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [23]:
import gensim
from gensim.scripts.glove2word2vec import glove2word2vec

if not Path('emb_word2vec_format.txt').exists():
    glove2word2vec(glove_input_file="glove.6B.300d.txt", word2vec_output_file="emb_word2vec_format.txt")

model = gensim.models.KeyedVectors.load_word2vec_format('emb_word2vec_format.txt')
weights = torch.FloatTensor(model.vectors)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
word2idx = {k:i for i, k in enumerate(model.vocab.keys())}
weight = np.array([model[k] for _, k in enumerate(model.vocab.keys())])

**Подготовим данные**

In [0]:
with open('data.txt', 'r') as f:
  text = f.read()

In [0]:
stopwords = []
with open( "stopwords.txt", "r" ) as f:
    for line in f.readlines():
        stopwords.append( line.strip().lower())

In [27]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
raw_documents = []
snippets = []
with open( "data.txt", "r") as f:
    for line in f.readlines():
        text = line.strip()
        raw_documents.append( text.lower() )
        
        snippets.append( text[0:min(len(text),100)] )

## Train

In [0]:
class DataSampler():
    def __init__(self, dataset, negative_size=10, batch_size=5):
        self.dataset = random.sample(dataset, len(dataset))
        self.negative_size = negative_size
        self.batch_size = batch_size
        self.cur_idx = 0
        self.seed = 0 

    def __next__(self):
        batch = {'positive': [], 'negative': []}
        for j in range(self.batch_size):
            positive = self.dataset[self.cur_idx]
            rest = self.dataset[:self.batch_size*self.cur_idx+j] + self.dataset[self.batch_size*self.cur_idx+j+1:]
            random.seed(self.seed + j)
            negative = torch.stack(random.sample(rest, self.negative_size), 0)
            batch['positive'].append(positive)
            batch['negative'].append(negative)
        batch['positive'] = torch.stack(batch['positive'], 0)
        batch['negative'] = torch.stack(batch['negative'], 0)
          
        return batch

    def __iter__(self):
        self.cur_idx = 0
        for i in range(len(self.dataset) // self.batch_size):
            self.cur_idx += 1
            self.seed = i * self.batch_size
            yield self.__next__()

    def __len__(self):
        return len(self.dataset) // self.batch_size

In [0]:
padding_idx = word2idx['pad']


In [0]:
tokenizer = TreebankWordTokenizer()
dataset = [tokenizer.tokenize(x) for x in raw_documents]

if not Path('vocab.pickle').exists():
    vocab_freq = {}
    for doc in dataset:
        for word in doc:
            if word in vocab_freq:
                vocab_freq[word] += 1
            else:
                vocab_freq[word] = 1
else:
    with open('vocab.pickle', 'rb') as f:
        vocab_freq = pickle.load(f)

In [0]:
vocab_list = [k for k, v in vocab_freq.items() if v > 50]

In [0]:

encode = lambda x: word2idx[x] if x in model.vocab.keys() else word2idx['unk']
dataset = [[encode(x) for x in y] for y in dataset]
dataset = dataset + [[encode(x)] for x in vocab_list]

In [0]:
max_len = 512
tensor_dataset = []
for doc in dataset:
    tensor_dataset.append(torch.LongTensor(doc[:max_len]+(max_len - len(doc))*[encode('pad')]))

**Модель**

In [0]:
class TopicModel(nn.Module):
    def __init__(self, vocab_size, d, n_topics):
        super().__init__()
        self.vocab_size = vocab_size
        self.d = d
        self.embedding = nn.Embedding(self.vocab_size, d)
        self.M_matrix = nn.Linear(d, d, bias=False)
        self.proj = nn.Linear(d, n_topics)

        self.T_matrix = nn.Parameter(nn.init.xavier_uniform_(torch.empty(n_topics, d)))

    def load_embedding_weight(self, weight, padding_idx=None, freeze=False):
        self.embedding = nn.Embedding(self.vocab_size, self.d).from_pretrained(weight, padding_idx=padding_idx)
        if freeze == True:
            self.embedding.requires_grad = False

    def forward(self, batch):
        pos_emb = self.embedding(batch['positive'])
        neg_context_emb = self.embedding(batch['negative']).mean(2)

        sent_context = pos_emb.mean(1)
        transf_emb = self.M_matrix(pos_emb)
        sim = torch.einsum('ble,be->bl', transf_emb, sent_context)
        alphas = F.softmax(sim, -1)
        attn = torch.einsum('ble,bl->be', pos_emb, alphas)
        p = F.softmax(self.proj(attn), -1)
        r = p @ self.T_matrix
        
        pos = torch.einsum('be,be->b', r, attn)
        neg = torch.einsum('be,bme->bm', r, neg_context_emb)

        return pos, neg 
    
    def get_probs(self, inp):
        pos_emb = self.embedding(inp)

        sent_context = pos_emb.mean(0)
        transf_emb = self.M_matrix(pos_emb)
        sim = torch.einsum('le,e->l', transf_emb, sent_context)
        alphas = F.softmax(sim, -1)
        attn = torch.einsum('le,l->e', pos_emb, alphas)
        p = F.softmax(self.proj(attn))

        return p

In [0]:
n_epoch = 5
batch_size = 10
negative_size = 20
lamda = 1

In [0]:
n_topics_range = [3, 4, 5, 6, 7, 8, 9, 10]
topic_models = []

In [39]:
for n_topics in n_topics_range:
    print(f'{n_topics} topics \n')
    random_seed(999)
    topic_model = TopicModel(len(word2idx), 300, n_topics=n_topics)
    topic_model.load_embedding_weight(torch.FloatTensor(weight), padding_idx=padding_idx, freeze=True)
    topic_model.to(device)
    optimizer = Adam(topic_model.parameters(), lr=1e-3)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True)

    def topic_loss_function(pos, neg, model):
        pos = pos.unsqueeze(-1)
        delta = 1 - pos + neg
        delta = F.relu(delta)
        reg = torch.frobenius_norm(model.T_matrix @ model.T_matrix.permute(1,0)  - torch.eye(n_topics).to(pos.device))
        loss = delta.sum() / batch_size + lamda * reg / len(sampler) / batch_size
        return loss

    topic_model.train()
    for ep in range(n_epoch):
        ep_loss = 0
        sampler = DataSampler(tensor_dataset, negative_size=negative_size, batch_size=batch_size)
        for step, batch in enumerate(iter(sampler)):
            for k, v in batch.items():
                batch[k] = v.to(device)
            optimizer.zero_grad()
            pos, neg = topic_model(batch)
            loss = topic_loss_function(pos, neg, topic_model)
            loss.backward()
            optimizer.step()
            ep_loss += loss.item()
        scheduler.step(ep_loss)
        print(f'Epoch {ep}, loss {ep_loss / len(sampler)}')

    topic_model.eval()
    with torch.no_grad():
        W = []
        for x in tensor_dataset:
            W.append(topic_model.get_probs(x.to(device)).cpu().numpy())
        W = np.array(W)

        H = []
        for x in vocab_list:
            if x not in word2idx.keys():
                continue
            H.append(topic_model.get_probs(torch.tensor([word2idx[x]]).to(device)).cpu().numpy())
        H = np.array(H)
        H = H.transpose()

    topic_models.append((n_topics, W, H))

3 topics 

Epoch 0, loss 1.9149345496476227
Epoch 1, loss 0.5372674344885393
Epoch 2, loss 0.23591477085950507
Epoch 3, loss 0.13628685431837448
Epoch 4, loss 0.07475079432465356




4 topics 

Epoch 0, loss 1.950862026236933
Epoch 1, loss 0.5755815415883574
Epoch 2, loss 0.2423686529863145
Epoch 3, loss 0.10766235287396982
Epoch 4, loss 0.04486044019254736
5 topics 

Epoch 0, loss 2.08254938715119
Epoch 1, loss 0.5563474206519411
Epoch 2, loss 0.22361525588479236
Epoch 3, loss 0.10516403289501615
Epoch 4, loss 0.048700271897553285
6 topics 

Epoch 0, loss 2.066286248204463
Epoch 1, loss 0.4534000423928859
Epoch 2, loss 0.15305404735768616
Epoch 3, loss 0.05572209231965309
Epoch 4, loss 0.02025284052827279
7 topics 

Epoch 0, loss 2.2445070831384464
Epoch 1, loss 0.5039599054086727
Epoch 2, loss 0.1735416554600613
Epoch 3, loss 0.07447927892491245
Epoch 4, loss 0.03060126555922636
8 topics 

Epoch 0, loss 2.099631497687388
Epoch 1, loss 0.39927460073630106
Epoch 2, loss 0.13011391842998532
Epoch 3, loss 0.04307928825706188
Epoch 4, loss 0.010784386056463119
9 topics 

Epoch 0, loss 7.771446455326019
Epoch 1, loss 7.820559848397809
Epoch 2, loss 0.6949806072480719
E

**Coherence**


In [0]:
terms = list(filter(lambda x: x in word2idx.keys(), vocab_list))

In [41]:
def get_descriptor( terms, H, topic_index, top ):
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( terms[term_index] )
    return top_terms

for (k,_,H) in topic_models:
    print(f'{k} topics')
    descriptors = []
    for topic_index in range(k):
        descriptors.append( get_descriptor( terms, H, topic_index, 10 ) )
        str_descriptor = ", ".join( descriptors[topic_index] )
        print("Topic %02d: %s" % ( topic_index+1, str_descriptor ) )
    print('\n')

3 topics
Topic 01: reforms, pledged, ministers, allies, leaders, tensions, policymakers, economic, austerity, confidence
Topic 02: crying, laughing, screaming, tears, guy, loud, heard, dressed, surrounded, wearing
Topic 03: download, multiple, wi-fi, format, #, species, user, server, entries, 2013


4 topics
Topic 01: percent, ministers, against, parliamentary, votes, elections, opposition, parliament, election, presidency
Topic 02: please, information, data, download, identify, customer, providers, disclose, publish, users
Topic 03: cech, klopp, grimsby, belfast, paschi, wembley, organised, reporters, breakfast, davos
Topic 04: funny, comedy, boring, mouth, piano, kid, weird, fiction, guitar, discovered


5 topics
Topic 01: productivity, devices, applications, systems, properties, larger, smaller, function, offset, products
Topic 02: somebody, sorry, anybody, yeah, scared, me, everybody, ok, nobody, guess
Topic 03: award, fame, awards, acclaimed, prize, favourite, tribute, guitarist, 

In [0]:
def calculate_coherence( w2v_model, term_rankings ):
    overall_coherence = 0.0
    for topic_index in range(len(term_rankings)):
        # check each pair of terms
        pair_scores = []
        for pair in combinations( term_rankings[topic_index], 2 ):
            if pair[0] in w2v_model.vocab and pair[1] in w2v_model.vocab:
                pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
        # get the mean for all pairs in this topic
        topic_score = sum(pair_scores) / len(pair_scores)
        overall_coherence += topic_score
    # get the mean score across all topics
    return overall_coherence / len(term_rankings)

****Topic Coherence****





In [43]:
k_values = []
coherences = []
for (k,W,H) in topic_models:
    term_rankings = []
    for topic_index in range(k):
        term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )
    k_values.append( k )
    coherences.append( calculate_coherence( model, term_rankings ) )
    print("K=%02d: Coherence=%.4f" % ( k, coherences[-1] ) )

K=03: Coherence=0.3056
K=04: Coherence=0.2673
K=05: Coherence=0.3684
K=06: Coherence=0.3009
K=07: Coherence=0.3372
K=08: Coherence=0.2787
K=09: Coherence=0.2443
K=10: Coherence=0.2486


  if np.issubdtype(vec.dtype, np.int):
