## Training an RMN

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import os

In [2]:
os.chdir("../../../scripts/assembly")
from session_speaker_assembly import *
from preprocess import *
from document import *
from constant import SPEECHES, SPEAKER_MAP, HB_PATH, EMBEDDINGS

In [3]:
os.chdir("../modeling")
from token_mapping import ohe_attribures, build_tokenizer_dict, build_metadata_dict

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
subject_df = pd.read_csv('../../data/gen-docs/documents_health.txt', sep = '|')
feature_columns = subject_df.columns.drop('speech')
subject_df = ohe_attribures(subject_df)
token_dict = build_tokenizer_dict(subject_df)

There are a total of 535 Members of Congress. 100 serve in the U.S. Senate and 435 serve in the U.S. House of Representatives. A length of 50 suggests that nearly everyone commented on "health" (in a speech of more than 50 words) at some point.

In [5]:
speeches_word_index = token_dict['speech']['word_index']
vocab_size = len(speeches_word_index)
vocab_size

13136

In [6]:
speeches_train = token_dict['speech']['train']
len(speeches_train)

2177

In [7]:
speeches_train_padded = token_dict['speech']['train_padded']
speeches_train_padded

array([[ 1334,     1,   162, ...,     0,     0,     0],
       [    1,   162,    51, ...,     0,     0,     0],
       [   14,    57,  2819, ...,     0,     0,     0],
       ...,
       [  102,     2,    60, ...,     0,     0,     0],
       [    2,   285, 13117, ...,     0,     0,     0],
       [   54,   203, 13135, ...,     0,     0,     0]], dtype=int32)

I think that the sentences need to be in integer-tokenized form.

From Iyyer et el.

"Each input to the RMN is a tuple that contains identifiers for a book and two character, as well as the spans corresponding to their relationship: $(b, c_1, c_2, S_{c_1,c_2})$. Given one such input, our objective is to reconstruct $S_(c_1,c_2)$ using a linear combination of relationship descriptors from R as shown in Figure 2; we now describe this process formally."


### Needs for Baseline goal

Let...
* $s_{v_t}$ be the $t_{th}$ span of text in the span set $S_{c_1,c_2}$
* $v_{s_t}$ be the vector that results from taking the element-wise average of the word vectors in $s_{v_t}$
* $C$ be the set metadata embeddings
* $m_{t,c}$ be the metadata embeddings vector for metadata $c$ with 
* $d$ be the dimension of the embedding
* $k$ be the number of decsriptors


Compute Sequence: Given $s_{v_t}$, do the following steps:
1. compute avg speech vector, $v_{s_t}$,
    * $v_{s_t} \in \mathbb{R}^{d}$
2. concat avg span and metadate embeddings
    * $ m_{t,c} \in \mathbb{R}^{d}$
    * [$v_{s_t}; m_{t,1};...; m_{t,|C|}$]
2. compute hidden state with Relu activation: 
    * $h_t =  relu \space (W_h \cdot [v_{s_t}; m_{t,1};...; m_{t,|C|}])$
    * $W_h \in \mathbb{R}^{d \times (d + d|C|)}$ 
    * $h_t \in  \mathbb{R}^{d}$
3. get distribution over topics using another hidden layer: 
    * $d_t = softmax \space (W_d \cdot h_t)$
    * $W_d \in  \mathbb{R}^{k \times d}$
    * $d_t \in  \mathbb{R}^{k}$
    * $d_{t,i} \in (0,1) \space \forall i$ 
4. recompose original sentence using the distribution over descriptors and the descriptor matrix:
    * $r_t = R^Td_t$
    * $R^T \in \mathbb{R}^{d \times k}$
    * $r_t \in \mathbb{R}^{d}$
5. score distance between $r_t$ and $v_{s_t}$
    * $distance = dist(r_t, v_{s_t})$
    
    
#### Notes on implementing it with keras
Every step that uses a matrix multiplication above can be implemented in keras using a dense layer, formatted like this:
* `h = keras.layers.Dense(units = a, input_shape = (b, ), activation= "the_activation")(prev_layer)`
    * This will make the dense layer use a weight matrix $W \in \mathbb{R}^{a \times b}$, and activation "`the_activation`"

In [8]:
# Imports
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Embedding, Dense, Lambda, Input, Masking

The GloVe embeddings are on a local VM, and are not yet in `gs://rwc1/embeddings/`. Attemtps to access embeddings from the gcloud bucket had bugs. You can find the embeddings used [here](https://nlp.stanford.edu/projects/glove/), which are the Wikipedia + Gigaword 5 trained embeddings with 6 billion tokens.

In [9]:
os.chdir("../modeling")
os.listdir(os.getcwd())

['token_mapping.py',
 'embeddings.py',
 'rmn_hyperparams.py',
 'orthoganlity_constraint.py',
 '__pycache__',
 'train_rmn.py',
 '.ipynb_checkpoints',
 'rmn.py']

In [11]:
# run this cell two or three times for some reason
os.chdir("../modeling")
from embeddings import *
from orthoganlity_constraint import Orthoganal
from rmn import RMN

In [None]:
# this cell is if you have the embeddings files stored localled

NUM_TOPICS = 20
GLOVE_DIMS = [50, 100, 200, 300]
EMBEDDING_DIM = GLOVE_DIMS[0]

embeddings_index = {}
glove = open('../../data/glove/glove.6B.%dd.txt' % EMBEDDING_DIM)
for line in glove:
    values = line.split()
    word = values[0]
    try:
        coefs = np.asarray(values[1:], dtype='float32')
    except Exception as e:
        print(values[1:])
        raise
        
    embeddings_index[word] = coefs
glove.close()

print('Found %s word vectors.' % len(embeddings_index))

embeddings_matrix = np.zeros((len(speeches_word_index) + 1, EMBEDDING_DIM))
for word, i in speeches_word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embeddings_matrix[i] = embedding_vector

In [None]:
# uncomment and run these cell to use embeddings from gcloud bucket
# warning: this takes longer than above

## build embedding matrix
# embeddings_index = fetch_embeddings()
# embeddings_matrix = build_embedding_matrix(speeches_word_index, embeddings_index)

In [None]:
# average of spane embeddings
Vst_train = embeddings_matrix[speeches_train_padded].mean(axis=1)
Vst_train.shape

In [None]:
metadata_dict = build_metadata_dict(feature_columns, subject_df)
metadata_dict.keys()

In [None]:
np.random.seed(565)
model = RMN().build_model(metadata_dict)
model.summary()

In [None]:
inputs = [Vst_train]
for key in metadata_dict.keys():
    inputs.append(metadata_dict[key]['input'])
np.random.seed(565)
model.fit(x=inputs, y=Vst_train, batch_size=50, epochs = 10)

In [None]:
R = np.transpose(model.get_layer('R').get_weights()[0])
R.shape

In [None]:
np.linalg.matrix_rank(R)

In [None]:
R_ = np.dot(R,np.transpose(R))
ones_R = np.diag(np.ones(R_.shape[0]))
(R_ - ones_R)

In [None]:
from scipy.spatial.distance import cosine

y_pred = model.predict(inputs)
y_truth = Vst_train


sims = []
for i in range(y_truth.shape[0]):
    cos_sim = cosine(y_truth[i],y_pred[i])
    sims.append(cos_sim)

np.array(sims).mean()