## Training a an RMN

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import os

In [2]:
os.chdir("../../../scripts/assembly")
from session_speaker_assembly import *
from preprocess import *
from document import *
from constant import SPEECHES, SPEAKER_MAP, HB_PATH

In [3]:
df = subject_docs(session = 111, path = HB_PATH, subject = "health", min_len_tokens=100)

In [4]:
df.head()

Unnamed: 0,speakerid,speech
0,111118060.0,pay their bills and keep their homes. small bu...
1,111120160.0,honest and fair prosperity for the many. not j...
2,111121410.0,rarely has our great Nation faced such grave c...
3,111120961.0,together. With the middle class struggling to ...
4,111114091.0,amount of pride in noting that in each of thes...


In [5]:
speaker_speeches = df.groupby("speakerid")

In [6]:
speaker_keys = list(speaker_speeches.groups.keys())

In [7]:
speaker_keys[:10]

[111113931.0,
 111113951.0,
 111113981.0,
 111114011.0,
 111114021.0,
 111114091.0,
 111114101.0,
 111114121.0,
 111114171.0,
 111114321.0]

In [8]:
len(speaker_keys)

536

There are a total of 535 Members of Congress. 100 serve in the U.S. Senate and 435 serve in the U.S. House of Representatives. A length of 50 suggests that nearly everyone commented on "health" (in a speech of more than 50 words) at some point.

In [9]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [10]:
tokenizer = Tokenizer()

In [11]:
tokenizer.fit_on_texts(df["speech"].values)

In [12]:
tokenizer.word_index

{'the': 1,
 'to': 2,
 'and': 3,
 'health': 4,
 'of': 5,
 'care': 6,
 'that': 7,
 'in': 8,
 'a': 9,
 'is': 10,
 'we': 11,
 'for': 12,
 'this': 13,
 'i': 14,
 'have': 15,
 'it': 16,
 'are': 17,
 'on': 18,
 'our': 19,
 'bill': 20,
 'as': 21,
 'they': 22,
 'with': 23,
 'will': 24,
 'their': 25,
 'about': 26,
 'not': 27,
 'be': 28,
 'you': 29,
 'people': 30,
 'reform': 31,
 'from': 32,
 'insurance': 33,
 'has': 34,
 'by': 35,
 'all': 36,
 'mr': 37,
 'but': 38,
 'who': 39,
 'what': 40,
 'my': 41,
 'was': 42,
 'going': 43,
 'do': 44,
 'speaker': 45,
 'would': 46,
 'an': 47,
 'so': 48,
 'more': 49,
 'at': 50,
 'president': 51,
 'or': 52,
 'there': 53,
 'one': 54,
 'american': 55,
 'been': 56,
 'if': 57,
 'when': 58,
 'americans': 59,
 'which': 60,
 'government': 61,
 'system': 62,
 'now': 63,
 'he': 64,
 'can': 65,
 'because': 66,
 'want': 67,
 'these': 68,
 'know': 69,
 'over': 70,
 'its': 71,
 'country': 72,
 'were': 73,
 'many': 74,
 'today': 75,
 'out': 76,
 'new': 77,
 'us': 78,
 'need': 

In [13]:
vocab_size = len(tokenizer.word_index)
vocab_size

17985

In [14]:
speaker_speeches.get_group(speaker_keys[0]).speech.values

array(['opposition to my approach who wanted to speak as well. Senator KENNEDY cosponsors my amendment and is fully supportive. Because of health care concerns he could not be here today. I do wish to share with our colleagues and for the record a statement he issued on June',
       'consistently advocated on behalf of small businesses. not only across Arkansas but across the country. We both want to reform the health care system. We know this has a major impact on small businesses. They create most of the new jobs in our society. So if we care',
       'jobs. I hope our colleagues will take note of this letter. The Senator from Maine also pointed out. why should we control the health care choices of individuals who are receiving no subsidies. That ought to be up to them. We accomplish all of those things. It',
       'conditions. With the leadership of the Federal Government and input from all stakeholders. including Alzheimers patient advocates. health cafe prodders. State health de

In [15]:
x_train = tokenizer.texts_to_sequences(speaker_speeches.get_group(speaker_keys[0]).speech.values)

In [16]:
x_train

[[789,
  2,
  41,
  592,
  39,
  691,
  2,
  411,
  21,
  125,
  110,
  849,
  4134,
  41,
  173,
  3,
  10,
  1056,
  2349,
  66,
  5,
  4,
  6,
  509,
  64,
  176,
  27,
  28,
  95,
  75,
  14,
  44,
  451,
  2,
  743,
  23,
  19,
  165,
  3,
  12,
  1,
  819,
  9,
  1284,
  64,
  2495,
  18,
  2021],
 [2423,
  3393,
  18,
  1062,
  5,
  134,
  167,
  27,
  158,
  230,
  2277,
  38,
  230,
  1,
  72,
  11,
  260,
  67,
  2,
  31,
  1,
  4,
  6,
  62,
  11,
  69,
  13,
  34,
  9,
  334,
  450,
  18,
  134,
  167,
  22,
  274,
  126,
  5,
  1,
  77,
  87,
  8,
  19,
  1277,
  48,
  57,
  11,
  6],
 [87,
  14,
  443,
  19,
  165,
  24,
  129,
  2006,
  5,
  13,
  945,
  1,
  110,
  32,
  2334,
  88,
  1187,
  76,
  183,
  133,
  11,
  303,
  1,
  4,
  6,
  615,
  5,
  369,
  39,
  17,
  1493,
  96,
  1632,
  7,
  682,
  2,
  28,
  93,
  2,
  100,
  11,
  1982,
  36,
  5,
  91,
  168,
  16],
 [574,
  23,
  1,
  323,
  5,
  1,
  147,
  61,
  3,
  2621,
  32,
  36,
  3660,
  242,
  2241,
 

In [17]:
from keras.preprocessing.sequence import pad_sequences

In [18]:
max_len = WINDOW_DEFAULT + 1
x_train_padded = pad_sequences(x_train, maxlen=max_len, padding="post")

In [19]:
x_train_padded

array([[ 789,    2,   41, ...,    0,    0,    0],
       [2423, 3393,   18, ...,    0,    0,    0],
       [  87,   14,  443, ...,    0,    0,    0],
       [ 574,   23,    1, ...,    0,    0,    0]], dtype=int32)

I think that the sentences need to be in integer-tokenized form.

From Iyyer et el.

"Each input to the RMN is a tuple that contains identifiers for a book and two character, as well as the spans corresponding to their relationship: $(b, c_1, c_2, S_{c_1,c_2})$. Given one such input, our objective is to reconstruct $S_(c_1,c_2)$ using a linear combination of relationship descriptors from R as shown in Figure 2; we now describe this process formally."


### Needs for Baseline goal

Let...
* $s_{v_t}$ be the $t_{th}$ span of text in the span set $S_{c_1,c_2}$
* $v_{s_t}$ be the vector that results from taking the element-wise average of the word vectors in $s_{v_t}$
* $d$ be the dimension of the embedding
* $k$ be the number of decsriptors


Compute Sequence: Given $s_{v_t}$, do the following steps:
1. compute avg speech vector, $v_{s_t}$,
    * $v_{s_t} \in \mathbb{R}^{d}$
2. compute hidden state with Relu activation: 
    * $h_t =  relu \space (W_h \cdot v_{s_t})$
    * $W_h \in \mathbb{R}^{d \times d}$ 
    * $h_t \in  \mathbb{R}^{d}$
3. get distribution over topics using another hidden layer: 
    * $d_t = softmax \space (W_d \cdot h_t)$
    * $W_d \in  \mathbb{R}^{k \times d}$
    * $d_t \in  \mathbb{R}^{k}$
    * $d_{t,i} \in (0,1) \space \forall i$ 
4. recompose original sentence using the distribution over descriptors and the descriptor matrix:
    * $r_t = R^Td_t$
    * $R^T \in \mathbb{R}^{d \times k}$
    * $r_t \in \mathbb{R}^{d}$
5. score distance between $r_t$ and $v_{s_t}$
    * $distance = dist(r_t, v_{s_t})$
    
    
#### Notes on implementing it with keras
Every step that uses a matrix multiplication above can be implemented in keras using a dense layer, formatted like this:
* `h = keras.layers.Dense(units = a, input_shape = (b, ), activation= "the_activation")(prev_layer)`
    * This will make the dense layer use a weight matrix $W \in \mathbb{R}^{a \times b}$, and activation "`the_activation`"

In [21]:
# Imports
import keras
import tensorflow as tf
from keras.layers import Embedding, Dense, Lambda

In [22]:
d = 100
k = 20

In [None]:
wordids = keras.layers.Input(shape=(max_len,))

# Embed the wordids.
e = keras.layers.Embedding(input_dim=vocab_size, 
                           output_dim=d, 
                           input_length=max_len)(wordids)

# Take elementwise average over vectors
a = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(e)

# dense layer
ht = keras.layers.Dense(units = d, input_shape = (d, ), activation = "relu")(a)

# dense layer with softmax activation, (where previous states will eventually be inserted) 
dt = keras.layers.Dense(units = k, input_shape = (d, ), activation = "softmax")(ht)

# reconstruction layer
rt = keras.layers.Dense(units = d, input_shape = (k, ), activation = "linear")(dt)

# rt = keras.layers.Dense(units = d, input_shape = (k, ), activation = "linear")(a)

In [None]:
print(rt)

In [None]:
model.summary()

In [None]:
#compile model
model = keras.Model(inputs=wordids, outputs=rt)
model.compile(optimizer = 'adam', loss="categorical_crossentropy")

In [None]:
model.fit(x=x_train_padded, y=x_train_padded, batch_size=1)

In [None]:
for l in model.layers:
    print(l)
    print(50*"=")
    print("input shape", l.input_shape)
    print("output shape", l.output_shape)

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dropout

In [None]:
model = Sequential()
model.add(Flatten(input_shape=(4,)))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='sigmoid'))

In [None]:
mo