## Training an RMN

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import os

In [None]:
os.chdir("../../../scripts/assembly")
from session_speaker_assembly import *
from preprocess import *
from document import *
from constant import SPEECHES, SPEAKER_MAP, HB_PATH, EMBEDDINGS

In [None]:
# os.chdir("../modeling")
# os.getcwd()

In [None]:
os.chdir("../modeling")
from token_mapping import ohe_attribures, build_tokenizer_dict, build_metadata_dict

In [None]:
# session = 111
# speak_map_cols = ['speakerid','chamber','state','gender']

# speaker_map_df = pd.read_csv(os.path.join(HB_PATH,SPEAKER_MAP % session), sep = '|')[speak_map_cols]
# speaker_map_df = speaker_map_df.groupby('speakerid').last().reset_index()
# speaker_map_df

In [None]:
# subject_df = subject_docs(session = session,
#                           speech_path = HB_PATH,
#                           min_tokens=MIN_TOKENS,
#                           span_finder=make_span_finder("health", WINDOW))
# subject_df.head()

In [None]:
subject_df = pd.read_csv('../../data/gen-docs/documents_health.txt', sep = '|')
feature_columns = subject_df.columns.drop('speech')
subject_df = ohe_attribures(subject_df)
token_dict = build_tokenizer_dict(subject_df)

In [None]:
# # megre speech and speaker metadata
# session_df = subject_df.merge(speaker_map_df, how = 'inner', on = 'speakerid')

# # ensure proper merge
# assert(subject_df.shape[0]==session_df.shape[0])
# assert(subject_df.shape[1] + len(speak_map_cols) - 1 == session_df.shape[1])

In [None]:
# # subset data for prelim building
# size = subject_df.shape[0]
# sample_df = subject_df.iloc[:size,:]

# sample_df['speakerid'] = sample_df['speakerid'].astype(str)

# # one-hot-encode speaker metadata
# for col in feature_columns[:3]:
#     sample_df = pd.concat([sample_df,pd.get_dummies(sample_df[col])], axis = 1)
    

# sample_df

In [None]:
# sample_speakers = sample_df['speakerid'].unique()
# print('speaker count:', len(sample_speakers))

There are a total of 535 Members of Congress. 100 serve in the U.S. Senate and 435 serve in the U.S. House of Representatives. A length of 50 suggests that nearly everyone commented on "health" (in a speech of more than 50 words) at some point.

In [None]:
# from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
# # building tokenizers, word indecies, and train data

# speech_tokenizer = Tokenizer()
# speech_tokenizer.fit_on_texts(sample_df['speech'].values)
# speeches_word_index = speech_tokenizer.word_index

# tokenizers = {}
# tokenizers['speech'] = {'tokenizer': speech_tokenizer,
#                         'train': speech_tokenizer.texts_to_sequences(sample_df['speech'].values),
#                         'word_index': speeches_word_index}

# for col in feature_columns:
#     tokenizer = Tokenizer()
#     tokenizer.fit_on_texts(sample_df[col].values)
#     tokenizers[col] = {}
#     tokenizers[col]['train'] = tokenizer.texts_to_sequences(sample_df[col].values)
#     tokenizers[col]['word_index'] = tokenizer.word_index
#     tokenizers[col]['tokenizer'] = tokenizer
    

In [None]:
speeches_word_index = token_dict['speech']['word_index']
vocab_size = len(speeches_word_index)
vocab_size

In [None]:
speeches_train = token_dict['speech']['train']
len(speeches_train)

In [None]:
# from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# max_len = WINDOW + 1
# speeches_train_padded = pad_sequences(speeches_train, maxlen=max_len, padding="post")

In [None]:
speeches_train_padded = token_dict['speech']['train_padded']
speeches_train_padded

I think that the sentences need to be in integer-tokenized form.

From Iyyer et el.

"Each input to the RMN is a tuple that contains identifiers for a book and two character, as well as the spans corresponding to their relationship: $(b, c_1, c_2, S_{c_1,c_2})$. Given one such input, our objective is to reconstruct $S_(c_1,c_2)$ using a linear combination of relationship descriptors from R as shown in Figure 2; we now describe this process formally."


### Needs for Baseline goal

Let...
* $s_{v_t}$ be the $t_{th}$ span of text in the span set $S_{c_1,c_2}$
* $v_{s_t}$ be the vector that results from taking the element-wise average of the word vectors in $s_{v_t}$
* $C$ be the set metadata embeddings
* $m_{t,c}$ be the metadata embeddings vector for metadata $c$ with 
* $d$ be the dimension of the embedding
* $k$ be the number of decsriptors


Compute Sequence: Given $s_{v_t}$, do the following steps:
1. compute avg speech vector, $v_{s_t}$,
    * $v_{s_t} \in \mathbb{R}^{d}$
2. concat avg span and metadate embeddings
    * $ m_{t,c} \in \mathbb{R}^{d}$
    * [$v_{s_t}; m_{t,1};...; m_{t,|C|}$]
2. compute hidden state with Relu activation: 
    * $h_t =  relu \space (W_h \cdot [v_{s_t}; m_{t,1};...; m_{t,|C|}])$
    * $W_h \in \mathbb{R}^{d \times (d + d|C|)}$ 
    * $h_t \in  \mathbb{R}^{d}$
3. get distribution over topics using another hidden layer: 
    * $d_t = softmax \space (W_d \cdot h_t)$
    * $W_d \in  \mathbb{R}^{k \times d}$
    * $d_t \in  \mathbb{R}^{k}$
    * $d_{t,i} \in (0,1) \space \forall i$ 
4. recompose original sentence using the distribution over descriptors and the descriptor matrix:
    * $r_t = R^Td_t$
    * $R^T \in \mathbb{R}^{d \times k}$
    * $r_t \in \mathbb{R}^{d}$
5. score distance between $r_t$ and $v_{s_t}$
    * $distance = dist(r_t, v_{s_t})$
    
    
#### Notes on implementing it with keras
Every step that uses a matrix multiplication above can be implemented in keras using a dense layer, formatted like this:
* `h = keras.layers.Dense(units = a, input_shape = (b, ), activation= "the_activation")(prev_layer)`
    * This will make the dense layer use a weight matrix $W \in \mathbb{R}^{a \times b}$, and activation "`the_activation`"

In [None]:
# Imports
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Embedding, Dense, Lambda, Input, Masking

The GloVe embeddings are on a local VM, and are not yet in `gs://rwc1/embeddings/`. Attemtps to access embeddings from the gcloud bucket had bugs. You can find the embeddings used [here](https://nlp.stanford.edu/projects/glove/), which are the Wikipedia + Gigaword 5 trained embeddings with 6 billion tokens.

In [None]:
os.chdir("../modeling")
os.getcwd()

In [None]:
os.chdir("../modeling")
from embeddings import *
from orthoganlity_constraint import Orthoganal
from rmn import RMN

In [None]:
NUM_TOPICS = 20
GLOVE_DIMS = [50, 100, 200, 300]
EMBEDDING_DIM = GLOVE_DIMS[0]

embeddings_index = {}
glove = open('../../data/glove/glove.6B.%dd.txt' % EMBEDDING_DIM)
for line in glove:
    values = line.split()
    word = values[0]
    try:
        coefs = np.asarray(values[1:], dtype='float32')
    except Exception as e:
        print(values[1:])
        raise
        
    embeddings_index[word] = coefs
glove.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
embeddings_matrix = np.zeros((len(speeches_word_index) + 1, EMBEDDING_DIM))
for word, i in speeches_word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embeddings_matrix[i] = embedding_vector

In [None]:
# build embedding matrix
# embeddings_index = fetch_embeddings()
# embeddings_matrix = build_embedding_matrix(speeches_word_index, embeddings_index)

# average of spane embeddings
Vst_train = embeddings_matrix[speeches_train_padded].mean(axis=1)
Vst_train.shape

In [None]:
# one-hot-encoded speaker metadata inputs

metadata_dict = {}

for col in feature_columns:
    df = sample_df[sample_df[col].unique()].values
    dim = df.shape[1]
    metadata_dict[col] = {'input': df, 'input_dim': dim}

metadata_dict.keys()

In [None]:
metadata_dict = build_metadata_dict(feature_columns, subject_df)
metadata_dict.keys()

In [None]:
np.random.seed(565)
model = RMN().build_model(metadata_dict)
model.summary()

In [None]:
inputs = [Vst_train]
for key in metadata_dict.keys():
    inputs.append(metadata_dict[key]['input'])
np.random.seed(565)
model.fit(x=inputs, y=Vst_train, batch_size=50, epochs = 10)

In [None]:
R = np.transpose(model.get_layer('R').get_weights()[0])
R.shape

In [None]:
np.linalg.matrix_rank(R)

In [None]:
R_ = np.dot(R,np.transpose(R))
ones_R = np.diag(np.ones(R_.shape[0]))
(R_ - ones_R)

In [None]:
from scipy.spatial.distance import cosine

y_pred = model.predict(inputs)
y_truth = Vst_train


sims = []
for i in range(y_truth.shape[0]):
    cos_sim = cosine(y_truth[i],y_pred[i])
    sims.append(cos_sim)

np.array(sims).mean()

In [None]:
y_pred = model.predict(inputs)
y_truth = Vst_train


sims = []
for i in range(y_truth.shape[0]):
    cos_sim = cosine(y_truth[i],y_pred[i])
    sims.append(cos_sim)

np.array(sims).mean()

- What is the file drawer problem? Why is the file drawer problem important from the perspective of a firm trying to learn about the effectiveness of an intervention from peer reviewed research?
- One response to the file drawer problem is to say, if there are multiple findings that point in the same direction, the effect is  "real." What is the logic of this claim? How does p-hacking subvert this logic?
- What is the pcurve? What is it meant to demonstrate (Figure   1). What is the key comparison to make based on Figure 1?

In [None]:
from google.cloud import storage

client = storage.Client()
client.get_bucket('rwc1')

In [None]:
# constants
GLOVE_6B = "http://nlp.stanford.edu/data/glove.6B.zip"
GLOVE_42B = "http://nlp.stanford.edu/data/glove.42B.300d.zip"
GLOVE_840B = "http://nlp.stanford.edu/data/glove.840B.300d.zip"


import requests
    
r = requests.get(GLOVE_6B)
r