# Introduction
Our goal is to determine whether text generated by the inclusion of grammatical structure is better than text generated without that grammatical structure.  Specifically, we want to know if adding part of speech vectors improves text generation.  To accomplish this, we will train two models with the same architecture over the same dataset, one with only words and one with parts of speech included.  The cells below will illustrate the results.

Before running these cells, please read the instructions in README.md. In particular:

1. acquire and preprocess data with `preprocess.py`
1. adjust and train models with and without parts of speech with `train.py --include_pos`  y and `train.py --include_pos n`

Once training is complete and the model weights have been saved, we can generate text as seen below.

The following code will generate 20 "sentences" for each of the two models, where a sentence is considered complete simply as soon as the RNN decides to output the "EN" token.

In [1]:
import numpy as np                                                                                           
import sys                                                                                                   
import os                                                                                                    
import json                                                                                                  
from tqdm import tqdm                                                                                        
from keras.models import load_model                                                                          
from keras.models import model_from_json                                                                     
from keras import backend as K                                                                               
import pickle     

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
########## UTILITY FUNCTIONS ##########
def dist(x,y):
    s = 0
    z = x-y

    for element in z:
        s += element**2
    return np.sqrt(s)

def closest(dictionary, vec):
    min_dist = 1000000000000
    for key,val in dictionary.items():
        v = np.array(val)[0]

        d = dist(v, vec)

        if d < min_dist:
            min_dist = d
            closest = key
            closest_vec = val
    return closest, np.array(closest_vec)

In [3]:
########## SET DIRECTORIES ##########
DATA_DIR = os.path.join("data", "train", "cleaned")
MAPPING_FILE = os.path.join("utils", "mapping.pkl")
RNN_MODEL_POS = os.path.join("models", "rnn_model_pos.hdf5")
RNN_MODEL_NO_POS = os.path.join("models", "rnn_model_no_pos.hdf5")

NUM_POS_TAGS = 47

########## IMPORT DATA ##########
with open(MAPPING_FILE, 'rb') as f:
    mapping = pickle.load(f)

In [4]:
########## LOAD MODEL ##########

########## NO GRAMMAR VERSION ##########
model_no_pos = load_model(RNN_MODEL_NO_POS)

########## GRAMMAR VERSION ##########
model_pos = load_model(RNN_MODEL_POS)

# Generate Sentences without Part of Speech Vector

In [9]:
########## NO GRAMMAR ##########
INCLUDE_POS = False

# set up start token
token = mapping['ST']
token = np.array(token)
token = np.reshape(token, (1,) + token.shape)

if INCLUDE_POS:
    final_shape = token.shape[-1] + NUM_POS_TAGS 
else:
    final_shape = token.shape[-1]

tmp = np.zeros(shape=(1,1,final_shape))
tmp[0,0,:len(token[0,0])] = token[0,0,:]
token = tmp
noise = np.random.rand(token.shape[0], token.shape[1], token.shape[2])
noise /= 10 #small amount of noise

en_count = 0

words = []
words.append('ST')

########## GENERATE WORDS ##########

print('ST', end=' ')

while en_count <= 20:
    out = model_no_pos.predict([token, noise])

    # snap the network's prediction to the closest real word, and also
    # snap the network's prediction to the closest vector in our space
    # so that it predicts with real words as previous values
    closest_word, closest_vec = closest(mapping, out[0,0,:])
    token = np.zeros(shape=out.shape)
    token[0,0,:] = closest_vec

    # fix shapes
    tmp = np.zeros(shape=(1,1,final_shape))
    tmp[0,0,:len(out[0,0])] = out[0,0,:]
    out = tmp

    tmp = np.zeros(shape=(1,1,final_shape))
    tmp[0,0,:len(token[0,0])] = token[0,0,:]
    token = tmp

    noise = np.random.rand(token.shape[0], token.shape[1], token.shape[2])
    noise /= 10

    words.append(closest_word)
    
    if closest_word == "EN":
        en_count += 1
        print(closest_word)
    else:
        print(closest_word, end=' ')

ST EN
pickle. ST EN
pickle. a into myself ST EN
and a i'm ST EN
pickle. a into myself ST EN
do i ST EN
pickle. a into myself ST EN
pickle. i'm ST EN
pickle! a into myself ST EN
pickle. a into myself ST EN
pickle. ST EN
house. ST EN
pickle. ST EN
pickle. a into myself ST EN
pickle. ST EN
pickle. a into myself ST EN
pickle. a into myself ST EN
pickle! a into myself ST EN
pickle. i'm ST EN
pickle. a into myself ST EN
pickle. ST EN


---
We can see here that the network fails to associate ST with the start of the sentence, and the sentences are quite short and repetitive.  Let's see how it does once the part of speech vectors are added!

# Generate Sentences with Part of Speech Vector

In [6]:
########## GRAMMAR ##########
INCLUDE_POS = True 

# set up start token
token = mapping['ST']
token = np.array(token)
token = np.reshape(token, (1,) + token.shape)

if INCLUDE_POS:
    final_shape = token.shape[-1] + NUM_POS_TAGS 
else:
    final_shape = token.shape[-1]

tmp = np.zeros(shape=(1,1,final_shape))
tmp[0,0,:len(token[0,0])] = token[0,0,:]
token = tmp
noise = np.random.rand(token.shape[0], token.shape[1], token.shape[2])
noise /= 10 #small amount of noise

en_count = 0

words = []
words.append('ST')

########## GENERATE WORDS ##########

print('ST', end=' ')

while en_count <= 20:
    out = model_pos.predict([token, noise])

    # snap the network's prediction to the closest real word, and also
    # snap the network's prediction to the closest vector in our space
    # so that it predicts with real words as previous values
    closest_word, closest_vec = closest(mapping, out[0,0,:])
    token = np.zeros(shape=out.shape)
    token[0,0,:] = closest_vec

    # fix shapes
    tmp = np.zeros(shape=(1,1,final_shape))
    tmp[0,0,:len(out[0,0])] = out[0,0,:]
    out = tmp

    tmp = np.zeros(shape=(1,1,final_shape))
    tmp[0,0,:len(token[0,0])] = token[0,0,:]
    token = tmp

    noise = np.random.rand(token.shape[0], token.shape[1], token.shape[2])
    noise /= 10

    words.append(closest_word)
    
    if closest_word == "EN":
        en_count += 1
        print(closest_word)
    else:
        print(closest_word, end=' ')

ST from it turns week who from it turns myself rick, don't science. heels] science. rick, don't i layers i-i'm which layers i-i'm which mind. [beth which from it did [beth from it did from it ooh, all EN
from it ooh, all -- rick: EN
from it ooh, all -- rick: mean, who from it boom! that's stop stop terrorism which stop about dark which mind. dark i-i'm which layers science. science. layers which stop about all -- -- family sweetie. which mind. [beth from it oh, all EN
from it did [beth think think sweetie. layers science. rick, don't science. rick, don't science. science. rick, stop enter dark and [beth from it this layers that's stop about dark sweetie. which mind. [beth from it oh, all -- rick: mean, who from it turns magic i-i'm who beth: which mind. [beth from it did [beth from it pickle, all EN
from it turns week who from it ooh, and [beth from it oh, can, from it for, layers myself rick, stop my EN
from it turns myself layers which layers myself rick, don't i rick, don't science.

----
Overall, the results from the inclusion of the parts of speech seem to be better, but only in length and variety.  This time the ST token is entirely absent (except for the initial seed) and sentences all begin with "from it".  Of course, these sentences are also not comprehensible nor cohesive.  We believe with less naive encoding, a more fleshed-out architecture, and possibly a better means of diversity that the RNN would be able to make significantly better-structured sentences with grammatical context present.