# Composing adjective-noun into metaphor-literal vectors space

In [1]:
# load libraries
import numpy as np
from gensim import models

In [2]:
# Load word embeddings
# you can try to add more pretrained word embeddings in out collection
# but just loading each file into memory is a time consuming process. 
# Loading them all together is not recommended. 
embeddings = {
    'w2v-gnews': models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True),
}

## Adjective-Noun compositions

In [3]:
# read the file and filter those who are not in the embeddings
phrase_annotate = []
with open('AN-phrase-annotations.csv') as f_csv:
    for i, line in enumerate(f_csv):
        if i == 0:
            continue
        
        adj, noun, is_meta, count = line.strip().split(',')
        
        is_oov = False
        for title, emb in embeddings.items():
            if noun[:-2] not in emb or adj[:-2] not in emb:
                is_oov = True

        if is_oov:
            continue
        
        phrase_annotate.append((
            adj[:-2],
            noun[:-2],
            1 if is_meta=='y' else 0,
            0 if count=='#N/A' else int(count))
        )

adjectives = set(adj for adj, _, _, _ in phrase_annotate)
nouns = set(n for _, n, _, _ in phrase_annotate)

print("""
{0:10} {nadj}
{1:10} {nn}
""".format('adjectives', 'nouns', nadj=len(adjectives),nn=len(nouns)))


adjectives 23
nouns      3418



# Training

In [4]:
import os
os.environ["CUDA_DEVICE_ORDER"]= "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]= "1"

import tensorflow as tf
from keras.models import Sequential, Model
from keras.layers import Dense, TimeDistributed
from keras.layers import Input, Flatten, Reshape, Merge, Lambda, merge

Using TensorFlow backend.


In [5]:
# shuffel the training data:
phrase_annotate_org = phrase_annotate[:]
np.random.shuffle(phrase_annotate)

## The Model

The architecture of composing vectors is similar to Mitchell and Lapata (2010):

\begin{equation}
\mathbf{p} = f(\mathbf{u}, \mathbf{v}; \theta)
\end{equation}

where, $\mathbf{u}$ and $\mathbf{v}$ are two word vector representations to be composed, and $\mathbf{p}$ is the vector representation of their composition. Then function $f$ parameterized by $\theta$ (a list of parameters to be learned). Based on Mitchel and Lapata (2008) it must contain latent information about the syntactic information about the composition and world knowledge about each word (or what ever it represents).

The objective of our model is to learn parameters of this funciton where the phrase vector representations for metaphoric compositions can be disambiguiated from literal compositions. For this matter, we proposed a neural network which learns this representaitons in a hidden layer. 

As a result, the final layer before prediction of literal and metaphoric will be considered as compositional representation. The final weight matrix in this model will be a vector in same space where indicates the maximal metaphoricity, where basically degree of metaphoricity can be compared by cosine similarity with this vector.


### First Architecture

One possible formulation is similar to additive composition in Mitchel and Lapata (2010), but instead of scalar modification of each vecotr, a weight matrix scales each feature dimention and additive regulirizer is used to avoid overfitting:

\begin{equation}
\mathbf{p} = \mathbf{u}W_{adjective} + \mathbf{v} W_{noun} + b \\
W = \left[\begin{array}{l}
      W_{adjective} \\
      W_{noun}
    \end{array}\right]
\end{equation}

where $b$ is the regularization, the composition function with $\theta = (W, b)$ follows the Michel and Lapatas formulation. This formulation is very similar to composition model in Socher et al. (2011) and (2012), where the non-linearity funciton $g$, instead we use a linear identity with a regulirizer:

\begin{equation}
\mathbf{p} = f_{\theta}(\mathbf{u}, \mathbf{v}) = [\mathbf{u} ; \mathbf{v}] W + b
\end{equation}


In [6]:
# First choose an embedding for this part
# embeding {title, total-score, per-adjective-scores}
report = []
for title, emb in embeddings.items():

    ### Prepare the dataset
    # Create the training and testing dataset based on the given embedding:
    X_all = []
    y_all = []

    for adj, noun, is_met, _ in phrase_annotate:
        X_all.append([emb[adj], emb[noun]])
        y_all.append(is_met)

    X_all = np.array(X_all)
    y_all = np.array(y_all)

    # split in half for train and test:
    test_split = 500 #int(len(phrase_annotate)/2)
    X_train, y_train = X_all[:test_split], y_all[:test_split]
    X_test, y_test   = X_all[test_split:], y_all[test_split:]
    
    
    ### Define the network layers
    # Compose two vectors (W)
    model_composer = Sequential()
    model_composer.add(Dense(300, activation='linear',input_shape=(600,)))

    # Map it to one measure (find a vector which maximized the prediction of metaphor) (q)
    model_decoder = Sequential()
    model_decoder.add(Dense(1, activation='sigmoid', input_shape=(300,)))

    # Connecting models
    input_adj  = Input(shape=(300,))
    input_noun = Input(shape=(300,))
    input_seq  = merge([input_adj, input_noun], mode='concat', concat_axis=1)
    
    out_binary = model_decoder(
        model_composer(input_seq)
    )

    # final model specifications (loss, optimizer, and etc.)
    final_model = Model(input=[input_adj, input_noun], output=out_binary)
    final_model.compile(optimizer='adam',
                  loss='binary_crossentropy', #good
                  #loss='mse', #good 
                  #loss='msle', #mehhh
                  #loss='cosine_proximity', #nope
                  metrics=['accuracy', 'recall', 'precision'])

    ### Train the network
    final_model.fit([X_train[:,0], X_train[:,1]], y_train, nb_epoch=20, batch_size=100, validation_split=0.0)

    
    ### Evaluate the trained network based on the test data
    score = final_model.evaluate([X_test[:,0], X_test[:,1]], y_test, batch_size=len(X_test))
    
    # print and save the report
    print("\n")
    print("Embedding:", title)
    for key, value in dict(zip(final_model.metrics_names, score)).items():
        print("{0:10} {1:0.4}".format(key, value))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Embedding: w2v-gnews
acc        0.9358
recall     0.9433
loss       0.1832
precision  0.9374


### Secod architecture

This architecture is even more simpler than the first one. In this model, the weight matrix is shared between noun  and adjective:

\begin{equation}
\mathbf{p} = f_{\theta}(\mathbf{u}, \mathbf{v}) = \mathbf{u}W + \mathbf{v} W + b
\end{equation}

Notice that in case of comparing two compositions $b$ is redundant. 
A benefit of this model is that with the same accuracy the trained transformation function can map any word vector weather it's in compositional relation into the new vector space. In the new vector space addition of any two vector can be compared with metaphoricity vector. 

In this new vector space, a simple addition operator compose two vectors. The resulted vector is comparable to metaphor vecotor $mathbfq$ to predict the metaphoricity of composition based on hypothesis on metaphoricity of adjective-nouns in Gutierrez et al. (2016).

In [7]:
# First choose an embedding for this part
# embeding {title, total-score, per-adjective-scores}
report = []
for title, emb in embeddings.items():

    ### Prepare the dataset
    # Create the training and testing dataset based on the given embedding:
    X_all = []
    y_all = []

    for adj, noun, is_met, _ in phrase_annotate:
        X_all.append([emb[adj], emb[noun]])
        y_all.append(is_met)

    X_all = np.array(X_all)
    y_all = np.array(y_all)

    # split in half for train and test:
    test_split = 500 #int(len(phrase_annotate)/2)
    X_train, y_train = X_all[:test_split], y_all[:test_split]
    X_test, y_test   = X_all[test_split:], y_all[test_split:]
    
    
    ### Define the network layers
    # Compose two vectors (W)
    model_composer = Sequential()
    model_composer.add(Dense(300, activation='linear',input_shape=(300,)))

    # Map it to one measure (find a vector which maximized the prediction of metaphor) (q)
    model_decoder = Sequential()
    model_decoder.add(Dense(1, activation='sigmoid', input_shape=(300,)))

    # Connecting models
    input_adj  = Input(shape=(300,))
    input_noun = Input(shape=(300,))
    input_seq  = merge([input_adj, input_noun], mode='sum', concat_axis=1)
    
    out_binary = model_decoder(
        model_composer(input_seq)
    )

    # final model specifications (loss, optimizer, and etc.)
    final_model = Model(input=[input_adj, input_noun], output=out_binary)
    final_model.compile(optimizer='adam',
                  loss='binary_crossentropy', #good
                  #loss='mse', #good 
                  #loss='msle', #mehhh
                  #loss='cosine_proximity', #nope
                  metrics=['accuracy', 'recall', 'precision'])

    ### Train the network
    final_model.fit([X_train[:,0], X_train[:,1]], y_train, nb_epoch=20, batch_size=100, validation_split=0.0)

    
    ### Evaluate the trained network based on the test data
    score = final_model.evaluate([X_test[:,0], X_test[:,1]], y_test, batch_size=len(X_test))
    
    # print and save the report
    print("\n")
    print("Embedding:", title)
    for key, value in dict(zip(final_model.metrics_names, score)).items():
        print("{0:10} {1:0.4}".format(key, value))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Embedding: w2v-gnews
acc        0.9334
recall     0.9384
loss       0.1829
precision  0.9376


### Metaphoricity vector


As the objective function is to minimize the distance between the predicted metaphoricity and gold standard in dataset which is one of the values $\{0, 1\}$ in our training data, the final layer of parameters can be considered as indicator vector for maximum metaphoricity:

\begin{equation}
\mathbf{p} = f_{\theta}(\mathbf{u}, \mathbf{v}) = \mathbf{u}W + \mathbf{v} W + b_0 \\
\hat{y} = \sigma(\mathbf{p} \cdotp \mathbf{q} + b_1) = \frac{1}{1+e^{-\mathbf{p} \cdotp \mathbf{q} + b_1}}
\end{equation}

where, $b_1$ is the regularization for final layer, $\mathbf{q}$ as metaphoricity indicator, and $\hat{y}$ is the predicted score of metaphoricity.

We fit the $\theta$ with our supervised data using adam stochastic gradient descent with training size of $T=500$: 

\begin{equation}
    \begin{array}{r c l l}
        \mathbf{x} &=& (x_1, ... x_T) & \text{adjective and nouns in training dataset}\\
        \mathbf{y} &=& (y_1, ... y_T) & \text{labels training dataset}\\
        \theta &=& (W, b_0, \mathbf{q}, b_1) & \\
        P(\mathbf{x}\ \text{are_all_metaphorical}) &=& \prod_{t=1}^{T}{P(x_t\ \text{is_metaphorical})}& \\
        y_t &=& P(x_t\ \text{is_metaphorical}) & \in \{0,1\}\\
        \hat{y}_t & = & \sigma(\mathbf{p}_t \cdotp \mathbf{q} + b_1) &\in (0, 1)  \\
        \mathcal{L}(\mathbf{x}) &=& -\sum_{t=1}^{T}{y_t \mathrm{log}(\hat{y}_t)+(1-y_t) \mathrm{log}(1-\hat{y}_t)} & 
    \end{array}
\end{equation}

where, each $x_t$ is a pair of adjective-noun vectors: $\mathbf{u}$ and $\mathbf{v}$ in previous equation.