This file will be used to create an automatic metric for redundancy, which may be used as a feature in the model we plan to create to predict redundancy.  The metric will use the Universal Sentence Encoder found [here](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder-large/3) to create vector representations of sentences, then we will compute the squared cosine similarities for all pairs of sentences.  Lastly, we compute the mean of the cosine similarities.

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
import sqlite3
import pandas as pd
import nltk
import itertools
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr, spearmanr

In here we retrieve the post edits from the SQLite DB.  Then, we just retrieve the first instance of a system summary.  this is because we only care about the system summaries, not the post edits, so we can avoid doing extra work by only keeping 1 postedit per each system summary.

In [2]:
conn = sqlite3.connect("../data/cdm_postedits.db")
c = conn.cursor()
df = pd.read_sql("SELECT * FROM cdm_postedits", conn)
conn.close()
_unique_vals, indices = np.unique([row.system + str(row.id) for _, row in df.iterrows()], return_index=True)
df = df.iloc[indices].sort_index()

In here we separate the system summaries by sentence and put the sentences into a list.  We will later use the Univeral Sentence Encoder to create vectors for each of the sentences.  We make sure to keep track of how many sentences each system summary has so we can perform operations on the sentences vectors for each system summary.

In [4]:
sentences = []
sentence_counts = []
index = 0
for _i, row in df.iterrows():
    system_summary = row.system_summary
    tokenized_sentences = nltk.sent_tokenize(system_summary)
    sentence_counts.append(len(tokenized_sentences))
    sentences.extend(tokenized_sentences)

Here we get the embeddings for all sentences from the Universal Sentence Encoder

In [22]:
sentence_embeddings = []
with tf.Graph().as_default():
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
    embeddings = embed(sentences)
    tf.logging.set_verbosity(tf.logging.ERROR)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        sentence_embeddings = sess.run(embeddings)
    


ELMo Embeddings

In [5]:
BATCH_SIZE = 200
elmo_batches = list(range(0, len(sentences), BATCH_SIZE))
elmo_sentence_embeddings = []
with tf.Graph().as_default():
    embed = hub.Module("https://tfhub.dev/google/elmo/2", name="elmo")
    tf.logging.set_verbosity(tf.logging.ERROR)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        for batch_index in elmo_batches:
            print(batch_index)
            embeddings = embed(sentences[batch_index:batch_index+BATCH_SIZE])
            elmo_sentence_embeddings.extend(sess.run(embeddings))

INFO:tensorflow:Using /var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules to cache modules.
INFO:tensorflow:Initialize variable elmo/aggregation/scaling:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with aggregation/scaling
INFO:tensorflow:Initialize variable elmo/aggregation/weights:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with aggregation/weights
INFO:tensorflow:Initialize variable elmo/bilm/CNN/W_cnn_0:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/CNN/W_cnn_0
INFO:tensorflow:Initialize variable elmo/bilm/CNN/W_cnn_1:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bi

INFO:tensorflow:Initialize variable elmo/bilm/RNN_1/RNN/MultiRNNCell/Cell0/rnn/lstm_cell/kernel:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/RNN_1/RNN/MultiRNNCell/Cell0/rnn/lstm_cell/kernel
INFO:tensorflow:Initialize variable elmo/bilm/RNN_1/RNN/MultiRNNCell/Cell0/rnn/lstm_cell/projection/kernel:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/RNN_1/RNN/MultiRNNCell/Cell0/rnn/lstm_cell/projection/kernel
INFO:tensorflow:Initialize variable elmo/bilm/RNN_1/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias:0 from checkpoint b'/var/folders/50/y01jw_8165zgcsy6f1h8qs8m0000gn/T/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/RNN_1/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias
INFO:tensorflow:Initialize variable elmo/bilm/RNN_1/RNN/MultiRNNCell/Cell1/rnn/

Function for calculating the sentence-based redundancy score. If a summary consists of only one sentence, we return 1.


args:
    **sentences** - the vector representations for sentences in a summary

returns:
    1 - the mean squared cosine similarity between all pairs of sentences

In [14]:
def sent_redundancy(sentences):
    if len(sentences) == 1:
        return 1
    sentence_combinations = itertools.combinations(sentences, 2)
    similarity_scores = []
    for pair in sentence_combinations:
        similarity_scores.append(cosine_similarity([pair[0]], [pair[1]])[0][0]**2)
    return 1 - (sum(similarity_scores) / len(similarity_scores))

Here we calculate the sent_redundancy scores for the system summaries.

In [23]:
index = 0
sent_redundancy_scores = []
for count in sentence_counts:
    system_sentence_vectors = sentence_embeddings[index:index+count]
    sent_redundancy_scores.append(sent_redundancy(system_sentence_vectors))
    index += count
        
        

In [15]:
index = 0
elmo_sent_redundancy_scores = []
for count in sentence_counts:
    system_sentence_vectors = elmo_sentence_embeddings[index:index+count]
    elmo_sent_redundancy_scores.append(sent_redundancy(system_sentence_vectors))
    index += count
        
        

In [None]:
df["elmo_redundancy"] = elmo_sent_redundancy_scores
df["use_redundandcy"] = sent_redundancy_scores

conn = sqlite3.connect('../data/cdm_postedits.db')
c = conn.cursor()

orig_df = pd.read_sql("SELECT * FROM cdm_postedits", conn)
orig_df["elmo_redundancy"] = None
orig_df["use_redundancy"] = None

for _, row in df.iterrows():
    orig_df.loc[(orig_df.id == row.id) & (orig_df.system == row.system), "elmo_redundancy"] = row.elmo_redundancy
    orig_df.loc[(orig_df.id == row.id) & (orig_df.system == row.system), "use_redundancy"] = row.use_redundandcy

Putting scores in .db file

In [None]:
c.execute('DROP TABLE IF EXISTS cdm_postedits;')
rows = orig_df.values.tolist()

# Create table
c.execute('''CREATE TABLE cdm_postedits
             (annotator_id integer, edit text, grammar integer, hter integer, id integer, overall integer, redundancy integer, reference text, sim real, system text, system_summary text, elmo_redundancy real, use_redundancy real)''')

# Insert a row of data
c.executemany('INSERT INTO cdm_postedits VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)', rows)

# Save (commit) the changes
conn.commit()


conn.close()
