Copyright 2021, Jeffrey Stanton. Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) feel free to create and distribute derivative works with attribution.

This notebook demonstrates how to ingest a Paraphrase database (PPDB) phrasal equivalence file and then get it organized into a Pandas dataframe for further analysis. The notebook includes several diagnostic displays to help users better understand the data. For more details about PPDB, please consult:

Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2015, July). PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 425-430).

You can download PPDB phrasal files from the PPDB website by clicking the appropriate buttons here: http://paraphrase.org/#/download
This code has been tested on the English small phrasal file, which unzips to about 1.35 GB. To test all of the steos in this notebook yourself, get a copy of the small phrasal file from the URL mentioned above, unzip it on your own computer, upload it to a convenient location on your Google drive and then change the pathname as noted in the third code block below. 

In [None]:
import pandas as pd
import numpy as np
import time

from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# The PPDB raw files are quite large. This code assumes that you host
# the file on your Google drive. This code attaches your Google drive
# to the file store of your Colab notebook. The call to drive.mount() 
# will request an auth code and will provide a link to help you get
# that code. Assuming you are logged in or can log in to your Google
# Drive account, you can copy and paste the auth code into a small
# box that will appear at the end of this cell.
from google.colab import drive

drive_mounted = False

if drive_mounted == False: # These save time if you click Run All again
    ppdb_imported = False
    drive.mount("/content/gdrive", force_remount=True)
    drive_mounted = True

In [None]:
# Choose a pathname within your Google drive. The easy way to do this 
# on Colab is to use the file browser on the left to navigate to the
# file and then use the ... menu to copy the pathname. Replace the
# pathname shown in the next line of code.
pathname = '/content/gdrive/MyDrive/Data/ppdb-2.0-s-phrasal.txt'

# Import the PPDB file from Google Drive: Takes up to half a minute.
if ppdb_imported == False and drive_mounted == True:
  with open(pathname, 'r') as f:
    ppdb = f.read()
  ppdb_imported = True # These save time if you click Run All again

In [None]:
ppdb[:1000] # Each individual record is delimited by a newline

In [None]:
# Create a list with one line/record per list entry
ppdb_lines = ppdb.split("\n")

In [None]:
type(ppdb_lines), len(ppdb_lines)

In [None]:
# Optionally remove ppdb to save RAM on small Jupyter VMs
del ppdb # Once we have parsed the lines, no need to keep the raw text

In [None]:
# This is an explorer routine to use to diagnose the contents of the PPDB. 
# This code was developed to address the PPDB 2.0 phrasal dataset.
def fields_from_line(line):
  """
  Show the field names from any line of a PPDB file. Expects a text string
  that contains one line from a PPDB file.
  """
  # Set up a blank pandas dataframe with all of the fields we need
  col_list = ["POS", "LHS", "RHS"] # The first three fields
  att_str = line.split('|||')[3].strip() # All the junk after the first 3 fields
  # This extracts the first string in each field of the form name=1.2345
  col_list += [item.split("=")[0] for item in att_str.split(" ")] 

  return col_list


In [None]:
ppdb_lines[1000]

In [None]:
fields_from_line(ppdb_lines[1000]) # Show the first line

These field definitions are mainly verbatim quoted from this supplement (https://aclanthology.org/attachments/P15-2070.Notes.pdf) as well as from various other papers published by Pavlick et al. The corresponding column that will be stored in a Pandas dataframe will contain numeric quantities but encoded as text data.

* 'POS' - A part of speech tag sequence summarizing the LHS
* 'LHS' - The left hand side phrase (AKA source)
* 'RHS' - The right hand side phrase (AKA target)
* 'PPDB2.0Score' 
* 'PPDB1.0Score' - the score used to rank paraphrases in the original release of PPDB, computed according to the heurisitic weighting given in the paper 
* '-logp(LHS|e1)',
* '-logp(LHS|e2)',
* '-logp(e1|LHS)',
* '-logp(e1|e2)' - Negative log probability of e1 given e2 based on Bannard and Callison-Burch (2005)
* '-logp(e1|e2,LHS)',
* '-logp(e2|LHS)',
* '-logp(e2|e1)' - Negative log probability of e2 given e1 based on Bannard and Callison- Burch (2005)
* '-logp(e2|e1,LHS)',
* 'AGigaSim' - the distributional similarity of e1 and e2, computed according to contexts observed in the Annotated Gigaword corpus (Napoles et al., 2012)
* 'Abstract' - a binary feature that indicates whether the rule is composed exclusively of nonterminal symbols.
* 'Adjacent',
* 'CharCountDiff' - The difference in the number of characters in RHS versus LHS strings; negative when LHS is longer
* 'CharLogCR' - the log-compression ratio in characters,logchars(f2),a feature used in sentence compression
* 'ContainsX' - a binary feature that indicates whether the nonterminal symbol X (see Chiang, 2007)
* 'Equivalence' - predicted probability that the paraphrase pair represents semantic equiva- lence (e1 entails e2 and e2 entails e1), accord- ing to model used in Pavlick et al. (2015)
* 'Exclusion' - predicted probability that the paraphrase pair represents semantic exclusion
* 'GlueRule' - a binary feature that indicates whether this is a "glue rule" (see Post et al., 2013)
* 'GoogleNgramSim' - the distributional similarity of e1 and e2, computed according to contexts observed in the Google Ngram cor- pus (Brants and Franz, 2006)
* 'Identity' - a binary feature that indicates whether the phrase is identical to the paraphrase. Note that these should generally be excluded from an analysis.
* 'Independent' - predicted probability that the paraphrase pair represents semantic independence.
* 'Lex(e1|e2)' - the “lexical translation” probability of the RHS given the LHS (see Koehn et al., 2003)
* 'Lex(e2|e1)' - the “lexical translation” probability of the LHS given the RHS 
* 'Lexical' - a binary feature that says whether this is a single word paraphrase
* 'LogCount' - the log of the frequency estimate for this paraphrase pair.
* 'MVLSASim' - Cosine similarity according to the Multiview Latent Semantic Analysis embeddings described by Rastogi et al. (2015)
* 'Monotonic' - a binary feature that indicates whether multiple nonterminal symbols occur in the same order (are monotonic) or if they are re-ordered
* 'OtherRelated' - predicted probability that the paraphrase pair represents topical relatedness but not entailment
* 'PhrasePenalty' - this feature is used by the decoder to count how many rules it uses in a derivation
* 'RarityPenalty' - marks rules that have only been seen a handful of times. It is calculated as exp(1 − c(e1,e2)), where c(e1 , e2 ) is the estimate of the frequency of this paraphrase pair
* 'ReverseEntailment' - predicted probability that the target phrase entails the source phrase; either this feature or the ForwardEntailment feature will be present, but not both
* 'SourceTerminalsButNoTarget' - a binary feature showing when the source phrase contains terminal symbols, but the target phrase contains no terminal symbols
* 'SourceWords' - The word count for the LHS phrase
* 'TargetTerminalsButNoSource' - a binary feature showing when the target phrase contains terminal symbols, but the source phrase contains no terminal symbols
* 'TargetWords' - The word count for the RHS phrase
* 'UnalignedSource' - a binary feature showing if there are any words in the source phrase that are not aligned to any words in the target phrase
* 'UnalignedTarget' - - a binary feature showing if there are any words in the target phrase that are not aligned to any words in the source phrase
* 'WordCountDiff' - The difference in word count between RHS and LHS phrases; a negative number means LHS phrase is longer
* 'WordLenDiff' - the difference in average word length between the source phrase and the target phrase
* 'WordLogCR' - the log-compression ratio in words, estimated as log words(e) words(f)


In [None]:
# This function eats up a list of lines and turns it into a Pandas dataframe
# that will have the same fields as shown in the diagnostic above.


def load_ppdb_lines(lines):
    """
    Load a paraphrase file from a list of input strings.

    Note that the resulting pandas dataframe can be quite large and may occupy
    a substantial amount of RAM, depending on the number of records provided
    by the input.

    :param lines: list of lines from the input file
    :return: a pandas data frame with POS, LHS, RHS and 
    the various numeric attributes 
    """

    rows = [] # We will build up a list of dictionaries that we will us to
              # construct the Pandas data frame at the end

    for line in lines:
        
        # discard lines with unicode character encoding issues
        if '\\ x' in line or 'xc3' in line:
            continue
        
        fields = line.split('|||') # These are the major sections of each entry
        
        # Error check, make sure that there are two strings of text with at
        # least one character in each.
        if len(fields[0].strip()) == 0 or len(fields[1].strip()) == 0:
            continue
        
        tpd_dict = {"POS": fields[0].strip()} # Create a dictionary of available values
        tpd_dict.update( {"LHS" : fields[1].strip()} )
        tpd_dict.update( {"RHS" : fields[2].strip()} )
        
        attr = fields[3].strip() # Grab the list of statistics

        for a in attr.split(" "): # Parse the list of values
          keyval = a.split("=") # This yields the name of the attribute and its value
          tpd_dict.update( {keyval[0].strip(): keyval[1].strip()} )

        # print(tpd_dict)
        rows.append(tpd_dict) # Add this dictionary to the list

    return pd.DataFrame.from_dict(rows, orient='columns')

In [None]:
# This takes about a minute for a million records. Choose a value for max_lines
# suitable for your task. Uncomment the following line to read in the whole db.
# max_lines = len(ppdb_lines) # process whole dataset; note, uses much RAM
max_lines = 100000

t0 = time.perf_counter() # Time the process
ppdb_df = load_ppdb_lines(ppdb_lines[:max_lines])
t1 = time.perf_counter()
print("Execution time per line:", (t1 - t0)/max_lines * 1000 , "milliseconds.")
print("Total time:", (t1 - t0), "seconds.")

In [None]:
# Optionally delete the ppdb_lines object to save RAM
del ppdb_lines

In [None]:
ppdb_df.shape

In [None]:
ppdb_df.head(10)

In [None]:
# Now proceed to whatever subsequent analysis is going to be done. Remember 
# that the numeric fields are currently stored as text, so must be converted
# before analysis. For example, here is how to calc the correlation between 
# PPDB 1.0 scores and PPDB 2.0 scores:
ppdb_df['ppdb1'] = pd.to_numeric(ppdb_df['PPDB1.0Score'], errors='coerce')
ppdb_df['ppdb2'] = pd.to_numeric(ppdb_df['PPDB2.0Score'], errors='coerce')

ppdb_df['ppdb1'].corr(ppdb_df['ppdb2'])

In [None]:
# Let's convert all of the string columns containing numbers to actual numbers
columns = list(ppdb_df)

for i in range(3, len(columns)):
  ppdb_df[columns[i]] = pd.to_numeric(ppdb_df[columns[i]], errors='coerce')

In [None]:
ppdb_df.to_csv("ppdb.csv") # Save a csv file in case we would like to reload later
# If using colab, download this file before the VM shuts down.

In [None]:
from google.colab import files
files.download("ppdb.csv")

In [None]:
# Now, optionally read the data structure back in if you are restarting from 
# this point. Uncomment these lines to start here:
# import pandas as pd
pathname = '/content/gdrive/MyDrive/Data/ppdb.csv'

ppdb_df = pd.read_csv(pathname)

From this point, one can use the brief texts stored in the Pandas dataframe to perform subsequent analyses. Some such analyses might be enhanced by also using the precalculated values that are also stored in the dataframe. For example, there are two measures of distance stored on the dataframe which could be used to compare with other measures of distance. Don't forget to adapt the code in the previous block to convert the attributes in the dataframe, which are stored as text, into numeric values. Note that some of the attribute fields contain instances of NaN, so you may have to wrestle with the fields or subset the data to condition it for your particular problem domain.

In [None]:
def hist(dfcol):
  """
  Here's a function that creates a histogram from a pd series, converting
  if necessary from a string to numeric.

  :param dfcol: a column from a pandas dataframe
  :return: None
  """

  import matplotlib.pyplot as plt
  import numpy as np
  import pandas as pd

  if len(dfcol) < 1:
    print("Error: Need at least one data point for histogram.")
    return None

  # Convert to numeric if it is a string, otherwise plt.hist()
  # will do a barplot using set(dfcol). 
  #if type(dfcol[0]) == str:
  #  dfcol = pd.to_numeric(dfcol, errors='coerce')

  plt.hist(dfcol)

  return None


In [None]:
def scatter(dfcol1, dfcol2):
  """
  Here's a function that creates a scattergram from two pd series, converting
  if necessary from string to numeric.

  :param dfcol1: a column from a pandas dataframe
  :param dfcol2: a column from a pandas dataframe
  :return: None
  """  
  
  import matplotlib.pyplot as plt
  import numpy as np
  import pandas as pd

  if len(dfcol1) < 1:
    print("Error: Need at least one data point for scattergram.")
    return None

  if len(dfcol1) != len(dfcol2):
    print("Error: The two vectors are different lengths.")
    return None

  # Convert to numeric if it is a string, otherwise plt.hist()
  # will do a barplot using set(dfcol). 
  #if type(dfcol1[0]) == str:
  #  dfcol1 = pd.to_numeric(dfcol1, errors='coerce')

  #if type(dfcol2[0]) == str:
  #  dfcol2 = pd.to_numeric(dfcol2, errors='coerce')


  plt.scatter(dfcol1, dfcol2)

  return None

In [None]:
def bicorr(dfcol1, dfcol2):
  """
  Here's a function that creates a correlation from two pd series, converting
  if necessary from string to numeric.

  :param dfcol1: a column from a pandas dataframe
  :param dfcol2: a column from a pandas dataframe
  :return: Pearson's correlation coefficient
  """  
  import pandas as pd
  import numpy as np

  if len(dfcol1) < 2:
    print("Error: Need at least two data points for correlation.")
    return None

  if len(dfcol1) != len(dfcol2):
    print("Error: The two vectors are different lengths.")
    return None

  # Convert to numeric if it is a string, 
  #if type(dfcol1.dtype) == str:
  #  dfcol1 = pd.to_numeric(dfcol1, errors='coerce')

  #if type(dfcol2.dtype) == str:
  #  dfcol2 = pd.to_numeric(dfcol2, errors='coerce')

  #cm = np.corrcoef(dfcol1.values, dfcol2.values)
  #return cm[0,1]

  return dfcol1.corr(dfcol2)


In [None]:
hist(ppdb_df['ppdb1'])

In [None]:
#hist(ppdb_df['ppdb2'])

In [None]:
scatter(ppdb_df['ppdb1'], ppdb_df['ppdb2'])

In [None]:
#hist(ppdb_df['AGigaSim'])

In [None]:
#hist(ppdb_df['GoogleNgramSim'])

In [None]:
#scatter(ppdb_df['AGigaSim'], ppdb_df['GoogleNgramSim'])

In [None]:
bicorr(ppdb_df['AGigaSim'], ppdb_df['GoogleNgramSim'])

In [None]:
# We can also sample texts from the main dataframe to create a new df
# Rather than create a list of dataframes, we will make a new df structure
# Takes about 2 minutes for 5000 samples

text_samps = 1000
col_list = ["LHS", "RHS", "RandText", "AGigaSim", "GoogleNgramSim"]

text_df = pd.DataFrame(columns=col_list) 

for i in range(text_samps):
  # Start with a single row randomly sampled; using seed to make results reproducible
  lrdf = ppdb_df.sample(n = 1, random_state=i)
  lhs = lrdf["LHS"].values[0] # Peel off the first phrase
  rhs = lrdf["RHS"].values[0] # Peel off the second phrase
  agiga = lrdf["AGigaSim"].values[0] # Save the similarity
  googn = lrdf["GoogleNgramSim"].values[0] # Save the similarity

  lrdf2 = ppdb_df.sample(n = 1) # Now sample a second row to get an additional text
  randtext = lrdf2["RHS"].values[0] # Save the second text

  text_df = text_df.append(pd.DataFrame([[lhs, rhs, randtext, agiga, googn]], columns=col_list))

len(text_df), type(text_df)

In [None]:
text_df

In [None]:
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
xflist = ['distiluse-base-multilingual-cased-v2',
'distiluse-base-multilingual-cased',
'distiluse-base-multilingual-cased-v1',
'paraphrase-multilingual-MiniLM-L12-v2',
'clip-ViT-B-32-multilingual-v1',
'all-mpnet-base-v2',
'clip-ViT-B-32',
'msmarco-bert-co-condensor',
'msmarco-distilbert-cos-v5',
'msmarco-MiniLM-L12-cos-v5',
'msmarco-MiniLM-L6-cos-v5',
'facebook-dpr-ctx_encoder-multiset-base',
'msmarco-bert-base-dot-v5',
'msmarco-distilbert-dot-v5',
'all-roberta-large-v1',
'paraphrase-distilroberta-base-v2',
'all-mpnet-base-v1',
'paraphrase-mpnet-base-v2',
'paraphrase-MiniLM-L12-v2',
'paraphrase-MiniLM-L6-v2',
'paraphrase-MiniLM-L3-v2',
'all-MiniLM-L12-v2',
'all-MiniLM-L12-v1',
'all-MiniLM-L6-v2',
'all-MiniLM-L6-v1',
'all-distilroberta-v1',
'multi-qa-mpnet-base-cos-v1',
'msmarco-distilbert-base-tas-b',
'multi-qa-mpnet-base-dot-v1',
'multi-qa-distilbert-cos-v1',
'multi-qa-MiniLM-L6-cos-v1',
'multi-qa-distilbert-dot-v1',
'multi-qa-MiniLM-L6-dot-v1',
'xlm-r-large-en-ko-nli-ststb',
'xlm-r-distilroberta-base-paraphrase-v1',
'xlm-r-bert-base-nli-stsb-mean-tokens',
'xlm-r-bert-base-nli-mean-tokens',
'xlm-r-base-en-ko-nli-ststb',
'xlm-r-100langs-bert-base-nli-stsb-mean-tokens',
'xlm-r-100langs-bert-base-nli-mean-tokens',
'stsb-xlm-r-multilingual',
'stsb-roberta-large',
'stsb-roberta-base',
'stsb-roberta-base-v2',
'stsb-mpnet-base-v2',
'stsb-distilroberta-base-v2',
'stsb-distilbert-base',
'stsb-bert-large',
'stsb-bert-base',
'roberta-large-nli-stsb-mean-tokens',
'roberta-large-nli-mean-tokens',
'roberta-base-nli-stsb-mean-tokens',
'roberta-base-nli-mean-tokens',
'quora-distilbert-multilingual',
'quora-distilbert-base',
'paraphrase-xlm-r-multilingual-v1',
'paraphrase-multilingual-mpnet-base-v2',
'paraphrase-distilroberta-base-v1',
'paraphrase-albert-small-v2',
'paraphrase-albert-base-v2',
'paraphrase-TinyBERT-L6-v2',
'nq-distilbert-base-v1',
'nli-roberta-large',
'nli-roberta-base',
'nli-roberta-base-v2',
'nli-mpnet-base-v2',
'nli-distilroberta-base-v2',
'nli-distilbert-base',
'nli-distilbert-base-max-pooling',
'nli-bert-large',
'nli-bert-large-max-pooling',
'nli-bert-large-cls-pooling',
'nli-bert-base',
'nli-bert-base-max-pooling',
'nli-bert-base-cls-pooling',
'msmarco-roberta-base-v3',
'msmarco-roberta-base-v2',
'msmarco-roberta-base-ance-firstp',
'msmarco-distilroberta-base-v2',
'msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch',
'msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned',
'msmarco-distilbert-base-v4',
'msmarco-distilbert-base-v3',
'msmarco-distilbert-base-v2',
'msmarco-distilbert-base-dot-prod-v3',
'msmarco-MiniLM-L-6-v3',
'msmarco-MiniLM-L-12-v3',
'facebook-dpr-question_encoder-single-nq-base',
'facebook-dpr-question_encoder-multiset-base',
'facebook-dpr-ctx_encoder-single-nq-base',
'distilroberta-base-paraphrase-v1',
'distilroberta-base-msmarco-v2',
'distilroberta-base-msmarco-v1',
'distilbert-multilingual-nli-stsb-quora-ranking',
'distilbert-base-nli-stsb-quora-ranking',
'distilbert-base-nli-stsb-mean-tokens',
'distilbert-base-nli-mean-tokens',
'distilbert-base-nli-max-tokens',
'bert-large-nli-stsb-mean-tokens',
'bert-large-nli-mean-tokens',
'bert-large-nli-max-tokens',
'bert-large-nli-cls-token',
'bert-base-wikipedia-sections-mean-tokens',
'bert-base-nli-stsb-mean-tokens',
'bert-base-nli-mean-tokens',
'bert-base-nli-max-tokens',
'bert-base-nli-cls-token',
'average_word_embeddings_levy_dependency',
'average_word_embeddings_komninos',
'average_word_embeddings_glove.840B.300d',
'average_word_embeddings_glove.6B.300d',
'allenai-specter',
'LaBSE']

In [None]:
len(xflist)

In [None]:
pathname = '/content/gdrive/MyDrive/Data/text_df.csv'

text_df.to_csv(pathname, index=False, encoding='utf-8-sig')
lhs = text_df["LHS"].values
rhs = text_df["RHS"].values
randtext = text_df["RandText"].values
del text_df

In [None]:
# Now process similarity values for each transformer in the list
import time

# The notebook doesn't have enough disk to process all 113 models.
# So do a range in any given run. Modify the following two lines
# to cover the range you want.
#startm = 0
startm = 0
#endm = len(xflist)
endm = 113

#time_list = []

# Create Log File for each Transformer
log_df = pd.DataFrame(columns=['Transformer_Index','Transformer_Name',"Time_Elapsed"])

pathname = '/content/gdrive/MyDrive/Data/text_df.csv'

# Work our way through the list of transformers
for i in range(startm, endm):
  print("Model:", i)
  t0 = time.perf_counter() # Time the sentence process: Capture the start time

  model = SentenceTransformer(xflist[i])

  # Compute sentence summaries: Model encode will handle a list of text values
  vleft = model.encode(lhs)
  vright = model.encode(rhs)
  vrand = model.encode(randtext)

  del model # Recoup the memory

  # Now compute similarities: There are three for each entry in the df
  lr_list = [] # Distance from left to right
  lrand_list = [] # Distance from left to rand
  rrand_list = [] # Distance from righ to rand

  for j in range(text_samps):
    #a = vleft[j].reshape(1, -1)
    #b = vright[j].reshape(1, -1)
    lr_list.append(cosine_similarity(vleft[j].reshape(1, -1), vright[j].reshape(1, -1))[0][0])

    #a = vleft[j].reshape(1, -1)
    #b = vrand[j].reshape(1, -1)
    lrand_list.append(cosine_similarity(vleft[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])

    #a = vright[j].reshape(1, -1)
    #b = vrand[j].reshape(1, -1)
    rrand_list.append(cosine_similarity(vright[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])

  log_df.loc[len(log_df)] = [i, xflist[i], time.perf_counter()-t0]
  #t1 = time.perf_counter()
  #time_list.append(t1 - t0)
  
  text_df = pd.read_csv(pathname)
  # Finally, add the columns to the dataset
  text_df.insert(4, "RRANDsim" + str(i), rrand_list, False)
  text_df.insert(4, "LRANDsim" + str(i), lrand_list, False)
  text_df.insert(4, "LRsim" + str(i), lr_list, False)

  text_df.to_csv(pathname, index=False, encoding='utf-8-sig')
  del text_df

In [None]:
log_df

In [None]:
text_df

In [None]:
from google.colab import files

text_df.to_csv("ppdb_sim_rest.csv")
files.download("ppdb_sim_rest.csv")

In [None]:
log_df.to_csv("ppdb_sim_timelog_0_34.csv")
files.download("ppdb_sim_timelog_0_34.csv")

In [None]:
run_time_df = pd.DataFrame(columns=['Model_Index','Model_Name','Run_Time','Run_Type'])

In [None]:
# Calculate consistency of run time

# BY PICKING RANDOM MODEL AND LOADING THE MODEL IN EVERY ITERATION
import random
model_index_to_test = random.randint(0,len(xflist)-1)
print("Model:", model_index_to_test)

for iteration in range(100):
  print("Iteration:", iteration)
  t0 = time.perf_counter() # Time the sentence process: Capture the start time

  model = SentenceTransformer(xflist[model_index_to_test])

  # Compute sentence summaries: Model encode will handle a list of text values
  vleft = model.encode(lhs)
  vright = model.encode(rhs)
  vrand = model.encode(randtext)

  del model # Recoup the memory

  # Now compute similarities: There are three for each entry in the df
  lr_list = [] # Distance from left to right
  lrand_list = [] # Distance from left to rand
  rrand_list = [] # Distance from righ to rand

  for j in range(text_samps):
    lr_list.append(cosine_similarity(vleft[j].reshape(1, -1), vright[j].reshape(1, -1))[0][0])
    lrand_list.append(cosine_similarity(vleft[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])
    rrand_list.append(cosine_similarity(vright[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])

  run_time_df.loc[len(run_time_df)] = [model_index_to_test, xflist[model_index_to_test], time.perf_counter()-t0, 'model_reloaded_in_every_iteration']

In [None]:
# Calculate consistency of run time

# BY PICKING A MODEL AND LOADING THE MODEL JUST ONCE AT THE START OF THE LOOP
# I have used the 2 models for which I ran the above random model run time test
# This was to check the run time effect of loading the model
for model_index_to_test in [25]:
  print("Model:", model_index_to_test)
  model = SentenceTransformer(xflist[model_index_to_test])

  for iteration in range(100):
    print("Iteration:", iteration)
    t0 = time.perf_counter() # Time the sentence process: Capture the start time

    # Compute sentence summaries: Model encode will handle a list of text values
    vleft = model.encode(lhs)
    vright = model.encode(rhs)
    vrand = model.encode(randtext)

    # Now compute similarities: There are three for each entry in the df
    lr_list = [] # Distance from left to right
    lrand_list = [] # Distance from left to rand
    rrand_list = [] # Distance from righ to rand

    for j in range(text_samps):
      lr_list.append(cosine_similarity(vleft[j].reshape(1, -1), vright[j].reshape(1, -1))[0][0])
      lrand_list.append(cosine_similarity(vleft[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])
      rrand_list.append(cosine_similarity(vright[j].reshape(1, -1), vrand[j].reshape(1, -1))[0][0])

    run_time_df.loc[len(run_time_df)] = [model_index_to_test, xflist[model_index_to_test], time.perf_counter()-t0, 'model_loaded_once_before_iteration']

In [None]:
run_time_df

In [None]:
from google.colab import files

run_time_df.to_csv("model_run_time_stats.csv")
files.download("model_run_time_stats.csv")