# ST446 Distributed Computing for Big Data
## Assignment 2 - PART 2
---

## P2: Topic Modelling

In this homework problem, you are asked to perform a semantic analysis of the DBLP author publications dataset `dblp/author_large.txt`.

Please, refer to:
* Week02 on how to download this dataset into the master node of your Dataproc cluster. You can put this dataset into a bucket on your GCP account and access it from your code as well.
* Week09 on how to configure your Dataproc cluster to use the `NLTK` library.
* Week09 on an example code for running LDA topic modelling.

## Questions

**P2.A (25 points)** Use Latent Dirichlet Allocation (LDA) to cluster publications by using words in their titles and represent each publication by 10 topics. Please follow these steps:

**A.1** Convert titles to tokens by:
   * Tokenizing words in the title of each publication.
   * Removing stop words using the `nltk` package.
   * Removing puctuations, numbers or other symbols.
   * Lemmatizing tokens.

Note that you may skip some of these editing steps or add some additional steps to edit the tokens, but if you do this provide a justification for it.

**A.2** Convert tokens into sparse vectors.

**A.3** Use LDA to find out 10 topics for each publication and represent each topic with the first few most relevant words. Note that you can choose to use different number of topics rather than 10. Again if you do so, please provide a justification.

**A.4** Comment the obtained results.

**P2.B (25 points)** Address each question as in part A, but with each *document* representing all publication tiles of a specific author. For example, if an author $Y$ wrote "introduction to databases" and "database design", then the *document* for the author $Y$ will be "introduction to database database design". 

In addition, calculate the **topic density** vector for each author and use the topic density to calculate the **cosine similarity** for each pair of authors. For example, if the topic density for author X is $[x_1, x_2, x_3, \dots]$ and topic density vector for author Y is $[y_1, y_2, y_3, \dots]$, then the cosine similarity is $\frac{x_1\cdot y_1 + x_2\cdot y_2 + x_3\cdot y_3 +\dots}{\sqrt{x_1^2+ x_2^2+ x_3^2 +\dots}\sqrt{y_1^2+ y_2^2+ y_3^2 +\dots}}$. Show the 10 most similar author pairs and comment on their similarity, if possible taking into consideration the results from the previous section.

## 0. Load data

In [95]:
import numpy as np

# your code to adjust the path to your dataset author-large.txt
author_rdd2 = sc.textFile('gs://bucket-sar/author-large.txt', 4) \
                .map(lambda row: np.array(row.strip().split("\t")))

In [96]:
#I take a sample of the dataset of 707 records, because the full dataset takes a really long time to run:
author_rdd = author_rdd2.filter(lambda r :(int(r[3])== 2010 and 'al' in r[0] ) )

In [97]:
author_rdd.count()

707

In [98]:
# example on how you can manipulate the RDD containing the data
# you can adjust for your case

#authors = author_rdd.map(lambda r: (r[0],1)).reduceByKey(lambda a,b: a+b)
#author_30 = set(authors.filter(lambda r: r[1] >= 30).map(lambda r: r[0]).collect())

In [99]:
#title_author = author_rdd.filter(lambda r: r[0] in author_30). \
#                    map(lambda r: (r[0],r[2])).distinct()
#title_author.take(10)

In [100]:
#print(author_rdd.count())
#print(title_author.count())

In [101]:
#Preparation for part A:
#Get all the titles in one object, so that they appear only once.
titles = author_rdd.map(lambda r: (r[2])).distinct()

In [102]:
#Preparation for part B:
#Get all publication tiles of a specific author in one row, so that each document represents all the publications of an author:
author_publ = author_rdd.map(lambda r: (r[0],r[2])).distinct()
author_publ = author_publ.reduceByKey(lambda x, y: x + ' ' + y)
author_publ = author_publ.map(lambda r: (r[0],r[1]))

## A1. Parse the data

Here we make use of the natural language processing module `nltk`. Please download both the module and the corresponding data. See https://www.nltk.org/install.html and https://www.nltk.org/data.html for more details.

In [103]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

#A.1 Convert titles to tokens by:
#Tokenizing words in the title of each publication.
#Removing stop words using the nltk package.
#Removing puctuations, numbers or other symbols.
#Lemmatizing tokens.

stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer() 

def get_tokens(line):
    # get tokens from line
    tokens = word_tokenize(line)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuations from each word
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    words = [w for w in words if not w in stop_words]
    # lemmatizing the words
    words = [lmtzr.lemmatize(w) for w in words]
    return (words)

titles_tokens = titles.map(lambda line: (str(line), get_tokens(line)))
print(titles_tokens.take(1))


#Remove stop words based on document features:

'''Find and store words that appear extremely rarely in the documents. Here, we consider these words to appear
less than 2 times across all titles. These words will be removed, hence they will not become features. 
The reason is that words that appear so rarely cannot provide a guideline on the topic, 
but might only overfit and overcomplicate the model. If the code would run in the full datset I would select 
a higher threshold than 2 for identifying rare words and removing them.
'''
doc_stop_words = titles_tokens.flatMap(lambda r: r[1]).map(lambda r: (r,1)).reduceByKey(lambda a,b: a+b)
doc_stop_words = doc_stop_words.filter(lambda a: a[1]<2).map(lambda r: r[0]).collect()

# throw away stop words and words that are just single letters.
titles_nostop = titles_tokens.map(lambda r: (r[0],[w for w in r[1] if not w in doc_stop_words and not len(w)==1])) 

titles_nostop.take(1)[0][1][:10]

[('Dual-Level Attack Detection and Characterization for Networks under DDoS.', ['duallevel', 'attack', 'detection', 'characterization', 'network', 'ddos'])]


['attack', 'detection', 'network']

## A2. Convert tokens into sparse vectors


In [104]:
# your code
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import monotonically_increasing_id

titles_df = spark.createDataFrame(titles_nostop, ["title","words"])
titles_df.cache()
titles_df.take(2)

[Row(title='Dual-Level Attack Detection and Characterization for Networks under DDoS.', words=['attack', 'detection', 'network']),
 Row(title='A Probabilistic Approach for On-Line Sum-Auditing.', words=['probabilistic', 'approach', 'online'])]

## Generate a vectorized representation of the *tokens*


In [105]:
# your code

# convert a collection of text documents to a matrix of token counts.
''' `minDF`: I use the value of 3, so that words that appear in less than 3 documents 
#won't be part of the vectorized representation of tokens. The reason is that is they are so rare, 
#then it will be like overfitting the model later on.'''
cv = CountVectorizer(inputCol="words", outputCol="features", minDF=3)

# learn a vocabulary dictionary of all tokens in the raw documents.
cv_model = cv.fit(titles_df)

# learn the vocabulary dictionary and return document-term matrix.
titles_df_w_features = cv_model.transform(titles_df)
titles_df_w_features.cache()
titles_df_w_features.show(3)

+--------------------+--------------------+--------------------+
|               title|               words|            features|
+--------------------+--------------------+--------------------+
|Dual-Level Attack...|[attack, detectio...|(325,[5,25,97],[1...|
|A Probabilistic A...|[probabilistic, a...|(325,[4,56],[1.0,...|
|Analysing and Vis...|[security, usabil...|    (325,[45],[1.0])|
+--------------------+--------------------+--------------------+
only showing top 3 rows



## Convert pyspark.ml vectors to pyspark.mllib vectors

In [106]:
# your code
from pyspark.mllib.linalg import Vectors

def as_mllib_vector(v):
    return Vectors.sparse(v.size, v.indices, v.values)

features = titles_df_w_features.select("features")
feature_vec = features.rdd.map(lambda r: as_mllib_vector(r[0]))

feature_vec.cache()
feature_vec.take(1)

[SparseVector(325, {5: 1.0, 25: 1.0, 97: 1.0})]

## Check the vocabulary

In [107]:
# your code

print ("Vocabulary from CountVectorizerModel is:")
print(cv_model.vocabulary[:100])
print("\n---\n")

m = len(cv_model.vocabulary)
print("Number of terms m: ", m)

Vocabulary from CountVectorizerModel is:
['de', 'system', 'model', 'using', 'approach', 'network', 'application', 'analysis', 'performance', 'information', 'service', 'data', 'social', 'framework', 'based', 'computing', 'towards', 'learning', 'efficient', 'memory', 'environment', 'design', 'study', 'tool', 'program', 'detection', 'computer', 'pour', 'development', 'programming', 'problem', 'science', 'software', 'distributed', 'management', 'la', 'virtual', 'architecture', 'code', 'mobile', 'algorithm', 'user', 'language', 'support', 'pattern', 'security', 'dynamic', 'classification', 'wireless', 'knowledge', 'high', 'interaction', 'interface', 'parallel', 'multiple', 'automatic', 'online', 'value', 'logic', 'semantic', 'sensor', 'transformation', 'simulation', 'structure', 'technology', 'search', 'education', 'base', 'web', 'modeling', 'role', 'access', 'optimization', 'challenge', 'community', 'der', 'abstract', 'java', 'video', 'interactive', 'experience', 'evaluation', 'le', 'local

## A3. Latent Dirichlet Allocation

In [108]:
# your code
from pyspark.ml.clustering import LDA

# instantiate LDA model. It is a batch LDA using EM algorithm. 
#Even though the online LDA can be more accurate, we use batch, as it executes quicker.
#I select 10 topics, which could fit better in the full dataset.
lda = LDA(k=10, maxIter=5)
# training the model
lda_model = lda.fit(titles_df_w_features)

#LONG TIME:
# calculate logLikelihood
#ll = lda_model.logLikelihood(titles_df_w_features)
# calculate perplexity
#lp = lda_model.logPerplexity(titles_df_w_features)

#print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
#print("The upper bound on the perplexity: " + str(lp))

Looking at the topics:

In [109]:
# Describe topics
# your code

topics = lda_model.describeTopics(5)

print("The topics described by their top-weighted terms:")

topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------------+----------------------------------------------------------------------------------------------------------------+
|topic|termIndices             |termWeights                                                                                                     |
+-----+------------------------+----------------------------------------------------------------------------------------------------------------+
|0    |[1, 134, 59, 9, 170]    |[0.02084882858555773, 0.00777319155846653, 0.007376292704212854, 0.007321170771130758, 0.00656767314316723]     |
|1    |[51, 214, 240, 104, 22] |[0.008452264510354676, 0.007230752479272849, 0.006926990275135945, 0.005610810024368863, 0.0053655463460259255] |
|2    |[31, 26, 66, 11, 44]    |[0.010973444595536574, 0.00925387455292773, 0.008174444940824591, 0.007856212059772893, 0.0073779446604198985]  |
|3    |[191, 285, 309, 176, 72]|[0.0105412676897365, 0.008483363048266844,

In [110]:
# Shows the results
# your code
topic_i = topics.select("termIndices").rdd.map(lambda r: r[0]).collect()
for j, i in enumerate(topic_i):
    print('topic', j+1, ':', np.array(cv_model.vocabulary)[i])

topic 1 : ['system' 'functional' 'semantic' 'information' 'multiagent']
topic 2 : ['interaction' 'tree' 'assignment' 'game' 'study']
topic 3 : ['science' 'computer' 'education' 'data' 'pattern']
topic 4 : ['solution' 'ehealth' 'emerging' 'enhancing' 'optimization']
topic 5 : ['algebraic' 'global' 'structure' 'team' 'medium']
topic 6 : ['transactional' 'program' 'approach' 'memory' 'development']
topic 7 : ['study' 'feasibility' 'structure' 'par' 'hardware']
topic 8 : ['integration' 'automatic' 'segmentation' 'partial' 'modeling']
topic 9 : ['de' 'model' 'pour' 'la' 'service']
topic 10 : ['information' 'dynamic' 'social' 'analysis' 'abstract']


## A4. Comment your results

From the results abe we see that 10 topics have been created for the titles and we see the words that mainly describe thse topics. Topic 1 has a higher presence in the document (title) when the words 'algorithm', 'challenge','using', 'efficient' and 'student' occur inhigh percentage. e see that most if the topics are related to computer science, but this might be due to the small dataset used, since only 325 tokens were used in the model.

## B1. Convert tokens into sparse vectors

In [121]:
# your code
#B Convert titles to tokens by:
#Tokenizing words in the title of each publication.
#Removing stop words using the nltk package.
#Removing puctuations, numbers or other symbols.
#Lemmatizing tokens.

stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer() # see https://www.nltk.org/_modules/nltk/stem/wordnet.html for details

author_publ_toks = author_publ.map(lambda line: (str(line[0]), get_tokens(line[1])))
print(author_publ_toks.take(1))


#Remove stop words based on document features:

'''Find and store words that appear extremely rarely in the documents. Here, we consider these words to appear
less than 2 times across all concatinated titles of authors. These words will be removed, hence they will not become features. 
The reason is that words that appear so rarely cannot provide a guideline on the topic, 
but might only overfit and overcomplicate the model. If the code would run in the full dataset I would select 
a higher threshold than 2 for identifying rare words and removing them.
'''

doc_stop_words2 = author_publ_toks.flatMap(lambda r: r[1]).map(lambda r: (r,1)).reduceByKey(lambda a,b: a+b)
doc_stop_words2 = doc_stop_words2.filter(lambda a: a[1]<2).map(lambda r: r[0]).collect()

# throw away stop words and words that are just single letters.
author_publ_nostop = author_publ_toks.map(lambda r: (r[0],[w for w in r[1] if not w in doc_stop_words2 and not len(w)==1])) 

author_publ_nostop.take(1)[0][1][:10]


[('Ana Cavalcanti', ['communication', 'system', 'clawz'])]


['suivi', 'dautomobiles', 'par', 'classification', 'hirarchique']

In [122]:
#Convert tokens into sparse vectors
author_publ_df = spark.createDataFrame(author_publ_nostop, ["author","words"])
author_publ_df.cache()
author_publ_df.take(2)

[Row(author='Ana Cavalcanti', words=['communication', 'system']),
 Row(author='Sudhakar Yalamanchili', words=['modeling', 'workload', 'system'])]

## Generate a vectorized representation of the *tokens*

In [123]:
# your code
# convert a collection of text documents to a matrix of token counts.
''' `minDF`: We dont use a minDF because here the documents are way less and longer than the title-documents we used before. 
Hence in this small dataset an author might be writing about a topic that none else will.'''
cv2 = CountVectorizer(inputCol="words", outputCol="features")

# learn a vocabulary dictionary of all tokens in the raw documents.
cv_model2 = cv2.fit(author_publ_df)

# learn the vocabulary dictionary and return document-term matrix.
author_publ_df_w_features = cv_model2.transform(author_publ_df)
author_publ_df_w_features.cache()
author_publ_df_w_features.show(2, truncate = False)

+---------------------+----------------------------+-------------------------------+
|author               |words                       |features                       |
+---------------------+----------------------------+-------------------------------+
|Ana Cavalcanti       |[communication, system]     |(885,[1,125],[1.0,1.0])        |
|Sudhakar Yalamanchili|[modeling, workload, system]|(885,[1,102,425],[1.0,1.0,1.0])|
+---------------------+----------------------------+-------------------------------+
only showing top 2 rows



## Convert pyspark.ml vectors to pyspark.mllib vectors

In [124]:
# your code

features2 = author_publ_df_w_features.select("features")
feature_vec2 = features2.rdd.map(lambda r: as_mllib_vector(r[0]))

#feature_vec2.cache()
feature_vec2.take(1)

[SparseVector(885, {1: 1.0, 125: 1.0})]

### Take a look at the vocabulary

In [125]:
# your code
print ("Vocabulary from CountVectorizerModel is:")
print(cv_model2.vocabulary[:100])
print("\n---\n")

m = len(cv_model2.vocabulary)
print("Number of terms m: ", m)

Vocabulary from CountVectorizerModel is:
['de', 'system', 'model', 'using', 'approach', 'network', 'analysis', 'application', 'service', 'data', 'performance', 'information', 'efficient', 'computing', 'memory', 'social', 'development', 'program', 'based', 'study', 'framework', 'environment', 'design', 'pour', 'towards', 'tool', 'learning', 'software', 'management', 'programming', 'science', 'computer', 'detection', 'architecture', 'support', 'distributed', 'dynamic', 'interaction', 'interface', 'security', 'classification', 'technology', 'pattern', 'user', 'virtual', 'multiple', 'problem', 'mobile', 'algorithm', 'la', 'code', 'base', 'wireless', 'language', 'parallel', 'sensor', 'knowledge', 'education', 'health', 'document', 'logic', 'structure', 'automatic', 'high', 'optimization', 'planning', 'transactional', 'semantic', 'transformation', 'search', 'web', 'interactive', 'et', 'simulation', 'strategy', 'experience', 'le', 'role', 'patient', 'control', 'semantics', 'value', 'java', 'm

## B1. Latent Dirichlet Allocation

We now analyse the same dataset but using the Latent Dirichlet Allocation to find feature vectors characterizing topics of documents, and feature vectors characterizing the words of topics.



In [126]:
# your code
# instantiate LDA model
lda2 = LDA(k=10, maxIter=5).setTopicDistributionCol("topicDistributionCol")
# training the model
lda_model2 = lda2.fit(author_publ_df_w_features)

#LONG TIME:
# calculate logLikelihood
#ll2 = lda_model2.logLikelihood(author_publ_df_w_features)
# calculate perplexity
#lp2 = lda_model2.logPerplexity(author_publ_df_w_features)

#print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
#print("The upper bound on the perplexity: " + str(lp))

The perplexity below is a measurement of how well a probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.

In [127]:
# Describe topics and top-weighted terms
# your code
topics2 = lda_model2.describeTopics(5)

print("The topics described by their top-weighted terms:")

topics2.show(truncate=False)

The topics described by their top-weighted terms:
+-----+-------------------------+------------------------------------------------------------------------------------------------------------------+
|topic|termIndices              |termWeights                                                                                                       |
+-----+-------------------------+------------------------------------------------------------------------------------------------------------------+
|0    |[11, 128, 20, 1, 68]     |[0.0038467355846325746, 0.0038105359724533434, 0.0037395917580262776, 0.0037356877350951446, 0.003630661180328617]|
|1    |[5, 83, 373, 643, 55]    |[0.003088524103112293, 0.003023962986769739, 0.002778274468819081, 0.002582373221437107, 0.002576776369402407]    |
|2    |[1, 3, 48, 10, 27]       |[0.008133595017822766, 0.005726834451699499, 0.005083177207777827, 0.004925496432479812, 0.003956571874342173]    |
|3    |[559, 117, 560, 38, 862] |[0.0020747157667571144,

In [128]:
# Shows the results
# your code
topic_i2 = topics2.select("termIndices").rdd.map(lambda r: r[0]).collect()
for j, i in enumerate(topic_i2):
    print('topic', j+1, ':', np.array(cv_model2.vocabulary)[i])

topic 1 : ['information' 'extended' 'framework' 'system' 'transformation']
topic 2 : ['network' 'monitoring' 'comparing' 'combinatorial' 'sensor']
topic 3 : ['system' 'using' 'algorithm' 'performance' 'software']
topic 4 : ['graphic' 'physical' 'profitable' 'interface' 'possibility']
topic 5 : ['analysis' 'permutation' 'lowlevel' 'detection' 'array']
topic 6 : ['application' 'support' 'mapping' 'design' 'array']
topic 7 : ['programming' 'shape' 'developing' 'graph' 'execution']
topic 8 : ['efficacy' 'call' 'threadlevel' 'speculation' 'graphlevel']
topic 9 : ['de' 'pour' 'der' 'video' 'approach']
topic 10 : ['biomedical' 'standard' 'applicability' 'technology' 'microassembly']


In [129]:
cv2 = CountVectorizer(inputCol="words", outputCol="features", minDF = 1) # TO DO 2

# learn a vocabulary dictionary of all tokens in the raw documents.
cv_model2 = cv2.fit(author_publ_df)

# learn the vocabulary dictionary and return document-term matrix.
author_publ_df_w_features = cv_model2.transform(author_publ_df)
author_publ_df_w_features.cache()
author_publ_df_w_features.show(2, truncate = True)

+--------------------+--------------------+--------------------+
|              author|               words|            features|
+--------------------+--------------------+--------------------+
|      Ana Cavalcanti|[communication, s...|(885,[1,125],[1.0...|
|Sudhakar Yalamanc...|[modeling, worklo...|(885,[1,102,425],...|
+--------------------+--------------------+--------------------+
only showing top 2 rows



Comment on above results:

Here we have also generated 10 topics, but wee see that some different words have ben picked up. This makes sense because this time we have included 885 tokens in the model. We see for example topic 8 focusing on efficacy and graphlevel, which we didn't see in the top words before.

## B2. Calculate the topic density vector for each author and the cosine similarity ## 

In [130]:
#Topic density vector:
transformed = lda_model2.transform(author_publ_df_w_features)
topics_dens= transformed.select("topicDistributionCol")
topics_dens.show(2, truncate= False)

#Transform it from pyspark dataframe into a numpy array:
topics_dens_np= np.array(topics_dens.select('topicDistributionCol').collect())

#Transform the lists in the numpy array into arrays:
ncols2 = len(topics_dens_np[0][0])
nrows2 =len(topics_dens_np)

topics_dens_np2 = np.zeros((nrows2,ncols2))

for i in range(nrows2):
    for j in range(ncols2):
        topics_dens_np2[i-1, j-1] += topics_dens_np[i-1][0][j-1]

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topicDistributionCol                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[0.03311606979560238,0.03267691589733875,0.7045669242824184,0.032631219383541736,0.03269794961452489,0.03287849479348196,0.03269988339040325,0.032416311066207765,0.03381987393354242,0.032496357842938535]   |
|[0.02480202792804722,0.02447107394993546,0.7787396202928778,0.024437429904178687,0.024487869695255633,0.024621943358962596,0.024486896505917746,0.02427608116833271

In [131]:
# your code

def dot_prod(A):
    B= A.transpose()
    ncols = len(A[0])


    product = np.zeros((nrows,nrows))

    # iterating by row of A
    for i in range(nrows):

        # iterating by column by B
        for j in range(nrows):

            # iterating by rows of B
            for k in range(ncols):
                product[i-1][j-1] += A[i-1][k-1] * B[k-1][j-1]   
    return product

nrows= len(topics_dens_np2)
product=dot_prod(topics_dens_np2)

In [132]:
# your code
from numpy import array  
from numpy.linalg import norm 


def cos_sim(A):
    
    #Calculate norms
    norms = np.zeros(nrows)
    
    #cosine similarity
    cos_sim = np.zeros((nrows,nrows))
    
    # iterating by row of A
    for i in range(nrows):
        norms[i-1]=norm(A[i-1])
        
    for i in range(nrows):
            # iterating by column 
        for j in range(nrows):
            cos_sim[i-1][j-1] += product[i-1][j-1]/(norms[i-1]*norms[j-1])  
        return cos_sim
    

cos_sim = cos_sim(topics_dens_np2)



## B3. Show the 10 most similar author pairs and comment on their similarity,

In [133]:
# your code

#get an array with the name of the authors
author_names= np.array(author_publ_df_w_features.select('author').collect())

#get the ones with the highst correlation:
from collections import defaultdict

d = defaultdict(list)
for i in range(len(cos_sim)):
    for j in range(len(cos_sim)):
        if i<j:
            d[cos_sim[i-1][j-1]].append((str(author_names[i-1]),str(author_names[j-1])))

for value in sorted(d.keys(), reverse=True)[0:9]:
    print (value, d[value])      

1.0000000000000002 [("['Stefan Nesbigall']", "['Philipp Slusallek']")]
0.9999999999687291 [("['Stefan Nesbigall']", "['Emilio del Rosal Garca']")]
0.9999770077981366 [("['Stefan Nesbigall']", "['Sigurdur O. Adalgeirsson']"), ("['Stefan Nesbigall']", "['Cynthia Breazeal']")]
0.9999769639203687 [("['Stefan Nesbigall']", "['Maider Zamalloa']")]
0.9999769240371887 [("['Stefan Nesbigall']", "['Aruna D. Balakrishnan']")]
0.9999585182482165 [("['Stefan Nesbigall']", "['Greg Malysa']")]
0.9998683270435408 [("['Stefan Nesbigall']", "['Ananya Kanjilal']")]
0.9998681943007313 [("['Stefan Nesbigall']", "['Gerald DeHondt']")]
0.9998681935693876 [("['Stefan Nesbigall']", "['Marcelo A. Falappa']")]
