# Sentence similarity with NLU using BERT embeddings


## 1. Install NLU and Java

In [1]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu  > /dev/null   
import nlu

## 2. Download sample dataset 60k Stack Overflow Questions with Quality Rating


https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate

In [2]:
import pandas as pd
# Download the dataset 
! wget -N https://ckl-it.de/wp-content/uploads/2020/11/60kstackoverflow.csv -P /tmp
# Load dataset to Pandas
df = pd.read_csv('/tmp/60kstackoverflow.csv')
max_r = 500
df = df.iloc[0:max_r]
df

--2020-11-10 03:33:15--  https://ckl-it.de/wp-content/uploads/2020/11/60kstackoverflow.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50356825 (48M) [text/csv]
Saving to: ‘/tmp/60kstackoverflow.csv’


2020-11-10 03:33:23 (6.62 MB/s) - ‘/tmp/60kstackoverflow.csv’ saved [50356825/50356825]



Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ
...,...,...,...,...,...,...
495,34744788,Can't call function Python,<p>Here is my code that calls the <code>__init...,<python><function>,2016-01-12 13:18:11,LQ_CLOSE
496,34744959,alright this is my most rescent code,\r\n \r\n highest = {}\r\n def reader...,<python>,2016-01-12 13:26:23,LQ_EDIT
497,34746224,pre tag text not coming in innerText,<p>I was just testing something and noticed th...,<javascript><jquery><html>,2016-01-12 14:25:36,LQ_CLOSE
498,34746726,"loading fonts ttf crashes , error loading with...","i have the problem with load the ttf file, my ...",<android-studio><fonts><libgdx><load>,2016-01-12 14:48:27,LQ_EDIT


## 3. Embed Sentences with Bert Sentence Embeddings  

We could either embed the Title or the question Body.

In [3]:
import nlu
pipe = nlu.load('embed_sentence.bert')
# pipe = nlu.load('en.embed_sentence.bert_large_cased') # if you have some time and RAM try a big BERT model!
predictions = pipe.predict(df.Title, output_level='document')
predictions

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


Unnamed: 0_level_0,document,embed_sentence_bert_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Java: Repeat Task Every Random Seconds,"[-1.72942316532135, 0.6468319892883301, -0.351..."
1,Why are Java Optionals immutable?,"[-0.6685013175010681, 0.08217886090278625, -0...."
2,Text Overlay Image with Darkened Opacity React...,"[-0.8454132080078125, -0.7770175337791443, -0...."
3,Why ternary operator in swift is so picky?,"[-0.41476115584373474, 0.15586626529693604, -0..."
4,hide/show fab with scale animation,"[-1.2917425632476807, -0.0196269191801548, -0...."
...,...,...
495,Can't call function Python,"[-1.2739437818527222, 0.7318032383918762, -0.6..."
496,alright this is my most rescent code,"[-1.373586654663086, 0.46381285786628723, -0.6..."
497,pre tag text not coming in innerText,"[-1.4447592496871948, -0.48064124584198, -0.80..."
498,"loading fonts ttf crashes , error loading with...","[-0.22394110262393951, 1.0780186653137207, -0...."


## 4. Calculate pairwise distances between all sentence embeddings.     
Sentences with small distances between their embeddings will be deemed as similar to each other. 

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
## Calculate dinstance between all pairs of sentences in DF 
def get_sim_df_for_iloc(sentence_id, predictions,e_col, pipe=pipe):
  # This function calculatse the distances for one sentences at  predictions[sentence_id] to all other sentences in predictions using the embedding defined by e_col 

  # put embeddings in matrix
  embed_mat = np.array([x for x in predictions[e_col]])

  # calculate distance between every embedding pair
  sim_mat = cosine_similarity(embed_mat,embed_mat)

  print("Similarities for Sentence : " + df.iloc[sentence_id].Title)

  # write sim scores to df
  df['sim_score'] = sim_mat[sentence_id]
  return df 

sim_df = get_sim_df_for_iloc(0,predictions,'embed_sentence_bert_embeddings')
sim_df.sort_values('sim_score', ascending = False)

Similarities for Sentence : Java: Repeat Task Every Random Seconds


Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,sim_score
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE,1.000000
306,34647874,Java servets response.getMethod() not working,Hello I am trying to create a simple servlet a...,<java><servlet-3.0><get-method>,2016-01-07 05:21:09,LQ_EDIT,0.835296
453,34710117,SQL Server: Displaying result in Java Textfield,<p>I use MS SQL Server and Java with JDBC to c...,<java><sql><sql-server><database><swing>,2016-01-10 19:55:48,LQ_CLOSE,0.830390
339,34662574,Node.JS: Getting error : [nodemon] Internal wa...,<p>I just installed <code>Node.js</code> on my...,<javascript><node.js>,2016-01-07 18:31:37,HQ,0.829243
107,34581270,Understanding JavaScript Object(value),<p>I understand that the following code wraps ...,<javascript>,2016-01-03 20:31:45,HQ,0.823874
...,...,...,...,...,...,...,...
40,34564543,Android Studio Import Failing,Ok guys i am trying to implement spinner in in...,<java><android><android-layout><android-studio...,2016-01-02 09:46:27,LQ_EDIT,0.612905
133,34589033,Trying to get property of non-object in yii,<p>I'm using yii framework and I'm new in yii ...,<php><mysql><yii>,2016-01-04 10:26:20,LQ_CLOSE,0.596515
331,34659252,Polymer - Animating a DIV,<p>I am learning Polymer. I have a element tha...,<javascript><polymer>,2016-01-07 15:41:51,HQ,0.592278
325,34656168,Stay signed in option with cookie-session in e...,"<p>I would like to have a ""Stay signed in"" opt...",<node.js><express><cookie-session>,2016-01-07 13:18:55,HQ,0.588495


# Calculate every similarity score between every sentence in the input dataframe pairwise

In [5]:
def get_sim_df_total( predictions,e_col, string_to_embed,pipe=pipe):
  # This function calculatse the distances every sentence pair. Creates for ever sentence a new column, i_sim the represents the similarity of sentences at predictions.iloc[i] to every other sentence j 
  # put embeddings in matrix

  embed_mat = np.array([x for x in predictions[e_col]])

  # calculate distance between every embedding pair
  sim_mat = cosine_similarity(embed_mat,embed_mat)

  for i,v in enumerate(sim_mat): predictions[str(i)+'_sim'] = sim_mat[i]

  return predictions 

sim_df = get_sim_df_total(predictions,'embed_sentence_bert_embeddings', 'How to get started with Machine Learning and Python' )
sim_df

Unnamed: 0_level_0,document,embed_sentence_bert_embeddings,0_sim,1_sim,2_sim,3_sim,4_sim,5_sim,6_sim,7_sim,8_sim,9_sim,10_sim,11_sim,12_sim,13_sim,14_sim,15_sim,16_sim,17_sim,18_sim,19_sim,20_sim,21_sim,22_sim,23_sim,24_sim,25_sim,26_sim,27_sim,28_sim,29_sim,30_sim,31_sim,32_sim,33_sim,34_sim,35_sim,36_sim,37_sim,...,460_sim,461_sim,462_sim,463_sim,464_sim,465_sim,466_sim,467_sim,468_sim,469_sim,470_sim,471_sim,472_sim,473_sim,474_sim,475_sim,476_sim,477_sim,478_sim,479_sim,480_sim,481_sim,482_sim,483_sim,484_sim,485_sim,486_sim,487_sim,488_sim,489_sim,490_sim,491_sim,492_sim,493_sim,494_sim,495_sim,496_sim,497_sim,498_sim,499_sim
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,Java: Repeat Task Every Random Seconds,"[-1.72942316532135, 0.6468319892883301, -0.351...",1.000000,0.819372,0.659262,0.694719,0.777048,0.725663,0.725625,0.702916,0.755798,0.759118,0.684960,0.666919,0.788409,0.725577,0.703996,0.698227,0.771101,0.814983,0.724666,0.753221,0.735336,0.645955,0.696854,0.729837,0.727965,0.785126,0.759896,0.724465,0.708779,0.699182,0.713487,0.796558,0.795756,0.769290,0.731599,0.705547,0.663230,0.737241,...,0.710759,0.701657,0.742022,0.715796,0.671633,0.682077,0.783963,0.718364,0.755613,0.669950,0.813350,0.721166,0.747419,0.649444,0.807471,0.729133,0.701040,0.697930,0.682483,0.801441,0.657247,0.657017,0.740262,0.733250,0.694280,0.743189,0.734860,0.752737,0.672762,0.732879,0.759939,0.770987,0.737586,0.739217,0.743965,0.655714,0.692343,0.775117,0.752383,0.784925
1,Why are Java Optionals immutable?,"[-0.6685013175010681, 0.08217886090278625, -0....",0.819372,1.000000,0.681445,0.785746,0.746594,0.760794,0.686735,0.765891,0.736671,0.750726,0.699515,0.718243,0.781912,0.723444,0.723196,0.718621,0.780115,0.800614,0.753087,0.719433,0.702536,0.682586,0.725512,0.731164,0.740763,0.767661,0.753056,0.746062,0.710316,0.746146,0.764438,0.777287,0.752945,0.747914,0.764362,0.754737,0.700815,0.782308,...,0.705635,0.726485,0.785977,0.722739,0.745312,0.692379,0.789531,0.706067,0.756747,0.709930,0.806407,0.766143,0.694382,0.675494,0.784660,0.759528,0.760680,0.753494,0.700965,0.756325,0.743803,0.713018,0.778026,0.824247,0.654408,0.767928,0.698624,0.811279,0.787773,0.772062,0.770704,0.783659,0.771810,0.683055,0.790450,0.732093,0.705242,0.762211,0.791996,0.799584
2,Text Overlay Image with Darkened Opacity React...,"[-0.8454132080078125, -0.7770175337791443, -0....",0.659262,0.681445,1.000000,0.700908,0.712997,0.744349,0.677978,0.698549,0.785566,0.741941,0.671408,0.684902,0.727052,0.697170,0.776989,0.736434,0.768504,0.671498,0.705557,0.693677,0.726465,0.682551,0.696212,0.727335,0.677732,0.729136,0.736803,0.732434,0.748954,0.608658,0.716553,0.640638,0.686526,0.706538,0.794051,0.681888,0.733818,0.697512,...,0.766923,0.712709,0.713497,0.713273,0.732967,0.713179,0.655296,0.781803,0.718077,0.745576,0.673234,0.781648,0.694080,0.650255,0.651726,0.767944,0.704866,0.666595,0.632337,0.729460,0.855053,0.743301,0.679530,0.705767,0.710097,0.655219,0.712059,0.726609,0.658780,0.727759,0.775218,0.676189,0.767131,0.746961,0.723573,0.598759,0.658729,0.786395,0.725651,0.706981
3,Why ternary operator in swift is so picky?,"[-0.41476115584373474, 0.15586626529693604, -0...",0.694719,0.785746,0.700908,1.000000,0.709669,0.752974,0.631151,0.710866,0.673875,0.737547,0.739440,0.728838,0.728043,0.763093,0.709968,0.733170,0.796574,0.790821,0.800898,0.732886,0.686625,0.752719,0.738881,0.620819,0.821819,0.806534,0.726634,0.735734,0.730317,0.744735,0.765111,0.654879,0.766204,0.737667,0.806084,0.862083,0.760542,0.794786,...,0.727900,0.757381,0.707900,0.728623,0.754694,0.735460,0.708523,0.722562,0.802051,0.778878,0.759916,0.724896,0.744051,0.685967,0.731031,0.716881,0.712002,0.738101,0.767521,0.704312,0.784878,0.786396,0.848044,0.795011,0.691402,0.735671,0.760157,0.729812,0.625895,0.815532,0.763806,0.791429,0.769846,0.647629,0.806612,0.697572,0.768133,0.745683,0.748908,0.721279
4,hide/show fab with scale animation,"[-1.2917425632476807, -0.0196269191801548, -0....",0.777048,0.746594,0.712997,0.709669,1.000000,0.686274,0.781686,0.769383,0.849608,0.697609,0.784714,0.727806,0.747672,0.671043,0.768991,0.701838,0.688884,0.793603,0.682163,0.775068,0.659666,0.723660,0.767470,0.726983,0.701460,0.730216,0.813612,0.652733,0.702719,0.721716,0.682510,0.720005,0.807717,0.844214,0.720020,0.615638,0.672022,0.717970,...,0.717623,0.659555,0.801951,0.676592,0.762317,0.694043,0.780064,0.770141,0.813450,0.716977,0.726947,0.717852,0.731239,0.696027,0.755972,0.768154,0.741713,0.762901,0.703013,0.710839,0.732210,0.708758,0.734915,0.789922,0.692274,0.751966,0.755104,0.800827,0.686754,0.687980,0.782181,0.742591,0.812714,0.778561,0.713158,0.727395,0.732339,0.817965,0.803347,0.706341
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Can't call function Python,"[-1.2739437818527222, 0.7318032383918762, -0.6...",0.655714,0.732093,0.598759,0.697572,0.727395,0.561384,0.614975,0.673705,0.659852,0.666031,0.626060,0.778819,0.645398,0.637153,0.625942,0.645707,0.659481,0.738943,0.647012,0.655602,0.582020,0.754944,0.851217,0.610026,0.646869,0.696024,0.675875,0.681873,0.668802,0.724839,0.586340,0.681417,0.705478,0.681926,0.678499,0.644976,0.605934,0.712498,...,0.623264,0.643910,0.676288,0.608738,0.704484,0.606561,0.696069,0.586650,0.707953,0.691828,0.683269,0.680691,0.599897,0.694955,0.685631,0.638154,0.761407,0.708931,0.690334,0.655885,0.693766,0.659238,0.732861,0.781721,0.697190,0.636980,0.646070,0.649400,0.682896,0.627332,0.685858,0.685453,0.722892,0.620080,0.732348,1.000000,0.652642,0.750132,0.604775,0.650420
496,alright this is my most rescent code,"[-1.373586654663086, 0.46381285786628723, -0.6...",0.692343,0.705242,0.658729,0.768133,0.732339,0.685872,0.673702,0.724322,0.696315,0.687306,0.723635,0.756914,0.703290,0.789414,0.673507,0.707992,0.721320,0.741644,0.710577,0.697502,0.680040,0.773037,0.703829,0.655994,0.772555,0.689742,0.711606,0.698840,0.727198,0.758632,0.633888,0.655046,0.667987,0.739698,0.653596,0.704631,0.723601,0.761140,...,0.714460,0.560928,0.675953,0.729824,0.661799,0.717107,0.679618,0.719278,0.785106,0.761438,0.735974,0.699210,0.630811,0.714587,0.727807,0.692938,0.726155,0.670583,0.773246,0.683561,0.659360,0.751009,0.815266,0.704249,0.720219,0.686310,0.685837,0.685674,0.591106,0.771287,0.695108,0.733666,0.794046,0.685854,0.818380,0.652642,1.000000,0.771558,0.669400,0.746257
497,pre tag text not coming in innerText,"[-1.4447592496871948, -0.48064124584198, -0.80...",0.775117,0.762211,0.786395,0.745683,0.817965,0.759645,0.762469,0.748065,0.790604,0.760869,0.790691,0.797259,0.752927,0.754208,0.774780,0.743570,0.765898,0.756446,0.683320,0.700143,0.739780,0.787285,0.782067,0.730017,0.754930,0.799160,0.738103,0.785867,0.726839,0.758471,0.649263,0.792139,0.802581,0.809475,0.809583,0.691792,0.721761,0.791922,...,0.763336,0.732019,0.757222,0.709079,0.779132,0.770120,0.797542,0.786174,0.784747,0.763017,0.814598,0.774457,0.759156,0.732264,0.787489,0.822046,0.816639,0.742405,0.741625,0.752880,0.803099,0.720333,0.805989,0.796365,0.792750,0.734402,0.740022,0.785095,0.719734,0.762320,0.808274,0.786587,0.805083,0.729022,0.810718,0.750132,0.771558,1.000000,0.721261,0.804163
498,"loading fonts ttf crashes , error loading with...","[-0.22394110262393951, 1.0780186653137207, -0....",0.752383,0.791996,0.725651,0.748908,0.803347,0.748382,0.729544,0.777338,0.783685,0.785164,0.765987,0.670951,0.796824,0.760482,0.787510,0.775785,0.816520,0.784957,0.737253,0.734189,0.721129,0.710458,0.695488,0.746929,0.773108,0.782977,0.821219,0.728300,0.699216,0.673416,0.756826,0.678704,0.788582,0.764936,0.742565,0.717040,0.710702,0.732620,...,0.666293,0.707038,0.744155,0.732448,0.735923,0.789924,0.782925,0.771727,0.800856,0.685395,0.742085,0.737671,0.727461,0.628968,0.761375,0.746103,0.727224,0.769949,0.684806,0.809202,0.743649,0.733095,0.755752,0.755018,0.633465,0.794575,0.724271,0.796990,0.696561,0.805522,0.785355,0.811493,0.724426,0.752514,0.724217,0.604775,0.669400,0.721261,1.000000,0.773170


# Compare an input string with all sentences and calculate similarity scores

In [6]:
def get_sim_df_for_string(predictions,e_col, string_to_embed,pipe=pipe):
  # Creates a Dataframe which has a sim_score column which describes the similarity with the string_to_embed variable

  # put predictions vectors in matrix
  embed_mat = np.array([x for x in predictions[e_col]])

  # embed string input string
  embedding = pipe.predict(string_to_embed).iloc[0][e_col]

  # Replicate embedding for input string 
  m = np.array([embedding,]*len(df))
  sim_mat = cosine_similarity(m,embed_mat)

  #write sim score
  df['sim_score'] = sim_mat[0]


  return df

In [7]:
sim_df = get_sim_df_for_string(predictions,'embed_sentence_bert_embeddings', 'How to get started with Machine Learning and Python' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,sim_score
305,34647831,Best resources for learning Machine Learning f...,<p>I am keen in learning machining learning. I...,<machine-learning>,2016-01-07 05:16:50,LQ_CLOSE,0.873803
58,34568008,How do I run webpack from SBT,<p>I'm developing a Play 2.4 application and w...,<playframework><sbt><webpack>,2016-01-02 16:26:08,HQ,0.867193
219,34621576,The background music keep restarting. How to s...,<p>I create a sharedInstance of a background m...,<ios><swift><avfoundation><segue>,2016-01-05 21:31:33,LQ_CLOSE,0.836021
326,34656814,how to implement this function,now have `fmapT` and `traverse` :\r\n\r\n f...,<haskell><traversal><functor><haskell-lens>,2016-01-07 13:50:12,LQ_EDIT,0.835821
291,34645131,How do I run PhantomJS on AWS Lambda with NodeJS,<p><em>After not finding a working answer anyw...,<node.js><amazon-web-services><phantomjs><aws-...,2016-01-06 23:59:08,HQ,0.827320
...,...,...,...,...,...,...,...
360,34668429,Count the difference in sql result,[enter image description here][1]\r\n\r\n\r\n ...,<sql><linq><linq-to-sql>,2016-01-08 02:14:52,LQ_EDIT,0.535925
132,34589023,Undefined index: category_icon_code in line 257,<p>Hello I am getting Notice Undefined index: ...,<php><indexing><undefined>,2016-01-04 10:25:46,LQ_CLOSE,0.535211
223,34622755,"Select all text between quotes, parentheses et...",<p>Sublime Text has this same functionality vi...,<editor><sublimetext3><atom-editor>,2016-01-05 22:55:51,HQ,0.531906
389,34678558,"C# ""content acceptance",Hi you know I can't make my character move in ...,<c#><unity3d>,2016-01-08 13:45:30,LQ_EDIT,0.527320


In [8]:
sim_df = get_sim_df_for_string(predictions,'embed_sentence_bert_embeddings', 'How to sort an array in Scala?' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,sim_score
468,34726096,How to get an array values in the dropdown in ...,Please suggest how to get array values in the...,<perl>,2016-01-11 16:09:54,LQ_EDIT,0.889675
460,34718641,How to create asossiative array in wrapping class,"I have made a array associative like this , an...",<c#>,2016-01-11 10:01:48,LQ_EDIT,0.866405
492,34743937,how to use mysql one fiel AND,My simple question about mysql\r\n\r\n\r\nThis...,<mysql>,2016-01-12 12:38:55,LQ_EDIT,0.864099
316,34650738,How can I change the format of this input?,<p>I have this script in js:</p>\n\n<p><strong...,<javascript><regex><validation>,2016-01-07 08:48:14,LQ_CLOSE,0.863125
135,34589908,Using array values in images?,<p>I have a some pictures with values of 1 - 1...,<javascript><jquery><html><image><append>,2016-01-04 11:16:45,LQ_CLOSE,0.858471
...,...,...,...,...,...,...,...
341,34663335,C vs C++ sizeof,<p>I just came across this simple code snippet...,<c++><c><sizeof>,2016-01-07 19:16:31,HQ,0.615517
448,34706960,1.#QNAN000000000000 interrupts the loop,This is my problem: I am simulating a particle...,<c>,2016-01-10 15:08:06,LQ_EDIT,0.612291
236,34626978,Laravel framework tutorial,<p>I'm a PHP developer and want to learn <stro...,<php><laravel>,2016-01-06 06:30:00,LQ_CLOSE,0.604020
264,34634366,Android ActionBar Backbutton Default Padding,<p>I am creating a custom <code>ActionBar</cod...,<android><android-layout><android-actionbar><a...,2016-01-06 13:30:18,HQ,0.599509


In [9]:
sim_df = get_sim_df_for_string(predictions,'embed_sentence_bert_embeddings', 'How to install Linux?' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,sim_score
326,34656814,how to implement this function,now have `fmapT` and `traverse` :\r\n\r\n f...,<haskell><traversal><functor><haskell-lens>,2016-01-07 13:50:12,LQ_EDIT,0.929197
473,34730910,How to make string accessible to all forms,"I have a form called ""AddFile"" and I have a te...",<c#><string><listview><global>,2016-01-11 20:56:09,LQ_EDIT,0.883789
117,34585453,How to bind raw html in Angular2,<p>I use Angular 2.0.0-beta.0 and I want to cr...,<angular>,2016-01-04 05:58:57,HQ,0.877632
242,34628958,How do i implement the algorithm below,"Get a list of numbers L1, L2, L3....LN as argu...",<python><algorithm>,2016-01-06 08:52:17,LQ_EDIT,0.874445
100,34579243,how to do this on bootstrap,<p>I am new in using bootstrap and I want to k...,<css><html>,2016-01-03 17:11:23,LQ_CLOSE,0.869619
...,...,...,...,...,...,...,...
341,34663335,C vs C++ sizeof,<p>I just came across this simple code snippet...,<c++><c><sizeof>,2016-01-07 19:16:31,HQ,0.607781
213,34620317,Asynchronous classes and its features,"<p>Newbie in programming, I am trying to under...",<java>,2016-01-05 20:05:50,LQ_CLOSE,0.603948
274,34637035,Are global static variables within a file comp...,<p>I know declaring a global variable as STATI...,<c><variables><static><global>,2016-01-06 15:41:29,LQ_CLOSE,0.588950
360,34668429,Count the difference in sql result,[enter image description here][1]\r\n\r\n\r\n ...,<sql><linq><linq-to-sql>,2016-01-08 02:14:52,LQ_EDIT,0.575117


# Let's use multiple Embeddings at the same time for our comparision!

First, let's load 3 embeddings at the same time and embed the text in our dataset

In [10]:
multi_pipe = nlu.load('en.embed_sentence.electra embed_sentence.bert en.embed_sentence.bert_large_cased ')
multi_embeddings = multi_pipe.predict(df.Title,output_level='document')
multi_embeddings

sent_electra_small_uncased download started this may take some time.
Approximate size to download 48.7 MB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sent_bert_large_cased download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


Unnamed: 0_level_0,embed_sentence_bert_embeddings,en_embed_sentence_bert_large_cased_embeddings,document,en_embed_sentence_electra_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[-1.72942316532135, 0.6468319892883301, -0.351...","[-0.08645061403512955, -0.6638216972351074, -0...",Java: Repeat Task Every Random Seconds,"[0.2554936707019806, 0.31124797463417053, -0.2..."
1,"[-0.6685013175010681, 0.08217886090278625, -0....","[-0.19792422652244568, -0.3418532609939575, -0...",Why are Java Optionals immutable?,"[0.0773053914308548, -0.06638598442077637, -0...."
2,"[-0.8454132080078125, -0.7770175337791443, -0....","[-0.050022322684526443, 0.13340312242507935, -...",Text Overlay Image with Darkened Opacity React...,"[0.05825200304389, 0.2296592742204666, 0.21679..."
3,"[-0.41476115584373474, 0.15586626529693604, -0...","[-0.4693029224872589, -0.054555993527173996, -...",Why ternary operator in swift is so picky?,"[-0.08927658945322037, -0.19631914794445038, -..."
4,"[-1.2917425632476807, -0.0196269191801548, -0....","[-0.103646419942379, -0.20527635514736176, -0....",hide/show fab with scale animation,"[-0.3903045654296875, -0.16252142190933228, -0..."
...,...,...,...,...
495,"[-1.2739437818527222, 0.7318032383918762, -0.6...","[-0.0020536284428089857, -0.09405096620321274,...",Can't call function Python,"[0.5778179168701172, 0.24898403882980347, -0.1..."
496,"[-1.373586654663086, 0.46381285786628723, -0.6...","[0.3330710530281067, -0.07615971565246582, -0....",alright this is my most rescent code,"[-0.20935986936092377, -0.11303772032260895, -..."
497,"[-1.4447592496871948, -0.48064124584198, -0.80...","[-0.12869122624397278, -0.5377069115638733, -0...",pre tag text not coming in innerText,"[-0.46354207396507263, -0.20844335854053497, 0..."
498,"[-0.22394110262393951, 1.0780186653137207, -0....","[-0.23117859661579132, -0.370358407497406, -0....","loading fonts ttf crashes , error loading with...","[-0.07354014366865158, -0.08907054364681244, 0..."


# Multi Embeddings Similarity


Let's define a function that takes in a string to embed, a list of embeddings and a pipeline

get_sim_df_for_string_multi() calculates all embeddings loaded in the input NLU pipeline for the input string and calculate distances to every sentence in the input DF across all embeddings and will give us a final normalized score.     

In [11]:
def get_sim_df_for_string_multi(predictions,embed_col_names, string_to_embed,pipe=multi_pipe):
  # Creates a Dataframe which has a sim_score column which describes the similarity with the string_to_embed variable
  # This accumulates the distances of all embeddings in embed_col_names and normalizes it by dividing by len(embed_col_names)

  #make empty simmilarity matrix which will store the aggregated simmilarities between different embeddings
  cum_sim = np.zeros((len(predictions),len(predictions)))

  # embed with all embedders currently loaded in pipeline
  embeddings = pipe.predict(string_to_embed).iloc[0]

  #loop over all embeddings columns and accumulate the pairwise distances with string_to_embed into cum_sim
  for e_col in embed_col_names:

    # get the current embedding for input string
    embedding = embeddings[e_col]  
    
    # stack embedding vector for input string
    m = np.array([embedding,]*len(predictions)) 

    # put df vectors in np matrix
    embed_mat = np.array([x for x in predictions[e_col]]) 

    # calculate new similarities
    sim_mat = cosine_similarity(m,embed_mat) 
  # accumulate new simmilarities in cum_sum
    cum_sim += sim_mat  

  predictions['sim_score'] = cum_sim[0]/len(embed_col_names) 
  return predictions

In [12]:
col_names = ['en_embed_sentence_electra_embeddings','embed_sentence_bert_embeddings', 'en_embed_sentence_bert_large_cased_embeddings']
sim_df = get_sim_df_for_string_multi(multi_embeddings,col_names, 'How to get started with Machine Learning and Python' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0_level_0,embed_sentence_bert_embeddings,en_embed_sentence_bert_large_cased_embeddings,document,en_embed_sentence_electra_embeddings,sim_score
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
305,"[-1.0579495429992676, 0.9618443250656128, -0.9...","[0.019202401861548424, -0.08235745877027512, -...",Best resources for learning Machine Learning f...,"[0.0017500552348792553, 0.09248263388872147, -...",0.799615
473,"[-1.5010557174682617, 1.2317813634872437, -1.0...","[0.06717701256275177, -0.2253427803516388, -0....",How to make string accessible to all forms,"[0.1147889494895935, 0.3052586317062378, -0.45...",0.759447
100,"[-1.545145869255066, 1.301101565361023, -0.783...","[-0.08175431936979294, 0.15494807064533234, 0....",how to do this on bootstrap,"[-0.3878040313720703, 0.544439971446991, -0.28...",0.750830
155,"[-1.6578322649002075, 0.9215767979621887, -1.2...","[0.004186966456472874, -0.00012195772433187813...",How to decrease padding in NumberPicker,"[0.18992789089679718, 0.27696549892425537, -0....",0.746449
117,"[-0.8761248588562012, 0.8857410550117493, -0.7...","[0.22049112617969513, -0.5024952292442322, -0....",How to bind raw html in Angular2,"[0.0016242936253547668, 0.13965380191802979, -...",0.732380
...,...,...,...,...,...
48,"[-1.4526066780090332, -0.3846992552280426, -0....","[-0.14455106854438782, 0.2772638499736786, -0....",japanese and portuguese language cannot support,"[0.2043197602033615, 0.02978350780904293, 0.56...",0.483809
389,"[-1.6448919773101807, 0.9714635610580444, -0.8...","[-0.5681443810462952, -0.2840718924999237, 0.1...","C# ""content acceptance","[0.3010333776473999, 0.048339955508708954, -0....",0.481131
477,"[-1.5496937036514282, 0.48711639642715454, -0....","[-0.0023735463619232178, -0.46493029594421387,...",liste chainées C,"[0.3307200074195862, 0.6620448231697083, 0.155...",0.478984
434,"[-1.77239191532135, 0.32269561290740967, -0.30...","[-0.3565511405467987, -0.15524736046791077, -0...","Stadard ""veiw contact"" icon","[0.25067782402038574, 0.26973265409469604, -0....",0.478298


In [13]:
sim_df = get_sim_df_for_string_multi(multi_embeddings,col_names, 'How to sort an Array in Java' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0_level_0,embed_sentence_bert_embeddings,en_embed_sentence_bert_large_cased_embeddings,document,en_embed_sentence_electra_embeddings,sim_score
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
261,"[-1.7481812238693237, 1.1634891033172607, -0.3...","[-0.34271395206451416, -0.4548005163669586, -0...",How to pass parameters to AWS Lambda function,"[-0.0027412506751716137, 0.19408729672431946, ...",0.783900
454,"[-1.5620574951171875, 0.31375429034233093, -0....","[0.07928941398859024, -0.08970289677381516, -0...",How to find percentage value from a table column,"[-0.05851144343614578, 0.32486703991889954, -0...",0.783077
473,"[-1.5010557174682617, 1.2317813634872437, -1.0...","[0.06717701256275177, -0.2253427803516388, -0....",How to make string accessible to all forms,"[0.1147889494895935, 0.3052586317062378, -0.45...",0.782262
349,"[-1.2373740673065186, 0.9328885674476624, -0.6...","[-0.02082238532602787, -0.34088224172592163, -...",how to fix an error on a simulation?,"[0.16280964016914368, 0.6067922115325928, -0.0...",0.780453
155,"[-1.6578322649002075, 0.9215767979621887, -1.2...","[0.004186966456472874, -0.00012195772433187813...",How to decrease padding in NumberPicker,"[0.18992789089679718, 0.27696549892425537, -0....",0.772880
...,...,...,...,...,...
172,"[-0.3316279351711273, 0.5964305996894836, 0.11...","[-0.38492849469184875, -0.017544478178024292, ...",$_SERVER['HTTP_REFERER'] and RewriteCond %{HTT...,"[0.09372983127832413, -0.1859704703092575, 0.1...",0.518671
485,"[-1.266743779182434, 0.7834378480911255, -0.36...","[-0.4286240041255951, 0.0027907954063266516, 0...",Delete SSH key without SSH access,"[0.3149007260799408, 0.30479028820991516, 0.56...",0.511594
279,"[-0.7300604581832886, 0.5116581320762634, -1.2...","[-0.09596068412065506, 0.37932419776916504, 0....",iOS app rejected due to copyright issues,"[0.3499693274497986, 0.36864396929740906, 0.36...",0.510166
448,"[-0.3950641453266144, 0.8491625189781189, 0.00...","[-0.3739803731441498, -0.4027167856693268, 0.0...",1.#QNAN000000000000 interrupts the loop,"[0.3847207725048065, -0.08086767792701721, 0.0...",0.486479


In [14]:
sim_df = get_sim_df_for_string_multi(multi_embeddings,col_names, 'Find maximum of numpy vector' )
sim_df.sort_values('sim_score', ascending = False)

Unnamed: 0_level_0,embed_sentence_bert_embeddings,en_embed_sentence_bert_large_cased_embeddings,document,en_embed_sentence_electra_embeddings,sim_score
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
478,"[-1.252250075340271, 0.6805993914604187, -1.28...","[0.06547948718070984, -0.0623813271522522, -0....",how to get output of count variable ?,"[0.34422123432159424, 0.24366424977779388, 0.0...",0.789374
431,"[-0.6944184303283691, 0.8214976191520691, -0.9...","[-0.09017328172922134, -0.5152651071548462, -0...",output of the functions based on nodes,"[0.7579619288444519, 0.14354963600635529, 0.31...",0.771461
134,"[-1.4964807033538818, 0.8033791184425354, -1.1...","[-0.0442177914083004, 0.12318655103445053, -0....",How to remove edge between two vertices?,"[0.5725486874580383, 0.30581313371658325, 0.29...",0.749279
71,"[-1.2068607807159424, 0.35331201553344727, -1....","[-0.2602085769176483, -0.34384885430336, -0.10...",Get superclass name in ES6,"[0.14307811856269836, -0.014038349501788616, 0...",0.744386
412,"[-0.380255788564682, 0.6532482504844666, -0.35...","[0.13670296967029572, -0.4267256259918213, -0....",can you assign initial value to global static ...,"[0.5862113237380981, 0.21501655876636505, 0.70...",0.741941
...,...,...,...,...,...
156,"[-0.6751947402954102, 0.8644970059394836, -1.2...","[-0.285198837518692, -0.17980371415615082, 0.3...",Slick Carousel Easing Examples,"[-0.008313175290822983, 0.45552361011505127, -...",0.447209
190,"[-0.27754655480384827, 0.41774505376815796, -0...","[0.17826826870441437, -0.03250247985124588, -0...",how to connect an android application to MySQL...,"[0.12497323006391525, 0.4649435579776764, -0.0...",0.430095
400,"[-1.1138392686843872, 0.24342913925647736, -0....","[0.23466642200946808, 0.1420179009437561, -0.2...",google maps your timeline api,"[0.2741749882698059, 0.1183297410607338, -0.37...",0.423157
40,"[-0.2265370786190033, -0.17995868623256683, -1...","[-0.15689228475093842, 0.20346693694591522, -0...",Android Studio Import Failing,"[0.4190728962421417, 0.1055925041437149, -0.56...",0.409944


# There are many more Sentence Embeddings to try out!
Even multi lingual embeddings like nlu.load('xx.embed_sentence.labse')

In [15]:
nlu.print_all_model_kinds_for_action('embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc