## Important!

As of the current date, April 11th, 2023. The Sentence Transformer package from sbert.net is currently only available on python 3.10 and below. If you have python 3.11 installed you will have to depreciate your version(This is where having a venv will be very helpful.). Any package installed under 3.11 if not using a venv will have to be reinstalled along with older version of python!

### SBERT - Sentence Bidirectional Encoder Representions from Transformers

BERT when it was introduced was by far the most successful model built for general tasks involving semantic textual simlilarity however finding embedding representations for large chunks of text like sentences or paragraphs was a task that in reality is quite different from the simple word embeddings generated from models like BERT, GloVe, Word2Vec, etc...

A representation that takes into account the context of the information throughout the entire piece of text required adaptations to current models to take word vector representations and produce a richer representation that generates an embedding based on similarity between large chunks of text.

SBERT is such an adaptation ontop of the BERT model. It utilizes twin BERT networks as well as a set of new objective funtions to optimize similarity measures between pairs of sentences in addition to a number of techniques like masking to achieve notable results.

Another benefit to this model in this particular use is that the pretrained models are small enough and efficient enough to fit directly on a single computer and run off its own memory and compute power. The openai embedding models which are based roughly on a similar set of principles require sending data to its servers through an API. Not an ideal situation when sensitive data is being transferred and processed.

In [1]:
#Load necessary packages.

import re 
import pandas as pd
import numpy as np
import warnings
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-mpnet-base-v2')

In [3]:
# Loading, Cleaning, and preprocess data

In [4]:
data = pd.read_spss('S:\\Data Services\\Samson\\Test Datasets\\CMI001-1007 - 2022 Visitors Survey Coded - Copy.sav')

In [5]:
data=data.drop(data[data['Q23']=='No'].index)

In [6]:
data = data.drop(data[data['Q24']==''].index)

In [7]:
data = data.reset_index()

In [8]:
data['embedding'] = data.Q24.apply(lambda x: model.encode(x))

In [9]:
data['embedding'].isna().any()

False

In [10]:
matrix = np.vstack(data.embedding.values)
matrix.shape

(116, 768)

In [15]:
topic = ['research family history']

In [16]:
topic_embedding = model.encode(topic, convert_to_tensor=True)

In [17]:
cosine_scores = []
for i in range(len(data['embedding'])):
    cosine_scores.append(util.cos_sim(data['embedding'][i], topic_embedding))

In [18]:
cosine_scores[0:10]

[tensor([[0.3568]]),
 tensor([[0.2115]]),
 tensor([[0.3127]]),
 tensor([[0.4757]]),
 tensor([[0.5598]]),
 tensor([[0.4869]]),
 tensor([[0.5907]]),
 tensor([[0.2360]]),
 tensor([[0.2648]]),
 tensor([[0.4823]])]

In [19]:
pairs =[]
for i in range(len(cosine_scores)-1):
    pairs.append({'index': [i], 'score': cosine_scores[i]})

In [20]:
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

In [21]:
pairs[0:10]

[{'index': [39], 'score': tensor([[0.8677]])},
 {'index': [53], 'score': tensor([[0.8627]])},
 {'index': [95], 'score': tensor([[0.8472]])},
 {'index': [79], 'score': tensor([[0.8245]])},
 {'index': [97], 'score': tensor([[0.7539]])},
 {'index': [18], 'score': tensor([[0.7456]])},
 {'index': [31], 'score': tensor([[0.7177]])},
 {'index': [48], 'score': tensor([[0.7107]])},
 {'index': [83], 'score': tensor([[0.7007]])},
 {'index': [92], 'score': tensor([[0.6746]])}]

### Ranking results and final reporting

Using some of the previous ranking systems devised, implement the same systems but using SBERT and finally using these systems append final dataframe with topic suggestions and rankings of these suggestions.

###  Exploring Cross-Encoder ranking

SBERT also offers a unique style of transformer, a cross encoder. Instead of generating a vector embedding for pairs of sentences placed into SBERT a cross encoder is just a single transformer that takes as input a concatenated pair of sentences and outputs a score of the similarity between 0 and 1 of those two sentences. 

In this case it would make sense to potentially use this ability to generate a list of best matches to a particular query. This query in practice would be a topic we wish to find the top k matches for.

### New methodology

Trying to impute cluster membership on the dataset via a clustering algorithm is particularly prone to error and is quite rigid in terms of how the coder might include or disinclude certain records in any given topic cluster. Misslabelling is the last thing we want to do as it would greatly affect the quality and reliability of end results and in order to fix these mistakes one would have to go back and manually review everything that was automatically coded. 

This next method is a proposal to get around this giant issue. the first step will be to use the community detection function built in from sentence transformers to roughly arrange the text in a way that an experienced coder can rapidly glean general topics that are appearing in the data. 

From these topics a cross encoder will generate rankings for each element for each topic. This rank column is ultimately what will be given to the coder to then assign labels from. It prevents misslabelling but it rapidly groups large amounts of data into clusters that match a given topic.

In [22]:
cross_en = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [23]:
cross_en_xl = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

In [25]:
query = 'look at archives and records'

In [26]:
sentence_combinations = [[query, corpus_sentence] for corpus_sentence in data['Q24']]

In [28]:
sim_scores = cross_en_xl.predict(sentence_combinations, apply_softmax=True)

In [60]:
sim_scores[0]

-11.277727

In [29]:
sim_scores_argsort = list(reversed(np.argsort(sim_scores)))

In [65]:
sim_scores_argsort[0:10]

[50, 52, 37, 83, 65, 3, 97, 84, 22, 78]

In [73]:
for idx in sim_scores_argsort:
    data.loc[idx, 'look through records and archives'] = sim_scores[idx]

In [74]:
data['look through records and archives']

0     -11.277727
1     -11.242486
2     -11.327414
3      -3.612744
4     -11.322302
         ...    
111   -11.309032
112   -11.214823
113   -11.229262
114   -11.309020
115   -11.190641
Name: look through records and archives, Length: 116, dtype: float64

In [75]:
query_1 = 'research family history'

In [76]:
sen_combo_1 = [[sentence, query_1] for sentence in data['Q24']]

In [77]:
sim_scores_1 = cross_en_xl.predict(sen_combo_1)

In [78]:
sim_scores_1_argsort = list(reversed(np.argsort(sim_scores_1)))

In [79]:
sim_scores_1_argsort

[92,
 25,
 95,
 53,
 63,
 48,
 39,
 83,
 79,
 11,
 50,
 12,
 10,
 66,
 84,
 97,
 112,
 6,
 19,
 18,
 69,
 54,
 16,
 111,
 60,
 47,
 62,
 32,
 96,
 85,
 14,
 102,
 5,
 58,
 37,
 46,
 49,
 31,
 56,
 29,
 107,
 82,
 55,
 65,
 27,
 57,
 51,
 100,
 74,
 73,
 77,
 30,
 78,
 21,
 94,
 23,
 52,
 2,
 33,
 75,
 71,
 42,
 41,
 113,
 68,
 61,
 88,
 1,
 59,
 0,
 4,
 9,
 72,
 43,
 106,
 103,
 80,
 28,
 3,
 38,
 108,
 105,
 89,
 81,
 90,
 109,
 115,
 67,
 45,
 44,
 34,
 110,
 17,
 104,
 70,
 86,
 64,
 93,
 7,
 40,
 91,
 20,
 15,
 36,
 87,
 26,
 35,
 8,
 76,
 13,
 24,
 99,
 22,
 114,
 101,
 98]

In [81]:
print("Query:", query_1)
for idx in sim_scores_1_argsort:
    print("{:.1f}\t{}".format(sim_scores_1[idx], data['Q24'][idx]))

Query: research family history
6.6	Research family
1.7	Research family immigration
-0.9	Investigate my family history
-1.1	Family history search
-2.9	To see the Museum, learn more about the Walnut as my sister in law was 3 when she immigration granted on the Walnut. Learn more about immigration process in my parents time.
-3.1	Research my great grandfather
-3.2	Look up family history
-3.8	Family archives
-3.9	Aid in learning my family history.
-4.2	See the exhibits and start researching family history of immigtation
-4.5	Look at family records.
-4.5	Research and experience
-5.7	To obtain family information
-6.1	Research names registry(
-6.4	Check the family records
-6.7	Find record of family
-6.9	To find any trace of my maternal grandfather coming to Canada in the early 1900s. And my paternal grandparents around 1910
-7.4	To gather information about our family that came from Italy
-7.4	To learn more about my husbands history
-7.9	Genealogy search
-8.1	Learn about family members experie

In [31]:
print("Query:", query)
for idx in sim_scores_argsort:
    print("{:.1f}\t{}".format(sim_scores[idx], data['Q24'][idx]))

Query: look at archives and records
3.3	Look at family records.
0.3	See archives
-2.4	Archive
-2.4	Family archives
-2.5	Records
-3.6	Immigration archives
-4.1	Find record of family
-4.5	Check the family records
-6.0	Find my in laws immigration archive
-6.2	Immigration records and Dutch history.
-6.8	Look into relatives
-7.0	Look up family history
-7.7	She my parents name in the archives
-8.2	Record of my grandmother
-8.8	Found records on my mom and her mom arriving in canada
-8.9	Check family immigration records
-8.9	Family history search
-9.4	My grandmother's  immigration records
-9.6	Genealogy search
-9.8	Look at my family's passage to Canada
-10.0	We wanted to learn about heritage
-10.0	See the exhibits and start researching family history of immigtation
-10.3	See exhibit
-10.5	PHOTOGRAPH EXHIBIT
-10.5	Learn
-10.5	Get documentation of my landing.
-10.5	Investigate my family history
-10.6	Photo exhibit
-10.6	Consult on exhibit
-10.7	See the registration papers for my grandmother
-10.

### Community Detection

SBERT comes included with a method to group together similar pieces of text based on pairwise similarity scores as opposed to doing something like kmeans clustering which just finds optimal centroids. This might work better because the optimal centroids from something like kmeans are not necessarily going to group well based on pairwise similarity.

In [41]:
embeddings = model.encode(list(data['Q24']), convert_to_tensor=True)

In [42]:
clusters = util.community_detection(embeddings,threshold=0.63, min_community_size=3)

In [43]:
clusters

[[10, 16, 18, 19, 31, 39, 48, 50, 53, 73, 79, 83, 84, 95, 97, 107],
 [4, 9, 14, 22, 25, 29, 32, 46, 62, 77],
 [2, 26, 33, 38, 47, 57, 59, 60, 100, 112],
 [7, 34, 67, 86, 93, 104],
 [35, 40, 56, 61, 75],
 [11, 55, 63, 64],
 [44, 81, 90, 110],
 [3, 28, 37],
 [36, 43, 99],
 [0, 30, 105]]

In [55]:
# First element listed in each cluster is the central item. All other members of cluster are measured from this item.

for i, v in enumerate(clusters):
    print('Cluster '+ str(i))
    print(data['Q24'][v])
    print('---------------------------------------')

Cluster 0
10                         To obtain family information
16                                  Look into relatives
18                                     Genealogy search
19              To learn more about my husbands history
31                                      Ancestry search
39                               Look up family history
48                        Research my great grandfather
50                              Look at family records.
53                                Family history search
73     Search for any pictures or info about my parents
79                   Aid in learning my family history.
83                                      Family archives
84                             Check the family records
95                        Investigate my family history
97                                Find record of family
107                    Check family immigration records
Name: Q24, dtype: object
---------------------------------------
Cluster 1
4      To learn abo

#### Next Step:

Once satisfied with quality of clusters consider how to define each cluster as a topic
<br>
<br>
Moving from there consider a ranking system using a cross encoder where each cluster topic has a rank generated for each example and use human expertise to adjust clusters accordingly.

In [45]:
# topics list to append labels in final dataframe.
# These will be established by the coder manually.

topics = ['research family history', 'see exhibits', 'look through archives', 'war brides']
topic_value = [0]

In [None]:
# create blank dataframe column to fill with topics

data.insert(data.columns.get_loc('Q24_2'), 'Q24_1_1', None)

In [46]:
# Clusters may need to be grouped up together. similar themes may have seperated in community detection
# Be certain to order clusters based on topic list above 

# combine 0, 1, 3, and 7 (research family history)

clus1 = [0, 1, 2, 4, 8, 9]
cluster1 = []

for idx in clus1:
    cluster1.append(clusters[idx])

cluster1 = [item for sublist in cluster1 for item in sublist]

# write in values for topics[0] that match values in cluster1 list

for idx in cluster1:
    data.loc[idx, 'Q24_1_1'] = topics[0]

In [51]:
# repeat process for each topic.
clus2 = [3, 5]
cluster2 = []

for idx in clus2:
    cluster2.append(clusters[idx])


for idx in cluster2:
    data.loc[idx, 'Q24_1_1'] = topics[1]

In [52]:
clus3 = [7]
cluster3 = []

cluster3 = [item for sublist in cluster3 for item in sublist]

for idx in clus3:
    cluster3.append(clusters[idx])

for idx in cluster3:
    data.loc[idx, 'Q24_1_1'] = topics[2]

In [53]:
clus4 = [6]
cluster4 = []

cluster4 = [item for sublist in cluster4 for item in sublist]

for idx in clus4:
    cluster4.append(clusters[idx])

for idx in cluster4:
    data.loc[idx, 'Q24_1_1'] = topics[3]

In [54]:
data['Q24_1_1'].value_counts()

Q24_1_1
research family history    47
see exhibits               10
war brides                  4
look through archives       3
Name: count, dtype: int64

### Export final data

pyreadstat package allows writing of sav, stata, etc. filetypes and is perfect for our use here

In [63]:
import pyreadstat

In [64]:
# writes data from notebook to an sav file

pyreadstat.write_sav(data, 'S:\Data Services\Samson\Test Datasets\CMI001-1007 - 2022 Visitors Survey Coded - Copy Alteration.sav')