## Semantic Search with Siamese BERT embeddings
This notebook demonstrates an example of using a pretrained Siamese BERT model and cosine similarity matrix for semantic search.


In [1]:
import re
import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.stem import WordNetLemmatizer
from collections import Counter

2023-06-12 14:17:56.693338: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-12 14:17:58.271763: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-12 14:17:58.290654: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

In [2]:
raw_data = pd.read_excel('test_data.xlsx')
raw_data.head()

Unnamed: 0,jobId,jobUrl,jobTitle,jobDescription,datePosted,companyId,companyIdNormalised,companyName,rawWageMin,rawWageMax,sourceName,qualifications
0,00d9917e95ebbb58d237e90b5a01095947c31fbe119d82...,https://www.efinancialcareers.sg/jobs-Singapor...,"Change Manager, IBOR Transition Programme","Change Manager, IBOR Transition Programme<br/>...",2022-04-01,7cc4c2d8b4893e7c64265beccd30d4c1b644cf8b57a9e8...,06634f73009b4765beafae5f98c0996b33870a3d34fa87...,Standard Chartered Bank,0,0,E-FinancialCareer,['No Requirement']
1,0104963e8e1289488f2ff96edfe95dddc9ab84231b37a5...,https://www.efinancialcareers.sg/jobs-Singapor...,"Analyst, KYC Analyst, Corporate Banking, Insti...","Analyst, KYC Analyst, Corporate Banking, Insti...",2022-04-01,c3475240458aa07566e1db7eec98affa5d85d8bd2b9577...,1810faaf5f96a398f5b43df1b80809dbf5b7935f94a5f7...,DBS Bank Limited,0,0,E-FinancialCareer,['Bachelors']
2,01561a39ff31372551e0be1caaf6a2c32150925f75af76...,https://www.jobstreet.com.sg/en/job/senior-leg...,"(Senior) Legal Counsel, Autumn Venture - SC Ve...",About Standard CharteredWe are a leading inter...,2022-04-01,a1ad3581a81222507fa918dc2d978ed1db672c44415c3f...,25fa191ddad0bb854bd7bbe811437b1c820271351c4ac8...,Autumn Life Pte. Ltd.,0,0,JobStreetSG,['No Requirement']
3,0110a85844f5aa0b87060109b25567903d2188130391b2...,https://www.efinancialcareers.sg/jobs-Singapor...,Product/Data Analyst,About us<br/>Endowus is Asia's leading fee-onl...,2022-04-01,4aac38458acbd96d2de7ceda69e0c3f9923c5c8b4773f5...,3c7c5b38bc57d7f85029e41a219822866fe3dd6587027a...,Endowus,0,0,E-FinancialCareer,['Bachelors']
4,02728a523b995791d2c81657a906c223cfdec4a9eb980b...,https://www.efinancialcareers.sg/jobs-Singapor...,Product Operations Associate (Account Opening),About us<br/>Endowus is Asia's leading fee-onl...,2022-04-01,4aac38458acbd96d2de7ceda69e0c3f9923c5c8b4773f5...,3c7c5b38bc57d7f85029e41a219822866fe3dd6587027a...,Endowus,0,0,E-FinancialCareer,['No Requirement']


## Data Cleaning

Lemmatization, and removal of html tags, stopwords, etc.

In [3]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

stopwords = nltk.corpus.stopwords.words('english')
from nltk.stem import snowball
stemmer = snowball.SnowballStemmer("english")
stemmer.stem("abilitydiploma")

def clean_text(text):
    # remove htmltags and new lines/tags
    try:
        text = re.sub(r'<.[a-zA-Z]+.>', ' ', text)
        text = re.sub(r'&.[a-zA-Z]+.;', '', text)
        #text = re.sub(r'^[a-zA-Z.]', '', text)
        text = re.sub(r'httpS+s*', ' ', text)
        text = re.sub(r'\.', '', text)
        text = re.sub(r'\(', '',text)
        text = re.sub(r'\)', '',text)
        text = re.sub(r' +', ' ',text)
        text = text.lower()
    except Exception as e:
        print(f"Error: {text}")
        return text

    return text

def clean_sent(text):
    words = tokenize(text)
    final = " ".join(words)
    
    return final

def tokenize(input_doc):
    tokens = []
    wn = WordNetLemmatizer()
    cptoken = [
        #wn.lemmatize(stemmer.stem(x.lower())) for x in word_tokenize(input_doc) 
        x.lower() for x in word_tokenize(input_doc) 
        if x not in stopwords and len(x) >= 2
    ]
        
    return cptoken

#check for nulls
def clean_col(text):
    if text.isnull().values.any():
        text.dropna(how='any', inplace=True)
    text.drop_duplicates(inplace=True)

    return text


In [4]:
# preprocess sample
clean_sent(clean_text(raw_data.jobDescription[68]))

'responsibilitiesmanage lead team provide residential services effectivelymaintain synergy hotel on-site managing agent resident councilresponsible time attendance record supervised team ensure accurate billing process mcstoversee maintenance accurate updated occupant records ensuring staff adherence confidentiality residents contact details personal informationconduct regular staff meetings maintain open channel communicationresolve resident complaints management office maintain high level resident satisfaction service qualitycommunicate on-site management office resident feedback recurring challenges improvementsparticipate site meetings convened management office requiredconduct participate yearly service excellence audit collaborate on-site managing agent meet compliancemaintain ongoing schedules ensure residential facilities safe clean attractivemaintain compliance regulatory requirements including workplace health amp safety occupational health amp safetyrequirementsdiploma hospi

In [5]:
# remove NaN rows
raw_data['jobdata'] = raw_data[['companyName', 'jobTitle', 'jobDescription']].apply(lambda x: ", ".join(x.dropna().astype(str)), axis=1)
if raw_data.isnull().values.any():
    raw_data.dropna(how='any', inplace=True)
raw_data.reset_index(drop=True, inplace=True)
raw_data.set_axis(range(len(raw_data)), inplace=True)
print(f"NaN rows: {raw_data.isnull().values.any()}")
print(f"total length after removal: {len(raw_data)}")

# We have to do drop duplicates and reindex again after applying text transformation
%time raw_data['jobdata'] = raw_data['jobdata'].apply(clean_text)
raw_data.drop_duplicates(subset=['jobdata'], keep="first",inplace=True)
raw_data.reset_index(drop=True, inplace=True)
raw_data.set_axis(range(len(raw_data)), inplace=True)
#sample output
print(raw_data['jobdata'][0])

  raw_data.set_axis(range(len(raw_data)), inplace=True)


NaN rows: False
total length after removal: 99994
CPU times: user 10.4 s, sys: 104 ms, total: 10.5 s
Wall time: 10.5 s
standard chartered bank, change manager, ibor transition programme, change manager, ibor transition programme about standard chartered we are a leading international bank focused on helping people and companies prosper across asia, africa and the middle east to us, good performance is about much more than turning a profit it's about showing how you embody our valued behaviours - do the right thing, better together and never settle - as well as our brand promise, here for good we're committed to promoting equality in the workplace and creating an inclusive and flexible culture - one where everyone can realise their full potential and make a positive contribution to our organisation this in turn helps us to provide better support to our broad client base 12 months direct contract the role responsibilities develop, drive and deliver coherent communications and engagement 

  raw_data.set_axis(range(len(raw_data)), inplace=True)


In [6]:
# check for badly formed sentences
# drop_index = []
# for jidx, jd in raw_data.iterrows():
#     if len(word_tokenize(raw_data['agg'])) <= 3:
#         print(f"{jidx}: ", raw_data['agg'])

In [7]:
from sentence_transformers import SentenceTransformer, util
import torch

# we use siamese-BERT for our embedding as it maps paragraphs and sentences into a fixed vector space
model = SentenceTransformer('sentence-transformers/paraphrase-distilroberta-base-v2')
device = torch.device("cuda")
model.to(device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

In [8]:
%time embeddings = model.encode(raw_data['jobdata'])

CPU times: user 8min 55s, sys: 2.28 s, total: 8min 57s
Wall time: 7min


In [9]:
embeddings.shape
cuda_embeddings = torch.from_numpy(embeddings).float().to(device)

In [20]:
# Sample queries
queries = [
    'data scientist job in financial sector', 
    'Remote only job', 
    'Web frontend react',
    'sales executive',
    'Devops AWS',
    'Cooking job'
]

In [21]:
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(raw_data['jobdata']))

for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, cuda_embeddings)[0]

    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print(top_results[0])
    print("\nTop 5 most similar sentences in corpus:\n")

    for score, idx in zip(top_results[0], top_results[1]):
        print(f"idx: {idx}" + " (Score: {:.4f})".format(score))
        print(f"Text:\n {raw_data['jobdata'][int(idx)]}\n")





Query: data scientist job in financial sector
tensor([0.6497, 0.6418, 0.6390, 0.6387, 0.6368], device='cuda:0')

Top 5 most similar sentences in corpus:

idx: 3138 (Score: 0.6497)
Text:
 singapore exchange limited, financial data engineer, job description amp requirementsa financial data engineer is responsible for developing, testing and maintaining architectures to improve data reliability and processing efficienciesyou will produce regular reports using financial data for leadership review, and work collaboratively across the entire finance team to analyse sgxs business performance and strategyyou will work closely with financial data analyst in forecasting future revenue and expenditures to help determine budgets for upcoming projectskey duties include: improving current methods used in the collection of data from other parts of the company while retaining the essential integrity and security of the information collected analyze financial data and create financial models for de