<a href="https://colab.research.google.com/github/MWFK/NLP-Semantic-Similarity/blob/main/ClinicalTrials/Models/01.%20Transformers_Cosine_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objectives

model = SentenceTransformer('stsb-roberta-large')

For other models

https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

We can choose other metrics too

util.pytorch_cos_sim(embedding1, embedding2)


The main library that we are going to use to compute semantic similarity is SentenceTransformers (Github source link), a simple library that provides an easy method to calculate dense vector representations (e.g. embeddings) for texts. It contains many state-of-the-art pretrained models that are fine-tuned for various applications. One of the primary tasks that it supports is Semantic Textual Similarity, which is the one we will focus on in this post.

To install SentenceTransformers, you will have to install the dependencies Pytorch and Transformers first.

After defining our model, we can now compute the similarity score of two sentences. As discussed in the introduction, the approach is to use the model to encode the two sentences, and then calculating the cosine similarity of the resulting two embeddings. The final result will be the semantic similarity score.

In general, we can use different formulas to calculate the final similarity score (e.g. dot product, Jaccard, etc.), but in this case, we are using cosine similarity due to its properties. The more important factor is the embeddings, which is produced by the model, so it is important to use a decent encoding model.

### Libs

In [1]:
!pip install sentence-transformers



In [2]:
!pip install transformers



In [3]:
!pip install torch



In [4]:
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util 

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Download Transformer

In [5]:
model = SentenceTransformer('stsb-roberta-large')

# Clinical Trials Use case

### Data

In [6]:
def get_data():

  # Download Clinical Trials data
  print('Downloading Clinical Trials Data')
  ct_dt = pd.read_csv(r'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_0.csv', sep=',', engine='python', encoding="utf-8")
  for btch in range(1, 4):
      url = 'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_' +str(btch)+ '.csv'
      tmp = pd.read_csv(url, sep=',', engine='python', encoding="ISO-8859-1")
      ct_dt = ct_dt.append(tmp, ignore_index=True)
  ct_dt['AllLocation'] = ct_dt['LocationCity'].str.lower().map(str) + ' | ' + ct_dt['LocationState'].str.lower().map(str) + ' | ' + ct_dt['LocationCountry'].str.lower().map(str)
  print('Clinical Trials Data: ',ct_dt.shape, '\n')

  # Download User input data
  print('Downloading Test data')
  test = pd.read_csv('https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/TestData.csv', sep=';', engine='python', encoding = "utf-8", skiprows=[0], names=['PatientID','ConditionOrDisease','Age','Gender','LocationCountry','TravelDistance','InclusionCriteria'])
  print('Test Data: ', test.shape)

  return ct_dt, test

ctdt, test = get_data()

Downloading Clinical Trials Data
Clinical Trials Data:  (10152, 21) 

Downloading Test data
Test Data:  (7, 7)


### Data Processing

In [7]:
# Strip Leading and Trailing Space
def cleansing(data):
  cols = data.select_dtypes(['object']).columns
  data[cols] = data[cols].apply(lambda x: x.str.strip().fillna(''))
  return data

In [8]:
ctdt, test = get_data()
ctdt = cleansing(ctdt)
test = cleansing(test)

test['InclusionCriteria'] = test['InclusionCriteria'].fillna('').astype(str)
ctdt['InclusionCriteria'] = ctdt['InclusionCriteria'].fillna('').astype(str)

Downloading Clinical Trials Data
Clinical Trials Data:  (10152, 21) 

Downloading Test data
Test Data:  (7, 7)


In [9]:
def embedding(ctdt, test):
    
    print('Embedding the test set...')
    embedding1 = model.encode(test['InclusionCriteria'].astype(str).tolist(), convert_to_tensor=True)
    test['InclusionCriteriaEmbedded'] = embedding1.tolist()

    print('Embedding the ctdt set...')
    embedding2 = model.encode(ctdt['InclusionCriteria'].fillna('').astype(str).tolist(), convert_to_tensor=True)
    ctdt['InclusionCriteriaEmbedded'] = embedding2.tolist()
    
    return ctdt, test

In [10]:
%%time
ctdtemb, testemb = embedding(ctdt, test)

Embedding the test set...
Embedding the ctdt set...
CPU times: user 7min 12s, sys: 1.48 s, total: 7min 14s
Wall time: 7min 10s


In [11]:
def data_filtering(ct_dt, test):

  print('Data dimensions before Filtering : ', ct_dt.shape, '\n')

  ### Filtering by Age ###
  print('Filtering by Age...')
  tmp = ct_dt[ct_dt.iloc[:,13] <= test.iloc[0,2]]               # compare numerics
  tmp = tmp[tmp.iloc[:,13].str.find(test.iloc[0,2][-5:]) != -1] # Detect the Year/Month
  print('Data dimensions: ', tmp.shape, '\n')

  ### Filtering by Gender ###
  print('Filtering by Gender...')
  tmp = tmp[(tmp.iloc[:,12] == test.iloc[0,3]) | (tmp.iloc[:,12] == 'All')] 
  print('Data dimensions: ', tmp.shape, '\n')

  ### Filtering by Travel Distance ###
  print('Filtering by Travel Distance...')
  tmp = tmp[tmp.iloc[:,20].str.find(str(test.iloc[0,5]).lower()) != -1] 
  print('Data dimensions: ', tmp.shape, '\n')

  return tmp

# filtered = data_filtering(ctdtemb, testemb)

### Execution

In [12]:
%%time

for index, row in testemb.iterrows():

    #print(index, row['InclusionCriteria'])
    print('\n################################')
    print('Processing the user input: [', index,']')
    print('################################\n')

    # Filter the Clinical Trials Data based on the test data
    filtered = data_filtering(ctdtemb, testemb.iloc[index:index+1,])

    # encode list of sentences to get their embeddings
    embedding1        = torch.Tensor(testemb.iloc[index:index+1,7].tolist())   # access the embed
    corpus_embeddings = torch.Tensor(filtered['InclusionCriteriaEmbedded'].tolist())         # access the embed
    
    # top_k results to return
    top_k=2

    # compute similarity scores of the sentence with the corpus
    cos_scores = util.pytorch_cos_sim(embedding1, corpus_embeddings)[0]

    # Sort the results in decreasing order and get the first top_k
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
    print("Sentence: ", *testemb.iloc[index:index+1,6], "\n")

    print("Top", top_k, "most similar sentences in corpus:")
    for idx in top_results[0:top_k]:
        print(*filtered.iloc[int(idx):int(idx+1), 9], "(Score: %.4f)" % (cos_scores[int(idx)])) 


################################
Processing the user input: [ 0 ]
################################

Data dimensions before Filtering :  (10152, 22) 

Filtering by Age...
Data dimensions:  (9517, 22) 

Filtering by Gender...
Data dimensions:  (9403, 22) 

Filtering by Travel Distance...
Data dimensions:  (645, 22) 

Sentence:  Histologically diagnosed with metastatic non-small cell lung cancer in 2018 | Initially treated with pertuzumab but relapsed | His performance status is ECOG 1 or KPS 90 | His blood and liver function analysis show normal | No other indications like HIV, HCV, HBV | No allergies | Life expectancy over 6 months | No mental disabilities. 

Top 2 most similar sentences in corpus:
Histologically or cytologically confirmed extensive-stage small cell lung cancer (ES-SCLC)|No prior systemic treatment for ES-SCLC|Eastern Cooperative Oncology Group (ECOG) Performance Status of 0 or 1|Measurable disease, as defined by Response Evaluation Criteria in Solid Tumors version 1.1

### Export

In [13]:
# from joblib import dump, load
# dump(ctdtemb, '/content/ctdtemb.joblib') 
# ctdtemb = load('/content/ctdtemb.joblib')

# from google.colab import files
# files.download('/content/ctdtemb.joblib')