<a href="https://colab.research.google.com/github/MWFK/NLP-Semantic-Similarity/blob/main/ClinicalTrials/01.%20ct_dt_Cosine_SoftCosine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Libs


In [1]:
# Python libs to manipulate dataframes and arrays
import pandas as pd
import numpy as np

# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# To get the word vectors, you need a word embedding model. Let’s download the FastText model using gensim’s downloader api.
import gensim
print(gensim.__version__)

# upgrade gensim if you can't import softcossim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess

3.6.0


### Data

In [2]:
# Define the documents
doc_trump    = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference in the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin    = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents    = [doc_trump, doc_election, doc_putin]

### Modeling using Cosine as a metric

To compute the cosine similarity, you need the word count of the words in each document. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix. On this, am optionally converting it to a pandas dataframe to see the word frequencies in a tabular format.

Even better, I could have used the TfidfVectorizer() instead of CountVectorizer(), because it would have downweighted words that occur frequently across docuemnts. Then, use cosine_similarity() to get the final output. It can take the document term matri as a pandas dataframe as well as a sparse matrix as inputs.

In [3]:
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix    = count_vectorizer.fit_transform(documents)

# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names(), index=['doc_trump', 'doc_election', 'doc_putin'])
df



Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,mr,outcome,parties,political,post,president,prime,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,2,0,1,1,0,0,0,1,2,0,1,0
doc_election,0,1,0,2,0,1,0,1,0,0,0,1,1,2,0,2,0,2,0,0,2,0,0,1,0,0,1
doc_putin,1,0,1,0,1,0,0,0,0,1,0,0,0,1,1,2,1,2,0,1,0,1,0,0,1,0,0


In [4]:
print(cosine_similarity(df, df))

[[1.         0.51639778 0.36893239]
 [0.51639778 1.         0.45360921]
 [0.36893239 0.45360921 1.        ]]


### Modeling using SoftCosine as a metric

Suppose if you have another set of documents on a completely different topic, say ‘food’, you want a similarity metric that gives higher scores for documents belonging to the same topic and lower scores when comparing docs from different topics. In such case, we need to consider the semantic meaning should be considered. That is, words similar in meaning should be treated as similar. For Example, ‘President’ vs ‘Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ should be considered similar. For this, converting the words into respective word vectors, and then, computing the similarities can address this problem.

In [5]:
# Define the documents
doc_soup = "Soup is a primarily liquid food, generally served warm or hot (but may be cool or cold), that is made by combining ingredients of meat or vegetables with stock, juice, water, or another liquid. "
doc_noodles = "Noodles are a staple food in many cultures. They are made from unleavened dough which is stretched, extruded, or rolled flat and cut into one of a variety of shapes."
doc_dosa = "Dosa is a type of pancake from the Indian subcontinent, made from a fermented batter. It is somewhat similar to a crepe in appearance. Its main ingredients are rice and black gram."
documents = [doc_trump, doc_election, doc_putin, doc_soup, doc_noodles, doc_dosa]

In [6]:
%%time
# Download the FastText model
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')



In [7]:
# Prepare a dictionary and a corpus.
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in documents])

# Prepare the similarity matrix
similarity_matrix = fasttext_model300.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(simple_preprocess(doc_trump))
sent_2 = dictionary.doc2bow(simple_preprocess(doc_election))
sent_3 = dictionary.doc2bow(simple_preprocess(doc_putin))
sent_4 = dictionary.doc2bow(simple_preprocess(doc_soup))
sent_5 = dictionary.doc2bow(simple_preprocess(doc_noodles))
sent_6 = dictionary.doc2bow(simple_preprocess(doc_dosa))

sentences = [sent_1, sent_2, sent_3, sent_4, sent_5, sent_6]

In [8]:
# Compute soft cosine similarity for two sentences
print(softcossim(sent_1, sent_2, similarity_matrix))

0.5885144994929364


In [9]:
# Compute soft cosine similarity for all sentences
def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

create_soft_cossim_matrix(sentences)

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.59,0.56,0.28,0.34,0.4
1,0.59,1.0,0.56,0.23,0.33,0.45
2,0.56,0.56,1.0,0.19,0.25,0.36
3,0.28,0.23,0.19,1.0,0.5,0.38
4,0.34,0.33,0.25,0.5,1.0,0.56
5,0.4,0.45,0.36,0.38,0.56,1.0


# Test with real data

In [10]:
# Python libs to manipulate dataframes and arrays
import pandas as pd
import numpy as np

# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# To get the word vectors, you need a word embedding model. Let’s download the FastText model using gensim’s downloader api.
import gensim
print(gensim.__version__)

# upgrade gensim if you can't import softcossim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess

3.6.0


In [96]:
def get_data():

  # Download Clinical Trials data
  print('Downloading Clinical Trials Data')
  ct_dt = pd.read_csv(r'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_0.csv', sep=',', engine='python', encoding="utf-8")
  for btch in range(1, 4):
      url = 'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_' +str(btch)+ '.csv'
      tmp = pd.read_csv(url, sep=',', engine='python', encoding="ISO-8859-1")
      ct_dt = ct_dt.append(tmp, ignore_index=True)
  ct_dt['AllLocation'] = ct_dt['LocationCity'].str.lower().map(str) + ' | ' + ct_dt['LocationState'].str.lower().map(str) + ' | ' + ct_dt['LocationCountry'].str.lower().map(str)
  print('Clinical Trials Data: ',ct_dt.shape, '\n')

  # Download User input data
  print('Downloading Test data')
  test = pd.read_csv('https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/TestData.csv', sep=';', engine='python', encoding = "utf-8", skiprows=[0], names=['PatientID','ConditionOrDisease','Age','Gender','LocationCountry','TravelDistance','InclusionCriteria'])
  print('Test Data: ', test.shape)

  return ct_dt, test

ct_dt, test = get_data()

Downloading Clinical Trials Data
Clinical Trials Data:  (10152, 21) 

Downloading Test data
Test Data:  (7, 7)


In [98]:
def data_processing(ct_dt):

  print('Data dimensions before Filtering : ', ct_dt.shape, '\n')

  ### Filtering by Age ###
  print('Filtering by Age...')
  tmp = ct_dt[ct_dt.iloc[:,13] <= test.iloc[:1,2][0]]               # compare numerics
  tmp = tmp[tmp.iloc[:,13].str.find(test.iloc[:1,2][0][-5:]) != -1] # Detect the Year/Month
  print('Data dimensions: ', tmp.shape, '\n')

  ### Filtering by Gender ###
  print('Filtering by Gender...')
  tmp = tmp[(tmp.iloc[:,12] == test.iloc[:1,3][0]) | (tmp.iloc[:,12] == 'All')] 
  print('Data dimensions: ', tmp.shape, '\n')

  ### Filtering by Travel Distance ###
  print('Filtering by Travel Distance...')
  tmp = tmp[tmp.iloc[:,20].str.find(test.iloc[:1,5][0].lower()) != -1] 
  print('Data dimensions: ', tmp.shape, '\n')

  return tmp

tmp = data_processing(ct_dt)
tmp

Data dimensions before Filtering :  (10152, 21) 

Filtering by Age...
Data dimensions:  (9517, 21) 

Filtering by Gender...
Data dimensions:  (9403, 21) 

Filtering by Travel Distance...
Data dimensions:  (645, 21) 



Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Keyword,DetailedDescription,Condition,EligibilityCriteria,InclusionCriteria,ExclusionCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationState,LocationZip,LocationCountry,AllLocation
52,53,NCT02603627,Guy's and St Thomas' NHS Foundation Trust,Cross-sectional Study to Compare the Prevalenc...,Unknown status,,Chronic obstructive pulmonary disease (COPD) i...,COPD|Lung Cancer|Smoking,Inclusion Criteria:||Informed consent|Aged ove...,Informed consent|Aged over 18|Lung cancer grou...,Patient refusal|Age under 18|Control group: pr...,No,All,18 Years,Patients will be recruited from the multidisci...,Guy's and St Thomas' NHS Foundation Trust,London,England,SE1 9RT,United Kingdom,london | england | united kingdom
79,80,NCT04629079,King's College London,Lung Cancer Detection Using Blood Exosomes and...,Recruiting,Early detection,Lung cancer is the leading cause of cancer dea...,Lung Cancer,Inclusion Criteria:||Over 18 years of age|Susp...,Over 18 years of age|Suspected clinical diagno...,-Synchronous other cancer types.,No,All,18 Years,The study will include patients who have been ...,"Borthwick Research Unit, Lister Hospital",Stevenage,,SG1 4AB,United Kingdom,stevenage | nan | united kingdom
88,89,NCT02612532,Owlstone Ltd,Lung Cancer Indicator Detection,"Active, not recruiting",,Rationale Approximately 75% of patients with l...,Lung Cancer,Recruitment for these patients will be done fr...,patients will be done from NHS hospitals whom...,e patients will be done from NHS hospitals who...,No,All,18 Years,,UZA University Hospital Antwerp|UZG University...,Antwerp|Gent|Leipzig|Bari|Cambridge|Buckingham...,Cambridgeshire,04103,Belgium|Belgium|Germany|Italy|United Kingdom|U...,antwerp|gent|leipzig|bari|cambridge|buckingham...
100,101,NCT04178889,Papworth Hospital NHS Foundation Trust,Second Primary Lung Cancer Cohort Study (SPORT),Recruiting,Lung cancer|Non-small cell lung cancer,"This is a multi-centre, observational basic sc...",Lung Cancer,Inclusion Criteria:||previous treatment with c...,previous treatment with curative intent (surge...,Primary lung tumour was a carcinoid tumour|in ...,No,All,18 Years,Patients who have been treated with curative i...,Royal Papworth Hospital,Cambridge,,,United Kingdom,cambridge | nan | united kingdom
103,104,NCT04409444,Manchester University NHS Foundation Trust,An Observational Cohort Study Investigating th...,Recruiting,,,Lung Cancer,Main data study:||Inclusion Criteria:||- Any i...,lusion Criteria:||- Any individual attending t...,- Unable to give informed consent to study par...,,All,55 Years,Individuals will be attending a lung health ch...,Manchester University NHS Trust,Manchester,,,United Kingdom,manchester | nan | united kingdom
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10041,10042,NCT04385160,Fondazione per la Ricerca Ospedale Maggiore,Myeloproliferative Neoplasms (MPN) and COVID-19,Recruiting,thrombosis,This is an European multicenter observational ...,Myeloproliferative Neoplasm|COVID,Inclusion Criteria:||Age > 18 years|Confirmed ...,Age > 18 years|Confirmed diagnosed of MPN acco...,,No,All,18 Years,Patients with Myeloproliferative Neoplasm (Pol...,New York-Presbyterian/Weill Cornell Medical Ce...,New York|Zagreb|Paris|Aachen|Minden|Monza|Ales...,New York|Monza Brianza|Barcellona|Barcellona|B...,10065|20900|15121|24127|40138|25123|50134|2012...,United States|Croatia|France|Germany|Germany|I...,new york|zagreb|paris|aachen|minden|monza|ales...
10044,10045,NCT01334593,Liverpool University Hospitals NHS Foundation ...,The Effect of Neoadjuvant Chemoradiotherapy on...,Completed,,Purpose: To evaluate the effects of chemoradio...,Cancer,Inclusion Criteria:||All patients listed to un...,All patients listed to undergo neoadjuvant che...,Unable to consent.|Under 18 years of age.|Sign...,No,All,18 Years,Colorectal cancer is the third commonest cause...,Aintree University Hospitals,Liverpool,Merseyside,L7 8XP,United Kingdom,liverpool | merseyside | united kingdom
10086,10087,NCT03828578,Cardiff and Vale University Health Board,Comparison of AIRVO High Flow Oxygen Therapy W...,Completed,,Major head and neck surgery involving micro-va...,Head and Neck Cancer,Inclusion Criteria:||Undergoing head and neck ...,Undergoing head and neck surgery with microvas...,Under 18 years old|Lack of consent|Consultant ...,No,All,18 Years,,Cardiff and Vale University Health Board,Cardiff,,CF144XW,United Kingdom,cardiff | nan | united kingdom
10098,10099,NCT00324298,National Cancer Institute (NCI),A Randomized Phase III Toxicity Study of Day 2...,Completed,drug/agent toxicity by tissue/organ|stage III ...,OBJECTIVES:||Primary||Determine if long-infusi...,Drug/Agent Toxicity by Tissue/Organ|Testicular...,DISEASE CHARACTERISTICS:||Diagnosis of metasta...,CS:||Diagnosis of metastatic germ cell cancer ...,ICS:||Diagnosis of metastatic germ cell cancer...,No,Male,18 Years,,Basildon University Hospital|Addenbrooke's Hos...,Basildon|Cambridge|Colchester|Ipswich|Leeds|Lo...,England|England|England|England|England|Englan...,SS16 5NL|CB2 2QQ|C03 3NB|IP4 5PD|LS9 7TF|EC1A ...,United Kingdom|United Kingdom|United Kingdom|U...,basildon|cambridge|colchester|ipswich|leeds|lo...


### Modeling using Cosine as a metric

To compute the cosine similarity, you need the word count of the words in each document. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix. On this, am optionally converting it to a pandas dataframe to see the word frequencies in a tabular format.

Even better, I could have used the TfidfVectorizer() instead of CountVectorizer(), because it would have downweighted words that occur frequently across docuemnts. Then, use cosine_similarity() to get the final output. It can take the document term matri as a pandas dataframe as well as a sparse matrix as inputs.

In [99]:
%%time

ct_dt['InclusionCriteria'] = ct_dt['InclusionCriteria'].fillna(' ')
tmp['InclusionCriteria']   = tmp['InclusionCriteria'].fillna(' ')

count_vectorizer           = CountVectorizer(stop_words='english')
count_vectorizer_ct_dt     = count_vectorizer.fit(ct_dt['InclusionCriteria'])

count_vectorizer_tmp       = count_vectorizer_ct_dt.transform(tmp['InclusionCriteria'])

count_vectorizer_test0    = count_vectorizer_ct_dt.transform(test.iloc[:1,6].fillna(' '))

CPU times: user 1.51 s, sys: 14 ms, total: 1.53 s
Wall time: 1.53 s


In [114]:
# print(type(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp)))
# print(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp).shape)
# print(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp))
# print(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp)[0])
print(*test.iloc[:1,6])
print(*tmp.iloc[2:3,9])

Histologically diagnosed with metastatic non-small cell lung cancer in 2018 | Initially treated with pertuzumab but relapsed | His performance status is ECOG 1 or KPS 90 | His blood and liver function analysis show normal | No other indications like HIV, HCV, HBV | No allergies | Life expectancy over 6 months | No mental disabilities.
 patients will be done from NHS hospitals whom identify or follow-up on patients suspected of having lung cancer.||Inclusion criteria:||Older than 18 years at time of consent||Referred for investigation due to suspicion of lung cancer||Referral based on suspicious symptoms|Referral based on suspicious finding on imaging, including CTscan with indeterminate nodule requiring follow-up evaluation.|Capable of understanding written and/or spoken language|Able to provide informed consent||Exclusion criteria:||(Anticipated) inability to complete breath sampling procedure due to e.g. hyper- or hypo-ventilation, respiratory failure or claustrophobia when wearing t

In [115]:
tmp['Similarity'] = pd.Series(cosine_similarity(count_vectorizer_test0, count_vectorizer_tmp)[0]).values

ct_dt_tmp = ct_dt
ct_dt_tmp['Similarity'] = 0
print(ct_dt_tmp.shape)
print(tmp.shape)

ct_dt_tmp = ct_dt_tmp[~ct_dt_tmp['NCTId'].isin(tmp['NCTId'])]
print(ct_dt_tmp.shape)
ct_dt_tmp = ct_dt_tmp.append(tmp, ignore_index=True)

print(ct_dt_tmp[ct_dt_tmp['Similarity']>0.1].shape)
print(ct_dt_tmp[ct_dt_tmp['Similarity']>0.2].shape)
print(ct_dt_tmp[ct_dt_tmp['Similarity']>0.25].shape)
print(ct_dt_tmp[ct_dt_tmp['Similarity']>0.3].shape)

ct_dt_tmp['Similarity'] = ct_dt_tmp['Similarity'].apply(lambda score: score if score>0.25 else 0)
ct_dt_tmp = ct_dt_tmp.sort_values(by=['Similarity'], ascending=False)

(10152, 22)
(645, 22)
(9507, 22)
(408, 22)
(93, 22)
(29, 22)
(10, 22)
