<a href="https://colab.research.google.com/github/MWFK/NLP-Semantic-Similarity/blob/main/00.%20ct_dt_Cosine_SoftCosine_DONE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Libs


In [None]:
# Python libs to manipulate dataframes and arrays
import pandas as pd
import numpy as np

# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# To get the word vectors, you need a word embedding model. Let’s download the FastText model using gensim’s downloader api.
import gensim
print(gensim.__version__)

# upgrade gensim if you can't import softcossim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess

3.6.0


### Data

In [None]:
# Define the documents
doc_trump    = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference in the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin    = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents    = [doc_trump, doc_election, doc_putin]

### Modeling using Cosine as a metric

To compute the cosine similarity, you need the word count of the words in each document. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix. On this, am optionally converting it to a pandas dataframe to see the word frequencies in a tabular format.

Even better, I could have used the TfidfVectorizer() instead of CountVectorizer(), because it would have downweighted words that occur frequently across docuemnts. Then, use cosine_similarity() to get the final output. It can take the document term matri as a pandas dataframe as well as a sparse matrix as inputs.

In [None]:
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix    = count_vectorizer.fit_transform(documents)

# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names(), index=['doc_trump', 'doc_election', 'doc_putin'])
df



Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,mr,outcome,parties,political,post,president,prime,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,2,0,1,1,0,0,0,1,2,0,1,0
doc_election,0,1,0,2,0,1,0,1,0,0,0,1,1,2,0,2,0,2,0,0,2,0,0,1,0,0,1
doc_putin,1,0,1,0,1,0,0,0,0,1,0,0,0,1,1,2,1,2,0,1,0,1,0,0,1,0,0


In [None]:
print(cosine_similarity(df, df))

[[1.         0.51639778 0.36893239]
 [0.51639778 1.         0.45360921]
 [0.36893239 0.45360921 1.        ]]


### Modeling using SoftCosine as a metric

Suppose if you have another set of documents on a completely different topic, say ‘food’, you want a similarity metric that gives higher scores for documents belonging to the same topic and lower scores when comparing docs from different topics. In such case, we need to consider the semantic meaning should be considered. That is, words similar in meaning should be treated as similar. For Example, ‘President’ vs ‘Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ should be considered similar. For this, converting the words into respective word vectors, and then, computing the similarities can address this problem.

In [None]:
# Define the documents
doc_soup = "Soup is a primarily liquid food, generally served warm or hot (but may be cool or cold), that is made by combining ingredients of meat or vegetables with stock, juice, water, or another liquid. "
doc_noodles = "Noodles are a staple food in many cultures. They are made from unleavened dough which is stretched, extruded, or rolled flat and cut into one of a variety of shapes."
doc_dosa = "Dosa is a type of pancake from the Indian subcontinent, made from a fermented batter. It is somewhat similar to a crepe in appearance. Its main ingredients are rice and black gram."
documents = [doc_trump, doc_election, doc_putin, doc_soup, doc_noodles, doc_dosa]

In [184]:
# Download the FastText model
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')



In [None]:
# Prepare a dictionary and a corpus.
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in documents])

# Prepare the similarity matrix
similarity_matrix = fasttext_model300.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(simple_preprocess(doc_trump))
sent_2 = dictionary.doc2bow(simple_preprocess(doc_election))
sent_3 = dictionary.doc2bow(simple_preprocess(doc_putin))
sent_4 = dictionary.doc2bow(simple_preprocess(doc_soup))
sent_5 = dictionary.doc2bow(simple_preprocess(doc_noodles))
sent_6 = dictionary.doc2bow(simple_preprocess(doc_dosa))

sentences = [sent_1, sent_2, sent_3, sent_4, sent_5, sent_6]

In [None]:
# Compute soft cosine similarity for two sentences
print(softcossim(sent_1, sent_2, similarity_matrix))

0.5885144994929364


In [None]:
# Compute soft cosine similarity for all sentences
def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

create_soft_cossim_matrix(sentences)

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.59,0.56,0.28,0.34,0.4
1,0.59,1.0,0.56,0.23,0.33,0.45
2,0.56,0.56,1.0,0.19,0.25,0.36
3,0.28,0.23,0.19,1.0,0.5,0.38
4,0.34,0.33,0.25,0.5,1.0,0.56
5,0.4,0.45,0.36,0.38,0.56,1.0


# Test with real data

In [1]:
# Python libs to manipulate dataframes and arrays
import pandas as pd
import numpy as np

# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# To get the word vectors, you need a word embedding model. Let’s download the FastText model using gensim’s downloader api.
import gensim
print(gensim.__version__)

# upgrade gensim if you can't import softcossim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess

3.6.0


In [185]:
# User input data
test = pd.read_csv('https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/TestData.csv', sep=';', engine='python', encoding = "utf-8", skiprows=[0], names=['PatientID','ConditionOrDisease','MinimumAge','Gender','LocationCountry','TravelDistance','InclusionCriteria'])
print(test.shape)
test.head()

(7, 7)


Unnamed: 0,PatientID,ConditionOrDisease,MinimumAge,Gender,LocationCountry,TravelDistance,InclusionCriteria
0,1,Lung Cancer,58 Years,Male,London,United Kingdom,Histologically diagnosed with metastatic non-s...
1,2,Lung Cancer,25 Years,Female,Paris,Paris,Clinically diagnosed with primary adenocarcino...
2,3,Lung Cancer,82 Years,Male,Southampton,Southampton,Smoker | Diagnosed with squamous cell carcinom...
3,4,Lung Cancer,16 Years,Female,Manchester,United Kingdom,Never smoke | Diagnosed with small cell lung c...
4,5,Lung Cancer,65 Years,Female,Belgium,,Former smoker | Histologically diagnosed with ...


In [237]:
test.iloc[:1,6][0]

'Histologically diagnosed with metastatic non-small cell lung cancer in 2018 | Initially treated with pertuzumab but relapsed | His performance status is ECOG 1 or KPS 90 | His blood and liver function analysis show normal | No other indications like HIV, HCV, HBV | No allergies | Life expectancy over 6 months | No mental disabilities.'

In [198]:
%%time

# Clinical Trials data
ct_dt = pd.read_csv(r'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_0.csv', sep=',', engine='python', encoding="utf-8")
for btch in range(1, 4):
    url = 'https://raw.githubusercontent.com/MWFK/NLP-Semantic-Similarity/main/ClinicalTrials/Data/Batches_' +str(btch)+ '.csv'
    tmp = pd.read_csv(url, sep=',', engine='python', encoding="ISO-8859-1")
    print('Batch ', btch, ': ', tmp.shape)
    ct_dt = ct_dt.append(tmp, ignore_index=True)

ct_dt['AllLocation'] = ct_dt['LocationCity'].str.lower().map(str) + ' | ' + ct_dt['LocationState'].str.lower().map(str) + ' | ' + ct_dt['LocationCountry'].str.lower().map(str)

print('All Batchs: ',ct_dt.shape)
ct_dt.head()

Batch  1 :  (2538, 20)
Batch  2 :  (2538, 20)
Batch  3 :  (2538, 20)
All Batchs:  (10152, 21)
CPU times: user 1.92 s, sys: 167 ms, total: 2.09 s
Wall time: 2.79 s


In [222]:
%%time
### Age Detection ###
print(test.iloc[:1,2][0])
tmp = ct_dt[ct_dt.iloc[:,13] >= test.iloc[:1,2][0]]               # compare numerics
tmp = tmp[tmp.iloc[:,13].str.find(test.iloc[:1,2][0][-5:]) != -1] # Detect the Year/Month
print(tmp.shape)

58 Years
(133, 21)
CPU times: user 7.7 ms, sys: 0 ns, total: 7.7 ms
Wall time: 8.75 ms


In [223]:
%%time
### Gender Detection ###
print(test.iloc[:1,3][0])
tmp = tmp[(tmp.iloc[:,12] == test.iloc[:1,3][0]) | (tmp.iloc[:,12] == 'All')] 
print(tmp.shape)

Male
(132, 21)
CPU times: user 3.62 ms, sys: 1.06 ms, total: 4.68 ms
Wall time: 7.56 ms


In [224]:
%%time
### Travel Distance ###
print(test.iloc[:1,5][0])
tmp = tmp[tmp.iloc[:,20].str.find(test.iloc[:1,5][0].lower()) != -1] 
print(tmp.shape)

United Kingdom
(4, 21)
CPU times: user 4.57 ms, sys: 710 µs, total: 5.28 ms
Wall time: 5.32 ms


In [225]:
tmp.head()

Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Keyword,DetailedDescription,Condition,EligibilityCriteria,InclusionCriteria,ExclusionCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationState,LocationZip,LocationCountry,AllLocation
1589,1590,NCT00227708,UNICANCER,Phase II Trial Assessing the Impact on Instrum...,Completed,adenocarcinoma of the lung|adenosquamous cell ...,OBJECTIVES:||Primary||Determine the quality of...,Lung Cancer,DISEASE CHARACTERISTICS:||Histologically or cy...,CS:||Histologically or cytologically confirmed...,ICS:||Histologically or cytologically confirme...,No,All,70 Years,,Centre Medico-Chirurgical de Creil|Centre de L...,Creil|Dijon|Elbeuf|Marseille|Paris|Genolier|Lo...,England|Northern Ireland|Scotland,60107|21079|76503|13273|75248|Ch-1272|W6 8RF|B...,France|France|France|France|France|Switzerland...,creil|dijon|elbeuf|marseille|paris|genolier|lo...
3848,3849,NCT02558101,"University College, London",Randomised Controlled Trial to Test Novel Invi...,"Active, not recruiting",Screening|Early detection|Health inequality|Up...,Lung cancer screening using low dose computed ...,Lung Cancer,Inclusion Criteria:||Recorded as a current smo...,Recorded as a current smoker during the year 2...,Active diagnosis of lung cancer or metastases|...,Accepts Healthy Volunteers,All,60 Years,,University College London Hospital NHS Trust,London,England,NW1 2BU,United Kingdom,london | england | united kingdom
4410,4411,NCT00489983,Eli Lilly and Company,A Multicenter Phase 2 Randomized Trial of Sing...,Completed,,,Non-Small Cell Lung Cancer,Inclusion Criteria:||histologically or cytolog...,histologically or cytologically confirmed NSCL...,have received treatment within the last 30 day...,No,All,70 Years,,For additional information regarding investiga...,Berlin|Milano|London,,,Germany|Italy|United Kingdom,berlin|milano|london | nan | germany|italy|uni...
5814,5815,NCT00256711,AstraZeneca,"A Randomised, Open Label, Parallel Group, Mult...",Completed,Locally advanced or metastatic NSCLC.|Stage II...,,Non-Small-Cell Lung Carcinoma,Inclusion Criteria:||Histologically confirmed ...,Histologically confirmed NSCLC and willing to ...,Newly diagnosed CNS metastases|Less than 4 wee...,No,All,70 Years,,Research Site|Research Site|Research Site|Rese...,St. Leonards|Westmead|South Brisbane|Nedlands|...,New South Wales|New South Wales|Queensland|Wes...,,Australia|Australia|Australia|Australia|Austra...,st. leonards|westmead|south brisbane|nedlands|...


### Modeling using Cosine as a metric

To compute the cosine similarity, you need the word count of the words in each document. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix. On this, am optionally converting it to a pandas dataframe to see the word frequencies in a tabular format.

Even better, I could have used the TfidfVectorizer() instead of CountVectorizer(), because it would have downweighted words that occur frequently across docuemnts. Then, use cosine_similarity() to get the final output. It can take the document term matri as a pandas dataframe as well as a sparse matrix as inputs.

In [265]:
%%time
# Create the Document Term Matrix
# ct_dt['InclusionCriteria'] = ct_dt['InclusionCriteria'].fillna(' ')
# count_vectorizer = CountVectorizer(stop_words='english')
# count_vectorizer    = count_vectorizer.fit(ct_dt['InclusionCriteria'])
# count_vectorizer_ct_dt    = count_vectorizer.transform(ct_dt['InclusionCriteria'])

ct_dt['InclusionCriteria'] = ct_dt['InclusionCriteria'].fillna(' ')
tmp['InclusionCriteria']   = tmp['InclusionCriteria'].fillna(' ')
count_vectorizer           = CountVectorizer(stop_words='english')
count_vectorizer_ct_dt     = count_vectorizer.fit(ct_dt['InclusionCriteria'])
count_vectorizer_tmp       = count_vectorizer_ct_dt.transform(tmp['InclusionCriteria'])

CPU times: user 1.52 s, sys: 8.64 ms, total: 1.52 s
Wall time: 1.53 s


In [266]:
%%time
sentence = test.iloc[:1,6].fillna(' ')
count_vectorizer_test0    = count_vectorizer_ct_dt.transform(sentence)

CPU times: user 1.35 ms, sys: 20 µs, total: 1.37 ms
Wall time: 1.38 ms


In [267]:
print(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp).shape)
print(cosine_similarity(count_vectorizer_test0 , count_vectorizer_tmp))

(1, 4)
[[0.18931708 0.         0.0461143  0.17912443]]


In [268]:
print(*test.iloc[:1,6])

Histologically diagnosed with metastatic non-small cell lung cancer in 2018 | Initially treated with pertuzumab but relapsed | His performance status is ECOG 1 or KPS 90 | His blood and liver function analysis show normal | No other indications like HIV, HCV, HBV | No allergies | Life expectancy over 6 months | No mental disabilities.


In [271]:
print(*tmp.iloc[2:3,9])

histologically or cytologically confirmed NSCLC not amenable to surgery or radiotherapy of curative intent|locally advanced or metastatic Stage IIIb (with N3 supraclavicular or T4 for pleural effusion) or IV NSCLC|no prior chemotherapy|measurable disease according to Response Evaluation Criteria in Solid Tumors (RECIST) criteria (Therasse et al. 2000)|men and women greater than or equal to 70 years of age or patients who, in the investigator's opinion, are not eligible for platinum-based chemotherapy||
