# Relevance Score
For each paper we have to compute a relevance score which should state how relevant that paper is compared to the search we made. <br>

So i was thinking something like this:<br>
paper1 ---> score : 1 (exactly what we were looking for)<br>
paper2 ---> score : 0 (treats a completly different topic)<br>
and the rest of the cases are between 0 and 1

Strategies:
- [Title Matches](#section_id)
- [Keywords matches](#section_id2)
- [Title and Keywords matches](#section_id3)
- [Automatic dictionary from abstracts](#section_id4)
- [Automatic dictionary from keywords](#section_id5)

In [1]:
import requests
import json
import time
import pandas as pd

In [2]:
def esearch(search_term):
    search_term = search_term.strip().replace(' ', '+')
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
    url = url + '?db=pubmed' + '&retmode=json' + '&retmax=500'
    url = url + '&term=' + search_term
    site = requests.get(url).content
    json_site = json.loads(site.decode()) 
    UIDs = json_site['esearchresult']['idlist']
    return UIDs

In [3]:
def efetch(UIDs):
    session = requests.Session()
    abstracts = []
    keywords = []
    titles = []
    i = 1
    for UID in UIDs:
        time.sleep(0.4)
        url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
        url = url + '?db=pubmed' + '&rettype=medline&id='
        url = url + UID
        
        data = session.get(url).content
        data = data.decode()
        data = data.split('\n')
        
        
        ls = []
        for elem in data:
            if elem == '':
                continue
            elif elem[0:4] != '    ': 
                elem = elem.strip()
                elem = elem.replace('\n', '')
                ls.append(elem)
            else:                    
                elem = elem.strip()
                elem = elem.replace('\n', '')
                ls[-1] = ls[-1] + elem
    
    
        ls_2 = [] 
        for elem in ls:
            key = elem[0:4].strip()
            value = elem[5:].strip()
            ls_2.append([key, value])
        
        all_key = ''
        abst = ''
        for row in ls_2:
            if row[0] == 'AB':
                abst = abst + row[1]
            
            elif row[0] == 'OT' or row[0] == 'MH':
                all_key = all_key + row[1] + ', '
            
            elif row[0] == 'TI':
                titles.append(row[1])
                
        keywords.append(all_key[:-2])
        abstracts.append(abst)
        
        if i%10==0:
            print('Paper ' + str(i) +': OK')
        i += 1
        
    dic = {
        'UID' : UIDs,
        'title' : titles,
        'abstarct' : abstracts,
        'keywords' : keywords
    }
    return dic

In [4]:
UIDs = esearch('machine learning cancer prevention')
data = efetch(UIDs)

Paper 10: OK
Paper 20: OK
Paper 30: OK
Paper 40: OK
Paper 50: OK
Paper 60: OK
Paper 70: OK
Paper 80: OK
Paper 90: OK
Paper 100: OK
Paper 110: OK
Paper 120: OK
Paper 130: OK
Paper 140: OK
Paper 150: OK
Paper 160: OK
Paper 170: OK
Paper 180: OK
Paper 190: OK
Paper 200: OK
Paper 210: OK
Paper 220: OK
Paper 230: OK
Paper 240: OK
Paper 250: OK
Paper 260: OK
Paper 270: OK
Paper 280: OK
Paper 290: OK
Paper 300: OK
Paper 310: OK
Paper 320: OK
Paper 330: OK
Paper 340: OK
Paper 350: OK
Paper 360: OK
Paper 370: OK
Paper 380: OK
Paper 390: OK
Paper 400: OK
Paper 410: OK
Paper 420: OK
Paper 430: OK
Paper 440: OK
Paper 450: OK
Paper 460: OK
Paper 470: OK
Paper 480: OK
Paper 490: OK
Paper 500: OK


In [5]:
df = pd.DataFrame(data)
df

Unnamed: 0,UID,title,abstarct,keywords
0,36179551,Baseline host determinants of robust human HIV...,BACKGROUND: The identification of baseline hos...,"Antibody, Baseline characteristics, CD4+ T cel..."
1,36168036,Molecular pathways enhance drug response predi...,Computational models have been successful in p...,"Cell Line, Cell Line, Tumor, *Everolimus, Hete..."
2,36159773,Comparative Analysis of Machine Learning Metho...,"Breast cancer is the leading cancer in women, ...","*Breast Neoplasms/diagnosis/genetics, Female, ..."
3,36159011,Radiomics and nomogram of magnetic resonance i...,BACKGROUND: Microvascular invasion (MVI) of sm...,"*Carcinoma, Hepatocellular/diagnostic imaging/..."
4,36157578,The integrated landscape of eRNA in gastric ca...,The comprehensive regulation effect of eRNA on...,"Bioinformatics, Biological sciences, Cancer, S..."
...,...,...,...,...
495,25540094,Multilevel modeling and value of information i...,BACKGROUND: Clinical trials are the main metho...,"Arabidopsis, Decision Support Techniques, Esch..."
496,25423479,Identifying predictive features in drug respon...,This article reviews several techniques from m...,"Algorithms, Animals, Antineoplastic Agents/adv..."
497,25189363,Computer-aided detection of exophytic renal le...,Renal lesions are important extracolonic findi...,"Adult, Aged, Algorithms, Artificial Intelligen..."
498,24732597,Gene expression profile alone is inadequate in...,With advent of several treatment options in mu...,"Genetic Testing, Humans, Microarray Analysis, ..."


<a id='section_id'></a>

# Title matches
one strategy might be to see if there is a match between the search term and the title <br>
we can compute the score like: (matching terms in the title)/(total search terms)


In [116]:
def process_text(text):
    text = text.strip()
    text = text.lower()
    text = text.replace('.', ' ')
    text = text.replace(',', ' ')
    text = text.replace(':', ' ')
    text = text.replace(';', ' ')
    text = text.replace('*', ' ')
    text = text.replace('/', ' ')
    text = text.replace('-', ' ')
    text = text.replace('&', ' ')
    text = text.replace('=', ' ')
    return text

In [117]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
titles = df['title'].tolist()
for title in titles:
    title = process_text(title)
    matchs = 0
    for word in search_term:
        if word in title:
            matchs += 1
    scores.append(matchs/len(search_term))

In [118]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2

Unnamed: 0,title,score
338,"Cancer Prevention Using Machine Learning, Nudg...",1.00
331,Machine learning highlights the deficiency of ...,1.00
54,Artificial Intelligence and Machine Learning i...,0.75
65,Development of Training Materials for Patholog...,0.75
316,Machine learning can accelerate discovery and ...,0.75
...,...,...
238,Mitochondriopathies as a Clue to Systemic Diso...,0.00
243,"Pathogenesis, Symptomatology, and Transmission...",0.00
244,Modeling of diagnosis for metabolic syndrome b...,0.00
246,Autism Spectrum Disorder from the Womb to Adul...,0.00


In [119]:
 for elem in df2[df2['score']==1]['title'].values:
        print(elem)
        print()

Cancer Prevention Using Machine Learning, Nudge Theory and Social Impact Bond.

Machine learning highlights the deficiency of conventional dosimetric constraintsfor prevention of high-grade radiation esophagitis in non-small cell lung cancertreated with chemoradiation.



In [120]:
df2.head()

Unnamed: 0,title,score
338,"Cancer Prevention Using Machine Learning, Nudg...",1.0
331,Machine learning highlights the deficiency of ...,1.0
54,Artificial Intelligence and Machine Learning i...,0.75
65,Development of Training Materials for Patholog...,0.75
316,Machine learning can accelerate discovery and ...,0.75


<a id='section_id2'></a>

# Keywords matches
same as title but for keywords

In [121]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
keywords = df['keywords'].tolist()
for keys in keywords:
    keys = process_text(keys)
    matchs = 0
    for word in search_term:
        if word in keys:
            matchs += 1
    scores.append(matchs/len(search_term))

In [122]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2

Unnamed: 0,title,score
248,Assessing Lung Cancer Absolute Risk Trajectory...,1.0
164,Identifying False Human Papillomavirus (HPV) V...,1.0
418,Prospective validation of the NCI Breast Cance...,1.0
177,Changes in Immune Cell Types with Age in Breas...,1.0
443,Predictors of the Healthy Eating Index and Gly...,1.0
...,...,...
382,Etiological Role of Diet in 30-Day Readmission...,0.0
254,A roadmap of six different pathways to improve...,0.0
243,"Pathogenesis, Symptomatology, and Transmission...",0.0
241,Identifying Stage II Colorectal Cancer Recurre...,0.0


In [123]:
for elem in df2[df2['score']==1.0]['title'].values:
        print(elem)
        print()

Assessing Lung Cancer Absolute Risk Trajectory Based on a Polygenic Risk Model.

Identifying False Human Papillomavirus (HPV) Vaccine Information andCorresponding Risk Perceptions From Twitter: Advanced Predictive Models.

Prospective validation of the NCI Breast Cancer Risk Assessment Tool (Gail Model)on 40,000 Australian women.

Changes in Immune Cell Types with Age in Breast are Consistent with a Decline inImmune Surveillance and Increased Immunosuppression.

Predictors of the Healthy Eating Index and Glycemic Index in Multi-EthnicColorectal Cancer Families.

Bidirectional deep neural networks to integrate RNA and DNA data for predictingoutcome for patients with hepatocellular carcinoma.

A data-driven ultrasound approach discriminates pathological high grade prostatecancer.

An immunogenic personal neoantigen vaccine for patients with melanoma.

Quantitative ultrasound image analysis of axillary lymph nodes to differentiatemalignancy from reactive benign changes due to COVID-19 vac

<a id='section_id3'></a>

# Title and Keywords matches

In [124]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
keywords = df['keywords'].tolist()
titles = df['title'].tolist()
for i in range(len(titles)):
    titles[i] = process_text(titles[i])
    keywords[i] = process_text(keywords[i])
    matchs = 0
    for word in search_term:
        if word in keywords[i]:
            matchs += 1
        if word in titles[i]:
            matchs += 1
    scores.append(matchs)

In [125]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2['score'] = df2['score'].apply(lambda x: x/max(df2['score']))
df2

Unnamed: 0,title,score
338,cancer prevention using machine learning nudg...,1.000
315,spatial distribution of esophageal cancer mort...,0.875
268,predicting peritoneal metastasis of gastric ca...,0.875
54,artificial intelligence and machine learning i...,0.875
331,machine learning highlights the deficiency of ...,0.875
...,...,...
93,utilization of host and microbiome features in...,0.000
428,hlbs popomics an online knowledge base to acc...,0.000
102,application of p4 (predictive preventive per...,0.000
111,an improved molecular inversion probe based ta...,0.000


In [126]:
for elem in df2.head(10)['title'].values:
        print(elem)
        print()

cancer prevention using machine learning  nudge theory and social impact bond 

spatial distribution of esophageal cancer mortality in china  a machine learningapproach 

predicting peritoneal metastasis of gastric cancer patients based on machinelearning 

artificial intelligence and machine learning in cancer research  a systematic andthematic analysis of the top 100 cited articles indexed in scopus database 

machine learning highlights the deficiency of conventional dosimetric constraintsfor prevention of high grade radiation esophagitis in non small cell lung cancertreated with chemoradiation 

estimating heterogeneous survival treatment effects of lung cancer screeningapproaches  a causal machine learning analysis 

classification tree based machine learning to visualize and validate a decisiontool for identifying malnutrition in cancer patients 

machine learning assisted discrimination of precancerous and cancerous fromhealthy oral tissue based on multispectral autofluorescence

<a id='section_id4'></a>

# Automatic dictionary from abstracts
idea of the tutor i overheard <br>
basically build a dictionary of the most common words in the abstracts and then use this list to check for matches (in title? in abstract? i have no idea)


In [127]:
def process_text(text):
    text = text.strip()
    text = text.lower()
    text = text.replace('.', ' ')
    text = text.replace(',', ' ')
    text = text.replace(':', ' ')
    text = text.replace(';', ' ')
    text = text.replace('*', ' ')
    text = text.replace('/', ' ')
    text = text.replace('-', ' ')
    text = text.replace('&', ' ')
    text = text.replace('=', ' ')
    return text

In [128]:
def word_freq(text, dic):
    text = process_text(text)
    text = text.split()
    for word in text:
        if word not in dic:
            dic[word] = 0
        dic[word] += 1
    return dic

In [129]:
abstracts = df['abstarct'].tolist()

In [130]:
dic = {key: val for key, val in sorted(dic.items(), key = lambda ele: ele[1], reverse = True)}

In [131]:
# most common words are not the one we care about
dic

{'the': 5756,
 'and': 4506,
 'of': 4464,
 'to': 2554,
 'in': 2366,
 'a': 1886,
 'for': 1491,
 'with': 1251,
 '0': 933,
 'cancer': 914,
 'is': 845,
 'we': 751,
 'on': 698,
 'that': 672,
 'were': 671,
 'was': 659,
 'this': 642,
 'as': 616,
 'from': 586,
 'by': 569,
 'learning': 564,
 'model': 555,
 'patients': 534,
 'data': 496,
 'risk': 474,
 'are': 452,
 'machine': 447,
 'based': 442,
 'an': 432,
 'using': 396,
 'results': 384,
 'be': 374,
 'study': 364,
 'clinical': 350,
 'models': 346,
 'methods': 344,
 '1': 343,
 'can': 306,
 'used': 303,
 'at': 289,
 'or': 282,
 '2': 253,
 'analysis': 248,
 'have': 242,
 'which': 237,
 'disease': 231,
 'prediction': 229,
 'these': 226,
 'our': 224,
 'accuracy': 217,
 'treatment': 217,
 'high': 215,
 'features': 215,
 'has': 214,
 'breast': 208,
 'prevention': 202,
 'performance': 199,
 'health': 190,
 'tumor': 183,
 'between': 177,
 'method': 176,
 'diagnosis': 175,
 'been': 171,
 'detection': 166,
 'most': 162,
 'also': 155,
 'it': 154,
 'such': 1

In [132]:
# try to remove the most common english words
dfs = pd.read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')

In [133]:
common = pd.read_csv('Untitled 1.csv', header=None)
common = common[0].tolist()

In [134]:
common_worlds = dfs[0]['Word'].tolist()
common = common + common_worlds

In [135]:
cleaned_dict = dic.copy()

In [136]:
for key, value in dic.items():
    if key in common:
        del cleaned_dict[key]

In [137]:
# still a lot of trash
# need other method to clean it better otherwise not useful
cleaned_dict

{'0': 933,
 'cancer': 914,
 'learning': 564,
 'model': 555,
 'patients': 534,
 'data': 496,
 'risk': 474,
 'machine': 447,
 'based': 442,
 'using': 396,
 'results': 384,
 'clinical': 350,
 'models': 346,
 'methods': 344,
 '1': 343,
 'used': 303,
 '2': 253,
 'analysis': 248,
 'disease': 231,
 'prediction': 229,
 'accuracy': 217,
 'treatment': 217,
 'features': 215,
 'breast': 208,
 'prevention': 202,
 'performance': 199,
 'health': 190,
 'tumor': 183,
 'method': 176,
 'diagnosis': 175,
 'detection': 166,
 '3': 149,
 'research': 146,
 'associated': 146,
 'deep': 146,
 'patient': 144,
 'identify': 144,
 'images': 144,
 'potential': 142,
 'factors': 142,
 'validation': 140,
 'screening': 138,
 '19': 136,
 'compared': 134,
 'predictive': 134,
 'related': 132,
 'ai': 132,
 'background': 131,
 'lung': 131,
 'network': 129,
 'developed': 129,
 'algorithms': 129,
 'however': 128,
 'classification': 127,
 'information': 125,
 'algorithm': 125,
 'system': 125,
 'predict': 125,
 'covid': 125,
 'ce

In [138]:
most_common_abstract = list(cleaned_dict.keys())[0:10]

<a id='section_id5'></a>

# Automatic dictionary from keywords

In [139]:
keywords = df['keywords'].tolist()
dic = {}
for keys in keywords:
    dic = word_freq(keys, dic)
    
dic = {key: val for key, 
       val in sorted(dic.items(), 
       key = lambda ele: ele[1], reverse = True)}

In [140]:
dic

{'learning': 462,
 'machine': 376,
 'humans': 347,
 'methods': 277,
 'neoplasms': 240,
 'cancer': 208,
 'diagnosis': 201,
 'imaging': 185,
 'genetics': 180,
 'pathology': 180,
 'diagnostic': 157,
 'prevention': 153,
 'aged': 148,
 'control': 141,
 'of': 139,
 'computer': 133,
 'female': 132,
 'deep': 130,
 'risk': 128,
 'epidemiology': 108,
 'studies': 106,
 'analysis': 106,
 'therapy': 97,
 'data': 95,
 'breast': 94,
 'artificial': 90,
 'drug': 87,
 'metabolism': 86,
 'middle': 83,
 'intelligence': 82,
 'male': 75,
 'disease': 73,
 'neural': 72,
 'and': 72,
 'cell': 70,
 'assisted': 69,
 'lung': 68,
 'adult': 67,
 'algorithms': 66,
 'factors': 66,
 'tomography': 65,
 'tumor': 64,
 'health': 64,
 'biomarkers': 63,
 'image': 61,
 'models': 60,
 'immunology': 60,
 'networks': 57,
 'computed': 57,
 'detection': 57,
 'neoplasm': 56,
 'effects': 55,
 'support': 54,
 'prognosis': 53,
 'assessment': 51,
 'early': 50,
 'medicine': 50,
 'covid': 49,
 '19': 49,
 'curve': 48,
 'gene': 48,
 'stati

In [141]:
# ok but were do i check for matches? title? keywords? abstract?
most_common_keywords = list(dic.keys())[0:10]

In [142]:
search_term

['machine', 'learning', 'cancer', 'prevention']

In [143]:
most_common = search_term + most_common_keywords
set(most_common)

{'cancer',
 'diagnosis',
 'genetics',
 'humans',
 'imaging',
 'learning',
 'machine',
 'methods',
 'neoplasms',
 'pathology',
 'prevention'}

In [144]:
abstarcts = df['abstarct'].tolist()
scores = []
for abstarct in abstarcts:
    abstarct = process_text(abstarct)
    matchs = 0
    for word in most_common:
        if word in abstarct:
            matchs += 1
    scores.append(matchs/len(most_common))

In [145]:
len(scores)

500

In [146]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2['score'] = df2['score'].apply(lambda x: x/max(df2['score']))
df2

Unnamed: 0,title,score
42,abstracts of presentations at the association ...,1.000000
217,a primer on applying ai synergistically with d...,0.846154
71,recent advancement in cancer diagnosis using m...,0.769231
77,srg vote predicting mirna gene relationships ...,0.769231
275,computer aided diagonosis for colorectal cance...,0.769231
...,...,...
394,artificial intelligence in dermato oncology a...,0.000000
402,ascent of machine learning in medicine,0.000000
330,letter to the editor response to giardiello d...,0.000000
353,hunting for new drugs with ai,0.000000


In [147]:
for elem in df2.head(10)['title'].values:
        print(elem)
        print()

abstracts of presentations at the association of clinical scientists 143(rd)meeting louisville  ky may 11 14 2022 

a primer on applying ai synergistically with domain expertise to oncology 

recent advancement in cancer diagnosis using machine learning and deep learningtechniques  a comprehensive review 

srg vote  predicting mirna gene relationships via embedding and lstm ensemble 

computer aided diagonosis for colorectal cancer using deep learning with visualexplanations 

radiomics in stratification of pancreatic cystic lesions  machine learning inaction 

cohort profile  chinese cervical cancer clinical study 

artificial intelligence for the prevention and clinical management ofhepatocellular carcinoma 

low grade chronic inflammation and immune alterations in childhood and adolescentcancer survivors  a contribution to accelerated aging?

data driven methods for advancing precision oncology 



In [149]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
titles = df['title'].tolist()
for title in titles:
    title = process_text(title)
    matchs = 0
    for word in most_common:
        if word in title:
            matchs += 1
    scores.append(matchs/len(most_common))

df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2

Unnamed: 0,title,score
30,Decoding the Role of Epigenetics in Breast Can...,0.571429
82,Effective Image Processing and Segmentation-Ba...,0.500000
130,Methodological quality of machine learning-bas...,0.500000
71,Recent advancement in cancer diagnosis using m...,0.500000
377,C-HMOSHSSA: Gene selection for cancer classifi...,0.500000
...,...,...
251,Cervical screening in high-income countries: t...,0.000000
262,Integration of a vertebral fracture identifica...,0.000000
263,COVID19 Drug Repository: text-mining the liter...,0.000000
272,Exploring chromosomal structural heterogeneity...,0.000000
