# Relevance Score
For each paper we have to compute a relevance score which should state how relevant that paper is compared to the search we made. <br>

So i was thinking something like this:<br>
paper1 ---> score : 1 (exactly what we were looking for)<br>
paper2 ---> score : 0 (treats a completly different topic)<br>
and the rest of the cases are between 0 and 1

Strategies:
- [Title Matches](#section_id)
- [Keywords matches](#section_id2)
- [Title and Keywords matches](#section_id3)
- [Automatic dictionary from abstracts](#section_id4)
- [Automatic dictionary from keywords](#section_id5)

In [1]:
import requests
import json
import time
import pandas as pd

In [2]:
def esearch(search_term):
    search_term = search_term.strip().replace(' ', '+')
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
    url = url + '?db=pubmed' + '&retmode=json' + '&retmax=100'
    url = url + '&term=' + search_term
    site = requests.get(url).content
    json_site = json.loads(site.decode()) 
    UIDs = json_site['esearchresult']['idlist']
    return UIDs

In [3]:
def efetch(UIDs):
    session = requests.Session()
    abstracts = []
    keywords = []
    titles = []
    i = 1
    for UID in UIDs:
        time.sleep(0.4)
        url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
        url = url + '?db=pubmed' + '&rettype=medline&id='
        url = url + UID
        
        data = session.get(url).content
        data = data.decode()
        data = data.split('\n')
        
        
        ls = []
        for elem in data:
            if elem == '':
                continue
            elif elem[0:4] != '    ': 
                elem = elem.strip()
                elem = elem.replace('\n', '')
                ls.append(elem)
            else:                    
                elem = elem.strip()
                elem = elem.replace('\n', '')
                ls[-1] = ls[-1] + elem
    
    
        ls_2 = [] 
        for elem in ls:
            key = elem[0:4].strip()
            value = elem[5:].strip()
            ls_2.append([key, value])
        
        all_key = ''
        abst = ''
        for row in ls_2:
            if row[0] == 'AB':
                abst = abst + row[1]
            
            elif row[0] == 'OT' or row[0] == 'MH':
                all_key = all_key + row[1] + ', '
            
            elif row[0] == 'TI':
                titles.append(row[1])
                
        keywords.append(all_key[:-2])
        abstracts.append(abst)
        
        if i%5==0:
            print('Paper ' + str(i) +': OK')
        i += 1
        
    dic = {
        'UID' : UIDs,
        'title' : titles,
        'abstarct' : abstracts,
        'keywords' : keywords
    }
    return dic

In [4]:
UIDs = esearch('machine learning cancer prevention')
data = efetch(UIDs)

Paper 5: OK
Paper 10: OK
Paper 15: OK
Paper 20: OK
Paper 25: OK
Paper 30: OK
Paper 35: OK
Paper 40: OK
Paper 45: OK
Paper 50: OK
Paper 55: OK
Paper 60: OK
Paper 65: OK
Paper 70: OK
Paper 75: OK
Paper 80: OK
Paper 85: OK
Paper 90: OK
Paper 95: OK
Paper 100: OK


In [5]:
df = pd.DataFrame(data)
df

Unnamed: 0,UID,title,abstarct,keywords
0,36179551,Baseline host determinants of robust human HIV...,BACKGROUND: The identification of baseline hos...,"Antibody, Baseline characteristics, CD4+ T cel..."
1,36168036,Molecular pathways enhance drug response predi...,Computational models have been successful in p...,"Cell Line, Cell Line, Tumor, *Everolimus, Hete..."
2,36159773,Comparative Analysis of Machine Learning Metho...,"Breast cancer is the leading cancer in women, ...","*Breast Neoplasms/diagnosis/genetics, Female, ..."
3,36159011,Radiomics and nomogram of magnetic resonance i...,BACKGROUND: Microvascular invasion (MVI) of sm...,"*Carcinoma, Hepatocellular/diagnostic imaging/..."
4,36157578,The integrated landscape of eRNA in gastric ca...,The comprehensive regulation effect of eRNA on...,"Bioinformatics, Biological sciences, Cancer, S..."
...,...,...,...,...
95,35316197,"Artificial Intelligence for Colonoscopy: Past,...","During the past decades, many automated image ...","Algorithms, *Artificial Intelligence, Colonosc..."
96,35304310,Urine surface-enhanced Raman spectroscopy comb...,"In this paper, we investigated the feasibility...","Algorithms, Biomarkers, Tumor, *Carcinoma, Hep..."
97,35272662,Reinforcement learning evaluation of treatment...,BACKGROUND: Evaluation of new treatment polici...,"Aspartate Aminotransferases/therapeutic use, H..."
98,35269943,Personalized Risk Schemes and Machine Learning...,Myelodysplastic syndromes (MDS) are characteri...,"Chromosome Aberrations, Genomics/methods, Huma..."


<a id='section_id'></a>

# Title matches
one strategy might be to see if there is a match between the search term and the title <br>
we can compute the score like: (matching terms in the title)/(total search terms)


In [6]:
def process_text(text):
    text = text.strip()
    text = text.lower()
    text = text.replace('.', '')
    text = text.replace(',', '')
    text = text.replace(':', '')
    text = text.replace(';', '')
    text = text.replace('*', '')
    text = text.replace('/', ' ')
    return text

In [7]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
titles = df['title'].tolist()
for title in titles:
    title = process_text(title)
    matchs = 0
    for word in search_term:
        if word in title:
            matchs += 1
    scores.append(matchs/len(search_term))

In [8]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2

Unnamed: 0,title,score
18,Identification of Hub Genes Associated with Tu...,0.75
83,A Machine-Learning-Based Bibliometric Analysis...,0.75
29,Decoding the Role of Epigenetics in Breast Can...,0.75
70,Recent advancement in cancer diagnosis using m...,0.75
25,Machine Learning-based Correlation Study betwe...,0.75
...,...,...
24,Positive-gradient-weighted object activation m...,0.00
73,Ferroptosis-based molecular prognostic model f...,0.00
47,The Appalachia Mind Health Initiative (AMHI): ...,0.00
76,SRG-Vote: Predicting Mirna-Gene Relationships ...,0.00


In [9]:
 for elem in df2[df2['score']==0.75]['title'].values:
        print(elem)
        print()

Identification of Hub Genes Associated with Tumor-infiltrating Immune Cells andECM Dynamics as the Potential Therapeutic Targets in Gastric Cancer through anIntegrated Bioinformatic Analysis and Machine Learning Methods.

A Machine-Learning-Based Bibliometric Analysis of the Scientific Literature onAnal Cancer.

Decoding the Role of Epigenetics in Breast Cancer Using Formal Modeling andMachine-Learning Methods.

Recent advancement in cancer diagnosis using machine learning and deep learningtechniques: A comprehensive review.

Machine Learning-based Correlation Study between Perioperative ImmunonutritionalIndex and Postoperative Anastomotic Leakage in Patients with Gastric Cancer.

Machine Learning for Endometrial Cancer Prediction and Prognostication.

Machine learning-based demand forecasting in cancer palliative care homehospitalization.

Risk Prediction of Pancreatic Cancer in Patients With Recent-onset Hyperglycemia:A Machine-learning Approach.

Development of Training Materials fo

<a id='section_id2'></a>

# Keywords matches
same as title but for keywords

In [10]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
keywords = df['keywords'].tolist()
for keys in keywords:
    keys = process_text(keys)
    matchs = 0
    for word in search_term:
        if word in keys:
            matchs += 1
    scores.append(matchs/len(search_term))

In [11]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2

Unnamed: 0,title,score
69,Artificial intelligence for the prevention and...,1.0
53,Artificial Intelligence and Machine Learning i...,1.0
38,Increasing Women's Knowledge about HPV Using B...,1.0
84,Association between Serum Triglycerides and Pr...,1.0
35,Quantitative ultrasound image analysis of axil...,1.0
...,...,...
16,Vascular Implications of COVID-19: Role of Rad...,0.0
65,Chronic Lymphocytic Leukemia Progression Diagn...,0.0
74,Power of big data to improve patient care in g...,0.0
45,Computational identification of natural produc...,0.0


In [12]:
for elem in df2[df2['score']==1.0]['title'].values:
        print(elem)
        print()

Artificial intelligence for the prevention and clinical management ofhepatocellular carcinoma.

Artificial Intelligence and Machine Learning in Cancer Research: A Systematic andThematic Analysis of the Top 100 Cited Articles Indexed in Scopus Database.

Increasing Women's Knowledge about HPV Using BERT Text Summarization: An OnlineRandomized Study.

Association between Serum Triglycerides and Prostate Specific Antigen (PSA) amongU.S. Males: National Health and Nutrition Examination Survey (NHANES), 2003-2010.

Quantitative ultrasound image analysis of axillary lymph nodes to differentiatemalignancy from reactive benign changes due to COVID-19 vaccination.



<a id='section_id3'></a>

# Title and Keywords matches

In [13]:
search_term = 'machine learning cancer prevention'
search_term = process_text(search_term)
search_term = search_term.split()

scores = []
keywords = df['keywords'].tolist()
titles = df['title'].tolist()
for i in range(len(titles)):
    titles[i] = process_text(titles[i])
    keywords[i] = process_text(keywords[i])
    matchs = 0
    for word in search_term:
        if word in keywords[i]:
            matchs += 1
        if word in titles[i]:
            matchs += 1
    scores.append(matchs)

In [14]:
df2 = pd.DataFrame(list(zip(titles, scores)), columns = ['title', 'score'])
df2 = df2.sort_values(by=['score'], ascending=False)
df2['score'] = df2['score'].apply(lambda x: x/max(df2['score']))
df2

Unnamed: 0,title,score
53,artificial intelligence and machine learning i...,1.000000
83,a machine-learning-based bibliometric analysis...,0.857143
21,machine learning for endometrial cancer predic...,0.857143
75,machine learning-based demand forecasting in c...,0.857143
18,identification of hub genes associated with tu...,0.857143
...,...,...
41,abstracts of presentations at the association ...,0.000000
57,precision medicine journey through omics approach,0.000000
47,the appalachia mind health initiative (amhi) a...,0.000000
52,automatic segmentation of calcification areas ...,0.000000


In [15]:
for elem in df2.head(10)['title'].values:
        print(elem)
        print()

artificial intelligence and machine learning in cancer research a systematic andthematic analysis of the top 100 cited articles indexed in scopus database

a machine-learning-based bibliometric analysis of the scientific literature onanal cancer

machine learning for endometrial cancer prediction and prognostication

machine learning-based demand forecasting in cancer palliative care homehospitalization

identification of hub genes associated with tumor-infiltrating immune cells andecm dynamics as the potential therapeutic targets in gastric cancer through anintegrated bioinformatic analysis and machine learning methods

development and validation of a non-invasive chairside oral cavity cancer riskassessment prototype using machine learning approach

recent advancement in cancer diagnosis using machine learning and deep learningtechniques a comprehensive review

development and validation of machine learning models to predict epidermal growthfactor receptor mutation in non-small cell l

<a id='section_id4'></a>

# Automatic dictionary from abstracts
idea of the tutor i overheard <br>
basically build a dictionary of the most common words in the abstracts and then use this list to check for matches (in title? in abstract? i have no idea)


In [16]:
def process_text(text):
    text = text.strip()
    text = text.lower()
    text = text.replace('.', '')
    text = text.replace(',', '')
    text = text.replace(':', '')
    text = text.replace(';', '')
    text = text.replace('*', '')
    text = text.replace('/', ' ')
    return text

In [17]:
def word_freq(text, dic):
    text = process_text(text)
    text = text.split()
    for word in text:
        if word not in dic:
            dic[word] = 0
        dic[word] += 1
    return dic

In [24]:
abstracts = df['abstarct'].tolist()
dic = {}
for abstract in abstracts:
    dic = word_freq(abstract, dic)

In [25]:
dic = {key: val for key, val in sorted(dic.items(), key = lambda ele: ele[1], reverse = True)}

In [26]:
# most common words are not the one we care about
dic

{'the': 1568,
 'of': 1300,
 'and': 1279,
 'to': 728,
 'in': 721,
 'a': 513,
 'for': 406,
 'with': 374,
 'is': 217,
 'were': 211,
 'that': 198,
 'on': 190,
 'this': 184,
 'was': 184,
 'we': 182,
 'cancer': 175,
 'as': 168,
 'by': 159,
 'patients': 159,
 'from': 155,
 'be': 142,
 'are': 136,
 'model': 125,
 'data': 118,
 'an': 117,
 'learning': 113,
 'clinical': 103,
 'study': 101,
 'machine': 95,
 'results': 90,
 'using': 89,
 'models': 89,
 'can': 88,
 'at': 80,
 'methods': 76,
 'risk': 74,
 'these': 73,
 'or': 72,
 'based': 67,
 'covid-19': 61,
 'used': 60,
 'treatment': 60,
 'our': 59,
 'has': 58,
 'health': 58,
 'prediction': 57,
 'diagnosis': 56,
 'analysis': 56,
 'detection': 55,
 '1': 54,
 'have': 54,
 'between': 54,
 'disease': 53,
 'may': 50,
 'which': 50,
 'most': 48,
 'been': 47,
 'all': 46,
 'identify': 46,
 'also': 45,
 'accuracy': 45,
 'genes': 45,
 'research': 44,
 'it': 44,
 'patient': 43,
 'upon': 43,
 'not': 42,
 'should': 41,
 'early': 41,
 'prevention': 41,
 'partici

In [27]:
# try to remove the most common english words
dfs = pd.read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')

In [32]:
common_worlds = dfs[0]['Word'].tolist()

In [34]:
cleaned_dict = dic.copy()

for key, value in dic.items():
    if key in common_worlds:
        del cleaned_dict[key]

In [35]:
# still a lot of trash
# need other method to clean it better otherwise not useful
cleaned_dict

{'is': 217,
 'were': 211,
 'was': 184,
 'cancer': 175,
 'patients': 159,
 'are': 136,
 'model': 125,
 'data': 118,
 'learning': 113,
 'clinical': 103,
 'study': 101,
 'machine': 95,
 'results': 90,
 'using': 89,
 'models': 89,
 'methods': 76,
 'risk': 74,
 'based': 67,
 'covid-19': 61,
 'used': 60,
 'treatment': 60,
 'has': 58,
 'health': 58,
 'prediction': 57,
 'diagnosis': 56,
 'analysis': 56,
 'detection': 55,
 '1': 54,
 'between': 54,
 'disease': 53,
 'may': 50,
 'been': 47,
 'identify': 46,
 'accuracy': 45,
 'genes': 45,
 'research': 44,
 'patient': 43,
 'upon': 43,
 'should': 41,
 'early': 41,
 'prevention': 41,
 'participants': 40,
 'more': 40,
 'ai': 40,
 'able': 40,
 'features': 39,
 'had': 39,
 'testing': 39,
 '2': 39,
 'care': 38,
 'activity': 38,
 'completion': 38,
 'performance': 37,
 'such': 37,
 'system': 37,
 'tumor': 36,
 'potential': 36,
 'ml': 36,
 'showed': 35,
 'high': 34,
 '=': 34,
 'associated': 34,
 'mortality': 34,
 'improve': 33,
 'support': 33,
 'survival': 3

<a id='section_id5'></a>

# Automatic dictionary from keywords

In [36]:
abstracts = df['keywords'].tolist()
dic = {}
for abstract in abstracts:
    dic = word_freq(abstract, dic)
    
dic = {key: val for key, 
       val in sorted(dic.items(), 
       key = lambda ele: ele[1], reverse = True)}

In [37]:
dic

{'learning': 87,
 'machine': 73,
 'humans': 54,
 'diagnosis': 37,
 'neoplasms': 36,
 'cancer': 31,
 'imaging': 25,
 'methods': 25,
 'covid-19': 20,
 'genetics': 19,
 'diagnostic': 19,
 'of': 19,
 'health': 19,
 'deep': 18,
 'pathology': 17,
 'female': 16,
 'artificial': 16,
 'epidemiology': 16,
 'prevention': 16,
 'studies': 15,
 'intelligence': 15,
 'control': 15,
 'breast': 14,
 'therapy': 14,
 'care': 13,
 '&': 13,
 'algorithms': 12,
 'carcinoma': 11,
 'and': 11,
 'drug': 10,
 'neural': 10,
 'analysis': 10,
 'risk': 10,
 'cell': 9,
 'retrospective': 9,
 'etiology': 9,
 'prediction': 9,
 'liver': 8,
 'disease': 8,
 'model': 8,
 'detection': 8,
 'early': 7,
 'networks': 7,
 'processing': 7,
 'lung': 7,
 'mass': 7,
 'cervical': 7,
 'vaccines': 7,
 'the': 6,
 'modeling': 6,
 'diabetes': 6,
 'hepatocellular': 5,
 'biomarkers': 5,
 'computer': 5,
 'models': 5,
 'network': 5,
 'complications': 5,
 'nutrition': 5,
 'precision': 5,
 'image': 5,
 'computer-assisted': 5,
 'factors': 5,
 'sars-

In [47]:
# ok but were do i check for matches? title? keywords? abstract?
list(dic.keys())[0:5]

['learning', 'machine', 'humans', 'diagnosis', 'neoplasms']