### NLP Research Internship Assignment Biomedical Text Analysis
*data_extraction_starter.ipynb*

In [19]:
# Import necessary libraries
from Bio import Entrez
import ssl

# Bypass SSL certificate verification
ssl._create_default_https_context = ssl._create_unverified_context

In [20]:
# Function to fetch abstracts from PubMed using MeSH terms
def fetch_abstracts(term, max_results=1000):
    """
    Fetch abstracts from PubMed based on search terms.
    
    Parameters:
    term (str): Search term or MeSH term for querying PubMed.
    max_results (int): Maximum number of results to fetch.
    
    Returns:
    list: A list of abstracts fetched from PubMed.
    """
    
    # Provide contact email for Entrez
    Entrez.email = "info@toxgensolutions.eu"
    
    # Perform the search query using Entrez
    handle = Entrez.esearch(db="pubmed", term=term, retmax=max_results)
    
    # Read search results
    record = Entrez.read(handle)
    handle.close()
    
    # Extract PubMed IDs from the search results
    id_list = record["IdList"]
    
    # Check if search returned results
    if not id_list:
        print("No results found.")
        return []
    
    # Fetch abstracts based on PubMed IDs
    handle = Entrez.efetch(db="pubmed", id=id_list, rettype="abstract", retmode="text")


    
    # Read and split the abstracts
    abstracts = handle.read().split("\n\n")
    handle.close()
    
    return abstracts

In [21]:
# Define the search term, e.g., "Cancer Immunotherapy"
search_term = "Cancer Immunotherapy"

# Fetch abstracts using the search term
abstracts = fetch_abstracts(search_term)

# Display first 5 abstracts for quick inspection (optional)
print("First 5 abstracts:\n")
for i, abstract in enumerate(abstracts[:5]):
    print(f"{i+1}. {abstract}\n")

First 5 abstracts:

1. 1. Br J Cancer. 2023 Sep 21. doi: 10.1038/s41416-023-02428-2. Online ahead of 
print.

2. Feasibility of mass cytometry proteomic characterisation of circulating tumour 
cells in head and neck squamous cell carcinoma for deep phenotyping.

3. Payne K(1), Brooks J(1), Batis N(2), Khan N(3), El-Asrag M(4), Nankivell P(1), 
Mehanna H(#)(1), Taylor G(#)(5).

4. Author information:
(1)Institute of Head and Neck Studies and Education, Institute of Cancer and 
Genomic Sciences, University of Birmingham, Birmingham, UK.
(2)School of Biomedical Sciences, Institute of Clinical Sciences, College of 
Medical and Dental Sciences, University of Birmingham, Birmingham, UK.
(3)Clinical Immunology Service, Institute of Immunology and Immunotherapy, 
University of Birmingham, Birmingham, UK.
(4)Institute of Cancer and Genomic Sciences, University of Birmingham, 
Birmingham, UK.
(5)Institute of Immunology and Immunotherapy, University of Birmingham, 
Birmingham, UK. g.s.taylor@bham

### Currently, the code provided outputs the first i parts of an element instead of first i abstracts. Therefore, I will adjust it to output the first i abstracts.

In [22]:
# Function to fetch abstracts from PubMed using MeSH terms
def fetch_abstracts_modified(term, max_results=1000):
    """
    Fetch abstracts from PubMed based on search terms.
    
    Parameters:
    term (str): Search term or MeSH term for querying PubMed.
    max_results (int): Maximum number of results to fetch.
    
    Returns:
    list: A list of abstracts fetched from PubMed.
    """
    
    # Provide contact email for Entrez
    Entrez.email = "info@toxgensolutions.eu"
    
    # Perform the search query using Entrez
    handle = Entrez.esearch(db="pubmed", term=term, retmax=max_results)
    
    # Read search results
    record = Entrez.read(handle)
    handle.close()
    
    # Extract PubMed IDs from the search results
    id_list = record["IdList"]
    
    # Check if search returned results
    if not id_list:
        print("No results found.")
        return []
    
    # Fetch abstracts based on PubMed IDs
    handle = Entrez.efetch(db="pubmed", id=id_list, rettype="abstract", retmode="text")
    

    
    # Read and split the abstracts
    abstracts = handle.read().split("\n\n")



    # Initialize a list to store the abstracts and a counter to keep track of which result
    abstracts_text = []
    counter = 1
    # Iterate through the abstracts to see the beginning of a new abstract 
    for i, abstract in enumerate(abstracts):
        # Check if an abstract begins with i. where i is article number

        if abstract.strip().startswith(str(counter) + '.'):
            
            #increment the counter to keep track of article number
            counter+=1
            # Check if there are 5 paragraphs left in the list
            if i + 5 <= len(abstracts):
                
                # Extract the 5th paragraph after the title as the abstract as results follows this order
                abstract = abstracts[i + 4].strip()
                abstracts_text.append(abstract)

                
    handle.close()
    
    return abstracts_text

In [23]:
search_term = "Cancer Immunotherapy"

# Fetch abstracts using the search term
abstracts = fetch_abstracts_modified(search_term)

# Display first 5 abstracts for quick inspection (optional)
print("First 5 abstracts:\n")
for i, abstract in enumerate(abstracts[:5]):
    print(f"{i+1}. {abstract}\n")

First 5 abstracts:

1. BACKGROUND: Circulating tumour cells (CTCs) are a potential cancer biomarker, 
but current methods of CTC analysis at single-cell resolution are limited. Here, 
we describe high-dimensional single-cell mass cytometry proteomic analysis of 
CTCs in HNSCC.
METHODS: Parsortix microfluidic-enriched CTCs from 14 treatment-naïve HNSCC 
patients were analysed by mass cytometry analysis using 41 antibodies. Immune 
cell lineage, epithelial-mesenchymal transition (EMT), stemness, proliferation 
and immune checkpoint expression was assessed alongside phosphorylation status 
of multiple signalling proteins. Patient-matched tumour gene expression and CTC 
EMT profiles were compared. Standard bulk CTC RNAseq was performed as a baseline 
comparator to assess mass cytometry data.
RESULTS: CTCs were detected in 13/14 patients with CTC counts of 2-24 CTCs/ml 
blood. Unsupervised clustering separated CTCs into epithelial, early EMT and 
advanced EMT groups that differed in signall

#### Now we are set to go let's use some NLP techniques to make use of the abstracts!

# Entity Recognition

One of the most useful techniques in NLP is Entity Recognition which detects entities within the text. Entity Recognition can be useful according to the application. In Biomedical field, detecting the names of the proteins, cells and diseases can be one of the most useful applications. There are various methods to do Entity Recognition, but the most famous nowadays is BERT. BERT is mostly successfull because it can be adapted according to the application and the dataset the model has been used on. For example, I am using "biomedical-ner-all" model from "d4data". The model can be easily found on HuggingFace.co . The model has been trained on a large dataset of labeled Biomedical data which makes it possible to recognize multiple entites within the biomedical field. 

In [7]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
biobert_tokenizer = AutoTokenizer.from_pretrained("d4data/biomedical-ner-all")
biobert_model = AutoModelForTokenClassification.from_pretrained("d4data/biomedical-ner-all")
biobert = pipeline('ner', model=biobert_model, tokenizer=biobert_tokenizer, aggregation_strategy="simple")

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
for abstract in abstracts[:5]:
      detected_entities = biobert(abstract)
      print("Abstract: ")
      print()
      print(abstract)
      print("Detected entities: ")
      print()
      for ent in detected_entities:
            print((ent['word'], ent['entity_group']))
      print()




Abstract: 

Cellular exhaustion in various immune cells develops in response to prolonged 
stimulation and overactivation during chronic infections and in cancer. Marked 
by an upregulation of inhibitory receptors and diminished effector functions, 
exhausted immune cells are unable to fully eradicate the antigen responsible for 
the overexposure. In cancer settings, this results in a relatively small but 
constant tumor burden known as a localized tumor-immune stalemate. In recent 
years, studies have elucidated key aspects of the development and progression of 
cellular exhaustion and have re-addressed previous misconceptions. Biological 
publications have also provided insight into the functional capabilities of 
exhausted cells. Complementing these findings, the model presented here serves 
as a mathematical framework for the establishment of cellular exhaustion and the 
development of the localized stalemate against a solid tumor. Analysis of this 
model indicates that this stalem

We can see that the model recognised some entities and might have failed in detecting some. But for me, the most interesting result of this model is that it was able to detect that "t cells" are biological structures which shows that BERT takes many information into account to produce such results.

# Keyword Extraction

Keyword Extraction is a technique similar to Entity Recognition. However, the main difference is that there are no entity groups. Here we used spacy and its model en_core_web_lg which has been trained on a large amount of corpuses. 

In [36]:
!python3 -m spacy download en_core_web_lg


[38;5;1m✘ No compatible package found for 'en_core_sci_sm' (spaCy v3.6.1)[0m



In [37]:
import spacy
nlp = spacy.load("en_core_web_lg")
text = """spaCy is an open-source software library for advanced natural language processing, 
written in the programming languages Python and Cython. The library is published under the MIT license
and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion."""
for abstract in abstracts[:5]:
    doc = nlp(abstract)
    print("Abstract: ")
    print(doc)
    print()
    print("Keywords: ")
    print(doc.ents)
    print()

Abstract: 
BACKGROUND: Circulating tumour cells (CTCs) are a potential cancer biomarker, 
but current methods of CTC analysis at single-cell resolution are limited. Here, 
we describe high-dimensional single-cell mass cytometry proteomic analysis of 
CTCs in HNSCC.
METHODS: Parsortix microfluidic-enriched CTCs from 14 treatment-naïve HNSCC 
patients were analysed by mass cytometry analysis using 41 antibodies. Immune 
cell lineage, epithelial-mesenchymal transition (EMT), stemness, proliferation 
and immune checkpoint expression was assessed alongside phosphorylation status 
of multiple signalling proteins. Patient-matched tumour gene expression and CTC 
EMT profiles were compared. Standard bulk CTC RNAseq was performed as a baseline 
comparator to assess mass cytometry data.
RESULTS: CTCs were detected in 13/14 patients with CTC counts of 2-24 CTCs/ml 
blood. Unsupervised clustering separated CTCs into epithelial, early EMT and 
advanced EMT groups that differed in signalling pathway 

We can see that the model captures both important and unimportant keywords. An important thing to keep in mind is that this model has been trained on different type of corpuses and not only medical ones. There exists a model called en_core_sci_sm which we will test now:

In [54]:
import spacy
model = spacy.load("en_core_sci_sm")
for abstract in abstracts:
    print("Abstract: ")
    print(abstract)
    keywords = model(abstract).ents
    print("Keywords: ")
    print(keywords)
    print()



Abstract: 
BACKGROUND: Circulating tumour cells (CTCs) are a potential cancer biomarker, 
but current methods of CTC analysis at single-cell resolution are limited. Here, 
we describe high-dimensional single-cell mass cytometry proteomic analysis of 
CTCs in HNSCC.
METHODS: Parsortix microfluidic-enriched CTCs from 14 treatment-naïve HNSCC 
patients were analysed by mass cytometry analysis using 41 antibodies. Immune 
cell lineage, epithelial-mesenchymal transition (EMT), stemness, proliferation 
and immune checkpoint expression was assessed alongside phosphorylation status 
of multiple signalling proteins. Patient-matched tumour gene expression and CTC 
EMT profiles were compared. Standard bulk CTC RNAseq was performed as a baseline 
comparator to assess mass cytometry data.
RESULTS: CTCs were detected in 13/14 patients with CTC counts of 2-24 CTCs/ml 
blood. Unsupervised clustering separated CTCs into epithelial, early EMT and 
advanced EMT groups that differed in signalling pathway 

We can already observe how the model trained on biomedical corpus outperforms the previous model as it's able to recognize a bigger number of more specific terms.

# Find similar documents using abstracts

As a final step to wrap up, I will develop a model that I will train on the extracted abstratcs regarding a specific biomedical topic. Using the model I will try to find the most similar abstracts to the abstract I will insert. I will pick the first abstract as the one that I will use to find similar results. To find a search topic, I will choose the first abstract from the list of abstracts that I have.

In [83]:
search_term = "Cancer Immunotherapy"

# Fetch abstracts using the search term
abstracts = fetch_abstracts_modified(search_term)

query_abstract = abstracts[0]

print(query_abstract)


BACKGROUND: Lung adenocarcinoma (LUAD) is an extraordinarily malignant tumor, 
with rapidly increasing morbidity and poor prognosis. Immunotherapy has emerged 
as a hopeful therapeutic modality for lung adenocarcinoma. Furthermore, a 
prognostic model (based on immune genes) can fulfill the purpose of early 
diagnosis and accurate prognostic prediction.
METHODS: Immune-related mRNAs (IRmRNAs) were utilized to construct a prognostic 
model that sorted patients into high- and low-risk groups. Then, the prediction 
efficacy of our model was evaluated using a nomogram. The differences in overall 
survival (OS), the tumor mutation landscape, and the tumor microenvironment were 
further explored between different risk groups. In addition, the immune genes 
comprising the prognostic model were subjected to single-cell RNA sequencing to 
investigate the expression of these immune genes in different cells. Finally, 
the functions of BIRC5 were validated through in vitro experiments.
RESULTS: Pa

We can see that the first topic is Lung adenocarcinoma which we will use to query and then see the similarities between the query abstract and other abstracts.

In [84]:
search_term = "Lung adenocarcinoma"

# Fetch abstracts using the search term
abstracts = fetch_abstracts_modified(search_term)

# Display first 5 abstracts for quick inspection 
print("First 5 abstracts:\n")
for i, abstract in enumerate(abstracts[:5]):
    print(f"{i+1}. {abstract}\n")

    

First 5 abstracts:

1. BACKGROUND: Multiple genetic and epigenetic regulatory mechanisms play a vital 
role in tumorigenesis and development. Understanding the interplay between 
different epigenetic modifications and its contribution to transcriptional 
regulation in cancer is essential for precision medicine. Here, we aimed to 
investigate the interplay between N6-methyladenosine (m6A) modifications and 
histone modifications in lung adenocarcinoma (LUAD).
RESULTS: Based on the data from public databases, including chromatin property 
data (ATAC-seq, DNase-seq), methylated RNA immunoprecipitation sequencing 
(MeRIP-seq), and gene expression data (RNA-seq), a m6A-related differentially 
expressed gene nerve growth factor inducible (VGF) was identified between LUAD 
tissues and normal lung tissues. VGF was significantly highly expressed in LUAD 
tissues and cells, and was associated with a worse prognosis for LUAD, silencing 
of VGF inhibited the malignant phenotype of LUAD cells by in

In [85]:
len(abstracts)

60

Let's now see use the 60 abstracts to train the Doc2vec model. Doc2Vec creates a vector for each abstract in n dimensional space. Those vectors generated will help us to see how similar is our query abstract and other abstracts.

In [86]:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.doc2vec import TaggedDocument

tags =  list(range(1,len(abstracts)))
tagged_data = [TaggedDocument(words=abstract, tags=[tag]) for abstract,tag in zip(abstracts, tags)]

model = Doc2Vec(vector_size=100, window=5,  min_count=1, epochs=20)         

model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)



In [87]:
abstract_vec = model.infer_vector([query_abstract])

similar_docs = model.docvecs.most_similar([abstract_vec], topn=5)
print("Abstract: ")
print(query_abstract)
print()
print("Similar docs: ")
for id,similarity_score in similar_docs:
    print("Abstract: ")
    print(abstracts[id-1])
    print("Similarity score: ")
    print(similarity_score)

Abstract: 
BACKGROUND: Lung adenocarcinoma (LUAD) is an extraordinarily malignant tumor, 
with rapidly increasing morbidity and poor prognosis. Immunotherapy has emerged 
as a hopeful therapeutic modality for lung adenocarcinoma. Furthermore, a 
prognostic model (based on immune genes) can fulfill the purpose of early 
diagnosis and accurate prognostic prediction.
METHODS: Immune-related mRNAs (IRmRNAs) were utilized to construct a prognostic 
model that sorted patients into high- and low-risk groups. Then, the prediction 
efficacy of our model was evaluated using a nomogram. The differences in overall 
survival (OS), the tumor mutation landscape, and the tumor microenvironment were 
further explored between different risk groups. In addition, the immune genes 
comprising the prognostic model were subjected to single-cell RNA sequencing to 
investigate the expression of these immune genes in different cells. Finally, 
the functions of BIRC5 were validated through in vitro experiments.


  similar_docs = model.docvecs.most_similar([abstract_vec], topn=5)


To see why this method might be important, let's consider the second most similar abstract, when we first searched, this was in the 51st position which shows us that the current query system doesn't keep context into account. I think querying by context rather than keywords can be the most powerful tool in the field of biomedicine. Note that the highest similarity score we got is 0.19 which is very low. This is a result of the training set being small, when trained on a larger number of documents, Doc2Vec can be powerful, BERT also have their own document comparsion models which are also worth looking at, but I decided to keep Doc2Vec to showcast different methods in NLP. 

##### To sum everything up, we worked on 3 different tasks:
    1- Entity Recogntion
    2- Keyword Extraction
    3- Document Similarity
I will now answer the questions mentioned in the email.
##### Q1. Could you please articulate your understanding of NLP as a technology, including any hands-on experience you may have in this specific field?
Throughout history, humans used text all the time, and the history of humanity is fully dependent on preserving texts. NLP uses this text to generate insightful results that would assist us in our daily lives. We already saw ChatGPT and the revolution it created when GPT was trained on a very small subset of texts available, which showed us the incredible results that NLP could acheive. After ChatGPT, the interest increased in the NLP field which will help in more research and hopefully better tools that will yield better results. My experience in NLP started 2 years ago when I took a course in unviersity that made it my favourite course. I also chose it for one of my project where I created a model trained on about 300 Doctoral thesis that helps finding the most similar abstract to a query of the user's choice. I also took a course in Information Retreival and Text mining where I had to make something useful with a book my choice. What I did was using a website for summaries to scrape the summaries and apply different methods such as Entity Recogntions, Topic Modelling, Emotion Analysis, and evaluations of the methods. I will also take an elective in Advanced Natural Language Processing next period which will help me increase my knowledge in new methods such as transformers.
##### Q2. We'd love to hear your thoughts on how NLP techniques could be employed to extract pertinent biomedical information from academic publications or alternative data sources.
The first appraoch that I get in mind is training models on academic publications and creating a retreival system that takes into account context rather than keywords. As humans, we always use synonyms which are not taken into account in most retreival systems, which makes me confident that biomedical information extraction would be easier with such system. Another Idea in mind is to pre-train existing models such as GPT-3 on academic publications to create a QA system which would be more powerful than a retreival system. I haven't researched the efficiency of such approach but I'm sure it's feasible.
##### Q3. Are there particular NLP algorithms or methodologies that you believe would be exceptionally beneficial in the biomedical research context?
BERT is one of the methods that I believe can make difference in any field that uses NLP. Personally, I would like to try Sentence-BERT on a similar appraoch to the one I implemented to find similar documents. According to my research, I felt there's a big room for improvment in the field of NLP and Biomedical research and I felt the potential for using NLP in the field. 

I want to note that this is my first time dealing with biomedical texts, and I could easily say I enjoyed working with it and I also want to clarify that I picked the same topic (Cancer Immunotherapy) as I don't have enough knowledge in the industry, but I can say I am very eager to learn about it. I would also like to thank you for the creative way to help applicants showcase their skills!