### NLP Research Internship Assignment Biomedical Text Analysis
*data_extraction_starter.ipynb*

In [4]:
# Import necessary libraries
from Bio import Entrez
import ssl

# Bypass SSL certificate verification
ssl._create_default_https_context = ssl._create_unverified_context

In [5]:
# Function to fetch abstracts from PubMed using MeSH terms
def fetch_abstracts(term, max_results=1000):
    """
    Fetch abstracts from PubMed based on search terms.
    
    Parameters:
    term (str): Search term or MeSH term for querying PubMed.
    max_results (int): Maximum number of results to fetch.
    
    Returns:
    list: A list of abstracts fetched from PubMed.
    """
    
    # Provide contact email for Entrez
    Entrez.email = "info@toxgensolutions.eu"
    
    # Perform the search query using Entrez
    handle = Entrez.esearch(db="pubmed", term=term, retmax=max_results)
    
    # Read search results
    record = Entrez.read(handle)
    handle.close()
    
    # Extract PubMed IDs from the search results
    id_list = record["IdList"]
    
    # Check if search returned results
    if not id_list:
        print("No results found.")
        return []
    
    # Fetch abstracts based on PubMed IDs
    handle = Entrez.efetch(db="pubmed", id=id_list, rettype="abstract", retmode="text")


    
    # Read and split the abstracts
    abstracts = handle.read().split("\n\n")
    handle.close()
    
    return abstracts

In [6]:
# Define the search term, e.g., "Cancer Immunotherapy"
search_term = "Cancer Immunotherapy"

# Fetch abstracts using the search term
abstracts = fetch_abstracts(search_term)

# Display first 5 abstracts for quick inspection (optional)
print("First 5 abstracts:\n")
for i, abstract in enumerate(abstracts[:5]):
    print(f"{i+1}. {abstract}\n")

First 5 abstracts:

1. 1. Biomark Med. 2023 Sep 15. doi: 10.2217/bmm-2023-0202. Online ahead of print.

2. CD39 (ENTPD1) in tumors: a potential therapeutic target and prognostic 
biomarker.

3. Li C(1), Zhang L(1), Jin Q(1), Jiang H(1), Wu C(1).

4. Author information:
(1)Department of Hematology, Lanzhou University Second Hospital, Lanzhou, 
730000, China.

5. As a regulator of the dynamic balance between immune-activated extracellular ATP 
and immunosuppressive adenosine, CD39 ectonucleotidase impairs the ability of 
immune cells to exert anticancer immunity and plays an important role in the 
immune escape of tumor cells within the tumor microenvironment. In addition, 
CD39 has been studied in cancer patients to evaluate the prognosis, the efficacy 
of immunotherapy (e.g., PD-1 blockade) and the prediction of recurrence. This 
article reviews the importance of CD39 in tumor immunology, summarizes the 
preclinical evidence on targeting CD39 to treat tumors and focuses on the 
potenti

### Currently, the code provided outputs the first i parts of an element instead of first i abstracts. Therefore, I will adjust it to output the first i abstracts.

In [7]:
# Function to fetch abstracts from PubMed using MeSH terms
def fetch_abstracts_modified(term, max_results=1000):
    """
    Fetch abstracts from PubMed based on search terms.
    
    Parameters:
    term (str): Search term or MeSH term for querying PubMed.
    max_results (int): Maximum number of results to fetch.
    
    Returns:
    list: A list of abstracts fetched from PubMed.
    """
    
    # Provide contact email for Entrez
    Entrez.email = "info@toxgensolutions.eu"
    
    # Perform the search query using Entrez
    handle = Entrez.esearch(db="pubmed", term=term, retmax=max_results)
    
    # Read search results
    record = Entrez.read(handle)
    handle.close()
    
    # Extract PubMed IDs from the search results
    id_list = record["IdList"]
    
    # Check if search returned results
    if not id_list:
        print("No results found.")
        return []
    
    # Fetch abstracts based on PubMed IDs
    handle = Entrez.efetch(db="pubmed", id=id_list, rettype="abstract", retmode="text")
    

    
    # Read and split the abstracts
    abstracts = handle.read().split("\n\n")



    # Initialize a list to store the abstracts and a counter to keep track of which result
    abstracts_text = []
    counter = 1
    # Iterate through the abstracts to see the beginning of a new abstract 
    for i, abstract in enumerate(abstracts):
        # Check if an abstract begins with i. where i is article number

        if abstract.strip().startswith(str(counter) + '.'):
            
            #increment the counter to keep track of article number
            counter+=1
            # Check if there are 5 paragraphs left in the list
            if i + 5 <= len(abstracts):
                
                # Extract the 5th paragraph after the title as the abstract as results follows this order
                abstract = abstracts[i + 4].strip()
                abstracts_text.append(abstract)

                
    handle.close()
    
    return abstracts_text

In [8]:
search_term = "Cancer Immunotherapy"

# Fetch abstracts using the search term
abstracts = fetch_abstracts_modified(search_term)

# Display first 5 abstracts for quick inspection (optional)
print("First 5 abstracts:\n")
for i, abstract in enumerate(abstracts[:5]):
    print(f"{i+1}. {abstract}\n")

First 5 abstracts:

1. As a regulator of the dynamic balance between immune-activated extracellular ATP 
and immunosuppressive adenosine, CD39 ectonucleotidase impairs the ability of 
immune cells to exert anticancer immunity and plays an important role in the 
immune escape of tumor cells within the tumor microenvironment. In addition, 
CD39 has been studied in cancer patients to evaluate the prognosis, the efficacy 
of immunotherapy (e.g., PD-1 blockade) and the prediction of recurrence. This 
article reviews the importance of CD39 in tumor immunology, summarizes the 
preclinical evidence on targeting CD39 to treat tumors and focuses on the 
potential of CD39 as a biomarker to evaluate the prognosis and the response to 
immune checkpoint inhibitors in tumors.

2. Brain tumors are the most common solid tumor in children and the leading cause 
of cancer-related deaths. Over the last few years, improvements have been made 
in the diagnosis and treatment of children with Central Nervous 

#### Now we are set to go let's use some NLP techniques to make use of the abstracts!

# Entity Recognition

In [4]:
pip install scispacy==0.4.0

Defaulting to user installation because normal site-packages is not writeable
Collecting scispacy==0.4.0
  Downloading scispacy-0.4.0-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 3.1 MB/s eta 0:00:011
Collecting spacy<3.1.0,>=3.0.0
  Downloading spacy-3.0.9-cp39-cp39-macosx_11_0_arm64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 11.9 MB/s eta 0:00:01
[?25hCollecting conllu
  Using cached conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Collecting nmslib>=1.7.3.6
  Using cached nmslib-2.1.1.tar.gz (188 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 15.2 MB/s eta 0:00:01
[?25hCollecting thinc<8.1.0,>=8.0.3
  Downloading thinc-8.0.17-cp39-cp39-macosx_11_0_arm64.whl (586 kB)
[K     |████████████████████████████████| 586 kB 44.3 MB/s eta 0:00:01
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting cat

In [2]:
import scispacy

ModuleNotFoundError: No module named 'scispacy'

In [2]:
import spacy
from spacy import displacy
import scispacy

NER = spacy.load("en_core_sci_scibert")



ModuleNotFoundError: No module named 'scispacy'

In [None]:
test = "Hello my name is Ali"
entities = NER(abstracts[1])

for word in entities.ents:
    print(word.text)
    print(word.label_)
    print(" ")

the last few years
DATE
 
Central Nervous System
LOC
 
