# 📘 02b – Biomedical Entity Processing

In this notebook, we extract and normalize biomedical entities from cleaned abstracts using **SciSpacy** and **UMLS linking**.

## Biomedical Entity Processing (without UMLS Linking)

In this notebook, we extract biomedical entities from research abstracts using the `en_core_sci_lg` model from [SciSpaCy](https://allenai.github.io/scispacy/).

Originally, the plan was to also integrate the **UMLS Entity Linker** (`UmlsEntityLinker`), which would allow us to:

- Normalize entities to unified UMLS concept identifiers
- Resolve abbreviations (e.g., `HTN` → `Hypertension`)
- Merge synonyms under the same concept (e.g., `non-small cell lung cancer`, `NSCLC`)
- Enable concept-level analysis and retrieval

However, due to persistent issues with dependencies — mainly the failure to install `nmslib` in the Colab environment — we had to skip UMLS linking for now.

### What we achieved:
- Extracted entities directly from the text using a pretrained biomedical model.
- Cleaned and enriched the dataset with these entities.
- Saved the enriched abstracts for further analysis.

The extracted entities are **surface-level mentions** from the text. While less structured than UMLS concepts, they still provide a strong basis for downstream analysis — such as frequency analysis, relevance filtering, and potential future retrieval tasks.

We keep the pipeline modular so that **UMLS linking can be added later** with minimal changes if we manage to resolve the installation issues.


In [None]:
# Only if you are using Google Colab

# Downgrade NumPy to a version that works with spaCy 3.4.4 and scispaCy 0.5.1
!pip install numpy==1.23.5 --force-reinstall -q

# Install spaCy 3.4.4
!pip install spacy==3.4.4 -q

# Install scispaCy 0.5.1 (biomedical NLP extensions for spaCy)
!pip install scispacy==0.5.1 -q

# Download and install the en_core_sci_lg biomedical model
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz -q



In [None]:
# Only if you are using Google Colab and want to retreive the data from your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import spacy
import en_core_sci_lg
from tqdm import tqdm
import json
import os

# Load the pretrained SciSpaCy biomedical model
nlp = en_core_sci_lg.load()


In [None]:
# Load cleaned abstracts (assumed already preprocessed and cleaned)
with open("/content/drive/MyDrive/biomedical_text_generation/data/cleaned/all_abstracts_cleaned.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Output number of abstracts loaded
print(f"Total abstracts loaded: {len(data)}")



In [None]:
# Initialize list to store processed abstracts with entities
enriched_data = []

# Process a sample (first 200 entries) for faster iteration
for entry in tqdm(data):
    abstract = entry["abstract"]

    # Apply the biomedical NLP pipeline on the abstract
    doc = nlp(abstract)

    # Extract unique named entities with length > 2 (to skip generic short tokens)
    entities = list(set(ent.text for ent in doc.ents if len(ent.text) > 2))

    # Append the extracted entities to the original entry
    entry["entities"] = entities

    # Add to final enriched dataset
    enriched_data.append(entry)


In [None]:
# Create the output folder if it doesn't exist
os.makedirs("/content/drive/MyDrive/biomedical_text_generation/data/enriched", exist_ok=True)

# Save the enriched data (with extracted biomedical entities) to a JSON file
with open("/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json", "w", encoding="utf-8") as f:
    json.dump(enriched_data, f, ensure_ascii=False, indent=2)

# Print confirmation message
print("Enriched dataset saved with biomedical entities.")


In [None]:
import pandas as pd

# Create a DataFrame to preview titles and extracted entities
df = pd.DataFrame(enriched_data)
df[["title", "entities"]].head(10)
