# 03 – Data Scanning & Token Analysis

In this notebook, we perform a **preliminary scan and token-based analysis** of the cleaned dataset in order to:

- Estimate the token length of abstracts and titles,
- Determine appropriate values for `max_input_length` / `max_output_length`,
-Provide visual and statistical insights before constructing the final training datasets.


# Entity Analysis (without UMLS Linking)

This notebook performs entity frequency analysis on biomedical abstracts using the `en_core_sci_lg` model from SciSpaCy. It extracts biomedical terms and identifies the most common multi-word entities.

## What This Does
- Loads pre-processed abstracts (with extracted entities)
- Counts and filters entities based on word count (e.g., only 2+ word phrases)
- Displays and saves the most frequent multi-word biomedical terms

## What This Does NOT Do
This version does **not** use UMLS entity linking due to technical limitations (`nmslib` build issues on Colab). As a result:
- **Synonyms are not merged** (e.g., `NSCLC` ≠ `non-small cell lung cancer`)
- **Entities are not normalized** to UMLS concept IDs
- **No abbreviation resolution** or semantic types are included

## Future Extension
If UMLS linking is enabled in the future:
- Terms will be linked to UMLS concepts (e.g., `C0007131`)
- Synonyms and abbreviations will be merged into canonical forms
- Enriched analysis with metadata (semantic types, vocabulary links, etc.)

Despite this limitation, the current entity recognition pipeline still offers strong results for exploratory analysis and keyword extraction.


## 💡 Example: With vs. Without UMLS Linking

### Without UMLS Linking (Current Behavior)
Entities are extracted as plain text, without normalization:

- `NSCLC`
- `non-small cell lung cancer`
- `Non-Small Cell Lung Carcinoma`
- `lung adenocarcinoma`

Each of these will be treated as **distinct entities** — even though they refer to the same medical concept.

---

### With UMLS Linking (Desired Behavior)
Entities are mapped to UMLS concepts, allowing for:

- **Synonym merging**
- **Abbreviation resolution**
- **Concept-level normalization**

For example:

| Text Span                           | Linked UMLS Concept | Concept Name                  |
|------------------------------------|----------------------|-------------------------------|
| `NSCLC`                            | C0007131             | Non-Small Cell Lung Carcinoma |
| `non-small cell lung cancer`       | C0007131             | Non-Small Cell Lung Carcinoma |
| `lung adenocarcinoma`              | C0007131             | Non-Small Cell Lung Carcinoma |

This makes downstream analysis, clustering, and retrieval much more robust and semantically meaningful.



In [None]:
# Only if you are using Google Colab and want to retreive the data from your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import json
import pandas as pd

# Load enriched biomedical abstracts (with extracted entities)
with open("/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert to DataFrame for easier handling
df = pd.DataFrame(data)
df[["title", "entities"]].head()


In [None]:
# Flatten all entities from all abstracts into a single list
all_entities = [entity for entry in df["entities"] for entity in entry]
print(f"Total entities collected: {len(all_entities)}")


In [None]:
# Keep only entities with 2 or more words
multi_word_entities = [ent for ent in all_entities if len(ent.split()) >= 2]
print(f"Multi-word entities (2+ words): {len(multi_word_entities)}")


In [None]:
from collections import Counter

# Count frequency of multi-word entities
entity_counter = Counter(multi_word_entities)

# Get top 300 (Computed to have over 100 appearances)
top_entities = entity_counter.most_common(350)

# Preview top 20
print("Top 20 multi-word entities:")
for entity, count in top_entities[:20]:
    print(f"{entity}: {count}")


In [None]:
import os

# Define output path
output_path = "/content/drive/MyDrive/biomedical_text_generation/data/processed/top_multiword_entities.json"

# Ensure the directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Save to JSON file
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(top_entities, f, ensure_ascii=False, indent=2)

print(f"\n Saved top multi-word entities to: {output_path}")
