# Medical Named Entity Recognition (NER) Project

## Overview

This project implements a medical Named Entity Recognition (NER) system using two approaches:

1. **Custom spaCy Model Training**: Training a custom NER model from scratch using annotated medical text data
2. **Pre-trained Transformer Model**: Using a pre-trained biomedical NER model from Hugging Face

The system is designed to identify and extract medical entities from clinical text, including:
- **Medications**: Drug names and prescriptions (e.g., Aspirin, Metformin, Warfarin)
- **Diseases**: Medical conditions and diagnoses (e.g., diabetes, pneumonia, COPD)
- **Treatments**: Medical procedures and therapies (e.g., surgery, inhaler therapy)

## Project Structure

The code is divided into the following sections:

1. **Data Loading and Exploration** - Uploading and examining the annotated JSON dataset
2. **Data Preprocessing** - Converting JSON annotations to spaCy training format
3. **spaCy Training Data Preparation** - Creating spaCy DocBin format for model training
4. **Model Training** - Training a custom spaCy NER model
5. **Model Inference (spaCy)** - Testing the trained model on sample medical text
6. **Transformer Model Inference** - Using a pre-trained biomedical NER model

## Use Cases

- Automated extraction of medical information from clinical notes
- Medical record processing and analysis
- Drug-disease relationship extraction
- Clinical decision support systems

## Technologies Used

- **spaCy**: For custom NER model training and inference
- **Transformers (Hugging Face)**: For pre-trained biomedical NER
- **Pandas**: For data manipulation
- **PyTorch**: Backend for transformer models
- **Google Colab**: Development environment

# Part 1: Data Loading and Exploration

This section handles uploading the annotated medical dataset and performing initial exploration to understand the data structure.

### 1. File Upload and JSON Loading

- **File Upload**: Uses Google Colab's `files.upload()` to allow users to upload files from their local computer
- **JSON Loading**: Reads the `Corona2.json` file which contains annotated medical text data
- **Data Type Check**: Prints the type of the loaded data to verify it's properly loaded as a dictionary or list

In [1]:
from google.colab import files
import pandas as pd
import json

# Upload file from your computer
uploaded = files.upload()

# Load JSON file (replace with your actual filename after upload)
with open("Corona2.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print(type(data))


Saving Corona2.json to Corona2.json
<class 'dict'>


### 2. Library Imports
Imports essential libraries for:
- Data manipulation (numpy, pandas)
- Visualization (matplotlib, seaborn)
- NLP processing (nltk, spacy)
- Progress tracking (tqdm)

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

### 3. File System Exploration
- Walks through the file system to verify the uploaded file location
- Prints all files found in the specified path

In [3]:
import os
for dirname, _, filenames in os.walk("/content/Corona2.json"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### 4. Load JSON as DataFrame
- Loads the JSON file into a pandas DataFrame for easier manipulation
- Displays the first 5 rows to preview the data structure

In [5]:
path = "/content/Corona2.json"

data = pd.read_json(path)
data.head()

Unnamed: 0,examples
0,"{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe',..."
1,"{'id': '487c93e3-0d45-4088-a378-cf3a01c8953d',..."
2,"{'id': 'd5056874-895a-4a7f-9e0f-828d414d65d9',..."
3,"{'id': '20c792c7-0c4b-42d0-8127-0e04113db384',..."
4,"{'id': 'f5359e0d-4d4a-4707-95a3-4c627fc4a83b',..."


### 5. Data Structure Exploration
- **Keys Inspection**: Lists all keys in the first example to understand available fields
- **Content Display**: Shows the actual text content of the first training example
- **Annotations Display**: Examines the structure of annotations (entity labels)

In [6]:
list(data['examples'][0].keys())

['id', 'content', 'metadata', 'annotations', 'classifications']

In [7]:
data['examples'][0]['content']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [8]:
data['examples'][0]['annotations'][0]

{'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
 'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
 'end': 371,
 'start': 360,
 'example_id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'tag_name': 'Medicine',
 'value': 'Diosmectite',
 'correct': None,
 'human_annotations': [{'timestamp': '2020-03-21T00:24:32.098000Z',
   'annotator_id': 1,
   'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
   'name': 'Ashpat123',
   'reason': 'exploration'}],
 'model_annotations': []}

# Part 2: Data Preprocessing
This section transforms the JSON-formatted annotations into the format required for spaCy NER model training.

### Converting JSON to spaCy Training Format
**Transformation Process**:
1. **Iterates through all examples** in the dataset
2. **Extracts text content** from each example's `content` field
3. **Processes annotations** by:
   - Getting the character start position (`annotation['start']`)
   - Getting the character end position (`annotation['end']`)
   - Converting tag names to uppercase (`annotation['tag_name'].upper()`)
4. **Creates tuples** in the format: `(start, end, label)`
5. **Stores in dictionary** with keys `text` and `entities`

### Data Validation
**Verification Steps**:
- **Displays entities** from the first training example to verify correct format
- **Extracts text slice** using the character positions to confirm that the annotations align correctly with the actual text
- This helps catch any misalignment issues between annotations and text

## Why This Format?
spaCy's NER training requires:
- Plain text strings
- Entity annotations as tuples of (start_char, end_char, label)
- Labels in uppercase by convention

This preprocessing step bridges the gap between your annotated dataset and spaCy's expected input format.

In [9]:
training_data = [
    {
        'text': example['content'],
        'entities': [
            (annotation['start'], annotation['end'], annotation['tag_name'].upper())
            for annotation in example['annotations']
        ]
    }
    for example in data['examples']
]


In [10]:
training_data[0]['entities']

[(360, 371, 'MEDICINE'),
 (383, 408, 'MEDICINE'),
 (104, 112, 'MEDICALCONDITION'),
 (679, 689, 'MEDICINE'),
 (6, 23, 'MEDICINE'),
 (25, 37, 'MEDICINE'),
 (461, 470, 'MEDICALCONDITION'),
 (577, 589, 'MEDICINE'),
 (853, 865, 'MEDICALCONDITION'),
 (188, 198, 'MEDICINE'),
 (754, 762, 'MEDICALCONDITION'),
 (870, 880, 'MEDICALCONDITION'),
 (823, 833, 'MEDICINE'),
 (852, 853, 'MEDICALCONDITION'),
 (461, 469, 'MEDICALCONDITION'),
 (535, 543, 'MEDICALCONDITION'),
 (692, 704, 'MEDICINE'),
 (563, 571, 'MEDICALCONDITION')]

In [11]:
training_data[0]['text'][563:571]

'diarrhea'

# Part 3: spaCy Training Data Preparation
This section converts the preprocessed training data into spaCy's binary format (DocBin), which is required for efficient model training.


### 1. Initialize spaCy Components
- **`spacy.blank("en")`**: Creates a blank English language model without any pre-trained components
- **`DocBin()`**: Initializes a container for efficiently storing spaCy Doc objects in binary format

In [12]:
nlp = spacy.blank("en")
doc_bin = DocBin()

### 2. Import Entity Filtering Utility
- Imports a utility function to handle overlapping entity spans
- Ensures that no two entities overlap in the same text, which would cause training errors

In [13]:
from spacy.util import filter_spans

### 3. Process Training Examples
**Step-by-Step Process**:

1. **Loop through training data**: Uses `tqdm` for progress bar visualization

2. **Extract text and labels**: Gets the text content and entity annotations from each example

3. **Create Doc object**: Converts raw text into a spaCy Doc object with tokenization

4. **Process each entity**:
   - **`char_span()`**: Creates a Span object from character positions
   - **`alignment_mode="contract"`**: If tokens don't align perfectly with character positions, contract the span to fit token boundaries
   - **Error handling**: Skips entities that can't be aligned (prints "Skipping entity")
   - **Append valid spans**: Adds successfully created spans to the list

5. **Filter overlapping spans**: Removes any overlapping entities, keeping the longest/most specific ones

6. **Set entities**: Assigns the filtered entities to the Doc object

7. **Add to DocBin**: Stores the processed Doc in the binary container

8. **Save to disk**: Writes the entire DocBin to `train.spacy` file

In [14]:
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.set_ents(filtered_ents)
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 31/31 [00:00<00:00, 525.34it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





# Part 4: Model Training
This section creates a training configuration and trains a custom spaCy NER model using the prepared training data.

### 1. Initialize Training Configuration
**Command Breakdown**:
- **`python -m spacy`**: Runs spaCy as a Python module
- **`init config`**: Creates a training configuration file
- **`config.cfg`**: Output filename for the configuration
- **`--lang en`**: Sets the language to English
- **`--pipeline ner`**: Specifies that only the NER component should be included
- **`--optimize efficiency`**: Optimizes for faster training and smaller model size (alternative: `accuracy` for better performance)

**What it Creates**:
A `config.cfg` file containing:
- Model architecture settings
- Training hyperparameters
- Pipeline component configurations
- Optimizer settings

In [15]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy



### 2. Train the Model

**Command Breakdown**:
- **`spacy train`**: Initiates the training process
- **`config.cfg`**: Uses the configuration file created in step 1
- **`--output ./`**: Saves trained models to the current directory
- **`--paths.train ./train.spacy`**: Specifies the training data file
- **`--paths.dev ./train.spacy`**: Specifies the development/validation data file

**Note**: In this code, the same file is used for both training and development. Ideally, you should split your data into separate training and validation sets.

**Training Process**:
1. **Loads the configuration** and training data
2. **Initializes the model** with random weights
3. **Trains iteratively**:
   - Processes batches of training examples
   - Computes loss (prediction error)
   - Updates model weights via backpropagation
   - Evaluates on development set
4. **Saves checkpoints**:
   - `model-best/`: The model with the best validation performance
   - `model-last/`: The model from the final training iteration

In [16]:
! python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.29    0.00    0.00    0.00    0.00
  7     200        734.73   3155.19   76.73   79.66   74.02    0.77
 14     400        596.12    731.85   95.69   95.31   96.06    0.96
 22     600        842.19    276.60   96.84   97.22   96.46    0.97
 30     800        177.29    181.67   98.03   98.03   98.03    0.98
 40    1000        172.56    176.14   98.43   98.43   98.43    0.98
 51    1200        793.37    172.28   98.22   98.80   97.64    0.98
 65    1400        187.19    181.08   98.43   98.43   98.43    0.98
 81    1600        162.25    176.52   98.43   98.43   98.

### 3. Load Trained Model
- Loads the best performing model from disk
- Creates a ready-to-use spaCy pipeline for predictions

**Metrics Explained**:
- **LOSS NER**: Training loss (lower is better)
- **ENTS_F**: F1-score (harmonic mean of precision and recall)
- **ENTS_P**: Precision (percentage of predicted entities that are correct)
- **ENTS_R**: Recall (percentage of actual entities that were found)
- **SCORE**: Overall score used to select the best model

In [17]:
nlp_trained_model = spacy.load("model-best")

# Part 5: Model Inference with spaCy

This section demonstrates how to use the trained custom spaCy model to perform Named Entity Recognition on new medical text.

### 1. Test with Sample Text 

**Process**:
- Passes medical text through the trained model
- The model tokenizes the text and identifies entities
- Returns a `Doc` object containing tokens, entities, and other linguistic features

In [22]:
doc = nlp_trained_model('''
The patient was prescribed Aspirin for their heart condition.
The doctor recommended Ibuprofen to alleviate the patient's headache.
The patient is suffering from diabetes, and they need to take Metformin regularly.
After the surgery, the patient experienced some post-operative complications, including infection.
The patient is currently on a regimen of Lisinopril to manage their high blood pressure.
The antibiotic course for treating the bacterial infection should be completed as prescribed.
The patient's insulin dosage needs to be adjusted to better control their blood sugar levels.
The physician suspects that the patient may have pneumonia and has ordered a chest X-ray.
The patient's cholesterol levels are high, and they have been advised to take Atorvastatin.
The allergy to penicillin was noted in the patient's medical history.
''')

### 2. Visualize Entities (Colab)
- Uses spaCy's built-in visualizer to display entities
- `style="ent"`: Renders Named Entities
- `jupyter=True`: Outputs directly in Jupyter/Colab notebooks
- Highlights entities with different colors based on their labels

In [23]:
spacy.displacy.render(doc, style="ent", jupyter=True)

### 3. Load Model from Local Path
- Shows how to load the model from a local Windows path (outside Colab)
- Useful for deploying the model in a different environment

In [14]:
import spacy 

In [1]:
import spacy

model_best = r"E:\NLP\NLP with Sequence Models\LSTM and Named Entity Recognition\Medical NER Application\model-best"
nlp_trained_model = spacy.load(model_best)


### 4. Test with Different Medical Text
- Tests the model with a new set of medical sentences
- Validates the model's ability to generalize to unseen text

In [5]:
doc = nlp_trained_model('''
The patient was prescribed Warfarin to prevent blood clots.
The physician recommended Acetaminophen for managing the patient's fever.
The patient has been diagnosed with chronic obstructive pulmonary disease (COPD) and requires inhaler therapy.
Following the appendectomy, the patient showed signs of mild inflammation.
The patient is on Atorvastatin to control elevated cholesterol levels.
Completing the full course of Amoxicillin is essential for treating the bacterial infection.
The endocrinologist adjusted the patient's Levothyroxine dosage to regulate thyroid function.
A chest CT scan was ordered to investigate suspected pulmonary embolism.
The patient was counseled on diet and exercise to manage their hypertension.
A documented allergy to Sulfa drugs was noted in the medical records.
''')

In [7]:
# spacy.displacy.render(doc, style="ent", jupyter=True)
# from spacy import displacy
from spacy import displacy
from IPython.core.display import display, HTML

html = displacy.render(doc, style="ent")
display(HTML(html))

import warnings 
warnings.filterwarnings('ignore')

  from IPython.core.display import display, HTML


<IPython.core.display.HTML object>

# Part 6: Transformer Model Inference
This section uses a pre-trained biomedical NER model from Hugging Face's Transformers library as an alternative to the custom-trained spaCy model.


### 1. Import Required Libraries
- Imports PyTorch libraries (the backend for Transformers)
- **Note**: Only `torch` is actually needed; `torchvision` and `torchaudio` are imported but not used

In [1]:
import torch 
import torchvision 
import torchaudio 

### 2. Load Pre-trained Model and Tokenizer

**Components**:
- **AutoTokenizer**: Converts text into tokens (subword units) that the model understands
- **AutoModelForTokenClassification**: A transformer model fine-tuned for token classification (NER)
- **Model ID**: `"d4data/biomedical-ner-all"` is a pre-trained model specialized for biomedical text

**What happens behind the scenes**:
1. Downloads the model weights from Hugging Face Hub (cached locally after first download)
2. Loads the tokenizer configuration (vocabulary, special tokens, etc.)
3. Initializes the model architecture and loads pre-trained weights


In [2]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("d4data/biomedical-ner-all")
model = AutoModelForTokenClassification.from_pretrained("d4data/biomedical-ner-all")


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


### 3. Create NER Pipeline

**Pipeline Configuration**:
- **`"ner"`**: Specifies this is a Named Entity Recognition task
- **`model=model`**: Uses the loaded model
- **`tokenizer=tokenizer`**: Uses the loaded tokenizer
- **`aggregation_strategy="simple"`**: Combines subword tokens into complete words
  - Without this, "Warfarin" might be split into ["War", "##far", "##in"]
  - With this, you get a single entity for "Warfarin"

**Optional**: Add `device=0` to use GPU: `pipeline(..., device=0)`


In [None]:
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu

### 4. Perform Entity Recognition

**Process**:
1. Text is tokenized into subwords
2. Tokens are passed through the transformer model
3. Model predicts entity labels for each token
4. Labels are aggregated back into word-level entities
5. Returns a list of dictionaries with entity information

**Fields**:
- **`entity_group`**: The type of entity (CHEMICAL, DISEASE, GENE, etc.)
- **`score`**: Confidence score (0-1)
- **`word`**: The actual entity text
- **`start`/`end`**: Character positions in the original text

In [None]:
pipe("""
The patient was prescribed Warfarin to prevent blood clots.
The physician recommended Acetaminophen for managing the patient's fever.
The patient has been diagnosed with chronic obstructive pulmonary disease (COPD) and requires inhaler therapy.
Following the appendectomy, the patient showed signs of mild inflammation.
The patient is on Atorvastatin to control elevated cholesterol levels.
Completing the full course of Amoxicillin is essential for treating the bacterial infection.
The endocrinologist adjusted the patient's Levothyroxine dosage to regulate thyroid function.
A chest CT scan was ordered to investigate suspected pulmonary embolism.
The patient was counseled on diet and exercise to manage their hypertension.
A documented allergy to Sulfa drugs was noted in the medical records.
""")

Device set to use cpu


[{'entity_group': 'Medication',
  'score': np.float32(0.99672157),
  'word': 'warfarin',
  'start': 28,
  'end': 36},
 {'entity_group': 'Medication',
  'score': np.float32(0.99865365),
  'word': 'ace',
  'start': 87,
  'end': 90},
 {'entity_group': 'Medication',
  'score': np.float32(0.9869897),
  'word': '##tam',
  'start': 90,
  'end': 93},
 {'entity_group': 'Medication',
  'score': np.float32(0.9743514),
  'word': '##inophen',
  'start': 93,
  'end': 100},
 {'entity_group': 'Disease_disorder',
  'score': np.float32(0.6789561),
  'word': 'pulmonary disease',
  'start': 191,
  'end': 208},
 {'entity_group': 'Disease_disorder',
  'score': np.float32(0.99904937),
  'word': 'cop',
  'start': 210,
  'end': 213},
 {'entity_group': 'Medication',
  'score': np.float32(0.999882),
  'word': 'at',
  'start': 339,
  'end': 341},
 {'entity_group': 'Medication',
  'score': np.float32(0.9997048),
  'word': '##or',
  'start': 341,
  'end': 343},
 {'entity_group': 'Medication',
  'score': np.float32(

## Advantages of Pre-trained Models

### 1. No Training Required
- Model is already trained on large biomedical corpora
- Can be used immediately without annotated data

### 2. Broad Coverage
The `biomedical-ner-all` model recognizes multiple entity types:
- **Chemicals/Drugs**: Medications, compounds
- **Diseases**: Conditions, symptoms
- **Genes**: Gene names, proteins
- **Species**: Organisms
- **Cell Types**: Cellular components

### 3. State-of-the-Art Performance
- Based on transformer architecture (BERT, BioBERT, etc.)
- Generally higher accuracy than simpler models
- Better at handling context and ambiguity

## Comparison: spaCy vs Transformers

| Aspect | Custom spaCy Model | Pre-trained Transformer |
|--------|-------------------|------------------------|
| Training | Requires annotated data | Ready to use |
| Speed | Fast inference | Slower (but more accurate) |
| Customization | Fully customizable | Limited without fine-tuning |
| Entity types | Only what you train | Pre-defined broad set |
| Memory | Small footprint | Large (100s of MBs) |

## Use Cases

Best for:
- **Quick prototyping**: Test NER without training
- **Broad entity coverage**: Need many entity types
- **High accuracy**: When precision is critical
- **Limited training data**: Don't have enough annotations

## Alternative Models

Other popular biomedical NER models:
- `dmis-lab/biobert-base-cased-v1.1`
- `allenai/scibert_scivocab_uncased`
- `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`

Each specializes in different biomedical subdomains (clinical, research, etc.).