# Task 1: Extracting and Counting Distinct Semantic Entities (ADR, Drug, Disease, Symptom)

In this task, our goal is to identify all the **unique medical entities** mentioned in the dataset — such as **Adverse Drug Reactions (ADR)**, **Drugs**, **Diseases**, and **Symptoms** — and then group similar expressions together using semantic clusterin

g.


In [1]:
import os
import re
import json
import string
import spacy
from symspellpy.symspellpy import SymSpell, Verbosity
from collections import defaultdict

# Load spaCy model
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

# Initialize SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", term_index=0, count_index=1)

# Directories
root_dir = "./cadec"
original_dir = os.path.join(root_dir, "original")
meddra_dir = os.path.join(root_dir, "meddra")
sct_dir = os.path.join(root_dir, "sct")

# Regex for original .ann files
original_pattern = re.compile(r"^T(\d+)\s+(ADR|Drug|Disease|Symptom)\s+[\d; ]+\s+(.+)$")
meddra_pattern = re.compile(r"^TT(\d+)\s+(\d+)\s+[\d; ]+\s+(.+)$")
sct_pattern = re.compile(r"^TT(\d+)\s+(.+?)\s*\|\s*(.+?)\s*\|[\d; ]+\s+(.+)$")

# Store all annotations
annotations = defaultdict(lambda: defaultdict(dict))

2025-07-20 17:22:47,989: E symspellpy.symspellpy] Dictionary file not found at frequency_dictionary_en_82_765.txt.


##  Step 1: Extract and Normalize Entities

Each file in the `original/` folder contains lines lik
T1 Drug 10 15 Aspirin
T2 Symptom 20 30 severe headache

For each line:
- We extract:
  - **Label type** (ADR, Drug, etc.)
  - **Raw text span**
- Then we normalize the text by:
  - Lowercasing
  - Removing punctuation
  - Correcting spelling with SymSpell
  - Lemmatizing using spaCy  
    (e.g., `headaches` → `headache`)

The goal here is to make the terms **uniform**, so that "headaches" and "headache" are treated the same.
e:


In [2]:
# Preprocessing function
def normalize(text):
    text = text.lower().strip()
    text = text.translate(str.maketrans('', '', string.punctuation.replace("'", ""))).replace("-", " ")
    corrected = sym_spell.lookup(text, Verbosity.CLOSEST, max_edit_distance=2)
    if corrected:
        text = corrected[0].term
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

In [3]:
# Step 1–2: Extract & Normalize Entities from original/
for filename in os.listdir(original_dir):
    orig_file = os.path.join(original_dir, filename)
    if not filename.endswith(".ann"):
        continue
    with open(orig_file, 'r', encoding='utf-8') as f:
        for line in f:
            if line.startswith("#"):
                continue
            match = original_pattern.match(line.strip())
            if match:
                tid, label, raw_text = match.groups()
                norm_text = normalize(raw_text)
                annotations[filename][f"T{tid}"] = {
                    "label": label,
                    "raw_text": raw_text,
                    "normalized": norm_text
                }

##  Step 2: Add MedDRA Codes (for ADR Only)

In the `meddra/` folder:
- We match entities from `original/` that are labeled **ADR**
- Using matching identifiers, we extract:
  - **MedDRA code** (standard medical ID)
  - **Description** from the MedDRA dictionary

This helps us **link ADR mentions to official medical vocabulary**.


In [4]:
# Step 3: Link MedDRA codes (ADR only)
for filename in os.listdir(meddra_dir):
    meddra_file = os.path.join(meddra_dir, filename)
    if not filename.endswith(".ann"):
        continue
    with open(meddra_file, 'r', encoding='utf-8') as f:
        for line in f:
            match = meddra_pattern.match(line.strip())
            if match:
                tid, code, text = match.groups()
                tid = f"T{tid}"
                if tid in annotations[filename]:
                    annotations[filename][tid]["meddra"] = {
                        "code": code,
                        "text": text
                    }


##  Step 3: Add SNOMED CT Codes

In the `sct/` folder:
- We again match entity IDs to extract:
  - **SNOMED CT code**
  - **Concept description**

This step helps enrich entities across all label types, not just ADR.


In [5]:
# Step 4: Link SNOMED CT codes
for filename in os.listdir(sct_dir):
    sct_file = os.path.join(sct_dir, filename)
    if not filename.endswith(".ann"):
        continue
    with open(sct_file, 'r', encoding='utf-8', errors='replace') as f:
        for line in f:
            match = sct_pattern.match(line.strip())
            if match:
                tid, code, concept, text = match.groups()
                tid = f"T{tid}"
                if tid in annotations[filename]:
                    annotations[filename][tid]["sct"] = {
                        "code": code,
                        "concept": concept,
                        "text": text
                    }


In [6]:

# Step 5–6–7: Save all linked normalized entities
output_list = []
for filename, ents in annotations.items():
    for tid, info in ents.items():
        out = {
            "file": filename,
            "id": tid,
            "label": info.get("label"),
            "raw_text": info.get("raw_text"),
            "normalized": info.get("normalized"),
            "meddra": info.get("meddra", None),
            "sct": info.get("sct", None)
        }
        output_list.append(out)

In [7]:
with open("linked_entities.json", "w", encoding="utf-8") as f:
    json.dump(output_list, f, indent=2, ensure_ascii=False)

print(f"✅ Linked entities saved to 'linked_entities.json' with {len(output_list)} total entries.")


✅ Linked entities saved to 'linked_entities.json' with 8676 total entries.


# Semantic Clustering of Normalized Entities
Different mentions can mean the same thing. For example:

**mild headaches**

**severe headache**

**headache**

We use a cosine similarity threshold of 0.75 to decide if two terms are similar enough to be clustered.

In [12]:
import json
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load JSON
with open("linked_entities.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Group normalized entities per label
label_to_entities = defaultdict(set)
for entry in data:
    label = entry.get("label")
    norm = entry.get("normalized")
    if label and norm:
        label_to_entities[label].add(norm)

# Load embedding model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", use_auth_token="api_key")

# Cluster function
def cluster_entities(label, entities, threshold=0.75):
    print(f"Clustering for: {label} ({len(entities)} terms)")

    entity_list = list(entities)
    embeddings = model.encode(entity_list)

    clustering = AgglomerativeClustering(
        n_clusters=None,
        distance_threshold=1 - threshold,
        metric='cosine', 
        linkage='average'
    )
    clustering.fit(embeddings)

    clusters = defaultdict(list)
    for idx, cluster_id in enumerate(clustering.labels_):
        clusters[str(cluster_id)].append(entity_list[idx])  # Convert key to str

    return clusters


#  clustering per label
all_clusters = {}
for label, entity_set in label_to_entities.items():
    clusters = cluster_entities(label, entity_set, threshold=0.75)
    all_clusters[label] = clusters

# Save results
with open("entity_clusters.json", "w", encoding="utf-8") as f:
    json.dump(all_clusters, f, indent=2, ensure_ascii=False)

print(" Clustering complete. Saved to 'entity_clusters.json'")




Clustering for: ADR (3252 terms)
Clustering for: Drug (308 terms)
Clustering for: Disease (160 terms)
Clustering for: Symptom (144 terms)
✅ Clustering complete. Saved to 'entity_clusters.json'


In [13]:
import json

# Load clustered entities
with open("entity_clusters.json", "r", encoding="utf-8") as f:
    clustered = json.load(f)

# Print and save
with open("clustered_entities_readable.txt", "w", encoding="utf-8") as f:
    for label in clustered:
        cluster_dict = clustered[label]
        num_clusters = len(cluster_dict)

        print(f"\nLabel: {label} — {num_clusters} distinct semantic entities")
        f.write(f"\nLabel: {label} — {num_clusters} distinct semantic entities\n")

        for cluster_id, terms in sorted(cluster_dict.items(), key=lambda x: -len(x[1])):  # sort by cluster size
            print(f"  Cluster {cluster_id} ({len(terms)} terms):")
            f.write(f"  Cluster {cluster_id} ({len(terms)} terms):\n")
            for t in terms:
                print(f"    - {t}")
                f.write(f"    - {t}\n")

print(" Task 1 (Semantic Distinct Count) completed. Output saved to 'clustered_entities_readable.txt'")



Label: ADR — 1593 distinct semantic entities
  Cluster 47 (48 terms):
    - extreme pain in both shoulder
    - joint pain in shoulder
    - sore shoulder
    - severe pain in the muscle in the shoulder area
    - pain in the back in shoulder muscle
    - severe pain in left shoulder
    - joint pain in my shoulder
    - soreness in shoulder
    - pain in the shoulder
    - muscle pain in the shoulder
    - shoulder muscle pain
    - muscle pain in shoulder
    - muscular soreness in arm and shoulder
    - horrible pain in both shoulder
    - very bad pain in shoulder
    - severe pain shoulder
    - terrible pain in shoulder
    - severe intense left arm and shoulder pain
    - excruciate pain in shoulder
    - ache and pain in both shoulder
    - pain in my arm and shoulder
    - pain in shoulder joint
    - ache pain across my shoulder
    - constant pain shoulder
    - ache in shoulder
    - severe muscle pain in shoulder
    - shoulder pain be unbearable
    - muscle pain relate 

In [14]:
import json
from collections import defaultdict

# Load the linked JSON output
with open("linked_entities.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Group normalized entities by label
label_to_entities = defaultdict(set)

for entry in data:
    label = entry.get("label")
    normalized = entry.get("normalized")
    if label and normalized:
        label_to_entities[label].add(normalized)

# Output results
with open("task1_output_normalized.txt", "w", encoding="utf-8") as f:
    for label in sorted(label_to_entities.keys()):
        entities = sorted(label_to_entities[label])
        print(f"\n{label} Entities ({len(entities)} total):")
        f.write(f"\n{label} Entities ({len(entities)} total):\n")
        for ent in entities:
            print(f"  - {ent}")
            f.write(f"  - {ent}\n")
        print(f"Total distinct {label} entities: {len(entities)}")
        f.write(f"Total distinct {label} entities: {len(entities)}\n")

print(" Task 1 completed. Output saved to 'task1_output_normalized.txt'")



ADR Entities (3252 total):
  -   fluey feeling in upper arm leg
  - ' scare ' feeling
  - 23 period a month instead of once a month
  - 56 time at night to pee normally 1 or 2
  - I be 52 and feel 82
  - I be eighty year old
  - I be up all night in bathroom
  - I do not feel right
  - I feel like I always have the flu
  - I feel like I be go to pass out
  - I feel like I be in a fog
  - I may need to go to the restroom but when I do I can not
  - I simply could not walk
  - I think I be go to die
  - I walk like an old person
  - a lot grouchi
  - abdominal cramp
  - abdominal cramp and pain
  - abdominal cramping
  - abdominal discomfort
  - abdominal distention   feel full
  - abdominal flu symptom
  - abdominal gas
  - abdominal pain
  - abdominal pressure
  - abdominal problem
  - abdominal rash
  - ability to concentrate
  - abnoraml dream
  - abnormal dream
  - abnormal liver function
  - abnormal uterine bleeding
  - abominal cramp
  - absent mindendness
  - absentminde
  - ac