<a href="https://colab.research.google.com/github/EkataU/-Recognizing-hand_written/blob/main/patient_health.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

In [None]:
import pandas as pd
import re

## Dataset :https://huggingface.co/datasets/ncbi/Open-Patients


In [None]:
df = pd.read_json("hf://datasets/ncbi/Open-Patients/Open-Patients.jsonl", lines=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
print(f"Total records: {len(df)}")
print(df.columns)

Total records: 180142
Index(['_id', 'description'], dtype='object')


In [None]:
sampled_df = df.sample(5)

In [None]:
## descriptions = df["description"].tolist()

In [None]:
sampled_df.head()


Unnamed: 0,_id,description
49261,pmc-8084036-1,A 71-year-old man suddenly lost consciousness ...
75905,pmc-4067881-1,Twenty-year-old man with Scimitar syndrome was...
95980,pmc-2967834-1,A 31-year-old man was admitted to our departme...
170421,usmle-3172,A 26-year-old woman comes to the physician bec...
105960,pmc-7953974-1,A 65-year-old man with a known history of hype...


## Want to make this unstructured dataset to structured dataset:

- Downstream task like Information retrieval

For this, we need to process the description column, and extracts the following structured fields:

index

_id

age

race

gender

health_problem

symptoms

ER_or_not_ER

how_diagnosed

case_summary_keyword

health_problem

organ_mapped

ontology_info	mapping_method


## 1. Set Up PyMedTermino2 for SNOMED CT/UMLS Mapping

In [None]:
 !pip install owlready2

Collecting owlready2
  Downloading owlready2-0.48.tar.gz (27.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.3/27.3 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: owlready2
  Building wheel for owlready2 (pyproject.toml) ... [?25l[?25hdone
  Created wheel for owlready2: filename=owlready2-0.48-cp311-cp311-linux_x86_64.whl size=24551720 sha256=56df74febebe01636bcbc8beb4928773dbd45a2874e4665f11f8c0adeeca6413
  Stored in directory: /root/.cache/pip/wheels/2a/4f/b2/88d834aab03077e1611b46825f45c06ac4db07b77ee45eadd5
Successfully built owlready2
Installing collected packages: owlready2
Successfully installed owlready2-0.48


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


o/p: Mounted at /content/drive

In [None]:
!ls /content/drive/MyDrive


 Account.csv				'Introduction of Computer Security'
 AccountEmployee.csv			 Labs7.gdoc
 Bank_Locations.csv			 LoginCredentials.csv
 Branch.csv				'MOCK_DATA (1).csv'
 BranchTransaction.csv			 Neha_Marlady_nmarlady_Homework1.gdoc
 Card.csv				'NehaMarlady_resume (1).pdf'
'Colab Notebooks'			 Neha_transcript.pdf
 Comprehensive_Banking_Database.csv	 sp25-cs486-InClassActivity07-1.gdoc
'Cover Letter.gdoc'			 Transaction.csv
'CS _final project.gdoc'		 TransactionType.csv
 Customer.csv				 Transcript.gdoc
'Database data.gsheet'			 umls-2019AA-metathesaurus.zip
'Database management system'		'Untitled document (1).gdoc'
 db_final.gdoc				'Untitled document (2).gdoc'
 Employee.csv				'Untitled document.gdoc'
 Final_dataset.gsheet			'Untitled spreadsheet.gsheet'
 graduate_project_deliverable_3.gdoc	'wiCyS _ISC2.gdoc'
'IF Interactive fiction'		'Work hour.gsheet'
'IF short story.gdoc'			'Копия "Family Tree_A4".gdoc'
'International Alumni Articles.gsheet'


lists the content from the gdrive

### Obtaining UMLS Data

To use the `import_umls` function, you need to download the UMLS Metathesaurus release files from the National Library of Medicine (NLM).

1.  **Register with NLM:** Go to the [UMLS homepage](https://www.nlm.nih.gov/research/umls/index.html) and register for a UTS (UMLS Terminology Services) account.
2.  **Download the Release:** Once registered, you can download the desired UMLS release (e.g., 2019AA) which will be in a zip format.
3.  **Upload to Colab:** Upload the downloaded zip file (`umls-2019AA-metathesaurus.zip` or similar) to your Colab environment using the file browser on the left sidebar.

## 2. Define Ontology-Based Mapping Function
This function attempts to map a symptom or diagnosis string to a SNOMED CT concept and its associated organ/system

In [None]:
## Probabale codebase
from owlready2 import *
from owlready2.pymedtermino2 import *
from owlready2.pymedtermino2.umls import *

# Remove old backend if it exists, run this if you face any issue
#if os.path.exists("pym.sqlite3"):
    #os.remove("pym.sqlite3")



# Set the backend for Owlready2 (you can change the filename if needed)
default_world.set_backend(filename = "pym.sqlite3")

# Import UMLS - Make sure the filename matches the uploaded zip file
# You might need to adjust the path if you uploaded it to a specific directory
try:
    import_umls("/content/drive/MyDrive/umls-2019AA-metathesaurus.zip", terminologies = ["ICD10", "SNOMEDCT_US", "CUI"])
    ##import_umls("umls-2019AA-metathesaurus.zip", terminologies = ["ICD10", "SNOMEDCT_US", "CUI"])
    default_world.save()
    print("UMLS import successful!")
except FileNotFoundError:
    print("UMLS zip file not found. Please ensure 'umls-2019AA-metathesaurus.zip' is uploaded to your Colab environment.")
except Exception as e:
    print(f"An error occurred during UMLS import: {e}")


def snomed_map(term):
    # Try to find the SNOMED CT concept for the given term
    matches = SNOMEDCT_US.search(term)
    if matches:
        concept = matches[0]
        # Get preferred term and semantic type (if available)
        preferred = concept.preferred_label
        semantic_types = getattr(concept, "semantic_types", [])
        return preferred, semantic_types
    return None, None

Importing UMLS from /content/drive/MyDrive/umls-2019AA-metathesaurus.zip with Python version 3.11.13 and Owlready version 2-0.48...
  Parsing 2019AA/META/MRSTY.RRF as MRSTY
  Parsing 2019AA/META/MRRANK.RRF as MRRANK
  Parsing 2019AA/META/MRCONSO.RRF as MRCONSO
  Parsing 2019AA/META/MRDEF.RRF as MRDEF
  Parsing 2019AA/META/MRREL.RRF as MRREL
  Parsing 2019AA/META/MRSAT.RRF as MRSAT
Breaking ORIG cycles...
    SNOMEDCT_US : 0 cycles found: 
    ICD10 : 0 cycles found: 
    SRC : 0 cycles found: 
Finalizing only properties and restrictions...
Finalizing CUI - ORIG mapping...
FTS Indexing...
UMLS import successful!


o/p:
Importing UMLS from /content/drive/MyDrive/umls-2019AA-metathesaurus.zip with Python version 3.11.13 and Owlready version 2-0.48...
  Parsing 2019AA/META/MRSTY.RRF as MRSTY
  Parsing 2019AA/META/MRRANK.RRF as MRRANK
  Parsing 2019AA/META/MRCONSO.RRF as MRCONSO
  Parsing 2019AA/META/MRDEF.RRF as MRDEF
  Parsing 2019AA/META/MRREL.RRF as MRREL
  Parsing 2019AA/META/MRSAT.RRF as MRSAT
Breaking ORIG cycles...
    SNOMEDCT_US : 0 cycles found:
    ICD10 : 0 cycles found:
    SRC : 0 cycles found:
Finalizing only properties and restrictions...
Finalizing CUI - ORIG mapping...
FTS Indexing...
UMLS import successful!

## 3. Define BioBERT/ClinicalBERT Fallback Mapping
This function uses BioBERT/ClinicalBERT to map the term to the most likely organ/system, using semantic similarity.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load BioBERT (or ClinicalBERT)
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModel.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

organ_labels = ["heart", "lung", "liver", "kidney", "brain", "breast", "systemic", "skin", "blood"]

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=32)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze()

def biobert_map(term):
    term_emb = get_embedding(term)
    sims = []
    for label in organ_labels:
        label_emb = get_embedding(label)
        sim = torch.cosine_similarity(term_emb, label_emb, dim=0)
        sims.append(sim.item())
    best_idx = sims.index(max(sims))
    return organ_labels[best_idx], sims[best_idx]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

## 4. Hybrid Mapping Function
Try SNOMED CT first; if not found, use BioBERT/ClinicalBERT.

In [None]:
def hybrid_organ_mapping(term):
    preferred, semantic_types = snomed_map(term)
    if preferred:
        # High-confidence mapping
        return preferred, semantic_types, "ontology"
    else:
        # Fallback to BioBERT/ClinicalBERT
        label, score = biobert_map(term)
        return label, score, "biobert"


## 5. Apply to Your DataFrame
Extract the main health problem or symptom from each description, then map it.

In [None]:
def extract_health_problem(text):
    match = re.search(r'presents? (?:to|with|complaining of|in)\s+(.*?)[\.,;]', text, re.I)
    if match:
        return match.group(1).strip()
    match2 = re.search(r'diagnosis of ([^\.]+)', text, re.I)
    if match2:
        return match2.group(1).strip()
    return None

sampled_df['health_problem'] = sampled_df['description'].apply(extract_health_problem)
sampled_df[['organ_mapped', 'ontology_info', 'mapping_method']] = sampled_df['health_problem'].apply(
    lambda x: pd.Series(hybrid_organ_mapping(x) if pd.notnull(x) else (None, None, None))
)
print(sampled_df[['index', '_id', 'health_problem', 'organ_mapped', 'ontology_info', 'mapping_method']])


## Extract Age , Race other terms

In [None]:
def extract_race(text):
    races = ['African-American', 'white', 'Caucasian', 'Asian', 'Hispanic', 'Latino', 'Black', 'Native American']
    for race in races:
        if race.lower() in text.lower():
            return race
    return None

def extract_gender(text):
    if re.search(r'\bmale\b', text, re.I):
        return 'Male'
    elif re.search(r'\bfemale|woman|girl\b', text, re.I):
        return 'Female'
    else:
        return None

def extract_health_problem(text):
    match = re.search(r'presents? (?:to|with|complaining of|in)\s+(.*?)[\.,;]', text, re.I)
    if match:
        return match.group(1).strip()
    match2 = re.search(r'diagnosis of ([^\.]+)', text, re.I)
    if match2:
        return match2.group(1).strip()
    return None

def extract_symptoms(text):
    symptoms_list = ['pain', 'fever', 'cough', 'dyspnea', 'nausea', 'diaphoresis', 'irritability', 'malaise', 'lesion', 'shortness of breath', 'infiltrates', 'mass', 'anemia', 'swelling', 'oozing', 'tenderness']
    found = []
    for sym in symptoms_list:
        if re.search(r'\b' + re.escape(sym) + r'\b', text, re.I):
            found.append(sym)
    return ', '.join(found) if found else None

def extract_er(text):
    if re.search(r'\bER\b|\bemergency department\b', text, re.I):
        return 'Yes'
    else:
        return 'No'

def extract_how_diagnosed(text):
    diag_methods = ['x-ray', 'CT scan', 'MRI', 'EKG', 'laboratory tests', 'echocardiogram', 'auscultation', 'examination', 'scan', 'biopsy', 'HbA1c', 'D-dimer']
    found = []
    for method in diag_methods:
        if method in text.lower():
            found.append(method)
    return ', '.join(found) if found else None

def extract_case_keywords(text):
    stopwords = set(['the', 'and', 'of', 'to', 'with', 'is', 'a', 'in', 'for', 'on', 'by', 'as', 'at', 'has', 'she', 'he', 'was', 'but', 'no', 'or', 'her', 'his', 'from', 'that', 'this', 'have', 'had', 'are', 'not', 'be', 'an', 'which', 'been', 'were', 'it', 'shows'])
    words = re.findall(r'\b[a-zA-Z]{4,}\b', text.lower())
    freq = pd.Series(words).value_counts()
    keywords = [w for w in freq.index if w not in stopwords][:3]
    return ', '.join(keywords)



## Apply to Dataframe

In [None]:
structured['index'] = sampled_df['index']
structured['_id'] = sampled_df['_id']
structured['age'] = sampled_df['description'].apply(extract_age)
structured['race'] = sampled_df['description'].apply(extract_race)
structured['gender'] = sampled_df['description'].apply(extract_gender)
structured['health_problem'] = sampled_df['description'].apply(extract_health_problem)
structured['symptoms'] = sampled_df['description'].apply(extract_symptoms)
structured['ER_or_not_ER'] = sampled_df['description'].apply(extract_er)
structured['how_diagnosed'] = sampled_df['description'].apply(extract_how_diagnosed)
structured['case_summary_keyword'] = sampled_df['description'].apply(extract_case_keywords)

In [None]:
# Organ mapping
organ_map_results = structured['health_problem'].apply(lambda x: hybrid_organ_mapping(x) if pd.notnull(x) else (None, None, None))
structured['organ_mapped'] = organ_map_results.apply(lambda x: x[0])
structured['ontology_info'] = organ_map_results.apply(lambda x: x[1])
structured['mapping_method'] = organ_map_results.apply(lambda x: x[2])

print(structured)