<a href="https://colab.research.google.com/github/Dlogical23/MedInputClass/blob/main/FactChecker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
Medical Classification Fact-Checker Model

In [2]:
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-m

In [1]:
import spacy
from spacy.matcher import Matcher
import pandas as pd

# Load a pre-trained spaCy model (you can use a medical-specific model like en_core_sci_sm if available)
nlp = spacy.load("en_core_web_md")

# Sample dataset (replace this with your actual medical transcription data)
data = {
    "transcription": [
        "The patient was prescribed 500mg of paracetamol twice a day.",
        "The doctor recommended aspirin for the headache.",
        "The patient has a history of hypertension and diabetes.",
        "Incorrect drug name: xyzabc 100mg daily."
    ]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Function to extract medical entities (e.g., drugs, conditions)
def extract_medical_entities(text):
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities

# Function to clean and preprocess text
def preprocess_text(text):
    # Basic cleaning (you can expand this)
    text = text.lower().strip()
    return text

# Apply preprocessing and entity extraction
df["cleaned_text"] = df["transcription"].apply(preprocess_text)
df["entities"] = df["transcription"].apply(extract_medical_entities)

# Display the processed data
print(df[["transcription", "cleaned_text", "entities"]])

                                       transcription  \
0  The patient was prescribed 500mg of paracetamo...   
1   The doctor recommended aspirin for the headache.   
2  The patient has a history of hypertension and ...   
3           Incorrect drug name: xyzabc 100mg daily.   

                                        cleaned_text  \
0  the patient was prescribed 500mg of paracetamo...   
1   the doctor recommended aspirin for the headache.   
2  the patient has a history of hypertension and ...   
3           incorrect drug name: xyzabc 100mg daily.   

                                           entities  
0                               [(500mg, QUANTITY)]  
1                                                []  
2                                                []  
3  [(xyzabc, ORG), (100mg, PERCENT), (daily, DATE)]  


Now build a labeled dataset for fact-checking. This involves:

1. Defining the fact-checking task: Decide what types of errors you want to detect (e.g., incorrect drug names, dosages, diagnoses).

2. Labeling the data: Create a labeled dataset where each transcription is marked as "correct" or "incorrect" based on expert review or synthetic errors.

3. Building a fact-checking model: Train a model to classify transcriptions as correct or incorrect, or to identify specific errors.

In [2]:
import random

# Sample list of correct and incorrect drug names
correct_drugs = ["paracetamol", "aspirin", "ibuprofen", "metformin"]
incorrect_drugs = ["xyzabc", "fakemed", "wrongdrug", "invalidmed"]

# Function to introduce synthetic errors
def introduce_errors(text, entities):
    for entity in entities:
        if entity[1] == "DRUG":  # Focus on drug entities
            if random.random() < 0.5:  # 50% chance to introduce an error
                incorrect_drug = random.choice(incorrect_drugs)
                text = text.replace(entity[0], incorrect_drug)
    return text

# Apply synthetic error generation
df["text_with_errors"] = df.apply(lambda row: introduce_errors(row["transcription"], row["entities"]), axis=1)
df["label"] = df["text_with_errors"] != df["transcription"]  # Label as correct (False) or incorrect (True)

# Display the updated DataFrame
print(df[["transcription", "text_with_errors", "label"]])

                                       transcription  \
0  The patient was prescribed 500mg of paracetamo...   
1   The doctor recommended aspirin for the headache.   
2  The patient has a history of hypertension and ...   
3           Incorrect drug name: xyzabc 100mg daily.   

                                    text_with_errors  label  
0  The patient was prescribed 500mg of paracetamo...  False  
1   The doctor recommended aspirin for the headache.  False  
2  The patient has a history of hypertension and ...  False  
3           Incorrect drug name: xyzabc 100mg daily.  False  


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Prepare the dataset
texts = df["text_with_errors"].tolist()
labels = df["label"].astype(int).tolist()

# Tokenize the texts
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
labels = torch.tensor(labels)

# Split into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(inputs["input_ids"], labels, test_size=0.2)

# Create a PyTorch dataset
class FactCheckingDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __getitem__(self, idx):
        return {
            "input_ids": self.inputs[idx],
            "labels": self.labels[idx]
        }

    def __len__(self):
        return len(self.labels)

train_dataset = FactCheckingDataset(train_inputs, train_labels)
val_dataset = FactCheckingDataset(val_inputs, val_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

In [1]:
# Add this at the beginning of your 'ipython-input-3-39553657f713' file
!python -m spacy download en_core_web_md

from transformers import Trainer, TrainingArguments
from transformers import BertTokenizer, BertForSequenceClassification # Import BertForSequenceClassification
from sklearn.model_selection import train_test_split
import torch
import pandas as pd
import spacy
import random

# Load pre-trained BERT tokenizer and model (if not already loaded)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) # Define model

# Sample list of correct and incorrect drug names # Define correct_drugs and incorrect_drugs for introduce_errors
correct_drugs = ["paracetamol", "aspirin", "ibuprofen", "metformin"]
incorrect_drugs = ["xyzabc", "fakemed", "wrongdrug", "invalidmed"]

# Function to introduce synthetic errors # Define introduce_errors function
def introduce_errors(text, entities):
    for entity in entities:
        if entity[1] == "DRUG":  # Focus on drug entities
            if random.random() < 0.5:  # 50% chance to introduce an error
                incorrect_drug = random.choice(incorrect_drugs)
                text = text.replace(entity[0], incorrect_drug)
    return text

# Function to extract medical entities (e.g., drugs, conditions)
def extract_medical_entities(text):
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities

  # Load a pre-trained spaCy model (you can use a medical-specific model like en_core_sci_sm if available) #This line MUST BE HERE
nlp = spacy.load("en_core_web_md")

# Function to clean and preprocess text
def preprocess_text(text):
    # Basic cleaning (you can expand this)
    text = text.lower().strip()
    return

    # Assuming data is defined somewhere before this code # Define data for pd.DataFrame
data = {
    "transcription": [
        "The patient was prescribed 500mg of paracetamol twice a day.",
        "The doctor recommended aspirin for the headache.",
        "The patient has a history of hypertension and diabetes.",
        "Incorrect drug name: xyzabc 100mg daily."
    ]
}

# Convert to a DataFrame
df = pd.DataFrame(data) # Define df here

# Apply preprocessing and entity extraction
df["cleaned_text"] = df["transcription"].apply(preprocess_text)
df["entities"] = df["transcription"].apply(extract_medical_entities)

# Apply synthetic error generation   # Bring in required code from ipython-input-2-7169ec464d41 to create 'text_with_errors' and 'label' columns
df["text_with_errors"] = df.apply(lambda row: introduce_errors(row["transcription"], row["entities"]), axis=1)
df["label"] = df["text_with_errors"] != df["transcription"]  # Label as correct (False) or incorrect (True)

# Prepare the dataset # This was missing and is needed for train_dataset and val_dataset
# Assume df is already defined and contains the data
texts = df["text_with_errors"].tolist()
labels = df["label"].astype(int).tolist()

# Tokenize the texts
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
labels = torch.tensor(labels)

# Split into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(inputs["input_ids"], labels, test_size=0.2)

# Create a PyTorch dataset
class FactCheckingDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __getitem__(self, idx):
        return {
            "input_ids": self.inputs[idx],
            "labels": self.labels[idx]
        }

    def __len__(self):
        return len(self.labels)

train_dataset = FactCheckingDataset(train_inputs, train_labels)
val_dataset = FactCheckingDataset(val_inputs, val_labels)

# Define training arguments  # This is the missing part!!
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Evaluate the model
results = trainer.evaluate()
print(f"Validation accuracy: {results['eval_accuracy']}")
print(f"Validation loss: {results['eval_loss']}")

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:


Abort: 

In [2]:
# Function to predict if a transcription is correct or incorrect
def predict_fact_check(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return "correct" if predictions == 0 else "incorrect"

# Example usage
transcription = "The patient was prescribed xyzabc 100mg daily."
result = predict_fact_check(transcription)
print(f"Fact-checking result: {result}")

Fact-checking result: correct
