# Medzy
## Overview
This project aims to develop a machine learning model capable of interpreting doctors’ handwriting on prescriptions. By accurately detecting and translating challenging handwriting, the model will empower patients to read their prescriptions independently, making it easier for them to purchase their medications without confusion if they run out of medicine.

This model is using Hugging Face's https://huggingface.co/docs/transformers/en/model_doc/bart for correcting the OCR errors.

## Get PyTorch Device

### DirectML

In [2]:
import torch
import torch_directml

device = torch_directml.device()

ModuleNotFoundError: No module named 'torch_directml'

### CUDA (fallback to CPU if none)

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Loading the data

In [2]:
import json
f = open("output.txt", "r")
texts = json.loads(f.read())

In [3]:
texts

['AcAetaeta',
 'AcAetaeta',
 'AcAeta',
 'Aetaetaeta',
 'AcAeta',
 'AcAeta',
 'Aetaetaeta',
 'AcAetaeta',
 'Aetaetaeta',
 'AcB',
 'Ace',
 'Ace',
 'Ace',
 'Ace',
 'Ace',
 'Acecece',
 'Aceacceacacac',
 'Ace',
 'Ace',
 'Ace',
 'AlrolatArolrolrol',
 'atAlrolalAt',
 'atAlrolArolrolrol',
 'atAlrolArolrolrol',
 'AlrolKrolrolrol',
 'AlrolFlrolrolrol',
 'rolAlatArolrolrol',
 'Arolrolrol',
 'Arolrolrol',
 'AlRrolrol',
 'AxAmod',
 'AtAmod',
 'AxAinin',
 'Amod',
 'Amod',
 'A',
 'Amod',
 'Amod',
 'Amod',
 'Amod',
 'AtAzriin',
 'AtAzriininrocinininzininin',
 'AtAtriininzininininrocinin',
 'AtAtriininzininininrocinin',
 'AtAzriinzinininin',
 'AtMetsinininrocininzininin',
 'AtAzriinininzininzinzininin',
 'Atriininzininininrocinin',
 'AtAtriininzinininin',
 'AtAzriinin',
 'AxinodO',
 'AxAxAxinodAinin',
 'AxinodO',
 'AxinodO',
 'AxAzinin',
 'inAxAxAxodAinin',
 'inAxAxAxAininin',
 'AxAxAxodOin',
 'Axinod',
 'AxAxAxOodinodinin',
 'Azithrocinininrocinrocrocin',
 'Azithrocrocrocinrocinininrocroc',
 'Azithroc

In [4]:
# Getting correct labels
import pandas as pd

test_df = pd.read_csv("./Dataset/Testing/testing_labels.csv", delimiter = ",")
valid_labels = test_df["MEDICINE_NAME"].unique() # Get all unique instance of the labels

## Loading the model

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Load the trained model
model = BartForConditionalGeneration.from_pretrained("./model-output/BART").to(device)
tokenizer = BartTokenizer.from_pretrained("./model-output/BART")

In [None]:
from rapidfuzz import process
import re

def reduce_repetitions(word):
    """Reduces repeated patterns like 'Arolrolrol' to 'Arol'."""
    return re.sub(r'(.{2,})\1+', r'\1', word)

def correct_text(ocr_text):
    """Generate corrected text using the fine-tuned BART model."""
    text = reduce_repetitions(ocr_text)

    with torch.no_grad():
        # Tokenize input
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=32).to(device)

        # Generate corrected text
        outputs = model.generate(**inputs, max_length=32, num_return_sequences=1)

    # Decode the corrected text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Find the closest match from the valid labels
    best_match, score, _ = process.extractOne(generated_text, valid_labels)
    return best_match if score > 80 else generated_text  # Use best match if confidence is high

In [11]:
corrected_texts = []
for text in texts:
    corrected_text = correct_text(text)
    print(f"OCR Output: {text} -> Corrected: {corrected_text}")
    corrected_texts.append(corrected_text)
    

OCR Output: AcAetaeta -> Corrected: Aceta
OCR Output: AcAetaeta -> Corrected: Aceta
OCR Output: AcAeta -> Corrected: Aceta
OCR Output: Aetaetaeta -> Corrected: Aceta
OCR Output: AcAeta -> Corrected: Aceta
OCR Output: AcAeta -> Corrected: Aceta
OCR Output: Aetaetaeta -> Corrected: Aceta
OCR Output: AcAetaeta -> Corrected: Aceta
OCR Output: Aetaetaeta -> Corrected: Aceta
OCR Output: AcB -> Corrected: AcB
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: Acecece -> Corrected: Ace
OCR Output: Aceacceacacac -> Corrected: Aacac
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: Ace -> Corrected: Ace
OCR Output: AlrolatArolrolrol -> Corrected: AlrolAtrol
OCR Output: atAlrolalAt -> Corrected: AtrolalAt
OCR Output: atAlrolArolrolrol -> Corrected: AlrolArol
OCR Output: atAlrolArolrolrol -> Corrected: AlrolArol
OCR Output: AlrolKrolrolrol

## Calculating Metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Get all unique true labels from the dataset
unique_labels = set(test_df["MEDICINE_NAME"].astype(str))  # Ensure labels are strings

# Add "unknown" to unique labels so sklearn treats it as a valid class
unique_labels.add("unknown")

# Replace unknown predictions with "unknown"
cleaned_predictions = [
    pred if pred in unique_labels else "unknown"
    for pred in corrected_texts
]

# Compute metrics
accuracy = accuracy_score(test_df["MEDICINE_NAME"], cleaned_predictions)
precision, recall, f1, _ = precision_recall_fscore_support(test_df["MEDICINE_NAME"], cleaned_predictions, labels=list(unique_labels), average="macro", zero_division=0)

# Print results
print(f"Model Performance:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")
print(f"F1-Score: {f1 * 100:.2f}%")

Model Performance:
Accuracy: 45.51%
Precision: 64.45%
Recall: 44.94%
F1-Score: 48.90%


In [13]:
cleaned_predictions

['Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'Aceta',
 'unknown',
 'Ace',
 'Ace',
 'Ace',
 'Ace',
 'Ace',
 'Ace',
 'unknown',
 'Ace',
 'Ace',
 'Ace',
 'unknown',
 'unknown',
 'unknown',
 'unknown',
 'unknown',
 'unknown',
 'Alatrol',
 'unknown',
 'unknown',
 'unknown',
 'Amodis',
 'unknown',
 'unknown',
 'Amodis',
 'Amodis',
 'Aceta',
 'Amodis',
 'Amodis',
 'Amodis',
 'Amodis',
 'Az',
 'Az',
 'Atrizin',
 'Atrizin',
 'Az',
 'unknown',
 'Az',
 'Atrizin',
 'unknown',
 'Az',
 'unknown',
 'unknown',
 'unknown',
 'unknown',
 'Az',
 'Axodin',
 'unknown',
 'Axodin',
 'unknown',
 'Axodin',
 'Azithrocin',
 'Azithrocin',
 'Azithrocin',
 'Azithrocin',
 'Az',
 'Azithrocin',
 'Azithrocin',
 'Az',
 'Azithrocin',
 'Azithrocin',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Azyth',
 'Aceta',
 'Az',
 'Az',
 'Aceta',
 'Az',
 'Leptic',
 'Az',
 'Az',
 'Az',
 'Az',
 'Bacaid',
 'Bacaid',
 'Bacaid',
 'Bacaid',
 'Bacaid',
 'Bacai

In [16]:
test_df["MEDICINE_NAME"][41]

'Atrizin'