## PDF Text Extraction

In [2]:
!apt-get install -y poppler-utils tesseract-ocr
!pip install pdf2image pytesseract pillow pdfplumber PyMuPDF opencv-python numpy

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  poppler-utils tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 4 newly installed, 0 to remove and 20 not upgraded.
Need to get 5,002 kB of archives.
After this operation, 16.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.6 [186 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 5,002 kB in 0s (14.3 MB/s)
Selecting previously unselected package poppl

#### Using Tesseract

In [3]:
import pytesseract
from pdf2image import convert_from_path
import os

In [4]:
def extract_text_from_pdf_OCR(pdf_path):
    # Convert PDF to image
    pages = convert_from_path(pdf_path, 500)

    # Extract text from each page using Tesseract OCR
    text_data = ''
    for page in pages:
        text = pytesseract.image_to_string(page)
        text_data += text + '\n\n'

    return text_data

In [None]:
# pdf_path = "/content/data/49pageText_ocr.pdf"
# text_data = extract_text_from_pdf(pdf_path)

In [None]:
# print(text_data)

### Using pdfplumber

In [5]:
import pdfplumber

In [7]:
def extract_text_from_pdf_plumber(pdf_path):
  with pdfplumber.open(pdf_path) as pdf:
    text_data = ''
    for page in pdf.pages:
      text = page.extract_text()
      text_data += text + '\n\n'
  return text_data

In [None]:
# pdf_path_1 = "/content/data/v2_55204-3.pdf"
# extract_text_pdfplumber = extract_text_from_pdf_plumber(pdf_path_1)

In [None]:
# print(extract_text_pdfplumber)

### Extract text data from any kind of pdf

In [8]:
def is_searchable_pdf(pdf_path):
    """Detect if the PDF is searchable or scanned"""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            if page.extract_text():
                return True  # Searchable PDF
    return False

In [None]:
# pdf_path = "/content/data/49pageText_ocr.pdf"

In [None]:
# is_searchable_pdf(pdf_path)

In [9]:
def extract_text_from_pdf(pdf_path):
  if is_searchable_pdf(pdf_path):
    extract_text = extract_text_from_pdf_plumber(pdf_path)
  else:
    extract_text = extract_text_from_pdf_OCR(pdf_path)
  return extract_text

In [None]:
#  extracted_text = extract_text_from_pdf(pdf_path)

In [None]:
# print(extracted_text)

[Chart][Andrea Quintanilla][67374] [12/15/2017][Page 1 of 4]
History and Physical
Patient Name: Andrea Quintanilla Visit Date: June 27, 2017
Patient ID: 67374 Provider: Benjamin Leshin, MD
Sex: Female Location: LA Pain Management
Birthdate: May 14, 1968 Location Address: 1400 S Grand Ave Suite 707
Los Angeles, CA 900153048
Location Phone: (213) 839-1119
Chief Complaint
• Neck pain
• Low back pain
History Of Present Illness
Andrea Quintanilla is a 49 year old female seen in pain management consultation at the request of her physician,
Osvaldo Cuevas, for back pain and neck pain. The neck pain developed gradually several years ago. It is 10/10 in
severity, and has a sharp quality and radiates into the head. The pain severity is 10/10 The pain has been present for
4 years It is made worse by prolonged sitting and standing and made better by "nothing". The pain is described as
feeling like aching and sharp. The pain radiates down the le~ leg but not past the knee.
She denies any additional

## Preprocessing

#### Remove Unnecessary Part

In [10]:
import re

In [12]:
def clean_text(text):
  #text = re.sub(r'\b\d{4}\b\s', ' ', text)
  text = re.sub(r'[^A-Za-z0-9.,/\s]',' ',text)
  text = re.sub(r'(Page\s\d+\sof\s\d+)|(Page\s\d+)',' ',text)
  text = re.sub(r'Digital Signature Validated',' ',text)
  #text = re.sub(r'\s+', ' ', text)
  return text

In [None]:
# print(clean_text(extracted_text))

 Chart  Andrea Quintanilla  67374   12/15/2017    
History and Physical
Patient Name  Andrea Quintanilla Visit Date  June 27, 2017
Patient ID  67374 Provider  Benjamin Leshin, MD
Sex  Female Location  LA Pain Management
Birthdate  May 14, 1968 Location Address  1400 S Grand Ave Suite 707
Los Angeles, CA 900153048
Location Phone   213  839 1119
Chief Complaint
  Neck pain
  Low back pain
History Of Present Illness
Andrea Quintanilla is a 49 year old female seen in pain management consultation at the request of her physician,
Osvaldo Cuevas, for back pain and neck pain. The neck pain developed gradually several years ago. It is 10/10 in
severity, and has a sharp quality and radiates into the head. The pain severity is 10/10 The pain has been present for
4 years It is made worse by prolonged sitting and standing and made better by  nothing . The pain is described as
feeling like aching and sharp. The pain radiates down the le  leg but not past the knee.
She denies any additional symptoms.

#### Extract the data

In [None]:
#import re

def extract_sections(text):
    """Automatically structures extracted text into key sections."""
    sections = {
        "Patient Info": "",
        "Chief Complaint": "",
        "History of Present Illness": "",
        "Diagnosis": "",
        "Treatment": "",
        "Medications": "",
        "Physical Examination": "",
        "Recommendations": ""
    }

    # Define regex patterns for section headers
    patterns = {
        "Patient Info": r"(Patient Name|Patient ID|DOB|Sex|Age|Visit Date|Date of Service|Date Of Birth|Social History|Patient Status).*?",
        "Chief Complaint": r"(Chief Complaint|Reason for Visit).*?",
        "History of Present Illness": r"(History Of Present Illness|Progress Note).*?",
        "Diagnosis": r"(Assessment|Diagnosis).*?",
        "Treatment": r"(Treatment Plan|Orders|Procedures).*?",
        "Medications": r"(Medications|Drugs|Prescriptions|Prescribed Medications).*?",
        "Physical Examination": r"(Physical Examination).*?",
        "Recommendations": r"(Recommendations|Follow-up Instructions|Instructions).*?"
    }

    current_section = None
    for line in text.split("\n"):
        line = line.strip()
        if not line:
            continue

        for section, pattern in patterns.items():
            if re.search(pattern, line, re.IGNORECASE):
                current_section = section
                break

        if current_section:
            sections[current_section] += line + " "

    return sections


In [None]:
#  extracted_sections = extract_sections(clean_text(extracted_text))

In [None]:
# type(extracted_sections)

dict

In [None]:
# extracted_sections

{'Patient Info': 'Patient Name  Andrea Quintanilla Visit Date  June 27, 2017 Patient ID  67374 Provider  Benjamin Leshin, MD Sex  Female Location  LA Pain Management Birthdate  May 14, 1968 Location Address  1400 S Grand Ave Suite 707 Los Angeles, CA 900153048 Location Phone   213  839 1119 Andrea Quintanilla is a 49 year old female seen in pain management consultation at the request of her physician, Osvaldo Cuevas, for back pain and neck pain. The neck pain developed gradually several years ago. It is 10/10 in severity, and has a sharp quality and radiates into the head. The pain severity is 10/10 The pain has been present for 4 years It is made worse by prolonged sitting and standing and made better by  nothing . The pain is described as feeling like aching and sharp. The pain radiates down the le  leg but not past the knee. She denies any additional symptoms. The patient has no prior history of neck or back surgery. RECENT INTERVENTIONS She has been previously treated with physical

In [None]:
# extracted_sections['Patient Info']

'Patient Name  Andrea Quintanilla Visit Date  June 27, 2017 Patient ID  67374 Provider  Benjamin Leshin, MD Sex  Female Location  LA Pain Management Birthdate  May 14, 1968 Location Address  1400 S Grand Ave Suite 707 Los Angeles, CA 900153048 Location Phone   213  839 1119 Andrea Quintanilla is a 49 year old female seen in pain management consultation at the request of her physician, Osvaldo Cuevas, for back pain and neck pain. The neck pain developed gradually several years ago. It is 10/10 in severity, and has a sharp quality and radiates into the head. The pain severity is 10/10 The pain has been present for 4 years It is made worse by prolonged sitting and standing and made better by  nothing . The pain is described as feeling like aching and sharp. The pain radiates down the le  leg but not past the knee. She denies any additional symptoms. The patient has no prior history of neck or back surgery. RECENT INTERVENTIONS She has been previously treated with physical therapy and chir

#### Summarization

In [14]:
!pip install google-generativeai



In [None]:
API_KEY = "*********************************"

In [16]:
import google.generativeai as genai
genai.configure(api_key=API_KEY)

In [17]:
def summarize_medical_report(medical_report):
    model = genai.GenerativeModel("gemini-pro")  # Using Gemini-Pro model
    summary_dict = {}

    for section, content in medical_report.items():
        prompt = f"""
        Summarize the following medical report section in simple and clear language :

        **Section**: {section}
        **Content**: {content}

        Provide a concise summary.
        """

        response = model.generate_content(prompt)
        summary_dict[section] = response.text  # Store summarized content

    return summary_dict


In [None]:
# summary_report=summarize_medical_report(extracted_sections)

In [None]:
# summary_report

{'Patient Info': 'Andrea Quintanilla, a 49-year-old female, presented with severe neck and back pain for four years, radiating down the left leg. Physical therapy and chiropractic treatments have been ineffective. Examination revealed sharp, aching pain in the neck radiating to the head, worse with prolonged sitting or standing. Back pain was also described as sharp and aching, radiating down the left leg but not past the knee. No neurological deficits were noted. She has a history of cervical disc herniation, idiopathic thrombocytopenia purpura, and lumbosacral disc herniation. She had previously tried unsuccessful medications and treatment, including physical therapy, chiropractic management, and various medications. An MRI of the cervical spine showed multilevel disc dehydration and protrusion. She underwent several injections and procedures, but none provided significant pain relief.',
 'Chief Complaint': "**Summary:** The patient's primary complaints are neck pain and lower back p

In [None]:
# for section, summary in summary_report.items():
#     print(f"**{section}**:\n{summary}\n")

**Patient Info**:
Andrea Quintanilla, a 49-year-old female, presented with severe neck and back pain for four years, radiating down the left leg. Physical therapy and chiropractic treatments have been ineffective. Examination revealed sharp, aching pain in the neck radiating to the head, worse with prolonged sitting or standing. Back pain was also described as sharp and aching, radiating down the left leg but not past the knee. No neurological deficits were noted. She has a history of cervical disc herniation, idiopathic thrombocytopenia purpura, and lumbosacral disc herniation. She had previously tried unsuccessful medications and treatment, including physical therapy, chiropractic management, and various medications. An MRI of the cervical spine showed multilevel disc dehydration and protrusion. She underwent several injections and procedures, but none provided significant pain relief.

**Chief Complaint**:
**Summary:** The patient's primary complaints are neck pain and lower back pa

### anonymization techniques

In [18]:
import spacy

In [22]:
# Load spaCy's English model with NER capability
nlp = spacy.load("en_core_web_sm")

def anonymize_summary(summary_text):
    # Detect named entities (e.g., patient names)
    doc = nlp(summary_text)
    anonymized_text = summary_text

    for ent in doc.ents:
        if ent.label_ == "PERSON":  # If the entity is a person's name
            anonymized_text = anonymized_text.replace(ent.text, "[NAME]")

    # Replace other identifiable information with placeholders
    anonymized_text = re.sub(r"\b\d{1,2}-year-old\b", "[AGE REDACTED]", anonymized_text)
    #anonymized_text = re.sub(r"\b(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}\b", "[DATE REDACTED]", anonymized_text)

    return anonymized_text



In [None]:
# Anonymize each section of the summary report
# anonymized_report = {section: anonymize_summary(text) for section, text in summary_report.items()}

# # Print the anonymized summary report
# print(anonymized_report)

{'Patient Info': '[PATIENT_NAME], a [AGE REDACTED] female, presented with severe neck and back pain for four years, radiating down the left leg. Physical therapy and chiropractic treatments have been ineffective. Examination revealed sharp, aching pain in the neck radiating to the head, worse with prolonged sitting or standing. Back pain was also described as sharp and aching, radiating down the left leg but not past the knee. No neurological deficits were noted. She has a history of cervical disc herniation, idiopathic thrombocytopenia purpura, and lumbosacral disc herniation. She had previously tried unsuccessful medications and treatment, including physical therapy, chiropractic management, and various medications. An MRI of the cervical spine showed multilevel disc dehydration and protrusion. She underwent several injections and procedures, but none provided significant pain relief.', 'Chief Complaint': "**Summary:** The patient's primary complaints are neck pain and lower back pai

In [None]:
# anonymized_report

{'Patient Info': '[PATIENT_NAME], a [AGE REDACTED] female, presented with severe neck and back pain for four years, radiating down the left leg. Physical therapy and chiropractic treatments have been ineffective. Examination revealed sharp, aching pain in the neck radiating to the head, worse with prolonged sitting or standing. Back pain was also described as sharp and aching, radiating down the left leg but not past the knee. No neurological deficits were noted. She has a history of cervical disc herniation, idiopathic thrombocytopenia purpura, and lumbosacral disc herniation. She had previously tried unsuccessful medications and treatment, including physical therapy, chiropractic management, and various medications. An MRI of the cervical spine showed multilevel disc dehydration and protrusion. She underwent several injections and procedures, but none provided significant pain relief.',
 'Chief Complaint': "**Summary:** The patient's primary complaints are neck pain and lower back pa

In [None]:
# for section, summary in anonymized_report.items():
#     print(f"**{section}**:\n{summary}\n")

**Patient Info**:
[PATIENT_NAME], a [AGE REDACTED] female, presented with severe neck and back pain for four years, radiating down the left leg. Physical therapy and chiropractic treatments have been ineffective. Examination revealed sharp, aching pain in the neck radiating to the head, worse with prolonged sitting or standing. Back pain was also described as sharp and aching, radiating down the left leg but not past the knee. No neurological deficits were noted. She has a history of cervical disc herniation, idiopathic thrombocytopenia purpura, and lumbosacral disc herniation. She had previously tried unsuccessful medications and treatment, including physical therapy, chiropractic management, and various medications. An MRI of the cervical spine showed multilevel disc dehydration and protrusion. She underwent several injections and procedures, but none provided significant pain relief.

**Chief Complaint**:
**Summary:** The patient's primary complaints are neck pain and lower back pai

In [36]:
def save_as_markdown(data, filename="medical_report.md"):
    md_text = "# Anonymized Medical Report\n\n"
    for section, content in data.items():
        md_text += f"## {section}\n{content}\n\n"

    with open(filename, "w", encoding="utf-8") as f:
        f.write(md_text)

In [37]:
import shutil
import os

def download_md_file(filename="medical_report.md", destination_folder="Downloads"):
    # Ensure the destination folder exists
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    # Move the file
    new_path = os.path.join(destination_folder, filename)
    shutil.move(filename, new_path)

#### User InterFace

In [20]:
from google.colab import files

In [23]:
def cli_simulation():
    # Simulate the CLI upload process
    uploaded = files.upload()  # This will prompt the user to upload a file

    for pdf_file in uploaded.keys():
        print(f'Processing file: {pdf_file}')

        # Extract text from the uploaded PDF
        extracted_text = extract_text_from_pdf(pdf_file)


        # Extract sections from the cleaned text
        extracted_sections = extract_sections(clean_text(extracted_text))


        # Create a Summary Report
        summary_report=summarize_medical_report(extracted_sections)

        # Anonymize each section of the summary report
        anonymized_report = {section: anonymize_summary(text) for section, text in summary_report.items()}

        for section, summary in anonymized_report.items():
          print(f"**{section}**:\n{summary}\n")



# Run the CLI simulation function
cli_simulation()

Saving 49pageText_ocr.pdf to 49pageText_ocr.pdf
Processing file: 49pageText_ocr.pdf
**Patient Info**:
[NAME] is a [AGE REDACTED] female with a history of neck and back pain who is seen in pain management. Her neck pain started several years ago and is sharp, with a severity of 10/10. Her back pain has been present for 4 years and is aching and sharp, with a severity of 10/10. She has tried physical therapy, chiropractic management, and medication, but nothing has helped. She is currently taking Effexor 37.5 mg daily and Acetaminophen 1000mg BID-TID. She has a past medical history of cervical disc herniation, cervicalgia, idiopathic thrombocytopenia purpura, lumbosacral disc herniation, thrombocytopenia, and myofascial pain.

**Chief Complaint**:
The patient's chief complaints are neck pain and low back pain.

**History of Present Illness**:
[NAME], age 49, is experiencing severe neck and back pain. Her neck pain gradually started years ago and is sharp, radiates into her head, and caus

In [25]:
cli_simulation()

Saving v2_55204-3.pdf to v2_55204-3 (1).pdf
Processing file: v2_55204-3 (1).pdf
**Patient Info**:
After a work related accident where the patient fell, twisted her back and hit her knee, she complained about her lower back and right knee. The patient has since worsened but is not significantly worse. There is a need for referral/consultation. The patient has been given therapeutic services and will continue working as normal.

**Chief Complaint**:
**Chief Complaint:**

This section summarizes the main reason why the patient is seeking medical attention. It's a brief statement that describes the patient's primary concern or symptoms.

**History of Present Illness**:
**Summary of History of Present Illness:**

The patient is at a follow-up visit for an injury sustained on 08/28/2014. Their injury has worsened. They continue to experience pain in their right wrist and hand, which is described as dull and intermittent. They also report pain with motion. There are no associated symptoms, an

In [26]:
cli_simulation()

Saving v3_55204-3.pdf to v3_55204-3 (1).pdf
Processing file: v3_55204-3 (1).pdf
**Patient Info**:
**Patient Information:**

[NAME], injured on 08/28/2014.

**Condition:**

* Worsened since last exam
* Varicosities and abdominal pain

**Work Status:**

* Continue working with restrictions recommended by Dr. [NAME]

**Treatment:**

* Physical therapy
* Splint for right hand
* Appointment with Dr. [NAME] on 9/12/14 at 9:30 AM

**Chief Complaint**:
The "Chief Complaint" section summarizes the main reason why the patient is seeking medical attention. It captures the patient's primary concern or symptom.

**History of Present Illness**:
**Summary:**

The patient is a [AGE REDACTED] female who sustained an injury to their right wrist/hand on 08/28/2014. They are currently experiencing dull, intermittent pain in their right wrist and hand. There is no numbness, weakness, or discoloration. Pain is exacerbated by movement of the hand. Patient is currently on modified duty and is taking NSAIDs to

In [27]:
cli_simulation()

Saving v4_55204-3.pdf to v4_55204-3.pdf
Processing file: v4_55204-3.pdf
**Patient Info**:
**Patient Information:**

* Name: [NAME]
* Date of Injury: August 28, 2014
* Date of Service: November 20, 2014

**Medical History:**

* Family history of diabetes and heart disease
* Patient denies tobacco or alcohol use

**Symptoms:**

* Varicose veins
* Abdominal pain
* Muscle aches

**Current Status:**

* Condition has worsened since the last examination
* Work status has changed

**Treatment:**

* Patient is advised to continue with Dr. [NAME]'s recommendations

**Chief Complaint**:
**Summary:**

This section describes the primary reason why the patient is seeking medical attention. It briefly states what the patient is concerned about or what symptoms they are experiencing.

**History of Present Illness**:
**Summary:**

The patient is a [AGE REDACTED] woman who injured her right hand/wrist while lifting furniture. She continues to experience dull pain, swelling, and weakness in her hand. Her

In [38]:
def download_md_file_local(filename="medical_report.md"):
    files.download(filename)  # Google Colab download function
    print(f"✅ {filename} is ready for download!")

In [39]:
def cli_simulation():
    """CLI workflow: Upload, extract, clean, summarize, anonymize, save, and provide download option."""
    uploaded = files.upload()  # Prompt user to upload a file

    for pdf_file in uploaded.keys():
        print(f'Processing file: {pdf_file}')

        # Extract text from the uploaded PDF
        extracted_text = extract_text_from_pdf(pdf_file)

        # Extract sections from the cleaned text
        extracted_sections = extract_sections(clean_text(extracted_text))

        # Create a Summary Report
        summary_report = summarize_medical_report(extracted_sections)

        # Anonymize each section of the summary report
        anonymized_report = {section: anonymize_summary(text) for section, text in summary_report.items()}

        # Save anonymized report as Markdown
        save_as_markdown(anonymized_report)

        # Ask user if they want to download
        user_input = input("📥 Do you want to download the report? (yes/no): ").strip().lower()
        if user_input == "yes":
            #download_md_file()
            download_md_file_local()
        else:
            print("❌ Download cancelled.")

#cli_simulation()


In [40]:
cli_simulation()

Saving v4_55204-3.pdf to v4_55204-3 (2).pdf
Processing file: v4_55204-3 (2).pdf
📥 Do you want to download the report? (yes/no): yes


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ medical_report.md is ready for download!
