# End-to-End GenAI Email Classification & OCR Pipeline

This notebook demonstrates an **end-to-end** solution to:
1. Parse multiple `.eml` emails from a directory.
2. Extract text from the email body and attachments (PDF/DOCX).
3. Classify request types using an **open-source Zero-Shot** model.
4. Apply domain rules to determine primary vs. sub-request types.
5. Perform **basic field extraction** (e.g., amounts).
6. (Optional) Perform **OCR** for scanned PDFs, if needed.
7. Generate a **report** summarizing results for all `.eml` files.

### Libraries & Installation
- **transformers** for zero-shot classification.
- **PyPDF2** for extracting text from PDFs.
- **python-docx** for extracting text from DOCX.
- **pytesseract** & **PIL** (optional) if you need OCR on image-based PDFs.

You can install them with:
```
!pip install transformers PyPDF2 python-docx pytesseract Pillow
```
The code here will demonstrate a purely text-based PDF by default. If you have scanned-image PDFs, you can enable the Tesseract-based OCR code snippet (commented out below).

Let's get started!

In [None]:
!pip install transformers PyPDF2 python-docx pytesseract Pillow --quiet

## 1. Imports and Pipeline Setup

We import the core libraries for parsing emails, performing zero-shot classification, and extracting text from PDF/DOCX attachments.

> **Note**: For demonstration, we embed sample `.eml` files **directly in this notebook**, each containing attachments in Base64. We'll write them out to a `test_emails` folder so that we can parse them as if they were real emails.


In [None]:
import os
import re
import base64
import email
import json
import io
from typing import Dict, List, Tuple

#######################################
# Transformers Zero-Shot Classifier
#######################################
from transformers import pipeline
# Initialize the pipeline (using bart-large-mnli as an example)
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

#######################################
# PDF, DOCX, and optional OCR (Tesseract)
#######################################
import PyPDF2
import docx

# OPTIONAL: If you need OCR for image-based PDFs
# import pytesseract
# from PIL import Image

#######################################
# Utility: parse_eml
#######################################
def parse_eml(file_path: str) -> dict:
    """
    Parse .eml file and return a dictionary of:
      {
        'subject': str,
        'from': str,
        'to': str,
        'cc': str,
        'date': str,
        'body': str,
        'attachments': [ { 'filename': str, 'data': bytes }, ... ]
      }
    """
    with open(file_path, "rb") as f:
        raw_data = f.read()
    msg = email.message_from_bytes(raw_data)

    email_data = {
        "subject": msg.get("subject", ""),
        "from": msg.get("from", ""),
        "to": msg.get("to", ""),
        "cc": msg.get("cc", ""),
        "date": msg.get("date", ""),
        "body": "",
        "attachments": []
    }

    for part in msg.walk():
        filename = part.get_filename()
        content_type = part.get_content_type()

        # If there's a filename, treat it as attachment
        if filename:
            attach_data = part.get_payload(decode=True)
            email_data["attachments"].append(
                {"filename": filename, "data": attach_data}
            )
        else:
            if content_type in ["text/plain", "text/html"]:
                try:
                    body_content = part.get_payload(decode=True)
                    if body_content:
                        email_data["body"] += body_content.decode(errors="ignore")
                except Exception:
                    pass

    return email_data

#######################################
# Utility: extract_text_from_attachments
#######################################
def extract_text_from_attachments(attachments: List[dict]) -> str:
    """
    For each attachment:
      - If PDF, extract text using PyPDF2.
      - If DOCX, extract text using python-docx.
      - (Optionally, if scanned PDF, can do OCR with Tesseract.)
    Return concatenated text from all attachments.
    """
    full_attachment_text = ""
    for attach in attachments:
        filename = attach["filename"].lower()
        data = attach["data"]

        # PDF extraction
        if filename.endswith(".pdf"):
            try:
                pdf_file = PyPDF2.PdfReader(io.BytesIO(data))
                pdf_text = []
                for page in pdf_file.pages:
                    extracted = page.extract_text()
                    if extracted:
                        pdf_text.append(extracted)
                combined_pdf_text = "\n".join(pdf_text)

                # If it was a scanned PDF, you could do:
                #   - Convert each page to an image
                #   - Run pytesseract.image_to_string(image)
                # We'll skip that for now.

                full_attachment_text += (combined_pdf_text + "\n")
            except Exception as e:
                full_attachment_text += f"[PDF error: {e}]\n"

        # DOCX extraction
        elif filename.endswith(".docx"):
            try:
                file_stream = io.BytesIO(data)
                doc = docx.Document(file_stream)
                docx_text = []
                for para in doc.paragraphs:
                    if para.text.strip():
                        docx_text.append(para.text)
                full_attachment_text += ("\n".join(docx_text) + "\n")
            except Exception as e:
                full_attachment_text += f"[DOCX error: {e}]\n"

        else:
            # For other file types, handle as needed.
            pass

    return full_attachment_text

#######################################
# classify_request_types (Zero-shot)
#######################################
def classify_request_types(text: str, candidate_labels: List[str]) -> Dict[str, float]:
    """
    Use the zero-shot classification pipeline to get a probability
    for each candidate label.
    """
    if not text.strip():
        return {label: 0.0 for label in candidate_labels}

    result = classifier(text, candidate_labels, multi_label=True)
    # 'result' has 'labels' and 'scores' in descending order.
    scores_dict = {}
    for label, score in zip(result["labels"], result["scores"]):
        scores_dict[label] = score
    return scores_dict

#######################################
# apply_domain_rules
#######################################
def apply_domain_rules(scores_dict: Dict[str, float]) -> Tuple[str, List[str]]:
    """
    Pick the highest-confidence label subject to domain priority.
    Then treat the rest as sub-requests.

    Domain priority example:
      1. Money Movement-Inbound
      2. Money Movement-Outbound
      3. Commitment Change
      4. Fee Payment
      5. Closing Notice
      6. AU Transfer
      7. Adjustments
    """
    if not scores_dict:
        return None, []

    # Sort labels by confidence descending
    sorted_labels = sorted(scores_dict.items(), key=lambda x: x[1], reverse=True)

    priority_order = [
        "Money Movement-Inbound",
        "Money Movement-Outbound",
        "Commitment Change",
        "Fee Payment",
        "Closing Notice",
        "AU Transfer",
        "Adjustments",
    ]

    primary = None
    sub_list = []
    for label, _ in sorted_labels:
        # if we haven't picked a primary yet AND label is recognized
        if (primary is None) and (label in priority_order):
            primary = label
        else:
            sub_list.append(label)

    return primary, sub_list

#######################################
# extract_key_fields
#######################################
def extract_key_fields(text: str) -> dict:
    """
    Basic example: extract first found $-amount. Optionally add more logic.
    """
    # Regex for amounts like $1,234,567 or 2,000.00 USD
    amount_pattern = re.compile(r'(\$?\d{1,3}(?:,\d{3})*(?:\.?\d+)?)(?:\s?(USD|dollars)?)')
    matches = amount_pattern.findall(text)
    amount = matches[0][0] if matches else None

    extracted_fields = {
        "deal_name": None,         # Could do advanced logic or NER.
        "amount": amount,
        "expiration_date": None,   # Could do date regex or NER.
    }
    return extracted_fields

#######################################
# check_duplicate (stub)
#######################################
def check_duplicate(email_data: dict, text: str) -> Tuple[bool, str]:
    """
    Simple placeholder for duplicates. You might:
      - Track message IDs
      - Compare text similarity
    Here, we always return False.
    """
    return False, ""

#######################################
# process_email
#######################################
def process_email(file_path: str, request_types: List[str]) -> dict:
    """
    End-to-end pipeline for a single .eml:
      1. parse .eml
      2. extract text (body + attachments)
      3. classify request types (zero-shot)
      4. apply domain rules -> primary + subs
      5. extract key fields (amount, etc.)
      6. check duplicates
      7. return structured result
    """
    eml_data = parse_eml(file_path)
    body_text = eml_data["body"] or ""

    attachments_text = extract_text_from_attachments(eml_data["attachments"])
    combined_text = body_text + "\n" + attachments_text

    # Classify
    scores = classify_request_types(combined_text, request_types)
    primary_request, sub_requests = apply_domain_rules(scores)

    # Extract fields
    extracted_fields = extract_key_fields(combined_text)

    # Check duplicates
    duplicate_flag, duplicate_reason = check_duplicate(eml_data, combined_text)

    if not primary_request:
        primary_request = "Unknown"  # fallback if no request type matched

    output = {
        "filename": os.path.basename(file_path),
        "primary_request_type": {
            "label": primary_request,
            "confidence": scores.get(primary_request, 0.0),
            "reasoning": f"Detected {primary_request} with highest confidence; domain rules applied."
        },
        "sub_request_types": [
            {
                "label": sr,
                "confidence": scores.get(sr, 0.0)
            } for sr in sub_requests
        ],
        "extracted_fields": extracted_fields,
        "duplicate_flag": duplicate_flag,
        "duplicate_reason": duplicate_reason
    }
    return output


## 2. Generate Sample EML Files

We'll create multiple `.eml` files in a `test_emails` directory. Each email will have a different body and attachments with **meaningful** content. Then, we'll run the pipeline on all of them.

**Note**: We'll embed base64-encoded PDFs and DOCXs. Each PDF or DOCX will contain some domain-relevant text, e.g., mention of amounts, request type, or other keywords.


In [None]:
# PDF 1: Inbound money movement, Redwood Project
PDF1_BASE64 = b"""
JVBERi0xLjUNCiW1t7IKDQoxIDAgb2JqDTw8IC9DcmVhdG9yIChHb29nbGUgQ29sYWIpCi9QYWdl
cyAyIDAgUgovVHlwZS9DYXRhbG9nPj4NZW5kb2JqDQoyIDAgb2JqDTw8IC9Db3VudCAxIC9LaWRz
IFszIDAgUl0gL1R5cGUgL1BhZ2VzPj4NZW5kb2JqDQozIDAgb2JqDTw8IC9Db250ZW50cyA0IDAg
UiAvUGFyZW50IDIgMCBSIC9UeXBlIC9QYWdlPj5lbmRvYmoNCjQgMCBvYmoNPDwgL0xlbmd0aCAx
NDc+PnN0cmVhbQpUaGlzIFBERiBjb250YWlucyBhbiBpbmJvdW5kIG1vbmV5IG1vdmVtZW50IHJl
cXVlc3QgZm9yIFJlZHdvb2QgUHJvamVjdC4KCklubmVyIFRleHQ6IFJlcXVlc3RpbmcgJDIuNTAw
LDAwMCBmb3IgUmVkd29vZCBQcm9qZWN0LCB0byBiZSBmb3JtYWxseSBpbml0aWF0ZWQgb24gMjAy
NS0xMi0wMS4KCkZlZWVzOiBUaGlzIGlzIGNsYXNzaWZpZWQgYXMgYW4gTW9uZXkgTW92ZW1lbnQt
SW5ib3VuZCByZXF1ZXN0LgoNCkVuZCBvZiBQQ0YNCmVuZHN0cmVhbQplbmRvYmoNCnhyZWYNCjAg
MTUNCjAwMDAwMDAwMDAgNjU1MzUgZiANCjAwMDAwMDAwMTAgMDAwMDAgbiANCjAwMDAwMDAwNTAg
MDAwMDAgbiANCjAwMDAwMDAwOTAgMDAwMDAgbiANCnRyYWlsZXINCjw8IC9TaXplIDUgL1Jvb3Qg
MSAwIFIgL0luZm8gNiAwIFIgL0lEIFsgPDZiYTg0MGEyNmI5OGI5YWFiODQyZDQ0ODgwNzQzMzAw
Mj4gPDZiYTg0MGEyNmI5OGI5YWFiODQyZDQ0ODgwNzQzMzAwMj5dID4+DQpzdGFydHhyZWYNCjUx
OQ0KJSVFT0YNCg==
"""

# DOCX 1: Redwood Project doc with some references
DOCX1_BASE64 = b"""
UEsDBBQABgAIAAAAIQDKgTJisAEAAB4FAAAQABwAUmVkd29vZC5kb2N4AAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAhrJiWVPQy9KTSxJVcjPTM0HczQwMDY2YpgBAkLEgQMAYzh5IQbKdyI9pNgp2
LxRJFVtlPxBiCfF1h3sO0mkPDI94RZbXWxchQmqB6dvzK1AR3FU7dKxOSnImsVM5fgy0ND/orybC
93jST6AUEsBAh4DFAAGAAgAAAAhAMqBMmKwAQAAHgUAABAAAAAAAAAAAAAAAAAAAAAAAACAAAAA
AFJlZHdvb2QuZG9jeFBLBQYAAAAAAQABAEcAAAD2AAAAAAA="""

# PDF 2: Mentions an Adjustment of $500 for Amendment Fees
PDF2_BASE64 = b"""
JVBERi0xLjUNCiW1t7IKDQoxIDAgb2JqDTw8IC9DcmVhdG9yIChHb29nbGUgQ29sYWIpCi9QYWdl
cyAyIDAgUgovVHlwZS9DYXRhbG9nPj4NZW5kb2JqDQoyIDAgb2JqDTw8IC9Db3VudCAxIC9LaWRz
IFszIDAgUl0gL1R5cGUgL1BhZ2VzPj4NZW5kb2JqDQozIDAgb2JqDTw8IC9Db250ZW50cyA0IDAg
UiAvUGFyZW50IDIgMCBSIC9UeXBlIC9QYWdlPj5lbmRvYmoNCjQgMCBvYmoNPDwgL0xlbmd0aCAx
Mjc+PnN0cmVhbQpUaGlzIGRvY3VtZW50IG91dGxpbmVzIGFuIGFkanVzdG1lbnQgcmVxdWVzdC4K
CldlIGFyZSBhc2tpbmcgZm9yIGFuICQ1MDAgQW1lbmRtZW50IEZlZS4KCkl0IHNob3VsZCBiZSBw
cmlvcml0aXplZCAgYWJvdmUgb3RoZXIgc2VydmljZSByZXF1ZXN0cy4KCkVuZCBvZiBkb2N1bWVu
dA0KZW5kc3RyZWFtDQplbmRvYmoNCnhyZWYNCjAgMTUNCjAwMDAwMDAwMDAgNjU1MzUgZiANCjAw
MDAwMDAwMTAgMDAwMDAgbiANCjAwMDAwMDAwNTAgMDAwMDAgbiANCjAwMDAwMDAwOTAgMDAwMDAg
biANCnRyYWlsZXINCjw8IC9TaXplIDUgL1Jvb3QgMSAwIFIgL0luZm8gNiAwIFIgL0lEIFsgPDY0
MjM0MDczMjBmYTE1ZjJkZWMyZTQzZTU5NTI2NzUwZT4gPDY0MjM0MDczMjBmYTE1ZjJkZWMyZTQz
ZTU5NTI2NzUwZT5dID4+DQpzdGFydHhyZWYNCjUxOQ0KJSVFT0YNCg==
"""

# DOCX 2: Mentions an "Adjustment" for $500 - Word doc.
DOCX2_BASE64 = b"""
UEsDBBQABgAIAAAAIQDmTaC9tgEAADcFAAAOABwAQWRqdXN0bWVudC5kb2N4AAAAAAAAAAAAAAAA
AAAAAAC0O2EK7M3JTczRbKyIVbIKjeUCAuERggA2QiAQAUEsDBBQABgAIAAAAIQDmTaC9tgEAADcF
AAAQABwAd29yZC9fcmVscy9kb2N1bWVudC54bWwucmVscwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AHRCRgpQSwECHgMUAAYACAAAACEA5k2gvbYBAAA3BQAAFgAAAAAAAAAAAAAAAAAkAAAAAEFkanVz
dG1lbnQuZG9jeFBLAQIeAxQABgAIAAAAIQDmTaC9tgEAADcFAAAQAAAAAAAAAAAAAAAAAAgAAQAA
AHdvcmQvX3JlbHMvZG9jdW1lbnQueG1sLnJlbHNUQkYKlBLAQIeAxQABgAIAAAAIQDmTaC9tgEA
AA=="""

##################################################################
#  Helper function to write sample .eml with attachments
##################################################################

def create_eml_file(
    file_path: str,
    subject: str,
    body_text: str,
    attachments: List[Tuple[str, bytes]]
) -> None:
    """
    Create an .eml file with the given subject, body_text, and a list of
    attachments (filename, base64 bytes).
    """
    boundary = "BOUNDARY"
    # Basic EML header
    eml_header = f"""From: hackathon.user@example.com\n" \
                  f"To: genai.challenge@example.com\n" \
                  f"Subject: {subject}\n" \
                  f"MIME-Version: 1.0\n" \
                  f"Content-Type: multipart/mixed; boundary=\"{boundary}\"\n" \
                  f"\n""".strip()

    # Body part
    eml_body = f"--{boundary}\nContent-Type: text/plain\n\n{body_text}\n\n"

    # Attachments part
    eml_attachments = ""
    for (fname, b64data) in attachments:
        if fname.lower().endswith(".pdf"):
            content_type = "application/pdf"
        elif fname.lower().endswith(".docx"):
            content_type = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
        else:
            content_type = "application/octet-stream"

        eml_attachments += (f"--{boundary}\n"
                            f"Content-Type: {content_type}\n"
                            f"Content-Transfer-Encoding: base64\n"
                            f"Content-Disposition: attachment; filename=\"{fname}\"\n"
                            f"\n"
                            f"{b64data.decode('utf-8')}\n\n")

    # Closing boundary
    eml_close = f"--{boundary}--\n"

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(eml_header + "\n\n")
        f.write(eml_body)
        f.write(eml_attachments)
        f.write(eml_close)

# Now let's create a test_emails directory and populate multiple .eml files.

os.makedirs("test_emails", exist_ok=True)

# 1) EML for Inbound Money Movement
body_1 = ("Hello Loan Team,\n\n"
          "Requesting a Money Movement-Inbound for $2,500,000.\n"
          "See attached PDF & DOCX for Redwood Project details.\n\n"
          "Best Regards,\nClient XYZ")

create_eml_file(
    file_path="test_emails/email_inbound.eml",
    subject="Inbound Funding Request - Redwood Project",
    body_text=body_1,
    attachments=[
        ("redwood_funding.pdf", PDF1_BASE64),
        ("redwood_details.docx", DOCX1_BASE64)
    ]
)

# 2) EML for Adjustments / Amendment Fee
body_2 = ("Dear Servicing Team,\n\n"
          "We would like to request an Adjustment in the amount of $500 for Amendment Fees.\n"
          "Please review the attached PDF and DOCX.\n\n"
          "Thanks,\nAccount Manager")

create_eml_file(
    file_path="test_emails/email_adjustment.eml",
    subject="Adjustment Request - Amendment Fees",
    body_text=body_2,
    attachments=[
        ("amendment_fees.pdf", PDF2_BASE64),
        ("adjustment_details.docx", DOCX2_BASE64)
    ]
)

# 3) EML referencing multiple requests but we want to see which is primary
#    We'll mention both "money movement inbound" and "fee payment" in the text.

body_3 = ("Hello,\n\n"
          "We need a Fee Payment processed tomorrow, and also plan to do a Money Movement-Inbound.\n"
          "However, the inbound request is the priority.\n\n"
          "Regards,\nClient ABC")

create_eml_file(
    file_path="test_emails/email_multiple.eml",
    subject="Fee Payment & Inbound Movement",
    body_text=body_3,
    attachments=[]  # no attachments for this one
)

print("Sample EML files created in ./test_emails")

## 3. Run the Pipeline on All `.eml` Files in `test_emails`

We'll define our sample request types, iterate over each `.eml` file, parse & classify, then collect the results in a **report**.

Finally, we'll display the output as JSON.


In [None]:
# Define the known request types
SAMPLE_REQUEST_TYPES = [
    "Adjustments",
    "AU Transfer",
    "Closing Notice",
    "Commitment Change",
    "Fee Payment",
    "Money Movement-Inbound",
    "Money Movement-Outbound"
]

# Collect results in a list
report = []

# Iterate over all .eml files in test_emails
eml_files = [f for f in os.listdir("test_emails") if f.lower().endswith(".eml")]
for eml_file in eml_files:
    eml_path = os.path.join("test_emails", eml_file)
    result = process_email(eml_path, SAMPLE_REQUEST_TYPES)
    report.append(result)

# Print the final report as JSON
print("\n==================== Final Report ====================")
print(json.dumps(report, indent=2))
print("=====================================================")