In [20]:
import pandas as pd
import matplotlib.pyplot as plt

from utils.data_loading import (
    load_enron_emails,
    load_invoices,
    #load_cuad_full_contracts,
    load_arxiv_from_jsonl,
    load_invoices_kaggle
)
#from utils.paths import RAW, PROCESSED, MODELS, FIG_EDA, FIG_EVAL, LOGS
from utils.paths import *

In [9]:
# paths 
ENRON_CSV   = RAW / "enron.csv"                      
INVOICE_CSV = RAW / "invoices.csv"
ARXIV_JSONL = RAW / "arxiv.json"
#FULL_CONTRACT = RAW / "CUAD" / "FULL_CONTRACT_TXT"
INVOICE_2_CSV = RAW_INVOICES_2

emails_df   = load_enron_emails(ENRON_CSV, max_rows=None)       # if slow, set max_rows=50000
invoices_df = load_invoices(INVOICE_CSV)
#legal_df    = load_cuad_full_contracts(FULL_CONTRACT)
papers_df   = load_arxiv_from_jsonl(ARXIV_JSONL, max_rows=20000)
invoices_2_df = load_invoices_kaggle(INVOICE_2_CSV)



Loading Enron emails...
Enron after cleaning: (169269, 3)
Loading invoices...
Invoices after cleaning: (10000, 3)
Loading arXiv JSONL...
arXiv after cleaning: (19687, 3)
Loading Kaggle invoices from C:\Users\viach\Documents\doc_class\datasets\raw\invoices_2.csv …
Kaggle invoices after cleaning: (50000, 3)


In [10]:
print("ENRON shape:", emails_df.shape)
print("INVOICES shape:", invoices_df.shape)
#print("CUAD shape:", legal_df.shape)
print("ARXIV shape:", papers_df.shape)
print("INVOICES_2 shape:", invoices_2_df.shape)


ENRON shape: (169269, 3)
INVOICES shape: (10000, 3)
ARXIV shape: (19687, 3)
INVOICES_2 shape: (50000, 3)


In [18]:
invoices_2_df

Unnamed: 0,text,doc_type,source
0,Invoice issued on 1982-09-10 for client Carmen...,INVOICE,KAGGLE_INVOICES
1,INVOICE 1982-09-10 — Client: Carmen Nixon Todd...,INVOICE,KAGGLE_INVOICES
2,"Bill to: Carmen Nixon Todd Anderson, 283 Wendy...",INVOICE,KAGGLE_INVOICES
3,1982-09-10 invoice for Carmen Nixon Todd Ander...,INVOICE,KAGGLE_INVOICES
4,"Attention Carmen Nixon Todd Anderson, Logistic...",INVOICE,KAGGLE_INVOICES
...,...,...,...
49995,Invoice issued on 2021-07-25 for client Jeffre...,INVOICE,KAGGLE_INVOICES
49996,INVOICE 2021-07-25 — Client: Jeffrey Johnson Y...,INVOICE,KAGGLE_INVOICES
49997,"Bill to: Jeffrey Johnson Yolanda Mcgrath, 783 ...",INVOICE,KAGGLE_INVOICES
49998,2021-07-25 invoice for Jeffrey Johnson Yolanda...,INVOICE,KAGGLE_INVOICES


In [17]:
for name, df in [
    ("EMAIL", emails_df),
    ("INVOICE", invoices_df),
    #("LEGAL_DOCUMENT", legal_df),
    ("SCIENTIFIC_PAPER", papers_df),
    ("INVOICE 2", invoices_2_df)
]:
    tmp = df.copy()
    tmp["n_chars"] = tmp["text"].str.len()
    print(f"=== {name} ===")
    print(tmp["n_chars"].describe())
    print()


=== EMAIL ===
count    169269.000000
mean        931.132310
std        1348.544744
min         100.000000
25%         216.000000
50%         435.000000
75%        1029.000000
max        9999.000000
Name: n_chars, dtype: float64

=== INVOICE ===
count    10000.000000
mean       250.592900
std          9.243306
min        234.000000
25%        244.000000
50%        249.000000
75%        255.000000
max        300.000000
Name: n_chars, dtype: float64

=== SCIENTIFIC_PAPER ===
count    19687.000000
mean       881.200030
std        403.311935
min        200.000000
25%        583.000000
50%        802.000000
75%       1122.000000
max       2051.000000
Name: n_chars, dtype: float64

=== INVOICE 2 ===
count    50000.000000
mean       134.812060
std         43.618914
min         61.000000
25%        115.000000
50%        134.500000
75%        150.000000
max        228.000000
Name: n_chars, dtype: float64



#  Document Length and Volume Analysis

The analysis of the four document types reveals significant differences in **sample volume** and **document length**, which is critical for preparing a robust training dataset.

##  Class-Specific Length Characteristics

The table below summarizes the key volume and length features for each document class:

| Document Class | Sample Count | Length Range (Characters) | Median Length (Approx.) | Key Observation |
| :--- | :--- | :--- | :--- | :--- |
| **EMAIL** | 169k | 100–10k | ~435 | Mixture of short messages and longer threaded discussions. |
| **INVOICE** | 10k | ~234–300 | N/A (Narrow) | **Very narrow range**, reflecting fixed text templates from structured fields. |
| **LEGAL_DOCUMENT** | 509 | Max ~300k | ~32k | **Smallest class by count**, but contains **extremely long contracts** (from CUAD). |
| **SCIENTIFIC\_PAPER** | 19.7k | 200–2050 | ~800 | Titles concatenated with abstracts (from arXiv metadata). |

---

##  Implications for Model Training

This data clearly shows that the four document types differ not only in content/style but also in typical document length. In particular, **LEGAL\_DOCUMENT** is both the smallest class in terms of sample count and the largest in terms of text length.

To address these disparities for effective model training, two key steps were taken:

### 1. Length-Based Filtering

The data was initially filtered using **class-specific thresholds** to manage length variance:

* **Custom Thresholds:** Applied class-specific thresholds for length.
* **No Upper Bound for Legal:** No upper length bound was set for the legal documents to preserve the integrity of the full contract texts.
* **Caps Applied:** Length caps were applied to other classes (e.g., emails and abstracts).

### 2. Dataset Balancing

To mitigate the effects of class imbalance, the final dataset was balanced:

* **Capping by Smallest Class:** Each document class was later capped at **509 documents**.
* This count **matches the sample volume** of the smallest class (`LEGAL_DOCUMENT`), ensuring the model trains on an equal number of samples from all four types.

In [14]:
FIG_EDA.mkdir(parents=True, exist_ok=True)

def plot_length_hist(df, title, filename):
    tmp = df.copy()
    tmp["n_chars"] = tmp["text"].str.len()
    plt.figure()
    tmp["n_chars"].hist(bins=50)
    plt.title(title)
    plt.xlabel("Characters")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.savefig(FIG_EDA / filename, dpi=150)
    plt.close()

plot_length_hist(emails_df, "Email length", "length_hist_email.png")

In [13]:
def show_examples(df, label, n=3):
    print("=" * 80)
    print(label)
    print("=" * 80)
    for i in range(n):
        print(f"[{i}]")
        print(df["text"].iloc[i][:800])  # first 800 chars
        print("\n" + "-" * 40 + "\n")

show_examples(emails_df, "EMAIL")
show_examples(invoices_df, "INVOICE")
#show_examples(legal_df, "LEGAL_DOCUMENT")
show_examples(papers_df, "SCIENTIFIC_PAPER")
show_examples(invoices_2_df, "INVOICE 2")


EMAIL
[0]
Traveling to have a business meeting takes the fun out of the trip. Especially if you have to prepare a presentation. I would suggest holding the business plan meetings here then take a trip without any formal business meetings. I would even try and get some honest opinions on whether a trip is even desired or necessary. As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not. Too often the presenter speaks and the others are quiet just waiting for their turn. The meetings might be better if held in a round table discussion format. My suggestion for where to go is Austin. Play golf and rent a ski boat and jet ski's. Flying somewhere takes too much time.

----------------------------------------

[1]
Randy, Can you send me a schedule of the salary and level of everyone in the scheduling group. Plus your thoughts on any changes that need to be made. (Patti S for examp

In [16]:
df_all = pd.concat(
    [
        emails_df[["text", "doc_type"]],
        invoices_df[["text", "doc_type"]],
        invoices_2_df[["text", "doc_type"]],
        #legal_df[["text", "doc_type"]],
        papers_df[["text", "doc_type"]],
    ],
    ignore_index=True
)

print(df_all["doc_type"].value_counts())


doc_type
EMAIL               169269
INVOICE              60000
SCIENTIFIC_PAPER     19687
Name: count, dtype: int64


## Initial Label Distribution and Balancing Strategy

### Raw Distribution Imbalance

The raw dataset exhibits significant class imbalance:

* **EMAIL** is **massively overrepresented**.
* **LEGAL\_DOCUMENT** is **extremely rare** (the minority class).

This disparity, if unaddressed, would lead to a model heavily biased toward recognizing emails and poorly generalizing to legal documents. 

### Data Balancing Action

To create a fair and representative training environment, the data is balanced in the `02_data_preparation.ipynb` notebook:

* **Capping Strategy:** Each class is capped at **509 samples**.
* **Goal:** This cap matches the count of the smallest class (`LEGAL\_DOCUMENT`), ensuring that all four document types contribute equally to the final training and testing sets.

In [19]:
df_all["n_chars"] = df_all["text"].str.len()
df_all.groupby("doc_type")["n_chars"].describe()


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
doc_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EMAIL,169269.0,931.13231,1348.544744,100.0,216.0,435.0,1029.0,9999.0
INVOICE,60000.0,154.108867,58.835415,61.0,117.0,140.0,205.0,300.0
SCIENTIFIC_PAPER,19687.0,881.20003,403.311935,200.0,583.0,802.0,1122.0,2051.0
