# Phase 1 — Corpus Profiling

## Objective

This notebook performs quantitative and qualitative structural inspection of the working ICU progress note corpus (n = 200) derived from MIMIC-IV.

The purpose of this phase is to empirically characterise the structural and linguistic properties of the corpus prior to schema definition or rule construction. The goal is to establish an evidence base describing corpus composition, formatting regularities, and recurrent textual patterns in order to reduce downstream design uncertainty.

No modelling, rule implementation, or architectural commitments are made in this notebook.

This notebook contains:

- Quantitative corpus-level statistics computed on all 200 notes  
- Sampling methodology for targeted manual inspection  
- Tables and plots generated from profiling  
- Pattern frequency counts  
- Observed structural irregularities  
- Consolidated manual inspection findings (pattern-level only)  

---

## Scope

This notebook performs structured corpus profiling through two complementary components.

### 1. Quantitative Profiling (Full Corpus, n = 200)

Quantitative profiling is conducted on the entire working corpus.

Analyses include:

- Token length distribution  
- Sentence count distribution  
- Section header frequency  
- Most common bigrams and trigrams  
- Most common uppercase patterns (candidate section headers)  
- Most frequent numeric patterns (candidate vital sign formats)  

Quantitative profiling:

- Produces distributions  
- Produces counts  
- Produces frequency tables  

These measurements provide corpus-wide structural visibility and reduce reliance on blind manual reading.

---

### 2. Targeted Manual Structural Inspection (Stratified Subset)

Manual inspection is conducted on a stratified subset of the corpus (approximately 40–50 notes) to identify structural and linguistic regularities not fully captured by automated profiling.

The subset is selected to capture variation in:

- Note length  
- Section density  
- Formatting variability  

Manual inspection produces:

- Observed structural patterns  
- Section header variants  
- Negation constructions  
- Abbreviation classes  
- Formatting irregularities  

Findings are documented at the pattern level only. Individual note summaries are not recorded.

---

## Explicit Non-Scope

This notebook does not contain:

- Justification narratives  
- Architectural interpretation  
- Downstream modelling implications  
- Design commitments  

Design commitments derived from this profiling phase are documented in `decisions.md`.  
Phase-level synthesis and downstream implications are documented separately in `summary.md`.


## 1. Quantitative Profiling

This section computes corpus-wide structural statistics across all notes in the working dataset.

The analyses below measure:

- Length distributions (tokens, sentences)
- Section header frequency
- Surface lexical regularities (n-grams, uppercase patterns)
- Numeric pattern prevalence relevant to vital formatting

All outputs in this section are descriptive statistics computed on the full corpus.

## 2. Manual Structural Inspection

This section documents qualitative observations from targeted manual review of a stratified subset of notes (~40-50).
The subset is selected to capture variability in length, section structure, and formatting.

Observations focus on:

- Section header variants
- Negation constructions
- Abbreviation usage
- Formatting and structural irregularities

Findings are recorded at the pattern level only.