# ![Banner](https://github.com/LittleHouse75/flatiron-resources/raw/main/NevitsBanner.png)
---
# Project Preface & Directory
### Dialogue Summarization: From Custom Architectures to Frontier LLMs
---

## BLUF: The Bottom Line Up Front

This project set out to answer a specific engineering question: **What is the most effective way to summarize noisy, informal chat logs in 2025?**

The industry trend suggests simply sending everything to a massive frontier model (like GPT-4 or Claude). However, our rigorous benchmarking across three distinct architectural approaches reveals a more nuanced reality.

**The Headlines:**

1.  **Specialization beats Generalization:** A fine-tuned **BART** model (Experiment 2) significantly outperformed massive frontier LLMs on standard quality metrics (ROUGE). It achieved higher accuracy while being **3x faster** and **infinitely cheaper** at scale than the API comparisons.
2.  **Architecture Matters:** Our custom "Frankenstein" model (DistilBERT encoder + DistilGPT-2 decoder) proved that simply glueing random components together is inefficient. Without pre-trained cross-attention, the model struggled to learn the task, highlighting exactly *why* models like BART and T5 exist.
3.  **The "API Tax" is Real:** While zero-shot frontier models (Experiment 3) provided the easiest "time-to-hello-world," they struggled to match the specific terse, outcome-focused style required by the dataset. They are excellent prototypes, but expensive production solutions.

Ultimately, this project demonstrates that for high-volume, domain-specific tasks, **a smaller, purpose-built model rooted in your own data is still the engineering gold standard.**

---

## The Problem: Drowning in Context

Modern work effectively happens in chat applications. But chat logs are noisy, informal, and fragmented. Employees spend hours every week scrolling through back-and-forth messages just to answer simple questions like:
- *"What was the final decision?"*
- *"Who owns the next step?"*
- *"Did we agree on a time?"*

**The Goal**
Build an automated system that ingests raw, messy dialogue and outputs a crisp, third-person summary of the outcome.

**The Constraints**
- **Inputs:** Short, informal, multi-speaker text (slang, typos, emojis included).
- **Outputs:** 1-2 sentence summaries suitable for a notification preview.
- **Performance:** Must be fast enough for real-time interaction on standard hardware.

## The Project Roadmap

We investigated this problem through three escalating levels of complexity. Here is how the notebooks are structured:

### **[01_eda.ipynb](./01_eda.ipynb) | Knowing the Data**
Before modeling, we must understand the terrain. This notebook explores the **SAMSum dataset**.
- **Why it matters:** We discover that this task isn't just extraction; it is **compression** (median 75% reduction) and **style transfer** (from informal first-person to formal third-person). This explains why simple extractive baselines fail.

### **[02_experiment1_bert_gpt2.ipynb](./02_experiment1_bert_gpt2.ipynb) | The Frankenstein Architecture**
Can we build a summarizer by manually connecting two famous models? We create a custom `EncoderDecoderModel` using **DistilBERT** (to read) and **DistilGPT-2** (to write).
- **The Experiment:** We force two models that have never spoken to each other to cooperate via randomly initialized cross-attention layers.
- **The Takeaway:** It functions, but demonstrates the high cost of training encoder-decoder alignment from scratch.

### **[03_experiment2_bart_t5.ipynb](./03_experiment2_bart_t5.ipynb) | The Specialist Models**
We switch to models designed specifically for Sequence-to-Sequence tasks: **BART** (Denoising Autoencoder) and **T5** (Text-to-Text Transfer Transformer).
- **The Experiment:** We fine-tune these pretrained powerhouses on the same dataset.
- **The Takeaway:** Pre-trained cross-attention provides a dramatic leap in performance, stability, and convergence speed.

### **[04_experiment3_api_models.ipynb](./04_experiment3_api_models.ipynb) | The Frontier Giants**
We skip training entirely and ask the world's smartest models to do the work via API. We benchmark **GPT-5 Mini, Claude 4.5 Haiku, Gemini 2.5 Flash, Qwen 2.5,** and **Kimi K2**.
- **The Experiment:** Can zero-shot intelligence beat a fine-tuned specialist?
- **The Takeaway:** These models produce "vibes-based" summaries that are fluent but often miss the specific stylistic constraints of the ground truth.

### **[05_evaluation_and_conclusions.ipynb](./05_evaluation_and_conclusions.ipynb) | The Verdict**
We consolidate all metrics—ROUGE scores, inference latency, API costs, and qualitative error analysis—into a final trade-off analysis.
- **The Outcome:** A decision matrix for engineers choosing between accuracy, speed, privacy, and cost.

## Guided Reading Strategy

*   **For the Data Scientist:** Start with **01_eda** to see the n-gram analysis, then review the architecture code in **02_experiment1**.
*   **For the ML Engineer:** Focus on **03_experiment2** vs **04_experiment3** to compare the operational trade-offs of hosting local models vs. hitting external APIs.
*   **For the Stakeholder:** Skip straight to **05_evaluation_and_conclusions** for the cost/benefit analysis and final recommendations.