# Explanation of model pipeline

In [2]:
with open("sample_article.txt", "r") as f:
    sample_text = f.read()

print(sample_text)

New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos 

In [None]:
# import json

# sample_app_request = {"text": sample_text}

# with open("sample_app_input.json", "w", encoding="utf-8") as f:
#     json.dump(sample_app_request, f, indent=4, ensure_ascii=False)

## Normal pipeline

In [3]:
import os
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
result = summarizer(sample_text, max_length=130, min_length=30, do_sample=False)

print(result[0]["summary_text"])

Device set to use cuda:0


Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. If convicted, she faces up to four years in prison.


In [5]:
summarizer.model.config

BartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 2.0,
  "ma

## Step by Step Explanation

### Understanding HuggingFace's Summarization Pipeline

HuggingFace's pipeline API uses **`SummarizationPipeline`**, which is a `Text2TextGenerationPipeline` under the hood when `pipeline("summarization", ...)` is called. 

#### Pipeline Execution Flow

When we run `summarizer(...)`, the execution follows this chain:

```
SummarizationPipeline.__call__() 
    ↓
Text2TextGenerationPipeline.__call__() 
    ↓
Pipeline.__call__()
```

#### 🔄 Three-Step Process

The `Pipeline.__call__()` method executes these functions **in order**:

| Step | Function | Purpose |
|------|----------|---------|
| **1** | `self.preprocess(inputs, **preprocess_params)` | Prepares input data for the model |
| **2** | `self.forward(model_inputs, **forward_params)` | Runs the model inference |
| **3** | `self.postprocess(model_outputs, **postprocess_params)` | Processes model outputs into final format |


**💡 Parameter Management**: The `preprocess_params`, `forward_params`, and `postprocess_params` come from `self._sanitize_parameters(**kwargs)`, which validates and formats the configuration parameters for each function.

#### 🔍 **Preprocessing**

The preprocessing step is crucial in transforming raw text input into a format that the model can understand. Let's break down exactly what happens in the implementation:

1. **Input Validation & Type Handling**: Ensures input is either a single string or list of strings, since different formats require their special processing strategies.
```python
assert isinstance(args[0], str) or isinstance(args[0], list)
```

2. **Prefix Application**: Prefixes allow you to condition your model with a task specific starter. For example, for a general purpose text to text generator, tasks can be `"summarize: "` or `"translate English to French: "`. But for BART, there is no need for a task-specific prefix.
```python
prefix = prefix if prefix is not None else ""
```

3. **Padding Strategy**: From Step 1, we can understand that input can be a single string, or a batch of strings. Prefix is applied to every string separately.
```python
if isinstance(args[0], list):
    args = ([prefix + arg for arg in args[0]],)
    padding = True
elif isinstance(args[0], str):
    args = (prefix + args[0],)
    padding = False
```

4. **Tokenization**: Since neural network models cannot process a string as it is, another step is needed to divide each item in the batch into its constructing elements, called tokens. Tokens can be words, characters, or even subword elements. BART uses Byte Pair Encoding (BPE), which divides the text into its subword elements.


##### ⚙️ **Configuration Parameters**

| Parameter | Value | Impact |
|-----------|--------|--------|
| **framework** | `"pt"` | Returns PyTorch tensors |
| **prefix** | `""` (empty for BART) | No task prefix needed |
| **truncation** | `DO_NOT_TRUNCATE` | Keeps full input (unless too long) |
| **padding** | Dynamic | True for batches, False for single inputs | -->

In [36]:
from transformers import BartTokenizer, BartForConditionalGeneration, AutoConfig
from transformers.tokenization_utils import TruncationStrategy

model_name = "facebook/bart-large-cnn"
framework = "pt"

prefix = AutoConfig.from_pretrained(model_name).prefix

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

loading configuration file config.json from cache at C:\Users\milli\.cache\huggingface\hub\models--facebook--bart-large-cnn\snapshots\37f520fa929c961707657b28798b30c003dd100b\config.json
Model config BartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"

In [37]:
from transformers.tokenization_utils import TruncationStrategy

def parse_and_tokenize(*args, truncation, prefix):
    assert isinstance(args[0], str) or isinstance(args[0], list), f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`"
    
    prefix = prefix if prefix is not None else ""

    if isinstance(args[0], list):
        if tokenizer.pad_token_id is None:
            raise ValueError("Please make sure that the tokenizer has a pad_token_id when using a batch input")
        args = ([prefix + arg for arg in args[0]],)
        padding = True
    elif isinstance(args[0], str):
        args = (prefix + args[0],)
        padding = False

    inputs = tokenizer(*args, padding=padding, truncation=truncation, return_tensors=framework)
    if "token_type_ids" in inputs:
        del inputs["token_type_ids"]
    
    return inputs


def preprocess(inputs, truncation = TruncationStrategy.DO_NOT_TRUNCATE, prefix = "", **kwargs):
    inputs = parse_and_tokenize(inputs, truncation=truncation, prefix = prefix, **kwargs)
    return inputs

##### 🔬 **Real Example Breakdown**

From our test with the first 10 words:

**Input Text**: `"New York (CNN)When Liana Barrientos was 23 years old, she"`

**Tokenization Result**:
```
Tokens: ['<s>', ' New', ' York', ' (', 'CNN', ')', 'When', ' L', 'iana', ' Bar', 'rient', 'os', ' was', ' 23', ' years', ' old', ',', ' she', '</s>']
```

**Key Observations**:
- `<s>` and `</s>` are automatically added (start/end tokens)
- `Ġ` represents spaces in BART tokenization (byte-pair encoding)
- Punctuation is handled separately: `(` and `)` are distinct tokens

In [38]:
def choose_first_k_words(text, k=10):
    return " ".join(text.split(' ')[:k])

input_text = choose_first_k_words(sample_text)

tokenizer_result = preprocess(input_text, prefix=prefix)

tokens = [tokenizer._convert_id_to_token(x) for x in tokenizer_result["input_ids"].flatten().tolist()]
reconstructed_text = tokenizer.convert_tokens_to_string(tokens)

print(input_text)
print([token.replace("Ġ", " ") for token in tokens])
print(reconstructed_text)

New York (CNN)When Liana Barrientos was 23 years old, she
['<s>', ' New', ' York', ' (', 'CNN', ')', 'When', ' L', 'iana', ' Bar', 'rient', 'os', ' was', ' 23', ' years', ' old', ',', ' she', '</s>']
<s> New York (CNN)When Liana Barrientos was 23 years old, she</s>


##### **How Tokenizers Are Used**

As we can see, BPE breaks down a single text into tokens, where each token have different ids. Later in the pipeline, these ids are used to get the **embeddings** for each of these tokens via a dense Embedding layer:

If:

- Vocabulary size = $V$  
- Embedding dimension = $d$  
- Embedding matrix $E \in \mathbb{R}^{V \times d}$  

Then for a token ID $i$:

$$
\text{embedding}(i) = E[i]
$$

This is just a lookup into the $i$-th row of the embedding matrix.

In [39]:
import torch

text = "apple pie"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])

with torch.no_grad():
    embeddings = model.model.shared(inputs["input_ids"])

print(embeddings.shape)

tensor([[    0, 27326, 11637,     2]])
torch.Size([1, 4, 1024])


##### 🧩 **Byte Pair Encoding (BPE)**

**Byte Pair Encoding (BPE)** is a **subword tokenization algorithm** widely used in NLP models like GPT-2, RoBERTa, and BART. It works as follows:

1. **Start with characters**
   - The initial vocabulary contains all individual characters.

   Example:  
   `unhappiness` → [ "u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s" ]

2. **Find the most frequent pair of tokens**  
   - Count all adjacent pairs in the corpus.  
   - Merge the most frequent one into a new subword token.  
   - Add it to the vocabulary.

   Example:  
   - "h" + "a" → "ha"  
   - Later: "ha" + "p" → "hap"

3. **Repeat until vocabulary size is reached**  
   - Continue merging until you hit a predefined vocab size (e.g. 30k–50k tokens).  
   - Frequent subwords become single tokens, while rare words are broken into smaller pieces.

##### ⚙️ **Formula (merge step)**
At each step, find the pair:

$$
(a^*, b^*) = \arg\max_{(a,b)} \text{freq}(a, b)
$$

where $\text{freq}(a, b)$ = number of times token $a$ is followed by token $b$ in the corpus.

For better understanding, here is a sample implementation of BPE with the following corpus: `"unhappy", "happiness", "unhappiness"`

In [55]:
corpus = ["unhappy", "happiness", "unhappiness"]

In [56]:
# Step 1
def word_to_tokens(word):
    return list(word) + ["</w>"]

tokenized_corpus = [word_to_tokens(w) for w in corpus]
print("Initial tokenization:", tokenized_corpus)

Initial tokenization: [['u', 'n', 'h', 'a', 'p', 'p', 'y', '</w>'], ['h', 'a', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>'], ['u', 'n', 'h', 'a', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>']]


In [57]:
from collections import Counter

# Step 2
def get_stats(tokenized_corpus):
    pairs = Counter()
    for tokens in tokenized_corpus:
        for i in range(len(tokens)-1):
            pairs[(tokens[i], tokens[i+1])] += 1
    return pairs

stats = get_stats(tokenized_corpus)
print(stats)

Counter({('h', 'a'): 3, ('a', 'p'): 3, ('p', 'p'): 3, ('u', 'n'): 2, ('n', 'h'): 2, ('p', 'i'): 2, ('i', 'n'): 2, ('n', 'e'): 2, ('e', 's'): 2, ('s', 's'): 2, ('s', '</w>'): 2, ('p', 'y'): 1, ('y', '</w>'): 1})


In [62]:
def recalculate_tokens(new_token, old_tokens):
    new_tokens = []
    i = 0
    while i < len(old_tokens):
        if i < len(old_tokens)-1 and ''.join([old_tokens[i], old_tokens[i+1]]) == new_token:
            new_tokens.append(new_token)
            i += 2
        else:
            new_tokens.append(old_tokens[i])
            i += 1

    return new_tokens

def merge_vocab(pair, tokenized_corpus):
    new_token = ''.join(pair)
    new_corpus = [recalculate_tokens(new_token, old_tokens) for old_tokens in tokenized_corpus]
    
    return new_corpus

pair = stats.most_common(1)[0][0]
print("Merging:", pair)

new_tokenized_corpus = merge_vocab(pair, tokenized_corpus)
print("New tokenized corpus:", new_tokenized_corpus)

Merging: ('h', 'a')
New tokenized corpus: [['u', 'n', 'ha', 'p', 'p', 'y', '</w>'], ['ha', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>'], ['u', 'n', 'ha', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>']]


In [63]:
# Lets run these three steps for 10 iterations
corpus = ["unhappy", "happiness", "unhappiness"]
tokenized_corpus = [word_to_tokens(w) for w in corpus]

for _ in range(10):
    pairs = get_stats(tokenized_corpus)
    if not pairs:
        break

    best_pair = pairs.most_common(1)[0][0]
    
    print("Merging:", best_pair)
    tokenized_corpus = merge_vocab(best_pair, tokenized_corpus)
    print(tokenized_corpus)

Merging: ('h', 'a')
[['u', 'n', 'ha', 'p', 'p', 'y', '</w>'], ['ha', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>'], ['u', 'n', 'ha', 'p', 'p', 'i', 'n', 'e', 's', 's', '</w>']]
Merging: ('ha', 'p')
[['u', 'n', 'hap', 'p', 'y', '</w>'], ['hap', 'p', 'i', 'n', 'e', 's', 's', '</w>'], ['u', 'n', 'hap', 'p', 'i', 'n', 'e', 's', 's', '</w>']]
Merging: ('hap', 'p')
[['u', 'n', 'happ', 'y', '</w>'], ['happ', 'i', 'n', 'e', 's', 's', '</w>'], ['u', 'n', 'happ', 'i', 'n', 'e', 's', 's', '</w>']]
Merging: ('u', 'n')
[['un', 'happ', 'y', '</w>'], ['happ', 'i', 'n', 'e', 's', 's', '</w>'], ['un', 'happ', 'i', 'n', 'e', 's', 's', '</w>']]
Merging: ('un', 'happ')
[['unhapp', 'y', '</w>'], ['happ', 'i', 'n', 'e', 's', 's', '</w>'], ['unhapp', 'i', 'n', 'e', 's', 's', '</w>']]
Merging: ('i', 'n')
[['unhapp', 'y', '</w>'], ['happ', 'in', 'e', 's', 's', '</w>'], ['unhapp', 'in', 'e', 's', 's', '</w>']]
Merging: ('in', 'e')
[['unhapp', 'y', '</w>'], ['happ', 'ine', 's', 's', '</w>'], ['unhapp', 'ine', 's', 

#### Forward

When generating text (e.g., summaries), models don’t just pick the most likely token greedily.  
Two important concepts are often used:

---

#### 🔹 Beam Search
- Instead of keeping **only the single best sequence** at each decoding step (greedy search),  
  beam search keeps the **top *k* candidates** (the *beam size*).
- At each step:
  - Expand all current candidates by one token.
  - Keep the top *k* most probable sequences.
- This explores more possibilities and often produces **higher-quality summaries**.

Example (beam size = 3):
- Step 1: Keep 3 most likely first tokens.
- Step 2: Expand each, keep 3 best partial sequences.
- Continue until end-of-sequence.

---

#### 🔹 Early Stopping
- Normally, beam search continues until **all beams reach an end token** (`</s>`).
- With *early stopping*, decoding stops as soon as the **first beam finishes**.
- This makes generation **faster**, but sometimes at the cost of completeness.
- In summarization, early stopping can help avoid **overly long or repetitive outputs**.

---

✅ **In practice (Hugging Face `generate`):**
- `num_beams=4` → use beam search with 4 beams.
- `early_stopping=True` → stop once one beam has produced a valid ending.

In [76]:
def check_inputs(input_length, min_length, max_length):
    if max_length < min_length:
        print(f"Your min_length={min_length} must be inferior than your max_length={max_length}.")

    if input_length < max_length:
        print(
            f"Your max_length is set to {max_length}, but your input_length is only {input_length}. Since this is "
            "a summarization task, where outputs shorter than the input are typically wanted, you might "
            f"consider decreasing max_length manually, e.g. summarizer('...', max_length={input_length // 2})"
        )

def forward(model, model_inputs, **generate_kwargs):
    in_b, input_length = model_inputs["input_ids"].shape

    check_inputs(
        input_length,
        generate_kwargs.get("min_length", 25),
        generate_kwargs.get("max_length", 60),
    )

    output_ids = model.generate(**model_inputs, **generate_kwargs)
    out_b = output_ids.shape[0]

    output_ids = output_ids.reshape(in_b, out_b // in_b, *output_ids.shape[1:])
    return {"output_ids": output_ids}


model_inputs = preprocess(sample_text, prefix=prefix)
generate_kwargs = {
    "num_beams": model.config.num_beams,
    "max_length": model.config.max_length,
    "min_length": model.config.min_length,
    "early_stopping": model.config.early_stopping
}

model_outputs = forward(model, model_inputs, **generate_kwargs)

tokens = [tokenizer._convert_id_to_token(x) for x in model_outputs["output_ids"].flatten().cpu().tolist()]
summary_text = tokenizer.convert_tokens_to_string(tokens)

print(summary_text)

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "length_penalty": 2.0,
  "max_length": 142,
  "min_length": 56,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1
}



</s><s>Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. At one time, she was married to eight men at once, prosecutors say.</s>


#### Postprocessing

Postprocessing allows us to change the format of the output from a list of tokens to a clean text.

In [79]:
def postprocess(model_outputs, clean_up_tokenization_spaces=False, return_name="summary"):
    records = []
    for output_ids in model_outputs["output_ids"][0]:
        record = {
            f"{return_name}_text": tokenizer.decode(
                output_ids,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            )
        }
        records.append(record)

    return records

output = postprocess(model_outputs)
output[0]["summary_text"]

'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. At one time, she was married to eight men at once, prosecutors say.'