In NLP latest architecture is Transformer based architectures There's three types. Encoder-only models like BERT, decoder-only models like GPT and llama. There's encoder-decoder architectures as well: Encoder-decoder models can be implemented using various neural network architectures, and their names often reflect the type of network used or the specific application. Here are some key examples:
1. Transformer-based Encoder-Decoder Models:
These models are based on the Transformer architecture, which has significantly advanced the field of natural language processing.
T5 (Text-to-Text Transfer Transformer): T5 is a Transformer-based encoder-decoder model that treats all NLP tasks as text-to-text problems. It's widely used for tasks like machine translation, summarization, and question answering.
BART (Bidirectional and Auto-Regressive Transformer): BART is another Transformer-based encoder-decoder model that excels at denoising sequence-to-sequence tasks. It's particularly effective for text generation and summarization.
Pegasus (Pre-training with Extracted Gap-sentences for Abstractive Summarization): Pegasus is a Transformer-based encoder-decoder model specifically designed for abstractive summarization, where it generates summaries by focusing on key sentences extracted from the input text.
MT5 (Massively Multilingual Text-to-Text Transformer): MT5 is a multilingual variant of T5, trained on a large corpus of text in various languages.
FLAN-T5 (Scaling Instruction-Finetuned Language Models): FLAN-T5 is an extension of T5 that has been finetuned on a wide range of tasks and instructions to improve its generalization capabilities.
Code-T5: This is a variant of T5 designed specifically for code understanding and generation.
UL2 (Unifying Language Learning Paradigms): UL2 is another Transformer-based encoder-decoder model with a unified approach to language learning.
FLAN-UL2: This is a finetuned version of UL2 with improved performance on various tasks.
EdgeFormer: A Transformer-based encoder-decoder model designed for efficient seq2seq generation on devices with limited resources.
2. Models Utilizing Encoder-Decoder Architecture for Specific Tasks:
Encoder-decoder architectures can also be used as components within larger models designed for specific tasks:
VisionEncoderDecoderModel: This model initializes an image-to-text model with a pretrained vision model (like ViT) as the encoder and a pretrained language model (like BERT or GPT2) as the decoder. This allows it to perform tasks like image captioning and optical character recognition (OCR).
TrOCR (Transformer-based Optical Character Recognition): TrOCR is a specific instance of the VisionEncoderDecoderModel architecture, fine-tuned for OCR.
Note: Encoder-decoder architecture is a framework, and specific implementations can vary in their internal network structure (RNN, CNN, Transformer) and pre-training objectives.

LangChain can be used for app development. Web-side of app development is still done with FastAPI. If you want to include agents then Langgraph is used. Fine-tuning of the model for best hyperparameters can be done using LoRA or QLoRA. LoRA is where memory is a constraint but want to maintain high precision. QLoRA (quantized lora) is used to optimize memory efficiency in comprise for a minimal loss in performance.

Llama 2 and 3 and mistral ai were attempted but did not work with system ram and gpu usage constraints. Tiny Llama was finally leveraged on a dataset that contained two columns: one with a full clinical note and the other with the summary written by a healthcare provider. The llm was given the task to summarize a given synthetic clinical note by chatgpt4. The result was an accurate response with extra unnecessary details

In [1]:
!pip install triton==2.1.0 bitsandbytes==0.41.0 peft==0.7.0 transformers==4.38.2 accelerate==0.30.0 trl==0.4.7

[31mERROR: Could not find a version that satisfies the requirement triton==2.1.0 (from versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1, 3.4.0, 3.5.0, 3.5.1)[0m[31m
[0m[31mERROR: No matching distribution found for triton==2.1.0[0m[31m
[0m

In [2]:
!pip install huggingface_hub
!pip install numpy
import numpy as np



In [3]:
#!pip install triton
!pip install trl
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)
from transformers.generation import LogitsProcessorList, TopKLogitsWarper, TopPLogitsWarper
from trl import SFTTrainer
from peft import LoraConfig
from datasets import load_dataset

Collecting trl
  Downloading trl-0.26.2-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.26.2-py3-none-any.whl (518 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl
Successfully installed trl-0.26.2




Llama model requires too much system ram and is crashing which is the reason for using tiny llama.

In [4]:
llama_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    # Remove quantization_config
).to('cpu') # Explicitly move the model to CPU
llama_model.config.use_cache = False
llama_model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The below code is specifically for GPU. Bitsandbytes config with quantization is for gpu. The above code is for CPU.

In [5]:
###llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path= "aboonaji/llama2finetune-v3",
                                                  # quantization_config= BitsAndBytesConfig(load_in_4bit = True,
                                                   # bnb_4bit_compute_dtype = getattr(torch, "float16"), bnb_4bit_quant_type = "nf4"))
##llama_model.config.use_cache = False
##llama_model.config.pretraining_tp = 1

In [6]:
llama_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path= "TinyLlama/TinyLlama-1.1B-Chat-v1.0", trust_remote_code = True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

In [7]:
training_arguments = TrainingArguments(output_dir= "./results", per_device_train_batch_size = 1, max_steps = 4)

In [8]:
!pip install -U datasets huggingface_hub fsspec


load_dataset("geekdom/clinical_data")
print(load_dataset)

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-1.2.3-py3-none-any.whl.metadata (13 kB)
Collecting fsspec
  Downloading fsspec-2025.12.0-py3-none-any.whl.metadata (10 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting fsspec
  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-1.2.3-py3-none-any.whl (520 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.0/521.0 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.10.0-py3-none-any.whl (200 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.0/201.0 kB[0m [31m7.0 MB/s[0m eta 

refined_clinical_data.jsonl:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19756 [00:00<?, ? examples/s]

<function load_dataset at 0x7f7bb576f740>


In [9]:
from datasets import load_dataset

dataset = load_dataset("geekdom/clinical_data", split="train")

def format_for_sft(example):
    return {
        "prompt": f"Summarize:\n{example['prompt']}\n\nSummary:",
        "completion": f" {example['response']}"  # note the leading space
    }

# Apply formatting
dataset = dataset.map(format_for_sft)

# Keep only the necessary columns
dataset = dataset.remove_columns([col for col in dataset.column_names if col not in ["prompt", "completion"]])

print(dataset[0])

Map:   0%|          | 0/19756 [00:00<?, ? examples/s]

{'prompt': "Summarize:\nBelow is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Hospital Course Summary:\n\nAdmission Date: [Insert date]\nDischarge Date: [Insert date]\n\nPatient: [Patient's Name]\nSex: Male\nAge: 57 years\n\nAdmission Diagnosis: Oxygen Desaturation\n\nHospital Course:\n\nThe patient was admitted to the ICU one week after a positive COVID-19 result due to oxygen desaturation. Physical therapy was initiated promptly after admission, which helped improve the patient's breathing frequency and oxygen saturation. The patient was guided to achieve a prone position resulting in a significant increase in oxygen saturation from 88% to 96%. The patient continued to receive intensive physical therapy, positioning, and oxygen therapy for the next few days. Although there were challenges in achieving the prone position due to the patient's profoundly reduced respiratory capacity and high risk of symptom exacerbatio

In [10]:

llama_sft_trainer = SFTTrainer(model = llama_model,
                               args = training_arguments,
                               train_dataset = dataset,
                               peft_config = LoraConfig(task_type = "CAUSAL_LM", r = 64, lora_alpha = 16, lora_dropout = 0.1))

Adding EOS to train dataset:   0%|          | 0/19756 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/19756 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/19756 [00:00<?, ? examples/s]

In [11]:
llama_sft_trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"




Step,Training Loss




TrainOutput(global_step=4, training_loss=0.7590062618255615, metrics={'train_runtime': 225.241, 'train_samples_per_second': 0.018, 'train_steps_per_second': 0.018, 'total_flos': 16197573070848.0, 'train_loss': 0.7590062618255615, 'entropy': 1.4997917115688324, 'num_tokens': 2587.0, 'mean_token_accuracy': 0.8140986263751984, 'epoch': 0.00020247013565499088})

In [12]:
user_prompt = (
    """Below is an instruction that describes a task. Write a response that appropriately completes the request.\n
    Make sure to use line breaks when appropriate.\n
    ### Instruction:\n
    Below is a clinical note.\n
    Your task is to summarize the clinical note in that you describe the patient's course of progression.\n
    ### Input:\n
    Hospital Course: The patient was admitted to the ICU six days after testing positive for COVID-19 due to worsening respiratory distress.\n
    Early physical therapy and prone positioning led to improved oxygen saturation from 85% to 94%. After five days of supportive care and rehabilitation,\n
    the patient was transferred to the general ward and continued progressing with assisted ambulation and breathing exercises.\n
    Discharge Condition: At discharge, the patient was stable, breathing comfortably on room air, and able to walk short distances with minimal assistance.\n
    Oxygen saturation and respiratory rate were within normal limits.\n
    ### Response:"""
)
#text_generation_pipeline = pipeline(task = "text-generation", model = llama_model, tokenizer = llama_tokenizer, max_length = 200)
#model_answer =  text_generation_pipeline(user_prompt)
#print(model_answer[0]['generated_text'])
inputs = llama_tokenizer(user_prompt, return_tensors="pt", padding = True, truncation = True)

# Ensure tensors are on CPU if you're CPU-only
inputs = {key: val.to("cpu") for key, val in inputs.items()}

with torch.no_grad():
    outputs = llama_model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.95
    )

response = llama_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

    Make sure to use line breaks when appropriate.

    ### Instruction:

    Below is a clinical note.

    Your task is to summarize the clinical note in that you describe the patient's course of progression.

    ### Input:

    Hospital Course: The patient was admitted to the ICU six days after testing positive for COVID-19 due to worsening respiratory distress.

    Early physical therapy and prone positioning led to improved oxygen saturation from 85% to 94%. After five days of supportive care and rehabilitation,

    the patient was transferred to the general ward and continued progressing with assisted ambulation and breathing exercises.

    Discharge Condition: At discharge, the patient was stable, breathing comfortably on room air, and able to walk short distances with minimal assistance.

    Oxygen saturation and respiratory rate were within normal limits.

    ### Re

In [17]:
import json
import os
from google.colab import _message

# Get the current notebook’s path
notebook_path = _message.blocking_request('get_ipynb')['notebookPath']

# Load the notebook JSON
with open(notebook_path, "r", encoding="utf-8") as f:
    nb = json.load(f)

# Remove metadata.widgets if it exists
if "widgets" in nb.get("metadata", {}):
    del nb["metadata"]["widgets"]

# Save a cleaned version
clean_path = notebook_path.replace(".ipynb", "_clean.ipynb")
with open(clean_path, "w", encoding="utf-8") as f:
    json.dump(nb, f, indent=2)

print(f"Cleaned notebook saved to {clean_path}")

KeyError: 'notebookPath'

# Result:

Below is the input given to the model. The model produces a decent answer but the answer depends exremely on the instructions given. The answer is accurate but lacks precision because extra details before or after the response are sometimes given. The response is unique each generation of response.

# Given Input

Hospital Course:

The patient was admitted to the ICU six days after testing positive for COVID-19 due to worsening respiratory distress. Early physical therapy and prone positioning led to improved oxygen saturation from 85% to 94%. After five days of supportive care and rehabilitation, the patient was transferred to the general ward and continued progressing with assisted ambulation and breathing exercises.

Discharge Condition:

At discharge, the patient was stable, breathing comfortably on room air, and able to walk short distances with minimal assistance. Oxygen saturation and respiratory rate were within normal limits.