# AI Text Summarization Agent using Open Source LLM from HuggingFace

This is the second notebook in the series of experiments where I will build different AI agents using open-source LLMs from HuggingFace.

### Google Colab
I will use Google Colab for creating and running the python code to build the AI agents using open-source LLMs from HuggingFace. Why did I choose Google Colab instead of my local computer?
1. Free access to powerful T4 GPUs needed to run most of the LLMs efficiently.
2. Easy ability to share code and collaborate.

### Hugging Face
I will need to connect to HuggingFace to use the appropriate open-source LLM for the AI application and connect that from my notebook in Colab. Here are the steps -
1. Create a free HuggingFace account at https://huggingface.co
2. Navigate to Settings from the user menu on the top right.
3. Create a new API token with **write** permissions.
4. Back to this colab notebook
  * Press the "key" icon on the side panel to the left
  * Click on add a new secret
  * In the name field put HF_TOKEN
  * In the value field put your actual token: hf_...
  * Ensure the notebook access switch is turned ON.

This way I can use my confidential API Keys for HuggingFace or others without needing to type them into my colab notebook, I will be sharing with others.

In [None]:
# Check GPU availability and specifications, such as its memory usage, temperature, and clock speed.
# We can also see that in details by clicking on Runtime (top menu) > View Resources
!nvidia-smi

In [2]:
# I will need to connect from my notebook in Colab to HuggingFace by validating the token, in order to use open-source models.
# The huggingface_hub library allows to interact with the HuggingFace Hub, a platform democratizing open-source LLMs and Datasets

from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

### Model Selection

I will select a model from the HuggingFace model library based on the specific  application. Here are the steps -

* Go to https://huggingface.co/models.
* Click on Summarization under NLP.
* Choose any model and review it's specification.
* I am choosing the bart-large-cnn model from Facebook

Note: We should select a model based on various criteria, such as the specific use-casr, available infrastructure, latency, performance. I will cover those in details later.

### Approach 1 - Using HuggingFace Pipeline Library

This is a much simpler approach with the Hugging Face pipeline API, which  provides a high-level, task-specific interface for running inference with pretrained models without manually handling tokenization, preprocessing, or postprocessing.

This approach is ideal, when we need to run quick experimentation or prototyping and don't need to gain more granular control on the model behavior.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

# Load the pipeline with the desired task and the model
summarizer = pipeline(
    task="summarization",
    model="facebook/bart-large-cnn"
)

In [None]:
# Application 1

# Provide the input text to summarize
text_to_summarize = """
Model quantization makes it possible to deploy increasingly complex deep learning models in resource-constrained environments
without sacrificing significant model accuracy. As AI models, especially generative AI models, grow in size and computational
demands, quantization addresses challenges such as memory usage, inference speed, and energy consumption by reducing the
precision of model parameters (weights and/or activations), e.g., from FP32 precision to FP8 precision. This reduction
decreases the model’s size and computational requirements, enabling faster computation during inference and lower power
consumption compared to the original model. However, quantization can lead to some accuracy degradation compared to the
original model. Finding the right tradeoff between model accuracy and efficiency depends heavily on the specific use case.
"""

# Run inference
summary = summarizer(text_to_summarize, max_length=100, min_length=50, do_sample=False)
# max_length and min_length are indicated in number of tokens,
# do_sample = false, by default, and is good for factual, consistent summaries
# do_sample = true, when we need more creative summaries

print(summary)

### Approach 2 - Using HuggingFace Tokenizer & Model Libraries

This is more nuanced approach than the previous one with the HuggingFace pipeline API. Here, I will use the Tokenizer & Model Libraries from HuggingFace. Though this approach involves more coding, but it gives us more granular access to the model execution and the tokens, such as Token IDs, Attention masks, Model logits, etc.

This approach is ideal, when we need to run either custom preprocessing, custom post‑processing, fine‑tuning, debugging model behavior or building our own pipeline.

In [5]:
# Load the tokenizer and the model library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load the specific tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [None]:
# Application 1

# Provide the input text to summarize
text_to_summarize = """
Model quantization makes it possible to deploy increasingly complex deep learning models in resource-constrained environments
without sacrificing significant model accuracy. As AI models, especially generative AI models, grow in size and computational
demands, quantization addresses challenges such as memory usage, inference speed, and energy consumption by reducing the
precision of model parameters (weights and/or activations), e.g., from FP32 precision to FP8 precision. This reduction
decreases the model’s size and computational requirements, enabling faster computation during inference and lower power
consumption compared to the original model. However, quantization can lead to some accuracy degradation compared to the
original model. Finding the right tradeoff between model accuracy and efficiency depends heavily on the specific use case.
"""

# Tokenize the input text
inputs = tokenizer(
    text_to_summarize,
    max_length=1024,
    return_tensors="pt"
)
# max_length indicates the max number of token the model can handle at a time.
# Hence, in order to summarize larger content, we will need to split it into chunks and then summarize each chunk.

# Run model to generate the output tokens
summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=4,
    min_length=30,
    max_length=100
)
# num_beams indicates which helps generate higher quality token. Below are some options
# num_beams=1, indicates Greedy Search, which is fast but has low quality
# num_beams=2 to 4, indicates Light beam search, which is slightly slower but has better quality
# num_beams=5 to 8, indicates Strong beam search, which is much slower but has best quality

# Decode tokens back to text
summary = tokenizer.batch_decode(
    summary_ids,
    skip_special_tokens=True,
    truncation=True,
    clean_up_tokenization_spaces=False
)[0]

print({"Summarized content: ": summary})