## Open notebook in:
| Colab                                 |  Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH02/ch02_mistral_7b_32_K_llama_index_summarization)                                              | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com//github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH02/ch02_mistral_7b_32_K_llama_index_summarization)|             

# About this notebook


In this notebook you download a file from a publicly accessible file from GoogleDrive and process it with LlamaIndex. You use Mistral as an LLM to perform the summarization. You will load the Mistral model with [quantization](https://huggingface.co/docs/bitsandbytes/main/en/index) to leverage an optimized, less resource-hungry version of the model.


# Installs

In [None]:
!pip -q install llama-index-llms-huggingface==0.1.5 \
                llama-index-embeddings-huggingface==0.2.0 \
                loralib==0.1.2 \
                sentencepiece==0.1.99 \
                bitsandbytes==0.43.0 \
                accelerate==0.28.0 \
                llama-index==0.10.33

In [None]:
!pip install flash-attn --no-build-isolation -q

# Imports

In [None]:
import os
import requests
import torch
import transformers
from textwrap import TextWrapper

from huggingface_hub import HfApi, HfFolder

from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          PreTrainedTokenizer,
                          PreTrainedModel,
                          BitsAndBytesConfig,
                          pipeline
                        )

from llama_index.core import (SummaryIndex,
                              VectorStoreIndex,
                              SimpleDirectoryReader,
                              StorageContext,
                              load_index_from_storage,
                              Settings,
                              PromptTemplate
)

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.llms import ChatMessage

In [None]:
def print_wrapper(print):
    """Adapted from: https://stackoverflow.com/questions/27621655/how-to-overload-print-function-to-expand-its-functionality/27621927"""

    def function_wrapper(text):
        if not isinstance(text, str):
            text = str(text)
        wrapper = TextWrapper()
        return print("\n".join([wrapper.fill(line) for line in text.split("\n")]))

    return function_wrapper

print = print_wrapper(print)

In [None]:
def download_file(url, destination_folder):
    """
    Download a file from a URL to the specified destination folder.
    Attempts to use the original filename from the Content-Disposition header.
    """
    # Ensure the destination folder exists
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    # Get the file content from the URL
    response = requests.get(url, allow_redirects=True)
    response.raise_for_status()  # Raise an exception for HTTP errors

    # Try to fetch the filename from the content disposition header
    content_disposition = response.headers.get('content-disposition')
    if content_disposition:
        # Extract filename from content_disposition
        filename = content_disposition.split('filename=')[1].strip('"')
    else:
        # If no filename is found in the headers, default to a filename
        filename = "default_filename.txt"

    # Create the full path for the local file
    local_file_path = os.path.join(destination_folder, filename)

    # Write the file content in binary mode to the local file
    with open(local_file_path, 'wb') as f:
        f.write(response.content)

    return local_file_path

In [None]:
# Hugging Face access token
hf_token = "your_access_token"

# HfFolder to save the token for subsequent API calls
HfFolder.save_token(hf_token)


In [None]:
system_prompt = """<|prompter|>\n"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("{query_str}\n<|assistant|>\n")

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)

# BitsAndBytes configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False, # You can optionally load it in 8bit
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

llm = HuggingFaceLLM(
    model_name=model_id,
    max_new_tokens=512,
    model_kwargs={
        "token": hf_token,
        "quantization_config": bnb_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=model_id,
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
def count_tokens(text: str, tokenizer: PreTrainedTokenizer, max_tokens: int = None) -> int:
    """
    Counts the number of tokens in the given text using the specified tokenizer.

    Parameters:
    - text (str): The text to tokenize and count.
    - tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenizing the text.
    - max_tokens (int, optional): The maximum number of tokens to consider. If None, uses the full text.

    Returns:
    - int: The number of tokens in the text.
    """
    if not isinstance(text, str):
        raise ValueError("Input text must be a string.")

    # Tokenize the text
    tokens = tokenizer.tokenize(text)

    # Shorten the list of tokens if max_tokens is specified
    if max_tokens is not None:
        tokens = tokens[:max_tokens]

    # The length of tokens is the count we're interested in
    token_count = len(tokens)

    return token_count


In [None]:
TXT_LINK = "https://drive.google.com/uc?export=download&id=1S3J6A2hX8rKlgLoWQv4fJHAHHSN4eFrD"

# Use requests to get the content of the text file
response = requests.get(TXT_LINK)
text = response.text
print(text[:1000])


Transformers in Action

Chapter 1: The need for transformers

This chapter covers
How transformers revolutionized NLP
Attention mechanism - the transformers key
architectural component
How to use transformers
When and why you want to use transformers

The field of natural language processing (NLP) has undergone a
revolutionary change
with the invention of a new class of neural networks called
transformers. These models,
famous for their capacity to understand and generate natural language,
are the backbone of
widely-used applications such as Google Translate or ChatGPT.
Transformers, along with their derivatives like large language models
(LLMs), are distinctive due to their unique architectural approach.
Unlike their predecessors, like Long Short
Term Memory (LSTMs), transformers incorporate an innovative
architectural component
called the "attention mechanism." This mechanism empowers the model to
concentrate
on different segments of the input data to varying extents, thereby
enhanc

In [None]:
token_count = count_tokens(text, tokenizer)
print(f"The number of tokens in the text is {token_count}.")

The number of tokens in the text is 16917.


In [None]:
urls = [
    "https://drive.google.com/uc?export=download&id=1S3J6A2hX8rKlgLoWQv4fJHAHHSN4eFrD",
]

# Destination folder
destination_folder = "data"

# Download each file
for url in urls:
    print(f"Downloading from {url}...")
    file_path = download_file(url, destination_folder)
    print(f"Saved to {file_path}")


Downloading from https://drive.google.com/uc?export=download&id=1S3J6A
2hX8rKlgLoWQv4fJHAHHSN4eFrD...
Saved to data/Transformers_in_Action_chapter_1_2.txt


### Summarization

In [None]:
documents = SimpleDirectoryReader('./data').load_data()
len(documents)

1

In [None]:
# Embedding model - You need to add this
# otherwise it will ask yu for OpenAI credentials
embed_model = HuggingFaceEmbedding(
    model_name="hkunlp/instructor-large"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Settings for the LLM
Settings.llm = llm
Settings.embed_model = embed_model
Settings.num_output = 128
Settings.context_window = 16000

In [None]:
vector_index = VectorStoreIndex.from_documents(documents)

In [None]:
# Query the model to summarized the document
query_engine = vector_index.as_query_engine()
print(query_engine.query("Could you summarize the given context? Return your response which covers the key points of the text and does not miss anything important."))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The text discusses the Transformer model, its components, and its
advantages over traditional sequence-to-sequence models. It mentions
that Transformers employ attention and multi-head attention mechanisms
to navigate through a sentence, focusing on key parts. The attention
mechanism allows the model to focus on important information, while
multi-head attention helps identify numerous links between words in a
sequence. The text also notes that Transformers have revolutionized
NLP by making it possible to train models in a matter of days,
outperforming state-of-the-art networks. However, the text also
acknowledges the limitations of extremely large language models,
including decreased effectiveness in certain specialized domains and
computational demands.
