# FastChat-T5

FastChat-T5 is a 3B parameter chatbot model developed by the FastChat team, primarily Dacheng Li, Lianmin Zheng, and Hao Zhang. It is based on the Flan-T5-xl model and fine-tuned on user-shared conversations collected from ShareGPT. Here are some key points about FastChat-T5:

### Key Features

- **Architecture**: FastChat-T5 uses an encoder-decoder transformer architecture, which allows it to autoregressively generate responses to users' inputs.
- **Training Data**: The model was trained on 70K conversations collected from ShareGPT.com, processed in the form of question answering.
- **Training Details**: The model was fine-tuned for 3 epochs with a max learning rate of 2e-5, warmup ratio of 0.03, and a cosine learning rate schedule.
- **Evaluation**: The model's quality was evaluated by creating a set of 80 diverse questions and using GPT-4 to judge the model outputs.

### Usage

FastChat-T5 can be used for various applications, including:

- **Commercial Usage**: The model is suitable for commercial usage of large language models and chatbots.
- **Research**: It can also be used for research purposes in natural language processing, machine learning, and artificial intelligence.

### Performance

FastChat-T5 has been reported to perform well in various tasks, including:

- **Summarizing Documents**: It has been used for summarizing documents and feeding the summaries to the model with the user's query.
- **Instruction Following**: It has been tested for instruction following and inference, and has been found to perform better than Ada and worse than ChatGPT.
- **Financial Work**: It has been used for financial work and has been found to be more accurate than other models like T5 and UL2.

### Availability

FastChat-T5 is available on the Hugging Face model hub and can be downloaded and used for various applications.

In [20]:
!pip -q install langchain tiktoken chromadb pypdf transformers InstructorEmbedding

In [21]:
!pip show langchain

Name: langchain
Version: 0.2.4
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: langchain-community


## QA Retrieval No Open AI - Fastchat-T5

Transformers is a popular open-source library developed by Hugging Face that provides a wide range of pre-trained models and tools for natural language processing (NLP) tasks. Here is a detailed explanation of transformers and the specific classes mentioned in the query:

### Transformers

Transformers is a library that provides a unified interface for various transformer-based models, including BERT, RoBERTa, XLNet, and many others. It allows users to easily load and use pre-trained models for a variety of NLP tasks such as text classification, sentiment analysis, question answering, and language translation.

### AutoTokenizer

`AutoTokenizer` is a class in the transformers library that automatically selects the appropriate tokenizer for a given pre-trained model. It is a generic tokenizer class that can be instantiated using the `from_pretrained` method, which loads the tokenizer based on the model name or path.

For example:
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```

### AutoModelForSeq2SeqLM

`AutoModelForSeq2SeqLM` is a class in the transformers library that automatically selects the appropriate model for sequence-to-sequence tasks. It is used for models with an encoder-decoder architecture, such as T5 and BART, which are commonly used for tasks like machine translation, text summarization, and question answering.

For example:
```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```

### Key Features

- **Pre-trained Models**: Transformers provides a wide range of pre-trained models that can be fine-tuned for specific tasks.
- **Auto Classes**: The library provides auto classes like `AutoTokenizer` and `AutoModelForSeq2SeqLM` that automatically select the appropriate tokenizer or model based on the model name or path.
- **Unified Interface**: Transformers provides a unified interface for various transformer-based models, making it easy to switch between different models and tasks.

### Usage

Transformers can be used for various NLP tasks, including:

- **Text Classification**: Transformers provides pre-trained models like BERT and RoBERTa that can be fine-tuned for text classification tasks.
- **Language Translation**: Models like T5 and BART can be used for machine translation tasks.
- **Question Answering**: Models like BERT and RoBERTa can be fine-tuned for question answering tasks.

### Advantages

- **Easy to Use**: Transformers provides a simple and unified interface for various transformer-based models, making it easy to use and fine-tune pre-trained models.
- **Wide Range of Models**: The library provides a wide range of pre-trained models that can be used for various NLP tasks.
- **Flexibility**: Transformers allows users to easily switch between different models and tasks, making it a flexible and versatile library.

### Common Issues

- **Model Selection**: Choosing the right pre-trained model for a specific task can be challenging, especially for users who are new to transformers.
- **Fine-tuning**: Fine-tuning pre-trained models requires a good understanding of the model architecture and the task at hand.

### Best Practices

- **Choose the Right Model**: Select a pre-trained model that is suitable for the task at hand.
- **Fine-tune Carefully**: Fine-tune the pre-trained model carefully, using a suitable optimizer and learning rate.
- **Monitor Performance**: Monitor the performance of the model during fine-tuning and adjust the hyperparameters as needed.

In [22]:
!wget -q https://www.dropbox.com/s/zoj9rnm7oyeaivb/new_papers.zip
!unzip -q new_papers.zip -d new_papers

replace new_papers/new_papers/toolformer.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/__MACOSX/new_papers/._toolformer.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/new_papers/Flash-attention.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/__MACOSX/new_papers/._Flash-attention.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/new_papers/Augmenting LLMs Survey.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/__MACOSX/new_papers/._Augmenting LLMs Survey.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/new_papers/ReACT.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/__MACOSX/new_papers/._ReACT.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/new_papers/ALiBi.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_papers/__MACOSX/new_papers/._ALiBi.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y


In [1]:
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("lmsys/fastchat-t5-3b-v1.0")

model = AutoModelForSeq2SeqLM.from_pretrained("lmsys/fastchat-t5-3b-v1.0")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special 

In [2]:
!pip install langchain_community



In [3]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256
)

local_llm = HuggingFacePipeline(pipeline=pipe)

  warn_deprecated(


# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files - PDFs
- ChromaDB - with more meta data?
- Source info
- gpt-3.5-turbo API
- HuggingFace Embeddings
- Instuctor Embeddings


## Setting up LangChain


In [5]:
import os

!pip install sentence-transformers==2.2.2



In [6]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader


from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

## Load multiple and process documents

In [7]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./new_papers/new_papers/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [8]:
len(documents)

142

In [9]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [10]:
len(texts)

659

In [11]:
texts[3]

Document(page_content='has time and memory complexity quadratic in sequence length. An important question is whether making\nattention faster and more memory-eﬃcient can help Transformer models address their runtime and memory\nchallenges for long sequences.\nMany approximate attention methods have aimed to reduce the compute and memory requirements of\nattention. These methods range from sparse-approximation [ 51,74] to low-rank approximation [ 12,50,84],\nand their combinations [ 3,9,92]. Although these methods reduce the compute requirements to linear or\nnear-linear in sequence length, many of them do not display wall-clock speedup against standard attention\nand have not gained wide adoption. One main reason is that they focus on FLOP reduction (which may not\ncorrelate with wall-clock speed) and tend to ignore overheads from memory access (IO).\nIn this paper, we argue that a missing principle is making attention algorithms IO-aware [1]—that is,', metadata={'source': 'new_papers/

## HF Instructor Embeddings

In [None]:

from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})


load INSTRUCTOR_Transformer


## create the DB

In [9]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## Here is the new embeddings being used
embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [10]:
# persiste the db to disk
vectordb.persist()
vectordb = None

  warn_deprecated(


In [11]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

## Make a retriever

In [12]:
retriever = vectordb.as_retriever()

In [13]:
docs = retriever.get_relevant_documents("What is Flash attention?")

  warn_deprecated(


In [14]:
len(docs)

4

In [15]:
docs[0]

Document(page_content='access.\nWe propose FlashAttention , a new attention algorithm that computes exact attention with far fewer\nmemory accesses. Our main goal is to avoid reading and writing the attention matrix to and from HBM.\nThis requires (i) computing the softmax reduction without access to the whole input (ii) not storing the large\nintermediate attention matrix for the backward pass. We apply two well-established techniques to address\nthese challenges. (i) We restructure the attention computation to split the input into blocks and make several\npasses over input blocks, thus incrementally performing the softmax reduction (also known as tiling). (ii) We\nstore the softmax normalization factor from the forward pass to quickly recompute attention on-chip in the\nbackward pass, which is faster than the standard approach of reading the intermediate attention matrix from\nHBM. We implement FlashAttention in CUDA to achieve ﬁne-grained control over memory access and', metadata={'

In [16]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [17]:
retriever.search_type

'similarity'

In [18]:
retriever.search_kwargs

{'k': 3}

## Make a chain

In [None]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=local_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [None]:
# full example
query = "What is Flash attention?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
# break it down
query = "What does IO-aware mean?"
llm_response = qa_chain(query)
process_llm_response(llm_response)
# llm_response

In [None]:
query = "What is tiling in flash-attention?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What is toolformer?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What tools can be used with toolformer?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "How many examples do we need to provide for each tool?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What are the best retrieval augmentations for LLMs?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What are the differences between REALM and RAG?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

## Deleteing the DB

In [None]:
!zip -r db.zip ./db

In [None]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# delete the directory
!rm -rf db/

## Starting again loading the db

restart the runtime

In [None]:
!unzip db.zip

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [None]:
persist_directory = 'db'
embedding = OpenAIEmbeddings()

vectordb2 = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding,
                   )

retriever = vectordb2.as_retriever(search_kwargs={"k": 2})

In [None]:
# Set up the turbo LLM
turbo_llm = ChatOpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)

In [None]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=turbo_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [None]:
# full example
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

### Chat prompts

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)