In [1]:
from dotenv import load_dotenv
import pickle
import os
load_dotenv("../.env")

import nest_asyncio
import langchain.schema
from llama_index import download_loader, GPTVectorStoreIndex
from llama_index.readers import GithubRepositoryReader
#from llama_index.readers.llamahub_modules.github_repo import GithubClient, GithubRepositoryReader
nest_asyncio.apply()
#download_loader("GithubRepositoryReader")

# [Building a Chatbot](https://gpt-index.readthedocs.io/en/latest/guides/tutorials/building_a_chatbot.html)

## Get documents

Download data

In [None]:
!mkdir data

Imports:

### [`StorageContext`](https://github.com/jerryjliu/llama_index/blob/main/llama_index/storage/storage_context.py) 
...is a utility container for storing nodes, indices and vectors.

Has attributes:

- `docstore` (default: `SimpleDocumentStore()`)
- `index_store` (default: `SimpleIndexStore()`)
- `vector_store` (default: `SimpleVectorStore()`)
    
And persists these to/loads these from a file.

### [`ServiceContext`](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/service_context.py) 
...is a utility container for index and query classes. 

Has attributes:

- `llm_predictor` (default: `LLMPredictor()`)
- `prompt_helper` (default: `PromptHelper.from_llm_predictor(..., chunk_size_limit)`)
- `embed_model` (default: `OpenAIEmbedding()`)
    - Not used in all indices
- `node_parser` (default: `SimpleNodeParser(text_splitter=TokenTextSplitter(...), callback_manager=...)`)
    - Splits a `Document` into `Node`s using a `TextSplitter`
- `llama_logger` (default: `LlamaLogger()`)
- `callback_manager` (default: `CallbackManager()`)
- `chunk_size_limit` (default: `none`) 
    - Goes to the node parser and prompt helper
    - The `SimpleNodeParser` has `text_splitter` equal to `TokenTextSplitter` with default chunk size if None, otherwise the chunk_size_limit arg is passed to it
    - The `PromptHelper` also instantiates a `TokenTextSplitter` with it.


So by instantiating one of these you basically have an object for interfaces to *all* related services.

In [2]:
from llama_index import (
    download_loader,
    GPTVectorStoreIndex,
    ServiceContext, 
    StorageContext,
    load_index_from_storage
)
from pathlib import Path

Create Documents from files in Github repo

In [3]:
docs = None
if os.path.exists("docs.pkl"):
    with open("docs.pkl", "rb") as f:
        docs = pickle.load(f)

if docs is None:
    github_client = GithubClient(os.getenv("GITHUB_TOKEN"))
    loader = GithubRepositoryReader(
        github_client,
        owner =                  "microsoft",
        repo =                   "Cream",
        filter_directories =     (["TinyViT"], GithubRepositoryReader.FilterType.INCLUDE),
        filter_file_extensions = ([".py", ".md"], GithubRepositoryReader.FilterType.INCLUDE),
        verbose =                True,
        concurrent_requests =    10,
    )

    docs = loader.load_data(branch="main")

    with open("docs.pkl", "wb") as f:
        pickle.dump(docs, f)

## Create vector index

Setting up vector indices for each file

In [12]:
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser

In [13]:
embeddings_storage = Path(f"./storage/TinyViT")

if embeddings_storage.exists():
    # load document/index/vector store from storage
    storage_context = StorageContext.from_defaults(persist_dir=embeddings_storage)
    cur_index = load_index_from_storage(storage_context=storage_context)
else:
    # generate document/index/vector store from documents, and store
    # we need an embedding model, hence we create a service context
    # the chunk_size_limit internally propagates to the textsplitter (which in turn goes to 
    # the node parser) and limits the number of tokens in each node
    chunk_size_limit, chunk_overlap = 768, 64
    text_splitter = TokenTextSplitter(chunk_size=chunk_size_limit, chunk_overlap=chunk_overlap)
    node_parser = SimpleNodeParser(text_splitter=text_splitter)
    service_context = ServiceContext.from_defaults(
        node_parser=node_parser,
        chunk_size_limit=chunk_size_limit
    ) 
    #index_set = {}

    # just a flat index for now, could loop and create multiple, hierarchically for example
    storage_context = StorageContext.from_defaults()
    # from_documents uses the service_context's nodeparser to translate documents into nodes

    cur_index = GPTVectorStoreIndex.from_documents(
        docs,#doc_set[key]
        service_context=service_context,
        storage_context=storage_context
    )
    #index_set[key] = cur_index
    storage_context.persist(persist_dir=embeddings_storage)
    
# since we're just doing one
index_set = cur_index

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 168176 tokens


## Retriever

### Here we select subset of nodes by semantic similarity from vector index

In [14]:
from llama_index import (
    GPTVectorStoreIndex,
    ResponseSynthesizer,
    LLMPredictor
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.postprocessor import SimilarityPostprocessor

#index = GPTVectorStoreIndex.from_documents(docs, storage_context=...)

# configure retriever
retriever = VectorIndexRetriever(
    index=cur_index,
    similarity_top_k=5
)

![bla bla](./vector_store_query.webp "this")

## ResponseSynthesizer

### Decide how to process nodes returned by retriever into answer

[ResponseMode](https://gpt-index.readthedocs.io/en/latest/reference/query/response_synthesizer.html#llama_index.indices.response.type.ResponseMode)

Response modes of the response builder (and synthesizer).

ACCUMULATE = 'accumulate'

    Synthesize a response for each text chunk, and then return the concatenation.

COMPACT = 'compact'

    Compact and refine mode first combine text chunks into larger consolidated chunks that more fully utilize the available context window, then refine answers across them. This mode is faster than refine since we make fewer calls to the LLM.

GENERATION = 'generation'

    Ignore context, just use LLM to generate a response.

NO_TEXT = 'no_text'

    Return the retrieved context nodes, without synthesizing a final response.

REFINE = 'refine'

    Refine is an iterative way of generating a response. We first use the context in the first node, along with the query, to generate an initial answer. We then pass this answer, the query, and the context of the second node as input into a “refine prompt” to generate a refined answer. We refine through N-1 nodes, where N is the total number of nodes.

SIMPLE_SUMMARIZE = 'simple_summarize'

    Merge all text chunks into one, and make a LLM call. This will fail if the merged text chunk exceeds the context window size.

TREE_SUMMARIZE = 'tree_summarize'

    Build a tree index over the set of candidate nodes, with a summary prompt seeded with the query. The tree is built in a bottoms-up fashion, and in the end the root node is returned as the response


Constructing a `ResponseSynthesizer` with a given `response_mode` will build one of the  `BaseResponseBuilder` subclasses that will handle the synthesis logic. These are delegated to [here](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/response/factory.py#L24).

Each of these interally use prompt templates for e.g. "summarisation" and "refine" steps which can be tweaked.

We may use a different `ServiceContext` here because we can choose to involve an LLM in the response synthesis.

Internally a `PromptHelper` is either passed explicitly or created `from_llm_predictor`. This takes a `chunk_size_limit`, which is used to figure out how many nodes can be squeezed into one prompt, so this can be taken from the same service context with which we generated the index.


`compact` is a synonym for "compact and refine". Here, the retrieved nodes are consolidated into chunks big enough to fill out the prompt. The query is issued with the first chunk of context to get an answer, and then this answer is provided with the original query and the context of the second node, and so on.
$$
(Q, C1) \rightarrow_{\text{LLM}} A1 \\
(Q, A1, C2) \rightarrow_{\text{LLM}} A2 \\
(Q, A2, C3) \rightarrow_{\text{LLM}} A3 \\
(Q, A3, C4) \rightarrow_{\text{LLM}} A4 \\
$$

`tree_summarize` also seems interesting.

The summary tree topology in this implementation appears to be a form of hierarchical agglomerative clustering, where each node in the tree represents a chunk of text and the summary of its child nodes. This type of clustering is hierarchical because each chunk of text starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Here's how the tree topology is defined in the provided `GPTTreeIndexBuilder` class:

- Leaves: The leaf nodes in the tree are chunks of the original input text. These are the smallest units of text that the system works with.

- Intermediate Nodes: The intermediate nodes are summaries of their children nodes. Each intermediate node represents a chunk of summarized text that is a summary of the chunks represented by its children. The summarization process is performed by a language model (specified by `llm_predictor` in the service context).

- Root: The root of the tree is a single node that represents the entire document. Its text is a summary of all the text in the document. If the tree is fully built, this root node is a summary of its children, which are summaries of their children, and so forth, down to the leaves which are the original chunks of text.

The **granularity of the summarization** is controlled by the `num_children` parameter of the `GPTTreeIndexBuilder` class. This parameter determines how many child nodes each parent node should have. By adjusting this parameter, you can control how much text each intermediate node is responsible for summarizing. A smaller num_children value would lead to finer granularity (i.e., more detail and shorter summaries in each node), while a larger value would lead to coarser granularity (i.e., less detail and longer summaries in each node).

In [29]:
from langchain.chat_models import ChatOpenAI

In [31]:
# Response synthesis service context
service_context = ServiceContext.from_defaults(
    llm_predictor=LLMPredictor(ChatOpenAI(temperature=0, model_name="gpt-4")),
    chunk_size_limit=cur_index.service_context.chunk_size_limit
)

# configure response synthesizer
response_synthesizer = ResponseSynthesizer.from_args(
    service_context=service_context,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=.7)
    ],
    response_mode="compact"
)

Now use Langchain to set up the outer chatbot agent, which has access to a set of `Tool`s.

LlamaIndex provides wrappers around indices and graphs so that they can be used within `Tool` interface.

In [58]:
from langchain.agents import Tool
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent

from llama_index.langchain_helpers.agents import (
    LlamaToolkit,
    create_llama_chat_agent,
    create_llama_agent,
    IndexToolConfig
)

In [32]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

We define a separate tool for each index. We're just using the vector index defined over the code 

In [38]:
tool_config = IndexToolConfig(
    query_engine=query_engine, 
    name=f"Vector Index (TinyViT GitHub)",
    description=(
        "useful for when you want to answer queries about the TinyViT vision transformer implementation"
    ),
    tool_kwargs={"return_direct": True}
)

Put all the tool configs into the LlamaToolkit (just one for now).

In [39]:
toolkit = LlamaToolkit(
    index_configs=[tool_config],
)

Finally, create the agent:

In [55]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [59]:
agent_chain = create_llama_agent(
    toolkit=toolkit,
    memory=memory,
    agent="chat-conversational-react-description",
    llm=ChatOpenAI(temperature=0, model_name="gpt-4"),
    verbose=True
)

In [60]:
agent_chain.run(input="Tell me about how the TinyViT model is trained.")





[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Vector Index (TinyViT GitHub)",
    "action_input": "training TinyViT"
}
```[0m

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 4 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 3621 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens



Observation: [36;1m[1;3mTo train a TinyViT model, follow these steps:

1. Save teacher sparse logits: Before training a TinyViT model, you need to save the teacher sparse logits. Follow the instructions in [How to save teacher sparse logits?](./docs/SAVE_TEACHER_LOGITS.md).

2. Train a TinyViT model: After saving the teacher sparse logits, you can train a TinyViT model using the proposed fast pretraining distillation. Follow the instructions in [Let's train a TinyViT model](./docs/TRAINING.md).

Make sure you have installed the necessary dependencies and prepared the datasets as mentioned in the [Preparation](./docs/PREPARATION.md) guide.[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m


"To train a TinyViT model, follow these steps:\n\n1. Save teacher sparse logits: Before training a TinyViT model, you need to save the teacher sparse logits. Follow the instructions in [How to save teacher sparse logits?](./docs/SAVE_TEACHER_LOGITS.md).\n\n2. Train a TinyViT model: After saving the teacher sparse logits, you can train a TinyViT model using the proposed fast pretraining distillation. Follow the instructions in [Let's train a TinyViT model](./docs/TRAINING.md).\n\nMake sure you have installed the necessary dependencies and prepared the datasets as mentioned in the [Preparation](./docs/PREPARATION.md) guide."

In [None]:
while True:
    text_input = input("User: ")
    response = agent_chain.run(input=text_input)
    print(f'Agent: {response}')

User: Can you summarize the training instructions explicitly here?


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "To train a TinyViT model, follow these steps: 1. Save teacher sparse logits by following the instructions in the TinyViT repository. 2. Train a TinyViT model using the proposed fast pretraining distillation method, as described in the TinyViT repository. Make sure to install necessary dependencies and prepare the datasets as mentioned in the preparation guide."
}
```[0m

[1m> Finished chain.[0m
Agent: To train a TinyViT model, follow these steps: 1. Save teacher sparse logits by following the instructions in the TinyViT repository. 2. Train a TinyViT model using the proposed fast pretraining distillation method, as described in the TinyViT repository. Make sure to install necessary dependencies and prepare the datasets as mentioned in the preparation guide.
User: I would like you to elaborate in m

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 2 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 3604 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens



Observation: [36;1m[1;3mTo train a TinyViT model, follow these steps:

1. Install the required dependencies by running:
```bash
pip install -r requirements.txt
```

2. Prepare the ImageNet-1k and ImageNet-22k datasets as described in the [Preparation document](./docs/PREPARATION.md).

3. Save the teacher sparse logits by following the tutorial in the [Save Teacher Logits document](./docs/SAVE_TEACHER_LOGITS.md).

4. Pretrain the model on ImageNet-22k with the proposed fast pretraining distillation. Assuming the teacher sparse soft labels are stored in `./teacher_logits/` and the IN-22k dataset is stored in `./ImageNet-22k`, run the following command:

```bash
python -m torch.distributed.launch --master_addr=$MASTER_ADDR --nproc_per_node 8 --nnodes=4 --node_rank=$NODE_RANK main.py --cfg configs/22k_distill/tiny_vit_21m_22k_distill.yaml --data-path ./ImageNet-22k --batch-size 128 --output ./output --opts DISTILL.TEACHER_LOGITS_PATH ./teacher_logits/
```

Replace `$NODE_RANK` and `$MAS

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 5 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 3474 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens



Observation: [36;1m[1;3mFast pretraining distillation is a method used to train small and efficient vision transformers by transferring knowledge from large pretrained models to smaller ones. This is achieved by saving the sparse soft labels (logits) of the large teacher models in advance, which helps in reducing memory cost and computational overheads. The saved logits can then be used to distill the knowledge into smaller student models without the need for re-forwarding the large teacher model. This approach enables faster and more efficient training of smaller vision transformers while maintaining high accuracy.[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m
Agent: Fast pretraining distillation is a method used to train small and efficient vision transformers by transferring knowledge from large pretrained models to smaller ones. This is achieved by saving the sparse soft labels (logits) of the large teacher models in advance, which helps in reducing memory cost and computatio

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 2 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 3797 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens



Observation: [36;1m[1;3mThe training loop in the given context can be found in the `train_one_epoch_distill_using_saved_logits` function. It iterates through the data_loader for each epoch, processes the samples and targets, performs forward and backward passes, updates the model parameters, and computes the loss and accuracy metrics.[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m
Agent: The training loop in the given context can be found in the `train_one_epoch_distill_using_saved_logits` function. It iterates through the data_loader for each epoch, processes the samples and targets, performs forward and backward passes, updates the model parameters, and computes the loss and accuracy metrics.
