# Local Llama2 + VectorStoreIndex

This notebook walks through the proper setup to use llama-2 with LlamaIndex locally. Note that you need a decent GPU to run this notebook, ideally an A100 with at least 40GB of memory.

Specifically, we look at using a vector store index.

adapted from: https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemoLlama-Local.html#local-llama2-vectorstoreindex

## Setup

In [1]:
!pip install llama-index==0.9.12 ipywidgets==8.1.1 torch==2.1 transformers==4.35.2 accelerate==0.25.0 bitsandbytes==0.41.2.post2 scipy

Collecting llama-index
  Using cached llama_index-0.9.12-py3-none-any.whl.metadata (8.2 kB)
Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting torch
  Using cached torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting transformers
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Collecting accelerate
  Using cached accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.41.2.post2-py3-none-any.whl.metadata (9.8 kB)
Collecting scipy
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m-:--:--[0m
[?25hCollecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index)
  Using cached SQLAlchemy-2.0.23-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB

### Set Up

**IMPORTANT**: If you want to use meta-llama models, please sign in to HF hub with an account that has access to the llama2 models, using `huggingface-cli login` in your console. For more details, please see: https://ai.meta.com/resources/models-and-libraries/llama-downloads/.

Otherwise, continue with `NOUS` from https://huggingface.co/NousResearch/Llama-2-7b-chat-hf 

In [2]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from IPython.display import Markdown, display

In [3]:
import torch
import accelerate
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

# Model names (make sure you have access on HF)
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"

NOUS = "NousResearch/Llama-2-7b-chat-hf"

selected_model = NOUS

SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cuda:0",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.bfloat16, "load_in_8bit": False},
)

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Download Data

In [4]:
!git clone https://github.com/determined-ai/determined

Cloning into 'determined'...


remote: Enumerating objects: 173633, done.[K
remote: Counting objects: 100% (6046/6046), done.[K
remote: Compressing objects: 100% (2460/2460), done.[K
remote: Total 173633 (delta 4062), reused 5011 (delta 3400), pack-reused 167587[K
Receiving objects: 100% (173633/173633), 221.37 MiB | 13.83 MiB/s, done.
Resolving deltas: 100% (134222/134222), done.


In [5]:
from llama_index import download_loader
from glob import glob

MarkdownReader = download_loader("MarkdownReader")
markdownreader = MarkdownReader()

docs = []
for file in glob("./determined/*.md", recursive=True):
    docs.extend(markdownreader.load_data(file=file))
print("Found " + str(len(docs)) + " docs of .md format in that github repo")

Found 42 docs of .md format in that github repo


In [6]:
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    set_global_service_context,
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en"
)
set_global_service_context(service_context)

index = VectorStoreIndex.from_documents(docs)

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Querying

In [7]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()

In [8]:
response = query_engine.query("What are the main components of Determined?")
display(Markdown(f"<b>{response}</b>"))



<b>The main components of Determined are the Python library, the command line interface (CLI), and the Web UI.</b>

In [9]:
response = query_engine.query("What can you use the CLI to do?")
display(Markdown(f"<b>{response}</b>"))

<b>Great, I'm happy to help! Based on the context information provided, you can use the CLI to:

1. Start a Determined cluster locally using the `det deploy local cluster-up` command.
2. Launch Determined on cloud services such as Amazon Web Services (AWS) or Google Cloud Platform (GCP) using the `det deploy aws up` command.
3. Train your models using the `det experiment create gpt.yaml .` command.
4. Configure distributed training and hyperparameter tuning using YAML files.

You can access the WebUI at `http://localhost:8080` after following either set of instructions above. Additionally, you can use the `det` command-line tool to interact with Determined, such as listing available GPUs or CPUs using the `det slot list` command. For more information, please refer to the reference documentation.</b>

In [10]:
response = query_engine.query("How do you install Determined on Kubernetes?")
display(Markdown(f"<b>{response}</b>"))

<b>Great, let's get started! To install Determined on Kubernetes, you can follow these steps:

1. First, make sure you have a Kubernetes cluster set up and running. You can use any Kubernetes distribution, such as Google Kubernetes Engine (GKE), Amazon Elastic Container Service for Kubernetes (EKS), or a self-managed Kubernetes cluster.
2. Next, deploy the Determined cluster on your Kubernetes cluster. You can do this by running the following command:
```
det deploy --kubernetes
```
This command will deploy the Determined cluster on your Kubernetes cluster using the Kubernetes manifests provided in the `determined/kubernetes` directory.
3. Once the deployment is complete, you can access the Determined web interface by navigating to the IP address of your Kubernetes node or by using the Kubernetes service name. For example, if your Kubernetes node is running on `node-1.example.com`, you can access Determined by visiting `http://node-1.example.com:8080`.
4. That's it! You should now have Determined installed and running on your Kubernetes cluster. If you have any questions or encounter any issues during the installation process, feel free to reach out to the Determined community for help.

I hope this helps! Let me know if you have any other questions.</b>

In [11]:
response = query_engine.query("How are Determined docs generated?")
display(Markdown(f"<b>{response}</b>"))

<b>Determined documents are generated using a combination of natural language processing (NLP) and machine learning (ML) techniques. The process involves several steps:

1. Content Creation: Determined's content is created by a team of experienced writers and editors who specialize in various industries and topics.
2. Structuring: Once the content is created, it is structured using a standardized framework that includes headings, subheadings, and other elements to make it easy to navigate and read.
3. Optimization: The structured content is then optimized for search engines using various techniques, such as keyword research, meta tags, and other SEO best practices.
4. Publishing: The optimized content is then published to Determined's documentation platform, which is powered by a combination of NLP and ML algorithms.
5. Maintenance: Finally, Determined's documentation is maintained through a continuous process of updating, editing, and refining the content to ensure that it remains accurate and relevant.

By following these steps, Determined is able to generate high-quality, informative, and easy-to-use documentation that meets the needs of its users.</b>

In [12]:
response = query_engine.query("Tell me about writing docs for Determined")
display(Markdown(f"<b>{response}</b>"))

<b>Of course! If you're looking to write documentation for Determined, there are several resources available to help you get started.

Firstly, the Determined community on Slack is an excellent place to ask questions and get support. You can join the community by clicking on the link provided in the context information.

Additionally, Determined has a YouTube channel and Twitter account where you can find updates and announcements related to the project.

If you prefer to receive updates via email, you can join the community mailing list by visiting the Determined website and filling out the form provided.

If you encounter a bug or security issue while working with Determined, you can report it on GitHub or email the security team directly, respectively.

Overall, there are several ways to get involved and stay up-to-date with the latest news and developments in the Determined community.</b>

In [13]:
response = query_engine.query("What does Determined use Algolia Search for?")
display(Markdown(f"<b>{response}</b>"))

<b>Determined utilizes Algolia Search as its primary search engine.</b>

In [14]:
response = query_engine.query("Tell me about the Sphinx-Native Search Fallback")
display(Markdown(f"<b>{response}</b>"))

<b>Of course! Here's the answer to your query:

The Sphinx-Native Search Fallback is a feature in Determined that allows for native search functionality in the absence of Sphinx. This fallback is useful in situations where Sphinx is not available or is not functioning properly, such as during deployment on a non-Sphinx-compatible environment.

To enable the Sphinx-Native Search Fallback, you can set the `native_search_fallback` configuration option to `True` in your Determined configuration file. This will enable the fallback and allow Determined to use native search functionality instead of relying on Sphinx.

It's important to note that the Sphinx-Native Search Fallback may not provide the same level of performance as Sphinx, as it relies on the underlying file system and may not be able to leverage the full power of Sphinx's indexing capabilities. However, it can still provide a useful fallback in situations where Sphinx is not available or is not functioning properly.</b>

### Streaming Support

In [15]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Why would a Determined user choose a notebook vs an experiment?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

Great, I'm here to help! A Determined user may choose to use a notebook instead of an experiment for several reasons:

1. Flexibility: Notebooks offer more flexibility in terms of data management and analysis. With a notebook, users can easily switch between different datasets, perform ad-hoc analysis, and experiment with different models and algorithms.
2. Ease of use: Notebooks are often more user-friendly and easier to use than experiments, especially for users who are new to machine learning. Notebooks provide an interactive environment where users can write code, run experiments, and visualize results without having to worry about setting up and managing complex infrastructure.
3. Collaboration: Notebooks are ideal for collaborative work. Multiple users can work on the same notebook simultaneously, and changes are tracked and versioned automatically. This makes it easier to work with teams and share results with others.
4. Rapid prototyping: Notebooks allow users to quickly protot

In [16]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Why would a Determined user choose an experiment over a notebook?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

Sure, I'd be happy to help!

A Determined user may choose an experiment over a notebook for several reasons:

1. Reusability: Experiments are designed to be reusable, meaning that the user can easily share their code and results with others. This can be particularly useful for collaborative work or for sharing results with colleagues or supervisors.
2. Organization: Experiments are organized in a structured format, making it easier for users to find and reuse code and results. This can help to keep track of progress and to maintain a clear overview of the work being done.
3. Version control: Determined provides version control for experiments, which allows users to track changes to their code and results over time. This can be particularly useful for debugging and for maintaining a record of progress.
4. Collaboration: Experiments can be easily shared with others, allowing for collaborative work and code reuse. This can be particularly useful for large-scale projects or for working wit

In [17]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Please compare the use of notebooks and experiments")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

Sure, 

I'd be happy to help! Based on the context information provided, here's the answer to your query:

Notebooks and experiments are both tools used in the Determined platform for training machine learning models. However, they serve different purposes and have different characteristics.

A notebook is a single, self-contained document that contains a complete environment for training a model. It includes all the necessary code, data, and configuration settings to train a model from scratch. Notebooks are useful for exploratory work, prototyping, and sharing models with others. They can also be used to document the training process and results.

On the other hand, an experiment is a collection of related models that are trained using a shared environment. Experiments are useful for comparing the performance of different models, tuning hyperparameters, and evaluating the effectiveness of different training strategies. Experiments can also be used to reproduce and share the results of previo