# Local Llama2 + VectorStoreIndex

This notebook walks through the proper setup to use llama-2 with LlamaIndex locally. Note that you need a decent GPU to run this notebook, ideally an A100 with at least 40GB of memory.

Specifically, we look at using a vector store index.

adapted from: https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemoLlama-Local.html#local-llama2-vectorstoreindex

## Setup

In [None]:
!pip install llama-index==0.9.12 ipywidgets==8.1.1 torch==2.1 transformers==4.35.2 accelerate==0.25.0 bitsandbytes==0.41.2.post2 scipy

### Set Up

**IMPORTANT**: If you want to use meta-llama models, please sign in to HF hub with an account that has access to the llama2 models, using `huggingface-cli login` in your console. For more details, please see: https://ai.meta.com/resources/models-and-libraries/llama-downloads/.

Otherwise, continue with `NOUS` from https://huggingface.co/NousResearch/Llama-2-7b-chat-hf 

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from IPython.display import Markdown, display

In [2]:
import torch
import accelerate
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

# Model names (make sure you have access on HF)
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"

NOUS = "NousResearch/Llama-2-7b-chat-hf"

selected_model = NOUS

SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="cuda:0",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.bfloat16, "load_in_8bit": False},
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Download Data

In [3]:
!git clone https://github.com/determined-ai/determined

fatal: destination path 'determined' already exists and is not an empty directory.


In [4]:
from llama_index import download_loader
from glob import glob

MarkdownReader = download_loader("MarkdownReader")
markdownreader = MarkdownReader()

docs = []
for file in glob("./determined/*.md", recursive=True):
    docs.extend(markdownreader.load_data(file=file))
print("Found " + str(len(docs)) + " docs of .md format in that github repo")

Found 42 docs of .md format in that github repo


In [5]:
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    set_global_service_context,
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en"
)
set_global_service_context(service_context)

index = VectorStoreIndex.from_documents(docs)

## Querying

In [6]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()

In [7]:
response = query_engine.query("What are the main components of Determined?")
display(Markdown(f"<b>{response}</b>"))



<b>The main components of Determined are the Python library, the command line interface (CLI), and the Web UI.</b>

In [8]:
response = query_engine.query("What can you use the CLI to do?")
display(Markdown(f"<b>{response}</b>"))

<b>Great, I'm happy to help! Based on the context information provided, you can use the CLI to:

1. Start a Determined cluster locally using the `det deploy local cluster-up` command.
2. Launch Determined on cloud services such as Amazon Web Services (AWS) or Google Cloud Platform (GCP) using the `det deploy aws up` command.
3. Train your models using the `det experiment create gpt.yaml .` command.
4. Configure distributed training and hyperparameter tuning using YAML files.

You can access the WebUI at `http://localhost:8080` after following either set of instructions above. Additionally, you can use the `det` command-line tool to interact with Determined, such as listing available GPUs or CPUs using the `det slot list` command. For more information, please refer to the reference documentation.</b>

In [9]:
response = query_engine.query("How do you install Determined on Kubernetes?")
display(Markdown(f"<b>{response}</b>"))



<b>Great, let's get started! To install Determined on Kubernetes, you can follow these steps:

1. First, you'll need to create a Kubernetes cluster. You can use a cloud provider like AWS or GCP, or you can set up a local cluster using tools like Minikube.
2. Once you have a Kubernetes cluster set up, you can deploy Determined to it by using the `det deploy` command. This command will create a Determined cluster on your Kubernetes cluster.
3. To deploy Determined on Kubernetes, you'll need to have the `kubectl` command-line tool installed on your system. If you don't have `kubectl`, you can install it using the instructions provided by Kubernetes.
4. Once you have `kubectl` installed, you can use the `det deploy` command to deploy Determined to your Kubernetes cluster. For example:
```
kubectl create deployment determined --image=determined/determined:latest
```
This command will create a Determined deployment on your Kubernetes cluster using the `determined/determined:latest` image.
5. After the deployment is created, you can use the `kubectl get deployment` command to verify that the deployment is running correctly.

That's it! With these steps, you should be able to install Determined on your Kubernetes cluster. If you have any questions or need further assistance, feel free to ask.</b>

In [10]:
response = query_engine.query("How are Determined docs generated?")
display(Markdown(f"<b>{response}</b>"))



<b>Determined documents are generated using a combination of natural language processing (NLP) and machine learning (ML) techniques. The process involves several steps:

1. Content Creation: Determined's content is created by a team of experienced writers and editors who specialize in various industries and topics.
2. Parsing: The content is then parsed into smaller units such as sentences, phrases, and keywords using NLP techniques.
3. Machine Learning: The parsed content is then fed into a machine learning model that analyzes the structure and format of the content.
4. Generation: Based on the analysis, the model generates the Determined document using a template that is specific to each topic.
5. Review and Editing: The generated document is reviewed and edited by a team of experts to ensure accuracy and consistency.
6. Maintenance: Once the document is published, it is continuously updated and maintained to ensure that it remains relevant and up-to-date.

By following this process, Determined is able to generate high-quality documents that are informative, well-structured, and easy to read.</b>

In [11]:
response = query_engine.query("Tell me about writing docs for Determined")
display(Markdown(f"<b>{response}</b>"))



<b>Of course! If you're looking to write documentation for Determined, there are several resources available to help you get started.

Firstly, the Determined community on Slack is an excellent place to ask questions and get support. You can join the community by clicking on the link provided in the context information.

Additionally, Determined has a YouTube channel and Twitter account where you can find updates and announcements related to the project.

If you prefer to receive updates via email, you can join the community mailing list by visiting the Determined website and filling out the form provided.

If you encounter a bug or security issue while working with Determined, you can report it on GitHub or email the security team directly, respectively.

Overall, there are several ways to get involved and stay up-to-date with the latest news and developments in the Determined community.</b>

In [13]:
response = query_engine.query("What does Determined use Algolia Search for?")
display(Markdown(f"<b>{response}</b>"))

<b>Determined utilizes Algolia Search as its primary search engine.</b>

In [14]:
response = query_engine.query("Tell me about the Sphinx-Native Search Fallback")
display(Markdown(f"<b>{response}</b>"))

<b>Of course! Here's the answer to your query:

The Sphinx-Native Search Fallback is a feature in Determined that allows for native search functionality in the absence of Sphinx. This fallback is useful in situations where Sphinx is not available or is not functioning properly, such as during deployment on a non-Sphinx-compatible environment.

To enable the Sphinx-Native Search Fallback, you can set the `native_search_fallback` configuration option to `True` in your Determined configuration file. Once enabled, Determined will use the native search functionality instead of Sphinx for search queries.

It's important to note that the native search functionality may not be as powerful or accurate as Sphinx, so it's recommended to use Sphinx whenever possible. However, the Sphinx-Native Search Fallback can be a useful backup option in situations where Sphinx is not available or is not functioning correctly.</b>

### Streaming Support

In [17]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Why would a Determined user choose a notebook vs an experiment?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

Great, I'm here to help! A Determined user may choose to use a notebook instead of an experiment for several reasons:

1. Flexibility: Notebooks offer more flexibility in terms of data management and analysis. With a notebook, users can easily switch between different datasets, perform ad-hoc analysis, and experiment with different models and algorithms.
2. Ease of use: Notebooks are often more user-friendly and easier to use than experiments, especially for users who are new to machine learning. Notebooks provide an interactive environment where users can write code, run experiments, and visualize results without having to worry about setting up and managing complex infrastructure.
3. Collaboration: Notebooks are ideal for collaborative work. Multiple users can work on the same notebook simultaneously, and changes are tracked and versioned automatically. This makes it easier to work with teams and share results with others.
4. Rapid prototyping: Notebooks allow users to quickly protot

In [18]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Why would a Determined user choose an experiment over a notebook?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")



Sure, I'd be happy to help!

A Determined user may choose an experiment over a notebook for several reasons:

1. Reusability: Experiments are designed to be reusable, meaning that the user can easily share their code and results with others. This can be particularly useful for collaborative work or for sharing results with colleagues or supervisors.
2. Organization: Experiments are organized in a structured format, making it easier for users to find and reuse code and results. This can help to keep track of progress and to maintain a clear overview of the work being done.
3. Version control: Determined provides version control for experiments, which allows users to track changes to their code and results over time. This can be particularly useful for debugging and for maintaining a record of progress.
4. Collaboration: Experiments can be shared with others, allowing for collaborative work and code reuse. This can be particularly useful for large-scale projects or for projects that requ

In [19]:
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Please compare the use of notebooks and experiments")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")

Sure, I'd be happy to help! Based on the context information provided, here's the answer to your query:

Notebooks and experiments are both tools used in the Determined platform for training machine learning models. However, they serve different purposes and have different characteristics.

A notebook is a single, self-contained document that contains code, markdown, and other content. It can be used to document a model training process, share it with others, or reuse it in the future. Notebooks are useful for exploratory data analysis, prototyping, and sharing models with others.

On the other hand, an experiment is a collection of related models, hyperparameters, and training jobs. It allows you to manage and track multiple models and their corresponding hyperparameters, and to run them in parallel on different hardware. Experiments are useful for conducting hyperparameter searches, comparing different models, and scaling up training to multiple GPUs or machines.

In summary, noteboo