# Getting Started with Llama2



# Setup

## Quick Start

Below I am just copy pasting the quick start setup from the [official repo](https://github.com/facebookresearch/llama) just to help out.

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.

- In a conda env with PyTorch / CUDA available clone and download this repository.

- In the top level directory run:

```pip install -e .```

- Visit the [Meta website and register](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to download the model/s.

- Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.

- Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

- Make sure to grant execution permissions to the download.sh script

- During this process, you will be prompted to enter the URL from the email.

- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.

- Once the model/s you want have been downloaded, you can run the model locally using the command below:

```
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
```

### Note

- Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
- The –nproc_per_node should be set to the MP value for the model you are using.
- Adjust the max_seq_len and max_batch_size parameters as needed.
- This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

> All of this info can be found in the Meta's [official repo](https://github.com/facebookresearch/llama)!

## You can also access the weights through [Hugging Face](https://huggingface.co/meta-llama):

- You must first request a download from the Meta website using the same email address as your Hugging Face account.
- After that you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.

[Go through this recipes setup from the official repository from Meta](https://github.com/facebookresearch/llama)

Ok great! Once you've done all of that you should have a folder with your chosen model, in my case I have 2 quantized 7b-chat models:

In [1]:
!ls models/

llama-2-7b-chat.Q4_K_M.gguf llama-2-7b-chat.Q5_K_M.gguf


See here for why use [.gguf standard format](https://deci.ai/blog/ggml-vs-gguf-comparing-formats-amp-top-5-methods-for-running-gguf-files/#:~:text=GGUF%2C%20introduced%20by,the%20GGUF%20format.).

Now, lets look at how to run inference with Llama2 using a Python-friendly approach!

For that we'll be looking at a few awesome options for running inference with llama2 models including:

- [llama-cpp](https://github.com/ggerganov/llama.cpp) library
- [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) library, which presents Python bindings for the [llama-cpp](https://github.com/ggerganov/llama.cpp) library.
- Running [llama2 with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)
- Setup with [Llama-cpp-python + Langchain](https://python.langchain.com/docs/integrations/llms/llamacpp)
- And finally a setup using [Ollama](https://ollama.ai/) and [Ollama+Langchain](https://python.langchain.com/docs/integrations/llms/ollama)
- Also using [LM-Studio](https://lmstudio.ai/).

Why so many options? Because its important to know that there are many ways to run these models and I think its nice to present a landscape of implementational options :) .

# [Llama-cpp (original)](https://github.com/ggerganov/llama.cpp)

```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

On Linux/Mac
- make

On Windows:
- Download the latest fortran version of w64devkit.
- Extract w64devkit on your pc.
- Run w64devkit.exe.
- Use the cd command to reach the llama.cpp folder.
- From here you can run:
- make
```

Then all you have to do is:

```
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
  # [Optional] for models using BPE tokenizers
  ls ./models
  65B 30B 13B 7B vocab.json

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

  # [Optional] for models using BPE tokenizers
  python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

# update the gguf filetype to current if older version is unsupported by another application
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY


# run the inference
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
```



# Llama-cpp-Python

- [source](https://abetlen.github.io/llama-cpp-python/)

Let's start by installing the package:

In [3]:
# !pip install llama-cpp-python

In [4]:
# source for the weights: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Now we can load the `Llama` class.

In [2]:
from llama_cpp import Llama

Now we just instantiate the llama object giving the path to the downloaded weights for our model of choice.

In [3]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_ctx=4096,verbose=False)

Now, running predictions for [`Text Completion`](https://github.com/abetlen/llama-cpp-python#:~:text=Below%20is%20a,28%2C%0A%20%20%20%20%22total_tokens%22%3A%2042%0A%20%20%7D%0A%7D) is a breeze.

In [4]:
output = llm("What is a large language model?.")
output

{'id': 'cmpl-f1a031e4-d66f-4dbc-a64d-43f04d7c2e03',
 'object': 'text_completion',
 'created': 1712588138,
 'model': './models/llama-2-7b-chat.Q4_K_M.gguf',
 'choices': [{'text': ' A large language model is a type of artificial intelligence (AI) model that is',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 8, 'completion_tokens': 16, 'total_tokens': 24}}

This schema is very similar to the one we are all used to with the ChatGPT API.

In [5]:
print(output["choices"][0]["text"])

 A large language model is a type of artificial intelligence (AI) model that is


Notice that here we are getting just questions, because this is the "base model" (which for questions will just output more questions).

See Andrej Karpathy explain that in [this part of his youtube video introducing LLMs](https://youtu.be/zjkBMFhNj_g?t=1213)

For the actually useful model we specify the `chat_format` parameter in the `Llama` class. 

Now, when we ask the same question:

In [6]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", chat_format="llama-2", verbose=False)

prompt = "What is a large language model?."
response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful assistant."},
          {
              "role": "user",
              "content": prompt
          }
      ]
)
print(response)
response["choices"][0]["message"]["content"]

{'id': 'chatcmpl-8c8515f4-b8c0-4140-975b-b940cedc2daa', 'object': 'chat.completion', 'created': 1712588156, 'model': './models/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '  A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The goal of an LLM is to learn the patterns and structures of a language, such as grammar, syntax, and semantics, in order to produce text that is similar to human language.\nLLMs are typically trained on vast amounts of text data, such as books, articles, or web pages, and use a combination of machine learning algorithms and neural networks to learn the patterns and relationships between words and phrases in the language. Once trained, an LLM can be used for a variety of natural language processing tasks, such as:\n1. Language Translation: An LLM can be trained to transl

'  A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The goal of an LLM is to learn the patterns and structures of a language, such as grammar, syntax, and semantics, in order to produce text that is similar to human language.\nLLMs are typically trained on vast amounts of text data, such as books, articles, or web pages, and use a combination of machine learning algorithms and neural networks to learn the patterns and relationships between words and phrases in the language. Once trained, an LLM can be used for a variety of natural language processing tasks, such as:\n1. Language Translation: An LLM can be trained to translate text from one language to another, allowing it to generate translations that are more accurate and natural-sounding than those produced by traditional machine translation systems.\n2. Text Summarization: An LLM can be used

We get an actual response from the model.

# [Llama2 inference with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)


First we would download the model (in case you haven't before).

In [6]:
# Uncomment below if you are on google colab and don't have the llama2 7b model downloaded
# !pip install huggingface-cli
# !huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False

Now we'll need to install the ctransformer library, check [here](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF#:~:text=Python%20using%20ctransformers-,First%20install%20the%20package,0.2.24%20--no-binary%20ctransformers,-Simple%20example%20code) for the option that makes sense for your machine.

In [6]:
# source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

# # Base ctransformers with no GPU acceleration
# pip install ctransformers>=0.2.24
# # Or with CUDA GPU acceleration
# pip install ctransformers[cuda]>=0.2.24
# # Or with ROCm GPU acceleration
# CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers


# # Or with Metal GPU acceleration for macOS systems ---- I am in a MacOS machine so I'll use this option!
# !CT_METAL=1 pip install ctransformers --no-binary ctransformers
# This is the only package that is not included in the requirements because it will be installed in a different way depending on the system.

Example to load and run a GGUF model:

In [7]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
#llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id="./models/llama-2-7b-chat.Q4_K_M.gguf", model_type="llama")
print(llm("What is a large language model?."))

 A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive dataset of text to generate language outputs that are coherent and natural-sounding. LLMs have become increasingly popular in recent years due to their ability to generate text that is often indistinguishable from human language. In this essay, I will provide an overview of the technology behind LLMs, discuss some of the challenges and limitations of these models, and explore their potential applications and ethical considerations.
Large language models work by processing input texts through a series of neural network layers, which learn to predict the next word in a sequence given the context provided by the previous words. The model is trained on a vast dataset of text, such as books, articles, or websites, and the objective is to maximize the likelihood of generating the correct next word. During training, the model is evaluated on its ability to predict the correct word based 

In [8]:
print(llm("List 3 essential stretches a person should do to live longer, healthier and stronger."))


Here are three essential stretches that can help you live longer, healthier, and stronger:
1. Neck Stretch: This stretch helps to relieve tension in the neck and shoulders, which can lead to headaches, muscle strain, and poor posture. To perform this stretch, gently tilt your head to the side, bringing your ear towards your shoulder. Hold for 30 seconds and then switch sides.
2. Shoulder Rolls: This stretch helps to loosen up the shoulders and improve flexibility in the upper body. To perform this stretch, roll your shoulders forward and backward in a circular motion. Repeat for 10-15 repetitions.
3. Hip Circles: This stretch helps to improve flexibility in the hips and lower back, which can help to prevent injuries and alleviate lower back pain. To perform this stretch, stand with your feet shoulder-width apart and slowly circle your hips in a large circle. Repeat for 10-15 repetitions in each direction.
By incorporating these stretches into your daily routine, you can help to improv

You can also check out this demo available in the hugging face website for interacting with the Llama2-70B-Chat model:
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI

# Setup with LangChain

[source](https://python.langchain.com/docs/integrations/llms/llamacpp)

Let's install the langchain package

In [12]:
# if you are on google colab uncomment this and install langchain
# !pip install langchain

In [9]:
# source: https://python.langchain.com/docs/integrations/llms/llamacpp

# !CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [10]:
!CT_METAL=1
n_gpu_layers = 50  # Change this value based on your model and your GPU VRAM pool.
n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

objc[28860]: Class GGMLMetalClass is implemented in both /Users/greatmaster/miniconda3/envs/oreilly-llama2/lib/python3.10/site-packages/ctransformers/lib/local/libctransformers.dylib (0x1067741d0) and /Users/greatmaster/miniconda3/envs/oreilly-llama2/lib/python3.10/site-packages/llama_cpp/libllama.dylib (0x1271b0250). One of the two will be used. Which one is undefined.
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096

In [11]:
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the secret to becoming a great pragmatic programmer?"
llm_chain.invoke(question)

 🙂

Step 1: Be Curious and Learn
The first step to becoming a great Pragmatic Programmer is to be curious about programming and the industry as a whole. Keep up-to-date with the latest trends, tools, and technologies by reading books, attending conferences, and participating in online forums.
Step 2: Practice, Practice, Practice
The next step is to practice what you have learned. Start by working on small projects, then gradually move on to more complex ones. The more you program, the better you will become at problem-solving and finding creative solutions.
Step 3: Focus on Building Quality Software
To become a great Pragmatic Programmer, it's essential to focus on building quality software that is easy to maintain and evolve over time. This means writing clean, modular code that adheres to established standards and best practices.
Step 4: Learn by Teaching Others
One of the most effective ways to learn any skill is to teach it to someone else. By explaining complex concepts in a simpl


llama_print_timings:        load time =     457.81 ms
llama_print_timings:      sample time =      32.60 ms /   256 runs   (    0.13 ms per token,  7852.52 tokens per second)
llama_print_timings: prompt eval time =     457.71 ms /    42 tokens (   10.90 ms per token,    91.76 tokens per second)
llama_print_timings:        eval time =    8028.85 ms /   255 runs   (   31.49 ms per token,    31.76 tokens per second)
llama_print_timings:       total time =    9175.38 ms /   297 tokens


{'question': 'What is the secret to becoming a great pragmatic programmer?',
 'text': " 🙂\n\nStep 1: Be Curious and Learn\nThe first step to becoming a great Pragmatic Programmer is to be curious about programming and the industry as a whole. Keep up-to-date with the latest trends, tools, and technologies by reading books, attending conferences, and participating in online forums.\nStep 2: Practice, Practice, Practice\nThe next step is to practice what you have learned. Start by working on small projects, then gradually move on to more complex ones. The more you program, the better you will become at problem-solving and finding creative solutions.\nStep 3: Focus on Building Quality Software\nTo become a great Pragmatic Programmer, it's essential to focus on building quality software that is easy to maintain and evolve over time. This means writing clean, modular code that adheres to established standards and best practices.\nStep 4: Learn by Teaching Others\nOne of the most effective w

# Setup with OLLAMA and Ollama + Langchain

You can download the executable from the Ollama website and use it out of the box:
- https://ollama.ai/

In [19]:
# !pip install ollama
# Uncomment below if you haven't run this yet on your local machine or google colab
# !ollama pull llama2:latest

In [12]:
import ollama

response = ollama.chat(model='llama2', 
                       messages=[
  {
    'role': 'user',
    'content': 'What is a large language model?',
  },
])

print(response['message']['content'])


A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The goal of an LLM is to be able to understand and generate language in a way that is comparable to human language use, and the models are typically trained on vast amounts of text data such as books, articles, and other forms of written content.

LLMs can be used for a variety of applications such as:

1. Language Translation: Large language models can be trained on multiple languages and used to translate text from one language to another.
2. Text Summarization: LLMs can be used to summarize long documents or articles, extracting the most important information and generating a concise summary.
3. Chatbots: Large language models can be used to build chatbots that can understand and respond to user queries in a conversational manner.
4. Content Generation: LLMs can be used to generate content su

## Set up Ollama with Langchain

You can also use `langchain` to use Ollama like shown below.

In [14]:
from langchain.chat_models import ChatOllama

ollama = ChatOllama(model="llama2")

ollama.invoke("What is a large language model?")

AIMessage(content="\nA large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text, such as books, articles, or other written content. The goal of training a large language model is to enable the model to generate coherent and natural-sounding text, or to perform other language-related tasks, such as language translation, text summarization, or chatbots.\n\nThe key characteristic of a large language model is its size. Unlike smaller language models that are typically trained on smaller datasets, large language models are trained on massive datasets that can contain tens of millions or even billions of words. This allows the model to learn a much broader range of linguistic patterns and relationships, which in turn enables it to generate more diverse and coherent text.\n\nThere are several types of large language models, including:\n\n1. Neural network-based models: These models use artificial neural networks to learn the patterns and 

# Set up with LM-Studio

LM studio is like an improved version of gpt4all with additional features like automatic evaluation of your machine specs to know whether or not it can run a certain model:

![](./assets-resources/lm-studio-machine-specs-analysis.png)

ANother great feature of LM studio is the local inference server so you can host your own model:

![](./assets-resources/lm-studio-local-inf-server.png)

Super easy to download open source models:

![](./assets-resources/lm-studio-download.png)

After downloading it, I can select it for chat easily:

![](./assets-resources/lm-studio-select-to-chat.png)

And now you can just chat with that model:

![](./assets-resources/lm-studio-chat.png)

You can easily host and run inference on your own models:

In [1]:
# Load a model, start the server, and run this example in your terminal
# Choose between streaming and non-streaming mode by setting the "stream" field

!curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d \
  '{"messages": [{"role": "system", "content": "Always answer in rhymes."},\
    {"role": "user", "content": "Introduce yourself."}\
  ], \
  "temperature": 0.7, \ 
  "max_tokens": -1,\
  "stream": false\
}'

{
  "id": "chatcmpl-4rxck8soi2xctzrc3zkauj",
  "object": "chat.completion",
  "created": 1707168904,
  "model": "/Users/greatmaster/.cache/lm-studio/models/TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q2_K.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm a lively soul, with a heart that's whole."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 16,
    "total_tokens": 32
  }
}

Below are the results of running the inference above:

![](./assets-resources/lm-studio-local-server.png)

# Some Extra notes on running Llama on your phone

[MLC LLM](https://github.com/mlc-ai/mlc-llm) is an open-source project that makes it possible to run language models locally on a variety of devices and platforms, including iOS and Android.

For iPhone users, there’s an MLC chat app on the App Store. MLC now has support for the 7B, 13B, and 70B versions of Llama 2, but it’s still in beta and not yet on the Apple Store version, so you’ll need to install TestFlight to try it out. Check out out the [instructions for installing the beta version here](https://llm.mlc.ai/docs/index.html#getting-started).

# Final Note

The setup we're going to use will depend on the type of application we are trying to build.

But the overall rule for this course will be to use `llama-cpp-python` combined with langchain for ease of use for the usecases we are interested in.

# Resources

- https://github.com/abetlen/llama-cpp-python
- https://github.com/ggerganov/llama.cpp
- https://ai.meta.com/llama/
- https://ollama.ai/download
- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
- https://huggingface.co/blog/llama2
- https://replicate.com/blog/run-llama-locally
- https://ollama.ai/
- https://arxiv.org/pdf/2307.09288.pdf 
- https://ai.meta.com/llama/#resources
- https://www.philschmid.de/llama-2
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard