# Getting Started with Llama2



# Setup

## Quick Start

Below I am just copy pasting the quick start setup from the [official repo](https://github.com/facebookresearch/llama) just to help out.

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.

- In a conda env with PyTorch / CUDA available clone and download this repository.

- In the top level directory run:

```pip install -e .```

- Visit the [Meta website and register](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to download the model/s.

- Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.

- Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

- Make sure to grant execution permissions to the download.sh script

- During this process, you will be prompted to enter the URL from the email.

- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.

- Once the model/s you want have been downloaded, you can run the model locally using the command below:

```
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
```

### Note

- Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
- The –nproc_per_node should be set to the MP value for the model you are using.
- Adjust the max_seq_len and max_batch_size parameters as needed.
- This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

> All of this info can be found in the Meta's [official repo](https://github.com/facebookresearch/llama)!

## You can also access the weights through [Hugging Face](https://huggingface.co/meta-llama):

- You must first request a download from the Meta website using the same email address as your Hugging Face account.
- After that you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.

[Go through this recipes setup from the official repository from Meta](https://github.com/facebookresearch/llama)

Ok great! Once you've done all of that you should have a folder with your chosen model, in my case I have 2 quantized 7b-chat models:

In [2]:
!ls models/

llama-2-7b-chat.Q4_K_M.gguf llama-2-7b-chat.Q5_K_M.gguf


Now, lets look at how to run inference with Llama2 using a Python-friendly approach!

For that we'll be looking at a few awesome options for running inference with llama2 models including:

- [llama-cpp](https://github.com/ggerganov/llama.cpp) library
- [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) library, which presents Python bindings for the [llama-cpp](https://github.com/ggerganov/llama.cpp) library.
- Running [llama2 with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)
- Setup with [Llama-cpp-python + Langchain](https://python.langchain.com/docs/integrations/llms/llamacpp)
- And finally a setup using [Ollama](https://ollama.ai/) and [Ollama+Langchain](https://python.langchain.com/docs/integrations/llms/ollama)

Why so many options? Because its important to know that there are many ways to run these models and I think its nice to present a landscape of implementational options :) .

# [Llama-cpp (original)](https://github.com/ggerganov/llama.cpp)

```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

On Linux/Mac
- make

On Windows:
- Download the latest fortran version of w64devkit.
- Extract w64devkit on your pc.
- Run w64devkit.exe.
- Use the cd command to reach the llama.cpp folder.
- From here you can run:
- make
```

Then all you have to do is:

```
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
  # [Optional] for models using BPE tokenizers
  ls ./models
  65B 30B 13B 7B vocab.json

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

  # [Optional] for models using BPE tokenizers
  python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

# update the gguf filetype to current if older version is unsupported by another application
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY


# run the inference
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
```



# Llama-cpp-Python

- [source](https://abetlen.github.io/llama-cpp-python/)

Let's start by installing the package:

In [None]:
!pip install llama-cpp-python

Now we can load the `Llama` class.

In [1]:
from llama_cpp import Llama

Now we just instantiate the llama object giving the path to the downloaded weights for our model of choice.

In [2]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q5_K_M.gguf")

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q5_K     [  4096,  4096,     1

Now, running predictions for [`Text Completion`](https://github.com/abetlen/llama-cpp-python#:~:text=Below%20is%20a,28%2C%0A%20%20%20%20%22total_tokens%22%3A%2042%0A%20%20%7D%0A%7D) is a breeze.

In [3]:
output = llm("Write a template for writing Python automations to improve personal productivity.")
print(output["choices"][0]["text"])


As part of your work as a productivity coach, you have been asked to create a template for writing Python automations to improve personal productivity. Here is the outline you created:
I. Introduction

* Briefly explain what automation is and why it can be useful for improving personal productivity
* Mention that the template provided will help readers get started with writing their own Python automations

II. Identifying tasks to automate

* Ask readers to think about their daily tasks and identify those that are repetitive or time-consuming
* Provide examples of common tasks that can



llama_print_timings:        load time =  6454.00 ms
llama_print_timings:      sample time =    87.17 ms /   128 runs   (    0.68 ms per token,  1468.40 tokens per second)
llama_print_timings: prompt eval time =  6453.96 ms /    15 tokens (  430.26 ms per token,     2.32 tokens per second)
llama_print_timings:        eval time = 11359.43 ms /   127 runs   (   89.44 ms per token,    11.18 tokens per second)
llama_print_timings:       total time = 18055.35 ms


You can also do chat_completion similarly to the OpenAI api:

In [4]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", chat_format="llama-2")

prompt = "How many people live in Europe?"
response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful assistant."},
          {
              "role": "user",
              "content": prompt
          }
      ]
)

response

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

{'id': 'chatcmpl-edd02658-0af0-4275-965a-6921f8a9c698',
 'object': 'chat.completion',
 'created': 1702046138,
 'model': './models/llama-2-7b-chat.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': " As of 2023, the estimated population of Europe is approximately 746 million people. This number includes the 50 countries that are located on the European continent, as well as several island nations and territories that are part of the European Union (EU).\nIt's worth noting that the population of Europe is constantly changing due to factors such as births, deaths, migration, and urbanization. The United Nations estimates that the population of Europe will reach 760 million by 2030 and 815 million by 2050.\nIt's also important to note that the EU has 28 member states, with a total population of over 510 million people, which is approximately 69% of the total population of Europe."},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 37, 'compl

# [Llama2 inference with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)


First we would download the model (in case you haven't before).

In [4]:
!huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False

downloading https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to /Users/greatmaster/.cache/huggingface/hub/tmpth7xvajs
Downloading (…)-7b-chat.Q4_K_M.gguf: 100%|█| 4.08G/4.08G [01:36<00:00, 42.1MB/s]
Storing https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf in local_dir at ./models/llama-2-7b-chat.Q4_K_M.gguf (not cached).
./models/llama-2-7b-chat.Q4_K_M.gguf


Now we'll need to install the ctransformer library, check [here](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF#:~:text=Python%20using%20ctransformers-,First%20install%20the%20package,0.2.24%20--no-binary%20ctransformers,-Simple%20example%20code) for the option that makes sense for your machine.

In [5]:
# source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

# # Base ctransformers with no GPU acceleration
# pip install ctransformers>=0.2.24
# # Or with CUDA GPU acceleration
# pip install ctransformers[cuda]>=0.2.24
# # Or with ROCm GPU acceleration
# CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers


# # Or with Metal GPU acceleration for macOS systems ---- I am in a MacOS machine so I'll use this option!
# !CT_METAL=1 pip install ctransformers --no-binary ctransformers 

Example to load and run a GGUF model:

In [6]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
#llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id="./models/llama-2-7b-chat.Q4_K_M.gguf", model_type="llama")
print(llm("AI is going to"))

  from .autonotebook import tqdm as notebook_tqdm
objc[49253]: Class GGMLMetalClass is implemented in both /Users/greatmaster/miniconda3/envs/oreilly_llama2/lib/python3.11/site-packages/llama_cpp/libllama.dylib (0x11609c1f8) and /Users/greatmaster/miniconda3/envs/oreilly_llama2/lib/python3.11/site-packages/ctransformers/lib/local/libctransformers.dylib (0x1179d01d0). One of the two will be used. Which one is undefined.


 make us all redundant. Here are 10 ways it could happen
AI is going to make us all redundant. Here are 10 ways it could happen, writes Paul Munro:
Most experts agree that automation and artificial intelligence (AI) will have a significant impact on the workforce in the coming years. While some predict that AI will create new jobs alongside existing ones, others warn that it could lead to widespread redundancy. Here are 10 ways in which AI could make us all redundant:
1. White-collar workers beware: AI is already making inroads into roles such as accounting, customer service, and even writing. As AI algorithms become more sophisticated, these jobs will continue to decline.
2. The gig economy could collapse: As self-driving cars and drones become more prevalent, the need for human drivers and delivery personnel will decrease. This could lead to a collapse of the gig economy as we know it.
3. Medical professionals under threat: AI is already being used in hospitals around the world to as

In [7]:
print(llm("What does it mean to be human?"))



What does it mean to be human? This is a question that has puzzled philosophers, scientists, and theologians for centuries. The answer can vary depending on one's perspective, beliefs, and values. Here are some possible answers:

1. Biological Definition: From a biological standpoint, being human means possessing certain physical characteristics, such as a large brain, upright posture, and the ability to reproduce. Humans belong to the species Homo sapiens and share a common ancestry with other primates.
2. Cultural Identity: Culture plays a significant role in shaping our understanding of what it means to be human. Our beliefs, values, customs, and practices are passed down from generation to generation, influencing how we see ourselves and others. Our cultural identity defines who we are, where we come from, and how we relate to the world around us.
3. Psychological Characteristics: The way humans think, feel, and behave is unique to our species. We have the capacity for self-aware

You can also check out this demo available in the hugging face website for interacting with the Llama2-70B-Chat model:
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI

# Setup with LangChain

[source](https://python.langchain.com/docs/integrations/llms/llamacpp)

Let's install the langchain package

In [None]:
!pip install langchain

In [8]:
# source: https://python.langchain.com/docs/integrations/llms/llamacpp

# !CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [9]:
!CT_METAL=1
n_gpu_layers = 50  # Change this value based on your model and your GPU VRAM pool.
n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the secret to becoming a great pragmatic programmer?"
llm_chain.run(question)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

  Pragmatic programming involves practical problem-solving abilities and adopting an approach that is flexible, realistic, and successful over time. Here are some tips for becoming a great pragmatic programmer:
1. Learn from others: Keep up with industry trends and best practices by reading books and blogs about programming. Attend seminars and workshops to learn from other professionals in the field and gain insight into different approaches to problem-solving.
2. Practice, practice, practice: The more you write code, the better you will become at pragmatic programming. Try to solve problems in different ways to find what works best for you and your team. Practice coding on various platforms or with different languages, if possible. This will help you become comfortable and effective in a variety of contexts.
3. Be flexible: Pragmatic programmers must be adaptable and willing to try new approaches when necessary. They understand that no one solution works for every problem and are abl


llama_print_timings:        load time =   812.26 ms
llama_print_timings:      sample time =   191.06 ms /   256 runs   (    0.75 ms per token,  1339.92 tokens per second)
llama_print_timings: prompt eval time =   812.22 ms /    42 tokens (   19.34 ms per token,    51.71 tokens per second)
llama_print_timings:        eval time =  8100.95 ms /   255 runs   (   31.77 ms per token,    31.48 tokens per second)
llama_print_timings:       total time =  9703.93 ms


'  Pragmatic programming involves practical problem-solving abilities and adopting an approach that is flexible, realistic, and successful over time. Here are some tips for becoming a great pragmatic programmer:\n1. Learn from others: Keep up with industry trends and best practices by reading books and blogs about programming. Attend seminars and workshops to learn from other professionals in the field and gain insight into different approaches to problem-solving.\n2. Practice, practice, practice: The more you write code, the better you will become at pragmatic programming. Try to solve problems in different ways to find what works best for you and your team. Practice coding on various platforms or with different languages, if possible. This will help you become comfortable and effective in a variety of contexts.\n3. Be flexible: Pragmatic programmers must be adaptable and willing to try new approaches when necessary. They understand that no one solution works for every problem and are

# Setup with OLLAMA and Ollama + Langchain

You can download the executable from the Ollama website and use it out of the box:
- https://ollama.ai/

## Set up Ollama with Langchain

You can also use `langchain` to use Ollama like shown below.

In [10]:
from langchain.chat_models import ChatOllama

ollama = ChatOllama(model="llama2")

ollama.predict("What is the best prorgamming language?")

' There is no one "best" programming language, as different languages are better suited for different purposes and preferences.его\n\nThe choice of a programming language depends on a variety of factors such as:\n\n1. Purpose: What do you want to use the language for? Are you building a web application, creating a desktop application, or working on a mobile app? Different languages are better suited for different purposes.\n2. Level of complexity: How complex do you want the language to be? Some languages are more straightforward and easy to learn, while others offer more advanced features but are harder to master.\n3. Community: Are you interested in working with a large community of developers who can provide support and resources? Some languages have larger communities than others, which can be beneficial for learning and troubleshooting.\n4. Ecosystem: What kind of tools and libraries are available for the language? Are there plenty of libraries and frameworks that make it easier t

In [11]:
ollama.predict("Tell me a joke")

" Sure! Here's a quick one:\n Unterscheidung)\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope that brought a smile to your face! Do you want to hear another one?"

# Some Extra notes on running Llama on your phone

[MLC LLM](https://github.com/mlc-ai/mlc-llm) is an open-source project that makes it possible to run language models locally on a variety of devices and platforms, including iOS and Android.

For iPhone users, there’s an MLC chat app on the App Store. MLC now has support for the 7B, 13B, and 70B versions of Llama 2, but it’s still in beta and not yet on the Apple Store version, so you’ll need to install TestFlight to try it out. Check out out the [instructions for installing the beta version here](https://llm.mlc.ai/docs/index.html#getting-started).

# Final Note

The setup we're going to use will depend on the type of application we are trying to build.

But the overall rule for this course will be to use `llama-cpp-python` combined with langchain for ease of use for the usecases we are interested in.

# Resources

- https://github.com/abetlen/llama-cpp-python
- https://github.com/ggerganov/llama.cpp
- https://ai.meta.com/llama/
- https://ollama.ai/download
- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
- https://huggingface.co/blog/llama2
- https://replicate.com/blog/run-llama-locally
- https://ollama.ai/
- https://arxiv.org/pdf/2307.09288.pdf 
- https://ai.meta.com/llama/#resources
- https://www.philschmid.de/llama-2
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard