# Getting Started with Llama2



# Setup

## Quick Start

Below I am just copy pasting the quick start setup from the [official repo](https://github.com/facebookresearch/llama) just to help out.

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.

- In a conda env with PyTorch / CUDA available clone and download this repository.

- In the top level directory run:

```pip install -e .```

- Visit the [Meta website and register](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to download the model/s.

- Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.

- Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

- Make sure to grant execution permissions to the download.sh script

- During this process, you will be prompted to enter the URL from the email.

- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.

- Once the model/s you want have been downloaded, you can run the model locally using the command below:

```
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
```

### Note

- Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
- The –nproc_per_node should be set to the MP value for the model you are using.
- Adjust the max_seq_len and max_batch_size parameters as needed.
- This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

> All of this info can be found in the Meta's [official repo](https://github.com/facebookresearch/llama)!

## You can also access the weights through [Hugging Face](https://huggingface.co/meta-llama):

- You must first request a download from the Meta website using the same email address as your Hugging Face account.
- After that you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.

[Go through this recipes setup from the official repository from Meta](https://github.com/facebookresearch/llama)

Ok great! Once you've done all of that you should have a folder with your chosen model, in my case I have 2 quantized 7b-chat models:

In [3]:
!ls models/

llama-2-7b-chat.Q4_K_M.gguf llama-2-7b-chat.Q5_K_M.gguf


Now, lets look at how to run inference with Llama2 using a Python-friendly approach!

For that we'll be looking at a few awesome options for running inference with llama2 models including:

- [llama-cpp](https://github.com/ggerganov/llama.cpp) library
- [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) library, which presents Python bindings for the [llama-cpp](https://github.com/ggerganov/llama.cpp) library.
- Running [llama2 with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)
- Setup with [Llama-cpp-python + Langchain](https://python.langchain.com/docs/integrations/llms/llamacpp)
- And finally a setup using [Ollama](https://ollama.ai/) and [Ollama+Langchain](https://python.langchain.com/docs/integrations/llms/ollama)
- Also using [LM-Studio](https://lmstudio.ai/).

Why so many options? Because its important to know that there are many ways to run these models and I think its nice to present a landscape of implementational options :) .

# [Llama-cpp (original)](https://github.com/ggerganov/llama.cpp)

```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

On Linux/Mac
- make

On Windows:
- Download the latest fortran version of w64devkit.
- Extract w64devkit on your pc.
- Run w64devkit.exe.
- Use the cd command to reach the llama.cpp folder.
- From here you can run:
- make
```

Then all you have to do is:

```
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
  # [Optional] for models using BPE tokenizers
  ls ./models
  65B 30B 13B 7B vocab.json

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

  # [Optional] for models using BPE tokenizers
  python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

# update the gguf filetype to current if older version is unsupported by another application
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY


# run the inference
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
```



# Llama-cpp-Python

- [source](https://abetlen.github.io/llama-cpp-python/)

Let's start by installing the package:

In [3]:
# !pip install llama-cpp-python

In [4]:
# source for the weights: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Now we can load the `Llama` class.

In [4]:
from llama_cpp import Llama

Now we just instantiate the llama object giving the path to the downloaded weights for our model of choice.

In [5]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=1, n_ctx=4096)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

Now, running predictions for [`Text Completion`](https://github.com/abetlen/llama-cpp-python#:~:text=Below%20is%20a,28%2C%0A%20%20%20%20%22total_tokens%22%3A%2042%0A%20%20%7D%0A%7D) is a breeze.

In [6]:
output = llm("What is a large language model?.")


llama_print_timings:        load time =  5819.81 ms
llama_print_timings:      sample time =    95.72 ms /   128 runs   (    0.75 ms per token,  1337.21 tokens per second)
llama_print_timings: prompt eval time =  5819.77 ms /     8 tokens (  727.47 ms per token,     1.37 tokens per second)
llama_print_timings:        eval time =  3982.53 ms /   127 runs   (   31.36 ms per token,    31.89 tokens per second)
llama_print_timings:       total time = 10112.50 ms


In [7]:
output

{'id': 'cmpl-6f2d2ef8-ee77-4fc2-8ef5-caa904901e30',
 'object': 'text_completion',
 'created': 1707158457,
 'model': './models/llama-2-7b-chat.Q4_K_M.gguf',
 'choices': [{'text': '\nIts like you are asking what a large language model is. A large language model is a type of artificial intelligence (AI) model that is trained on vast amounts of text data to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their impressive capabilities, such as generating text, summarizing content, and even creating new forms of creative writing like poetry or short stories.\nSome examples of large language models include:\n1. BERT (Bidirectional Encoder Representations from Transformers): Developed by Google in 2',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 8, 'completion_tokens': 128, 'total_tokens': 136}}

This schema is very similar to the one we are all used to with the ChatGPT API.

In [8]:
print(output["choices"][0]["text"])


Its like you are asking what a large language model is. A large language model is a type of artificial intelligence (AI) model that is trained on vast amounts of text data to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their impressive capabilities, such as generating text, summarizing content, and even creating new forms of creative writing like poetry or short stories.
Some examples of large language models include:
1. BERT (Bidirectional Encoder Representations from Transformers): Developed by Google in 2


Notice that here we are getting just questions, because this is the "base model" (which for questions will just output more questions).

See Andrej Karpathy explain that in [this part of his youtube video introducing LLMs](https://youtu.be/zjkBMFhNj_g?t=1213)

For the actually useful model we specify the `chat_format` parameter in the `Llama` class. 

Now, when we ask the same question:

In [9]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf", chat_format="llama-2")

prompt = "What is a large language model?."
response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful assistant."},
          {
              "role": "user",
              "content": prompt
          }
      ]
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [10]:
response

{'id': 'chatcmpl-f94461ca-4ece-4e71-bfbd-61f275636aa3',
 'object': 'chat.completion',
 'created': 1707158606,
 'model': './models/llama-2-7b-chat.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': ' Ah, an excellent question! A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive amount of text data to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their ability to generate text that is often indistinguishable from human-written content.\nThe key characteristic of an LLM is its scale: these models are trained on billions of words or more, and can process input sequences of any length. This allows them to learn a vast vocabulary and understand the nuances of language in a way that smaller models cannot. As a result, they can generate text that is more diverse and accurate than what is possible with smaller mode

In [11]:
response["choices"][0]["message"]["content"]

' Ah, an excellent question! A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive amount of text data to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their ability to generate text that is often indistinguishable from human-written content.\nThe key characteristic of an LLM is its scale: these models are trained on billions of words or more, and can process input sequences of any length. This allows them to learn a vast vocabulary and understand the nuances of language in a way that smaller models cannot. As a result, they can generate text that is more diverse and accurate than what is possible with smaller models.\nThere are several types of LLMs, including:\n1. Language Translation Models: These models are trained on large amounts of text data in multiple languages to learn the patterns and structures of different languages. They can be used 

We get an actual response from the model.

# [Llama2 inference with Hugging Face](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)


First we would download the model (in case you haven't before).

In [2]:
# !pip install huggingface-cli

In [4]:
!huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False

downloading https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to /Users/greatmaster/.cache/huggingface/hub/tmpth7xvajs
Downloading (…)-7b-chat.Q4_K_M.gguf: 100%|█| 4.08G/4.08G [01:36<00:00, 42.1MB/s]
Storing https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf in local_dir at ./models/llama-2-7b-chat.Q4_K_M.gguf (not cached).
./models/llama-2-7b-chat.Q4_K_M.gguf


Now we'll need to install the ctransformer library, check [here](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF#:~:text=Python%20using%20ctransformers-,First%20install%20the%20package,0.2.24%20--no-binary%20ctransformers,-Simple%20example%20code) for the option that makes sense for your machine.

In [5]:
# source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

# # Base ctransformers with no GPU acceleration
# pip install ctransformers>=0.2.24
# # Or with CUDA GPU acceleration
# pip install ctransformers[cuda]>=0.2.24
# # Or with ROCm GPU acceleration
# CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers


# # Or with Metal GPU acceleration for macOS systems ---- I am in a MacOS machine so I'll use this option!
# !CT_METAL=1 pip install ctransformers --no-binary ctransformers 

Example to load and run a GGUF model:

In [16]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
#llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id="./models/llama-2-7b-chat.Q4_K_M.gguf", model_type="llama")
print(llm("What is a large language model?."))



A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. These models are capable of producing human-like text based on the input they receive, and have become increasingly popular in recent years due to their impressive performance on a wide range of natural language processing (NLP) tasks.
The key characteristic of an LLM is its scale: these models are trained on massive datasets that contain billions of words or more, and are designed to capture the complexities and nuances of human language. This allows them to generate text that is not only grammatically correct but also contextually appropriate and semantically meaningful.
LLMs have many applications in areas such as:
1. Language Translation: LLMs can be trained on large datasets of text in multiple languages, allowing them to learn the patterns and structures of different languages and generat

In [17]:
print(llm("List 10 essential stretches a person should do to live longer, healthier and stronger."))



Stretching is an important part of any exercise routine. Not only can it improve flexibility, but it can also reduce the risk of injury and improve overall well-being. Here are ten essential stretches that everyone should incorporate into their daily routine:
1. Neck Stretch: Slowly tilt your head to the side, bringing your ear towards your shoulder. Hold for 30 seconds and repeat on the other side. This stretch can help relieve tension in the neck and improve posture.
2. Shoulder Rolls: Roll your shoulders forward and backward in a circular motion. Repeat for 10-15 repetitions. This stretch can help reduce tension in the shoulders and improve range of motion.
3. Chest Stretch: Place your hands on a wall or door frame and lean forward, stretching your chest. Hold for 30 seconds. This stretch can help improve posture and reduce stress.
4. Arm Circles: Hold your arms straight out to the sides and make small circles with your hands. Gradually increase the size of the circles as you cont

You can also check out this demo available in the hugging face website for interacting with the Llama2-70B-Chat model:
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI

# Setup with LangChain

[source](https://python.langchain.com/docs/integrations/llms/llamacpp)

Let's install the langchain package

In [None]:
# !pip install langchain

In [18]:
# source: https://python.langchain.com/docs/integrations/llms/llamacpp

# !CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [19]:
!CT_METAL=1
n_gpu_layers = 50  # Change this value based on your model and your GPU VRAM pool.
n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is the secret to becoming a great pragmatic programmer?"
llm_chain.run(question)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1



Step 1: Learn the basics of programming - Before you can become a great pragmatic programmer, you need to have a solid foundation in computer science and programming. This includes understanding data types, control structures, functions, objects, and algorithms.
Step 2: Understand the domain you are programming in - A pragmatic programmer must have a deep understanding of the domain they are working in. This means learning about the problem you are trying to solve, the business goals, and the users needs.
Step 3: Learn practical programming techniques - A pragmatic programmer should be proficient in many different programming languages and should have experience with various paradigms such as object-oriented, functional, and imperative programming. They should also have knowledge of design patterns and principles.
Step 4: Practice writing clean, modular, and maintainable code - A pragmatic programmer should strive to write code that is easy to understand, modify, and debug. This mean


llama_print_timings:        load time =   613.04 ms
llama_print_timings:      sample time =   267.56 ms /   256 runs   (    1.05 ms per token,   956.80 tokens per second)
llama_print_timings: prompt eval time =   613.00 ms /    42 tokens (   14.60 ms per token,    68.52 tokens per second)
llama_print_timings:        eval time =  7902.81 ms /   255 runs   (   30.99 ms per token,    32.27 tokens per second)
llama_print_timings:       total time =  9384.24 ms


'\n\nStep 1: Learn the basics of programming - Before you can become a great pragmatic programmer, you need to have a solid foundation in computer science and programming. This includes understanding data types, control structures, functions, objects, and algorithms.\nStep 2: Understand the domain you are programming in - A pragmatic programmer must have a deep understanding of the domain they are working in. This means learning about the problem you are trying to solve, the business goals, and the users needs.\nStep 3: Learn practical programming techniques - A pragmatic programmer should be proficient in many different programming languages and should have experience with various paradigms such as object-oriented, functional, and imperative programming. They should also have knowledge of design patterns and principles.\nStep 4: Practice writing clean, modular, and maintainable code - A pragmatic programmer should strive to write code that is easy to understand, modify, and debug. Thi

# Setup with OLLAMA and Ollama + Langchain

You can download the executable from the Ollama website and use it out of the box:
- https://ollama.ai/

In [24]:
# !pip install ollama
# !ollama pull llama2:latest

In [12]:
import ollama

response = ollama.chat(model='llama2', 
                       messages=[
  {
    'role': 'user',
    'content': 'What is a large language model?',
  },
])

print(response['message']['content'])


A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The goal of an LLM is to learn the patterns and structures of language, such as grammar, syntax, and semantics, in order to produce text that is similar to human language production.

LLMs are typically trained on vast amounts of text data, such as books, articles, or web pages, using a machine learning algorithm like deep learning. The model learns to predict the next word or character in a sequence of text based on the context provided by the previous words. As the model is trained, it can generate text that is coherent and contextually appropriate, such as answering questions, summarizing texts, or even generating creative writing.

Some examples of large language models include:

1. BERT (Bidirectional Encoder Representations from Transformers): A popular LLM developed by Google in 2018, whi

## Set up Ollama with Langchain

You can also use `langchain` to use Ollama like shown below.

In [26]:
from langchain.chat_models import ChatOllama

ollama = ChatOllama(model="llama2")

ollama.predict("What is a large language model?")

'\nA large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The goal of an LLM is to learn the patterns and structures of language, allowing it to generate text that is similar to human language production.\n\nLLMs are typically trained using deep learning techniques, such as recurrent neural networks (RNNs) or transformer models, on large datasets of text. These models can be used for a variety of tasks, such as:\n\n1. Language translation: LLMs can be trained to translate text from one language to another.\n2. Text generation: LLMs can be used to generate new text that is similar to a given input or style.\n3. Language understanding: LLMs can be trained to understand and interpret natural language text, allowing them to perform tasks such as sentiment analysis or question-answering.\n4. Chatbots: LLMs can be used to create chatbots that can engage in

# Set up with LM-Studio

LM studio is like an improved version of gpt4all with additional features like automatic evaluation of your machine specs to know whether or not it can run a certain model:

![](./assets-resources/lm-studio-machine-specs-analysis.png)

ANother great feature of LM studio is the local inference server so you can host your own model:

![](./assets-resources/lm-studio-local-inf-server.png)

Super easy to download open source models:

![](./assets-resources/lm-studio-download.png)

After downloading it, I can select it for chat easily:

![](./assets-resources/lm-studio-select-to-chat.png)

And now you can just chat with that model:

![](./assets-resources/lm-studio-chat.png)

You can easily host and run inference on your own models:

In [1]:
# Load a model, start the server, and run this example in your terminal
# Choose between streaming and non-streaming mode by setting the "stream" field

!curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d \
  '{"messages": [{"role": "system", "content": "Always answer in rhymes."},\
    {"role": "user", "content": "Introduce yourself."}\
  ], \
  "temperature": 0.7, \ 
  "max_tokens": -1,\
  "stream": false\
}'

{
  "id": "chatcmpl-4rxck8soi2xctzrc3zkauj",
  "object": "chat.completion",
  "created": 1707168904,
  "model": "/Users/greatmaster/.cache/lm-studio/models/TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q2_K.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm a lively soul, with a heart that's whole."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 16,
    "total_tokens": 32
  }
}

Below are the results of running the inference above:

![](./assets-resources/lm-studio-local-server.png)

# Some Extra notes on running Llama on your phone

[MLC LLM](https://github.com/mlc-ai/mlc-llm) is an open-source project that makes it possible to run language models locally on a variety of devices and platforms, including iOS and Android.

For iPhone users, there’s an MLC chat app on the App Store. MLC now has support for the 7B, 13B, and 70B versions of Llama 2, but it’s still in beta and not yet on the Apple Store version, so you’ll need to install TestFlight to try it out. Check out out the [instructions for installing the beta version here](https://llm.mlc.ai/docs/index.html#getting-started).

# Final Note

The setup we're going to use will depend on the type of application we are trying to build.

But the overall rule for this course will be to use `llama-cpp-python` combined with langchain for ease of use for the usecases we are interested in.

# Resources

- https://github.com/abetlen/llama-cpp-python
- https://github.com/ggerganov/llama.cpp
- https://ai.meta.com/llama/
- https://ollama.ai/download
- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
- https://huggingface.co/blog/llama2
- https://replicate.com/blog/run-llama-locally
- https://ollama.ai/
- https://arxiv.org/pdf/2307.09288.pdf 
- https://ai.meta.com/llama/#resources
- https://www.philschmid.de/llama-2
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard