# Llama 2 Introduction

#### Made by SimonLiu

1. My Linkedin: https://www.linkedin.com/in/simonliuyuwei/

2. InfuseAI: https://infuseai.io

### Llama 2

The next generation of our open source large language model

1. Official Website: [Link](https://ai.meta.com/llama/)

2. Download Model: [Link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

### Related Website: 

1. llama.cpp: [Link](https://github.com/ggerganov/llama.cpp)

2. llama-cpp-python: [Link](https://github.com/abetlen/llama-cpp-python)

3. ggml: Tensor library for machine learning - [Link](https://github.com/ggerganov/ggml)

# Code

## Step 1: Install related package

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

In [None]:
# For download the models
!pip install huggingface_hub

## Step 2: Import python libraries and Variable config

### Import Python Package

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

### Configure Variables

In [None]:
download_model_bool = True

#### HuggingFace Llama-cpp Model Link:

1. TheBloke/Llama-2-7B-chat-GGML: [Link](https://huggingface.co/TheBloke/Llama-2-7B-chat-GGML)
2. TheBloke/Llama-2-13B-chat-GGML: [Link](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML)
3. TheBloke/Llama-2-70B-chat-GGML: [Link](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGML)
4. audreyt/Taiwan-LLaMa-v1.0-GGML: [Link](https://huggingface.co/audreyt/Taiwan-LLaMa-v1.0-GGML)

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"

#### the model is in bin format:
Please Get the bin file from here: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main

In [None]:
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

## Step 3: Download Model

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

In [None]:
print(model_path)

## Step 4: Loading the Model

In [None]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2,             # CPU cores
    n_batch=512,             # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32          # Change this value based on your model and your GPU VRAM pool.
)

## Step 5: Create a Prompt

In [None]:
prompt = "Write a linear regression in python"

In [None]:
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

## Step 6: Generating the Response

In [None]:
# Predict the Result
response = lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                    repeat_penalty=1.2, top_k=150,
                    echo=True)

In [None]:
# Print the json content.
response

In [None]:
# Print the response answer.
print(response["choices"][0]["text"])