---
        
💡 **NOTE**: We will want to use a GPU to run both Llama2 as well as BERTopic for this use case. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

We will start by installing a number of packages that we are going to use throughout this example:

In [None]:
%%capture
!pip install transformers datasets accelerate bitsandbytes xformers langchain sentence_transformers autotrain-advanced faiss-gpu

# 🤗 HuggingFace Hub Credentials
Before we can load in Llama2 using a number of tricks, we will first need to accept the License for using Llama2. The steps are as follows:


* Create a HuggingFace account [here](https://huggingface.co)
* Apply for Llama 2 access [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
* Get your HuggingFace token [here](https://huggingface.co/settings/tokens)

After doing so, we can login with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 🦙 **Llama 2**

Now comes one of the more interesting components of this tutorial, how to load in a Llama 2 model on a T4-GPU!

We will be focusing on the `'meta-llama/Llama-2-13b-chat-hf'` variant. It is large enough to give interesting and useful results whilst small enough that it can be run on our environment.

We start by defining our model and identifying if our GPU is correctly selected. We expect the output of `device` to show a cuda device:

In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

# 4-bit Quantization to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

The chat-version of Llama 2 requires a certain prompt template in order to correctly ask the model questions:

<br>
<div>
<img src="https://cdn-images-1.medium.com/v2/resize:fit:1200/1*id6E_ZLa77N6OEWuDKsPmg.png" width="650"/>
</div>



In [None]:
basic_prompt = """
<s>[INST] <<SYS>>

You are a helpful assistant

<</SYS>>

What is 1 + 1? [/INST]
"""
print(generator(basic_prompt)[0]["generated_text"])


<s>[INST] <<SYS>>

You are a helpful assistant

<</SYS>>

What is 1 + 1? [/INST]

Oh my, that's a simple one! The answer to 1 + 1 is... (drumroll please)... 2! 😊


# 📄 **Prompt Engineering**


## Example-based Prompt Engineering

<br>
<div>
<img src="https://cdn-images-1.medium.com/v2/resize:fit:1800/1*orJYX0HGbydHhzYF0TPuwA.png" width="1250"/>
</div>

We will start with prompting the model without any examples by simply asking the LLM the question directly:

In [None]:
prompt = """
<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was okay. [/INST]
"""
print(generator(prompt)[0]["generated_text"])


<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was okay. [/INST]

Neutral. The word "okay" is a neutral term and does not convey a particularly positive or negative sentiment.


Personally, I am not that convinced with the answer. I think it is more neutral than positive. Also, we have to search in the text for the answer.
Instead, let's give it an example of how we want the answer to be generated:

In [None]:
prompt = """
<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was alright.
Sentiment:
[/INST]

Neutral</s><s>

[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]
"""
print(generator(prompt)[0]["generated_text"])


<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was alright.
Sentiment:
[/INST]

Neutral</s><s>

[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]

Neutral


## Thought-based Prompt Engineering

We can go a step further and ask the LLM to "reason" about its answer.

By having the LLM divide its thinking into smaller steps, it allows for more computation to be given to each step. These smaller steps are generally referred to as the "thoughts" of the LLM.

There are many ways that we can use thought-based prompting and we can make it quite complex as seen below:

<br>
<div>
<img src="https://cdn-images-1.medium.com/v2/resize:fit:1800/1*LxwftaEgFetmX_LR6KE5IQ.png" width="1250"/>
</div>

However, let's start with the most basic example. What if we ask the model to solve a problem by breaking the problem into steps?



In [None]:
prompt = """
<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Do the odd numbers in this group add up to an even number? 3, 5, 15, 32.
Solve by breaking the problem into steps.
Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST]
"""
print(generator(prompt)[0]["generated_text"])


<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Do the odd numbers in this group add up to an even number? 3, 5, 15, 32.
Solve by breaking the problem into steps.
Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST]

Hello! I'd be happy to help you with that. Let's break down the problem into steps:

Step 1: Identify the odd numbers in the group.

The odd numbers in the group are:

3, 5, 15

Step 2: Add the odd numbers together.

3 + 5 + 15 = 23

Step 3: Is the result odd or even?

23 is an odd number.

Therefore, the sum of the odd numbers in the group is an odd number.


# 🗨️ **Retrieval Augmented Generation (RAG)**

In RAG, a knowledge base, like Wikipedia, is converted to numerical representations to capture its meaning, called embeddings. These embeddings are stored in a vector database so that the information can easily be retrieved.

<br>
<div>
<img src="https://cdn-images-1.medium.com/v2/resize:fit:1800/1*sRqMlpbsHhOAxsncsqlEHw.png" width="1250"/>
</div>

### Embedding Model

In [None]:
!autotrain llm --train \
--project_name Llama-Chat \
--model abhishek/llama-2-7b-hf-small-shards \
--data_path . \
--use_peft \
--use_int4 \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--trainer sft \
--merge_adapter

> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='.', train_split='train', valid_split=None, text_column='text', model='abhishek/llama-2-7b-hf-small-shards', learning_rate=0.0002, num_train_epochs=1, train_batch_size=2, warmup_ratio=0.1, gradient_accumulation_steps=1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.0, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=-1, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='Llama-Chat', evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, fp16=False, push_to_hub=False, use_int8=False, model_max_length=1024, repo_id=None, use_int4=True, trainer='sft', target_modules=None, merge_adapter=True, token=None, backend='default', username=None, func=<function run_llm_command_factory at 0x7d207c3488b0>)[0m
> [1mINFO    loading dataset from csv[0m
Loading che

In practice, each of these 3 methods can be run either indepedently but we can even combine them:

<br>
<div>
<img src="https://cdn-images-1.medium.com/v2/resize:fit:1800/1*kKj5u6L0zHeXtF_HLL-NyQ.png" width="1200"/>
</div>


In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'Llama-Chat'

# 4-bit Quanityzation to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
prompt = "### Human: Write me a numbered list of things to do in New York City.### Assistant:"
print(generator(prompt)[0]["generated_text"])

### Human: Write me a numbered list of things to do in New York City.### Assistant: Here is a list of 10 things you can do in New York City:

1. Visit the Empire State Building for an amazing view of the city skyline.
2. Take a stroll through Central Park and enjoy the beautiful scenery.
3. Go shopping on Fifth Avenue or SoHo for some great finds.
4. Check out the Museum of Modern Art (MoMA) for some world-class art exhibits.
5. Eat at one of the many delicious restaurants, such as Carnegie Deli or Sardi's.
6. See a Broadway show at one of the famous theaters.
7. Walk across the Brooklyn Bridge for a breathtaking view of Manhattan.
8. Take a boat tour around the Statue of Liberty and Ellis Island.
9. Shop for souvenirs at Times Square or Grand Central Terminal.
10. Enjoy a night out on the town with friends at one of the many bars or clubs.
