# Runing llama2-7b-hi AI (Our Finetuned AI model) on a Single GPU T4 with 16GB VRam
Welcome to this notebook that will show you how to load and run llama2-7b-hi with QLoRA which is a 4bit quantization technique with no performance degradation.

Note that this could be used for any model that supports device_map (i.e. loading the model with accelerate).

## Step 0 -  Enable text wrapping so we don't have to scrool horizontally


In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)


## Step 1 - Install necessary packages (Or you can use requirments.txt)

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


## Step 2 - Define quantization parameters through the BitsandBytesConfig from transformers


* load_in_4bit=True: specify that we want to convert and load the model in 4-bit precision.
* bnb_4bit_use_double_quant=True: Use nested quantization for more memory efficient inference and training.
* bnd_4bit_quant_type="nf4": The 4bit integration comes with 2 different quantization types FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper. By default, the FP4 quantization is used.
* bnb_4bit_compute_dype=torch.bfloat16: The compute dtype is used to change the dtype that will be used during computation. By default, the compute dtype is set to float32 but computation can be set to bf16 for speedups.



In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)



## Step 3 - Load the Model with quantization

Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [None]:
model_id = "nvjkkartik/llama2-qlora-hi-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

If you print the model, you will see that most of the nn.Linear layers are replaced by bnb.nn.Linear4bit layers!

In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    

Let's make sure we loaded the whole model on GPU

In [None]:
model.hf_device_map

{'': 0}

## Step 5 - Once loaded, run a generation!

In [None]:
device = "cuda:0"

messages = [
    {"role": "user", "content": "Aapka naam kya hai, aap kaise ho"}
]


encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)


generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s>[INST] Aapka naam kya hai, aap kaise ho [/INST]  My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. лютого 2023

Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm trained on a massive dataset of text from the internet and can generate human-like responses to a wide range of topics and questions. I can be used to create chatbots, virtual assistants, and other applications that require natural language understanding and generation capabilities.</s>


In [None]:
messages = [
    {"role": "user", "content": "1 और 10000 के बीच यादृच्छिक 1000 संख्याओं की सूची तैयार करने के लिए एक पायथन फ़ंक्शन लिखें?"}
    ]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s>[INST] 1 और 10000 के बीच यादृच्छिक 1000 संख्याओं की सूची तैयार करने के लिए एक पायथन फ़ंक्शन लिखें? [/INST]  Sure! Here's a Python function that takes a number N and returns a list of numbers between 1 and N:

```
def range_list(n):
    return list(range(1, n+1))
```

उदाहरण के लिए:

`range_list(1000)` पायथन द्वारा एक सूची `Range` के बजाय में सारे संख्याओं की सूची प्रदर्शित किए जाती है:

```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
```

यादगार के लिए, आप मैं 60 संख्याओं को चैट के माधिदारियों के रूप में ध्यान देंगे और 70 संख्याओं को ऐसे कोशिश करें जो एक मकान देख सकता है, और चटक बदलाव प्रारूप में 60 संख्याओं को चक्किस देंगे या बीसीआउट देंगे और छ

In [None]:
device = "cuda:0"
PROMPT= """ ### Instruction: Write a concise summary of the text, return your responses with 5 lines that cover the key points of the text.
### Paragraph:
प्रधानमंत्री नरेंद्र मोदी ने बुधवार को उत्तराखंड के मुख्यमंत्री पुष्कर सिंह धामी से बात की और उत्तरकाशी सुरंग बचाव कार्य की प्रगति के बारे में जानकारी ली, जो 11वें दिन में प्रवेश कर गया है। मुख्यमंत्री ने पूर्व ट्विटर एक्स पर पोस्ट किया कि उन्होंने पीएम मोदी को पिछले 24 घंटों में हुई प्रगति के बारे में जानकारी दी। जैसे-जैसे फंसे हुए मजदूरों तक पहुंचने की कोशिशें जारी हैं, अब एजेंसियां ​​उन्हें ठोस, गर्म भोजन उपलब्ध कराकर जीवित रखने की कोशिश कर रही हैं। मंगलवार शाम को श्रमिकों को पुलाव और मटर पनीर उपलब्ध कराया गया। बुधवार की सुबह एक बड़ी सफलता में, ऊर्ध्वाधर ड्रिलिंग के लिए एक स्थान की पहचान की गई। सीमा सड़क संगठन ने सुरंग स्थल के पास सड़कें बनाईं, जिससे कल फंसी एक मशीन को निकलने में सुविधा हुई।
### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])



<s>  ### Instruction: Write a concise summary of the text, return your responses with 5 lines that cover the key points of the text.
### Paragraph:
प्रधानमंत्री नरेंद्र मोदी ने बुधवार को उत्तराखंड के मुख्यमंत्री पुष्कर सिंह धामी से बात की और उत्तरकाशी सुरंग बचाव कार्य की प्रगति के बारे में जानकारी ली, जो 11वें दिन में प्रवेश कर गया है। मुख्यमंत्री ने पूर्व ट्विटर एक्स पर पोस्ट किया कि उन्होंने पीएम मोदी को पिछले 24 घंटों में हुई प्रगति के बारे में जानकारी दी। जैसे-जैसे फंसे हुए मजदूरों तक पहुंचने की कोशिशें जारी हैं, अब एजेंसियां ​​उन्हें ठोस, गर्म भोजन उपलब्ध कराकर जीवित रखने की कोशिश कर रही हैं। मंगलवार शाम को श्रमिकों को पुलाव और मटर पनीर उपलब्ध कराया गया। बुधवार की सुबह एक बड़ी सफलता में, ऊर्ध्वाधर ड्रिलिंग के लिए एक स्थान की पहचान की गई। सीमा सड़क संगठन ने सुरंग स्थल के पास सड़कें बनाईं, जिससे कल फंसी एक मशीन को निकलने में सुविधा हुई।
### Answer:
मोदी को पिएम के बारे में जानकारी दी थी।मुख्यमंत्री निश्चित था कि मध्येश्वर बांधा हुं, लेकिन बांधा हुं के लिए आयोजन किया जाना चाहते थे।
प

In [None]:
device = "cuda:0"
PROMPT= """ ### Instruction: Write this paragraph in hindi, you have to transliterate it on your own. don't add additional information
### Paragraph:
This year has been particularly good at the box office for Hindi cinema. After a lull that was brought about by diminishing attendance in theatres due to the Covid-19 pandemic, the film industry has bounced back with some of the biggest hits in recent years. So much is the competition among films that one particular release grossed over Rs 600 crore and yet could not hold on to the record of highest grossing Hindi film for more than a few hours.
### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s>  ### Instruction: Write this paragraph in hindi, you have to transliterate it on your own. don't add additional information
### Paragraph:
This year has been particularly good at the box office for Hindi cinema. After a lull that was brought about by diminishing attendance in theatres due to the Covid-19 pandemic, the film industry has bounced back with some of the biggest hits in recent years. So much is the competition among films that one particular release grossed over Rs 600 crore and yet could not hold on to the record of highest grossing Hindi film for more than a few hours.
### Answer:
ये साल हिंदी सिनेमा की बॉк ऑफिस मे बड़ी छिपाई का सबसे भाड़ी घंटा में प्यार था।हिंदी सिनेमा के इलाक त्रिपाई में नींसा और एंटर्टेainer का लुनहर ने स्टोरी किया जिसे इसके साथ एफ एक्सप्लॉ का जल्द घण्टा करना पड़ा।घैंस्टार ने एक अमेजॉन के साथ एक गालिया नहीं बनाया क्योंकि अन्य फिल्म इतना महकुट थी कि एक जगह दूर ट्ठा के बनाने की उसकी कृप थी।</s></s>


In [None]:
device = "cuda:0"
PROMPT= """
### Question:
Mujhe India Jaana hai, sabse acha city konsa hai
### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> 
### Question:
Mujhe India Jaana hai, sabse acha city konsa hai
### Answer:
Hahaha! 🙂 India is a vast and diverse country with many wonderful cities. Here are a few cities you may want to consider, with some of their most famous landmarks:

1. Delhi - Old Delhi (Red Fort, Qutub Minar, India Gate), New Delhi (Rajasthan; Connaught Place, Dilli Haat, Mughal Gardens), Chandigarh (capital of Punjab, Haryana and Chandigarh); and Dilli Darwaza (the gate to Delhi).
2. Bengaluru (cool weather, cultural landmarks, fun nightlife, restaurants, and 200+ lakes); Agra (Taj Mahal, Fatehpur Sikri, Agra Fort)
3. Pune (Osho Ashram, Khandala Lake, Lonavala, famous temples, Pune University)
4. Chennai (Kapaleeswarar Temple, Marthandam, Fort St George, St. George's Cathedral)
5. Rajasthan (Jaisalmer (golden sand dunes, sand castles, forts, Havelis), Jodhpur (blue city, Mehrangarh Fort, Jaswant Thada); Udaipur (Lake Pichola, City Palace, Fateh Sagar Lake)
6. Agra (Taj Mahal, Fatehpur Sikri, Agra Fort)
7.

In [None]:
device = "cuda:0"
PROMPT= """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Hanuman chalisa kya hai[/INST]
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> 
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Hanuman chalisa kya hai[/INST]
Hanuman Chalisa is a Hindu religious hymn that is said to have been composed by Tulsidas, a 16th-century Indian poet-saint, in praise of Lord Hanuman, the monkey-like deity and devotee of Lord Rama. The hymn consists of 44 couplets and is typically recited or sung during religious gatherings, especially during Navratri and other Hindu festivals.

The Hanuman Chalisa is a plea for Lord Hanuman's blessings, seeking his help in

In [None]:
device = "cuda:0"
PROMPT= """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Hanuman chalisa likhe do devnagiri mei[/INST]
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> 
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Hanuman chalisa likhe do devnagiri mei[/INST]
जय हनुमान ज्ञान गुन सागर। जय कपीस तिहुं लोक उजागर।।

राम दूत अतुलित बल धामा। अंजनि-पुत्र पवनसुत नामा।।

महाबीर बिक्रम बजरंगी। कुमति निवार सुमति के संगी।।

कंचन बरन बिराज सुबेसा। कानन कुण्डल कुँचित केसा।। हाथ बज्र औ ध्वजा बिराजे। कांधे मूंज जनेउ साजे।। शंकर सुवन केसरी नंदन। तेज प्रताप महा जग वंदन।।बिद्यावान गुनी अति चातुर। राम काज करिबे को आतुर।।प्रभु चरित्र सुनिबे को रसिया। राम लखन सीता मन बसिया।।सूक्ष्म रूप ध

In [None]:
device = "cuda:0"
PROMPT= """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Mujhe ek wikipedia ka clone banake do javascript mei  [/INST]
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=2000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])



<s> 
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Mujhe ek wikipedia ka clone banake do javascript mei  [/INST]
I cannot create a full Wikipedia clone in JavaScript, as it would require a significant amount of resources and expertise. However, I can provide you with some guidance on how to create a simple wiki-like platform using Node.js, MongoDB, and Express.js.

Here are the basic steps you can follow:

1. Install Node.js and MongoDB on your system.
2. Create a new MongoDB database and collection for 