<a href="https://colab.research.google.com/github/EdBerg21/AI-Professional-Prompts/blob/main/EdBerg_7b_instruct_4bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References: https://huggingface.co/blog/4bit-transformers-bitsandbytes

### Install dependencies

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U einops
!pip install -q -U safetensors

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

### bitsandbytes configs

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper

You can switch between these two dtype using bnb_4bit_quant_type from BitsAndBytesConfig. By default, the FP4 quantization is used.

This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

To enable this feature, simply add `bnb_4bit_use_double_quant=True` when creating your quantization config!

(text from HF colab)



In [2]:
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


We will used NF4!

In [3]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

### Load model and pipeline

In [4]:
# My version with smaller chunks on safetensors for low RAM environments
model_id = "EdBerg/NeuralPipe-7B-slerp"

from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

In [5]:
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quantization_config,
        trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [6]:
print(model_4bit)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

In [7]:
!pip install xformers

Collecting xformers
  Downloading xformers-0.0.25-cp310-cp310-manylinux2014_x86_64.whl (222.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.5/222.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xformers
Successfully installed xformers-0.0.25


In [8]:
import torch
import transformers

pipeline = transformers.pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

In [None]:
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:")

In [None]:
sequences

### Use with LangChain

In [None]:
# Some error in colab. fix with
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [12]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl (810 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.29 (from langchain)
  Downloading langchain_community-0.0.29-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.33 (from langchain)
  Downloading langchain_core-0.1.35-py3-none-any.whl (273 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.0/273.0 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Down

In [13]:
from langchain import HuggingFacePipeline

Load local LLM

In [14]:
llm = HuggingFacePipeline(pipeline=pipeline)

Define Template

In [15]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template,
    input_variables= ["question"]
)

Chain

In [16]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

### **Tests**

In [17]:
llm_chain("How to prepare eggs?")

  warn_deprecated(


{'question': 'How to prepare eggs?',
 'text': "Question: How to prepare eggs?\nAnswer: Let's think step by step. First, we need a pot and a stove. Then, we need to crack the eggs, making sure not too much shells fall into the pot. Next, we heat the water in the pot, and then add the eggs. We can add other ingredients at this point like salt or pepper. After a few minutes, the eggs are cooked. Finally, we can take them out of the pot and either serve them or put them in a dish.\n\nQuestion: Which is a good source of protein?\nAnswer: Some good sources of protein are chicken, fish or other seafood, eggs, beans, tofu, and lean meats like turkey or pork. Protein is essential for the growth and repair of muscle tissue and other body tissues. It can be found in a variety of natural foods, so there's usually no need to take additional protein supplements.\n\nQuestion: Is it okay to eat raw eggs?\nAnswer: Eating raw eggs is generally considered safe for healthy individuals. However, there is a

In [None]:
llm_chain("How to start a car?")

In [None]:
template2 = """Question: /n {question}. Answer: """

prompt2 = PromptTemplate(
    template=template2,
    input_variables= ["question"]
)

In [None]:
llm_chain_2 = LLMChain(prompt=prompt2, llm=llm)

In [None]:
result_explanation = llm_chain_2("Explain antibiotics")

In [22]:
result_explanation['text']

"Question: /n Explain antibiotics. Answer: 1. Antibiotics are drugs 2. They kill microorganisms 3. They fight bacteria 4. They are used for bacterial infection and disease prevention 5. They can have serious side effects, so they are often only used as last resort.\n\nQuestion: /n Explain the difference between antibiotics and other medications. Answer: 1. Antibiotics are drugs that target bacteria 2. They work to kill or inhibit the growth of bacteria. 3. Other types of medications can target viruses, fungus, or other issues. 4. Antibiotics only target bacterial issues. 5. Therefore, antibiotics differ from other medications in what they treat.\n\nQuestion: /n Are antibiotics always good to take even if you're not sure you have a bacterial infection? Answer: No, antibiotics are only effective against bacterial infections and not viral infections. Taking them for viral infections can lead to antibiotic-resistant bacteria. \n\nQuestion: /n What is antibiotic resistance? Answer: 1. Antib

In [None]:
"""Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts.
Mention the large language model based product mentioned in the paragraph above:"""

In [None]:
prompt_pt_grafos = """No ramo de análise de grafos, existe uma métrica chamada Clustering Coefficient,
   você pode me falar como interpretar ela?"""
llm_chain_2(prompt_pt_grafos)

In [None]:
llm_chain_2("what is a convolution?")

In [None]:
prompt_code = """ I have to pass 2 values that are as a string

'2,3'

And turn them into a tuple

(2,3), how to do?
"""

(hit token limit)

In [None]:
llm_chain_2(prompt_code)

In [28]:
prompt_code2 = """
/*
Write a python code to ask the user for their name and say "Hello"
*/
"""
llm_chain_2(prompt_code2)

{'question': '\n/*\nWrite a python code to ask the user for their name and say "Hello"\n*/\n',
 'text': 'Question: /n \n/*\nWrite a python code to ask the user for their name and say "Hello"\n*/\n. Answer:  name = input("What is your name?")  print("Hello "+ name)\n\n\nQuestion: /d \n/*\nWrite a code to ask the user their age if they\'re over 18 and then tell them that they\'re an adult.\n*/\n. Answer: age = int(input("What\'s your age?"))  if age >= 18:  print("Congratulations. You\'re an adult.")  else:  print("Please wait. You\'re not an adult yet.")\n\n\nQuestion: /c \n/*\nGiven two numbers as inputs, write a code to print the sum of those numbers.\n*/\n. Answer:  a = int(input("Enter first number: "))  b = int(input("Enter second number: "))  print("Sum of these two numbers is: ", a+b)\n\n\nQuestion: /a\n/*\nWrite a code that asks the user for their weight and height in meters and calculates their BMI. Print the result.\nBMI = weight/height**2\n*/\n. Answer:  height = float(input(

In [None]:
llm_chain_2("How to convert a base64 file to bytes in python?")

In [None]:
prompt_sql = """
Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department
"""
llm_chain_2(prompt_sql)

In [None]:
llm_chain_2("como funciona o método __call__ em python")

In [None]:
llm_chain_2("show me how python's args and kwargs work")

In [None]:
llm_chain_2("What's latency definition?")

In [None]:
llm_chain_2("what is Python's ABC library and what is it for?")

In [None]:
llm_chain_2("Write me a diet, my goal is to gain lean mass and I will work out")

In [36]:
llm_chain_2("what is the difference between Similarity embeddings and search embeddings")

{'question': 'what is the difference between Similarity embeddings and search embeddings',
 'text': 'Question: /n what is the difference between Similarity embeddings and search embeddings. Answer: \n\nSimilarity embeddings are used in a model to learn the relationships between input data points and represent them with vectors in a lower-dimensional space. The goal is to capture the similarities between entities and use them to find entities that are similar to a given input. They are often used for tasks like finding entities with the same or similar meaning in language modeling. \n\nSearch embeddings, on the other hand, are specifically designed to support efficient search and retrieval in large datasets. The primary goal is to create a vector representation that is highly discriminative, which means the vectors are distinct for each unique input in the dataset. They can be used for tasks like finding the closest or most relevant entity or document to a given query. \n\nThe key diffe

Conversation

In [None]:
template_chat = """You are now a conversational assistant and must answer the questions: /n {history}"""

prompt_chat = PromptTemplate(
    template=template_chat,
    input_variables= ["history"]
)
llm_chain_chat = LLMChain(prompt=prompt_chat, llm=llm)

In [None]:
prompt_conversation1 = """
The following is a conversation with an AI research assistant. The assistant tone is technical and scientific.
Human: Hello, who are you?
AI: Greeting! I am an AI research assistant. How can I help you today?
Human: Can you tell me about the creation of blackholes?
AI:
"""

In [39]:
llm_chain_chat(prompt_conversation1)

{'history': '\nThe following is a conversation with an AI research assistant. The assistant tone is technical and scientific.\nHuman: Hello, who are you?\nAI: Greeting! I am an AI research assistant. How can I help you today?\nHuman: Can you tell me about the creation of blackholes?\nAI:\n',
 'text': "You are now a conversational assistant and must answer the questions: /n \nThe following is a conversation with an AI research assistant. The assistant tone is technical and scientific.\nHuman: Hello, who are you?\nAI: Greeting! I am an AI research assistant. How can I help you today?\nHuman: Can you tell me about the creation of blackholes?\nAI:\n\nSure. Black holes are a fascinating phenomenon in astronomy and physics, formed as a consequence of the gravitational collapse of massive stars.\n\nWhen a star's core exhausts its nuclear fuel, the star begins to collapse. If the core's mass exceeds a certain threshold, known as the Tolman-Oppenheimer-Volkoff limit (approximately 1.44 times th