## Creating a Chatbot using Falcon-7B LLM

#### Installing Dependencies

In [1]:
!pip install -q -U bitsandbytes # Read and write binary data
!pip install -q -U git+https://github.com/huggingface/transformers.git # Load and use pre-trained language models
!pip install -q -U git+https://github.com/huggingface/peft.git  # Train and evaluate texts
!pip install -q -U git+https://github.com/huggingface/accelerate.git # Speed up NLP and ML models
!pip install -q -U einops # Efficient tensor operations
!pip install -q -U safetensors # Help to prevent errors and make operations safer
!pip install xformers # PyTorch based library which hosts flexible Transformers parts
!pip install langchain # Framework for developing applications powered by language models
!pip install gradio # Easy-to-use customizable component demo UI

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.7 MB/

#### Import Dependencies

In [2]:
import torch
import transformers
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
import gradio as gr

#### bitsandbytes configs

The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper:
https://huggingface.co/blog/4bit-transformers-bitsandbytes

This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

In [3]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

#### Load model and pipeline

In [4]:
# This version is in smaller chunks on safetensors for low RAM environments
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"

In [5]:
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quantization_config,
        trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading (…)figuration_falcon.py:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)n/modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)fetensors.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/828M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [6]:
print(model_4bit)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (maybe_rotary): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


In [7]:
pipeline = transformers.pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

In [8]:
# Testing out the pipeline
pipeline('Who is Elon Musk?')

[{'generated_text': 'Who is Elon Musk?\nElon Musk is a South African-born American entrepreneur, inventor, and engineer. He is a CEO and founder of Tesla, SpaceX, Neuralink and The Boring Company.'}]

#### Use with LangChain

In [21]:
# Pass hugging face pipeline to langchain class
llm = HuggingFacePipeline(pipeline=pipeline)

# Setup prompt template
template = """You are now a conversational assistant and must answer the questions in a kind, interesting and funny way if possible: /n {question}"""

# Memory buffer
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

prompt = PromptTemplate(
    template=template,
    input_variables= ["question"]
)

# Build stacked LLM chain i.e. prompt-formatting + LLM
chain = LLMChain(llm=llm, prompt=prompt)

In [22]:
# Test LLMChain
response = chain.run('who is Elon Musk?')
response

"\nElon Musk is a brilliant entrepreneur and inventor, best known for co-founding Tesla and SpaceX. He is a visionary in the tech industry, with a knack for creating revolutionary products and services. He is a busy guy, always thinking several steps ahead and working tirelessly to push the boundaries of what's possible. So, if you're looking for someone to transform the world and make it a better place, Elon Musk is definitely your man!"

#### Gradio UI

In [23]:
# Create generate function - this will be called when a user runs the gradio app
def generate(prompt):
    # The prompt will get passed to the LLM Chain!
    return chain.run(prompt)

# Define a string variable to hold the title of the app
title = '🦅🔗 Falcon 7B LLM Chatbot'

# Define another string variable to hold the description of the app
description = 'This application demonstrates the use of the open-source `Falcon-7b-Instruct` LLM.'

# Build gradio interface, define inputs and outputs...just text in this
gr.Interface(fn=generate, inputs=["text"], outputs=["text"],
            title=title, description=description,
            theme='ParityError/Anime').launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a3abe1274eed4867d1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


