# TII Falcon 7B Tutorial in HugginFace Transformers and Langchain 
In this tutorial I'm going to show how we can use the open-source Falcon 7B model developed by Technology Innovation Institute TII. Falcon supposedly outperforms GPT-3 for only 75% of the training compute budget whilst requiring a fifth of the compute at inference time. 


1. Installing packages

In [1]:
!pip install -q transformers 
!pip install -q accelerate einops 
!pip install -q langchain 
!pip install -q xformers

In [None]:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

In [5]:
pip install torch==2.0.1+cu118 torchvision==0.9.2+cu111 torchaudio==0.8.2 -f https://download.pytorch.org/whl/torch_stable.html

^C
Note: you may need to restart the kernel to use updated packages.


In [6]:
pip uninstall torch

^C
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip show torch

Name: torch
Version: 2.0.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: e:\documents\personal_project\huggingface\.venv\lib\site-packages
Requires: filelock, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, xformers
Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch

if torch.cuda.is_available():
    print("CUDA is available. You can use GPU.")
    device = torch.device('cuda')
else:
    print("CUDA is not available. Using CPU.")
    device = torch.device('cpu')


### Let's initialize a HuggingFace pipeline and load the LLM

If you're using Colab, make sure you use the GPU runtime, preferably the Pro membership as you might need access to either an A100 or V100 GPU.  
Running this tutorial locally will not be sufficient for most personal setups.  
The general steps that will be taken:  
1. Loading the Falcon model from HuggingFace 
2. Loading the model tokenizer 

In [5]:
import torch
from torch import cuda, bfloat16 
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device('cuda' if cuda.is_available() else 'cpu')

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-1_5", 
    trust_remote_code=True, 
    torch_dtype=bfloat16,
)



AssertionError: Torch not compiled with CUDA enabled

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5")

inputs = tokenizer('''```python
def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

In [2]:
from torch import cuda, bfloat16 
import transformers 
from transformers import AutoTokenizer, AutoModelForCausalLM 
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'  

#creating a model 
fmodel = AutoModelForCausalLM.from_pretrained(
    'tiiuae/falcon-7b-instruct',
    trust_remote_code=True,
    torch_dtype=bfloat16
)
fmodel.eval() 
fmodel.to(device) 
print(f'Model loaded on {device}') 

Downloading (…)lve/main/config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

Downloading (…)/configuration_RW.py:   0%|          | 0.00/2.61k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)main/modelling_RW.py:   0%|          | 0.00/47.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modelling_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

KeyboardInterrupt: 

The tokenizer is loaded from the tii model itself, as implemented below. 

In [3]:
tokenizer = AutoTokenizer.from_pretrained('tiiuae/falcon-7b-instruct')

Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Initializing the HF pipeline and setting up some additional parameters 

In [4]:
gen_text = transformers.pipeline(
    model=fmodel, 
    tokenizer=tokenizer, 
    task='text-generation', 
    return_full_text=True, 
    device=device, 
    max_length=10000, 
    temperature=0.1, 
    top_p=0.15, #select from top tokens whose probability adds up to 15%
    top_k=0, #selecting from top 0 tokens 
    repetition_penalty=1.1, #without a penalty, output starts to repeat 
    do_sample=True, 
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForC

In [6]:
result = gen_text("What is the name of the first president of the united arab emirates?") 
print(result[0]['generated_text']) 

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


What is the name of the first president of the united arab emirates?
The name of the first president of the United Arab Emirates is Sheikh Zayed bin Sultan Al Nahyan.


### LangChain Implementation 

In [7]:
from langchain import PromptTemplate, LLMChain 
from langchain.chains.conversation.memory import ConversationBufferMemory,ConversationSummaryMemory
from langchain.llms import HuggingFacePipeline 
#Creating a basic prompt template 

template = """
I want you to act as an informative AI assistant. 
Answer the following question: {question}.
"""
prompt = PromptTemplate(
    input_variables=['question'], 
    template=template 
)

In [8]:
#showcasing the format of the prompt 
prompt.format(question="What is the name of the first US president? ")

'\nI want you to act as an informative AI assistant. \nAnswer the following question: What is the name of the first US president? .\n'

In [9]:
llm = HuggingFacePipeline(pipeline=gen_text)
llm_chain = LLMChain(llm=llm,prompt=prompt)

In [11]:
llm_chain.predict(question='List me some of the most important electrolytes') 

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


'Electrolytes are essential for maintaining the balance of fluids in the body. Some of the most important electrolytes include sodium, potassium, chloride, calcium, and magnesium.'

### Implementing LangChain conversational memory

In [13]:
#creating a new conversational template 
template = """You are an informative assistant chatting with a human. 
{chat_history}
Human:{human_input}
Assistant:""" 
#creating the prompt 
prompt = PromptTemplate(
    input_variables=["chat_history","human_input"], 
    template=template
) 
memory = ConversationBufferMemory(memory_key="chat_history") 

In [14]:
llm = HuggingFacePipeline(pipeline=gen_text) 
#adding memory to the llm chain 
llm_chain = LLMChain(llm=llm,prompt=prompt,memory=memory) 

In [15]:
#1st prompt 
llm_chain.predict(human_input='What is the first president of UAE?')

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


' The first president of UAE was Zayed bin Sultan Al Nahyan. He served as the President from 1971 until his death in 2006.\nUser '

In [16]:
#2nd prompt 
llm_chain.predict(human_input='Who was the president that came after him?')

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


' The current president of UAE is His Highness Sheikh Khalifa bin Zayed Al Nahyan. He has been in power since 2008.\nUser '