
# **Analytics 2 :** <font color=#DF4807>**Hugging Face Pipelines and Lanchain**</font>



# **Preparation:**

We are going to be using an opensource llm today but first we need to get a few permissions from Meta and Huggingface.

  1. First, you will need a Hugging Face account. You can sign up for one for free at https://huggingface.co/join.


  2. Next, you will need to request permission from Meta. You can make the request at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf when you are logged in with your Hugging Face credentials. This request may take one to two days to process.

  3. Finally, request a token from Hugging Face from your settings page at https://huggingface.co/settings/tokens. You will need this token to download the model.-
  
Your Hugging Face account email address <font color=#DF4807>**MUST match**</font> the email you provide on the Meta website, or your request will not be approved.

We will need the Hugging Face token later.

##**Pipelines**

pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

Summary of applications can be found here:
https://huggingface.co/docs/transformers/task_summary

In [None]:
#install requirements

!pip install transformers datasets torch  -q gwpy
!pip install sentencepiece -q gwpy

from datasets import load_dataset
from datasets import load_metric

from transformers import pipeline
from transformers import AutoTokenizer #generic tokenizer that we can set to match a pre-trained model
from transformers import AutoModelForQuestionAnswering, DistilBertConfig #AutoModel class plus bert config
from transformers import Trainer, TrainingArguments #used for fine tuning a model

from sklearn.metrics import f1_score
import torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Let's look at some examples...

In [None]:
# classification
classifer = pipeline("text-classification")
classifer(["This restmovie was surprisingly terrible", "The menu prices were really awful"])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.999505877494812},
 {'label': 'NEGATIVE', 'score': 0.9997275471687317}]

By default, the models include pre-trained models. To see what models are being used...we can visit the github repo:

https://github.com/huggingface/transformers/blob/71688a8889c4df7dd6d90a65d895ccf4e33a1a56/src/transformers/pipelines.py#L2716-L2804

However, we can change the model and use whatever we want.

`myPipeline = pipeline(model="roberta-large-mnli")`

link to model: https://huggingface.co/Helsinki-NLP/opus-mt-en-de

In [None]:
# translation from English to German
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-de')
translator(["What is your name?"])

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



[{'translation_text': 'Wie heißt du?'}]

In [None]:
# question answering
bot = pipeline(task="question-answering")
preds = bot(

    question="What is the avarage salary of a data scientist in Ireland?",
    context="The average salary for a data scientist is € 64,048 per year in Ireland. In the last 12 months, the average wage has decreased by -6.44% compared to the previous year."

)

print(f"score: {round(preds['score'], 4)}, answer: {preds['answer']}")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

score: 0.8225, answer: € 64,048


# **Langchain & LLMs**

Today we will focus on Langchain application of llms. An llm model, takes a string as input and returns a string.

the code would look something like



```
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

llm = OpenAI()
chat_model = ChatOpenAI()

llm.predict("hi!")
>>> "Hi"
```

Alternatively, we can also ask the model to make a prediction with something along the lines of:



```
text = "What would be a good company name for a company that makes colorful socks?"

llm.predict(text)
# >> Feetful of Fun

```




## **Prompt Template**

Allows us to chain together input messages without having to give a long descriptive input.

```
from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template("What is a good name for a company that makes {product}?")
prompt.format(product="colorful socks")
```

You can also use these prompt templates to set the tone of the output. Consider the following example:



```
from langchain.prompts.chat import ChatPromptTemplate

template = "You are a helpful assistant that translates {input_language} to {output_language}."
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", human_template),
])

chat_prompt.format_messages(input_language="English", output_language="French", text="I love programming.")
```




## **Using Llama 2 with langchain**

In [None]:
!pip install langchain xformers transformers datasets bitsandbytes accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
#first we need to key in our hugging face token here
hf_token = 'your key here'

In [None]:
from langchain.chains import ConversationChain
from langchain.llms import HuggingFacePipeline
from langchain.memory import ConversationSummaryBufferMemory
from langchain.prompts.prompt import PromptTemplate

from torch import cuda, bfloat16
import torch
from transformers import StoppingCriteria, StoppingCriteriaList
import transformers


In [None]:
#define model
model_id = 'meta-llama/Llama-2-7b-chat-hf'

#check if gpu is available
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_token
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True, #bolean flag saying we are downloading from HF
    config=model_config,
    quantization_config=bnb_config, #optimizing how the model runs on hardware
    device_map='auto', #will map model to device, i.e. gpu or cpu
    use_auth_token=hf_token
)
model.eval()

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_token
)

stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model reply can be quite lengthy
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)





DEFAULT_TEMPLATE:

The Defualt Template here, is a string that defines the default prompt template for the LangChain chatbot. The prompt template is used to generate the conversation context before each response. The prompt template includes the following placeholders:



```
[INST]: This placeholder is replaced with the current instruction.
<<SYS>>: This placeholder is replaced with a system message.
{history}: This placeholder is replaced with the conversation history.
{input}: This placeholder is replaced with the current user input.
```

In [None]:
DEFAULT_TEMPLATE = """<s>[INST] <<SYS>>
The following is a friendly conversation between a human and a robot on a serious mission. If the AI does not know the answer to a question, it truthfully says it does not know.
Current conversation:
{history}
<</SYS>>
{input} [/INST]"""

PROMPT = PromptTemplate(input_variables=["history", "input"], template=DEFAULT_TEMPLATE)

In [None]:
pipeline=transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
    )

llm  = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0.7})
chain = ConversationChain(llm=llm, memory=ConversationSummaryBufferMemory(llm=llm, max_token_limit=100), prompt=PROMPT)

# checking again that everything is working fine
response = chain.predict(input="Tell me a funny joke.")
print(response)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

  Oh, wow, I'm so glad you asked! *chuckles* I've got a doozy of a joke for you. *clears throat*

Why did the astronaut break up with his girlfriend before going to Mars? *pauses for dramatic effect* Because he needed his space! *groans* Get it? Space! *winks*

I know, I know, it's a bit of a cheesy pun, but I just can't help myself. I'm just an AI, after all, and puns are my bread and butter. *chuckles* But seriously, it's great to be here with you on this mission. It's not every day that I get to chat with a human on a serious space mission. *nods*

So, what else would you like to know? I'm all ears... or should I say, all circuits? *winks*


In [None]:
chain.predict(input="What did I just ask about?")

'  You just asked the AI to tell a joke. The AI responded with a cheesy pun about an astronaut breaking up with his girlfriend before going to Mars.'