### Installing Libraries

In [None]:
!pip -q install langchain huggingface_hub transformers sentence_transformers

#### HuggingFace
Hugging Face provides two LLM wrappers: one for local pipelines and another for models hosted on the Hugging Face Hub. These wrappers work with LLMs supporting text-to-text generation and text generation tasks.

**Local Pipeline Wrapper**: Simplifies integrating LLMs into your local environment for text generation or text-to-text generation tasks.

**Hugging Face Hub Wrapper**: Enables seamless utilization of pre-trained LLMs hosted on the Hugging Face Hub, without local setup or training.

Both wrappers offer a consistent interface, abstracting away complexities and allowing you to leverage LLMs efficiently, whether hosted locally or on the Hub.

In [1]:
import os

os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''

### langchain-huggingface: 0.1.2

**chat_models**

chat_models.huggingface.ChatHuggingFace

Hugging Face LLM's as ChatModels.


Message to send to the TextGenInference API.

Response from the TextGenInference API.

**embeddings**

embeddings.huggingface.HuggingFaceEmbeddings

HuggingFace sentence_transformers embedding models.

embeddings.huggingface_endpoint.HuggingFaceEndpointEmbeddings

HuggingFaceHub embedding models.

**llms**

llms.huggingface_endpoint.HuggingFaceEndpoint

HuggingFace Endpoint.

llms.huggingface_pipeline.HuggingFacePipeline

HuggingFace Pipeline API.

### Using Chat Hugging Face


Works with HuggingFaceTextGenInference, HuggingFaceEndpoint, HuggingFaceHub, and HuggingFacePipeline LLMs.

Upon instantiating this class, the model_id is resolved from the url provided to the LLM, and the appropriate tokenizer is loaded from the HuggingFace Hub.

In [5]:
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

llm = HuggingFaceEndpoint(
    repo_id="microsoft/Phi-3-mini-4k-instruct",
    task="text-generation",
    max_new_tokens=512,
    do_sample=False,
    repetition_penalty=1.03,
)

chat = ChatHuggingFace(llm=llm, verbose=True)

In [9]:
messages = [
    ("system", "You are a helpful translator. Translate the user sentence to French."),
    ("human", "I love programming."),
]

result = chat.invoke(messages)

In [11]:
result.content

"J'aime programmer."

### Huggingface Endpoints

In [12]:
from langchain_huggingface import HuggingFaceEndpoint

In [13]:
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

In [14]:
question = "Who won the FIFA World Cup in the year 1994? "

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

In [18]:
from langchain_core.output_parsers import StrOutputParser
repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0.5,
)
llm_chain = prompt | llm | StrOutputParser()
print(llm_chain.invoke({"question": question}))

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


 The FIFA World Cup is an international football tournament. It takes place every four years. In 1994, the tournament was held in the United States of America. The winning team was Brazil. They defeated Italy in the final match, which took place on July 17, 1994. So, the answer is Brazil won the FIFA World Cup in 1994.


### HuggingFaceEmbeddings

HuggingFace sentence_transformers embedding models.

To use, you should have the sentence_transformers python package installed.

In [24]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hfe = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [22]:
text = "This is a test document."


In [25]:
query_result = hfe.embed_query(text)

In [26]:
type(query_result)

list

In [27]:
query_result[:3]

[-0.048951808363199234, -0.039862047880887985, -0.021562788635492325]

### HuggingFaceEndpointEmbeddings

HuggingFaceHub embedding models.

To use, you should have the huggingface_hub python package installed, and the environment variable HUGGINGFACEHUB_API_TOKEN set with your API token, or pass it as a named parameter to the constructor.

In [29]:
from langchain_huggingface import HuggingFaceEndpointEmbeddings
model = "sentence-transformers/all-mpnet-base-v2"
hfee = HuggingFaceEndpointEmbeddings(
    model=model,
    task="feature-extraction"
)

In [32]:
text = "This is a test document."
query_result = hfee.embed_query(text)

In [34]:
query_result[:3]

[-0.048951830714941025, -0.03986202925443649, -0.021562786772847176]

### Hugging Face Hub

In [38]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [40]:
gllm=HuggingFaceHub(repo_id="google/flan-t5-large",  model_kwargs={"temperature":0,  "max_length":100})

In [41]:
gllm_chain = prompt | gllm | StrOutputParser()
print(gllm_chain.invoke({"question": question}))

The 1994 FIFA World Cup was won by Brazil. Brazil won the 1994 FIFA World Cup. So the answer is Brazil.


### Another Example with answerdotai/ModernBERT-base

In [46]:

model = HuggingFaceHub(
    repo_id="distilbert/distilgpt2",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

In [47]:
gptllm_chain = prompt | model
question = "What is inside area 51?"

print(gptllm_chain.invoke(question))

Question: What is inside area 51?

Answer: Let's think step by step. The first thing you need to do is look at the map and see if there are any areas that have been identified in the map. If you're looking for a place where you can find your own, then it's important to look at the map and make sure that you don't miss any of the places. You should also be aware that there are no other areas on the map. This is an area that has been identified as the "home" of the player. It's not a home. There are only two locations on the map.
The second thing you need to do is look at the map and see if there are any areas that have been identified in the map. If you're looking for a place where you can find your own, then it's important to look at the map and make sure that you don't miss any of the places. You should also be aware that there are no other areas on the map. This is an area that has been identified as the "home" of the player. It's not a home. There are only two locations on the map.


### Download models locally.
You can download and use any open-source, unrestricted Hugging Face model if you have sufficient VRAM and storage. There are two main approaches:

- Pipeline Approach 
- Hugging Face Auto Classes

The choice depends on your requirements, resources, and the level of customization needed.

**Pipeline Option**

In [51]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM


pipe = pipeline("text-generation", model="distilbert/distilgpt2" , device=3,   max_length=100)

local_llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda:3


In [56]:
gptllmhf_chain = prompt | local_llm
question = "What is the capital of Egypt?"

print(gptllmhf_chain.invoke(question, truncation = True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is the capital of Egypt?

Answer: Let's think step by step.
A few things. The state of Libya has made an extraordinary contribution, because Egypt's people are still struggling in financial crisis and political upheaval, because they are not very happy in their own sense of self, or at all.
So, the question is how do we explain its state collapse, or how can we explain it when we put together some solutions?
One way is, Egypt's


**Auto Classes**

In [71]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from langchain_core.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# 1. Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

# 2. Create a pipeline
pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0,
    device = 3,
)

# 3. Create a HuggingFacePipeline instance
acllm = HuggingFacePipeline(pipeline=pipe)

# 4. Create a prompt template
prompt = PromptTemplate.from_template("{question}")

# 5. Create the chain
chain = prompt | acllm

# 6. Run the chain
question = "What is the capital of Egypt?"
result = chain.invoke({"question": question})
print(result)

# If you want to use the model directly without LangChain:
input_ids = tokenizer(question, return_tensors="pt").input_ids
device = "cuda:3" if torch.cuda.is_available() else "cpu"
input_ids = input_ids.to(device)

outputs = model.generate(input_ids)
direct_result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nDirect model output:", direct_result)

Device set to use cuda:3


cairo

Direct model output: cairo


In [67]:
from langchain.llms import HuggingFacePipeline

llm_hf = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    device = 3,
    model_kwargs={"temperature": 0.7, "max_length": 256}
)


Device set to use cuda:3


In [68]:
llm_hf_chain = prompt | llm_hf
question = "What is the capital of Egypt?"

print(llm_hf_chain.invoke(question, truncation = True))

The capital of Egypt is Cairo. Cairo is the capital of Egypt. So the answer is Cairo.


### Chatbot usecase
Few Shot Learning Contextual Chatbot: This example demonstrates the simplest way conversational context can be managed within a LLM based chatbot.

Longer conversations can be solved for in two ways:
- Truncating the conversational history, hence removing the first portion of the conversation history at set stages. This approach is analogous to limiting log files to a certain size via rolling logs.

- The second approach is making use of LLMs to summarise the conversation history, as the conversations continue.

Within LangChain ConversationBufferMemory can be used as type of memory that collates all the previous input and output text and add it to the context passed with each dialog sent from the user.

Reference: https://cobusgreyling.medium.com/langchain-creating-large-language-model-llm-applications-via-huggingface-192423883a74

In [74]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferMemory

conversation = ConversationChain(
    llm=llm_hf, 
    verbose=True, 
    memory=ConversationBufferMemory()
)

conversation.predict(input=question)

  memory=ConversationBufferMemory()
  conversation = ConversationChain(




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: What is the capital of Egypt?
AI:[0m

[1m> Finished chain.[0m


'Cairo is the capital of Egypt.'