## Hugging Face Local Pipelines

Hugging Face models can be run locally through the HuggingFacePipeline class.

The Hugging Face Model Hub hosts over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

These can be called from LangChain either through this local pipeline wrapper or by calling their hosted inference endpoints through the HuggingFaceHub class. For more information on the hosted pipelines, see the HuggingFaceHub notebook.

To use, you should have the transformers python package installed, as well as pytorch. You can also install xformer for a more memory-efficient attention implementation.

In [1]:
%pip install --upgrade --quiet  transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
%pip install langchain

Collecting langchain
  Downloading langchain-0.1.8-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m816.1/816.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_community-0.0.21-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.24 (from langchain)
  Downloading langchain_core-0.1.25-py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.2.0,>=0.1.0 (from langchain)
  Downloading langsmith

In [3]:
%pip install langchain_community



### Model Loading

Models can be loaded by specifying the model parameters using the `from_model_id` method.



In [4]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    pipeline_kwargs={"max_new_tokens": 10},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



**They can also be loaded by passing in an existing transformers pipeline directly**

In [5]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100)
hf = HuggingFacePipeline(pipeline=pipe)

### Create Chain

With the model loaded into memory, you can compose it with a prompt to form a chain.

In [6]:
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is machine learning?"

print(chain.invoke({"question": question}))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 First, we define a list of "things" that can help us solve an problem. If we take a list of all these things, what are they? This means, we have a list that can help us sort our data with information from the data:

What are all these things, and why should we care?

What are they worth?

How well can we solve an problem?

How hard can we solve an issue?

What about these?



In [9]:
gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    device_map="auto",
    pipeline_kwargs={"max_new_tokens": 30},
)

gpu_chain = prompt | gpu_llm

question = "What is google?"

print(gpu_chain.invoke({"question": question}))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 From basic Google search, you probably have some text you need for a recipe, then you click one, and everything changes like you type it. This
