# what is this?

This notebook contains examples for running some langchain queries using open source models locally on GPU/CPU/Apple machines.

In [1]:
from langchain import HuggingFacePipeline, HuggingFaceHub, LlamaCpp
from langchain import PromptTemplate, LLMChain
from dotenv import dotenv_values

# the models
First we need to get a model to work as a backend to langchain.

### Downloading models
There are two options for downloading models to run them locally. Under the title "huggingface: if you have GPU!" below you will see a description for automatically downloading the weights to a cache folder. However, for the Vicuna model, and the llama-cpp backend described here, you will need to convert the weights. To do this you may have to downlad them manually. They can easily be downloaded from Huggingface by cloning the repos, but you need to install the LFS extension first:

```
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/huggyllama/llama-13b (for example)
```


### LLaMA
Meta (Facebook) model that leaked. Available unofficially (and probably not legal for production usage) here:
https://huggingface.co/huggyllama/llama-13b/tree/main

### Vicuna
"an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT"
This model depends on the LLaMA weights. It is created by running a script on the original LLaMA weights.
https://huggingface.co/lmsys/vicuna-13b-delta-v1.1

### Dolly
"the World's First Truly Open Instruction-Tuned LLM"
Available directly from huggingface: 
- https://huggingface.co/databricks/dolly-v2-3b
- https://huggingface.co/databricks/dolly-v1-6b
- https://huggingface.co/databricks/dolly-v2-7b
- https://huggingface.co/databricks/dolly-v2-12b

# loading the model
We will load a model into the `llm` variable

## For running on CPU/Apple: llama-cpp
llama-cpp is a tool for converting Meta's LLaMa models and similar so that they can run on CPUs and Apple Silicon. The converted model binaries have names like `ggml-model.*.bin`. This also works for derived models such as Vicuna.

This Github repo has instructions for converting:
https://github.com/ggerganov/llama.cpp
The conversion only takes a few minutes.

In [2]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [3]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

# ggml_model_path = '../llama-13b/ggml-model-f16.bin'
ggml_model_path = '../vicuna-13b/ggml-model-f16.bin'

llm = LlamaCpp(
    model_path=ggml_model_path, callback_manager=callback_manager, verbose=True
)

llama.cpp: loading model from ../vicuna-13b/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 26874.67 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


### a small example using Vicuna-13b

In [4]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [5]:
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"

llm_chain.run(question)

 We know that the Super Bowl was played on February 1, 2015. And we know that Justin Bieber was born on March 1, 1994. So, if a team won the Super Bowl in the year Justin Bieber was born, it would have been the team that won the Super Bowl in either 1993 or 1994.

There were two teams that won the Super Bowl in those years: the Dallas Cowboys and the San Francisco 49ers. The Cowboys won Super Bowl XXVII on January 31, 1993, which would have been the year before Justin Bieber was born. And the 49ers won Super Bowl XXIX on January 29, 1995, which would have been after Justin Bieber's birthday.

So, to summarize: there were no NFL teams that won the Super Bowl in the year Justin Bieber was born.

" We know that the Super Bowl was played on February 1, 2015. And we know that Justin Bieber was born on March 1, 1994. So, if a team won the Super Bowl in the year Justin Bieber was born, it would have been the team that won the Super Bowl in either 1993 or 1994.\n\nThere were two teams that won the Super Bowl in those years: the Dallas Cowboys and the San Francisco 49ers. The Cowboys won Super Bowl XXVII on January 31, 1993, which would have been the year before Justin Bieber was born. And the 49ers won Super Bowl XXIX on January 29, 1995, which would have been after Justin Bieber's birthday.\n\nSo, to summarize: there were no NFL teams that won the Super Bowl in the year Justin Bieber was born."

# huggingface: if you have GPU!
This is the easiest way to get running if you have GPU access. This will download weights automatically from Github.

In [4]:
# huggingface_model_id = 'NbAiLab/nb-alpaca-lora-7b'
# huggingface_model_id = 'NbAiLab/nb-gpt-j-6B-alpaca'
# huggingface_model_id = "databricks/dolly-v2-3b"
# huggingface_model_id = "databricks/dolly-v2-12b"
# huggingface_model_id = 'decapoda-research/llama-7b-hf'
huggingface_model_id = 'NbAiLab/nb-gpt-j-6B-norpaca'

llm = HuggingFacePipeline.from_model_id(model_id=huggingface_model_id, task='text-generation',
                                        model_kwargs={"temperature": 0.4, "max_length": 2048},
                                        device=0,
                                        )

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:16<00:00,  3.28s/it]


In [5]:
# the `max_length` keyword argument does not appear to apply to all models; for the NB-GPT-J model and more, we can hack it thus:
llm.pipeline.model.config.max_length=2048

In [6]:
import os
os.environ['SERPAPI_API_KEY'] = '863f83f3bad1b48accd3c4f5631525a93455e0b1a10445446d4414ed8ab90955'
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

In [None]:
!nvidia-smi

In [None]:
#os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python' # I had to set this to load the Vicuna model; experiences may vary
# llm = HuggingFacePipeline.from_model_id(model_id='../vicuna-13b/', task='text-generation',
#                                         model_kwargs={"temperature": 0.4, "max_length": 2048},
#                                         device=0,
#                                         )

# simple search thing
Now that we have the `llm` variable defined with a loaded model, we can define an agent and run a query

In [8]:
from langchain.agents import load_tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)

In [9]:
from langchain.agents import AgentType, initialize_agent
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

### Trying to use a (machine translated) Norwegian template in a Norwegian model

In [31]:
new_template = "Svar på følgende spørsmål så godt du kan. Du har tilgang til følgende verktøy:\n\nSearch: En søkemotor. Nyttig når du trenger å svare på spørsmål om aktuelle hendelser. Inndata skal være et søk.\nCalculator: Nyttig når du skal svare på spørsmål om matematikk.\n\nBruk følgende format:\n\nSpørsmål: inndataspørsmålet du må svare på\nTanke: du bør alltid tenke på hva du skal gjøre\nHandling: handlingen som skal utføres, bør være en av [Search, Calculator]\nHandlingsinngang: input til handlingen\nObservasjon: resultatet av handlingen\n... (denne tanken/handlingen/handlingsinputten/observasjonen kan gjentas N ganger)\nTanke: Jeg vet nå det endelige svaret\nEndelig svar: det endelige svaret på det opprinnelige spørsmålet\n\nBegynn!\n\nSpørsmål: {input}\nTanke: {agent_scratchpad}"

In [32]:
# probably not the right way to do this
agent.agent.llm_chain.prompt.template = new_template

In [34]:
agent.run("Hva skjedde dagen Justin Bieber ble født?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new AgentExecutor chain...[0m


## Conclusion?
poor result for the Norwegian model in this case

In [7]:
from langchain import Wikipedia
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
from langchain.memory import ConversationBufferMemory
from langchain import LLMMathChain

from langchain import SerpAPIWrapper

docstore = DocstoreExplorer(Wikipedia())


tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup",
    ),
]

memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True, output_key="output"
)

agent = initialize_agent(
    tools,
    llm,
    # agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    agent=AgentType.REACT_DOCSTORE,
    verbose=True,
    memory=memory,
    return_intermediate_steps=True, # This is needed for the evaluation later
    
)

In [8]:
query_one = "Thought: What is a table used for?"

test_outputs_one = agent({"input": query_one}, return_only_outputs=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new AgentExecutor chain...[0m




# try this?

In [None]:
from langchain import OpenAI, Wikipedia
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
docstore=DocstoreExplorer(Wikipedia())
tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search"
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup"
    )
]

# llm = OpenAI(temperature=0, model_name="text-davinci-002")
react = initialize_agent(tools, llm, agent=AgentType.REACT_DOCSTORE, verbose=True)

In [None]:
llm

In [None]:
react.run('Who is the tallest man in the world?')