<div align="center">
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

# Getting started with Petals

This notebook will guide you through the basics of Petals &mdash; a system for inference and fine-tuning 100B+ language models without the need to have high-end GPUs. With Petals, you can join compute resources with other people over the Internet and run large language models such as 176B-parameter [BLOOM](https://huggingface.co/bigscience/bloom) or [BLOOMZ](https://huggingface.co/bigscience/bloomz), which are of the same size as GPT-3.

💬 If you meet any issues while running this notebook, let us know in the **[#running-a-client](https://discord.gg/J29mCBNBvm)** channel of our Discord!

So, let's get started! First, let's install [the Petals package](https://github.com/bigscience-workshop/petals):


In [3]:
pip install chromadb gdown transformers torch

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Note: you may need to restart the kernel to use updated packages.


In [4]:
!pip install -q typer==0.9.0 petals langchain unstructured[local-inference] tiktoken unstructured

## Initialize the database

Download the documents

In [5]:
from pathlib import Path


In [6]:
from pathlib import Path
from pydantic import BaseModel

In [7]:
import torch
from transformers import BloomTokenizerFast 
from petals import DistributedBloomForCausalLM
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import GPT4All
import os
from langchain.llms import GPT4All
# from langchain.callbacks.base import CallbackManager
# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import AnalyzeDocumentChain
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# MODEL_NAME = "bigscience/bloom-petals"
# tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
# model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
# model = model.cuda()

May 14 08:28:42.524 [[1m[34mINFO[0m] NumExpr defaulting to 4 threads.


In [8]:
#from google.colab import drive
#drive.mount('/content/drive')

In [9]:
import gdown
from pathlib import Path
try :
 os.mkdir(f'{Path().absolute()}/docs')
except :
    print('File exist')
url_1 = 'https://drive.google.com/u/0/uc?id=1fV9d8xFwqzafc_FUFAYUc2gopTQVR0eY&export=download&confirm=t&uuid=4574d808-b5a6-4893-b1eb-289adeb9dabe&at=AKKF8vzF4ZE7mAt-_jfxk1sRzwU3:1684027565236'
url_2 = 'https://drive.google.com/u/0/uc?id=1-64PFjerutODq6V2GqqDFkBiUj_5E8v2&export=download&confirm=t&uuid=50e639c5-167a-4b16-8ed3-e67c48a4daad&at=AKKF8vw6K5unCUwJgr5tjEUe-ObZ:1684027611825'
output_1 = f'{Path().absolute()}/docs/Guidelines for the clinical diagnosis and treatment of dengue chịkugunya and zika.pdf'
output_2 = f'{Path().absolute()}/docs/483.full.pdf'
gdown.download(url_1, output_1, quiet=False)
gdown.download(url_2, output_2, quiet=False)

Downloading...
From: https://drive.google.com/u/0/uc?id=1fV9d8xFwqzafc_FUFAYUc2gopTQVR0eY&export=download&confirm=t&uuid=4574d808-b5a6-4893-b1eb-289adeb9dabe&at=AKKF8vzF4ZE7mAt-_jfxk1sRzwU3:1684027565236
To: /home/ec2-user/SageMaker/docs/Guidelines for the clinical diagnosis and treatment of dengue chịkugunya and zika.pdf
100%|██████████| 2.01M/2.01M [00:00<00:00, 37.6MB/s]
Downloading...
From: https://drive.google.com/u/0/uc?id=1-64PFjerutODq6V2GqqDFkBiUj_5E8v2&export=download&confirm=t&uuid=50e639c5-167a-4b16-8ed3-e67c48a4daad&at=AKKF8vw6K5unCUwJgr5tjEUe-ObZ:1684027611825
To: /home/ec2-user/SageMaker/docs/483.full.pdf
100%|██████████| 326k/326k [00:00<00:00, 32.2MB/s]


'/home/ec2-user/SageMaker/docs/483.full.pdf'

In [10]:
chroma_path = f"{Path().absolute()}/db_2"

def extract_helpful_answer(text):
    helpful_answer = None
    # Split the string into lines
    lines = text.split("\n")
    # Loop through each line
    for line in lines:
        # Check if the line starts with "Helpful Answer:"
        if line.startswith("Helpful Answer:"):
            # Extract the answer text after the colon
            helpful_answer = line.split(":", 1)[1].strip()
            break  # Exit the loop after finding the answer
    return helpful_answer

def split_document(input_path):
    loader = UnstructuredPDFLoader(input_path)
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
    return text_splitter.split_documents(data)

def db_init():
    docs = split_document(output_1) # replace with your desired local file path
    docs2 = split_document(output_2) # replace with your desired local file path
    vectordb = Chroma.from_documents(documents=docs, persist_directory=chroma_path)
    vectordb.add_documents(docs2)
    vectordb.persist()
    vectordb = None

In [11]:
db_init()

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
May 14 08:29:17.293 [[1m[34mINFO[0m] Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
May 14 08:29:17.295 [[1m[34mINFO[0m] Running Chroma using direct local API.
May 14 08:29:17.358 [[1m[38;5;208mWARN[0m] [[1mchromadb.get_db:43[0m] Using embedded DuckDB with persistence: data will be stored in: /home/ec2-user/SageMaker/db_2
May 14 08:29:17.363 [[1m[34mINFO[0m] Successfully imported ClickHouse Connect C data optimizations
May 14 08:29:17.363 [[1m[34mINFO[0m] Successfully import ClickHouse Connect C/Numpy optimizations
May 14 08:29:17.419 [[1m[34mINFO[0m] Using ujson library for writing JSON byte strings
May 14 08:29:17.452 [[1m[34mINFO[0m] N

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

May 14 08:29:19.895 [[1m[34mINFO[0m] Use pytorch device: cuda


Batches:   0%|          | 0/38 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

May 14 08:30:20.661 [[1m[34mINFO[0m] Persisting DB to disk, putting it in the save folder: /home/ec2-user/SageMaker/db_2


In [12]:
class Message(BaseModel):
    message: str
    sender: str

## Custom Langchain wrapper

In [13]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Sun May 14 08:30:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   29C    P0    85W / 300W |   1491MiB / 23028MiB |      0%      Default |
|                               |            

In [31]:
"""Wrapper around HuggingFace Pipeline APIs."""
import importlib.util
import logging
from typing import Any, List, Mapping, Optional

from pydantic import Extra

from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.llms.base import LLM
from langchain.llms.utils import enforce_stop_tokens

DEFAULT_MODEL_ID = "gpt2"
DEFAULT_TASK = "text-generation"
VALID_TASKS = ("text2text-generation", "text-generation")

logger = logging.getLogger(__name__)


class CustomPipeline(LLM):
    """Wrapper around HuggingFace Pipeline API.

    To use, you should have the ``transformers`` python package installed.

    Only supports `text-generation` and `text2text-generation` for now.

    Example using from_model_id:
        .. code-block:: python

            from langchain.llms import HuggingFacePipeline
            hf = HuggingFacePipeline.from_model_id(
                model_id="gpt2", task="text-generation"
            )
    Example passing pipeline in directly:
        .. code-block:: python

            from langchain.llms import HuggingFacePipeline
            from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

            model_id = "gpt2"
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            model = AutoModelForCausalLM.from_pretrained(model_id)
            pipe = pipeline(
                "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
            )
            hf = HuggingFacePipeline(pipeline=pipe)
    """
    tokenizer: Any  
    model: Any  #: :meta private:
    model_id: str = DEFAULT_MODEL_ID
    """Model name to use."""
    model_kwargs: Optional[dict] = None
    """Key word arguments to pass to the model."""

    class Config:
        """Configuration for this pydantic object."""

        extra = Extra.forbid

    @classmethod
    def from_model_id(
        cls,
        model_id: str,
        task: str,
        device: int = -1,
        model_kwargs: Optional[dict] = None,
        **kwargs: Any,
    ) -> LLM:
        """Construct the pipeline object from model_id and task."""
        try:
          from transformers import BloomTokenizerFast 
          from petals import DistributedBloomForCausalLM  
        except ImportError:
            raise ValueError(
                "Could not import transformers python package. "
                "Please install it with `pip install transformers`."
            )

        _model_kwargs = model_kwargs or {}
        tokenizer = BloomTokenizerFast.from_pretrained(model_id)

        try:
          model = DistributedBloomForCausalLM.from_pretrained(model_id)

        except ImportError as e:
            raise ValueError(
                f"Could not load the {task} model due to missing dependencies."
            ) from e

        if importlib.util.find_spec("torch") is not None:
            import torch

            cuda_device_count = torch.cuda.device_count()
            if device < -1 or (device >= cuda_device_count):
                raise ValueError(
                    f"Got device=={device}, "
                    f"device is required to be within [-1, {cuda_device_count})"
                )
            if device < 0 and cuda_device_count > 0:
                model = model.cuda()
                logger.warning(
                    "Device has %d GPUs available. "
                    "Provide device={deviceId} to `from_model_id` to use available"
                    "GPUs for execution. deviceId is -1 (default) for CPU and "
                    "can be a positive integer associated with CUDA device id.",
                    cuda_device_count,
                )
        if "trust_remote_code" in _model_kwargs:
            _model_kwargs = {
                k: v for k, v in _model_kwargs.items() if k != "trust_remote_code"
            }

        return cls(
            model = model,
            tokenizer = tokenizer,
            model_id=model_id,
            model_kwargs=_model_kwargs,
            **kwargs,
        )

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {
            **{"model_id": self.model_id},
            **{"model_kwargs": self.model_kwargs},
        }

    @property
    def _llm_type(self) -> str:
        return "custom_pipeline"

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
    ) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
        outputs = self.model.generate(inputs, max_new_tokens=3, do_sample=True, top_p=0.9, temperature=0.75)
        text= self.tokenizer.decode(outputs[0])
        if stop is not None:
            # This is a bit hacky, but I can't figure out a better way to enforce
            # stop tokens when making calls to huggingface_hub.
            text = enforce_stop_tokens(text, stop)
        return text

In [32]:
llm = CustomPipeline.from_model_id(model_id="bigscience/bloomz-petals", task="text-generation")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


May 14 09:05:39.747 [[1m[38;5;208mWARN[0m] [[1m__main__.from_model_id:100[0m] Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.


## Step 1. The easiest way to generate text 🚀

Let's start with the easiest task &mdash; creating a __`DistributedBloom`__ model and using it for generating text.

This machine will download a small part of the model weights (~8 GB out of 352 GB) and rely on other computers in the network to run the rest of the model. Downloading the local part of the weights usually takes ~3 minutes.

🧑‍🏫 __Note:__ We suggest to start with the regular BLOOM, but you can also use [BLOOMZ](https://huggingface.co/bigscience/bloomz) &mdash; a version of BLOOM fine-tuned to better follow human instructions in the zero-shot regime. You would need to set `MODEL_NAME = "bigscience/bloomz-petals"` to load this model.

Now, let's try to generate something by calling __`model.generate()`__ method.

The first call to this method takes ~5 sec to connect to the Petals swarm. Once we do that, you should expect generation speed of 1&ndash;1.5 sec/token. If you don't have enough GPUs to host the entire model, this is much faster than what you get with other methods, such as offloading (which takes at least 10&ndash;20 sec/token).

In [33]:
qa_chain = load_qa_chain(llm, chain_type="stuff")
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)
vectordb = Chroma(persist_directory=chroma_path)
retriever = vectordb.as_retriever()

May 14 09:05:39.840 [[1m[34mINFO[0m] Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
May 14 09:05:39.842 [[1m[34mINFO[0m] Running Chroma using direct local API.
May 14 09:05:39.843 [[1m[38;5;208mWARN[0m] [[1mchromadb.get_db:43[0m] Using embedded DuckDB with persistence: data will be stored in: /home/ec2-user/SageMaker/db_2
May 14 09:05:40.054 [[1m[34mINFO[0m] loaded in 1303 embeddings
May 14 09:05:40.057 [[1m[34mINFO[0m] loaded in 1 collections
May 14 09:05:40.058 [[1m[34mINFO[0m] collection with name langchain already exists, returning existing collection
May 14 09:05:40.059 [[1m[38;5;208mWARN[0m] [[1mchromadb.api.models.Collection.__init__:51[0m] No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


In [34]:
def ask(input): 
    docs = retriever.get_relevant_documents(input)[0]
    #Get the page content of document object
    text = docs.page_content
    # print(text)
    res = qa_document_chain.run(input_document=text, question=input)
    print(res)
    extracted_res = extract_helpful_answer(res)
    final_res = Message(message=extracted_res, sender="Bloom")
    return final_res

In [35]:
ask('What is the type of dengue')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The dengue virus is a single stranded, positive-sense RNA virus approximately 11 kb in length. It is a member of the Flavivirus genus, which also includes yellow fever, Japanese encephalitis and West Nile virus. There is considerable genetic diversity in the dengue virus family with four serotypes (Den I, II, III and IV).

Question: What is the type of dengue
Helpful Answer: IV</s>


Message(message='IV</s>', sender='Bloom')

In [None]:
ask('What is the for virus serotypes of dengue')

============ Commonly asked questions ============

In [26]:
ask('What is the dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The dengue virus is a single stranded, positive-sense RNA virus approximately 11 kb in length. It is a member of the Flavivirus genus, which also includes yellow fever, Japanese encephalitis and West Nile virus. There is considerable genetic diversity in the dengue virus family with four serotypes (Den I, II, III and IV).

Question: What is the dengue virus?
Helpful Answer: a single stranded, positive-sense RNA virus approximately 11 kb in length</s>


Message(message='a single stranded, positive-sense RNA virus approximately 11 kb in length</s>', sender='Bloom')

In [28]:
ask('How many types of dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The dengue virus is a single stranded, positive-sense RNA virus approximately 11 kb in length. It is a member of the Flavivirus genus, which also includes yellow fever, Japanese encephalitis and West Nile virus. There is considerable genetic diversity in the dengue virus family with four serotypes (Den I, II, III and IV).

Question: How many types of dengue virus?
Helpful Answer: four</s>


Message(message='four</s>', sender='Bloom')

In [29]:
ask('List the diversity in the dengue virus family')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The dengue virus is a single stranded, positive-sense RNA virus approximately 11 kb in length. It is a member of the Flavivirus genus, which also includes yellow fever, Japanese encephalitis and West Nile virus. There is considerable genetic diversity in the dengue virus family with four serotypes (Den I, II, III and IV).

Question: List the diversity in the dengue virus family
Helpful Answer: four serotypes</s>


Message(message='four serotypes</s>', sender='Bloom')

==== NOT STATISFIED ====

Question: List the diversity in the dengue virus family

Helpful Answer: four serotypes</s>

==== CHANGE TO ====

Answer: Four serotypes (Den I, II, III, and IV)</s>.

In [39]:
ask('Is the dengue virus severe?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

ANNEX 5. GRADE tables: from evidence to recommendations

83

Additional considerations

The following factors were identified as NON-predictors or markers of severe dengue: High fever Positive tourniquet test Diarrhea Rhinorrhea Anorexia or hyporexia Petechiae or ecchymosis Nausea Obesity (considered as a potential risk factor and not a potential predictor) Malnutrition Rash Cough Leukopenia Retro-ocular pain Headache Myalgias or arthralgias

Additional considerations

Question: Is the dengue virus severe?
Helpful Answer: Yes</s>


Message(message='Yes</s>', sender='Bloom')

In [38]:
ask('How severity of the dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

207. Wichmann O, Gascon J, Schunk M, Puente S, Siikamaki H, Gjørup I, et al. Severe dengue virus infection in travelers: Risk factors and laboratory indicators.

Journal of Infectious Diseases 2007;195(8):1081-1083. Available from: https://doi.org/10.1086/512680.

Question: How severity of the dengue virus?
Helpful Answer: severe</s>


Message(message='severe</s>', sender='Bloom')

==== NOT STATISFIED ====

Question: 'How severity of the dengue virus?'

Helpful Answer: severe/s>

==== CHANGE TO ====

Answer: Must show the dangerous level of the virus</s>.

In [40]:
ask('Is having fever as a symptom of infecting dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Clinical features

Dengue fever is a mild, self-limited febrile episode, commonly associated with a rash. It usually begins with fever, respiratory symptoms, anorexia, nausea, vomiting and headache. Back pain, myal- gias, arthralgias and conjunctivitis may also occur. The initial fever usually resolves within one week, and a few days later a generalised morbilliform or mac- ulopapular rash may develop. Fever may return with the rash (Figs 1 and 2).

Question: Is having fever as a symptom of infecting dengue virus?
Helpful Answer: Yes</s>


Message(message='Yes</s>', sender='Bloom')

In [41]:
ask('What are symptoms of being infected by dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Clinical features

Dengue fever is a mild, self-limited febrile episode, commonly associated with a rash. It usually begins with fever, respiratory symptoms, anorexia, nausea, vomiting and headache. Back pain, myal- gias, arthralgias and conjunctivitis may also occur. The initial fever usually resolves within one week, and a few days later a generalised morbilliform or mac- ulopapular rash may develop. Fever may return with the rash (Figs 1 and 2).

Question: What are symptoms of being infected by dengue virus?
Helpful Answer: fever, respiratory


Message(message='fever, respiratory', sender='Bloom')

In [42]:
ask('Is having respiratory considered as being infected by dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

infections. Journal of Medical Virology 2005;76(4):547-552. Available from: https://doi.org/10.1002/jmv.20397.

111. Malavige GN, Velathanthiri VG, Wijewickrama ES, Fernando S, Jayaratne SD, Aaskov J, et al. Patterns of disease among adults hospitalized with dengue

infections. QJM: An International Journal of Medicine 2006;99(5):299-305. Available from: https://doi.org/10.1093/qjmed/hcl039.

Question: Is having respiratory considered as being infected by dengue virus?
Helpful Answer: Yes</s>


Message(message='Yes</s>', sender='Bloom')

In [44]:
ask('Is having considered as being infected by dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Dengue is the most widely distributed mosquito-borne viral infection of humans, affecting an estimated 100 million people worldwide each year, with 40% (2.5 billion) of the world’s population estimated to be at risk for infection

Dengue should be considered in any patient with fever, particularly if there is a recent travel history to endemic regions

Dengue severity exists as a continuous spectrum of dengue fever through to severe dengue

Question: Is having considered as being infected by dengue virus?
Helpful Answer: Yes</s>


Message(message='Yes</s>', sender='Bloom')

==== NOT STATISFIED ====

Give no symptom but answer being infected by dengue virus

In [45]:
ask('How long does it take to recover when got dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

in other regions where dengue has been endemic for decades.

Question: How long does it take to recover when got dengue virus?
Helpful Answer: usually 7 days


Message(message='usually 7 days', sender='Bloom')

In [50]:
ask('What is the first thing patient should do as infected by dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

and emergency medicine physicians, among others), and nursing personnel, as well as medical and nursing

students, who are involved in caring for patients with suspected dengue, chikungunya, or Zika in one capacity

or another. These guidelines are also addressed to health unit managers and heads of national arboviral

disease prevention and control programs, who are responsible for facilitating the process to implement the

recommendations laid out in this publication.

2

Question: What is the first thing patient should do as infected by dengue virus?
Helpful Answer: Seek medical attention


Message(message='Seek medical attention', sender='Bloom')

In [51]:
ask('Why patient need to seek medical attention as infected by dengue virus?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

– Other factors that may determine the need for the hospitalization of dengue patients include the presence of

comorbidities other than those described above, the extremes of life, and social or environmental conditions.

The decision to hospitalize patients with the aforementioned conditions should be individualized.

– In situations in which hospital capacity is exceeded (for example, an epidemic), dengue patients without

Question: Why patient need to seek medical attention as infected by dengue virus?
Helpful Answer: in situations in


Message(message='in situations in', sender='Bloom')

==== NOT STATISFIED ====

Should explain why patient need to get medical attention early.

In [52]:
ask('Is dengue fever contagious to others?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Dengue is the most widely distributed mosquito-borne viral infection of humans, affecting an estimated 100 million people worldwide each year, with 40% (2.5 billion) of the world’s population estimated to be at risk for infection

Dengue should be considered in any patient with fever, particularly if there is a recent travel history to endemic regions

Dengue severity exists as a continuous spectrum of dengue fever through to severe dengue

Question: Is dengue fever contagious to others?
Helpful Answer: Yes</s>


Message(message='Yes</s>', sender='Bloom')

In [53]:
ask('What happened if we expose a dengue virus infected patient?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

major risks. In summary, the body of evidence identified suggests that all alternatives evaluated would be safe

for the symptomatic management of dengue patients (see summary of findings table 7, Annex 4).

PART III. Recommendations

23

Considering the lack of reliable evidence and the absence of side effects related to the mechanism of action of

some of the drugs considered (e.g., hemorrhages and NSAIDs), the panel considered that there could be variability

Question: What happened if we expose a dengue virus infected patient?
Helpful Answer: Not enough information


Message(message='Not enough information', sender='Bloom')

==== NOT STATISFIED ====

In [20]:
stop

The `model.generate()` method runs **greedy** generation by default. However, you can choose other generation methods like **top-p/top-k sampling** or **beam search** by passing the corresponding parameters (you'll see an example in a bit). You can even implement custom generation methods (we'll cover that in **Step 5**).

🔏 **Note:** Your data is processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.

## Step 2. Chat bots and interactive generation 💬

If you'd like to talk to the model in an interactive way, you can use the __inference session__ interface. This interface provides a simple way to print generated tokens on the fly or make a chat bot that responds to human's phrases.

The inference session looks for a sequence of servers that will run successive inference steps and store past attention caches. This way, you don't need to rerun previous tokens through the transformer to generate each phrase. If one of the remote servers fails, Petals will automatically find a replacement and regenerate a small part of the caches.

Let's see how to use it to write a simple chat bot, showing tokens as soon as they are generated:

In [None]:
with model.inference_session(max_length=512) as sess:
    while True:
        prompt = input('Human: ')
        if prompt == "":
            break
        prefix = f"Human: {prompt}\nFriendly AI:"
        prefix = tokenizer(prefix, return_tensors="pt")["input_ids"].cuda()
        print("Friendly AI:", end="", flush=True)
        
        while True:
            outputs = model.generate(
                prefix, max_new_tokens=1, do_sample=True, top_p=0.9, temperature=0.75, session=sess
            )
            outputs = tokenizer.decode(outputs[0, -1:])
            print(outputs, end="", flush=True)
            if "\n" in outputs:
                break
            prefix = None  # Prefix is passed only for the 1st token of the bot's response

### 📦 Making apps that use Petals

If you develop a tool for other people, you can wrap up the code using Petals into a user-friendly web app, such as [chat.petals.ml](http://chat.petals.ml). Under the hood, this app may connect to a lightweight [HTTP endpoint](https://github.com/borzunov/petals-chat) for inference that forwards all requests to the Petals swarm.

📋 **Note:** If you build an app running BLOOM with Petals, make sure it follows the BLOOM's [terms of use](https://huggingface.co/bigscience/bloom).

<div align="center">
<br>
<img src="https://i.imgur.com/p2nwiho.png" width="40%">  
</div>

## Step 3. How does it work? 🛠️

The `model` you are running is the actual BLOOM-176B, but only a part of it is loaded into your machine's GPU. Let's have a look under the hood:

In [None]:
model.transformer

As you can see, word embeddings and some other layers are regular PyTorch modules hosted on your machine, but the rest of the model (e.g., transformers blocks) is encased in the __RemoteSequential__ class. This is an advanced PyTorch module that runs on a distributed swarm of other machines.

Still, you can access individual layers and their outputs, as well as run forward/backward through them:

In [None]:
first_five_layers = model.transformer.h[0:5]
first_five_layers

In [None]:
dummy_inputs = torch.randn(1, 3, 14336, dtype=torch.bfloat16, requires_grad=True)
outputs = first_five_layers(dummy_inputs)
outputs

In [None]:
loss = torch.mean((outputs - torch.ones_like(outputs)) ** 2)
loss.backward()  # backpropagate through the internet
print("Grad w.r.t. inputs:", dummy_inputs.grad.flatten())

In general, you can mix and match distributed layers like in regular PyTorch and even insert and train your own layers (e.g., adapters) between the pre-trained ones.

<div align="center">
<img src="https://camo.githubusercontent.com/58732a64488a9be928e25f3e60e3692b989ffe212ac86cb4902d8df20a042b03/68747470733a2f2f692e696d6775722e636f6d2f525459463379572e706e67" width="80%">
</div>

<p align="center">📜 <b><a href="https://arxiv.org/pdf/2209.01188.pdf">Read details in our paper</a></b></p>

## Step 4. Adding a trainable adapter 🏋️

While the remotely hosted transformer blocks are **frozen** to keep the pretrained model the same for all users, using **parameter-efficient adapters** (small trainable layers added between the pretrained blocks of the model, such as [LoRA](https://arxiv.org/abs/2106.09685)) or **trainable prompts** (trainable inputs added before the inputs to the model, such as in [P-Tuning v2](https://arxiv.org/abs/2110.07602)) is usually enough to make BLOOM solve a variety of downstream tasks.

Below, we show an example of how to add a basic **trainable** linear layer between 5th and 6th transformer blocks of the pretrained model. The layer's weights and the corresponding optimizer statistics will be stored locally:

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class BloomBasedClassifier(nn.Module):
  def __init__(self, model):
    super().__init__()
    self.distributed_layers = model.transformer.h
    self.adapter = nn.Sequential(nn.Linear(14336, 32), nn.Linear(32, 14336))
    self.head = nn.Sequential(nn.LayerNorm(14336), nn.Linear(14336, 2))
  
  def forward(self, embeddings):
    hidden_states = self.distributed_layers[0:6](embeddings)
    hidden_states = self.adapter(hidden_states)
    hidden_states = self.distributed_layers[6:10](hidden_states)
    pooled_states = torch.mean(hidden_states, dim=1)
    return self.head(pooled_states)

In [None]:
classifier = BloomBasedClassifier(model).cuda()
opt = torch.optim.Adam(classifier.parameters(), 3e-5)
inputs = torch.randn(3, 2, 14336, device='cuda')
labels = torch.tensor([1, 0, 1], device='cuda')

for i in range(5):
  loss = F.cross_entropy(classifier(inputs), labels)
  print(f"loss[{i}] = {loss.item():.3f}")
  opt.zero_grad()
  loss.backward()
  opt.step()

print('predicted:', classifier(inputs).argmax(-1))  # l, o, l

## Step 5. Using custom sampling methods 🎰

The __`model.inference_session()`__ interface in Petals also allows you to write custom inference code. You can use this to implement any sampling algorithms you want, or write a custom beam search algorithm that forbids the model from using swearwords.

Below, let's see how we can reimplement the standard `model.generate()` interface by making forward passes through all the layers manually:

In [None]:
text = "What is a good chatbot? Answer:"
token_ids = tokenizer(text, return_tensors="pt")["input_ids"].cuda()
max_length = 100
with torch.inference_mode():
    with model.inference_session(max_length=max_length) as sess:
        while len(text) < max_length:
            embs = model.transformer.word_embeddings(token_ids)
            embs = model.transformer.word_embeddings_layernorm(embs)

            h = sess.step(embs)
            h_last = model.transformer.ln_f(h[:, -1])
            logits = model.lm_head(h_last)

            next_token = logits.argmax(dim=-1)
            text += tokenizer.decode(next_token)
            token_ids = next_token.reshape(1, 1)
            print(text)

## Step 6. Sharing is caring 🤗

We developed Petals to be a community-run system, so we rely on people giving out their GPUs to increase the swarm’s capacity. If you have some GPUs that are not always busy, please **consider running a Petals server.** You can pause it any time if you want to use the GPUs for something else. As a bonus, people running a server get a certain speedup when using Petals, since a larger part of the model is hosted locally.

<br>

🐋 You can run our [Docker](https://www.docker.com) image (works on Linux, macOS, and Windows with [WSL2](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl)):

```
sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main python -m petals.cli.run_server bigscience/bloom-petals --port 31330
```

🐍 Or run these commands in an [Anaconda](https://www.anaconda.com) env (requires Linux and Python 3.7+):

```
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -U petals
python -m petals.cli.run_server bigscience/bloom-petals
```

<br>

📚 See [FAQ](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#running-a-server) to learn how to configure the server to use multiple GPUs, address common issues, etc.

You can also host [BLOOMZ](https://huggingface.co/bigscience/bloomz), a version of BLOOM fine-tuned to follow human instructions in the zero-shot regime — just replace `bloom-petals` with `bloomz-petals`.

🔒 Hosting a server does not allow others to run custom code on your computer. Learn more about security [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).

💬 If you have any issues or feedback, let us know on [our Discord server](https://discord.gg/D9MwApKgWa)!

## Step 7. Using other fine-tuning and prompt-tuning methods

While you can write your own custom adapters, Petals implements several [standard](https://arxiv.org/abs/2104.08691) [methods](https://arxiv.org/abs/2101.00190) for parameter-efficient fine-tuning. We provide a couple of advanced examples in our GitHub repository:

- Training a personified chatbot: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb)

- Fine-tuning BLOOM for text semantic classification: [notebook](https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb)

## What's next?

Congratulations on finishing our tutorial! Now, you are familiar with how to use Petals for different tasks, how it works under the hood, and how to increase its capacity.

You can find a few other helpful resources below:

* __More about Petals.__ The [README](https://github.com/bigscience-workshop/petals#readme) file in our GitHub repository has links to more Petals-related materials, including instructions for starting your own swarm (possibly, with a model other than BLOOM).

* __Discord server.__ If you have any feedback, questions, or technical issues, please [join our Discord server](https://discord.gg/D9MwApKgWa) and let us know. If you want to build something based on Petals, we'd be happy to hear what you are up to.

* __Academic paper.__ We have released a [paper](https://arxiv.org/abs/2209.01188) that goes into details about our research and what happens in Petals under the hood.