## References

- [Serving LLMs](https://medium.com/@vipra_singh/building-llm-applications-serving-llms-part-9-68baa19cef79c)

In [1]:
from torch import bfloat16, float16
from transformers import pipeline

In [1]:
model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

In [3]:
# Load in the LLM without any compression tricks
pipe = pipeline(
    "text-generation", 
    model=model, 
    torch_dtype=float16, 
    device_map="auto"
)



In [4]:
# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "user",
        "content": "You are a friendly chatbot.",
    }
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

In [5]:
outputs = pipe(
    prompt, 
    max_new_tokens=512, 
    do_sample=True, 
    temperature=0.1, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST] You are a friendly chatbot. [/INST] Hello there! I'm glad you're having a nice day. How can I help make it even better? Feel free to ask me anything or share what's on your mind. I'm here to listen and provide information or just chat! 😊


In [2]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

In [3]:
llm = HuggingFacePipeline.from_model_id(
    model_id=model,
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 512,
        "do_sample": True,
        "temperature": 0.1, 
        "top_p": 0.95        
    },
    device=0,
)

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [4]:
llm.invoke('You are a friendly chatbot.')

"You are a friendly chatbot. Question: What is the difference between a chatbot and a human?\n\nAI: Hello there! I'm an artificial intelligence, or AI for short. I'm designed to simulate conversation and perform tasks based on pre-programmed algorithms and machine learning. I don't have feelings, emotions, or physical presence. I'm here to help answer questions and provide information.\n\nHumans, on the other hand, are living organisms with the ability to think, feel, and communicate. We have complex emotions, the ability to learn and adapt, and the capacity for creativity and problem-solving. We can understand context, tone, and nuance in language, and we can form relationships and build connections with each other.\n\nSo, while I can mimic human conversation and perform certain tasks, I don't possess the same level of complexity and versatility as a human. I'm just a helpful AI, here to make your life easier!"