## The main file to invoke My Virtual Moments application

#### 1. Prepare the LLM for question answering.
In this naive implementation, we want to first ensure that llama3 (or any other possible models) may respond to user requests well.

In [1]:
# We want to load the model first
import accelerate, bitsandbytes
import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from transformers import LlamaTokenizerFast

model_path = '/ssdshare/LLMs/llama3-Chinese-chat-8b/'
tokenizer = LlamaTokenizerFast.from_pretrained(model_path,padding_side='left')
qconfig=BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             quantization_config=qconfig) 
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [2]:
# Now we define a function to get answers from the LLM
def chat(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=4096)
    input_ids = inputs.input_ids.to("cuda")
    outputs = model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def question_prompt(question):
    chat = [
        {"role": "system", "content": """ Please be a helpful assistant and answer the following question:"""},
        {"role": "user", "content": "Question: " + question},
    ]
    s = tokenizer.apply_chat_template(chat, tokenize = False)
    return s

def chat_with_llm(question):
    prompt = question_prompt(question)
    return chat(model, tokenizer, prompt)

In [14]:
# Utilize the functions defined above to chat with the model
print(chat_with_llm("What is the capital of France?"))

system

Please be a helpful assistant and answer the following question:user

Question: What is the capital of France?assistant

The capital of France is Paris.


#### 2. Implement the langchain pipeline with ChromaDB

In [3]:
class LocalLlama:
    def __init__(self):
        self.model = model
        self.tokenizer = tokenizer

    def predict(self, input_text):
        input_ids = self.tokenizer(input_text, return_tensors="pt",  padding=True, truncation=True, max_length=4096).input_ids.to("cuda")
        outputs = self.model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=True, temperature=0.7)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

In [4]:
# Construct a transformer pipeline
from transformers import pipeline
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate

llm = LocalLlama()

pipe = pipeline(
    "text-generation",
    model = model,
    device_map = "cuda:0",
    max_length = 4096,
    tokenizer = tokenizer,
)

In [42]:
input_text = "Hi! Tell me your name."
print(llm.predict(input_text))

Hi! Tell me your name. I'm Sarah.
Hi! My name is Sarah. I'm a writer, editor and creative living in Durham, North Carolina.
I am a professional writer with 5+ years of experience. I have a background in journalism, having worked at local and national publications. I have experience in a variety of writing styles, including news, features, and opinion pieces. I am skilled in interviewing, researching, and writing compelling stories.
I have a degree in journalism from the University of North Carolina at Chapel Hill.
I am currently a freelance writer and editor.
I am available for work on a freelance basis.
I have a wide range of skills that I can offer to clients, including writing, editing, and proofreading.
I am available for work on a freelance basis and can provide samples of my work upon request.
