### Tutorial on LlamaIndex to use local open-source LLMs to construct a RAG service

This tutorial is based on the blog [here](https://medium.com/@marco.bertelli/build-a-complete-opensource-llm-rag-qa-chatbot-choose-the-model-25968144d7a6), which is also driven from the original LlamaIndex notebook [here](https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing#scrollTo=ghqk6C04TD3b)

This tutorial runs successfully with **llama-index==0.9.26**, **transformers==4.36.2**, **torch==2.1.2**

#### step0. prepare the dependencies

In [1]:
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

# local_llama_model_root = os.getenv("LOCAL_LLAMA_MODEL_ROOT")
local_mistral_model_root = os.getenv("LOCAL_MISTRAL_MODEL_ROOT")

In [2]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [3]:
import torch

from transformers import BitsAndBytesConfig
from transformers import LlamaForCausalLM, MistralForCausalLM, LlamaTokenizer, AutoTokenizer, AutoModelForCausalLM

from llama_index.readers import BeautifulSoupWebReader
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

#### step1. load some data as the local document

In [4]:
data_url = "https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots"
documents = BeautifulSoupWebReader().load_data([data_url])
documents

[Document(id_='df4e0d20-942a-4110-8a94-4f8628ef5813', embedding=None, metadata={'URL': 'https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='3dc5d8241b72378e191206cc7c52f7854b819f8ceaf55ac21818dc21162cdea9', text="The synthetic social network is coming - The VergeSkip to main contentThe VergeThe Verge logo.The Verge homepageThe Verge homepageThe VergeThe Verge logo./Tech/Reviews/Science/Entertainment/MoreMenuExpandThe VergeThe Verge logo.MenuExpandPlatformer/Artificial Intelligence/TechThe synthetic social network is comingThe synthetic social network is coming / Between ChatGPT’s surprisingly human voice and Meta’s AI characters, our feeds may be about to change foreverBy  Casey Newton, a contributing editor who has been writing about tech for over 10 years. He founded Platformer, a newsletter about Big Tech and democracy. Sep 29, 2023, 1:30 PM UTC|CommentsShare

In [5]:
text = documents[0].text

i, num_line = 0, 200
while True:
    if (i+1)*num_line > len(text): 
        print(text[i*num_line:])
        break
    else:
        print(text[i*num_line:(i+1)*num_line])
    i += 1

The synthetic social network is coming - The VergeSkip to main contentThe VergeThe Verge logo.The Verge homepageThe Verge homepageThe VergeThe Verge logo./Tech/Reviews/Science/Entertainment/MoreMenuEx
pandThe VergeThe Verge logo.MenuExpandPlatformer/Artificial Intelligence/TechThe synthetic social network is comingThe synthetic social network is coming / Between ChatGPT’s surprisingly human voice a
nd Meta’s AI characters, our feeds may be about to change foreverBy  Casey Newton, a contributing editor who has been writing about tech for over 10 years. He founded Platformer, a newsletter about Bi
g Tech and democracy. Sep 29, 2023, 1:30 PM UTC|CommentsShare this story Image: Álvaro Bernis / The VergeThis is Platformer, a newsletter on the intersection of Silicon Valley and democracy from Casey
 Newton and Zoë Schiffer. Sign up here.Today, let’s consider the implications of a truly profound week in the development of artificial intelligence and discuss whether we may be witnessing the ri

#### step2. load the quantized local open-source hf model

In [6]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4", # qlora
    bnb_4bit_use_double_quant=True,
)
quantization_config

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

In [25]:
local_model_path = os.path.join(local_mistral_model_root, "Mistral-7B-Instruct-v0.2")
# local_model_path = os.path.join(local_llama_model_root, "Llama2-7b-chat")

tokenizer = AutoTokenizer.from_pretrained(
    local_model_path, 
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    quantization_config=quantization_config,
    trust_remote_code=True,
    device_map='auto'
)
model

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

In [26]:
llama_query_prompt_wrapper = PromptTemplate("<s>[INST] {query_str} [/INST]</s>\n")

llm = HuggingFaceLLM(
    model=model, tokenizer=tokenizer,
    query_wrapper_prompt=llama_query_prompt_wrapper,
    context_window=1000,
    max_new_tokens=256,
    generate_kwargs=dict(do_sample=True, temperature=0.2, top_k=50, top_p=0.95, pad_token_id=tokenizer.eos_token_id),
)
# llm # FIXME: do not print, otherwise it wll cause a recursion error

#### step3. build a service based on the model

In [27]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5"
)
service_context

ServiceContext(llm_predictor=LLMPredictor(system_prompt=None, query_wrapper_prompt=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>), prompt_helper=PromptHelper(context_window=1000, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None, separator=' '), embed_model=HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7fd51c400160>, tokenizer_name='BAAI/bge-small-en-v1.5', max_length=512, pooling=<Pooling.CLS: 'cls'>, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None), transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7fd51c400160>, id_func=<function default_id_func at 0x7fdbd9a75480>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?')], llama_logge

#### step4. build a vectorstore index based on the data and the model service

In [28]:
from llama_index import VectorStoreIndex

vectorstore_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

#### step5. instantiate a query engine from the vectorestore index

In [29]:
query_engine = vectorstore_index.as_query_engine(response_mode="compact")

#### step6. query the engine to construct a RAG service

In [30]:
from llama_index.response.notebook_utils import display_response

default_query = "How do OpenAI and Meta differ on AI tools?"
while True:
    print("="*50)
    query = input(f"Please input your query to the engine ('quit' to exit, empty to use the default one: {default_query}):\n")
    if query == "quit": break
    if query == "": query = default_query
    print(f"User Query:\n{query}")
    print("-"*50)

    response = query_engine.query(query)
    display_response(response)

User Query:
How do OpenAI and Meta differ on AI tools?
--------------------------------------------------


**`Final Response:`** Both OpenAI and Meta are making significant strides in the field of AI, but their approaches and applications differ. OpenAI, as previously mentioned, focuses on presenting its AI technologies as productivity tools, such as its popular ChatGPT model, which is designed to help users get things done through conversational interactions. However, the new context highlights the emotional and personal connection users can form with AI models like ChatGPT, offering not only productivity but also emotional support, coaching, therapy, and even daily affirmations. This could be particularly valuable for individuals who are lonely, isolated, or on the margins. The ability to give text messages a warm and kindly voice, as demonstrated by a trans teenager's use of ChatGPT for identity issues, should not be underestimated.

Meta, on the other hand, is taking a different approach by building Language Models (LLMs) for use in its messaging apps and social media platforms. Meta's AI characters, such as its 28 personality-driven chatbots, some of which are voiced by celebrities, are intended to provide a more engaging and interactive experience for users. These chatbots are not just tools, but rather a new form of entertainment and

User Query:
what happened on that Monday
--------------------------------------------------


**`Final Response:`** On that Monday, Meta and OpenAI made significant strides in the integration of artificial intelligence in social media and messaging platforms. Meta unveiled 28 personality-driven chatbots, featuring the voices of celebrities like Charli D'Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton. These chatbots, designed for use in Meta's messaging apps, offer a more immersive and interactive experience for users. The announcement came a day after OpenAI revealed updates for ChatGPT, which included its surprisingly human-like voice and productivity-focused features, such as daily affirmations and pep talks.

The use of celebrity voices marks a departure from OpenAI's productivity-focused approach and Meta's entry into the entertainment industry's use of generative AI. The power of giving these digital characters a warm and kindly voice, as demonstrated by a trans teenager's use of ChatGPT for daily affirmations about identity issues, should not be underestimated. This development is a significant step forward in the creation of synthetic social networks, with potential for fans to spend

