<a href="https://colab.research.google.com/github/HatefulRock/Chatbot/blob/main/Llama2_7b_chat_feature_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Note: Responses from local models can be quite slow, especially with 8-bit quantization.

With 4bit quantization, llama2-7b-chat uses about 8GB of VRAM

In [1]:
%pip install llama-index
%pip install transformers accelerate bitsandbytes
%pip install llama-index-readers-web
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
%pip install llama-index-program-openai
%pip install llama-index-agent-openai
!pip install pypdf


Collecting llama-index
  Downloading llama_index-0.10.20-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.5-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.9-py3-none-any.whl (25 kB)
Collecting llama-index-core<0.11.0,>=0.10.20 (from llama-index)
  Downloading llama_index_core-0.10.20.post2-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.4-py3-none-any.whl (6.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloa

## Setup

### Data

In [2]:
from llama_index.readers.web import BeautifulSoupWebReader
from llama_index.core import SimpleDirectoryReader

url = "https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots"

#documents = BeautifulSoupWebReader().load_data([url])

documents=SimpleDirectoryReader("/content/data").load_data()


In [5]:
documents

[Document(id_='7cbe1ccf-ea48-4f94-b64c-e94449492b87', embedding=None, metadata={'page_label': '1', 'file_name': 'Déplacements professionnels.pdf', 'file_path': '/content/data/Déplacements professionnels.pdf', 'file_type': 'application/pdf', 'file_size': 153233, 'creation_date': '2024-03-18', 'last_modified_date': '2024-03-18'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Les déplacements professionnels\nPRODECURE\nPour des raisons pratiques et de sécurité, il est important que vous informiez votre service et le secrétariat de votre pôle \nde tous vos déplacements, quels qu’ils soient :\nSi vous devez effectuer un déplacement dans le cadre de votre travail, il vous faudra préalablement générer un ordre de \nmission informatique comportant les informat

### LLM

This should run on a T4 GPU in the free tier

In [3]:
# huggingface api token for downloading llama2
hf_token = "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"

In [27]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

system_prompt="""
Vous êtes un assistant questions-réponses. Votre objectif est de répondre aux questions
avec autant de précision que possible en fonction des instructions et du contexte fournis. Always answer the questions in french.
"""









llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
    system_prompt=system_prompt,
    context_window=3900,
    max_new_tokens=512,
    model_kwargs={"token": hf_token, "quantization_config": quantization_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

### Index Setup

In [29]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

In [30]:
from llama_index.core.indices import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)

### Helpful Imports / Logging

In [32]:
from llama_index.core.response.notebook_utils import display_response

In [31]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [35]:

custom_qa_prompt = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query in french, mentioning any relevant sources from which you derived the answer.\n"
    "Query: {query_str}\n"
    "Answer: "
)

In [38]:
query_engine = vector_index.as_query_engine(text_qa_template=custom_qa_prompt)


qa_prompt_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query in french, mentioning the name of the text from which you derived the answer.\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt_tmpl = PromptTemplate(
    qa_prompt_tmpl_str
)

query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": qa_prompt_tmpl}
)
query_engine.update_prompts(
    {"response_synthesizer:refine_template": qa_prompt_tmpl}
)
user_question = input("Utilisateur: ")
response = query_engine.query(user_question)

display_response(response)



Utilisateur: combien de jours de conges?


**`Final Response:`** Based on the context information provided in the text "Les congés" from page 2, the answer to the query "Combien de jours de congés?" is:

"Les techniciens et employés ont droit à 5 jours de congés pour 5 ans de service, 10 jours de congés pour 10 ans de service, 15 jours de congés pour 15 ans de service, 20 jours de congés pour 20 ans de service, et 30 jours de congés pour 30 ans de service."

In summary, the number of days of vacation for technical staff and employees is:

* 5 days for 5 years of service
* 10 days for 10 years of service
* 15 days for 15 years of service
* 20 days for 20 years of service
* 30 days for 30 years of service.

In [39]:
from IPython.display import Markdown, display


# define prompt viewing function
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))


prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)


**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query in french, mentioning the name of the text from which you derived the answer.
Query: {query_str}
Answer: 


<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query in french, mentioning the name of the text from which you derived the answer.
Query: {query_str}
Answer: 


<br><br>