<a href="https://colab.research.google.com/github/Balacoumarane/finetune_llama/blob/main/Build_RAG_Mistral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
%%capture
! pip install farm-haystack

In [4]:
from haystack.nodes import PromptNode
from getpass import getpass

In [None]:
!huggingface-cli login

In [34]:
HF_TOKEN = getpass("Your Hugging Face Token")

Your Hugging Face Token··········


In [6]:
pn = PromptNode(model_name_or_path="mistralai/Mistral-7B-Instruct-v0.1",  # instruct fine-tuned model: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
                max_length=800,
                api_key=HF_TOKEN)

(…)-v0.1/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

(…)nstruct-v0.1/resolve/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

(…)0.1/resolve/main/special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [7]:
# Let's quickly try the model

out=pn("[INST] Explain in a ironic way why Large Language Model rock! [/INST]")

print(out[0])

 Large Language Models are the epitome of irony. They are so large and complex that they can only be trained on massive amounts of data, yet they are capable of understanding and generating language in ways that humans can't. They are like a giant brain in a computer, but they can't think or feel like a human. They are like a machine that can write poetry, but they can't appreciate the beauty of it. They are like a tool that can help us communicate better, but they can't understand the nuances of human communication. In short, Large Language Models are the ultimate example of how technology can be both amazing and terrifying at the same time.


In [8]:
! git clone https://huggingface.co/spaces/anakin87/fact-checking-rocks
! tar -xzf /content/fact-checking-rocks/data/rock_wiki.tar.gz

Cloning into 'fact-checking-rocks'...
remote: Enumerating objects: 319, done.[K
remote: Counting objects: 100% (319/319), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 319 (delta 193), reused 319 (delta 193), pack-reused 0[K
Receiving objects: 100% (319/319), 68.13 KiB | 13.63 MiB/s, done.
Resolving deltas: 100% (193/193), done.
Filtering content: 100% (4/4), 222.56 MiB | 21.59 MiB/s, done.


In [9]:
# they are JSON file with a content field and some metadata
import json

with open("./rock_wiki/100405.json") as f:
    doc = json.load(f)

for key, value in doc.items():
    print(key, ":", str(value)[:250])

content : Robert Anthony Plant  (born 20 August 1948) is an English singer and songwriter, best known as the lead singer and lyricist of the English rock band Led Zeppelin for all of its existence from 1968 until 1980, when the band broke up following the deat
meta : {'name': 'Robert Plant', 'url': 'https://en.wikipedia.org/wiki/Robert_Plant'}


In [10]:
import glob,json
from haystack import Document
from haystack.nodes import PreProcessor

docs = []

for json_file in glob.glob("/content/rock_wiki/*.json"):
    with open(json_file, "r") as fin:
        doc_json = json.load(fin)
    doc = Document.from_json(doc_json)

    docs.append(doc)

In [11]:
processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=0,
    language="en",
)
preprocessed_docs = processor.process(docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Preprocessing: 100%|██████████| 453/453 [00:09<00:00, 49.48docs/s]


In [12]:
preprocessed_docs[0]

<Document: {'content': 'Bruce Frederick Joseph Springsteen (born September 23, 1949) is an American singer, songwriter, and musician. He has released 20 studio albums, many of which feature his backing band, the E Street Band. Originally from the Jersey Shore, he is one of the originators of the heartland rock style of music, combining mainstream rock musical style with narrative songs about working class American life. During a career that has spanned six decades, Springsteen has become known for his poetic, socially conscious lyrics and energetic stage performances, sometimes lasting up to four hours in length. He has been nicknamed "the Boss".In 1973, Springsteen released his first two albums, Greetings from Asbury Park, N.J. and The Wild, the Innocent & the E Street Shuffle, neither of which earned him a large audience. He changed his style and reached worldwide popularity with Born to Run in 1975. It was followed by Darkness on the Edge of Town (1978) and The River (1980), which t

# **Create an InMemoryDocumentStore and store data**

In [13]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

In [14]:
document_store.write_documents(preprocessed_docs)


Updating BM25 representation...: 100%|██████████| 13236/13236 [00:01<00:00, 11891.42 docs/s]


# **Create a RAG Pipeline**

In [22]:
from haystack import Pipeline
from haystack.nodes import BM25Retriever, PromptNode, PromptTemplate

In [23]:
retriever = BM25Retriever(document_store, top_k=4)

In [24]:
# a good Question Answering template, adapted for the instruction format
# (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)

qa_template = PromptTemplate(prompt=
  """[INST] Using the information contained in the context, answer the question (using a maximum of two sentences).
  If the answer cannot be deduced from the context, answer \"I don't know.\"
  Context: {join(documents)};
  Question: {query}
  [/INST]""")

In [25]:
prompt_node = PromptNode(model_name_or_path="mistralai/Mistral-7B-Instruct-v0.1",
                         api_key=HF_TOKEN,
                         default_prompt_template=qa_template,
                         max_length=5500,
                         model_kwargs={"model_max_length":8000})

In [26]:
rag_pipeline = Pipeline()
rag_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
rag_pipeline.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])


In [27]:
from pprint import pprint
print_answer = lambda out: pprint(out["results"][0].strip())

# **Let's try our RAG Pipeline**

In [28]:
print_answer(rag_pipeline.run(query="Who was Elvis Presley?"))


('Elvis Presley was a renowned American singer, actor, and musician, known for '
 'his significant impact on the popularization of rock and roll in the 20th '
 'century. He was born on January 8, 1935, in Tupelo, Mississippi, and had a '
 'close bond with his mother. His father was of German, Scottish, and English '
 'origins, while his mother was of Scots-Irish with some French Norman '
 "ancestry. Presley's early years were marked by his love for music, which he "
 'discovered at an Assembly of God church in Tupelo.')


In [29]:
print_answer(rag_pipeline.run(query="What was the initial name of Sum 41?"))

'The initial name of Sum 41 was Kaspir.'


In [30]:
print_answer(rag_pipeline.run(query="Is the earth flat?"))


("I don't know. The provided context does not mention anything about the Earth "
 'being flat.')


In [31]:
print_answer(rag_pipeline.run(query="How can use lamini to train a LLM model?"))


("I don't know. The provided context does not mention how to use lamini to "
 'train a LLM model.')


In [33]:
print_answer(rag_pipeline.run(query="Is there any indian musician mentioned in the document?"))


'Yes, Ravi Shankar, an Indian sitar maestro, is mentioned in the document.'
