## Mistral-7B Retrieval Augmented Generation (RAG) ⚙️ 🗃️

As the applications of Large Language Models (LLMs) continue to grow, companies and users are increasingly seeking out ways to understand and extract value from their proprietary data by using LLMs. However, security and privacy are serious concerns that have made companies reluctant to expose their sensitive proprietary data to external models. 

There are two ways this can be addressed. By building LLMs from scratch or fune-tuning open source LLMs on the proprietary data, which can be boht expensive and time consuming. Another option, is to build a RAG framework.

Simply put RAG allows users query a data or data source to receive relevant response. 
RAG frameworks, powered by large language models (LLM), take a data or data source, generate embeddings from the data, store the embeddings in a vector database, perform similarity search on query embeddings across the vector database to find relevant chunks, and then send the query embeddings and relevant chunks to the LLM, which generates a response.

In [1]:
!nvidia-smi

Mon Jan 15 23:05:18 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA TITAN RTX               Off | 00000000:00:05.0 Off |                  N/A |
| 41%   34C    P2              60W / 280W |   4811MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

#### 1. Import packages

- 🦙 `llama-index` is a framework for fast retrieval and querying of data

- 🗄️ `qdrant` is a vector database and vector similarity search engine for storing, searching and managing embeddings

In [1]:
# Import Modules
from llama_index.llms import Ollama
import qdrant_client
from pathlib import Path
from llama_index import download_loader
from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore

#### 2. Loading the data and Initializing the service context

In [2]:
UnstructuredReader = download_loader("UnstructuredReader")
loader = UnstructuredReader()
docs = loader.load_data(file=Path('../data/data.txt'))

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# # reads and loads all files in the data directory
# reader = SimpleDirectoryReader(input_files=["/home/ubuntu/Mistral-7B-RAG/data/data.txt"])
# docs = reader.load_data()

In [3]:
# path to store the data
client = qdrant_client.QdrantClient(path="../data/qdrant_data")

# name of the collection
vector_store = QdrantVectorStore(client=client, collection_name="mistral_data")

# context responsible for storing the nodes, indices, and vectors
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [4]:
# Initializing Ollama and ServiceContext
llm = Ollama(model="mistral")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local") # model is located in local machine

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Creating the VectorStoreIndex and query engine
index = VectorStoreIndex.from_documents(docs, service_context=service_context, storage_context=storage_context) # embeds data and creates indices for the embeddings
query_engine = index.as_query_engine(streaming=True)

In [7]:
# perform a query and stream the response
response = query_engine.query("which of the models performed best")

In [8]:
response.print_response_stream()

 Based on the provided context, both Elara and Seraphim played crucial roles in their journey to save Seraphim's homeland from a devastating curse. Their unique magical abilities and dedication allowed them to overcome various challenges they encountered along the way. However, it is important to note that the success of their mission was not solely dependent on their individual performances but rather on their combined efforts. Therefore, it is not accurate to determine which of the two models "performed best" based on the context alone. Instead, we can conclude that they complemented each other's strengths and worked together to achieve a common goal.

In [12]:
type(response)

llama_index.core.response.schema.StreamingResponse

In [14]:
chat_engine = index.as_chat_engine(streaming=True)

In [15]:
response = chat_engine.query('what is the story about?')

In [17]:
response.response

"The story is about Elara, a young woman with magical abilities, and Seraphim, a skilled mage, who team up to save Seraphim's homeland from a curse. Throughout their journey, they encounter challenges, form a bond, and eventually confront the malevolent being responsible for the curse. The story explores themes of power, identity, sacrifice, and the balance between magic and humanity."