<a href="https://colab.research.google.com/github/PranavkrishnaVadhyar/Mistral7B_LlamaIndex_RAG/blob/main/Mistral7_LLamaIndex_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q pypdf
!pip install -q python-dotenv
!pip install -q transformers
!pip install llama-index #for converting to embedding to store in vector database
!pip install llama-index-llms-llama-cpp
!pip -q install sentence-transformers
!pip install llama-index-embeddings-langchain
!pip install langchain-huggingface
!pip install --upgrade --quiet  langchain sentence_transformers



llama.cpp : cpu and gpu for referencing

In [2]:
!CMAKE_ARGS="DLLAMA_CUBLAS=ON" FORCE_CMAKE=1 python -m pip install llama-cpp-python --no-cache-dir



In [3]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

In [4]:
documents = SimpleDirectoryReader('/content/Data/').load_data()

In [5]:
import torch
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

from llama_index.core.prompts.prompts import SimpleInputPrompt
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = "<|USER|>{query_str}<|ASSISTANT|>"


llm = LlamaCPP(
    # Correctly specify the direct download URL for the GGUF model
    model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
    # Alternatively, specify the path to a pre-downloaded model
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # Llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # Transform inputs into Llama2 format
    system_prompt=system_prompt,
    verbose=True,
)


llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tmp/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:   

In [6]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding



embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="thenlper/gte-large"),
)

  from tqdm.autonotebook import tqdm, trange


In [7]:
service_context = ServiceContext.from_defaults(
    chunk_size=256,
    llm=llm,
    embed_model=embed_model,
)

  service_context = ServiceContext.from_defaults(


In [8]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, llm=llm)

In [9]:
query_engine = index.as_query_engine(llm=llm)

response = query_engine.query("Explain the contents of the pdf?")

print(response)


llama_print_timings:        load time =  207336.15 ms
llama_print_timings:      sample time =      42.85 ms /    75 runs   (    0.57 ms per token,  1750.17 tokens per second)
llama_print_timings: prompt eval time =  207335.60 ms /   469 tokens (  442.08 ms per token,     2.26 tokens per second)
llama_print_timings:        eval time =   46331.72 ms /    74 runs   (  626.10 ms per token,     1.60 tokens per second)
llama_print_timings:       total time =  253766.18 ms /   543 tokens



The PDF file contains information about testing, containerization, deployment to production environments, cloud build, APIs for services such as Google Maps, YouTube, Gmail, Google Drive, Google Calendar, and more, and Chrome DevTools. The document also explains how these tools can help teams deliver software faster and more reliably while maintaining consistency and scalability.


In [10]:
!pip install ragas



In [28]:
from ragas.metrics import faithfulness
from ragas import evaluate

# Define your evaluation dataset
evaluation_data = [
    {
        'query': 'What is google developer students club?',
    'expected_answer': '''Google Developer Student Clubs are university based community groups for students interested
in Google developer technologies. Students from all undergraduate or graduate programs with
an interest in growing as a developer are welcome. By joining a GDSC, students grow their
knowledge in a peer-to-peer learning environment and build solutions for local businesses and
their community.'''
    }
    # Add more queries and expected answers
]

# Define a function to evaluate the system
def evaluate_rag_system(evaluation_data, query_engine):
    results = []
    for item in evaluation_data:
        query = item['query']
        expected_answer = item['expected_answer']

        # Perform the query
        response = query_engine.query(query)

        # Log the results
        logging.info(f"Query: {query}")
        logging.info(f"Response: {response}")
        logging.info(f"Expected: {expected_answer}")

        # Collect results
        results.append({
            "query": query,
            "response": str(response),
            "expected_answer": expected_answer
        })

    return results

# Run the evaluation
results = evaluate_rag_system(evaluation_data, query_engine)

# Compute metrics using ragas (example, replace with actual metric functions)
metrics = evaluate(results, metrics=['faithfulness'])

# Print the metrics
print(metrics)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  207336.15 ms
llama_print_timings:      sample time =      86.36 ms /   144 runs   (    0.60 ms per token,  1667.44 tokens per second)
llama_print_timings: prompt eval time =  246063.36 ms /   582 tokens (  422.79 ms per token,     2.37 tokens per second)
llama_print_timings:        eval time =   85577.10 ms /   143 runs   (  598.44 ms per token,     1.67 tokens per second)
llama_print_timings:       total time =  331840.72 ms /   725 tokens


AttributeError: 'list' object has no attribute 'rename_columns'