# RAG Tutorial with Groclake
Description:
This complete, end-to-end tutorial demonstrates how to create an Agentic Retrieval-Augmented Generation (RAG) system using Groclake. The process involves managing documents in DataLake, generating vectors for documents, performing vector searches, enriching search results, and utilizing ModelLake to provide contextual, AI-assisted responses. Each step is designed to showcase the capabilities of Groclake in creating a fully functional Agentic RAG system.

Groclake Documentation: https://plotch-ai.gitbook.io/groclake-by-plotch.ai

Vectorcake is a vector centric infrastructure allowing developers to create embedding vectors quickly, store vectors and build useful RAG applications.

Datalake is a data warehouse for storing various types structured and unstructured documents and records. Using Datalake, developers can store pdfs, word documents, excel sheets, google sheets, texts etc for RAG based applications.

Modelake is an infrastructure pipe for LLM based operations like chat completions, language translations, automatic speech recognition, text to speech, speech to text and speech to speech operations

# Step 1: Install the Required Library
First, install the groclake library, which will be used for managing data, vectors, and models

In [3]:
!pip install groclake

Collecting groclake
  Downloading groclake-0.1.14-py3-none-any.whl.metadata (83 bytes)
Downloading groclake-0.1.14-py3-none-any.whl (10.0 kB)
Installing collected packages: groclake
Successfully installed groclake-0.1.14


In [1]:
!pip install PyPDF2


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/232.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


#Step 2: Set Environment Variables
Set up the API key and account ID for authenticating with the Groclake service. These are stored as environment variables to simplify access throughout the script.

In [4]:
import os

# Set API key and account ID
GROCLAKE_API_KEY = 'fe9fc289c3ff0af142b6d3bead98a923'
GROCLAKE_ACCOUNT_ID = '31df8ac36812112e6bc5ff0ad0daf847'

# Set them as environment variables
os.environ['GROCLAKE_API_KEY'] = GROCLAKE_API_KEY
os.environ['GROCLAKE_ACCOUNT_ID'] = GROCLAKE_ACCOUNT_ID

print("Environment variables set successfully.")


Environment variables set successfully.


In [6]:
!pip install python-docx


Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2


In [28]:
import os
from groclake.vectorlake import VectorLake
from groclake.datalake import DataLake
from groclake.modellake import ModelLake
import random
import PyPDF2
import docx
import textwrap

class SarcasticQuizBot:
    def __init__(self):
        # Initialize API credentials
        self.GROCLAKE_API_KEY = '93db85ed909c13838ff95ccfa94cebd9'
        self.GROCLAKE_ACCOUNT_ID = '89ff7fa5adc705887aa8186792153342'
        self.setup_environment()
        self.initialize_lakes()

        # Adjusted token limits to leave room for completion
        self.MAX_INPUT_TOKENS = 4000  # Reduced from 6000 to safely handle larger documents
        self.COMPLETION_TOKENS = 2000
        self.CHARS_PER_TOKEN = 4

        # Previous roasts, summary_intros, and score_comments remain the same
        self.roasts = [
            "Wow, that's impressively wrong! Did you even read the document?",
            "Oh honey... The answer was right there in the text. Like, literally right there.",
            "That's about as correct as saying the Earth is flat.",
            "Amazing! You managed to ignore everything in the document!",
            "Did you actually read the text, or just use it as a pillow?",
            "I'm not saying you're wrong, but... actually, yes, I am saying that.",
            "Congratulations! You've mastered the art of not absorbing information!",
            "Even a goldfish would remember more from the text!"
        ]

        self.summary_intros = [
            "Alright, buckle up buttercup! Here's what this masterpiece is about:",
            "Oh boy, let me break down this literary gem for you:",
            "Prepare yourself for this absolutely riveting summary:",
            "Here's what you apparently couldn't figure out yourself:",
            "Let me dumb this down to its essence:",
            "Warning: The following summary contains actual information:",
            "Behold, the contents of your document, simplified for your convenience:"
        ]

        self.score_comments = {
            0: "Wow, a perfect zero! You've truly mastered the art of not learning!",
            1: "One right answer... Did you actually read the document or just guess?",
            2: "Two correct! Your reading comprehension is as deep as a parking lot puddle.",
            3: "Three right! Moving up from 'totally clueless' to just 'mostly clueless'.",
            4: "Four correct. Are you even trying to understand the material?",
            5: "Half right! Perfectly balanced between knowledge and ignorance.",
            6: "Six correct! Starting to show signs of actually reading the document.",
            7: "Seven! Not bad... for someone who probably skimmed the text.",
            8: "Eight right! Almost impressive, if I had lower standards.",
            9: "Nine correct! Who knew you could actually read?",
            10: "Perfect score! What, did you write this document yourself or something?"
        }

    # Setup and initialization methods remain the same
    def setup_environment(self):
        os.environ['GROCLAKE_API_KEY'] = self.GROCLAKE_API_KEY
        os.environ['GROCLAKE_ACCOUNT_ID'] = self.GROCLAKE_ACCOUNT_ID

    def initialize_lakes(self):
        try:
            self.vectorlake = VectorLake()
            vector_create = self.vectorlake.create()
            self.vectorlake_id = vector_create["vectorlake_id"]

            self.datalake = DataLake()
            datalake_create = self.datalake.create()
            self.datalake_id = datalake_create["datalake_id"]

            print("Lakes initialized successfully!")
        except Exception as e:
            print(f"Failed to initialize lakes: {str(e)}")
            raise

    def read_file(self, file_path):
        """Read different file types and return their content."""
        try:
            file_extension = os.path.splitext(file_path)[1].lower()

            if file_extension == '.txt':
                with open(file_path, 'r', encoding='utf-8') as file:
                    return file.read()

            elif file_extension == '.pdf':
                text = ""
                with open(file_path, 'rb') as file:
                    pdf_reader = PyPDF2.PdfReader(file)
                    for page in pdf_reader.pages:
                        text += page.extract_text() + "\n\n"
                return text

            elif file_extension in ['.doc', '.docx']:
                doc = docx.Document(file_path)
                return '\n\n'.join([paragraph.text for paragraph in doc.paragraphs])

            else:
                raise ValueError(f"Unsupported file type: {file_extension}")

        except Exception as e:
            print(f"Error reading file: {str(e)}")
            raise

    def chunk_text(self, text):
        """Improved text chunking that ensures chunks don't exceed token limit."""
        max_chars = self.MAX_INPUT_TOKENS * self.CHARS_PER_TOKEN
        chunks = []

        # Split into sentences first (rough approximation)
        sentences = [s.strip() for s in text.replace('\n', ' ').split('.') if s.strip()]

        current_chunk = []
        current_length = 0

        for sentence in sentences:
            sentence_length = len(sentence) + 2  # Add space for period and space

            if current_length + sentence_length > max_chars and current_chunk:
                # Join current chunk and add to chunks
                chunks.append('. '.join(current_chunk) + '.')
                current_chunk = [sentence]
                current_length = sentence_length
            else:
                current_chunk.append(sentence)
                current_length += sentence_length

        # Add the last chunk
        if current_chunk:
            chunks.append('. '.join(current_chunk) + '.')

        return chunks

    def generate_sassy_summary(self, text):
        """Generate a summary handling large texts with improved chunking."""
        try:
            chunks = self.chunk_text(text)
            summaries = []

            # Generate individual summaries for each chunk
            for i, chunk in enumerate(chunks):
                prompt = (
                    f"Summarize part {i+1} of {len(chunks)} of the text in a sassy, "
                    "slightly sarcastic, but informative way. Focus on key points:\n\n" + chunk
                )

                payload = {
                    "messages": [
                        {
                            "role": "system",
                            "content": "You are a sassy but knowledgeable assistant who summarizes documents with a mix of snark and actual insight."
                        },
                        {"role": "user", "content": prompt}
                    ],
                    "token_size": self.COMPLETION_TOKENS
                }

                response = ModelLake().chat_complete(payload)
                summaries.append(response["answer"])

            # If we have multiple summaries, combine them
            if len(summaries) > 1:
                # Create a shorter version of each summary for combining
                short_summaries = [f"Part {i+1}: {summary[:1000]}" for i, summary in enumerate(summaries)]
                combine_prompt = (
                    "Combine these partial summaries into one cohesive, sassy summary "
                    "maintaining the key points and sarcastic tone:\n\n" +
                    "\n\n".join(short_summaries)
                )

                payload = {
                    "messages": [
                        {
                            "role": "system",
                            "content": "You are a sassy but knowledgeable assistant who combines summaries while maintaining style and key points."
                        },
                        {"role": "user", "content": combine_prompt}
                    ],
                    "token_size": self.COMPLETION_TOKENS
                }

                response = ModelLake().chat_complete(payload)
                return response["answer"]

            return summaries[0]

        except Exception as e:
            print(f"Error generating summary: {str(e)}")
            raise

    def generate_question(self, text, question_number):
        """Generate questions with improved chunk selection."""
        try:
            chunks = self.chunk_text(text)

            # Select chunk based on question number to ensure coverage
            chunk_index = question_number % len(chunks)
            chunk = chunks[chunk_index]

            # Ensure the chunk isn't too long for question generation
            if len(chunk) > (self.MAX_INPUT_TOKENS * self.CHARS_PER_TOKEN) // 2:
                chunk = chunk[:(self.MAX_INPUT_TOKENS * self.CHARS_PER_TOKEN) // 2]

            prompt = (
                f"Generate a challenging multiple choice question #{question_number} about this text excerpt. "
                "The question MUST have EXACTLY 4 options labeled A, B, C, and D. "
                "Make it specific to the content provided.\n\n"
                f"Text: {chunk}\n\n"
                "Format:\n"
                "Question: [Your question here]\n"
                "A) [First option]\n"
                "B) [Second option]\n"
                "C) [Third option]\n"
                "D) [Fourth option]\n"
                "Correct: [A, B, C, or D]"
            )

            payload = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a quiz bot that generates specific multiple-choice questions about provided text."
                    },
                    {"role": "user", "content": prompt}
                ],
                "token_size": self.COMPLETION_TOKENS
            }

            response = ModelLake().chat_complete(payload)
            return self._parse_question(response["answer"])

        except Exception as e:
            print(f"Error generating question: {str(e)}")
            raise

    # The _parse_question and run_quiz methods remain the same as they don't need modification
    def _parse_question(self, response):
        """Parse the generated question response."""
        lines = [line.strip() for line in response.split("\n") if line.strip()]

        question = lines[0]
        if question.startswith("Question:"):
            question = question[9:].strip()

        options = []
        for line in lines[1:5]:
            if line.startswith(("A)", "B)", "C)", "D)")):
                options.append(line)

        correct_answer = None
        for line in lines:
            if line.startswith("Correct:"):
                correct_answer = line.split(":")[1].strip()
                break

        if correct_answer not in ["A", "B", "C", "D"]:
            correct_answer = "A"

        return {
            "question": question,
            "options": options,
            "correct_answer": correct_answer
        }

    def run_quiz(self, text, num_questions=10):
        """Run the quiz with the provided text."""
        print("\n" + random.choice(self.summary_intros))
        summary = self.generate_sassy_summary(text)
        print("\n" + textwrap.fill(summary, width=80))

        input("\nPress Enter when you're ready to prove how little you retained from that...")

        print("\nAlright, prepare to be humiliated!")
        print("=" * 50)

        correct_answers = 0

        for i in range(num_questions):
            print(f"\nQuestion {i+1} of {num_questions}")
            print("-" * 30)

            question_data = self.generate_question(text, i+1)
            print(question_data["question"])
            for option in question_data["options"]:
                print(option)

            while True:
                answer = input("\nYour answer (A/B/C/D): ").upper()
                if answer in ['A', 'B', 'C', 'D']:
                    break
                print("Really? It's not that complicated. Just pick A, B, C, or D!")

            if answer == question_data["correct_answer"]:
                print("\nCorrect! Who would've thought you actually paid attention!")
                correct_answers += 1
            else:
                roast = random.choice(self.roasts)
                print(f"\n{roast}")
                print(f"The correct answer was {question_data['correct_answer']}.")

            print(f"\nCurrent score: {correct_answers}/{i+1}")

        final_score = correct_answers
        print("\n" + "=" * 50)
        print(f"\nFinal Score: {final_score}/{num_questions}")
        print(self.score_comments[final_score])

        if final_score < 5:
            print("Maybe try actually reading the document next time?")
        elif final_score < 8:
            print("Not terrible, but not good either. Story of your life?")
        else:
            print("I hate to admit it, but you might actually have understood the document.")

def main():
    quiz_bot = SarcasticQuizBot()

    print("Welcome to the Sarcastic Document Quiz Bot!")
    print("I'll read your document, summarize it with attitude, then test how well you actually read it.")

    while True:
        file_path = input("\nEnter the path to your document (or 'quit' to exit): ")
        if file_path.lower() == 'quit':
            break

        try:
            document_text = quiz_bot.read_file(file_path)
            quiz_bot.run_quiz(document_text)

            play_again = input("\nWant to test your reading comprehension on another document? (yes/no): ").lower()
            if play_again != 'yes':
                print("\nProbably for the best. Your ego couldn't take much more anyway.")
                break

        except Exception as e:
            print(f"\nOops! Something went wrong: {str(e)}")
            print("Maybe try a file that actually exists next time?")

if __name__ == "__main__":
    main()

Lakes initialized successfully!
Welcome to the Sarcastic Document Quiz Bot!
I'll read your document, summarize it with attitude, then test how well you actually read it.

Enter the path to your document (or 'quit' to exit): linea.pdf

Let me dumb this down to its essence:

Alright, buckle up buttercup, we're diving into the world of matrices, vectors,
and all things machine learning. This NITK Surathkal quiz is a whirlwind tour of
machine learning foundations, so hold onto your hats.  First off, we're getting
quizzed on matrix properties. Apparently, the determinant value isn't a basic
property of a matrix, who knew? And all elements are non-negative in a non-
negative matrix - shocker, right? A square matrix is the cool kid with equal
number of rows and columns.  Then we're thrown into the deep end with linear
independence in vector spaces. Turns out, linearly independent vectors are the
drama queens of the vector space, spanning the entire thing and refusing to be
expressed uniquely 

KeyboardInterrupt: Interrupted by user

actual code
