To do 1:
- pdf or docx
- folder
- prompts to be cvs wise
- streamlit (chat + old + uploading folders)
- Maher suggestions? (discuss with him)

To do 2:
- docker
- Azure
- Github
--------
To enhance 1:
- cvs with anything not text
- semantic chuncking
- overlapping chuncking

To enhance 2:
- web

# **Notebook Goal** 
This notebook demonstrates the process of creating a Retrieval-Augmented Generation (RAG) pipeline for question-answering tasks.It uses open-source models for embedding generation, vector storage, and conversational LLM inference.

**Pipeline Overview**


1. **Data fLoading** Reading and preparing data from unstructured sources like PDFs.
2.  **Text Splitting:** Breaking large documents into manageable chunks with overlap for better context retrieval.
3.  **Embedding Creation:** Generating vector representations of text using pre-trained embedding models.
4.   **Vector Store Initialization:** Storing embeddings in a vector database for efficient retrieval.
5.   **Question-Answering Workflow:** Using a conversational retrieval chain to answer user queries.


**PipeLine Hyperparameters**
1. **`model_name`**: Specifies the language model to be used (default: `"llama3-70b-8192"`).

2. **`temperature`**: Controls the randomness of the model's responses. A value o 0` makes the output deterministic.

3. **`k`**: Defines the number of top documents to retrieve from the vector database for answering a q
   3`5.

4. **`chunk_size`**: Determines the size of text chunks when splitting docu1006`).

5. **`chunk_overlap`**: Specifies the amount of overlap between chunks to maintain, `100`).

In [1]:
!pip install langchain-groq==0.1.3 langchain-pinecone chromadb langchain==0.1.17 langchain_community qdrant-client==1.9.1  fastembed==0.2.7 rapidocr-onnxruntime  unstructured[pdf] langchain-groq  -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.10.1 requires cubinlinker, which is not installed.
cudf 24.10.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.10.1 requires libcudf==24.10.*, which is not installed.
cudf 24.10.1 requires ptxcompiler, which is not installed.
cuml 24.10.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 24.10.0 requires cuvs==24.10.*, which is not installed.
cuml 24.10.0 requires nvidia-cublas, which is not installed.
cuml 24.10.0 requires nvidia-cufft, which is not installed.
cuml 24.10.0 requires nvidia-curand, which is not installed.
cuml 24.10.0 requires nvidia-cusolver, which is not installed.
cuml 24.10.0 requires nvidia-cusparse, which is not installed.
dask-cudf 24.10.1 requires cupy-cuda11x>=12.0.0, which is not installed.
pylibcudf 24.10.1 requires libcudf==24.10.*, wh

In [2]:
!pip install pinecone-client



In [3]:
import os
from kaggle_secrets import UserSecretsClient
from pinecone import Pinecone
user_secrets = UserSecretsClient()
os.environ["GROQ_API_KEY"] =  "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"
os.environ["PINECONE_API_KEY"] = "pcsk_5WCgdB_QmWFymLpXWR9CFEaXsPggzWS83xBiBiLmDvWX8RMPdcT6G2x6Wa2p9LEQCvXuC9"

In [4]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import FastEmbedEmbeddings
from langchain.vectorstores import Chroma
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.prompts import PromptTemplate
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_groq import ChatGroq
from langchain.memory import ConversationSummaryMemory
from langchain_pinecone import Pinecone
import uuid

# **PDF Processor Class**
**The `PDFProcessor` class is designed to:**

     1. Load and split PDF documents into smaller text chunks.
     2. Generate embeddings for text chunks using open-source embedding models.
     3.Text from the PDF is split into smaller, overlapping chunks to preserve context.
     4. Embeddings are generated using the `BAAI/bge-base-en-v1.5` model.
     5. Store the generated embeddings in a persistent vector database for efficient similarity-based retrieval.



In [5]:
class PDFProcessor:
    def __init__(self,pdf_path, index_name="cvs", embedding_model="BAAI/bge-base-en-v1.5", chunk_size=1000, chunk_overlap=100):
        self.pdf_path = pdf_path
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.embeddings = FastEmbedEmbeddings(model_name=embedding_model)
    
        # Initialize Pinecone vector store
        self.vector_db =  Pinecone.from_existing_index(
            index_name=index_name,
            embedding=self.embeddings,
            text_key="text" 
        )

    
    def read_pdf(self):
        """Reads the PDF document."""
        # loader = PyPDFLoader(self.pdf_path, extract_images=True)
        loader = UnstructuredFileLoader(self.pdf_path)
        self.pages = loader.load()
        return self.pages

    def split_document(self):
        """Splits the document into smaller chunks."""
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap
        )
        self.texts = text_splitter.split_documents(self.pages)
        return self.texts

    def perform_embedding(self):
        """Generates embeddings and stores them in Pinecone with the PDF file name as metadata."""
        # Create metadata for each chunk (e.g., including the PDF file name)
        metadatas = [{"source": self.pdf_path, "id": str(uuid.uuid4())} for _ in self.texts]

        # Add text chunks and their metadata to the Pinecone index
        texts = [doc.page_content for doc in self.texts]  # Extract text content from Document objects
        self.vector_db.add_texts(texts=texts, metadatas=metadatas)
        return self.vector_db


    def prepare_pdf(self):
        self.read_pdf()
        print("reading pdf is done")
        self.split_document()
        print("spliting and storing pdf is done")
        self.perform_embedding()
        print("embedding is done")

# **Conversational Chain Setup**
**The conversational chain combines:**

1. Memory for maintaining chat history.
2. A retriever for fetching the most relevant context.
3. A language model (LLM) for generating answers.


**Ordered for `RagChain` Class** 

1. **Initialize Components**:
   - Set up `vector_db` for context retrieval.
   - Configure `groq_client` with the specified language model (default: `llama3-70b-8192`).
   - Initialize `ConversationBufferMemory` to retain chat history.

2. **Create QA Chain**:
   - **Define Prompt**: Construct system and human templates for clear instructions and user input.
   - **Setup Retriever**: Use `vector_db` to fetch the top 3 relevant context pieces.
   - **Build Chain**: Combine the LLM, retriever, memory, and prompt into a RetrievalQA chain with "stuff" merging.

3. **Ask Questions**:
   - Query the system with a user-provided question.
   - Utilize `chat_history` from memory for contextual responses.
   - Return the generated response.

In [6]:
class RagChain:
    def __init__(self,  source_name,index_name="cvs", embedding_model="BAAI/bge-base-en-v1.5", model_name="llama-3.3-70b-versatile"):
        self.model_name = model_name
        self.groq_client = ChatGroq(temperature=0, model_name=model_name)
        self.memory = ConversationSummaryMemory(memory_key="chat_history", return_messages=True, llm=self.groq_client)
        self.embeddings = FastEmbedEmbeddings(model_name=embedding_model)
        self.souce_name=source_name
        # Initialize Pinecone vector store
        self.vector_db = Pinecone.from_existing_index(
            index_name=index_name,
            embedding=self.embeddings,
            text_key="text" 
        )
        self.conversational_chain = self.create_qa_chain()

    def create_qa_chain(self):
        """Creates a RetrievalQA chain."""

        # ChatGPT-style template using ChatPromptTemplate
        system_template = """
        Use the following pieces of information to answer the user's question.
        If you don't know the answer, just say that you don't know, don't try to make up an answer.
        Answer the question and provide additional helpful information,
        based on the pieces of information, if applicable. Be succinct.
        Responses should be properly formatted to be easily read.

        Context: {context}
        """

        system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)
        human_template = "{question}"
        human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

        chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

        # Create the conversational chain with the template
        conversational_chain = RetrievalQA.from_chain_type(
            llm=self.groq_client,
            retriever=self.vector_db.as_retriever(search_kwargs={
            "filter": {"source": self.souce_name}  # Filter results by the source metadata
        }),
            memory=self.memory,
            chain_type="stuff",
            chain_type_kwargs={"prompt": chat_prompt}
        )
        return conversational_chain

    def ask_question(self, question):
        """Queries the chain and returns the response."""
        try:
            # Run the chain and return the result
            return self.conversational_chain.invoke({"query": question})
        except Exception as e:
            print(f"Error while querying the chain: {e}")
            return "An error occurred while processing your question."


# **Bring It All Together**

In [7]:
pdf_processor = PDFProcessor(pdf_path="/kaggle/input/cvs-folder/SW_MLEngineer_OmarMarie.pdf")
pdf_processor.prepare_pdf()

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/740 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/218M [00:00<?, ?B/s]

reading pdf is done
spliting and storing pdf is done
embedding is done


In [8]:
# Initialize and use the RetrievalQA chain
rag_chain = RagChain(source_name="/kaggle/input/cvs-folder/SW_MLEngineer_OmarMarie.pdf")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [9]:
# Ask the first question
response = rag_chain.ask_question("what is the documnet about?")
response


{'query': 'what is the documnet about?',
 'chat_history': [SystemMessage(content='')],
 'result': "**Document Overview**\n\nThe document appears to be a portfolio or resume of a developer's work experience and projects. It highlights their skills and accomplishments in various areas, including:\n\n* Deep learning and machine learning\n* Web development (front-end and back-end)\n* Real-time applications\n* Data analysis and visualization\n* Image and object detection\n\nThe document lists several projects, including:\n\n* Glaucoma and cataract detection using deep learning\n* Real-time chat application\n* Campgrounds web application\n* E-commerce web application\n* Automatic number plate recognition\n* Head pose estimation\n* Job market data analysis\n\nOverall, the document showcases the developer's technical skills and experience in building a wide range of applications and systems."}

In [10]:
# Ask the first question
response = rag_chain.ask_question("what is his skills?")
response


{'query': 'what is his skills?',
 'chat_history': [SystemMessage(content="The human asks what the document is about, and the AI responds that the document appears to be a portfolio or resume of a developer's work experience and projects, highlighting their skills and accomplishments in areas such as deep learning, web development, and data analysis, and listing various projects including glaucoma detection, real-time chat applications, and e-commerce web applications, showcasing the developer's technical skills and experience.")],
 'result': '**Technical Skills:**\n\n1. **Programming Languages:** \n   - Python (Data Scientist With Python Track Datacamp)\n   - React (React Nano-degree Udacity)\n   - Django (Virtual Assistant project)\n\n2. **Data Analysis and Visualization:**\n   - Power BI (Microsoft Power BI Data Analyst Certified)\n   - Kibana\n   - Dash (Plotly)\n   - Seaborn\n   - ETL processes\n   - Star Schema\n\n3. **Machine Learning:**\n   - AWS Machine Learning Engineer Nano-d

In [11]:
# Ask the first question
response = rag_chain.ask_question("is he will suitable for hr position?")
response


{'query': 'is he will suitable for hr position?',
 'chat_history': [SystemMessage(content="The human asks what the document is about, and the AI responds that the document appears to be a portfolio or resume of a developer's work experience and projects, highlighting their skills and accomplishments in areas such as deep learning, web development, and data analysis, and listing various projects including glaucoma detection, real-time chat applications, and e-commerce web applications, showcasing the developer's technical skills and experience. The human then inquires about the developer's skills, and the AI lists the developer's technical skills, including programming languages such as Python, React, and Django, data analysis and visualization tools like Power BI and Kibana, machine learning expertise with AWS and Coursera certifications, database and big data skills including Apache NIFI and Apache Spark, design tools like Figma and Adobe XD, and speech recognition skills with Wav2Lip