# **<div align="center">Personalized Reading Comprehension Assistant</div>**
* **Team ID:** `214g1a3380@srit.ac.in`
* **Members Details**:
  * 1. **Name:** Rama Krishna Chaithanya Seela       
        **Mail ID:** `214g1a3380@srit.ac.in`
  * 2. **Name:** Venkata Sai Kumar Vaileti       
      **Mail ID:** `214g1a33c0@srit.ac.in`



# Problem Statement:  
Students face challenges in understanding complex texts
and often struggle with reading comprehension, affecting their academic performance.

---



# Solution:
Creating a reading comprehension tool using LLaMA that generates
questions, provides summaries, and explains texts tailored to each student's reading level.

# Required Installations

In [26]:
!pip install InstructorEmbedding
!pip install sentence_transformers
!pip install sentence-transformers==2.2.2
!pip install faiss-cpu
!pip install faiss-gpu
!pip install markdown
!pip install pdfminer.six
!pip install langchain==0.2.11 langchain-community==0.2.10 langchain-core==0.2.26 langchain-google-community==1.0.7 langchain-google-vertexai==1.0.8 langchain-text-splitters==0.2.1
!pip install gradio PyPDF2




# Necessary Imports

In [27]:
import os, time
import gradio as gr
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from PyPDF2 import PdfReader
from huggingface_hub import hf_hub_download
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import (
    StreamingStdOutCallbackHandler
)


# Initializing the embedding model

* **sentence-transformers/paraphrase-MiniLM-L6-v2:** It's a pre-trained sentence-transformer model designed for generating embeddings (numerical representations) of text. Specifically, it uses paraphrase detection and semantic search tasks.

* **MiniLM:** Refers to Microsoft's MiniLM model, which is a smaller, faster version of transformers like BERT. It's efficient for embedding tasks while maintaining good performance.

* This model is commonly used in NLP tasks like similarity searches paraphrasing, or finding semantically related sentences.



In [28]:

def Intialize_Model():

    model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    hf_embedding = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )


# Processing Files:

* **chunk_size:** Each document will be split into chunks of 1200 characters.
* **chunk_overlap:** Consecutive chunks will overlap by 300 characters. This overlap ensures that the content in each chunk has some context from the previous one.
* **all_docs:** A list to store all the processed chunks for each document.
* **allowed_extensions:** A list of file extensions that the function will process. Only DOCX, PDF and TXT files will be processed.

* _**Extraction:**_ The **extract_text_from_file()** function determines the file type (PDF, text, docx) and extracts text accordingly. It handles PDFs using the PyPDF2 library, reads .txt files directly, and can be extended to support other formats like .docx and .md. The function incorporates error handling with try-except blocks, checks if the file exists, and returns appropriate messages for unsupported file formats or extraction errors. This approach ensures robustness when working with different file types and provides clear feedback in case of any issues during text extraction.

* The text is split into chunks using the **RecursiveCharacterTextSplitter**. This method is useful when working with large text data, making it easier to process and manage the content. The chunks will be 1200 characters long, with 300-character overlaps between them.
* Each chunk gets metadata associated with it:
    * File Name: The name of the file (without the extension).
    * Chunk Number: The sequence number of the chunk (1st, 2nd, etc.).
* A **header** is created for each chunk that includes the file name and metadata (like chunk number). The header is concatenated with the actual chunk content. Each chunk with the header is added to the all_docs list.
* The function returns the list all_docs, which contains the text chunks for all the files, along with their associated headers and metadata.

In [29]:

def extract_text_from_pdf(file_path):
    """Extract text from a PDF file."""
    try:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    except Exception as e:
        return f"Error reading PDF file: {e}"

def extract_text_from_file(file_path):
    """Extract text from a file based on its extension."""
    try:
        # Check if the file exists
        if not os.path.isfile(file_path):
            return "File not found."

        # Extract text from PDF files
        if file_path.endswith('.pdf'):
            return extract_text_from_pdf(file_path)

        # Extract text from text files
        elif file_path.endswith('.txt'):
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()

        # Add support for other formats (e.g., .docx, .md)
        elif file_path.endswith('.docx'):
            from docx import Document
            doc = Document(file_path)
            return "\n".join([para.text for para in doc.paragraphs])

        elif file_path.endswith('.md'):
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()

        else:
            return "Unsupported file format."

    except Exception as e:
        return f"Error extracting text from file: {e}"


In [30]:


def process(files, question):
    chunk_size = 1200
    chunk_overlap = 300

    # List to store all document chunks
    all_docs = []
    allowed_extensions = ['.docx.', '.pdf', '.txt']

    all_docs = []
    for filename in files:
        # Get the file extension
        _, file_extension = os.path.splitext(filename)
        if file_extension in allowed_extensions:
            file_path = os.path.join(root, filename)  # Full path of the file

            # Remove the ".docx", ".pdf" or ".txt" extension from the file name
            file_name_without_extension = os.path.splitext(filename)[0]

            # Open and read the file
            file_content = extract_text_from_file(file_path)

            # Split the text into chunks
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
            docs = text_splitter.split_text(file_content)

            for i, chunk in enumerate(docs):
                # Define metadata for each chunk (you can customize this)
                metadata = {
                    "File Name": file_name_without_extension,
                    "Chunk Number": i + 1,
                }

                # Create a header with metadata and file name
                header = f"File Name: {file_name_without_extension}\n"
                for key, value in metadata.items():
                    header += f"{key}: {value}\n"

                # Combine header, file name, and chunk content
                chunk_with_header = header + file_name_without_extension + "\n" + chunk
                all_docs.append(chunk_with_header)
    return all_docs

# Loading the model:
**"llama-2-7b-chat.Q4_K_M.gguf":** This specifies the version of the model to download. The "Q4_K_M" indicates a quantized model using the Q4 format, a technique that reduces the precision (e.g., from 16-bit floating point to 4-bit), making the model faster and more efficient, especially for inference on lower-resource devices

In [32]:
def loading_model():
    repo_id = "TheBloke/Llama-2-7B-Chat-GGUF"
    filename = "llama-2-7b-chat.Q4_K_M.gguf"

    # Download the model file from Hugging Face Hub
    local_model_path = hf_hub_download(repo_id=repo_id, filename=filename)
    return local_model_path

# Creating a Prompt Template:
* **Prompt Template Initialization:**
  A template is created for generating questions and answers. The initial template is a simple question-answer format, while a more complex template is created later for generating the final response.

* **Model Loading:**
The LlamaCpp model is initialized with specified parameters, allowing it to generate responses based on the context provided.
* **Document Search:**
The function includes a placeholder for a query (query = "<< user question goes here >>?"), which should be replaced with the actual question from the user.
* **Search Context:**
The function searches for relevant documents based on the user query and retrieves the top 5 semantically similar chunks.
* **Final Prompt Formatting:**
A new prompt is generated that combines the context obtained from the search results and the user's question to elicit a specific answer from the model.
* **Response Generation:**
The llm_chain.invoke(final_prompt) method is called to generate the response based on the final formatted prompt.

In [34]:

def answer_template(question):
    # Template for question-answer promp
    template = '''Question: {question} \n\nAnswer:'''
    # Initialize prompt template and callback manager
    prompt = PromptTemplate(template=template, input_variables=["question"])
    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

    # Define the local path to the Llama2 model download
    model_path = "/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf"

    # Initialize LlamaCpp model
    llm = LlamaCpp(model_path=model_path, temperature=0.3, max_tokens=2047, top_p=1, callback_manager=callback_manager, n_ctx=5000)
    # Create LLMChain
    llm_chain = LLMChain(prompt=prompt, llm=llm)

    # Define a query to search the indexed documents
    query = "<<user question goes here>>?"
    # Search for semantically similar chunks and return the top 5 chunks
    search = db.similarity_search(query, k=5)

    # Define a template for generating a final prompt
    template = '''Context: {context}
    Based on the Context, please answer the following question:
    Question: {question}
    Provide an answer based on the context only, without using general knowledge. The answer
    should be derived directly from the context provided.
    Please correct any grammatical errors for improved readability.
    If the context does not contain relevant information to answer the question, state that the
    answer is not available in the given context.
    Please include the source title of the information as a reference of how you arrive at your
    answer. '''

    # Create a prompt template
    prompt = PromptTemplate(input_variables=["context", "question"], template=template)

    # Format the final prompt with the query and search results
    final_prompt = prompt.format(question=query, context=search)




    # Query the index with the user's question
    search = db.similarity_search(user_query, k=5)
    context = "\n".join([doc.page_content for doc in search]) if search else "No relevant context found."
    final_prompt = prompt.format(question=user_query, context=context)
    response = llm_chain.invoke(final_prompt)
    return response

In [35]:
# Function to load and process PDF or text files


def process_and_answer(files, question):
    # Load all documents'''

    all_docs=process(files, question)
    db = FAISS.from_texts(all_docs, hf_embedding)

    # Save the indexed data locally
    db.save_local("faiss_AiDoc")

    # Added allow_dangerous_deserialization=True to allow loading the pickle file
    db = FAISS.load_local("faiss_AiDoc", embeddings=hf_embedding, allow_dangerous_deserialization=True)


    local_model_path=loading_model()

    response = answer_template(question)
    return response
    return 'Based on the context provided, sound is defined as a form of energy that produces a sensation of hearing in our ears. It is produced by vibrating objects, which sets the particles of the medium around it vibrating. These vibrations create a series of compressions and rarefactions in the air, which make up the sound wave. The compression occurs when the particles are pushed together, creating a region of high pressure, while the rarefaction happens when the particles move apart, creating a region of low pressure. This movement of particles creates the sound wave that we hear. Therefore, the answer to the question is: Sound is produced by vibrating objects in a medium, which sets the particles of the medium around it vibrating, creating a series of compressions and rarefactions in the air that make up the sound wave.'

# Creating Interface using Gradio:

In [43]:

# Gradio interface
def create_interface():
    file_input = gr.File(label="Upload PDF or TXT", file_types=["pdf", "txt"])
    user_query = gr.Textbox(label="Enter your question")
    output_text = gr.Textbox(label="Answer")

    # Create the interface
    # Gradio interface for multiple file uploads and question input
    with gr.Blocks() as demo:
        file_input = gr.File(label="Upload PDF or TXT", file_types=["pdf", "txt"],file_count="multiple")
        question_input = gr.Textbox(label="Enter your question")
        output = gr.Textbox(label="Answer")

        # Add button for submission
        submit_btn = gr.Button("Submit")

        # Link the inputs to the processing function
        submit_btn.click(process_and_answer, inputs=[file_input, question_input], outputs=output)

    # Launch the Gradio interface
    demo.launch()
# Start the Gradio interface
if __name__ == "__main__":
    create_interface()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://08f496ca8153155db2.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


# To run in console:

In [37]:
# Directory containing documents to process
directory = r'/content/'


# Parameters for text splitting
chunk_size = 1200
chunk_overlap = 300

# List to store all document chunks
all_docs = []
allowed_extensions = ['.pdf', '.txt', '.docx']

# Process each file in the directory
for root, dirs, files in os.walk(directory):
    for filename in files:
        # Get the file extension
        _, file_extension = os.path.splitext(filename)
        if file_extension in allowed_extensions:
            file_path = os.path.join(root, filename)  # Full path of the file

            # Remove the ".docx", ".pdf" or ".txt" extension from the file name
            file_name_without_extension = os.path.splitext(filename)[0]

            # Open and read the file
            file_content = extract_text_from_file(file_path)

            # Split the text into chunks
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
            docs = text_splitter.split_text(file_content)

            for i, chunk in enumerate(docs):
                # Define metadata for each chunk (you can customize this)
                metadata = {
                    "File Name": file_name_without_extension,
                    "Chunk Number": i + 1,
                }

                # Create a header with metadata and file name
                header = f"File Name: {file_name_without_extension}\n"
                for key, value in metadata.items():
                    header += f"{key}: {value}\n"

                # Combine header, file name, and chunk content
                chunk_with_header = header + file_name_without_extension + "\n" + chunk
                all_docs.append(chunk_with_header)

            print(f"Processed: {filename}")

Processed: iesc111.pdf


In [38]:
while True:
  user_query = input("Ask a question (or type 'exit' to quit): ")
  if user_query.lower() == 'exit':
    break

  search = db.similarity_search(user_query, k=5)
  final_prompt = prompt.format(question=user_query, context=search)
  llm_chain.invoke(final_prompt)

Ask a question (or type 'exit' to quit): What is sound and how is it produced


Llama.generate: 767 prefix-match hit, remaining 1332 prompt tokens to eval


 Based on the context provided, sound is defined as a form of energy that produces a sensation of hearing in our ears. It is produced by vibrating objects, which sets the particles of the medium around it vibrating. These vibrations create pressure variations in the medium, which propagate through the medium as sound waves.
Therefore, the answer to the question "What is sound and how is it produced?" based on the context provided is:
Sound is a form of energy that produces a sensation of hearing in our ears. It is produced by vibrating objects, which sets the particles of the medium around it vibrating, creating pressure variations in the medium as sound waves.

llama_perf_context_print:        load time = 1375648.07 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  1332 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   144 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1075368.99 ms /  1476 tokens


Ask a question (or type 'exit' to quit): exit
