Skip to content

Create rag.py#3

Merged
shouryashashank merged 9 commits into
mainfrom
feature/add-rag
Oct 12, 2024
Merged

Create rag.py#3
shouryashashank merged 9 commits into
mainfrom
feature/add-rag

Conversation

@shouryashashank
Copy link
Copy Markdown
Contributor

@shouryashashank shouryashashank commented Oct 12, 2024

Summary by CodeRabbit

  • New Features
    • Enhanced CLI functionality to interact with vector databases for more dynamic responses.
    • Introduced a WebScraper class for managing web scraping operations, including fetching and parsing HTML content.
    • Added methods for processing text documents, generating unique identifiers for chunks, and integrating with a Chroma vector store.
    • Implemented functionality to load documents from specified directories and update the Chroma database with new chunks.
    • Added a similarity search feature to retrieve relevant results based on queries.
    • Updated configuration options to support new features and improve user interaction.
  • Chores
    • Updated .gitignore to exclude /docs and /vector directories from version control.
    • Incremented package version and updated dependency requirements in setup.py.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 12, 2024

Walkthrough

The changes enhance the Cli class in app/predacons_cli/src/cli.py by adding several methods and improving its interaction with vector databases and web scraping capabilities. A new WebScraper class is introduced in rag.py, which manages web scraping operations, while the existing VectorStore class continues to handle document embeddings and storage. The launch method now includes checks for a vector database, initializing the VectorStore if found, and modifying response generation to utilize this database. Additionally, the .gitignore file is updated to exclude specific directories.

Changes

File Change Summary
app/predacons_cli/src/cli.py Updated Cli class with methods: launch, load_model, clear_config, print_config_data, settings, version, help, check_config_file, create_config_file, load_config_file, and generate_response.
app/predacons_cli/src/rag.py Added WebScraper class with methods: google_search_extract and get_web_data. Updated VectorStore class with enhanced methods for document management and embedding operations.
setup.py Incremented version from 0.0.100 to 0.0.101 and updated dependencies in install_requires.
.gitignore Added entries: /docs and /vector to ignore specific directories.

Possibly related PRs

  • Feature/initial build #2: The changes in the cli.py file enhance the functionality of the Cli class, including modifications to the launch, check_config_file, and generate_response methods, which are also updated in the main PR to improve configuration handling and user interaction with a vector database.

Poem

In the land of code where rabbits play,
A VectorStore hops in, brightening the day.
With chunks of text, it dances and spins,
Adding to Chroma, where the fun begins!
So gather 'round, let the documents flow,
In the world of embeddings, watch our knowledge grow! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
app/predacons_cli/src/rag.py (2)

23-28: Use logging instead of print statements for better practice

Using print statements is not recommended in production code. Consider using the logging module to log messages at appropriate severity levels.

Suggested modification:

+import logging

 class VectorStore:
     def split_text(self, documents: list[Document]):
+        logger = logging.getLogger(__name__)
         text_splitter = RecursiveCharacterTextSplitter(
             chunk_size=300,
             chunk_overlap=100,
             length_function=len,
             add_start_index=True,
         )
         chunks = text_splitter.split_documents(documents)
-        print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
+        logger.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")

         if len(chunks) > 10:
             document = chunks[10]
-            print(document.page_content)
-            print(document.metadata)
+            logger.debug(document.page_content)
+            logger.debug(document.metadata)
         else:
-            print("Not enough chunks to display sample content.")
+            logger.info("Not enough chunks to display sample content.")

         return chunks

71-71: Use logging instead of print statements for consistency

Consistently using the logging module enhances maintainability and allows for better control over log levels and outputs.

Suggested modification:

+import logging

 class VectorStore:
     def add_to_chroma(self, chunks: list[Document]):
+        logger = logging.getLogger(__name__)
         # ...

-        print(f"Number of existing documents in DB: {len(existing_ids)}")
+        logger.info(f"Number of existing documents in DB: {len(existing_ids)}")

         # ...

         if len(new_chunks):
-            print(f"👉 Adding new documents: {len(new_chunks)}")
+            logger.info(f"Adding new documents: {len(new_chunks)}")
             # ...

         else:
-            print("✅ No new documents to add")
+            logger.info("No new documents to add")

Also applies to: 80-80, 85-85

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between b6f9eaf and 2b5dce9.

📒 Files selected for processing (1)
  • app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
app/predacons_cli/src/rag.py (1)

87-90: load_documents method is implemented correctly

The load_documents method properly utilizes DirectoryLoader to load and return documents from the specified path.

Comment thread app/predacons_cli/src/rag.py Outdated
)

# Calculate Page IDs.
chunks_with_ids = self.alculate_chunk_ids(chunks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in method call: alculate_chunk_ids should be calculate_chunk_ids

There's a typo in the method call self.alculate_chunk_ids(chunks). This will raise an AttributeError because the method doesn't exist. Please correct the method name.

Suggested fix:

     # Calculate Page IDs.
-    chunks_with_ids = self.alculate_chunk_ids(chunks)
+    chunks_with_ids = self.calculate_chunk_ids(chunks)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chunks_with_ids = self.alculate_chunk_ids(chunks)
chunks_with_ids = self.calculate_chunk_ids(chunks)

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +25 to +28
document = chunks[10]
print(document.page_content)
print(document.metadata)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential IndexError when accessing chunks[10]

Accessing chunks[10] without checking its length may result in an IndexError if there are fewer than 11 chunks. To prevent this, consider adding a check before accessing chunks[10].

Suggested fix:

     chunks = text_splitter.split_documents(documents)
     print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

-    document = chunks[10]
-    print(document.page_content)
-    print(document.metadata)
+    if len(chunks) > 10:
+        document = chunks[10]
+        print(document.page_content)
+        print(document.metadata)
+    else:
+        print("Not enough chunks to display sample content.")

     return chunks
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
document = chunks[10]
print(document.page_content)
print(document.metadata)
if len(chunks) > 10:
document = chunks[10]
print(document.page_content)
print(document.metadata)
else:
print("Not enough chunks to display sample content.")

Comment on lines +39 to +55
for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"

# If the page ID is the same as the last one, increment the index.
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0

# Calculate the chunk ID.
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id

# Add it to the page meta-data.
chunk.metadata["id"] = chunk_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle missing 'source' or 'page' in chunk metadata

The code assumes that chunk.metadata contains 'source' and 'page' keys. If these keys are missing, it could lead to issues when generating current_page_id. Consider adding default values or handling missing metadata to avoid potential errors.

Suggested improvement:

     for chunk in chunks:
-        source = chunk.metadata.get("source")
-        page = chunk.metadata.get("page")
+        source = chunk.metadata.get("source", "unknown_source")
+        page = chunk.metadata.get("page", "unknown_page")
         current_page_id = f"{source}:{page}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"
# If the page ID is the same as the last one, increment the index.
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
# Calculate the chunk ID.
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
# Add it to the page meta-data.
chunk.metadata["id"] = chunk_id
for chunk in chunks:
source = chunk.metadata.get("source", "unknown_source")
page = chunk.metadata.get("page", "unknown_page")
current_page_id = f"{source}:{page}"
# If the page ID is the same as the last one, increment the index.
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
# Calculate the chunk ID.
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
# Add it to the page meta-data.
chunk.metadata["id"] = chunk_id

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)
app/predacons_cli/src/rag.py (1)

1-108: Overall assessment: Good foundation with room for improvement

The VectorStore class provides a solid foundation for managing document embeddings and storage using the LangChain library. The implementation covers essential functionality such as document loading, text splitting, chunk ID calculation, and similarity search.

Key strengths:

  1. Clear separation of concerns with distinct methods for each step in the process.
  2. Integration with appropriate libraries (LangChain, Chroma) for vector storage and embedding.

Areas for improvement:

  1. Enhance error handling across all methods to improve robustness.
  2. Implement consistent logging throughout the class for better debugging and monitoring.
  3. Address minor issues such as typos and potential edge cases in data handling.
  4. Consider performance optimizations, such as batching for large datasets.

By addressing these points, the VectorStore class will become more robust, maintainable, and performant, providing a reliable foundation for document processing and similarity search operations.

🧰 Tools
🪛 Ruff

107-107: f-string without any placeholders

Remove extraneous f prefix

(F541)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2b5dce9 and f3bf930.

📒 Files selected for processing (1)
  • app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py

107-107: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (5)
app/predacons_cli/src/rag.py (5)

1-5: LGTM: Imports are appropriate and concise.

The import statements are well-organized and include all necessary modules for the VectorStore class functionality.


7-13: LGTM: Class initialization is well-structured.

The __init__ method properly initializes the VectorStore class with necessary parameters and sets up the embedding model correctly.


15-29: 🛠️ Refactor suggestion

Consider additional error handling and logging improvements.

The split_text method effectively splits documents into chunks. However, as previously noted, there's a potential IndexError when accessing chunks[10]. In addition to addressing that issue, consider the following improvements:

  1. Add error handling for the text splitting process.
  2. Use logging instead of print statements for better integration with the application's logging system.

Example improvement:

import logging

def split_text(self, documents: list[Document]):
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=300,
            chunk_overlap=100,
            length_function=len,
            add_start_index=True,
        )
        chunks = text_splitter.split_documents(documents)
        logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")

        if len(chunks) > 10:
            sample_chunk = chunks[10]
            logging.debug(f"Sample chunk content: {sample_chunk.page_content}")
            logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}")
        else:
            logging.warning("Not enough chunks to display sample content.")

        return chunks
    except Exception as e:
        logging.error(f"Error splitting documents: {str(e)}")
        raise

31-57: 🛠️ Refactor suggestion

Enhance robustness of chunk ID calculation.

The calculate_chunk_ids method effectively generates unique IDs for chunks. As previously noted, there's a potential issue with missing 'source' or 'page' in chunk metadata. In addition to addressing that, consider the following improvements for increased robustness:

  1. Use uuid.uuid4() as a fallback for truly unique IDs when metadata is incomplete.
  2. Sanitize the source and page values to ensure they don't contain characters that could cause issues in the ID string.

Example improvement:

import uuid
import re

def calculate_chunk_ids(self, chunks):
    last_page_id = None
    current_chunk_index = 0

    def sanitize(value):
        return re.sub(r'[^a-zA-Z0-9_-]', '_', str(value))

    for chunk in chunks:
        source = sanitize(chunk.metadata.get("source", "unknown_source"))
        page = sanitize(chunk.metadata.get("page", "unknown_page"))
        current_page_id = f"{source}:{page}"

        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        chunk_id = f"{current_page_id}:{current_chunk_index}"
        
        # Fallback to UUID if the generated ID is not unique
        if not chunk_id or chunk_id == ":":
            chunk_id = str(uuid.uuid4())

        last_page_id = current_page_id
        chunk.metadata["id"] = chunk_id

    return chunks

This implementation ensures that even with missing or problematic metadata, each chunk will receive a unique ID.


59-85: 🛠️ Refactor suggestion

⚠️ Potential issue

Fix typo and enhance error handling in add_to_chroma method.

As previously noted, there's a typo in the method call self.alculate_chunk_ids(chunks). In addition to fixing this typo, consider the following improvements:

  1. Add error handling for database operations.
  2. Use logging instead of print statements for consistency.
  3. Consider batching the addition of new chunks for better performance with large datasets.

Example improvement:

import logging
from typing import List
from langchain.schema import Document

def add_to_chroma(self, chunks: List[Document]):
    try:
        db = Chroma(
            persist_directory=self.chroma_path, embedding_function=self.embedder
        )

        chunks_with_ids = self.calculate_chunk_ids(chunks)  # Fix typo here

        existing_items = db.get(include=[])
        existing_ids = set(existing_items["ids"])
        logging.info(f"Number of existing documents in DB: {len(existing_ids)}")

        new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids]

        if new_chunks:
            logging.info(f"Adding new documents: {len(new_chunks)}")
            new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
            
            # Consider batching for large datasets
            batch_size = 1000
            for i in range(0, len(new_chunks), batch_size):
                batch = new_chunks[i:i+batch_size]
                batch_ids = new_chunk_ids[i:i+batch_size]
                db.add_documents(batch, ids=batch_ids)
            
            db.persist()
        else:
            logging.info("No new documents to add")

    except Exception as e:
        logging.error(f"Error adding documents to Chroma: {str(e)}")
        raise

This implementation fixes the typo, adds error handling, uses logging, and introduces batching for better performance with large datasets.

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +92 to +96
def load_and_update_db(self):
documents = self.load_documents()
chunks = self.split_text(documents)
self.add_to_chroma(chunks)
print("✅ Done")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance load_and_update_db method with error handling and detailed logging.

The load_and_update_db method effectively orchestrates the document processing pipeline. However, it could benefit from improved error handling and more detailed logging. Consider the following improvements:

import logging

def load_and_update_db(self):
    try:
        logging.info("Starting database update process")
        documents = self.load_documents()
        logging.info(f"Loaded {len(documents)} documents")

        chunks = self.split_text(documents)
        logging.info(f"Split documents into {len(chunks)} chunks")

        self.add_to_chroma(chunks)
        logging.info("Completed adding chunks to Chroma database")

    except Exception as e:
        logging.error(f"Error in load_and_update_db: {str(e)}")
        raise
    else:
        logging.info("Database update process completed successfully")

This implementation adds error handling and more detailed logging throughout the process, providing better visibility into the progress and any potential issues during the database update.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Outside diff range and nitpick comments (1)
app/predacons_cli/src/cli.py (1)

59-59: Fix typo in print statement

There's a typo in the print statement: "Continueing" should be "Continuing".

Apply this diff to correct the typo:

 print("[red]Vector DB not found![/red]")
-print("[yellow]Continueing with simple chat[/yellow]")
+print("[yellow]Continuing with simple chat[/yellow]")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f3bf930 and 22d007b.

📒 Files selected for processing (2)
  • app/predacons_cli/src/cli.py (3 hunks)
  • app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py

106-106: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (4)
app/predacons_cli/src/rag.py (3)

1-5: LGTM: Import statements are appropriate and concise.

The import statements are well-organized and include all necessary modules for the VectorStore class functionality.


7-12: LGTM: VectorStore class initialization is well-structured.

The class initialization is flexible with the optional model parameter and correctly sets up the necessary attributes for the VectorStore functionality.


1-107: Overall assessment: Well-structured implementation with room for improvements

The VectorStore class provides a comprehensive set of methods for managing document embeddings and storage using the LangChain library. The overall structure and logic of the class are sound and align well with the intended functionality.

Key strengths:

  1. Clear separation of concerns across methods
  2. Effective use of LangChain components
  3. Thoughtful implementation of document chunking and ID generation

Suggested improvements:

  1. Enhance error handling across all methods
  2. Implement consistent logging throughout the class
  3. Address minor issues like typos and unnecessary f-strings
  4. Consider adding type hints for better code clarity

By incorporating these suggestions, the VectorStore class will become more robust, maintainable, and production-ready. Great job on the initial implementation!

🧰 Tools
🪛 Ruff

106-106: f-string without any placeholders

Remove extraneous f prefix

(F541)

app/predacons_cli/src/cli.py (1)

11-11: Import statement is appropriate

The import of VectorStore from the rag module appears correct and necessary for the added functionality.

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +15 to +29
def split_text(self,documents: list[Document]):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

document = chunks[10]
print(document.page_content)
print(document.metadata)

return chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling and use logging in split_text method.

The split_text method looks good overall, but there are two areas for improvement:

  1. Handle potential IndexError:
    When accessing chunks[10], add a check to ensure there are at least 11 chunks to avoid an IndexError.

  2. Use logging instead of print statements:
    Replace print statements with appropriate logging calls for better production-ready code.

Here's a suggested improvement:

import logging

def split_text(self, documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    if len(chunks) > 10:
        sample_chunk = chunks[10]
        logging.debug(f"Sample chunk content: {sample_chunk.page_content}")
        logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}")
    else:
        logging.warning("Not enough chunks to display a sample.")

    return chunks

This implementation addresses the potential IndexError and uses logging for better production readiness.

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +31 to +57
def calculate_chunk_ids(self,chunks):

# This will create IDs like "data/monopoly.pdf:6:2"
# Page Source : Page Number : Chunk Index

last_page_id = None
current_chunk_index = 0

for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"

# If the page ID is the same as the last one, increment the index.
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0

# Calculate the chunk ID.
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id

# Add it to the page meta-data.
chunk.metadata["id"] = chunk_id

return chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential missing metadata in calculate_chunk_ids method.

The calculate_chunk_ids method effectively generates unique IDs for chunks. However, it assumes that 'source' and 'page' keys always exist in the chunk metadata. To improve robustness, consider handling cases where these keys might be missing.

Here's a suggested improvement:

def calculate_chunk_ids(self, chunks):
    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source", "unknown_source")
        page = chunk.metadata.get("page", "unknown_page")
        current_page_id = f"{source}:{page}"

        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        chunk.metadata["id"] = chunk_id

    return chunks

This implementation uses the get method with default values to handle potential missing keys in the metadata.

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +59 to +86
def add_to_chroma(self,chunks: list[Document]):
# Load the existing database.
db = Chroma(
persist_directory=self.chroma_path, embedding_function=self.embedder
)

# Calculate Page IDs.
chunks_with_ids = self.alculate_chunk_ids(chunks)

# Add or Update the documents.
existing_items = db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

# Only add documents that don't exist in the DB.
new_chunks = []
for chunk in chunks_with_ids:
if chunk.metadata["id"] not in existing_ids:
new_chunks.append(chunk)

if len(new_chunks):
print(f"👉 Adding new documents: {len(new_chunks)}")
new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
db.add_documents(new_chunks, ids=new_chunk_ids)
db.persist()
else:
print("✅ No new documents to add")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo and improve logging in add_to_chroma method.

The add_to_chroma method effectively adds new chunks to the Chroma database. However, there are two issues to address:

  1. There's a typo in the method call: self.alculate_chunk_ids should be self.calculate_chunk_ids.
  2. The logging can be improved for consistency and clarity.

Here's a suggested improvement:

import logging

def add_to_chroma(self, chunks: list[Document]):
    db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)

    chunks_with_ids = self.calculate_chunk_ids(chunks)  # Fix typo here

    existing_items = db.get(include=[])
    existing_ids = set(existing_items["ids"])
    logging.info(f"Number of existing documents in DB: {len(existing_ids)}")

    new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids]

    if new_chunks:
        logging.info(f"Adding {len(new_chunks)} new documents")
        new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunk_ids)
        db.persist()
    else:
        logging.info("No new documents to add")

This implementation fixes the typo and uses consistent logging throughout the method.

Comment on lines +87 to +90
def load_documents(self):
loader = DirectoryLoader(self.document_path)
documents = loader.load()
return documents
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance load_documents method with error handling and logging.

While the load_documents method is concise, it could benefit from error handling and logging for improved robustness and debugging.

Consider the following improvement:

import logging
from langchain_community.document_loaders import DirectoryLoader

def load_documents(self):
    try:
        loader = DirectoryLoader(self.document_path)
        documents = loader.load()
        logging.info(f"Successfully loaded {len(documents)} documents from {self.document_path}")
        return documents
    except Exception as e:
        logging.error(f"Error loading documents from {self.document_path}: {str(e)}")
        raise

This implementation adds error handling and logging, providing more context about the document loading process and any potential issues.

Comment on lines +92 to +95
def load_and_update_db(self):
documents = self.load_documents()
chunks = self.split_text(documents)
self.add_to_chroma(chunks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance load_and_update_db method with error handling and logging.

The load_and_update_db method effectively orchestrates the document processing pipeline. However, it could benefit from error handling and logging for improved robustness and debugging.

Consider the following improvement:

import logging

def load_and_update_db(self):
    try:
        logging.info("Starting database update process")
        documents = self.load_documents()
        logging.info(f"Loaded {len(documents)} documents")

        chunks = self.split_text(documents)
        logging.info(f"Split documents into {len(chunks)} chunks")

        self.add_to_chroma(chunks)
        logging.info("Completed adding chunks to Chroma database")

    except Exception as e:
        logging.error(f"Error in load_and_update_db: {str(e)}")
        raise
    else:
        logging.info("Database update process completed successfully")

This implementation adds error handling and logging throughout the process, providing better visibility into the progress and any potential issues during the database update.

Comment on lines +97 to +99
def load_db(self):
db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
return db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling to load_db method.

The load_db method is concise, but it could benefit from error handling to improve robustness.

Consider the following improvement:

import logging
from langchain_community.vectorstores import Chroma

def load_db(self):
    try:
        db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
        logging.info(f"Successfully loaded Chroma database from {self.chroma_path}")
        return db
    except Exception as e:
        logging.error(f"Error loading Chroma database from {self.chroma_path}: {str(e)}")
        raise

This implementation adds error handling and logging, providing more context about the database loading process and any potential issues.

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +101 to +107
def get_similar(self, query,db =None, top_n = 5,similarity_threshold = 0.1):
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
if len(results) == 0 or results[0][1] < similarity_threshold:
print(f"Unable to find matching results.")
return results No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Enhance get_similar method and address f-string issue.

The get_similar method performs a similarity search effectively, but there are a few improvements we can make:

  1. Address the f-string without placeholders (F541 error).
  2. Add error handling for the database operations.
  3. Improve the logging of results.
  4. Consider returning a more structured result.

Here's an improved version:

import logging
from typing import List, Tuple, Any

def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
    try:
        if db is None:
            db = self.load_db()
        
        results = db.similarity_search_with_relevance_scores(query, k=top_n)
        
        filtered_results = [result for result in results if result[1] >= similarity_threshold]
        
        if not filtered_results:
            logging.info("No results found above the similarity threshold.")
        else:
            logging.info(f"Found {len(filtered_results)} relevant results.")
        
        return filtered_results
    except Exception as e:
        logging.error(f"Error in similarity search: {str(e)}")
        raise

This implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.

🧰 Tools
🪛 Ruff

106-106: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment thread app/predacons_cli/src/cli.py Outdated
Comment on lines +97 to +102
if config["chat_with_data"]:
# get response from vector store
response = vector_store.get_response(user_input)
response_body = {"role": "assistant", "content": response}
chat.append(response_body)
print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential use of uninitialized 'vector_store'

If the vector database is not found or required configurations are missing, vector_store may not be initialized, leading to a NameError when calling vector_store.get_response(user_input). To prevent this, ensure vector_store is initialized properly or adjust the logic to handle cases when it's not available.

Apply this diff to adjust the response generation:

 else:
     print("[red]Vector DB not found![/red]")
     print("[yellow]Continuing with simple chat[/yellow]")
+    vector_store = None

 # In the chat loop
 if user_input == "exit":
     return
 elif user_input == "clear":
     chat = []
     print("[yellow]Chat history cleared![/yellow]")
     ...
 else:
-    if config["chat_with_data"]:
+    if config["chat_with_data"] and vector_store is not None:
         # get response from vector store
         response = vector_store.get_response(user_input)
         response_body = {"role": "assistant", "content": response}
         chat.append(response_body)
         print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")
+    else:
+        response = Cli.generate_response(self, chat, model, tokenizer, config)
+        response_body = {"role": "assistant", "content": response}
+        chat.append(response_body)
+        if config["print_as_markdown"]:
+            markdown = Markdown(response)
+            print("[orange1]Predacons: [/orange1]")
+            console.print(markdown)
+        else:
+            console.print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")

Committable suggestion was skipped due to low confidence.


⚠️ Potential issue

Avoid redundant response generation

Currently, even when a response is obtained from vector_store, the code still calls Cli.generate_response, leading to multiple responses for a single user input. Consider adjusting the logic to prevent redundant response generation and ensure only one response is provided to the user.

Apply this diff to modify the response flow:

 else:
     if config["chat_with_data"] and vector_store is not None:
         # get response from vector store
         response = vector_store.get_response(user_input)
+        if response:
+            response_body = {"role": "assistant", "content": response}
+            chat.append(response_body)
+            print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")
+            continue  # Skip generating response from the model
     response = Cli.generate_response(self, chat, model, tokenizer, config)
     response_body = {"role": "assistant", "content": response}
     chat.append(response_body)
     if config["print_as_markdown"]:
         markdown = Markdown(response)
         print("[orange1]Predacons: [/orange1]")
         console.print(markdown)
     else:
         console.print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")

Committable suggestion was skipped due to low confidence.

Comment thread app/predacons_cli/src/cli.py Outdated
print("[yellow]Loading data from vector DB[/yellow]")
# load data from vector db

vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure required configuration keys are present before initializing VectorStore

When initializing VectorStore, the code uses config["vector_db_path"], config["document_path"], and config["embedding_model"]. Without prior checks, missing keys can result in a KeyError. Consider verifying that these keys exist and handling missing values appropriately.

Apply this diff to check for required configuration keys:

 print("[yellow]Loading data from vector DB[/yellow]")
+# Verify required configuration keys
+required_keys = ["vector_db_path", "document_path", "embedding_model"]
+missing_keys = [key for key in required_keys if key not in config]
+if missing_keys:
+    print(f"[red]Missing configuration keys: {', '.join(missing_keys)}[/red]")
+    print("[yellow]Continuing with simple chat[/yellow]")
+    vector_store = None
+else:
     # load data from vector db
     vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"])
# Verify required configuration keys
required_keys = ["vector_db_path", "document_path", "embedding_model"]
missing_keys = [key for key in required_keys if key not in config]
if missing_keys:
print(f"[red]Missing configuration keys: {', '.join(missing_keys)}[/red]")
print("[yellow]Continuing with simple chat[/yellow]")
vector_store = None
else:
# load data from vector db
vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"])

Comment thread app/predacons_cli/src/cli.py Outdated

chat = []
print("[yellow]Checking for data sources for the chat[/yellow]")
if config["chat_with_data"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add 'chat_with_data' and related keys to configuration

The code checks config["chat_with_data"], but this key is not present in the default configuration or in the create_config_file method. This will result in a KeyError if 'chat_with_data' is not defined. Similarly, vector_db_path, document_path, and embedding_model are used but may not be specified.

Apply this diff to add the missing keys to the default configuration:

 def create_config_file(self):
     config = Cli.check_config_file(self)
     file_path = self.config_file_path
     default_config = {
         "model_path": "Precacons/Pico-Lamma-3.2-1B-Reasoning-Instruct",
         "trust_remote_code": False,
         "use_fast_generation": False,
         "draft_model_name": None,
         "gguf_file": None,
         "auto_quantize": False,
         "temperature": 0.3,
         "max_length": 1000,
         "top_k": 50,
         "top_p": 0.9,
         "repetition_penalty": 1.0,
         "num_return_sequences": 1,
         "print_as_markdown": False,
+        "chat_with_data": False,
+        "vector_db_path": "path/to/vector_db",
+        "document_path": "path/to/documents",
+        "embedding_model": "model/name"
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if config["chat_with_data"]:
def create_config_file(self):
config = Cli.check_config_file(self)
file_path = self.config_file_path
default_config = {
"model_path": "Precacons/Pico-Lamma-3.2-1B-Reasoning-Instruct",
"trust_remote_code": False,
"use_fast_generation": False,
"draft_model_name": None,
"gguf_file": None,
"auto_quantize": False,
"temperature": 0.3,
"max_length": 1000,
"top_k": 50,
"top_p": 0.9,
"repetition_penalty": 1.0,
"num_return_sequences": 1,
"print_as_markdown": False,
"chat_with_data": False,
"vector_db_path": "path/to/vector_db",
"document_path": "path/to/documents",
"embedding_model": "model/name"
}

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Outside diff range and nitpick comments (6)
app/predacons_cli/src/rag.py (2)

16-27: Simplify model_id assignment in init

The current assignment of model_id uses a ternary operator that can be simplified. The if model is not None check is redundant since model will be None if not provided.

Consider simplifying the model_id assignment:

-    self.model_id = model if model is not None else None
+    self.model_id = model

This change maintains the same functionality while making the code more concise.


150-175: Implement WebScraper class methods

The WebScraper class is currently a skeleton with empty methods. While the structure and docstrings are in place, the actual implementation is missing.

To complete the implementation, consider the following steps:

  1. Implement the fetch_html method using the requests library to retrieve HTML content from a given URL.
  2. Implement the parse_html method using BeautifulSoup to extract relevant text from the HTML content.
  3. Add error handling and logging to both methods.
  4. Consider adding a method to combine fetching and parsing in a single operation.

Would you like assistance in generating the implementation for these methods?

app/predacons_cli/src/cli.py (4)

11-11: Consider grouping imports for better organization

It's generally a good practice to group imports in the following order: standard library imports, third-party imports, and local imports. Consider moving this import statement to be grouped with other third-party or local imports for better code organization.


24-33: Remove or document commented-out code

Commented-out code can lead to confusion and clutter the codebase. If this GPU-related code is no longer needed, consider removing it entirely. If it's intended for future use, add a clear comment explaining its purpose and when it might be reintroduced, or move it to a separate development branch.


213-217: Approved: New configuration options for vector database

The addition of new configuration options for vector database functionality is well-implemented. The prompts for user input and the inclusion of these options in the config_data dictionary are consistent with the existing code style.

For improved clarity, consider adding brief comments explaining the purpose of each new configuration option, especially for users who might not be familiar with vector databases or embedding models.

Here's a suggested improvement for the default configuration:

default_config = {
    # ... existing options ...
    "chat_with_data": False,  # Enable/disable chat with vector database
    "vector_db_path": None,   # Path to the vector database
    "document_path": None,    # Path to the documents for vector database
    "embedding_model": None   # Name or path of the embedding model to use
}

These comments will help users understand the purpose of each option when they're editing the configuration file directly.

Also applies to: 273-276, 293-296


Line range hint 1-324: Overall improvements and suggestions for the Cli class

The changes to the Cli class significantly enhance its capabilities, particularly with the addition of vector database support. Here are some overall suggestions for further improvement:

  1. Consider breaking down the launch method into smaller, more focused methods. It's currently handling multiple responsibilities and could benefit from better separation of concerns.

  2. Implement proper logging throughout the class instead of using print statements. This will make it easier to debug and monitor the application in various environments.

  3. Add docstrings to new methods and update existing ones to reflect the new functionality.

  4. Consider implementing type hints to improve code readability and catch potential type-related errors early.

  5. Implement unit tests for the new functionality to ensure reliability and ease of maintenance.

Here's an example of how you might start refactoring the launch method:

import logging

class Cli:
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        # ... other initialization ...

    def launch(self, logs=False):
        self.setup_logging(logs)
        config = self.load_or_create_config()
        model, tokenizer = self.load_model(config)
        vector_db = self.setup_vector_db(config)
        self.run_chat_loop(model, tokenizer, vector_db, config)

    def setup_logging(self, logs):
        # Set up logging configuration

    def load_or_create_config(self):
        # Load existing config or create a new one

    def setup_vector_db(self, config):
        # Set up vector database if enabled in config

    def run_chat_loop(self, model, tokenizer, vector_db, config):
        # Main chat loop logic

    # ... other methods ...

This structure would make the code more modular and easier to maintain and test.

🧰 Tools
🪛 Ruff

82-82: Avoid equality comparisons to False; use if not logs: for false checks

Replace with not logs

(E712)


84-84: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 22d007b and 82874d2.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • app/predacons_cli/src/cli.py (9 hunks)
  • app/predacons_cli/src/rag.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py

6-6: requests imported but unused

Remove unused import: requests

(F401)


7-7: bs4.BeautifulSoup imported but unused

Remove unused import: bs4.BeautifulSoup

(F401)


147-147: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
app/predacons_cli/src/rag.py (1)

1-176: Summary of review for rag.py

Overall, the rag.py file introduces two important classes: VectorStore and WebScraper. Here's a summary of the main points:

  1. The VectorStore class is well-structured and functional, but could benefit from:

    • Improved error handling across all methods
    • Consistent use of logging instead of print statements
    • Minor optimizations in some methods
  2. The WebScraper class is currently a skeleton and requires implementation of its methods.

  3. There are unused imports at the beginning of the file that should be removed.

Addressing these points will improve the robustness, maintainability, and completeness of the code. Great progress on implementing the RAG (Retrieval-Augmented Generation) functionality!

🧰 Tools
🪛 Ruff

6-6: requests imported but unused

Remove unused import: requests

(F401)


7-7: bs4.BeautifulSoup imported but unused

Remove unused import: bs4.BeautifulSoup

(F401)


147-147: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +29 to +49
def split_text(self, documents: list[Document]):
"""
Split documents into chunks using RecursiveCharacterTextSplitter.

:param documents: List of Document objects to be split.
:return: List of chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

document = chunks[10]
print(document.page_content)
print(document.metadata)

return chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace print statements with logging in split_text method

Using print statements for debugging in production code is not ideal. Consider using the logging module for better control over output and log levels.

Here's a suggested improvement:

import logging

def split_text(self, documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    if len(chunks) > 10:
        sample_chunk = chunks[10]
        logging.debug(f"Sample chunk content: {sample_chunk.page_content}")
        logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}")
    else:
        logging.warning("Not enough chunks to display a sample.")

    return chunks

This implementation uses logging instead of print statements and adds a check to avoid IndexError when accessing chunks[10].

Comment thread app/predacons_cli/src/rag.py Outdated
Comment on lines +133 to +148
def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1):
"""
Retrieve similar documents from the Chroma database based on a query.

:param query: Query string to search for similar documents.
:param db: Optional Chroma database object. If None, the database will be loaded.
:param top_n: Number of top similar documents to retrieve.
:param similarity_threshold: Minimum similarity score threshold.
:return: List of tuples containing similar documents and their relevance scores.
"""
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
if len(results) == 0 or results[0][1] < similarity_threshold:
print(f"Unable to find matching results.")
return results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance get_similar method and address f-string issue

The get_similar method could be improved in several ways:

  1. Address the f-string without placeholders (F541 error).
  2. Add more comprehensive error handling.
  3. Improve logging of results.
  4. Consider returning a more structured result.

Here's an improved version:

import logging
from typing import List, Tuple, Any

def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
    try:
        if db is None:
            db = self.load_db()
        
        results = db.similarity_search_with_relevance_scores(query, k=top_n)
        
        filtered_results = [result for result in results if result[1] >= similarity_threshold]
        
        if not filtered_results:
            logging.info("No results found above the similarity threshold.")
        else:
            logging.info(f"Found {len(filtered_results)} relevant results.")
        
        return filtered_results
    except Exception as e:
        logging.error(f"Error in similarity search: {str(e)}")
        raise

This implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.

🧰 Tools
🪛 Ruff

147-147: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +64 to +80
vector_db = None
print("[yellow]Checking for data sources for the chat[/yellow]")
if config.get("chat_with_data", False):
print("[yellow]Chat with data enabled looking for vector db[/yellow]")
if not os.path.exists(config.get("vector_db_path", False)):
print("[red]Vector DB not found![/red]")
print("[yellow]Continueing with simple chat[/yellow]")
else:
print("[green]Vector DB found![/green]")
print("[yellow]Loading data from vector DB[/yellow]")
# load data from vector db

vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None))
# check and update the vector store
vector_store.load_and_update_db()
vector_db = vector_store.load_db()
print("[yellow]Data loaded successfully![/yellow]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve config access consistency and error handling

  1. There's inconsistent use of config.get() vs direct dictionary access. For example, config.get("chat_with_data", False) is used, but config["vector_db_path"] is accessed directly. Consider using config.get() consistently to avoid potential KeyError exceptions.

  2. The error handling for a missing vector database could be improved. Instead of just printing a message and continuing, consider logging the error and/or raising a specific exception that can be caught and handled appropriately.

Here's a suggested improvement:

import logging

# ... (in the launch method)
if config.get("chat_with_data", False):
    print("[yellow]Chat with data enabled looking for vector db[/yellow]")
    vector_db_path = config.get("vector_db_path")
    if not vector_db_path or not os.path.exists(vector_db_path):
        logging.warning("Vector DB not found or path not specified.")
        print("[red]Vector DB not found or not specified![/red]")
        print("[yellow]Continuing with simple chat[/yellow]")
    else:
        print("[green]Vector DB found![/green]")
        print("[yellow]Loading data from vector DB[/yellow]")
        try:
            vector_store = VectorStore(
                vector_db_path,
                config.get("document_path"),
                config.get("embedding_model")
            )
            vector_store.load_and_update_db()
            vector_db = vector_store.load_db()
            print("[yellow]Data loaded successfully![/yellow]")
        except Exception as e:
            logging.error(f"Failed to load vector database: {str(e)}")
            print("[red]Failed to load vector database. Continuing with simple chat.[/red]")
            vector_db = None

This approach provides more robust error handling and consistent use of config.get().

Comment on lines +107 to +121
if config.get("chat_with_data", False) and vector_db:
# get response from vector store
response = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1)
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""
user_input = PROMPT_TEMPLATE.format(context=response, question=user_input)

user_body = {"role": "user", "content": user_input}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Optimize PROMPT_TEMPLATE definition and fix indentation

  1. The PROMPT_TEMPLATE is defined within the chat loop, which means it will be redefined in each iteration. This is inefficient. Consider moving it outside the loop or even making it a class constant.

  2. The indentation of the user_input assignment (line 121) seems to be at the same level as the if block above it. This could lead to logical errors as it will always be executed, regardless of whether the if condition is met.

Here's a suggested improvement:

# Define this outside the loop, perhaps as a class constant
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

# In the chat loop
if config.get("chat_with_data", False) and vector_db:
    # get response from vector store
    response = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1)
    user_input = PROMPT_TEMPLATE.format(context=response, question=user_input)

user_body = {"role": "user", "content": user_input}
chat.append(user_body)

This change will improve performance by avoiding unnecessary redefinition of the template and ensure correct logic flow by fixing the indentation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (9)
app/predacons_cli/src/rag.py (2)

16-27: Simplify model_id assignment in init method

The current implementation uses a ternary operator to set self.model_id, which can be simplified.

Consider simplifying the self.model_id assignment:

-    self.model_id = model if model is not None else None
+    self.model_id = model

This change maintains the same functionality while making the code more concise.


148-173: Add TODO comments or NotImplementedError to WebScraper methods

The WebScraper class currently contains empty methods. While it's common to have placeholder methods during development, it would be helpful to add TODO comments or raise NotImplementedError to indicate that these methods need to be implemented.

Consider updating the methods as follows:

class WebScraper:
    # ... (existing code)

    def fetch_html(self, url):
        """
        Fetch HTML content from a given URL.

        :param url: URL to fetch HTML content from.
        :return: HTML content as a string.
        """
        # TODO: Implement HTML fetching logic
        raise NotImplementedError("fetch_html method not implemented yet")

    def parse_html(self, html_content):
        """
        Parse HTML content and extract relevant text.

        :param html_content: HTML content to parse.
        :return: Extracted text as a string.
        """
        # TODO: Implement HTML parsing logic
        raise NotImplementedError("parse_html method not implemented yet")

This change will make it clear that these methods are not yet implemented and need to be completed.

app/predacons_cli/src/cli.py (7)

24-33: Consider removing or implementing the GPU check

The commented-out code for GPU availability check should either be removed or properly implemented. Keeping commented-out code in production can lead to confusion and maintenance issues.

If GPU support is planned for future implementation, consider adding a TODO comment instead of keeping the entire code block commented out.


64-78: LGTM: Vector database integration

The integration of vector database functionality is well-implemented. However, consider the following improvements for robustness:

  1. Use config.get("vector_db_path") instead of config["vector_db_path"] to avoid potential KeyError.
  2. Add error handling for the VectorStore initialization and load_db call.

Example:

vector_db_path = config.get("vector_db_path")
if vector_db_path and os.path.exists(vector_db_path):
    try:
        vector_store = VectorStore(vector_db_path, config.get("document_path"), config.get("embedding_model"))
        vector_db = vector_store.load_db()
        print("[yellow]Data loaded successfully![/yellow]")
    except Exception as e:
        print(f"[red]Error loading vector database: {e}[/red]")
        vector_db = None

110-123: LGTM: Vector DB integration in response generation

The integration of the vector database for context-aware responses is well-implemented. However, consider the following optimization:

  1. Move the PROMPT_TEMPLATE outside the loop to avoid recreating it in each iteration. You can define it as a class constant or at the module level.

Example:

PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

# In the loop
if config.get("chat_with_data", False) and vector_db:
    context_text = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1)
    user_input = PROMPT_TEMPLATE.format(context=context_text, question=user_input)

This change will improve performance by avoiding unnecessary string creation in each iteration.


217-221: LGTM: New vector database configuration options

The new configuration options for the vector database functionality are correctly added to the default configuration. However, consider adding validation for these new options when they are set by the user. For example:

if config["chat_with_data"]:
    if not config["vector_db_path"] or not os.path.exists(config["vector_db_path"]):
        print("[yellow]Warning: Vector DB path is not set or does not exist.[/yellow]")
    if not config["document_path"] or not os.path.exists(config["document_path"]):
        print("[yellow]Warning: Document path is not set or does not exist.[/yellow]")
    if not config["embedding_model"]:
        print("[yellow]Warning: Embedding model is not set.[/yellow]")

This will help users identify potential issues with their configuration early on.


277-280: LGTM: New prompts for vector database configuration

The new prompts for vector database configuration are correctly added. However, consider adding input validation for these new options. For example:

vector_db = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"]))
if vector_db.lower() == 'true':
    vector_db_path = Prompt.ask("Enter the vector DB path", default=config["vector_db_path"])
    while not vector_db_path or not os.path.exists(vector_db_path):
        print("[yellow]Warning: Vector DB path is not valid or does not exist.[/yellow]")
        vector_db_path = Prompt.ask("Enter a valid vector DB path")
    
    document_path = Prompt.ask("Enter the document path", default=config["document_path"])
    while not document_path or not os.path.exists(document_path):
        print("[yellow]Warning: Document path is not valid or does not exist.[/yellow]")
        document_path = Prompt.ask("Enter a valid document path")
    
    embedding_model = Prompt.ask("Enter the embedding model", default=config["embedding_model"])
    while not embedding_model:
        print("[yellow]Warning: Embedding model cannot be empty.[/yellow]")
        embedding_model = Prompt.ask("Enter a valid embedding model")

This will ensure that the user provides valid inputs for the vector database configuration.


80-80: Use not instead of comparison to False

Replace if logs == False: with if not logs: for better readability and to follow Python style guidelines.

🧰 Tools
🪛 Ruff

80-80: Avoid equality comparisons to False; use if not logs: for false checks

Replace with not logs

(E712)


82-82: Rename unused loop variable

The loop control variable i is not used within the loop body. Rename it to _ to indicate it's intentionally unused:

for _ in range(1, 0, -1):
    time.sleep(1)

This change improves code clarity by explicitly showing that the loop variable is not used.

🧰 Tools
🪛 Ruff

82-82: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 82874d2 and d655f44.

📒 Files selected for processing (2)
  • app/predacons_cli/src/cli.py (8 hunks)
  • app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/cli.py

80-80: Avoid equality comparisons to False; use if not logs: for false checks

Replace with not logs

(E712)


82-82: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

app/predacons_cli/src/rag.py

6-6: requests imported but unused

Remove unused import: requests

(F401)


7-7: bs4.BeautifulSoup imported but unused

Remove unused import: bs4.BeautifulSoup

(F401)


143-143: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (3)
app/predacons_cli/src/cli.py (3)

11-11: LGTM: New import for vector database functionality

The import of VectorStore from the rag module is correctly placed and necessary for the new vector database functionality.


87-87: LGTM: Updated help message

The help message has been appropriately updated to include the new 'update' command for updating loaded documents.


297-300: LGTM: New vector database options in configuration data structure

The new vector database options are correctly added to the configuration data structure. The use of conditional expressions to set default values is appropriate and helps maintain clean code.

Comment on lines +102 to +110
def load_documents(self):
"""
Load documents from the specified directory.

:return: List of loaded Document objects.
"""
loader = DirectoryLoader(self.document_path)
documents = loader.load()
return documents
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance load_documents method with error handling and logging

The load_documents method could benefit from error handling and logging to improve robustness and provide better context during document loading.

Consider enhancing the method as follows:

import logging

def load_documents(self):
    """
    Load documents from the specified directory.

    :return: List of loaded Document objects.
    """
    try:
        loader = DirectoryLoader(self.document_path)
        documents = loader.load()
        logging.info(f"Successfully loaded {len(documents)} documents from {self.document_path}")
        return documents
    except Exception as e:
        logging.error(f"Error loading documents from {self.document_path}: {str(e)}")
        raise

This implementation adds error handling and logging, providing more context about the document loading process and any potential issues.

Comment on lines +112 to +118
def load_and_update_db(self):
"""
Load documents, split them into chunks, and add new chunks to the Chroma database.
"""
documents = self.load_documents()
chunks = self.split_text(documents)
self.add_to_chroma(chunks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance load_and_update_db method with error handling and logging

The load_and_update_db method orchestrates the document processing pipeline but lacks error handling and logging. Adding these would improve robustness and provide better visibility into the process.

Consider enhancing the method as follows:

import logging

def load_and_update_db(self):
    """
    Load documents, split them into chunks, and add new chunks to the Chroma database.
    """
    try:
        logging.info("Starting database update process")
        documents = self.load_documents()
        logging.info(f"Loaded {len(documents)} documents")

        chunks = self.split_text(documents)
        logging.info(f"Split documents into {len(chunks)} chunks")

        self.add_to_chroma(chunks)
        logging.info("Completed adding chunks to Chroma database")

    except Exception as e:
        logging.error(f"Error in load_and_update_db: {str(e)}")
        raise
    else:
        logging.info("Database update process completed successfully")

This implementation adds error handling and logging throughout the process, providing better visibility into the progress and any potential issues during the database update.

Comment on lines +120 to +127
def load_db(self):
"""
Load the Chroma database.

:return: Chroma database object.
"""
db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
return db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and logging to load_db method

The load_db method could benefit from error handling and logging to improve robustness and provide better context during database loading.

Consider enhancing the method as follows:

import logging

def load_db(self):
    """
    Load the Chroma database.

    :return: Chroma database object.
    """
    try:
        db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
        logging.info(f"Successfully loaded Chroma database from {self.chroma_path}")
        return db
    except Exception as e:
        logging.error(f"Error loading Chroma database from {self.chroma_path}: {str(e)}")
        raise

This implementation adds error handling and logging, providing more context about the database loading process and any potential issues.

Comment on lines +129 to +147
def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1):
"""
Retrieve similar documents from the Chroma database based on a query.

:param query: Query string to search for similar documents.
:param db: Optional Chroma database object. If None, the database will be loaded.
:param top_n: Number of top similar documents to retrieve.
:param similarity_threshold: Minimum similarity score threshold.
:return: List of tuples containing similar documents and their relevance scores.
"""
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
if len(results) == 0 or results[0][1] < similarity_threshold:
print(f"Unable to find matching results.")

context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
return context_text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance get_similar method and address f-string issue

The get_similar method could be improved in several ways:

  1. Replace the print statement with proper logging.
  2. Fix the f-string without placeholders.
  3. Consider returning a more structured result.

Here's an improved version of the method:

import logging
from typing import List, Tuple, Any

def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
    """
    Retrieve similar documents from the Chroma database based on a query.

    :param query: Query string to search for similar documents.
    :param db: Optional Chroma database object. If None, the database will be loaded.
    :param top_n: Number of top similar documents to retrieve.
    :param similarity_threshold: Minimum similarity score threshold.
    :return: List of tuples containing similar documents and their relevance scores.
    """
    try:
        if db is None:
            db = self.load_db()
        
        results = db.similarity_search_with_relevance_scores(query, k=top_n)
        
        filtered_results = [result for result in results if result[1] >= similarity_threshold]
        
        if not filtered_results:
            logging.info("No results found above the similarity threshold.")
        else:
            logging.info(f"Found {len(filtered_results)} relevant results.")
        
        return filtered_results
    except Exception as e:
        logging.error(f"Error in similarity search: {str(e)}")
        raise

This implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.

🧰 Tools
🪛 Ruff

143-143: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +1 to +9
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
import requests
from bs4 import BeautifulSoup
# from googlesearch import search

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports and clean up commented code

There are a few issues with the import statements:

  1. The requests and BeautifulSoup modules are imported but not used in the current implementation.
  2. There's a commented-out import for search from googlesearch.

To keep the code clean and maintainable, please remove the unused imports and the commented-out code if it's not needed in the near future.

Apply this diff to clean up the imports:

 from langchain_community.document_loaders import DirectoryLoader
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.schema import Document
 from langchain_community.vectorstores import Chroma
 from predacons import PredaconsEmbedding
-import requests
-from bs4 import BeautifulSoup
-# from googlesearch import search
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
import requests
from bs4 import BeautifulSoup
# from googlesearch import search
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
🧰 Tools
🪛 Ruff

6-6: requests imported but unused

Remove unused import: requests

(F401)


7-7: bs4.BeautifulSoup imported but unused

Remove unused import: bs4.BeautifulSoup

(F401)

Comment on lines +47 to +71
def calculate_chunk_ids(self, chunks):
"""
Calculate unique IDs for each chunk based on source and page metadata.

:param chunks: List of chunks to calculate IDs for.
:return: List of chunks with updated metadata including IDs.
"""
last_page_id = None
current_chunk_index = 0

for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"

if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0

chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
chunk.metadata["id"] = chunk_id

return chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential missing metadata in calculate_chunk_ids method

The calculate_chunk_ids method assumes that 'source' and 'page' keys always exist in the chunk metadata. To improve robustness, it's better to handle cases where these keys might be missing.

Consider using default values when accessing metadata:

 for chunk in chunks:
-    source = chunk.metadata.get("source")
-    page = chunk.metadata.get("page")
+    source = chunk.metadata.get("source", "unknown_source")
+    page = chunk.metadata.get("page", "unknown_page")
     current_page_id = f"{source}:{page}"

This change ensures that even if 'source' or 'page' keys are missing, the method will still function without raising an error.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def calculate_chunk_ids(self, chunks):
"""
Calculate unique IDs for each chunk based on source and page metadata.
:param chunks: List of chunks to calculate IDs for.
:return: List of chunks with updated metadata including IDs.
"""
last_page_id = None
current_chunk_index = 0
for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
chunk.metadata["id"] = chunk_id
return chunks
def calculate_chunk_ids(self, chunks):
"""
Calculate unique IDs for each chunk based on source and page metadata.
:param chunks: List of chunks to calculate IDs for.
:return: List of chunks with updated metadata including IDs.
"""
last_page_id = None
current_chunk_index = 0
for chunk in chunks:
source = chunk.metadata.get("source", "unknown_source")
page = chunk.metadata.get("page", "unknown_page")
current_page_id = f"{source}:{page}"
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
chunk.metadata["id"] = chunk_id
return chunks

Comment on lines +104 to +108
elif user_input == "update":
print("[yellow]Updating documents...[/yellow]")
# check and update the vector store
vector_store.load_and_update_db()
print("[yellow]Documents updated successfully![/yellow]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add check before updating vector store

The 'update' command is a good addition. However, you should check if the vector store exists before calling load_and_update_db(). This will prevent potential errors if the user tries to update when chat_with_data is disabled or the vector store failed to initialize.

Consider modifying the code as follows:

elif user_input == "update":
    if vector_store and config.get("chat_with_data", False):
        print("[yellow]Updating documents...[/yellow]")
        vector_store.load_and_update_db()
        print("[yellow]Documents updated successfully![/yellow]")
    else:
        print("[red]Vector store is not initialized. Cannot update documents.[/red]")

Comment thread app/predacons_cli/src/cli.py Outdated
Comment on lines +327 to +328
cli = Cli()
cli.launch()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use a main guard for script entry point

Creating and launching the CLI instance at the module level will cause the CLI to start immediately when this module is imported, which may not be the desired behavior in all cases. It's a best practice to use a if __name__ == "__main__": guard for script entry points.

Consider modifying the code as follows:

def main():
    cli = Cli()
    cli.launch()

if __name__ == "__main__":
    main()

This change allows the module to be imported without automatically launching the CLI, which is useful for testing or if you want to use the Cli class in other scripts.

@shouryashashank shouryashashank merged commit 3c7c7c4 into main Oct 12, 2024
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (11)
setup.py (2)

31-39: LGTM: Dependency updates are appropriate and well-defined.

The dependency updates are well-structured and align with the expected new features:

  1. Updated "predacons" to version 0.0.126 for compatibility.
  2. Added new dependencies with specific minimum versions:
    • rich: Enhanced CLI output
    • langchain and langchain_community: Language model interactions
    • requests and beautifulsoup4: Web scraping capabilities
    • googlesearch-python: Google search functionality

These additions suggest significant new features in web scraping and language model interactions.

Consider updating the README or documentation to reflect these new capabilities and dependencies.


Line range hint 1-46: Consider updating PR description and documentation.

The changes in setup.py suggest significant new features related to web scraping and language model interactions. However, the PR description lacks details about these changes.

  1. Update the PR description to explain the new features and their impact on the package's functionality.
  2. Ensure that the README and any other relevant documentation are updated to reflect these new capabilities and dependencies.
  3. If not already done, consider adding usage examples for the new features in the documentation.

These updates will help users understand the new capabilities and how to use them effectively.

app/predacons_cli/src/cli.py (9)

66-66: Typo in log message

There's a typo in the print statement. "data sources for the chat" may be clearer as "data sources for chat".

Optional improvement:

-print("[yellow]Checking for data sources for the chat[/yellow]")
+print("[yellow]Checking for data sources for chat[/yellow]")

68-68: Correct the typo in the log message

"iitializing" should be "initializing" in the print statement.

Apply this diff to fix the typo:

-print("[yellow]Web scraping enabled iitializing web scraper[/yellow]")
+print("[yellow]Web scraping enabled initializing web scraper[/yellow]")

74-74: Correct the typo in the log message

"Continueing" should be "Continuing" in the print statement.

Apply this diff to fix the typo:

-print("[yellow]Continueing with simple chat[/yellow]")
+print("[yellow]Continuing with simple chat[/yellow]")

80-80: Correct the typo in the log message

"Continueing" should be "Continuing" in the print statement.

Apply this diff to fix the typo:

-print("[yellow]Continueing with simple chat[/yellow]")
+print("[yellow]Continuing with simple chat[/yellow]")

97-97: Add a space after period in the log message

For better readability, add a space after the period in the print statement.

Apply this diff to fix:

-print("[yellow]You can start chatting with Predacons now.Type 'clear' to clear history, Type 'exit' to quit, Type 'help' for more options, Type 'update' to update the load documents[/yellow]")
+print("[yellow]You can start chatting with Predacons now. Type 'clear' to clear history, Type 'exit' to quit, Type 'help' for more options, Type 'update' to update the load documents[/yellow]")

132-132: Fix typos in the prompt template

Correct the misspellings in the prompt template to improve clarity.

Apply this diff:

-        Here are additianl infor from web search this context is google search results and web scraped data on the same topic :
+        Here is additional information from web search. This context includes Google search results and web scraped data on the same topic:

158-158: Improve punctuation and capitalization in prompt template

Add punctuation and capitalize "Google" for clarity and correctness.

Apply this diff:

-        Answer the question based only on the following context this context is google search results and web scraped data:
+        Answer the question based only on the following context. This context includes Google search results and web scraped data:

322-322: Rename variable for clarity

The variable vector_db stores the response to "Chat with data?" but its name suggests it's a database object. Consider renaming it to chat_with_data for better clarity.

Apply this diff:

-vector_db = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"]))
+chat_with_data = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"]))

And update usage in line 343:

-"chat_with_data": vector_db.lower() == 'true',
+"chat_with_data": chat_with_data.lower() == 'true',

326-326: Correct the typo in the prompt message

"Scrap" should be "Scrape" in the prompt to accurately reflect the action.

Apply this diff:

-scrap_web = Prompt.ask("Scrap the web for data? (true/false)", default=str(config["scrap_web"]))
+scrap_web = Prompt.ask("Scrape the web for data? (true/false)", default=str(config["scrap_web"]))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d655f44 and f772643.

📒 Files selected for processing (3)
  • app/predacons_cli/src/cli.py (8 hunks)
  • app/predacons_cli/src/rag.py (1 hunks)
  • setup.py (2 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/cli.py

72-72: Do not use bare except

(E722)


90-90: Avoid equality comparisons to False; use if not logs: for false checks

Replace with not logs

(E712)


92-92: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

app/predacons_cli/src/rag.py

143-143: f-string without any placeholders

Remove extraneous f prefix

(F541)


184-184: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


187-187: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (4)
setup.py (1)

10-10: LGTM: Version update is appropriate.

The version increment from 0.0.100 to 0.0.101 follows semantic versioning principles and is consistent with a patch update. This is suitable for minor changes or bug fixes.

app/predacons_cli/src/rag.py (1)

1-200: Overall assessment and summary of suggestions

The rag.py file introduces two main classes, VectorStore and WebScraper, which provide functionality for document management, embedding operations, and web scraping. While the overall structure and implementation are good, there are several areas for improvement:

  1. Imports: Clean up unused imports and uncomment the necessary ones.
  2. Logging: Replace print statements with proper logging throughout the file.
  3. Error handling: Enhance exception handling in both classes, particularly in the WebScraper methods.
  4. Code style: Use list comprehensions and more pythonic constructs where applicable.
  5. Type hinting: Consider adding type hints to method parameters and return values for better code clarity.

These improvements will enhance the code's readability, maintainability, and robustness. After implementing these changes, the code will be more production-ready and easier to debug and extend in the future.

🧰 Tools
🪛 Ruff

143-143: f-string without any placeholders

Remove extraneous f prefix

(F541)


184-184: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


187-187: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

app/predacons_cli/src/cli.py (2)

11-12: Ensure 'rag' module is available and properly imported

The imports for VectorStore and WebScraper require the rag module. Make sure that rag.py is in the correct directory or installed as a module so that these imports work without issues.


117-117: ⚠️ Potential issue

Check if 'vector_store' is initialized before use

If vector_store is not initialized due to earlier failures or configurations, calling vector_store.load_and_update_db() will raise an error. Ensure that vector_store is initialized before invoking this method.

Comment on lines +1 to +8
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
import requests
from bs4 import BeautifulSoup
from googlesearch import search
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clean up import statements

  1. The requests and BeautifulSoup imports are unused in the current implementation of the WebScraper class. Consider removing them if they're not needed.
  2. The search import from googlesearch is commented out, but it's used in the google_search_extract method. Uncomment this import and remove the others if they're not needed.

Here's a suggested cleanup:

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
from googlesearch import search

Comment on lines +29 to +45
def split_text(self, documents: list[Document]):
"""
Split documents into chunks using RecursiveCharacterTextSplitter.

:param documents: List of Document objects to be split.
:return: List of chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

return chunks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace print statement with logging in split_text method

Consider using the logging module instead of print statements for better control over output in a production environment.

import logging

def split_text(self, documents: list[Document]):
    # ... (existing code)
    chunks = text_splitter.split_documents(documents)
    logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    return chunks

Don't forget to configure the logging module at the application level.

Comment on lines +73 to +101
def add_to_chroma(self, chunks: list[Document]):
"""
Add new chunks to the Chroma database if they do not already exist.

:param chunks: List of chunks to be added to the database.
"""
db = Chroma(
persist_directory=self.chroma_path, embedding_function=self.embedder
)

chunks_with_ids = self.calculate_chunk_ids(chunks)

existing_items = db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

new_chunks = []
for chunk in chunks_with_ids:
if chunk.metadata["id"] not in existing_ids:
new_chunks.append(chunk)

if len(new_chunks):
print(f"👉 Adding new documents: {len(new_chunks)}")
new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
db.add_documents(new_chunks, ids=new_chunk_ids)
db.persist()
else:
print("✅ No new documents to add")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance add_to_chroma method with logging and list comprehension

  1. Replace print statements with logging:
import logging

def add_to_chroma(self, chunks: list[Document]):
    # ... (existing code)
    logging.info(f"Number of existing documents in DB: {len(existing_ids)}")
    
    # ... (rest of the method)
    
    if new_chunks:
        logging.info(f"Adding new documents: {len(new_chunks)}")
        # ... (rest of the code)
    else:
        logging.info("No new documents to add")
  1. Simplify the creation of new_chunks using a list comprehension:
new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids]

These changes will improve the code's readability and maintainability.

Comment on lines +129 to +146
def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1):
"""
Retrieve similar documents from the Chroma database based on a query.

:param query: Query string to search for similar documents.
:param db: Optional Chroma database object. If None, the database will be loaded.
:param top_n: Number of top similar documents to retrieve.
:param similarity_threshold: Minimum similarity score threshold.
:return: List of tuples containing similar documents and their relevance scores.
"""
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
if len(results) == 0 or results[0][1] < similarity_threshold:
print(f"Unable to find matching results.")

context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
return context_text,results[0][1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix f-string issue and enhance get_similar method

  1. Address the f-string without placeholders (F541 error).
  2. Add error handling.
  3. Improve logging of results.

Here's an improved version:

import logging
from typing import List, Tuple, Any

def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> Tuple[str, float]:
    try:
        if db is None:
            db = self.load_db()
        
        results = db.similarity_search_with_relevance_scores(query, k=top_n)
        
        if not results or results[0][1] < similarity_threshold:
            logging.info("No results found above the similarity threshold.")
            return "", 0.0
        
        context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
        return context_text, results[0][1]
    except Exception as e:
        logging.error(f"Error in similarity search: {str(e)}")
        raise

This implementation addresses the f-string issue, adds error handling, improves logging, and returns a more structured result.

🧰 Tools
🪛 Ruff

143-143: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +156 to +193
def google_search_extract(self,query, num_results=5):
results = []
try:
# Perform Google search
search_results = search(term = query, num_results=10, advanced = True)
i =0
for search_result in search_results:
url = search_result.url
title = search_result.title
description = search_result.description
results.append(title +" "+ description)
if i == num_results:
break
try:
# Fetch the content of the URL
response = requests.get(url)
response.raise_for_status()

# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract text from the website
text = ' '.join([p.get_text() for p in soup.find_all('p')])

# Append the extracted text to results
results.append(text)
i = i+1
print(url)
except requests.RequestException as e:
# print(f"Failed to fetch {url}: {e}")
pass
except Exception as e:
# print(f"Failed to parse {url}: {e}")
pass
except Exception as e:
print(f"Failed to perform Google search: {e}")

return results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve exception handling and use of imported libraries in google_search_extract method

  1. Log exceptions instead of using print statements:
import logging

def google_search_extract(self, query, num_results=5):
    # ... (existing code)
    try:
        response = requests.get(url)
        response.raise_for_status()
        # ... (rest of the code)
    except requests.RequestException as e:
        logging.warning(f"Failed to fetch {url}: {e}")
    except Exception as e:
        logging.error(f"Failed to parse {url}: {e}")
    # ... (rest of the method)
  1. Use the imported requests and BeautifulSoup consistently throughout the method.

  2. Consider using a more pythonic way to limit the number of results:

for i, search_result in enumerate(search_results):
    if i >= num_results:
        break
    # ... (rest of the loop)

These changes will improve error handling, make better use of imported libraries, and enhance the overall code quality.

🧰 Tools
🪛 Ruff

184-184: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


187-187: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

Comment on lines +25 to +34
# try:
# os.environ['CUDA_VISIBLE_DEVICES'] ='0'
# import torch
# if torch.cuda.is_available():
# torch.set_default_device('cuda')
# print("Using GPU")
# else:
# print("No GPU available")
# except:
# print("No GPU available")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented-out code for better readability

Lines 25-34 contain commented-out code. Keeping unused code increases clutter and can cause confusion. If this code is no longer needed, consider removing it.

Apply this diff to clean up the code:

-# try:
-#     os.environ['CUDA_VISIBLE_DEVICES'] ='0'
-#     import torch
-#     if torch.cuda.is_available():
-#         torch.set_default_device('cuda')
-#         print("Using GPU")
-#     else:
-#         print("No GPU available")
-# except:
-#     print("No GPU available")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# try:
# os.environ['CUDA_VISIBLE_DEVICES'] ='0'
# import torch
# if torch.cuda.is_available():
# torch.set_default_device('cuda')
# print("Using GPU")
# else:
# print("No GPU available")
# except:
# print("No GPU available")

Comment on lines +220 to +221
print("[blue]Predacons CLI version 0.0.101[blue]")
print("[blue]Predacons version 0.0.126[blue]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the closing tags in print statements

The closing tags for the color should be [/blue] instead of [blue].

Apply this diff to fix the tags:

-print("[blue]Predacons CLI version 0.0.101[blue]")
-print("[blue]Predacons version 0.0.126[blue]")
+print("[blue]Predacons CLI version 0.0.101[/blue]")
+print("[blue]Predacons version 0.0.126[/blue]")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("[blue]Predacons CLI version 0.0.101[blue]")
print("[blue]Predacons version 0.0.126[blue]")
print("[blue]Predacons CLI version 0.0.101[/blue]")
print("[blue]Predacons version 0.0.126[/blue]")

print("[yellow]Loading data from vector DB[/yellow]")
# load data from vector db

vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use consistent access methods for configuration

Accessing config["vector_db_path"] and config["document_path"] directly can raise KeyError if the keys are missing. Consider using config.get() for safer access or ensure these keys are present.

Apply this diff to improve robustness:

-vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None))
+vector_store = VectorStore(config.get("vector_db_path"), config.get("document_path"), config.get("embedding_model"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None))
vector_store = VectorStore(config.get("vector_db_path"), config.get("document_path"), config.get("embedding_model"))

try:
web_scraper = WebScraper()
print("[green]Web scraper initialized![/green]")
except:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid bare 'except' clauses

Using a bare except: can catch unintended exceptions and make debugging difficult. Specify the exception type to catch only expected exceptions.

Apply this diff to fix the issue:

-    except:
+    except Exception as e:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except:
except Exception as e:
🧰 Tools
🪛 Ruff

72-72: Do not use bare except

(E722)

Comment on lines +78 to +81
if not os.path.exists(config.get("vector_db_path", False)):
print("[red]Vector DB not found![/red]")
print("[yellow]Continueing with simple chat[/yellow]")
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure 'vector_db_path' is valid before checking existence

Using config.get("vector_db_path", False) may return False, leading os.path.exists(False) to check for a path 'False', which is unintended. Check if vector_db_path is valid before proceeding.

Apply this diff to fix the issue:

-    if not os.path.exists(config.get("vector_db_path", False)):
+    vector_db_path = config.get("vector_db_path")
+    if not vector_db_path or not os.path.exists(vector_db_path):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not os.path.exists(config.get("vector_db_path", False)):
print("[red]Vector DB not found![/red]")
print("[yellow]Continueing with simple chat[/yellow]")
else:
vector_db_path = config.get("vector_db_path")
if not vector_db_path or not os.path.exists(vector_db_path):
print("[red]Vector DB not found![/red]")
print("[yellow]Continueing with simple chat[/yellow]")
else:

@coderabbitai coderabbitai Bot mentioned this pull request Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant