Create rag.py#3
Conversation
WalkthroughThe changes enhance the Changes
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
app/predacons_cli/src/rag.py (2)
23-28: Uselogginginstead ofUsing
loggingmodule to log messages at appropriate severity levels.Suggested modification:
+import logging class VectorStore: def split_text(self, documents: list[Document]): + logger = logging.getLogger(__name__) text_splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=100, length_function=len, add_start_index=True, ) chunks = text_splitter.split_documents(documents) - print(f"Split {len(documents)} documents into {len(chunks)} chunks.") + logger.info(f"Split {len(documents)} documents into {len(chunks)} chunks.") if len(chunks) > 10: document = chunks[10] - print(document.page_content) - print(document.metadata) + logger.debug(document.page_content) + logger.debug(document.metadata) else: - print("Not enough chunks to display sample content.") + logger.info("Not enough chunks to display sample content.") return chunks
71-71: Uselogginginstead ofConsistently using the
loggingmodule enhances maintainability and allows for better control over log levels and outputs.Suggested modification:
+import logging class VectorStore: def add_to_chroma(self, chunks: list[Document]): + logger = logging.getLogger(__name__) # ... - print(f"Number of existing documents in DB: {len(existing_ids)}") + logger.info(f"Number of existing documents in DB: {len(existing_ids)}") # ... if len(new_chunks): - print(f"👉 Adding new documents: {len(new_chunks)}") + logger.info(f"Adding new documents: {len(new_chunks)}") # ... else: - print("✅ No new documents to add") + logger.info("No new documents to add")Also applies to: 80-80, 85-85
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
app/predacons_cli/src/rag.py (1)
87-90:load_documentsmethod is implemented correctlyThe
load_documentsmethod properly utilizesDirectoryLoaderto load and return documents from the specified path.
| ) | ||
|
|
||
| # Calculate Page IDs. | ||
| chunks_with_ids = self.alculate_chunk_ids(chunks) |
There was a problem hiding this comment.
Fix typo in method call: alculate_chunk_ids should be calculate_chunk_ids
There's a typo in the method call self.alculate_chunk_ids(chunks). This will raise an AttributeError because the method doesn't exist. Please correct the method name.
Suggested fix:
# Calculate Page IDs.
- chunks_with_ids = self.alculate_chunk_ids(chunks)
+ chunks_with_ids = self.calculate_chunk_ids(chunks)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chunks_with_ids = self.alculate_chunk_ids(chunks) | |
| chunks_with_ids = self.calculate_chunk_ids(chunks) |
| document = chunks[10] | ||
| print(document.page_content) | ||
| print(document.metadata) | ||
|
|
There was a problem hiding this comment.
Handle potential IndexError when accessing chunks[10]
Accessing chunks[10] without checking its length may result in an IndexError if there are fewer than 11 chunks. To prevent this, consider adding a check before accessing chunks[10].
Suggested fix:
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
- document = chunks[10]
- print(document.page_content)
- print(document.metadata)
+ if len(chunks) > 10:
+ document = chunks[10]
+ print(document.page_content)
+ print(document.metadata)
+ else:
+ print("Not enough chunks to display sample content.")
return chunks📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| document = chunks[10] | |
| print(document.page_content) | |
| print(document.metadata) | |
| if len(chunks) > 10: | |
| document = chunks[10] | |
| print(document.page_content) | |
| print(document.metadata) | |
| else: | |
| print("Not enough chunks to display sample content.") | |
| for chunk in chunks: | ||
| source = chunk.metadata.get("source") | ||
| page = chunk.metadata.get("page") | ||
| current_page_id = f"{source}:{page}" | ||
|
|
||
| # If the page ID is the same as the last one, increment the index. | ||
| if current_page_id == last_page_id: | ||
| current_chunk_index += 1 | ||
| else: | ||
| current_chunk_index = 0 | ||
|
|
||
| # Calculate the chunk ID. | ||
| chunk_id = f"{current_page_id}:{current_chunk_index}" | ||
| last_page_id = current_page_id | ||
|
|
||
| # Add it to the page meta-data. | ||
| chunk.metadata["id"] = chunk_id |
There was a problem hiding this comment.
Handle missing 'source' or 'page' in chunk metadata
The code assumes that chunk.metadata contains 'source' and 'page' keys. If these keys are missing, it could lead to issues when generating current_page_id. Consider adding default values or handling missing metadata to avoid potential errors.
Suggested improvement:
for chunk in chunks:
- source = chunk.metadata.get("source")
- page = chunk.metadata.get("page")
+ source = chunk.metadata.get("source", "unknown_source")
+ page = chunk.metadata.get("page", "unknown_page")
current_page_id = f"{source}:{page}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for chunk in chunks: | |
| source = chunk.metadata.get("source") | |
| page = chunk.metadata.get("page") | |
| current_page_id = f"{source}:{page}" | |
| # If the page ID is the same as the last one, increment the index. | |
| if current_page_id == last_page_id: | |
| current_chunk_index += 1 | |
| else: | |
| current_chunk_index = 0 | |
| # Calculate the chunk ID. | |
| chunk_id = f"{current_page_id}:{current_chunk_index}" | |
| last_page_id = current_page_id | |
| # Add it to the page meta-data. | |
| chunk.metadata["id"] = chunk_id | |
| for chunk in chunks: | |
| source = chunk.metadata.get("source", "unknown_source") | |
| page = chunk.metadata.get("page", "unknown_page") | |
| current_page_id = f"{source}:{page}" | |
| # If the page ID is the same as the last one, increment the index. | |
| if current_page_id == last_page_id: | |
| current_chunk_index += 1 | |
| else: | |
| current_chunk_index = 0 | |
| # Calculate the chunk ID. | |
| chunk_id = f"{current_page_id}:{current_chunk_index}" | |
| last_page_id = current_page_id | |
| # Add it to the page meta-data. | |
| chunk.metadata["id"] = chunk_id |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (1)
app/predacons_cli/src/rag.py (1)
1-108: Overall assessment: Good foundation with room for improvementThe
VectorStoreclass provides a solid foundation for managing document embeddings and storage using the LangChain library. The implementation covers essential functionality such as document loading, text splitting, chunk ID calculation, and similarity search.Key strengths:
- Clear separation of concerns with distinct methods for each step in the process.
- Integration with appropriate libraries (LangChain, Chroma) for vector storage and embedding.
Areas for improvement:
- Enhance error handling across all methods to improve robustness.
- Implement consistent logging throughout the class for better debugging and monitoring.
- Address minor issues such as typos and potential edge cases in data handling.
- Consider performance optimizations, such as batching for large datasets.
By addressing these points, the
VectorStoreclass will become more robust, maintainable, and performant, providing a reliable foundation for document processing and similarity search operations.🧰 Tools
🪛 Ruff
107-107: f-string without any placeholders
Remove extraneous
fprefix(F541)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py
107-107: f-string without any placeholders
Remove extraneous
fprefix(F541)
🔇 Additional comments (5)
app/predacons_cli/src/rag.py (5)
1-5: LGTM: Imports are appropriate and concise.The import statements are well-organized and include all necessary modules for the
VectorStoreclass functionality.
7-13: LGTM: Class initialization is well-structured.The
__init__method properly initializes theVectorStoreclass with necessary parameters and sets up the embedding model correctly.
15-29: 🛠️ Refactor suggestionConsider additional error handling and logging improvements.
The
split_textmethod effectively splits documents into chunks. However, as previously noted, there's a potentialIndexErrorwhen accessingchunks[10]. In addition to addressing that issue, consider the following improvements:
- Add error handling for the text splitting process.
- Use logging instead of print statements for better integration with the application's logging system.
Example improvement:
import logging def split_text(self, documents: list[Document]): try: text_splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=100, length_function=len, add_start_index=True, ) chunks = text_splitter.split_documents(documents) logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.") if len(chunks) > 10: sample_chunk = chunks[10] logging.debug(f"Sample chunk content: {sample_chunk.page_content}") logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}") else: logging.warning("Not enough chunks to display sample content.") return chunks except Exception as e: logging.error(f"Error splitting documents: {str(e)}") raise
31-57: 🛠️ Refactor suggestionEnhance robustness of chunk ID calculation.
The
calculate_chunk_idsmethod effectively generates unique IDs for chunks. As previously noted, there's a potential issue with missing 'source' or 'page' in chunk metadata. In addition to addressing that, consider the following improvements for increased robustness:
- Use
uuid.uuid4()as a fallback for truly unique IDs when metadata is incomplete.- Sanitize the
sourceandpagevalues to ensure they don't contain characters that could cause issues in the ID string.Example improvement:
import uuid import re def calculate_chunk_ids(self, chunks): last_page_id = None current_chunk_index = 0 def sanitize(value): return re.sub(r'[^a-zA-Z0-9_-]', '_', str(value)) for chunk in chunks: source = sanitize(chunk.metadata.get("source", "unknown_source")) page = sanitize(chunk.metadata.get("page", "unknown_page")) current_page_id = f"{source}:{page}" if current_page_id == last_page_id: current_chunk_index += 1 else: current_chunk_index = 0 chunk_id = f"{current_page_id}:{current_chunk_index}" # Fallback to UUID if the generated ID is not unique if not chunk_id or chunk_id == ":": chunk_id = str(uuid.uuid4()) last_page_id = current_page_id chunk.metadata["id"] = chunk_id return chunksThis implementation ensures that even with missing or problematic metadata, each chunk will receive a unique ID.
59-85: 🛠️ Refactor suggestion
⚠️ Potential issueFix typo and enhance error handling in add_to_chroma method.
As previously noted, there's a typo in the method call
self.alculate_chunk_ids(chunks). In addition to fixing this typo, consider the following improvements:
- Add error handling for database operations.
- Use logging instead of print statements for consistency.
- Consider batching the addition of new chunks for better performance with large datasets.
Example improvement:
import logging from typing import List from langchain.schema import Document def add_to_chroma(self, chunks: List[Document]): try: db = Chroma( persist_directory=self.chroma_path, embedding_function=self.embedder ) chunks_with_ids = self.calculate_chunk_ids(chunks) # Fix typo here existing_items = db.get(include=[]) existing_ids = set(existing_items["ids"]) logging.info(f"Number of existing documents in DB: {len(existing_ids)}") new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids] if new_chunks: logging.info(f"Adding new documents: {len(new_chunks)}") new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks] # Consider batching for large datasets batch_size = 1000 for i in range(0, len(new_chunks), batch_size): batch = new_chunks[i:i+batch_size] batch_ids = new_chunk_ids[i:i+batch_size] db.add_documents(batch, ids=batch_ids) db.persist() else: logging.info("No new documents to add") except Exception as e: logging.error(f"Error adding documents to Chroma: {str(e)}") raiseThis implementation fixes the typo, adds error handling, uses logging, and introduces batching for better performance with large datasets.
| def load_and_update_db(self): | ||
| documents = self.load_documents() | ||
| chunks = self.split_text(documents) | ||
| self.add_to_chroma(chunks) | ||
| print("✅ Done") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance load_and_update_db method with error handling and detailed logging.
The load_and_update_db method effectively orchestrates the document processing pipeline. However, it could benefit from improved error handling and more detailed logging. Consider the following improvements:
import logging
def load_and_update_db(self):
try:
logging.info("Starting database update process")
documents = self.load_documents()
logging.info(f"Loaded {len(documents)} documents")
chunks = self.split_text(documents)
logging.info(f"Split documents into {len(chunks)} chunks")
self.add_to_chroma(chunks)
logging.info("Completed adding chunks to Chroma database")
except Exception as e:
logging.error(f"Error in load_and_update_db: {str(e)}")
raise
else:
logging.info("Database update process completed successfully")This implementation adds error handling and more detailed logging throughout the process, providing better visibility into the progress and any potential issues during the database update.
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Outside diff range and nitpick comments (1)
app/predacons_cli/src/cli.py (1)
59-59: Fix typo in print statementThere's a typo in the print statement: "Continueing" should be "Continuing".
Apply this diff to correct the typo:
print("[red]Vector DB not found![/red]") -print("[yellow]Continueing with simple chat[/yellow]") +print("[yellow]Continuing with simple chat[/yellow]")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- app/predacons_cli/src/cli.py (3 hunks)
- app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py
106-106: f-string without any placeholders
Remove extraneous
fprefix(F541)
🔇 Additional comments (4)
app/predacons_cli/src/rag.py (3)
1-5: LGTM: Import statements are appropriate and concise.The import statements are well-organized and include all necessary modules for the
VectorStoreclass functionality.
7-12: LGTM: VectorStore class initialization is well-structured.The class initialization is flexible with the optional model parameter and correctly sets up the necessary attributes for the VectorStore functionality.
1-107: Overall assessment: Well-structured implementation with room for improvementsThe
VectorStoreclass provides a comprehensive set of methods for managing document embeddings and storage using the LangChain library. The overall structure and logic of the class are sound and align well with the intended functionality.Key strengths:
- Clear separation of concerns across methods
- Effective use of LangChain components
- Thoughtful implementation of document chunking and ID generation
Suggested improvements:
- Enhance error handling across all methods
- Implement consistent logging throughout the class
- Address minor issues like typos and unnecessary f-strings
- Consider adding type hints for better code clarity
By incorporating these suggestions, the
VectorStoreclass will become more robust, maintainable, and production-ready. Great job on the initial implementation!🧰 Tools
🪛 Ruff
106-106: f-string without any placeholders
Remove extraneous
fprefix(F541)
app/predacons_cli/src/cli.py (1)
11-11: Import statement is appropriateThe import of
VectorStorefrom theragmodule appears correct and necessary for the added functionality.
| def split_text(self,documents: list[Document]): | ||
| text_splitter = RecursiveCharacterTextSplitter( | ||
| chunk_size=300, | ||
| chunk_overlap=100, | ||
| length_function=len, | ||
| add_start_index=True, | ||
| ) | ||
| chunks = text_splitter.split_documents(documents) | ||
| print(f"Split {len(documents)} documents into {len(chunks)} chunks.") | ||
|
|
||
| document = chunks[10] | ||
| print(document.page_content) | ||
| print(document.metadata) | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
Improve error handling and use logging in split_text method.
The split_text method looks good overall, but there are two areas for improvement:
-
Handle potential IndexError:
When accessingchunks[10], add a check to ensure there are at least 11 chunks to avoid an IndexError. -
Use logging instead of print statements:
Replace print statements with appropriate logging calls for better production-ready code.
Here's a suggested improvement:
import logging
def split_text(self, documents: list[Document]):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")
if len(chunks) > 10:
sample_chunk = chunks[10]
logging.debug(f"Sample chunk content: {sample_chunk.page_content}")
logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}")
else:
logging.warning("Not enough chunks to display a sample.")
return chunksThis implementation addresses the potential IndexError and uses logging for better production readiness.
| def calculate_chunk_ids(self,chunks): | ||
|
|
||
| # This will create IDs like "data/monopoly.pdf:6:2" | ||
| # Page Source : Page Number : Chunk Index | ||
|
|
||
| last_page_id = None | ||
| current_chunk_index = 0 | ||
|
|
||
| for chunk in chunks: | ||
| source = chunk.metadata.get("source") | ||
| page = chunk.metadata.get("page") | ||
| current_page_id = f"{source}:{page}" | ||
|
|
||
| # If the page ID is the same as the last one, increment the index. | ||
| if current_page_id == last_page_id: | ||
| current_chunk_index += 1 | ||
| else: | ||
| current_chunk_index = 0 | ||
|
|
||
| # Calculate the chunk ID. | ||
| chunk_id = f"{current_page_id}:{current_chunk_index}" | ||
| last_page_id = current_page_id | ||
|
|
||
| # Add it to the page meta-data. | ||
| chunk.metadata["id"] = chunk_id | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
Handle potential missing metadata in calculate_chunk_ids method.
The calculate_chunk_ids method effectively generates unique IDs for chunks. However, it assumes that 'source' and 'page' keys always exist in the chunk metadata. To improve robustness, consider handling cases where these keys might be missing.
Here's a suggested improvement:
def calculate_chunk_ids(self, chunks):
last_page_id = None
current_chunk_index = 0
for chunk in chunks:
source = chunk.metadata.get("source", "unknown_source")
page = chunk.metadata.get("page", "unknown_page")
current_page_id = f"{source}:{page}"
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
chunk_id = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
chunk.metadata["id"] = chunk_id
return chunksThis implementation uses the get method with default values to handle potential missing keys in the metadata.
| def add_to_chroma(self,chunks: list[Document]): | ||
| # Load the existing database. | ||
| db = Chroma( | ||
| persist_directory=self.chroma_path, embedding_function=self.embedder | ||
| ) | ||
|
|
||
| # Calculate Page IDs. | ||
| chunks_with_ids = self.alculate_chunk_ids(chunks) | ||
|
|
||
| # Add or Update the documents. | ||
| existing_items = db.get(include=[]) # IDs are always included by default | ||
| existing_ids = set(existing_items["ids"]) | ||
| print(f"Number of existing documents in DB: {len(existing_ids)}") | ||
|
|
||
| # Only add documents that don't exist in the DB. | ||
| new_chunks = [] | ||
| for chunk in chunks_with_ids: | ||
| if chunk.metadata["id"] not in existing_ids: | ||
| new_chunks.append(chunk) | ||
|
|
||
| if len(new_chunks): | ||
| print(f"👉 Adding new documents: {len(new_chunks)}") | ||
| new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks] | ||
| db.add_documents(new_chunks, ids=new_chunk_ids) | ||
| db.persist() | ||
| else: | ||
| print("✅ No new documents to add") | ||
|
|
There was a problem hiding this comment.
Fix typo and improve logging in add_to_chroma method.
The add_to_chroma method effectively adds new chunks to the Chroma database. However, there are two issues to address:
- There's a typo in the method call:
self.alculate_chunk_idsshould beself.calculate_chunk_ids. - The logging can be improved for consistency and clarity.
Here's a suggested improvement:
import logging
def add_to_chroma(self, chunks: list[Document]):
db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
chunks_with_ids = self.calculate_chunk_ids(chunks) # Fix typo here
existing_items = db.get(include=[])
existing_ids = set(existing_items["ids"])
logging.info(f"Number of existing documents in DB: {len(existing_ids)}")
new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids]
if new_chunks:
logging.info(f"Adding {len(new_chunks)} new documents")
new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
db.add_documents(new_chunks, ids=new_chunk_ids)
db.persist()
else:
logging.info("No new documents to add")This implementation fixes the typo and uses consistent logging throughout the method.
| def load_documents(self): | ||
| loader = DirectoryLoader(self.document_path) | ||
| documents = loader.load() | ||
| return documents |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance load_documents method with error handling and logging.
While the load_documents method is concise, it could benefit from error handling and logging for improved robustness and debugging.
Consider the following improvement:
import logging
from langchain_community.document_loaders import DirectoryLoader
def load_documents(self):
try:
loader = DirectoryLoader(self.document_path)
documents = loader.load()
logging.info(f"Successfully loaded {len(documents)} documents from {self.document_path}")
return documents
except Exception as e:
logging.error(f"Error loading documents from {self.document_path}: {str(e)}")
raiseThis implementation adds error handling and logging, providing more context about the document loading process and any potential issues.
| def load_and_update_db(self): | ||
| documents = self.load_documents() | ||
| chunks = self.split_text(documents) | ||
| self.add_to_chroma(chunks) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance load_and_update_db method with error handling and logging.
The load_and_update_db method effectively orchestrates the document processing pipeline. However, it could benefit from error handling and logging for improved robustness and debugging.
Consider the following improvement:
import logging
def load_and_update_db(self):
try:
logging.info("Starting database update process")
documents = self.load_documents()
logging.info(f"Loaded {len(documents)} documents")
chunks = self.split_text(documents)
logging.info(f"Split documents into {len(chunks)} chunks")
self.add_to_chroma(chunks)
logging.info("Completed adding chunks to Chroma database")
except Exception as e:
logging.error(f"Error in load_and_update_db: {str(e)}")
raise
else:
logging.info("Database update process completed successfully")This implementation adds error handling and logging throughout the process, providing better visibility into the progress and any potential issues during the database update.
| def load_db(self): | ||
| db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder) | ||
| return db |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling to load_db method.
The load_db method is concise, but it could benefit from error handling to improve robustness.
Consider the following improvement:
import logging
from langchain_community.vectorstores import Chroma
def load_db(self):
try:
db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
logging.info(f"Successfully loaded Chroma database from {self.chroma_path}")
return db
except Exception as e:
logging.error(f"Error loading Chroma database from {self.chroma_path}: {str(e)}")
raiseThis implementation adds error handling and logging, providing more context about the database loading process and any potential issues.
| def get_similar(self, query,db =None, top_n = 5,similarity_threshold = 0.1): | ||
| if db is None: | ||
| db = self.load_db() | ||
| results = db.similarity_search_with_relevance_scores(query, k=top_n) | ||
| if len(results) == 0 or results[0][1] < similarity_threshold: | ||
| print(f"Unable to find matching results.") | ||
| return results No newline at end of file |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance get_similar method and address f-string issue.
The get_similar method performs a similarity search effectively, but there are a few improvements we can make:
- Address the f-string without placeholders (F541 error).
- Add error handling for the database operations.
- Improve the logging of results.
- Consider returning a more structured result.
Here's an improved version:
import logging
from typing import List, Tuple, Any
def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
try:
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
filtered_results = [result for result in results if result[1] >= similarity_threshold]
if not filtered_results:
logging.info("No results found above the similarity threshold.")
else:
logging.info(f"Found {len(filtered_results)} relevant results.")
return filtered_results
except Exception as e:
logging.error(f"Error in similarity search: {str(e)}")
raiseThis implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.
🧰 Tools
🪛 Ruff
106-106: f-string without any placeholders
Remove extraneous
fprefix(F541)
| if config["chat_with_data"]: | ||
| # get response from vector store | ||
| response = vector_store.get_response(user_input) | ||
| response_body = {"role": "assistant", "content": response} | ||
| chat.append(response_body) | ||
| print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]") |
There was a problem hiding this comment.
Handle potential use of uninitialized 'vector_store'
If the vector database is not found or required configurations are missing, vector_store may not be initialized, leading to a NameError when calling vector_store.get_response(user_input). To prevent this, ensure vector_store is initialized properly or adjust the logic to handle cases when it's not available.
Apply this diff to adjust the response generation:
else:
print("[red]Vector DB not found![/red]")
print("[yellow]Continuing with simple chat[/yellow]")
+ vector_store = None
# In the chat loop
if user_input == "exit":
return
elif user_input == "clear":
chat = []
print("[yellow]Chat history cleared![/yellow]")
...
else:
- if config["chat_with_data"]:
+ if config["chat_with_data"] and vector_store is not None:
# get response from vector store
response = vector_store.get_response(user_input)
response_body = {"role": "assistant", "content": response}
chat.append(response_body)
print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")
+ else:
+ response = Cli.generate_response(self, chat, model, tokenizer, config)
+ response_body = {"role": "assistant", "content": response}
+ chat.append(response_body)
+ if config["print_as_markdown"]:
+ markdown = Markdown(response)
+ print("[orange1]Predacons: [/orange1]")
+ console.print(markdown)
+ else:
+ console.print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")Committable suggestion was skipped due to low confidence.
Avoid redundant response generation
Currently, even when a response is obtained from vector_store, the code still calls Cli.generate_response, leading to multiple responses for a single user input. Consider adjusting the logic to prevent redundant response generation and ensure only one response is provided to the user.
Apply this diff to modify the response flow:
else:
if config["chat_with_data"] and vector_store is not None:
# get response from vector store
response = vector_store.get_response(user_input)
+ if response:
+ response_body = {"role": "assistant", "content": response}
+ chat.append(response_body)
+ print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")
+ continue # Skip generating response from the model
response = Cli.generate_response(self, chat, model, tokenizer, config)
response_body = {"role": "assistant", "content": response}
chat.append(response_body)
if config["print_as_markdown"]:
markdown = Markdown(response)
print("[orange1]Predacons: [/orange1]")
console.print(markdown)
else:
console.print("[orange1]Predacons: [/orange1] [sky_blue1]" + response+"[/sky_blue1]")Committable suggestion was skipped due to low confidence.
| print("[yellow]Loading data from vector DB[/yellow]") | ||
| # load data from vector db | ||
|
|
||
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"]) |
There was a problem hiding this comment.
Ensure required configuration keys are present before initializing VectorStore
When initializing VectorStore, the code uses config["vector_db_path"], config["document_path"], and config["embedding_model"]. Without prior checks, missing keys can result in a KeyError. Consider verifying that these keys exist and handling missing values appropriately.
Apply this diff to check for required configuration keys:
print("[yellow]Loading data from vector DB[/yellow]")
+# Verify required configuration keys
+required_keys = ["vector_db_path", "document_path", "embedding_model"]
+missing_keys = [key for key in required_keys if key not in config]
+if missing_keys:
+ print(f"[red]Missing configuration keys: {', '.join(missing_keys)}[/red]")
+ print("[yellow]Continuing with simple chat[/yellow]")
+ vector_store = None
+else:
# load data from vector db
vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"]) | |
| # Verify required configuration keys | |
| required_keys = ["vector_db_path", "document_path", "embedding_model"] | |
| missing_keys = [key for key in required_keys if key not in config] | |
| if missing_keys: | |
| print(f"[red]Missing configuration keys: {', '.join(missing_keys)}[/red]") | |
| print("[yellow]Continuing with simple chat[/yellow]") | |
| vector_store = None | |
| else: | |
| # load data from vector db | |
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config["embedding_model"]) |
|
|
||
| chat = [] | ||
| print("[yellow]Checking for data sources for the chat[/yellow]") | ||
| if config["chat_with_data"]: |
There was a problem hiding this comment.
Add 'chat_with_data' and related keys to configuration
The code checks config["chat_with_data"], but this key is not present in the default configuration or in the create_config_file method. This will result in a KeyError if 'chat_with_data' is not defined. Similarly, vector_db_path, document_path, and embedding_model are used but may not be specified.
Apply this diff to add the missing keys to the default configuration:
def create_config_file(self):
config = Cli.check_config_file(self)
file_path = self.config_file_path
default_config = {
"model_path": "Precacons/Pico-Lamma-3.2-1B-Reasoning-Instruct",
"trust_remote_code": False,
"use_fast_generation": False,
"draft_model_name": None,
"gguf_file": None,
"auto_quantize": False,
"temperature": 0.3,
"max_length": 1000,
"top_k": 50,
"top_p": 0.9,
"repetition_penalty": 1.0,
"num_return_sequences": 1,
"print_as_markdown": False,
+ "chat_with_data": False,
+ "vector_db_path": "path/to/vector_db",
+ "document_path": "path/to/documents",
+ "embedding_model": "model/name"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if config["chat_with_data"]: | |
| def create_config_file(self): | |
| config = Cli.check_config_file(self) | |
| file_path = self.config_file_path | |
| default_config = { | |
| "model_path": "Precacons/Pico-Lamma-3.2-1B-Reasoning-Instruct", | |
| "trust_remote_code": False, | |
| "use_fast_generation": False, | |
| "draft_model_name": None, | |
| "gguf_file": None, | |
| "auto_quantize": False, | |
| "temperature": 0.3, | |
| "max_length": 1000, | |
| "top_k": 50, | |
| "top_p": 0.9, | |
| "repetition_penalty": 1.0, | |
| "num_return_sequences": 1, | |
| "print_as_markdown": False, | |
| "chat_with_data": False, | |
| "vector_db_path": "path/to/vector_db", | |
| "document_path": "path/to/documents", | |
| "embedding_model": "model/name" | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Outside diff range and nitpick comments (6)
app/predacons_cli/src/rag.py (2)
16-27: Simplify model_id assignment in initThe current assignment of
model_iduses a ternary operator that can be simplified. Theif model is not Nonecheck is redundant sincemodelwill beNoneif not provided.Consider simplifying the
model_idassignment:- self.model_id = model if model is not None else None + self.model_id = modelThis change maintains the same functionality while making the code more concise.
150-175: Implement WebScraper class methodsThe
WebScraperclass is currently a skeleton with empty methods. While the structure and docstrings are in place, the actual implementation is missing.To complete the implementation, consider the following steps:
- Implement the
fetch_htmlmethod using therequestslibrary to retrieve HTML content from a given URL.- Implement the
parse_htmlmethod usingBeautifulSoupto extract relevant text from the HTML content.- Add error handling and logging to both methods.
- Consider adding a method to combine fetching and parsing in a single operation.
Would you like assistance in generating the implementation for these methods?
app/predacons_cli/src/cli.py (4)
11-11: Consider grouping imports for better organizationIt's generally a good practice to group imports in the following order: standard library imports, third-party imports, and local imports. Consider moving this import statement to be grouped with other third-party or local imports for better code organization.
24-33: Remove or document commented-out codeCommented-out code can lead to confusion and clutter the codebase. If this GPU-related code is no longer needed, consider removing it entirely. If it's intended for future use, add a clear comment explaining its purpose and when it might be reintroduced, or move it to a separate development branch.
213-217: Approved: New configuration options for vector databaseThe addition of new configuration options for vector database functionality is well-implemented. The prompts for user input and the inclusion of these options in the config_data dictionary are consistent with the existing code style.
For improved clarity, consider adding brief comments explaining the purpose of each new configuration option, especially for users who might not be familiar with vector databases or embedding models.
Here's a suggested improvement for the default configuration:
default_config = { # ... existing options ... "chat_with_data": False, # Enable/disable chat with vector database "vector_db_path": None, # Path to the vector database "document_path": None, # Path to the documents for vector database "embedding_model": None # Name or path of the embedding model to use }These comments will help users understand the purpose of each option when they're editing the configuration file directly.
Also applies to: 273-276, 293-296
Line range hint
1-324: Overall improvements and suggestions for the Cli classThe changes to the
Cliclass significantly enhance its capabilities, particularly with the addition of vector database support. Here are some overall suggestions for further improvement:
Consider breaking down the
launchmethod into smaller, more focused methods. It's currently handling multiple responsibilities and could benefit from better separation of concerns.Implement proper logging throughout the class instead of using print statements. This will make it easier to debug and monitor the application in various environments.
Add docstrings to new methods and update existing ones to reflect the new functionality.
Consider implementing type hints to improve code readability and catch potential type-related errors early.
Implement unit tests for the new functionality to ensure reliability and ease of maintenance.
Here's an example of how you might start refactoring the
launchmethod:import logging class Cli: def __init__(self): self.logger = logging.getLogger(__name__) # ... other initialization ... def launch(self, logs=False): self.setup_logging(logs) config = self.load_or_create_config() model, tokenizer = self.load_model(config) vector_db = self.setup_vector_db(config) self.run_chat_loop(model, tokenizer, vector_db, config) def setup_logging(self, logs): # Set up logging configuration def load_or_create_config(self): # Load existing config or create a new one def setup_vector_db(self, config): # Set up vector database if enabled in config def run_chat_loop(self, model, tokenizer, vector_db, config): # Main chat loop logic # ... other methods ...This structure would make the code more modular and easier to maintain and test.
🧰 Tools
🪛 Ruff
82-82: Avoid equality comparisons to
False; useif not logs:for false checksReplace with
not logs(E712)
84-84: Loop control variable
inot used within loop bodyRename unused
ito_i(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- .gitignore (1 hunks)
- app/predacons_cli/src/cli.py (9 hunks)
- app/predacons_cli/src/rag.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/rag.py
6-6:
requestsimported but unusedRemove unused import:
requests(F401)
7-7:
bs4.BeautifulSoupimported but unusedRemove unused import:
bs4.BeautifulSoup(F401)
147-147: f-string without any placeholders
Remove extraneous
fprefix(F541)
🔇 Additional comments (1)
app/predacons_cli/src/rag.py (1)
1-176: Summary of review for rag.pyOverall, the
rag.pyfile introduces two important classes:VectorStoreandWebScraper. Here's a summary of the main points:
The
VectorStoreclass is well-structured and functional, but could benefit from:
- Improved error handling across all methods
- Consistent use of logging instead of print statements
- Minor optimizations in some methods
The
WebScraperclass is currently a skeleton and requires implementation of its methods.There are unused imports at the beginning of the file that should be removed.
Addressing these points will improve the robustness, maintainability, and completeness of the code. Great progress on implementing the RAG (Retrieval-Augmented Generation) functionality!
🧰 Tools
🪛 Ruff
6-6:
requestsimported but unusedRemove unused import:
requests(F401)
7-7:
bs4.BeautifulSoupimported but unusedRemove unused import:
bs4.BeautifulSoup(F401)
147-147: f-string without any placeholders
Remove extraneous
fprefix(F541)
| def split_text(self, documents: list[Document]): | ||
| """ | ||
| Split documents into chunks using RecursiveCharacterTextSplitter. | ||
|
|
||
| :param documents: List of Document objects to be split. | ||
| :return: List of chunks. | ||
| """ | ||
| text_splitter = RecursiveCharacterTextSplitter( | ||
| chunk_size=300, | ||
| chunk_overlap=100, | ||
| length_function=len, | ||
| add_start_index=True, | ||
| ) | ||
| chunks = text_splitter.split_documents(documents) | ||
| print(f"Split {len(documents)} documents into {len(chunks)} chunks.") | ||
|
|
||
| document = chunks[10] | ||
| print(document.page_content) | ||
| print(document.metadata) | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Replace print statements with logging in split_text method
Using print statements for debugging in production code is not ideal. Consider using the logging module for better control over output and log levels.
Here's a suggested improvement:
import logging
def split_text(self, documents: list[Document]):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")
if len(chunks) > 10:
sample_chunk = chunks[10]
logging.debug(f"Sample chunk content: {sample_chunk.page_content}")
logging.debug(f"Sample chunk metadata: {sample_chunk.metadata}")
else:
logging.warning("Not enough chunks to display a sample.")
return chunksThis implementation uses logging instead of print statements and adds a check to avoid IndexError when accessing chunks[10].
| def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1): | ||
| """ | ||
| Retrieve similar documents from the Chroma database based on a query. | ||
|
|
||
| :param query: Query string to search for similar documents. | ||
| :param db: Optional Chroma database object. If None, the database will be loaded. | ||
| :param top_n: Number of top similar documents to retrieve. | ||
| :param similarity_threshold: Minimum similarity score threshold. | ||
| :return: List of tuples containing similar documents and their relevance scores. | ||
| """ | ||
| if db is None: | ||
| db = self.load_db() | ||
| results = db.similarity_search_with_relevance_scores(query, k=top_n) | ||
| if len(results) == 0 or results[0][1] < similarity_threshold: | ||
| print(f"Unable to find matching results.") | ||
| return results |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance get_similar method and address f-string issue
The get_similar method could be improved in several ways:
- Address the f-string without placeholders (F541 error).
- Add more comprehensive error handling.
- Improve logging of results.
- Consider returning a more structured result.
Here's an improved version:
import logging
from typing import List, Tuple, Any
def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
try:
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
filtered_results = [result for result in results if result[1] >= similarity_threshold]
if not filtered_results:
logging.info("No results found above the similarity threshold.")
else:
logging.info(f"Found {len(filtered_results)} relevant results.")
return filtered_results
except Exception as e:
logging.error(f"Error in similarity search: {str(e)}")
raiseThis implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.
🧰 Tools
🪛 Ruff
147-147: f-string without any placeholders
Remove extraneous
fprefix(F541)
| vector_db = None | ||
| print("[yellow]Checking for data sources for the chat[/yellow]") | ||
| if config.get("chat_with_data", False): | ||
| print("[yellow]Chat with data enabled looking for vector db[/yellow]") | ||
| if not os.path.exists(config.get("vector_db_path", False)): | ||
| print("[red]Vector DB not found![/red]") | ||
| print("[yellow]Continueing with simple chat[/yellow]") | ||
| else: | ||
| print("[green]Vector DB found![/green]") | ||
| print("[yellow]Loading data from vector DB[/yellow]") | ||
| # load data from vector db | ||
|
|
||
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None)) | ||
| # check and update the vector store | ||
| vector_store.load_and_update_db() | ||
| vector_db = vector_store.load_db() | ||
| print("[yellow]Data loaded successfully![/yellow]") |
There was a problem hiding this comment.
Improve config access consistency and error handling
-
There's inconsistent use of
config.get()vs direct dictionary access. For example,config.get("chat_with_data", False)is used, butconfig["vector_db_path"]is accessed directly. Consider usingconfig.get()consistently to avoid potential KeyError exceptions. -
The error handling for a missing vector database could be improved. Instead of just printing a message and continuing, consider logging the error and/or raising a specific exception that can be caught and handled appropriately.
Here's a suggested improvement:
import logging
# ... (in the launch method)
if config.get("chat_with_data", False):
print("[yellow]Chat with data enabled looking for vector db[/yellow]")
vector_db_path = config.get("vector_db_path")
if not vector_db_path or not os.path.exists(vector_db_path):
logging.warning("Vector DB not found or path not specified.")
print("[red]Vector DB not found or not specified![/red]")
print("[yellow]Continuing with simple chat[/yellow]")
else:
print("[green]Vector DB found![/green]")
print("[yellow]Loading data from vector DB[/yellow]")
try:
vector_store = VectorStore(
vector_db_path,
config.get("document_path"),
config.get("embedding_model")
)
vector_store.load_and_update_db()
vector_db = vector_store.load_db()
print("[yellow]Data loaded successfully![/yellow]")
except Exception as e:
logging.error(f"Failed to load vector database: {str(e)}")
print("[red]Failed to load vector database. Continuing with simple chat.[/red]")
vector_db = NoneThis approach provides more robust error handling and consistent use of config.get().
| if config.get("chat_with_data", False) and vector_db: | ||
| # get response from vector store | ||
| response = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1) | ||
| PROMPT_TEMPLATE = """ | ||
| Answer the question based only on the following context: | ||
|
|
||
| {context} | ||
|
|
||
| --- | ||
|
|
||
| Answer the question based on the above context: {question} | ||
| """ | ||
| user_input = PROMPT_TEMPLATE.format(context=response, question=user_input) | ||
|
|
||
| user_body = {"role": "user", "content": user_input} |
There was a problem hiding this comment.
Optimize PROMPT_TEMPLATE definition and fix indentation
-
The
PROMPT_TEMPLATEis defined within the chat loop, which means it will be redefined in each iteration. This is inefficient. Consider moving it outside the loop or even making it a class constant. -
The indentation of the
user_inputassignment (line 121) seems to be at the same level as theifblock above it. This could lead to logical errors as it will always be executed, regardless of whether theifcondition is met.
Here's a suggested improvement:
# Define this outside the loop, perhaps as a class constant
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
---
Answer the question based on the above context: {question}
"""
# In the chat loop
if config.get("chat_with_data", False) and vector_db:
# get response from vector store
response = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1)
user_input = PROMPT_TEMPLATE.format(context=response, question=user_input)
user_body = {"role": "user", "content": user_input}
chat.append(user_body)This change will improve performance by avoiding unnecessary redefinition of the template and ensure correct logic flow by fixing the indentation.
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Outside diff range and nitpick comments (9)
app/predacons_cli/src/rag.py (2)
16-27: Simplify model_id assignment in init methodThe current implementation uses a ternary operator to set
self.model_id, which can be simplified.Consider simplifying the
self.model_idassignment:- self.model_id = model if model is not None else None + self.model_id = modelThis change maintains the same functionality while making the code more concise.
148-173: Add TODO comments or NotImplementedError to WebScraper methodsThe
WebScraperclass currently contains empty methods. While it's common to have placeholder methods during development, it would be helpful to add TODO comments or raiseNotImplementedErrorto indicate that these methods need to be implemented.Consider updating the methods as follows:
class WebScraper: # ... (existing code) def fetch_html(self, url): """ Fetch HTML content from a given URL. :param url: URL to fetch HTML content from. :return: HTML content as a string. """ # TODO: Implement HTML fetching logic raise NotImplementedError("fetch_html method not implemented yet") def parse_html(self, html_content): """ Parse HTML content and extract relevant text. :param html_content: HTML content to parse. :return: Extracted text as a string. """ # TODO: Implement HTML parsing logic raise NotImplementedError("parse_html method not implemented yet")This change will make it clear that these methods are not yet implemented and need to be completed.
app/predacons_cli/src/cli.py (7)
24-33: Consider removing or implementing the GPU checkThe commented-out code for GPU availability check should either be removed or properly implemented. Keeping commented-out code in production can lead to confusion and maintenance issues.
If GPU support is planned for future implementation, consider adding a TODO comment instead of keeping the entire code block commented out.
64-78: LGTM: Vector database integrationThe integration of vector database functionality is well-implemented. However, consider the following improvements for robustness:
- Use
config.get("vector_db_path")instead ofconfig["vector_db_path"]to avoid potential KeyError.- Add error handling for the
VectorStoreinitialization andload_dbcall.Example:
vector_db_path = config.get("vector_db_path") if vector_db_path and os.path.exists(vector_db_path): try: vector_store = VectorStore(vector_db_path, config.get("document_path"), config.get("embedding_model")) vector_db = vector_store.load_db() print("[yellow]Data loaded successfully![/yellow]") except Exception as e: print(f"[red]Error loading vector database: {e}[/red]") vector_db = None
110-123: LGTM: Vector DB integration in response generationThe integration of the vector database for context-aware responses is well-implemented. However, consider the following optimization:
- Move the
PROMPT_TEMPLATEoutside the loop to avoid recreating it in each iteration. You can define it as a class constant or at the module level.Example:
PROMPT_TEMPLATE = """ Answer the question based only on the following context: {context} --- Answer the question based on the above context: {question} """ # In the loop if config.get("chat_with_data", False) and vector_db: context_text = vector_store.get_similar(user_input, db=vector_db, top_n=5, similarity_threshold=0.1) user_input = PROMPT_TEMPLATE.format(context=context_text, question=user_input)This change will improve performance by avoiding unnecessary string creation in each iteration.
217-221: LGTM: New vector database configuration optionsThe new configuration options for the vector database functionality are correctly added to the default configuration. However, consider adding validation for these new options when they are set by the user. For example:
if config["chat_with_data"]: if not config["vector_db_path"] or not os.path.exists(config["vector_db_path"]): print("[yellow]Warning: Vector DB path is not set or does not exist.[/yellow]") if not config["document_path"] or not os.path.exists(config["document_path"]): print("[yellow]Warning: Document path is not set or does not exist.[/yellow]") if not config["embedding_model"]: print("[yellow]Warning: Embedding model is not set.[/yellow]")This will help users identify potential issues with their configuration early on.
277-280: LGTM: New prompts for vector database configurationThe new prompts for vector database configuration are correctly added. However, consider adding input validation for these new options. For example:
vector_db = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"])) if vector_db.lower() == 'true': vector_db_path = Prompt.ask("Enter the vector DB path", default=config["vector_db_path"]) while not vector_db_path or not os.path.exists(vector_db_path): print("[yellow]Warning: Vector DB path is not valid or does not exist.[/yellow]") vector_db_path = Prompt.ask("Enter a valid vector DB path") document_path = Prompt.ask("Enter the document path", default=config["document_path"]) while not document_path or not os.path.exists(document_path): print("[yellow]Warning: Document path is not valid or does not exist.[/yellow]") document_path = Prompt.ask("Enter a valid document path") embedding_model = Prompt.ask("Enter the embedding model", default=config["embedding_model"]) while not embedding_model: print("[yellow]Warning: Embedding model cannot be empty.[/yellow]") embedding_model = Prompt.ask("Enter a valid embedding model")This will ensure that the user provides valid inputs for the vector database configuration.
80-80: Usenotinstead of comparison toFalseReplace
if logs == False:withif not logs:for better readability and to follow Python style guidelines.🧰 Tools
🪛 Ruff
80-80: Avoid equality comparisons to
False; useif not logs:for false checksReplace with
not logs(E712)
82-82: Rename unused loop variableThe loop control variable
iis not used within the loop body. Rename it to_to indicate it's intentionally unused:for _ in range(1, 0, -1): time.sleep(1)This change improves code clarity by explicitly showing that the loop variable is not used.
🧰 Tools
🪛 Ruff
82-82: Loop control variable
inot used within loop bodyRename unused
ito_i(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- app/predacons_cli/src/cli.py (8 hunks)
- app/predacons_cli/src/rag.py (1 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/cli.py
80-80: Avoid equality comparisons to
False; useif not logs:for false checksReplace with
not logs(E712)
82-82: Loop control variable
inot used within loop bodyRename unused
ito_i(B007)
app/predacons_cli/src/rag.py
6-6:
requestsimported but unusedRemove unused import:
requests(F401)
7-7:
bs4.BeautifulSoupimported but unusedRemove unused import:
bs4.BeautifulSoup(F401)
143-143: f-string without any placeholders
Remove extraneous
fprefix(F541)
🔇 Additional comments (3)
app/predacons_cli/src/cli.py (3)
11-11: LGTM: New import for vector database functionalityThe import of
VectorStorefrom theragmodule is correctly placed and necessary for the new vector database functionality.
87-87: LGTM: Updated help messageThe help message has been appropriately updated to include the new 'update' command for updating loaded documents.
297-300: LGTM: New vector database options in configuration data structureThe new vector database options are correctly added to the configuration data structure. The use of conditional expressions to set default values is appropriate and helps maintain clean code.
| def load_documents(self): | ||
| """ | ||
| Load documents from the specified directory. | ||
|
|
||
| :return: List of loaded Document objects. | ||
| """ | ||
| loader = DirectoryLoader(self.document_path) | ||
| documents = loader.load() | ||
| return documents |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance load_documents method with error handling and logging
The load_documents method could benefit from error handling and logging to improve robustness and provide better context during document loading.
Consider enhancing the method as follows:
import logging
def load_documents(self):
"""
Load documents from the specified directory.
:return: List of loaded Document objects.
"""
try:
loader = DirectoryLoader(self.document_path)
documents = loader.load()
logging.info(f"Successfully loaded {len(documents)} documents from {self.document_path}")
return documents
except Exception as e:
logging.error(f"Error loading documents from {self.document_path}: {str(e)}")
raiseThis implementation adds error handling and logging, providing more context about the document loading process and any potential issues.
| def load_and_update_db(self): | ||
| """ | ||
| Load documents, split them into chunks, and add new chunks to the Chroma database. | ||
| """ | ||
| documents = self.load_documents() | ||
| chunks = self.split_text(documents) | ||
| self.add_to_chroma(chunks) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance load_and_update_db method with error handling and logging
The load_and_update_db method orchestrates the document processing pipeline but lacks error handling and logging. Adding these would improve robustness and provide better visibility into the process.
Consider enhancing the method as follows:
import logging
def load_and_update_db(self):
"""
Load documents, split them into chunks, and add new chunks to the Chroma database.
"""
try:
logging.info("Starting database update process")
documents = self.load_documents()
logging.info(f"Loaded {len(documents)} documents")
chunks = self.split_text(documents)
logging.info(f"Split documents into {len(chunks)} chunks")
self.add_to_chroma(chunks)
logging.info("Completed adding chunks to Chroma database")
except Exception as e:
logging.error(f"Error in load_and_update_db: {str(e)}")
raise
else:
logging.info("Database update process completed successfully")This implementation adds error handling and logging throughout the process, providing better visibility into the progress and any potential issues during the database update.
| def load_db(self): | ||
| """ | ||
| Load the Chroma database. | ||
|
|
||
| :return: Chroma database object. | ||
| """ | ||
| db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder) | ||
| return db |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling and logging to load_db method
The load_db method could benefit from error handling and logging to improve robustness and provide better context during database loading.
Consider enhancing the method as follows:
import logging
def load_db(self):
"""
Load the Chroma database.
:return: Chroma database object.
"""
try:
db = Chroma(persist_directory=self.chroma_path, embedding_function=self.embedder)
logging.info(f"Successfully loaded Chroma database from {self.chroma_path}")
return db
except Exception as e:
logging.error(f"Error loading Chroma database from {self.chroma_path}: {str(e)}")
raiseThis implementation adds error handling and logging, providing more context about the database loading process and any potential issues.
| def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1): | ||
| """ | ||
| Retrieve similar documents from the Chroma database based on a query. | ||
|
|
||
| :param query: Query string to search for similar documents. | ||
| :param db: Optional Chroma database object. If None, the database will be loaded. | ||
| :param top_n: Number of top similar documents to retrieve. | ||
| :param similarity_threshold: Minimum similarity score threshold. | ||
| :return: List of tuples containing similar documents and their relevance scores. | ||
| """ | ||
| if db is None: | ||
| db = self.load_db() | ||
| results = db.similarity_search_with_relevance_scores(query, k=top_n) | ||
| if len(results) == 0 or results[0][1] < similarity_threshold: | ||
| print(f"Unable to find matching results.") | ||
|
|
||
| context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results]) | ||
| return context_text | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance get_similar method and address f-string issue
The get_similar method could be improved in several ways:
- Replace the print statement with proper logging.
- Fix the f-string without placeholders.
- Consider returning a more structured result.
Here's an improved version of the method:
import logging
from typing import List, Tuple, Any
def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> List[Tuple[Any, float]]:
"""
Retrieve similar documents from the Chroma database based on a query.
:param query: Query string to search for similar documents.
:param db: Optional Chroma database object. If None, the database will be loaded.
:param top_n: Number of top similar documents to retrieve.
:param similarity_threshold: Minimum similarity score threshold.
:return: List of tuples containing similar documents and their relevance scores.
"""
try:
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
filtered_results = [result for result in results if result[1] >= similarity_threshold]
if not filtered_results:
logging.info("No results found above the similarity threshold.")
else:
logging.info(f"Found {len(filtered_results)} relevant results.")
return filtered_results
except Exception as e:
logging.error(f"Error in similarity search: {str(e)}")
raiseThis implementation addresses the f-string issue, adds error handling, improves logging, and returns a filtered list of results. The method now returns a more structured result that can be easily processed by the calling code.
🧰 Tools
🪛 Ruff
143-143: f-string without any placeholders
Remove extraneous
fprefix(F541)
| from langchain_community.document_loaders import DirectoryLoader | ||
| from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
| from langchain.schema import Document | ||
| from langchain_community.vectorstores import Chroma | ||
| from predacons import PredaconsEmbedding | ||
| import requests | ||
| from bs4 import BeautifulSoup | ||
| # from googlesearch import search | ||
|
|
There was a problem hiding this comment.
Remove unused imports and clean up commented code
There are a few issues with the import statements:
- The
requestsandBeautifulSoupmodules are imported but not used in the current implementation. - There's a commented-out import for
searchfromgooglesearch.
To keep the code clean and maintainable, please remove the unused imports and the commented-out code if it's not needed in the near future.
Apply this diff to clean up the imports:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
-import requests
-from bs4 import BeautifulSoup
-# from googlesearch import search📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from langchain_community.document_loaders import DirectoryLoader | |
| from langchain.text_splitter import RecursiveCharacterTextSplitter | |
| from langchain.schema import Document | |
| from langchain_community.vectorstores import Chroma | |
| from predacons import PredaconsEmbedding | |
| import requests | |
| from bs4 import BeautifulSoup | |
| # from googlesearch import search | |
| from langchain_community.document_loaders import DirectoryLoader | |
| from langchain.text_splitter import RecursiveCharacterTextSplitter | |
| from langchain.schema import Document | |
| from langchain_community.vectorstores import Chroma | |
| from predacons import PredaconsEmbedding |
🧰 Tools
🪛 Ruff
6-6:
requestsimported but unusedRemove unused import:
requests(F401)
7-7:
bs4.BeautifulSoupimported but unusedRemove unused import:
bs4.BeautifulSoup(F401)
| def calculate_chunk_ids(self, chunks): | ||
| """ | ||
| Calculate unique IDs for each chunk based on source and page metadata. | ||
|
|
||
| :param chunks: List of chunks to calculate IDs for. | ||
| :return: List of chunks with updated metadata including IDs. | ||
| """ | ||
| last_page_id = None | ||
| current_chunk_index = 0 | ||
|
|
||
| for chunk in chunks: | ||
| source = chunk.metadata.get("source") | ||
| page = chunk.metadata.get("page") | ||
| current_page_id = f"{source}:{page}" | ||
|
|
||
| if current_page_id == last_page_id: | ||
| current_chunk_index += 1 | ||
| else: | ||
| current_chunk_index = 0 | ||
|
|
||
| chunk_id = f"{current_page_id}:{current_chunk_index}" | ||
| last_page_id = current_page_id | ||
| chunk.metadata["id"] = chunk_id | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
Handle potential missing metadata in calculate_chunk_ids method
The calculate_chunk_ids method assumes that 'source' and 'page' keys always exist in the chunk metadata. To improve robustness, it's better to handle cases where these keys might be missing.
Consider using default values when accessing metadata:
for chunk in chunks:
- source = chunk.metadata.get("source")
- page = chunk.metadata.get("page")
+ source = chunk.metadata.get("source", "unknown_source")
+ page = chunk.metadata.get("page", "unknown_page")
current_page_id = f"{source}:{page}"This change ensures that even if 'source' or 'page' keys are missing, the method will still function without raising an error.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def calculate_chunk_ids(self, chunks): | |
| """ | |
| Calculate unique IDs for each chunk based on source and page metadata. | |
| :param chunks: List of chunks to calculate IDs for. | |
| :return: List of chunks with updated metadata including IDs. | |
| """ | |
| last_page_id = None | |
| current_chunk_index = 0 | |
| for chunk in chunks: | |
| source = chunk.metadata.get("source") | |
| page = chunk.metadata.get("page") | |
| current_page_id = f"{source}:{page}" | |
| if current_page_id == last_page_id: | |
| current_chunk_index += 1 | |
| else: | |
| current_chunk_index = 0 | |
| chunk_id = f"{current_page_id}:{current_chunk_index}" | |
| last_page_id = current_page_id | |
| chunk.metadata["id"] = chunk_id | |
| return chunks | |
| def calculate_chunk_ids(self, chunks): | |
| """ | |
| Calculate unique IDs for each chunk based on source and page metadata. | |
| :param chunks: List of chunks to calculate IDs for. | |
| :return: List of chunks with updated metadata including IDs. | |
| """ | |
| last_page_id = None | |
| current_chunk_index = 0 | |
| for chunk in chunks: | |
| source = chunk.metadata.get("source", "unknown_source") | |
| page = chunk.metadata.get("page", "unknown_page") | |
| current_page_id = f"{source}:{page}" | |
| if current_page_id == last_page_id: | |
| current_chunk_index += 1 | |
| else: | |
| current_chunk_index = 0 | |
| chunk_id = f"{current_page_id}:{current_chunk_index}" | |
| last_page_id = current_page_id | |
| chunk.metadata["id"] = chunk_id | |
| return chunks |
| elif user_input == "update": | ||
| print("[yellow]Updating documents...[/yellow]") | ||
| # check and update the vector store | ||
| vector_store.load_and_update_db() | ||
| print("[yellow]Documents updated successfully![/yellow]") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add check before updating vector store
The 'update' command is a good addition. However, you should check if the vector store exists before calling load_and_update_db(). This will prevent potential errors if the user tries to update when chat_with_data is disabled or the vector store failed to initialize.
Consider modifying the code as follows:
elif user_input == "update":
if vector_store and config.get("chat_with_data", False):
print("[yellow]Updating documents...[/yellow]")
vector_store.load_and_update_db()
print("[yellow]Documents updated successfully![/yellow]")
else:
print("[red]Vector store is not initialized. Cannot update documents.[/red]")| cli = Cli() | ||
| cli.launch() |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use a main guard for script entry point
Creating and launching the CLI instance at the module level will cause the CLI to start immediately when this module is imported, which may not be the desired behavior in all cases. It's a best practice to use a if __name__ == "__main__": guard for script entry points.
Consider modifying the code as follows:
def main():
cli = Cli()
cli.launch()
if __name__ == "__main__":
main()This change allows the module to be imported without automatically launching the CLI, which is useful for testing or if you want to use the Cli class in other scripts.
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Outside diff range and nitpick comments (11)
setup.py (2)
31-39: LGTM: Dependency updates are appropriate and well-defined.The dependency updates are well-structured and align with the expected new features:
- Updated "predacons" to version 0.0.126 for compatibility.
- Added new dependencies with specific minimum versions:
- rich: Enhanced CLI output
- langchain and langchain_community: Language model interactions
- requests and beautifulsoup4: Web scraping capabilities
- googlesearch-python: Google search functionality
These additions suggest significant new features in web scraping and language model interactions.
Consider updating the README or documentation to reflect these new capabilities and dependencies.
Line range hint
1-46: Consider updating PR description and documentation.The changes in
setup.pysuggest significant new features related to web scraping and language model interactions. However, the PR description lacks details about these changes.
- Update the PR description to explain the new features and their impact on the package's functionality.
- Ensure that the README and any other relevant documentation are updated to reflect these new capabilities and dependencies.
- If not already done, consider adding usage examples for the new features in the documentation.
These updates will help users understand the new capabilities and how to use them effectively.
app/predacons_cli/src/cli.py (9)
66-66: Typo in log messageThere's a typo in the print statement. "data sources for the chat" may be clearer as "data sources for chat".
Optional improvement:
-print("[yellow]Checking for data sources for the chat[/yellow]") +print("[yellow]Checking for data sources for chat[/yellow]")
68-68: Correct the typo in the log message"iitializing" should be "initializing" in the print statement.
Apply this diff to fix the typo:
-print("[yellow]Web scraping enabled iitializing web scraper[/yellow]") +print("[yellow]Web scraping enabled initializing web scraper[/yellow]")
74-74: Correct the typo in the log message"Continueing" should be "Continuing" in the print statement.
Apply this diff to fix the typo:
-print("[yellow]Continueing with simple chat[/yellow]") +print("[yellow]Continuing with simple chat[/yellow]")
80-80: Correct the typo in the log message"Continueing" should be "Continuing" in the print statement.
Apply this diff to fix the typo:
-print("[yellow]Continueing with simple chat[/yellow]") +print("[yellow]Continuing with simple chat[/yellow]")
97-97: Add a space after period in the log messageFor better readability, add a space after the period in the print statement.
Apply this diff to fix:
-print("[yellow]You can start chatting with Predacons now.Type 'clear' to clear history, Type 'exit' to quit, Type 'help' for more options, Type 'update' to update the load documents[/yellow]") +print("[yellow]You can start chatting with Predacons now. Type 'clear' to clear history, Type 'exit' to quit, Type 'help' for more options, Type 'update' to update the load documents[/yellow]")
132-132: Fix typos in the prompt templateCorrect the misspellings in the prompt template to improve clarity.
Apply this diff:
- Here are additianl infor from web search this context is google search results and web scraped data on the same topic : + Here is additional information from web search. This context includes Google search results and web scraped data on the same topic:
158-158: Improve punctuation and capitalization in prompt templateAdd punctuation and capitalize "Google" for clarity and correctness.
Apply this diff:
- Answer the question based only on the following context this context is google search results and web scraped data: + Answer the question based only on the following context. This context includes Google search results and web scraped data:
322-322: Rename variable for clarityThe variable
vector_dbstores the response to "Chat with data?" but its name suggests it's a database object. Consider renaming it tochat_with_datafor better clarity.Apply this diff:
-vector_db = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"])) +chat_with_data = Prompt.ask("Chat with data? (true/false)", default=str(config["chat_with_data"]))And update usage in line 343:
-"chat_with_data": vector_db.lower() == 'true', +"chat_with_data": chat_with_data.lower() == 'true',
326-326: Correct the typo in the prompt message"Scrap" should be "Scrape" in the prompt to accurately reflect the action.
Apply this diff:
-scrap_web = Prompt.ask("Scrap the web for data? (true/false)", default=str(config["scrap_web"])) +scrap_web = Prompt.ask("Scrape the web for data? (true/false)", default=str(config["scrap_web"]))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- app/predacons_cli/src/cli.py (8 hunks)
- app/predacons_cli/src/rag.py (1 hunks)
- setup.py (2 hunks)
🧰 Additional context used
🪛 Ruff
app/predacons_cli/src/cli.py
72-72: Do not use bare
except(E722)
90-90: Avoid equality comparisons to
False; useif not logs:for false checksReplace with
not logs(E712)
92-92: Loop control variable
inot used within loop bodyRename unused
ito_i(B007)
app/predacons_cli/src/rag.py
143-143: f-string without any placeholders
Remove extraneous
fprefix(F541)
184-184: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
187-187: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
🔇 Additional comments (4)
setup.py (1)
10-10: LGTM: Version update is appropriate.The version increment from 0.0.100 to 0.0.101 follows semantic versioning principles and is consistent with a patch update. This is suitable for minor changes or bug fixes.
app/predacons_cli/src/rag.py (1)
1-200: Overall assessment and summary of suggestionsThe
rag.pyfile introduces two main classes,VectorStoreandWebScraper, which provide functionality for document management, embedding operations, and web scraping. While the overall structure and implementation are good, there are several areas for improvement:
- Imports: Clean up unused imports and uncomment the necessary ones.
- Logging: Replace print statements with proper logging throughout the file.
- Error handling: Enhance exception handling in both classes, particularly in the WebScraper methods.
- Code style: Use list comprehensions and more pythonic constructs where applicable.
- Type hinting: Consider adding type hints to method parameters and return values for better code clarity.
These improvements will enhance the code's readability, maintainability, and robustness. After implementing these changes, the code will be more production-ready and easier to debug and extend in the future.
🧰 Tools
🪛 Ruff
143-143: f-string without any placeholders
Remove extraneous
fprefix(F541)
184-184: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
187-187: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
app/predacons_cli/src/cli.py (2)
11-12: Ensure 'rag' module is available and properly importedThe imports for
VectorStoreandWebScraperrequire theragmodule. Make sure thatrag.pyis in the correct directory or installed as a module so that these imports work without issues.
117-117:⚠️ Potential issueCheck if 'vector_store' is initialized before use
If
vector_storeis not initialized due to earlier failures or configurations, callingvector_store.load_and_update_db()will raise an error. Ensure thatvector_storeis initialized before invoking this method.
| from langchain_community.document_loaders import DirectoryLoader | ||
| from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
| from langchain.schema import Document | ||
| from langchain_community.vectorstores import Chroma | ||
| from predacons import PredaconsEmbedding | ||
| import requests | ||
| from bs4 import BeautifulSoup | ||
| from googlesearch import search |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Clean up import statements
- The
requestsandBeautifulSoupimports are unused in the current implementation of theWebScraperclass. Consider removing them if they're not needed. - The
searchimport fromgooglesearchis commented out, but it's used in thegoogle_search_extractmethod. Uncomment this import and remove the others if they're not needed.
Here's a suggested cleanup:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from predacons import PredaconsEmbedding
from googlesearch import search| def split_text(self, documents: list[Document]): | ||
| """ | ||
| Split documents into chunks using RecursiveCharacterTextSplitter. | ||
|
|
||
| :param documents: List of Document objects to be split. | ||
| :return: List of chunks. | ||
| """ | ||
| text_splitter = RecursiveCharacterTextSplitter( | ||
| chunk_size=500, | ||
| chunk_overlap=100, | ||
| length_function=len, | ||
| add_start_index=True, | ||
| ) | ||
| chunks = text_splitter.split_documents(documents) | ||
| print(f"Split {len(documents)} documents into {len(chunks)} chunks.") | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Replace print statement with logging in split_text method
Consider using the logging module instead of print statements for better control over output in a production environment.
import logging
def split_text(self, documents: list[Document]):
# ... (existing code)
chunks = text_splitter.split_documents(documents)
logging.info(f"Split {len(documents)} documents into {len(chunks)} chunks.")
return chunksDon't forget to configure the logging module at the application level.
| def add_to_chroma(self, chunks: list[Document]): | ||
| """ | ||
| Add new chunks to the Chroma database if they do not already exist. | ||
|
|
||
| :param chunks: List of chunks to be added to the database. | ||
| """ | ||
| db = Chroma( | ||
| persist_directory=self.chroma_path, embedding_function=self.embedder | ||
| ) | ||
|
|
||
| chunks_with_ids = self.calculate_chunk_ids(chunks) | ||
|
|
||
| existing_items = db.get(include=[]) # IDs are always included by default | ||
| existing_ids = set(existing_items["ids"]) | ||
| print(f"Number of existing documents in DB: {len(existing_ids)}") | ||
|
|
||
| new_chunks = [] | ||
| for chunk in chunks_with_ids: | ||
| if chunk.metadata["id"] not in existing_ids: | ||
| new_chunks.append(chunk) | ||
|
|
||
| if len(new_chunks): | ||
| print(f"👉 Adding new documents: {len(new_chunks)}") | ||
| new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks] | ||
| db.add_documents(new_chunks, ids=new_chunk_ids) | ||
| db.persist() | ||
| else: | ||
| print("✅ No new documents to add") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance add_to_chroma method with logging and list comprehension
- Replace print statements with logging:
import logging
def add_to_chroma(self, chunks: list[Document]):
# ... (existing code)
logging.info(f"Number of existing documents in DB: {len(existing_ids)}")
# ... (rest of the method)
if new_chunks:
logging.info(f"Adding new documents: {len(new_chunks)}")
# ... (rest of the code)
else:
logging.info("No new documents to add")- Simplify the creation of
new_chunksusing a list comprehension:
new_chunks = [chunk for chunk in chunks_with_ids if chunk.metadata["id"] not in existing_ids]These changes will improve the code's readability and maintainability.
| def get_similar(self, query, db=None, top_n=5, similarity_threshold=0.1): | ||
| """ | ||
| Retrieve similar documents from the Chroma database based on a query. | ||
|
|
||
| :param query: Query string to search for similar documents. | ||
| :param db: Optional Chroma database object. If None, the database will be loaded. | ||
| :param top_n: Number of top similar documents to retrieve. | ||
| :param similarity_threshold: Minimum similarity score threshold. | ||
| :return: List of tuples containing similar documents and their relevance scores. | ||
| """ | ||
| if db is None: | ||
| db = self.load_db() | ||
| results = db.similarity_search_with_relevance_scores(query, k=top_n) | ||
| if len(results) == 0 or results[0][1] < similarity_threshold: | ||
| print(f"Unable to find matching results.") | ||
|
|
||
| context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results]) | ||
| return context_text,results[0][1] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Fix f-string issue and enhance get_similar method
- Address the f-string without placeholders (F541 error).
- Add error handling.
- Improve logging of results.
Here's an improved version:
import logging
from typing import List, Tuple, Any
def get_similar(self, query: str, db=None, top_n: int = 5, similarity_threshold: float = 0.1) -> Tuple[str, float]:
try:
if db is None:
db = self.load_db()
results = db.similarity_search_with_relevance_scores(query, k=top_n)
if not results or results[0][1] < similarity_threshold:
logging.info("No results found above the similarity threshold.")
return "", 0.0
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
return context_text, results[0][1]
except Exception as e:
logging.error(f"Error in similarity search: {str(e)}")
raiseThis implementation addresses the f-string issue, adds error handling, improves logging, and returns a more structured result.
🧰 Tools
🪛 Ruff
143-143: f-string without any placeholders
Remove extraneous
fprefix(F541)
| def google_search_extract(self,query, num_results=5): | ||
| results = [] | ||
| try: | ||
| # Perform Google search | ||
| search_results = search(term = query, num_results=10, advanced = True) | ||
| i =0 | ||
| for search_result in search_results: | ||
| url = search_result.url | ||
| title = search_result.title | ||
| description = search_result.description | ||
| results.append(title +" "+ description) | ||
| if i == num_results: | ||
| break | ||
| try: | ||
| # Fetch the content of the URL | ||
| response = requests.get(url) | ||
| response.raise_for_status() | ||
|
|
||
| # Parse the content using BeautifulSoup | ||
| soup = BeautifulSoup(response.text, 'html.parser') | ||
|
|
||
| # Extract text from the website | ||
| text = ' '.join([p.get_text() for p in soup.find_all('p')]) | ||
|
|
||
| # Append the extracted text to results | ||
| results.append(text) | ||
| i = i+1 | ||
| print(url) | ||
| except requests.RequestException as e: | ||
| # print(f"Failed to fetch {url}: {e}") | ||
| pass | ||
| except Exception as e: | ||
| # print(f"Failed to parse {url}: {e}") | ||
| pass | ||
| except Exception as e: | ||
| print(f"Failed to perform Google search: {e}") | ||
|
|
||
| return results |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Improve exception handling and use of imported libraries in google_search_extract method
- Log exceptions instead of using print statements:
import logging
def google_search_extract(self, query, num_results=5):
# ... (existing code)
try:
response = requests.get(url)
response.raise_for_status()
# ... (rest of the code)
except requests.RequestException as e:
logging.warning(f"Failed to fetch {url}: {e}")
except Exception as e:
logging.error(f"Failed to parse {url}: {e}")
# ... (rest of the method)-
Use the imported
requestsandBeautifulSoupconsistently throughout the method. -
Consider using a more pythonic way to limit the number of results:
for i, search_result in enumerate(search_results):
if i >= num_results:
break
# ... (rest of the loop)These changes will improve error handling, make better use of imported libraries, and enhance the overall code quality.
🧰 Tools
🪛 Ruff
184-184: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
187-187: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
| # try: | ||
| # os.environ['CUDA_VISIBLE_DEVICES'] ='0' | ||
| # import torch | ||
| # if torch.cuda.is_available(): | ||
| # torch.set_default_device('cuda') | ||
| # print("Using GPU") | ||
| # else: | ||
| # print("No GPU available") | ||
| # except: | ||
| # print("No GPU available") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Remove commented-out code for better readability
Lines 25-34 contain commented-out code. Keeping unused code increases clutter and can cause confusion. If this code is no longer needed, consider removing it.
Apply this diff to clean up the code:
-# try:
-# os.environ['CUDA_VISIBLE_DEVICES'] ='0'
-# import torch
-# if torch.cuda.is_available():
-# torch.set_default_device('cuda')
-# print("Using GPU")
-# else:
-# print("No GPU available")
-# except:
-# print("No GPU available")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # try: | |
| # os.environ['CUDA_VISIBLE_DEVICES'] ='0' | |
| # import torch | |
| # if torch.cuda.is_available(): | |
| # torch.set_default_device('cuda') | |
| # print("Using GPU") | |
| # else: | |
| # print("No GPU available") | |
| # except: | |
| # print("No GPU available") |
| print("[blue]Predacons CLI version 0.0.101[blue]") | ||
| print("[blue]Predacons version 0.0.126[blue]") |
There was a problem hiding this comment.
Correct the closing tags in print statements
The closing tags for the color should be [/blue] instead of [blue].
Apply this diff to fix the tags:
-print("[blue]Predacons CLI version 0.0.101[blue]")
-print("[blue]Predacons version 0.0.126[blue]")
+print("[blue]Predacons CLI version 0.0.101[/blue]")
+print("[blue]Predacons version 0.0.126[/blue]")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print("[blue]Predacons CLI version 0.0.101[blue]") | |
| print("[blue]Predacons version 0.0.126[blue]") | |
| print("[blue]Predacons CLI version 0.0.101[/blue]") | |
| print("[blue]Predacons version 0.0.126[/blue]") |
| print("[yellow]Loading data from vector DB[/yellow]") | ||
| # load data from vector db | ||
|
|
||
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None)) |
There was a problem hiding this comment.
Use consistent access methods for configuration
Accessing config["vector_db_path"] and config["document_path"] directly can raise KeyError if the keys are missing. Consider using config.get() for safer access or ensure these keys are present.
Apply this diff to improve robustness:
-vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None))
+vector_store = VectorStore(config.get("vector_db_path"), config.get("document_path"), config.get("embedding_model"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vector_store = VectorStore(config["vector_db_path"], config["document_path"], config.get("embedding_model", None)) | |
| vector_store = VectorStore(config.get("vector_db_path"), config.get("document_path"), config.get("embedding_model")) |
| try: | ||
| web_scraper = WebScraper() | ||
| print("[green]Web scraper initialized![/green]") | ||
| except: |
There was a problem hiding this comment.
Avoid bare 'except' clauses
Using a bare except: can catch unintended exceptions and make debugging difficult. Specify the exception type to catch only expected exceptions.
Apply this diff to fix the issue:
- except:
+ except Exception as e:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except: | |
| except Exception as e: |
🧰 Tools
🪛 Ruff
72-72: Do not use bare
except(E722)
| if not os.path.exists(config.get("vector_db_path", False)): | ||
| print("[red]Vector DB not found![/red]") | ||
| print("[yellow]Continueing with simple chat[/yellow]") | ||
| else: |
There was a problem hiding this comment.
Ensure 'vector_db_path' is valid before checking existence
Using config.get("vector_db_path", False) may return False, leading os.path.exists(False) to check for a path 'False', which is unintended. Check if vector_db_path is valid before proceeding.
Apply this diff to fix the issue:
- if not os.path.exists(config.get("vector_db_path", False)):
+ vector_db_path = config.get("vector_db_path")
+ if not vector_db_path or not os.path.exists(vector_db_path):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not os.path.exists(config.get("vector_db_path", False)): | |
| print("[red]Vector DB not found![/red]") | |
| print("[yellow]Continueing with simple chat[/yellow]") | |
| else: | |
| vector_db_path = config.get("vector_db_path") | |
| if not vector_db_path or not os.path.exists(vector_db_path): | |
| print("[red]Vector DB not found![/red]") | |
| print("[yellow]Continueing with simple chat[/yellow]") | |
| else: |
Summary by CodeRabbit
WebScraperclass for managing web scraping operations, including fetching and parsing HTML content..gitignoreto exclude/docsand/vectordirectories from version control.setup.py.