Implement fundas: AI-powered file import library extending pandas#1
Implement fundas: AI-powered file import library extending pandas#1
Conversation
Co-authored-by: AMSeify <37192915+AMSeify@users.noreply.github.com>
Co-authored-by: AMSeify <37192915+AMSeify@users.noreply.github.com>
Co-authored-by: AMSeify <37192915+AMSeify@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces the Fundas library, an AI-powered Python package that extends pandas functionality to import and analyze unstructured files (PDFs, images, audio, video, and webpages) by leveraging the OpenRouter API for intelligent data extraction. All functions return pandas DataFrames for immediate analysis.
Key Changes:
- Implemented core OpenRouter API client for AI-powered data extraction
- Added five reader functions:
read_pdf(),read_image(),read_audio(),read_webpage(), andread_video()with configurable data sources - Created comprehensive package structure with proper dependency management and documentation
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Defines package dependencies including pandas, requests, PyPDF2, Pillow, beautifulsoup4, and opencv-python |
| pyproject.toml | Configures package metadata, build system, dependencies, and development tools for the fundas package |
| fundas/core.py | Implements OpenRouterClient class for API communication and structured data extraction from content |
| fundas/readers.py | Provides five main reader functions for different file types, each returning pandas DataFrames |
| fundas/init.py | Package initialization file that exports main reader functions and OpenRouterClient |
| examples/usage_example.py | Demonstrates usage patterns for all reader functions with example code |
| README.md | Comprehensive documentation including installation, usage examples, API reference, and feature overview |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from PyPDF2 import PdfReader | ||
| except ImportError: | ||
| raise ImportError("PyPDF2 is required for read_pdf. Install it with: pip install PyPDF2") |
There was a problem hiding this comment.
The error message references "PyPDF2" but the actual import statement uses "pypdf2" (lowercase). If the package is installed as "PyPDF2" (which is the correct package name), the import should be from PyPDF2 import PdfReader. The current code will fail because Python package imports are case-sensitive on some systems.
| try: | ||
| video = cv2.VideoCapture(str(filepath)) | ||
| fps = video.get(cv2.CAP_PROP_FPS) | ||
| frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT)) | ||
| duration = frame_count / fps if fps > 0 else 0 | ||
| width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH)) | ||
| height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT)) | ||
|
|
||
| content += f"Duration: {duration:.2f} seconds\n" | ||
| content += f"Frame rate: {fps:.2f} fps\n" | ||
| content += f"Resolution: {width}x{height}\n" | ||
| content += f"Total frames: {frame_count}\n" | ||
|
|
||
| if 'pics' in from_: | ||
| content += f"\n--- Frame Analysis ---\n" | ||
| content += f"Sampling every {sample_rate} frames\n" | ||
|
|
||
| # Sample frames from the video | ||
| frame_descriptions = [] | ||
| frame_idx = 0 | ||
| while frame_idx < frame_count: | ||
| video.set(cv2.CAP_PROP_POS_FRAMES, frame_idx) | ||
| ret, frame = video.read() | ||
| if not ret: | ||
| break | ||
|
|
||
| # Basic frame description (in a real implementation, you'd use OCR or vision models) | ||
| timestamp = frame_idx / fps if fps > 0 else 0 | ||
| frame_descriptions.append(f"Frame at {timestamp:.2f}s") | ||
|
|
||
| frame_idx += sample_rate | ||
|
|
||
| content += f"Sampled {len(frame_descriptions)} frames\n" | ||
| content += "\n".join(frame_descriptions[:10]) # Limit to first 10 for brevity | ||
| if len(frame_descriptions) > 10: | ||
| content += f"\n... and {len(frame_descriptions) - 10} more frames" | ||
|
|
||
| if 'audios' in from_: | ||
| content += f"\n\n--- Audio Analysis ---\n" | ||
| content += "[Note: Full audio extraction requires additional audio processing services]\n" | ||
|
|
||
| video.release() |
There was a problem hiding this comment.
The video.release() call should be in a finally block to ensure proper resource cleanup even if an exception occurs during processing. Currently, if an exception is raised between lines 297-337, the video capture resource may not be properly released, leading to potential resource leaks.
| self.base_url, | ||
| headers=headers, | ||
| json=payload, | ||
| timeout=60 |
There was a problem hiding this comment.
[nitpick] Hardcoded "magic number" for timeout without explanation. The timeout value of 60 seconds on line 75 is hardcoded. Consider making this configurable via a parameter or a constant at the module level (e.g., DEFAULT_TIMEOUT = 60) to improve maintainability and allow users to adjust it for their use cases.
| json_end = response_text.find("```", json_start) | ||
| json_str = response_text[json_start:json_end].strip() | ||
| elif "```" in response_text: | ||
| json_start = response_text.find("```") + 3 | ||
| json_end = response_text.find("```", json_start) | ||
| json_str = response_text[json_start:json_end].strip() |
There was a problem hiding this comment.
The JSON parsing logic doesn't handle cases where the response contains multiple JSON objects or nested code blocks properly. If a response has "json" followed by another "" for a different purpose before the actual closing "", the parser will incorrectly extract content. Consider using a more robust approach, such as finding the last occurrence of "" or validating that the extracted string is valid JSON.
| json_end = response_text.find("```", json_start) | |
| json_str = response_text[json_start:json_end].strip() | |
| elif "```" in response_text: | |
| json_start = response_text.find("```") + 3 | |
| json_end = response_text.find("```", json_start) | |
| json_str = response_text[json_start:json_end].strip() | |
| json_end = response_text.rfind("```") | |
| json_str = response_text[json_start:json_end].strip() if json_end > json_start else response_text[json_start:].strip() | |
| elif "```" in response_text: | |
| json_start = response_text.find("```") + 3 | |
| json_end = response_text.rfind("```") | |
| json_str = response_text[json_start:json_end].strip() if json_end > json_start else response_text[json_start:].strip() |
| columns: Optional[List[str]] = None, | ||
| api_key: Optional[str] = None, | ||
| model: Optional[str] = None, | ||
| sample_rate: int = 30 |
There was a problem hiding this comment.
[nitpick] The sample_rate parameter default value of 30 is not documented as a "magic number" and its meaning may not be immediately clear. Consider adding a module-level constant like DEFAULT_FRAME_SAMPLE_RATE = 30 with a comment explaining that this means "sample 1 frame every 30 frames" to improve code clarity.
| except requests.exceptions.RequestException as e: | ||
| raise RuntimeError(f"Error communicating with OpenRouter API: {str(e)}") |
There was a problem hiding this comment.
Security concern: The API key is exposed in error messages through the generic exception handling. If requests.post() fails and includes details about the request in the exception message, the API key could potentially be leaked in logs or error outputs. Consider sanitizing the exception message or using more specific exception handling that doesn't expose sensitive information.
| except requests.exceptions.RequestException as e: | |
| raise RuntimeError(f"Error communicating with OpenRouter API: {str(e)}") | |
| except requests.exceptions.RequestException: | |
| raise RuntimeError("Error communicating with OpenRouter API. Please check your network connection and API credentials.") |
| content += "\n".join(frame_descriptions[:10]) # Limit to first 10 for brevity | ||
| if len(frame_descriptions) > 10: | ||
| content += f"\n... and {len(frame_descriptions) - 10} more frames" |
There was a problem hiding this comment.
[nitpick] Hardcoded magic number: The limit of 10 frames (line 329) and the calculation on line 331 are hardcoded without explanation. Consider defining these as constants at the module level (e.g., MAX_FRAME_DESCRIPTIONS_DISPLAY = 10) to make the code more maintainable and the intent clearer.
|
|
||
| # Fetch webpage content | ||
| try: | ||
| response = requests.get(url, timeout=30) |
There was a problem hiding this comment.
[nitpick] Hardcoded timeout value of 30 seconds without explanation or configuration option. Consider making this configurable or defining it as a module-level constant (e.g., DEFAULT_REQUEST_TIMEOUT = 30) for better maintainability.
| import pandas as pd | ||
| from typing import Optional, List, Union | ||
| from pathlib import Path | ||
| import base64 |
There was a problem hiding this comment.
Import of 'base64' is not used.
| import base64 |
| content += f"\nImage size: {image.size[0]}x{image.size[1]}" | ||
| content += f"\nImage format: {image.format}" | ||
| content += f"\nImage mode: {image.mode}" | ||
| except Exception: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
Implements fundas, a Python library that uses OpenRouter API and generative AI to extract structured data from unstructured files (PDFs, images, audio, video, web pages) and return pandas DataFrames.
Core API
All functions accept custom prompts for AI extraction and return DataFrames:
Implementation
from_='pics','audios','both', or lists['pics', 'audios']Security
Package Structure
Requires
OPENROUTER_API_KEYenvironment variable or explicitapi_keyparameter.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
openrouter.ai/usr/bin/python3 python3(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.