A comprehensive Python-based system that processes Excel files through an AI-powered analysis pipeline, converting them to HTML format, analyzing content with OpenAI's GPT models, and generating updated Excel files with filled data.
This system automates the process of filling empty cells in Excel spreadsheets by:
- Converting Excel files to HTML with preserved formatting and structure
- Analyzing HTML tables using OpenAI's GPT models to match text data to appropriate cells
- Generating multiple output formats (CSV, JSON, Excel) with cell mappings
- Updating original Excel files with the analyzed data
The system is designed for processing complex Excel workbooks with multiple worksheets, particularly useful for financial planning, project documentation, and data analysis workflows.
Input_Folder/ → Excel files (.xlsx)
↓ (Excel to HTML conversion)
html_outputs/ → HTML files (.html)
↓ (AI Analysis with text data)
DATA_SOURCES/ → Text data files (.txt)
↓ (Analysis results)
Output_folder/ → Analysis outputs (.csv, .json, .txt, .xlsx)
↓ (Excel updating)
Updated_excel_workbooks/ → Updated Excel files (.xlsx)
final_project/
├── 📂 Input_Folder/ # Input Excel files
│ ├── ENGLISH_1A priedas_InoStartas en.xlsx
│ ├── ENGISH_1B priedas_InoStartas en.xlsx
│ ├── ENGISH_Rekomenduojama forma...xlsx
│ └── ENGLISH_Finansinis planas en.xlsx
├── 📂 DATA_SOURCES/ # Text data for analysis
│ ├── ENGLISH_1A priedas_InoStartas en_1.txt
│ ├── ENGLISH_1A priedas_InoStartas en_2.txt
│ ├── ENGISH_1B priedas_InoStartas en_1_Fic_amounts_patenting.txt
│ └── ... (18 text files)
├── 📂 html_outputs/ # Generated HTML files
│ ├── ENGLISH_1A priedas_InoStartas en_1.html
│ ├── ENGLISH_1A priedas_InoStartas en_2.html
│ └── ... (22 HTML files)
├── 📂 Output_folder/ # Analysis results
│ ├── *.csv files # CSV format analysis
│ ├── *.json files # JSON format analysis
│ ├── *.txt files # Text summary
│ └── *.xlsx files # Excel format analysis
├── 📂 Updated_excel_workbooks/ # Updated Excel files
├── 📂 final_project/ # Main system code
│ ├── run_complete_system.py # Main entry point
│ ├── unified_processor.py # Core processing classes
│ └── requirements.txt # Dependencies
├── excel_processor.py # Legacy processor
├── final_json_from_outputFolder_to_xlsx_filling.py # Excel updater
├── requirements.txt # Main dependencies
└── README.md # This file
- Multi-worksheet Support: Processes Excel files with multiple worksheets
- AI-Powered Analysis: Uses OpenAI GPT models for intelligent data matching
- Format Preservation: Maintains Excel formatting, borders, alignment, and merged cells
- Multiple Output Formats: Generates CSV, JSON, Excel, and text outputs
- Batch Processing: Handles multiple files automatically
- Error Handling: Robust error handling with detailed logging
- Contextual Data Matching: AI analyzes table structure and matches data contextually
- Cell Reference Mapping: Provides Excel-style cell references (A1, B2, etc.)
- Encoding Detection: Handles multiple text encodings automatically
- Retry Mechanism: Built-in retry logic for API calls
- Modular Design: Clean, extensible architecture
- Python 3.8 or higher
- OpenAI API key
-
Clone or download the project
git clone <repository-url> cd final_project
-
Create virtual environment (recommended)
python -m venv formfillingvenv # Windows formfillingvenv\Scripts\activate # Linux/Mac source formfillingvenv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Set up OpenAI API key
Option A: Environment variable
# Windows set OPENAI_API_KEY=your-api-key-here # Linux/Mac export OPENAI_API_KEY=your-api-key-here
Option B: .env file Create a
.envfile in the project root:OPENAI_API_KEY=your-api-key-here
The system requires the following Python packages:
# Excel Processing
openpyxl==3.1.5 # Excel file manipulation
pandas==2.2.3 # Data analysis
xlsxwriter==3.2.0 # Excel writing
# AI Integration
openai==1.51.0 # OpenAI API client
# Utilities
python-dotenv==1.0.1 # Environment variable management
typing_extensions==4.12.2 # Type hints support
- Place your Excel files in the
Input_Folder/directory - Add corresponding text data files in the
DATA_SOURCES/directory - Run the complete system:
python final_project/excel_processor.py
from excel_processor import ExcelToHTMLConverter
converter = ExcelToHTMLConverter()
html_files = converter.process_all_excel_files("Input_Folder", "html_outputs")from excel_processor import HTMLAnalyzer
analyzer = HTMLAnalyzer()
success = analyzer.process_all_files()from final_json_from_outputFolder_to_xlsx_filling import update_excel_from_json
update_excel_from_json("Input_Folder", "Output_folder")The system uses a specific naming convention for file matching:
- Excel files:
workbook_name.xlsx - Text data files:
workbook_name_worksheet_name.txt - HTML files:
workbook_name_worksheet_name.html - Analysis outputs:
workbook_name_worksheet_name.{csv,json,txt,xlsx}
Structured data with columns:
row: Row numbercolumn: Column lettercell_reference: Excel cell reference (A1, B2, etc.)value: Data to fillcontext: Explanation of data placementsource_file: Source file name
[
{
"cell_reference": "B7",
"value": "Patent application drafting and preparation...",
"context": "Phase I attorney service (item 1.1) supplier / offer details"
}
]Analysis results in Excel format with auto-adjusted column widths and proper formatting.
Human-readable summaries with:
- Source file information
- Number of cells filled
- Detailed cell mappings
- Generated file list
The system uses gpt-4o-mini by default with the following parameters:
temperature: 0.1 (for consistent results)max_completion_tokens: 16000top_p: 0.9frequency_penalty: 0.0presence_penalty: 0.0
- Max retries: 3 attempts
- Error handling: Graceful fallback to raw response saving
-
OpenAI API Key Not Found
Error: Please set your OPENAI_API_KEY environment variableSolution: Set your OpenAI API key as described in the Installation section.
-
No Excel Files Found
Error: Input folder 'Input_Folder' not foundSolution: Ensure the
Input_Folderdirectory exists and contains.xlsxfiles. -
JSON Parsing Errors
Error parsing JSON responseSolution: The system will save raw responses as fallback. Check your OpenAI API key and internet connection.
-
Encoding Issues
UnicodeDecodeErrorSolution: The system automatically tries multiple encodings. Ensure your text files are properly encoded.
Enable detailed logging by modifying the print statements in the code or adding logging configuration.
- Processing Speed: ~2-5 seconds per worksheet (depending on complexity)
- Memory Usage: Moderate (loads one Excel file at a time)
- API Costs: Varies based on OpenAI usage (typically $0.01-0.10 per file)
- API Key: Store securely, never commit to version control
- Data Privacy: Text data is sent to OpenAI for analysis
- File Access: System only reads specified input directories
The system is designed with modularity in mind. Key areas for extension:
- New AI Models: Add support for other AI providers
- Additional Formats: Support for other file formats
- Enhanced Analysis: More sophisticated data matching algorithms
- UI Interface: Web or desktop interface for easier use
This project is designed for internal use. Please ensure compliance with OpenAI's usage policies and any applicable data protection regulations.
For issues or questions:
- Check the troubleshooting section above
- Review the console output for detailed error messages
- Ensure all dependencies are properly installed
- Verify your OpenAI API key is valid and has sufficient credits
- Input: Excel files with empty cells + corresponding text data files
- Step 1: Convert Excel worksheets to HTML with preserved formatting
- Step 2: Use AI to analyze HTML tables and match text data to cells
- Step 3: Generate analysis results in multiple formats (CSV, JSON, Excel, TXT)
- Step 4: Update original Excel files with filled data
- Output: Updated Excel files with intelligently filled cells
The system is particularly effective for:
- Financial planning documents
- Project proposal forms
- Research data compilation
- Administrative form filling
- Data migration tasks
This system represents a sophisticated approach to automated Excel processing, combining traditional data manipulation with modern AI capabilities to create an efficient, reliable solution for complex spreadsheet management tasks.