
A comprehensive, multi-modal text analytics platform that combines smart column detection, advanced NLP processing, and multiple AI model integrations to provide actionable insights from customer feedback and text data.
- Smart Column Detection: Automatically identifies text, ID, product, and date columns
- Sentiment Analysis: TextBlob-powered sentiment classification with numerical scoring
- Topic Extraction: Multi-level topic identification using noun phrases and frequency analysis
- Actionable Insights: Dictionary-based extraction of improvement suggestions
- Advanced Search: TF-IDF vectorized semantic search with synonym expansion
- Multi-Provider Support: OpenAI, Anthropic Claude, Deepseek, Groq, Google Gemini
- Dynamic Model Switching: Change AI models on-the-fly
- Unified Interface: Consistent API across different providers
- AI-Powered Insights: Generate high-level analysis using selected AI models
- Memory Efficient: Smart data extraction and garbage collection
- Multiple Formats: Support for CSV, Excel (.xlsx, .xls), and JSON files
- Batch Processing: Handle large datasets efficiently
- Export Options: Export results in Excel or CSV formats
- Interactive Charts: Plotly-powered visualizations
- Web Interface: User-friendly Gradio-based UI
- Real-time Processing: Live feedback during data processing
- Multiple Views: Sentiment distribution, topic analysis, trends over time
- Installation
- Quick Start
- Configuration
- Usage Guide
- API Documentation
- Architecture
- Contributing
- Troubleshooting
- License
- Python 3.8 or higher
- pip package manager
- Git (for cloning the repository)
git clone https://github.com/yourusername/text-analytics-ai-agent.git
cd text-analytics-ai-agent
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('brown')
python -m textblob.download_corpora
Create a .env
file in the project root directory:
# AI Model API Keys (add only the ones you plan to use)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
DEEPSEEK_API_KEY=your_deepseek_api_key_here
GROQ_API_KEY=your_groq_api_key_here
GOOGLE_API_KEY=your_google_gemini_api_key_here
Provider | How to Get API Key | Documentation |
---|---|---|
OpenAI | OpenAI Platform | OpenAI Docs |
Anthropic | Anthropic Console | Anthropic Docs |
Deepseek | Deepseek Platform | Deepseek Docs |
Groq | Groq Console | Groq Docs |
Google AI Studio | Gemini Docs |
Note: The system works with any combination of API keys. You don't need all providers configured.
-
Start the Application:
python Multimodal_Text_Analytics.py
-
Open the Web Interface:
- The application will provide a local URL (typically
http://127.0.0.1:7860
) - A public sharing link will also be generated automatically
- The application will provide a local URL (typically
-
Upload Your Data:
- Select an AI model from the dropdown
- Upload a CSV, Excel, or JSON file containing text data
- Click "Process File"
-
Explore Results:
- View processing status and AI insights
- Search through your data
- Generate visualizations
- Export processed results
Your data should contain text columns (comments, feedback, reviews, etc.). The system automatically detects:
id,customer_feedback,product_name,date,rating
1,"Great product but delivery was slow","Widget A","2024-01-15",4
2,"Poor quality, broke after one day","Widget B","2024-01-16",1
3,"Excellent customer service, very helpful","Service","2024-01-17",5
File Upload:
- Supported formats:
.csv
,.xlsx
,.xls
,.json
- Automatic column detection for text, ID, product, and date fields
- Memory-efficient processing with progress feedback
AI Model Selection:
- Choose from available AI providers
- Switch models dynamically
- Generate AI-powered insights from processed data
Processing Results:
- Smart column detection summary
- Data preview (first 10 rows)
- Downloadable processed file with analysis columns
Semantic Search:
- Enter keywords to find relevant text entries
- Synonym expansion for better matching
- Similarity scoring with exact match boosting
- Export search results
Search Features:
- TF-IDF vectorized search
- Cosine similarity ranking
- Multi-term query support
- Results include sentiment and topics
Available Visualizations:
- Sentiment Distribution: Pie chart of positive/negative/neutral sentiment
- Topic Distribution: Bar chart of most common topics
- Sentiment by Topic: Heatmap showing sentiment patterns across topics
- Sentiment Timeline: Trend analysis over time (if date data available)
- Top Insights: Most frequent actionable insights
Interactive Features:
- Plotly-powered interactive charts
- Zoom, pan, and hover functionality
- Downloadable chart images
Export Options:
- Excel Format: Full analysis with formatting
- CSV Format: Lightweight, compatible format
- Timestamp: Automatic file naming with timestamps
Export Contents:
- Original data plus analysis columns
- Sentiment scores and classifications
- Extracted topics (3 levels)
- Actionable insights
- Search scores (if applicable)
├── SmartColumnDetector # Automatic column type detection
├── EnhancedTextProcessor # NLP processing and insights extraction
├── TextSearchEngine # Advanced search with semantic capabilities
├── AIModelManager # Multi-provider AI model integration
└── EnhancedTextAnalyzer # Main orchestration class
- File Upload → Smart column detection → Data extraction
- Text Processing → Sentiment analysis → Topic extraction → Insights generation
- Search Index → TF-IDF vectorization → Similarity calculations
- AI Analysis → Sample selection → Prompt generation → Insight generation
- Visualization → Data aggregation → Chart generation → Interactive display
Raw Data → Column Detection → Text Cleaning → Sentiment Analysis →
Topic Extraction → Insights Generation → Search Index →
Visualizations → Export Options
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes
- Add tests for new functionality
- Run existing tests:
python -m pytest tests/
- Commit changes:
git commit -am 'Add feature'
- Push to branch:
git push origin feature-name
- Create a Pull Request
- New AI Providers: Add support for additional AI APIs
- Enhanced NLP: Improve topic extraction and sentiment analysis
- Visualizations: Create new chart types and insights
- Performance: Optimize processing for larger datasets
- Documentation: Improve guides and examples
- Testing: Add comprehensive test coverage
- Follow PEP 8 Python style guidelines
- Use type hints where appropriate
- Add docstrings for all functions and classes
- Include inline comments for complex logic
1. NLTK Data Missing
Error: Resource punkt not found
Solution: Run the NLTK download commands in the installation section.
2. TextBlob Corpora Missing
Error: Resource 'corpora/brown' not found
Solution: Run python -m textblob.download_corpora
3. API Key Issues
Error: No API key provided
Solution: Check your .env
file configuration and ensure API keys are valid.
4. Memory Issues with Large Files
MemoryError: Unable to allocate array
Solution: Process files in smaller chunks or increase system memory.
5. Gradio Port Conflicts
Error: Port 7860 is already in use
Solution: The application will automatically find an available port.
For Large Datasets:
- Process files with < 50,000 rows for optimal performance
- Use CSV format for faster loading
- Close unnecessary applications to free memory
For Slow AI Responses:
- Check internet connection
- Verify API key limits haven't been exceeded
- Try switching to a different AI provider
- GitHub Issues: Report bugs and request features
- Documentation: Check this README and inline code comments
- Community: Join discussions in the Issues section
Dataset Size | Processing Time | Memory Usage | Recommended |
---|---|---|---|
1K rows | 10-30 seconds | 200MB | Optimal |
10K rows | 1-3 minutes | 500MB | Good |
50K rows | 5-15 minutes | 1.5GB | Caution |
100K+ rows | 15+ minutes | 3GB+ | Consider chunking |
- Real-time data streaming support
- Custom AI model integration
- Advanced topic modeling (LDA, BERTopic)
- Multi-language support
- API endpoint for programmatic access
- Automated report generation
- Integration with business intelligence tools
- Custom visualization builder
- Advanced export options (PDF reports)
- User authentication and data persistence
This project is licensed under the MIT License. See the LICENSE file for details.
MIT License
Copyright (c) 2024 [Your Name]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- Gradio Team: For the web interface framework
- Hugging Face: For NLP tools and model hosting
- Plotly: For interactive visualization capabilities
- NLTK Team: For comprehensive natural language processing tools
- TextBlob: For sentiment analysis capabilities
- scikit-learn: For machine learning algorithms and utilities
- GitHub Issues: Create an issue
- Documentation: Wiki
Made for the open source community