This repository contains the implementation of BageLLM, an automated data preparation pipeline specifically optimized for Llama 3.2 fine-tuning.
BageLLM/
├── src/
│ ├── __init__.py
│ ├── annotator.py
│ ├── cleaner.py
│ ├── pipeline.py
│ ├── quality_control/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── metrics.py
│ │ └── rules.py
│ ├── annotation_system/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── processors.py
│ └── s3_connector.py
├── tests/
│ ├── __init__.py
│ ├── test_cleaner.py
│ ├── test_annotation_system.py
│ └── test_quality_control.py
├── config/
│ └── quality_config.json
├── requirements.txt
└── run.py
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install required models:
python -m spacy download en_core_web_sm- Set up AWS credentials for S3 access:
# ~/.aws/credentials
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key- Create quality control configuration (config/quality_config.json):
{
"min_completeness": 0.8,
"min_consistency": 0.7,
"min_validity": 0.8,
"min_uniqueness": 0.9,
"max_data_age_days": 30,
"min_accuracy": 0.8,
"text_length_threshold": 10,
"min_overall_score": 0.75,
"output_dir": "quality_reports"
}The annotation system provides comprehensive text analysis:
from src.annotation_system import AnnotationSystem, AnnotationConfig
# Initialize
config = AnnotationConfig(
enable_entities=True,
enable_sentiment=True,
enable_topics=True,
enable_keywords=True,
enable_language=True
)
system = AnnotationSystem(config)
# Process single text
annotations = system.annotate_text("Your text here")
# Process batch
annotations = system.process_batch(["Text 1", "Text 2"])Features:
- Named Entity Recognition
- Sentiment Analysis
- Topic Detection
- Keyword Extraction
- Language Detection
The quality control system ensures data quality through multiple dimensions:
from src.quality_control import QualityController, SpecialCharacterRule
# Initialize
controller = QualityController()
# Validate data
passed, issues, metrics = controller.validate_data(your_dataframe)
# Use specific rules
rule = SpecialCharacterRule()
result = rule.validate(your_dataframe)Quality Dimensions:
- Completeness
- Consistency
- Validity
- Uniqueness
- Timeliness
- Integrity
- Accuracy
The cleaning pipeline processes and prepares the data:
from src.pipeline import AnnotationPipeline
pipeline = AnnotationPipeline()
pipeline.process_datasets()Features:
- Text cleaning
- Format standardization
- Quality validation
- Batch processing
from src.pipeline import AnnotationPipeline
# Initialize and run pipeline
pipeline = AnnotationPipeline(output_dir="processed_data")
pipeline.process_datasets()from src.quality_control import QualityController
import pandas as pd
# Load data
data = pd.read_csv("your_data.csv")
# Initialize controller
controller = QualityController()
# Validate data
passed, issues, metrics = controller.validate_data(data)
# Check results
print(f"Passed: {passed}")
print(f"Issues: {issues}")
print(f"Overall Score: {metrics.overall_score}")from src.annotation_system import AnnotationSystem
# Initialize
system = AnnotationSystem()
# Process text
text = "Apple Inc. is planning to release a new iPhone next year."
result = system.annotate_text(text)
# View results
print("Entities:", result['entities'])
print("Sentiment:", result['sentiment'])
print("Topics:", result['topics'])Run all tests:
python -m unittest discover testsRun specific test suites:
python -m unittest tests/test_cleaner.py
python -m unittest tests/test_annotation_system.py
python -m unittest tests/test_quality_control.pyquality_reports/
├── quality_report_dataset1_20241110_123456.json
├── quality_report_dataset1_20241110_123456.txt
├── quality_report_dataset2_20241110_123457.json
└── quality_report_dataset2_20241110_123457.txt
processed_data/
├── processed_dataset1.csv
├── processed_dataset1_metrics.json
├── processed_dataset2.csv
└── processed_dataset2_metrics.json
- Environment setup
- Core infrastructure
- Basic cleaning pipeline
- Initial testing
- Annotation system implementation
- Quality control metrics
- Validation rules
- Integration testing
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
Common issues and solutions:
- Import Errors:
pip install -e . # Install package in editable mode- Model Download Issues:
python -m spacy download en_core_web_sm --force- Quality Control Failures:
- Check the quality_reports directory for detailed error reports
- Adjust thresholds in config/quality_config.json
- Review the validation rules in the logs
For any issues:
- Check the logs in:
- pipeline.log
- quality_control.log
- Review the quality reports
- Run the test suite
- Create a GitHub issue with:
- Detailed description
- Relevant log excerpts
- Steps to reproduce
This project is licensed under the MIT License - see the LICENSE file for details.