A machine learning project to differentiate between good and bad captions using Natural Language Processing.
This project uses TF-IDF vectorization and Logistic Regression to classify captions as either "good" or "bad" based on their quality.
- ✅ Load caption data from JSON files
- ✅ TF-IDF text vectorization
- ✅ Logistic Regression classification
- ✅ Model saving and loading
- ✅ Prediction with confidence scores
- ✅ Data augmentation utilities
- ✅ Interactive prediction mode
pip install -r requirements.txtpython train.pypython predict.pyfrom caption_classifier import CaptionClassifier
classifier = CaptionClassifier()
captions, labels = classifier.load_data('caption_data.json')
X = classifier.preprocess(captions)
classifier.train(X, labels)
result = classifier.predict("A beautiful sunset over the ocean")
print(result) # {'label': 'good', 'confidence': 0.95}├── caption_classifier.py # Main classifier class
├── caption_data.json # Sample training data
├── config.py # Configuration settings
├── utils.py # Utility functions
├── augmentation.py # Data augmentation
├── train.py # Training script
├── predict.py # Prediction API
├── test_classifier.py # Unit tests
└── requirements.txt # Dependencies
The sample dataset includes 10 labeled captions (5 good, 5 bad).
python -m unittest test_classifier.pyMIT License