A production-ready FastAPI application that provides AI text generation with response caching and comprehensive monitoring.
- 🤖 AI Text Generation - Uses Google's Flan-T5-Small model
- ⚡ Response Caching - 1000x faster for repeat queries
- 📊 Metrics Tracking - Real-time performance monitoring
- 🔍 Comprehensive Logging - Detailed request/response tracking
- ✅ Input Validation - Prevents malformed requests
- 💚 Health Checks - Production monitoring ready
- FastAPI - Modern Python web framework
- Transformers (Hugging Face) - AI model integration
- Flan-T5-Small - 80M parameter text generation model
- Python 3.12
- Python 3.8+
- 8GB+ RAM (for model)
- Clone the repository:
git clone <your-repo-url>
cd ml-api-project- Create virtual environment:
python3 -m venv venv
source venv/bin/activate # On Mac/Linux- Install dependencies:
pip install fastapi uvicorn transformers torch- Run the API:
uvicorn main:app --reload- Open your browser:
http://127.0.0.1:8000/docs
Root endpoint - confirms API is running
Health check endpoint for monitoring
Returns API usage statistics:
- Total requests
- Cache hit rate
- Average inference time
- And more...
Main text generation endpoint
Request Body:
{
"prompt": "What is machine learning?",
"max_length": 100
}Response:
{
"prompt": "What is machine learning?",
"response": "Machine learning is a method of data analysis...",
"model": "flan-t5-small",
"inference_time_seconds": 5.59,
"cached": false
}- First request: ~5-10 seconds (runs AI model)
- Cached requests: ~0.01 seconds (instant)
- Cache hit rate: Typically 60%+ in production
curl -X POST "http://127.0.0.1:8000/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is Python?", "max_length": 100}'import requests
response = requests.post(
"http://127.0.0.1:8000/generate",
json={"prompt": "What is Python?", "max_length": 100}
)
print(response.json())The API uses comprehensive logging with emoji markers:
- 🔵 New request received
- 📝 Prompt details
- 💾 Cache hit/miss
- 🤖 AI model running
- ✅ Response generated
⚠️ Warnings/errors
- Deploy to AWS ECS/Lambda
- Add Redis for persistent caching
- Implement rate limiting
- Add authentication
- Support for multiple models
- Streaming responses
- OpenAPI/Swagger customization
ml-api-project/
├── main.py # Main application code
├── README.md # This file
├── requirements.txt # Python dependencies
└── venv/ # Virtual environment
Built as part of an AI engineering portfolio project.
MIT License