# Task 2: Build a File Upload and Processing API - SOLUTION

## Scenario
Build a FastAPI service that handles file uploads with background processing:
1. Accept file uploads with validation
2. Process files in background tasks
3. Track processing status
4. Use async file operations for efficiency

## Setup

In [None]:
import json
import time
import uuid
from pathlib import Path
from typing import Dict, List, Optional
from contextlib import asynccontextmanager

from fastapi import FastAPI, File, UploadFile, HTTPException, BackgroundTasks, status
from fastapi.testclient import TestClient
from pydantic import BaseModel, Field
import aiofiles

print("Imports successful!")

### Create temporary directory for file storage

In [None]:
import tempfile
import shutil

# Create temp directory for this session
TEMP_DIR = Path(tempfile.mkdtemp())
UPLOAD_DIR = TEMP_DIR / "uploads"
RESULTS_DIR = TEMP_DIR / "results"

UPLOAD_DIR.mkdir(exist_ok=True)
RESULTS_DIR.mkdir(exist_ok=True)

print(f"Upload directory: {UPLOAD_DIR}")
print(f"Results directory: {RESULTS_DIR}")

---
## Task 1: Setup Job Tracking and App - SOLUTION

In [None]:
# SOLUTION

# Global dict to track jobs
# Structure: {job_id: {"status": str, "filename": str, "result": dict}}
jobs: Dict[str, Dict] = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Lifespan context manager for cleanup."""
    # Startup
    print("File processing API starting up...")
    print(f"Upload directory: {UPLOAD_DIR}")
    
    yield  # App runs here
    
    # Shutdown: Cleanup
    print("Cleaning up temporary files...")
    # In production, you'd want to clean up old files
    jobs.clear()

# Create FastAPI app
app = FastAPI(
    title="File Upload and Processing API",
    description="Upload files for background processing with status tracking",
    version="1.0.0",
    lifespan=lifespan
)

print("FastAPI app created")

In [None]:
# TEST
assert 'jobs' in dir(), "jobs dict not found"
assert 'lifespan' in dir(), "lifespan function not found"
assert 'app' in dir(), "app not found"
assert isinstance(jobs, dict), "jobs should be a dict"

client = TestClient(app)

print("✓ Task 1 PASSED!")
print("  Job tracking and app initialized")

---
## Task 2: Define Pydantic Models - SOLUTION

In [None]:
# SOLUTION

class UploadResponse(BaseModel):
    """Response model for file upload."""
    job_id: str = Field(..., description="Unique job identifier")
    filename: str = Field(..., description="Uploaded filename")
    status: str = Field(..., description="Current job status")
    message: str = Field(..., description="Status message")
    
    class Config:
        json_schema_extra = {
            "example": {
                "job_id": "550e8400-e29b-41d4-a716-446655440000",
                "filename": "document.txt",
                "status": "pending",
                "message": "File uploaded successfully and queued for processing"
            }
        }

class JobStatus(BaseModel):
    """Model for job status."""
    job_id: str = Field(..., description="Job identifier")
    status: str = Field(..., description="Current status")
    filename: str = Field(..., description="Filename being processed")
    result: Optional[Dict] = Field(None, description="Processing result (if completed)")
    
    class Config:
        json_schema_extra = {
            "example": {
                "job_id": "550e8400-e29b-41d4-a716-446655440000",
                "status": "completed",
                "filename": "document.txt",
                "result": {
                    "line_count": 100,
                    "word_count": 500,
                    "char_count": 3000
                }
            }
        }

class ProcessingResult(BaseModel):
    """Model for file processing results."""
    line_count: int = Field(..., description="Number of lines")
    word_count: int = Field(..., description="Number of words")
    char_count: int = Field(..., description="Number of characters")
    processing_time: float = Field(..., description="Processing time in seconds")
    
    class Config:
        json_schema_extra = {
            "example": {
                "line_count": 100,
                "word_count": 500,
                "char_count": 3000,
                "processing_time": 0.15
            }
        }

print("Pydantic models defined")

In [None]:
# TEST
assert 'UploadResponse' in dir(), "UploadResponse not found"
assert 'JobStatus' in dir(), "JobStatus not found"
assert 'ProcessingResult' in dir(), "ProcessingResult not found"

# Test models
upload_resp = UploadResponse(
    job_id="test-123",
    filename="test.txt",
    status="pending",
    message="File uploaded"
)
assert upload_resp.job_id == "test-123"

job_status = JobStatus(
    job_id="test-123",
    status="completed",
    filename="test.txt",
    result={"lines": 10}
)
assert job_status.result is not None

result = ProcessingResult(
    line_count=10,
    word_count=50,
    char_count=300,
    processing_time=0.5
)
assert result.line_count == 10

print("✓ Task 2 PASSED!")
print("  All Pydantic models defined")

---
## Task 3: Implement Background Processing Function - SOLUTION

In [None]:
# SOLUTION

async def process_file(job_id: str, file_path: Path):
    """
    Process uploaded file asynchronously.
    
    Args:
        job_id: Job identifier
        file_path: Path to uploaded file
    """
    start_time = time.time()
    
    try:
        # Update status to processing
        jobs[job_id]["status"] = "processing"
        
        # Read file asynchronously
        async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
            content = await f.read()
        
        # Process content
        lines = content.split('\n')
        line_count = len(lines)
        
        # Count words (split on whitespace)
        words = content.split()
        word_count = len(words)
        
        # Count characters
        char_count = len(content)
        
        # Calculate processing time
        processing_time = time.time() - start_time
        
        # Create result
        result = {
            "line_count": line_count,
            "word_count": word_count,
            "char_count": char_count,
            "processing_time": processing_time
        }
        
        # Update job with result
        jobs[job_id]["status"] = "completed"
        jobs[job_id]["result"] = result
        
    except Exception as e:
        # Handle errors
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["error"] = str(e)

print("process_file function defined")

In [None]:
# TEST
import asyncio

assert 'process_file' in dir(), "process_file function not found"

# Create test file
test_file = UPLOAD_DIR / "test_processing.txt"
test_file.write_text("Hello world\nThis is a test\nThree lines total")

# Test processing
test_job_id = "test-job-123"
jobs[test_job_id] = {"status": "pending", "filename": "test_processing.txt"}

# Run async function (use await in Jupyter since event loop is already running)
await process_file(test_job_id, test_file)

# Check results
assert jobs[test_job_id]['status'] == 'completed', f"Expected completed, got {jobs[test_job_id]['status']}"
assert 'result' in jobs[test_job_id], "Result not found in job"
result = jobs[test_job_id]['result']
assert result['line_count'] == 3, f"Expected 3 lines, got {result['line_count']}"
assert result['word_count'] > 0, "Word count should be > 0"
assert result['char_count'] > 0, "Char count should be > 0"

print("✓ Task 3 PASSED!")
print(f"  Processing result: {result}")

---
## Task 4: Implement File Upload Endpoint - SOLUTION

In [None]:
# SOLUTION

MAX_FILE_SIZE = 10 * 1024 * 1024  # 10MB
ALLOWED_EXTENSIONS = {".txt", ".csv", ".json"}

@app.post("/upload", response_model=UploadResponse, status_code=status.HTTP_200_OK)
async def upload_file(
    file: UploadFile = File(...),
    background_tasks: BackgroundTasks = BackgroundTasks()
):
    """
    Upload a file for processing.
    
    Args:
        file: File to upload
        background_tasks: FastAPI background tasks
        
    Returns:
        UploadResponse with job information
    """
    try:
        # Validate file extension
        file_ext = Path(file.filename).suffix.lower()
        if file_ext not in ALLOWED_EXTENSIONS:
            raise HTTPException(
                status_code=status.HTTP_400_BAD_REQUEST,
                detail=f"File type {file_ext} not allowed. Allowed: {ALLOWED_EXTENSIONS}"
            )
        
        # Generate unique job ID
        job_id = str(uuid.uuid4())
        
        # Create unique filename
        safe_filename = f"{job_id}_{file.filename}"
        file_path = UPLOAD_DIR / safe_filename
        
        # Save file asynchronously
        async with aiofiles.open(file_path, 'wb') as f:
            content = await file.read()
            
            # Check file size
            if len(content) > MAX_FILE_SIZE:
                raise HTTPException(
                    status_code=status.HTTP_400_BAD_REQUEST,
                    detail=f"File too large. Max size: {MAX_FILE_SIZE / 1024 / 1024}MB"
                )
            
            await f.write(content)
        
        # Create job entry
        jobs[job_id] = {
            "status": "pending",
            "filename": file.filename,
            "file_path": str(file_path)
        }
        
        # Schedule background processing
        background_tasks.add_task(process_file, job_id, file_path)
        
        return UploadResponse(
            job_id=job_id,
            filename=file.filename,
            status="pending",
            message="File uploaded successfully and queued for processing"
        )
        
    except HTTPException:
        raise
    except Exception as e:
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Upload failed: {str(e)}"
        )

print("/upload endpoint implemented")

In [None]:
# TEST
from io import BytesIO

client = TestClient(app)

# Create test file content
test_content = b"Line 1\nLine 2\nLine 3"
test_file = ("test.txt", BytesIO(test_content), "text/plain")

# Upload file
response = client.post(
    "/upload",
    files={"file": test_file}
)

assert response.status_code == 200, f"Expected 200, got {response.status_code}"
data = response.json()
assert 'job_id' in data, "Response missing job_id"
assert 'filename' in data, "Response missing filename"
assert 'status' in data, "Response missing status"
assert data['status'] == 'pending', f"Expected pending, got {data['status']}"

job_id = data['job_id']
assert job_id in jobs, "Job not found in jobs dict"

# Test invalid file type
invalid_file = ("test.exe", BytesIO(b"fake exe"), "application/x-msdownload")
response = client.post(
    "/upload",
    files={"file": invalid_file}
)
assert response.status_code == 400, "Invalid file type should return 400"

# Wait a bit for background task
time.sleep(1)

print("✓ Task 4 PASSED!")
print(f"  File uploaded with job_id: {job_id}")

---
## Task 5: Implement Status and Results Endpoints - SOLUTION

In [None]:
# SOLUTION

@app.get("/jobs/{job_id}", response_model=JobStatus, status_code=status.HTTP_200_OK)
async def get_job_status(job_id: str):
    """
    Get status of a specific job.
    
    Args:
        job_id: Job identifier
        
    Returns:
        JobStatus with current status and result
    """
    if job_id not in jobs:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=f"Job {job_id} not found"
        )
    
    job = jobs[job_id]
    
    return JobStatus(
        job_id=job_id,
        status=job["status"],
        filename=job["filename"],
        result=job.get("result")
    )

@app.get("/jobs", response_model=List[JobStatus], status_code=status.HTTP_200_OK)
async def list_jobs():
    """
    List all jobs.
    
    Returns:
        List of JobStatus for all jobs
    """
    return [
        JobStatus(
            job_id=job_id,
            status=job["status"],
            filename=job["filename"],
            result=job.get("result")
        )
        for job_id, job in jobs.items()
    ]

@app.delete("/jobs/{job_id}", status_code=status.HTTP_200_OK)
async def delete_job(job_id: str):
    """
    Delete a job and associated files.
    
    Args:
        job_id: Job identifier
        
    Returns:
        Success message
    """
    if job_id not in jobs:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=f"Job {job_id} not found"
        )
    
    job = jobs[job_id]
    
    # Delete uploaded file if exists
    if "file_path" in job:
        file_path = Path(job["file_path"])
        if file_path.exists():
            file_path.unlink()
    
    # Remove job from tracking
    del jobs[job_id]
    
    return {
        "message": f"Job {job_id} deleted successfully",
        "job_id": job_id
    }

print("Status and management endpoints implemented")

In [None]:
# TEST
client = TestClient(app)

# First upload a file
test_content = b"Test line 1\nTest line 2"
response = client.post(
    "/upload",
    files={"file": ("status_test.txt", BytesIO(test_content), "text/plain")}
)
job_id = response.json()['job_id']

# Wait for processing
time.sleep(1)

# Test get specific job
response = client.get(f"/jobs/{job_id}")
assert response.status_code == 200, f"Expected 200, got {response.status_code}"
data = response.json()
assert data['job_id'] == job_id
assert data['status'] in ['pending', 'processing', 'completed'], f"Unexpected status: {data['status']}"

# Test get all jobs
response = client.get("/jobs")
assert response.status_code == 200
jobs_list = response.json()
assert isinstance(jobs_list, list), "Expected list of jobs"
assert len(jobs_list) > 0, "Should have at least one job"

# Test delete job
response = client.delete(f"/jobs/{job_id}")
assert response.status_code == 200

# Verify job deleted
response = client.get(f"/jobs/{job_id}")
assert response.status_code == 404, "Deleted job should return 404"

# Test non-existent job
response = client.get("/jobs/fake-job-id")
assert response.status_code == 404, "Non-existent job should return 404"

print("✓ Task 5 PASSED!")
print("  Status and management endpoints working")

---
## Bonus: Test Complete Upload Flow

In [None]:
# Load real test file
with open('../fixtures/input/test_file.txt', 'rb') as f:
    file_content = f.read()

print("=== Complete File Upload Flow ===")
print()

# 1. Upload file
response = client.post(
    "/upload",
    files={"file": ("test_file.txt", BytesIO(file_content), "text/plain")}
)
print("1. File Upload:")
upload_data = response.json()
print(f"   Job ID: {upload_data['job_id']}")
print(f"   Status: {upload_data['status']}")
print()

job_id = upload_data['job_id']

# 2. Check status immediately
response = client.get(f"/jobs/{job_id}")
print("2. Initial Status:")
status_data = response.json()
print(f"   Status: {status_data['status']}")
print()

# 3. Wait for processing
print("3. Waiting for processing...")
time.sleep(2)

# 4. Check final status
response = client.get(f"/jobs/{job_id}")
print("\n4. Final Status:")
final_data = response.json()
print(f"   Status: {final_data['status']}")
if final_data.get('result'):
    print("   Results:")
    for key, value in final_data['result'].items():
        print(f"     - {key}: {value}")
print()

# 5. List all jobs
response = client.get("/jobs")
all_jobs = response.json()
print(f"5. Total Jobs: {len(all_jobs)}")
print()

print("✓ Complete flow test passed!")

---
## Cleanup

In [None]:
# Cleanup temp directory
shutil.rmtree(TEMP_DIR)
print(f"Cleaned up: {TEMP_DIR}")

---
## Summary

**Key techniques used:**

1. **File uploads:**
   - Use `UploadFile` for efficient streaming
   - Validate file type and size before saving
   - Generate unique filenames to avoid collisions

2. **Background tasks:**
   - Use `BackgroundTasks` to process without blocking
   - Update job status as processing progresses
   - Handle errors gracefully

3. **Async file I/O:**
   - Use `aiofiles` for non-blocking file operations
   - Read and write files asynchronously
   - Better performance under load

4. **Job tracking:**
   - Use UUID for unique job identifiers
   - Store job state in memory (use database in production)
   - Provide status endpoints for monitoring

5. **Validation:**
   - Check file extensions
   - Enforce size limits
   - Return appropriate error codes

6. **Cleanup:**
   - Delete files when jobs are removed
   - Use lifespan for application cleanup
   - Clear job tracking on shutdown

**Production considerations:**
- Use database for job tracking (Redis, PostgreSQL)
- Implement job expiration and cleanup
- Add authentication and authorization
- Use object storage (S3) for files
- Add rate limiting
- Implement progress tracking for long operations
- Add webhooks for job completion

**Common pitfalls avoided:**
- Reading entire file into memory (use streaming)
- Not validating file size before reading
- Blocking operations in async context
- Not handling file encoding errors
- Filename collision issues
- Not cleaning up uploaded files