# Notebook 17: AWS S3 Integration - Beyond Simple Database Storage

## üéØ What You'll Learn

In our **Todo app**, everything was simple - we stored all data in a PostgreSQL database. But what happens when you need to store **files** like PDFs? This notebook shows you the key differences when you move from database-only applications to applications that handle file storage.

## üìä Todo App vs PDF App: Storage Comparison

| Aspect | Todo App | PDF App | Why Different? |
|---------|----------|---------|----------------|
| **Data Storage** | Everything in database | Text in database + Files in S3 | Files are too large for database |
| **Dependencies** | Basic FastAPI packages | + `boto3`, `python-multipart` | Need AWS SDK and file upload support |
| **Environment Variables** | Database credentials only | Database + AWS credentials | Need cloud service access |
| **Configuration** | Simple database connection | Database + S3 client setup | Multiple services to configure |

---

**üí° Key Insight**: When your app handles files, you need to think about **dual storage** - metadata goes in your database, actual files go in cloud storage.

## Part 1: What We Had - Todo App Storage

### Simple Todo App Storage Model

In our Todo app, everything was stored in one place:

```python
# Todo App - Everything in Database
class Todo(Base):
    __tablename__ = "todos"
    
    id = Column(Integer, primary_key=True, index=True)
    name = Column(String)          # ‚Üê Todo text stored here
    completed = Column(Boolean)    # ‚Üê Status stored here
```

### Todo App Environment Variables
```bash
# .env file for Todo app
DATABASE_USER=todo_user
DATABASE_PASSWORD=todo_password
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=modern_todo_db
```

**Why this worked**: Todo items are just text - small, simple data that fits perfectly in a database.

## Part 2: What Changes - PDF App Dual Storage

### Why We Need Cloud Storage for Files

**The Problem**: PDF files are:
- **Large** (can be several MB each)
- **Binary data** (not text like todo items)
- **Need special handling** (upload, download, display)
- **Expensive to store in database** (databases aren't optimized for large files)

**The Solution**: Split storage into two systems:
1. **Database**: Stores metadata (filename, selected status, file URL)
2. **AWS S3**: Stores the actual PDF files

```python
# PDF App - Dual Storage Model
class PDF(Base):
    __tablename__ = "pdfs"
    
    id = Column(Integer, primary_key=True, index=True)
    name = Column(Text)        # ‚Üê Filename stored in database
    file = Column(Text)        # ‚Üê S3 URL stored in database (not the file!)
    selected = Column(Boolean) # ‚Üê Status stored in database
```

**Key Difference**: The `file` column stores a **URL pointing to S3**, not the actual file content.

## Part 3: New Dependencies - What We Need to Add

### Todo App Dependencies (Simple)
```toml
# Todo app pyproject.toml
[tool.poetry.dependencies]
python = "3.13.3"
fastapi = "0.115.0"
uvicorn = "0.32.0"
sqlalchemy = "2.1.0"
psycopg2-binary = "2.9.9"
pydantic = "2.10.0"
```

### PDF App Dependencies (Added Complexity)
```toml
# PDF app pyproject.toml
[tool.poetry.dependencies]
python = "3.13.3"
fastapi = "0.115.0"
uvicorn = "0.32.0"
sqlalchemy = "2.1.0"
psycopg2-binary = "2.9.9"
pydantic = "2.10.0"
boto3 = "1.34.0"           # ‚Üê NEW: AWS SDK
python-multipart = "0.0.6" # ‚Üê NEW: File upload support
```

### What These New Dependencies Do

**`boto3`**: 
- Amazon's official Python SDK
- Lets your Python code talk to AWS services like S3
- Handles authentication, file uploads, downloads

**`python-multipart`**:
- Enables FastAPI to handle file uploads
- Processes `multipart/form-data` (how browsers send files)
- Required for `UploadFile` parameter type

## Part 4: Environment Variables - AWS Credentials

### Todo App Environment (Simple)
```bash
# Todo app .env
DATABASE_USER=todo_user
DATABASE_PASSWORD=todo_password
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=modern_todo_db
```

### PDF App Environment (More Complex)
```bash
# PDF app .env
DATABASE_USER=pdf_user
DATABASE_PASSWORD=pdf_password
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=pdf_database

# NEW: AWS Credentials
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=secret...
AWS_S3_BUCKET=pdf-basic-app
AWS_REGION=us-east-1
```

### Why AWS Credentials Are Needed

**Security**: AWS needs to know:
- **Who you are** (`AWS_ACCESS_KEY_ID`)
- **That it's really you** (`AWS_SECRET_ACCESS_KEY`)
- **Which bucket to use** (`AWS_S3_BUCKET`)
- **Which region** (`AWS_REGION`)

**‚ö†Ô∏è Important**: Never put these credentials in your code or commit them to GitHub!

## Part 5: Configuration Changes - S3 Client Setup

### Todo App Configuration (Simple)
```python
# Todo app config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    database_user: str
    database_password: str
    database_host: str
    database_port: int
    database_name: str
    
    class Config:
        env_file = ".env"
```

### PDF App Configuration (Added Complexity)
```python
# PDF app config.py
import boto3
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    # Same database config as Todo app
    database_user: str
    database_password: str
    database_host: str
    database_port: int
    database_name: str
    
    # NEW: AWS Configuration
    aws_access_key_id: str
    aws_secret_access_key: str
    aws_s3_bucket: str
    aws_region: str = "us-east-1"
    
    # NEW: S3 Client Factory Method
    @staticmethod
    def get_s3_client():
        settings = Settings()
        return boto3.client(
            's3',
            aws_access_key_id=settings.aws_access_key_id,
            aws_secret_access_key=settings.aws_secret_access_key,
            region_name=settings.aws_region
        )
    
    class Config:
        env_file = ".env"
```

### What the S3 Client Does
- **Authenticates** with AWS using your credentials
- **Provides methods** for uploading, downloading, deleting files
- **Handles errors** like network timeouts, permission issues
- **Manages connections** to AWS servers

## Part 6: AWS S3 Setup Process

### What You Need to Set Up (Todo App Had None of This)

#### 1. Create AWS Account
- Sign up at [aws.amazon.com](https://aws.amazon.com)
- Provide credit card (free tier available)
- Verify identity

#### 2. Create S3 Bucket
```bash
# Bucket name: pdf-basic-app (must be globally unique)
# Region: us-east-1 (or your preferred region)
# Access: Public read (so users can view PDFs)
```

#### 3. Create IAM User
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::pdf-basic-app/*"
        }
    ]
}
```

#### 4. Get Access Keys
- Create access key for the IAM user
- Save `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
- Add to your `.env` file

**Why This Complexity?**: Unlike a database that runs on your computer, S3 is a cloud service that needs proper authentication and permissions.

## Part 7: Data Flow Comparison

### Todo App Data Flow (Simple)
```
User creates todo ‚Üí FastAPI ‚Üí PostgreSQL ‚Üí Done
User reads todos ‚Üê FastAPI ‚Üê PostgreSQL ‚Üê Request
```

### PDF App Data Flow (Complex)
```
User uploads PDF ‚Üí FastAPI ‚Üí {
    1. File ‚Üí AWS S3 (actual PDF)
    2. Metadata ‚Üí PostgreSQL (filename, S3 URL)
}

User views PDF ‚Üê {
    1. Get S3 URL ‚Üê PostgreSQL ‚Üê FastAPI
    2. Download file ‚Üê AWS S3 ‚Üê Browser
}
```

### Key Difference: Two-Step Process
1. **Upload**: File goes to S3, URL goes to database
2. **Access**: Get URL from database, fetch file from S3

**Why**: Separating storage optimizes each system for what it does best:
- **Database**: Fast queries, relationships, transactions
- **S3**: Large file storage, global delivery, backup

## Part 8: Cost and Scaling Considerations

### Todo App Costs (Predictable)
- **Database storage**: ~$0.10/GB/month
- **Compute**: Fixed cost per server
- **Scaling**: Add more database resources

### PDF App Costs (Variable)
- **Database storage**: ~$0.10/GB/month (for metadata only)
- **S3 storage**: ~$0.023/GB/month (much cheaper for files)
- **S3 requests**: $0.0004 per 1000 GET requests
- **Data transfer**: $0.09/GB out of AWS

### Why This Matters
**For 1000 PDFs (1MB each)**:
- **Database storage**: ~$100/month (expensive!)
- **S3 storage**: ~$2.30/month (much better!)

**Scaling Benefits**:
- S3 handles millions of files automatically
- Global CDN for fast downloads
- Automatic backups and durability

## üéØ Key Takeaways

### What Changes When You Add File Storage:

1. **Architecture**: Single storage ‚Üí Dual storage (database + cloud)
2. **Dependencies**: Simple packages ‚Üí AWS SDK + file handling
3. **Configuration**: Database only ‚Üí Database + cloud credentials  
4. **Environment**: Local variables ‚Üí Local + cloud service variables
5. **Data Flow**: Direct storage ‚Üí Two-step process
6. **Costs**: Predictable ‚Üí Variable based on usage
7. **Complexity**: Simple ‚Üí Multiple services to manage

### Why These Changes Are Worth It:

‚úÖ **Better Performance**: Files don't slow down your database  
‚úÖ **Lower Costs**: S3 is much cheaper for file storage  
‚úÖ **Better Scaling**: S3 handles millions of files automatically  
‚úÖ **Global Access**: CDN delivers files fast worldwide  
‚úÖ **Reliability**: AWS provides 99.999999999% durability  

### Next Steps:

In **Notebook 18**, we'll see how these storage changes affect your FastAPI endpoints - spoiler alert: uploading files requires completely different endpoint patterns than simple JSON CRUD operations!

---

**Remember**: The core concepts from the Todo app still apply - we're just adding cloud storage to handle files efficiently. The database skills you learned are still essential for managing file metadata!