## DS 4300 - Spring 2025
Project Handout

## Due: Friday April 18 @ 11:59 pm

This project can be done in teams of two to four students. 

## Overview:
Over the last few weeks, we have explored several AWS services, including:
1. S3 for object storage
2. EC2 for compute
3. RDS for relational database access
4. Lambda for serverless or “on demand” computing

In this project, you will create a simple ETL pipeline using the compute and storage services we have covered, in addition to any other services from the AWS Free Tier you would like to explore as a team.  

The pipeline should have the following features:
1. Accept “user” uploaded data through a UI or programmatic mock user. 
2. Automatically extract the data as uploaded and preprocess it through relevant processes to the data’s context (deal with missing values, pre-process an image, etc.  for example).  
3. Store the data in its pre-processed form either in S3 (a different bucket than before) or an RDS database.  
4. That storage should trigger some additional processing step(s) and subsequently store in an analysis-ready form.  

Your project should also have a very simple web interface that shows some type of analytics of the data your users have uploaded.  The UI should run on an EC2 instance.  You can implement it using some package like Streamlit or another similar library or tool.  

Your project must use the 4 AWS services we have covered in class. 

## Example Pipeline:
(You cannot do this as a project, FYI.)

1. Implement a web app running on EC2 using Streamlit 
2. The web app, among other features, allows users to upload image files to S3.  
     - The app also allows users to enter additional information about the image that will be stored in a MySQL RDS database instance
3. When the new image file is loaded into S3, a Lambda is triggered that sanitizes the image file (removes geotags, etc.) and generates 3 different resolutions of the image: one for phone, one for tablet, and one for desktop
4. These 3 new images are written back to a different S3 bucket (it must be different than the source bucket), and their URLs are stored in the RDS instance. 

## Submission:

You’ll submit the following:
1. A short slide deck in PDF (6-8 slides) detailing the following:
     - Overview of your pipeline and main goals it should have achieved, including information about the source data sets
     - The architecture of your data pipeline in diagram form
     - Briefly explain how you used each AWS service and why you chose a particular service for a certain task service for that task
     - An overview of what topics/skills you had to research to put your pipeline into full execution
2. A short  4-5 minute video of your team explaining your pipeline and demoing the functionality.  It should be comprehensive enough for the viewer to see the state changes to the various services and data stores as the overall pipeline functions. 
     - Put a link to your demo video prominently on the first slide of your slide deck. 
     - All team members must appear in the demo video with camera on and participate substantively. 

## Creativity:

Use this project as an opportunity to build something creative that you might want to show to future employers.  I’m giving you all a ton of leeway regarding what you build.  So, build something that satisfies the requirements while also could help you get a job 🙂

## Submission:

You’ll submit the PDF report (that includes a link to your demo video) to GradeScope. 


## Grading:
- Creation of a functional AWS pipeline using required services (40%)
- Report (30%)
- Demo Video (30%)

# S3 JSON Uploader

A Python CLI application that randomly selects and uploads JSON files from a specified folder to an AWS S3 bucket at regular intervals.

## Requirements

- Python 3.11
- AWS account with S3 access
- Required Python packages (see requirements.txt)

## AWS Setup Instructions

### 1. Create S3 Bucket

1. Log into AWS Management Console
2. Navigate to S3 service
3. Make sure you're in the correct AWS Region
4. Click "Create bucket"
5. Configure bucket settings:
   - Choose `General purpose` bucket type
   - Choose a globally unique bucket name (this will be your `S3_BUCKET_NAME` in .env)
   - Leave most settings as default
   - Click "Create bucket"

### 2. Create IAM User and Policy

1. Go to IAM service in AWS Console
2. Click "Users" → "Create user"
3. Give your user a name (e.g., "s3-uploader")
4. Do NOT check the box next to "Provide user access to the AWS Management Console"
5. Click "Next: Permissions"
6. Click "Attach policies directly"
7. Create a new policy (Button in Policy section)
8. On the next page, choose JSON in the Policy Editor
9. Copy and paste the following
   ```json
   {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
         "Resource": [
           "arn:aws:s3:::ds4300bucket01",
           "arn:aws:s3:::ds4300bucket01/*"
         ]
       }
     ]
   }
   ```
   (Replace `YOUR-BUCKET-NAME` with your actual bucket name)
10. Give the policy a name (e.g., "S3UploadAccess")
11. Attach this policy to your user
12. Complete the user creation
13. **IMPORTANT**: Save the Access Key ID and Secret Access Key - these are your credentials for the .env file

## Project Setup

1. Clone this repository
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```
3. Copy `.env.example` to `.env` and fill in your AWS credentials:
   ```bash
   cp .env.example .env
   ```
4. Edit the `.env` file with your AWS credentials:
   ```
   AWS_ACCESS_KEY_ID=your_access_key_here
   AWS_SECRET_ACCESS_KEY=your_secret_key_here
   AWS_REGION=your_aws_region
   S3_BUCKET_NAME=your_bucket_name
   ```
5. Update the configuration variables in `src/s3_uploader.py`:
   - `DATA_FOLDER`: Path to your JSON files
   - `UPLOAD_INTERVAL`: Time between uploads in seconds

## Usage

Run the script:

```bash
python src/s3_uploader.py
```

The script will:

1. Load AWS credentials from the .env file
2. Connect to your S3 bucket
3. Randomly select a JSON file from the specified folder
4. Upload it to the S3 bucket
5. Wait for the specified interval
6. Repeat the process

## Project Structure

```
.
├── data-news-articles/     # Folder containing JSON files to upload
├── src/
│   └── s3_uploader.py     # Main script
├── .env                   # AWS credentials (not in version control)
├── .env.example          # Template for .env file
├── requirements.txt      # Python dependencies
└── README.md            # This file
```

## Security Notes

- Never commit your `.env` file to version control
- Keep your AWS credentials secure
- Use appropriate IAM roles and permissions for S3 access


In [2]:
!streamlit run src/ec2_streamlit.py

[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://10.110.135.169:8501[0m
[0m
[34m[1m  For better performance, install the Watchdog module:[0m

  $ xcode-select --install
  $ pip install watchdog
            [0m
       Team_ID     Game_ID     GAME_DATE      MATCHUP  ... BLK  TOV  PF  PTS
70  1610612738  0022400001  NOV 12, 2024  BOS vs. ATL  ...   5   20  12  116
67  1610612738  0022400021  NOV 19, 2024  BOS vs. CLE  ...   4   13  19  120
66  1610612738  0022400028  NOV 22, 2024    BOS @ WAS  ...   7   12  19  108
63  1610612738  0022400047  NOV 29, 2024    BOS @ CHI  ...   2    9  13  138
81  1610612738  0022400061  OCT 22, 2024  BOS vs. NYK  ...   3    3  15  132
..         ...         ...           ...          ...  ...  ..  ...  ..  ...
2   1610612738  0022401156  APR 09, 2025    BOS @ ORL  ...   5   15  11   76
1   1610612738  0022401174  APR 11, 2025  BOS vs. CH