# Data Engineering
Moving, Storing and Processing data in AWS

### To start off we will talk about the storage services avalibale on AWS. 
- Amazon S3 (Simple Storage Service): Object storage for unstructured data and data lakes.
- Amazon DynamoDB: NoSQL key-value and document database for high-performance applications.
- Amazon Redshift: Fully managed data warehouse designed for large-scale analytics.
- Amazon EFS (Elastic File System):

## S3

This is used for storage lakes for unstructured data, (Data Lake). 

It is used by method called buckets in AWS 
- Example:  Image_Data/Animals/Dog/Pitbull
    - In this each label is a bucket and they are globaly unique so in your entire AWS account only one Bucket can have that name. 
    - This full path of the object is refered to as the key
    - Image_Data, Animals, Dog, Pitbull are seperate buckets
    - Object storage => supports any file format
**Backbone for many AWS ML services (example: SageMaker)** 

Limitations: 
- Max object size is 5TB


### Storage Classes 

- Amazon S3 Standard - General Purpose
- Amazon S3 Standard-Infrequent Access (IA)
- Amazon S3 One Zone-Infrequent Access
- Amazon S3 Glacier Instant Retrieval
- Amazon S3 Glacier Flexible Retrieval
- Amazon S3 Glacier Deep Archive
- Amazon S3 Intelligent Tiering
- Can move between classes manually or using S3 Lifecycle

| **Storage Class**                          | **Use Cases**                                | **Retrieval Times**             | **Cost**                             |
|--------------------------------------------|---------------------------------------------|---------------------------------|--------------------------------------|
| **Amazon S3 Standard - General Purpose**   | Frequently accessed data.                   | Immediate                      | Highest cost per GB.                |
| **Amazon S3 Standard-Infrequent Access (IA)** | Long-lived, infrequently accessed data. But requires immediate acess when needed. **Backups**     | Immediate                      | Lower storage cost, retrieval fees. |
| **Amazon S3 One Zone-Infrequent Access**   | Secondary backups, easily reproducible data. | Immediate                      | Cheaper than Standard-IA.           |
| **Amazon S3 Glacier Instant Retrieval**    | Archival data requiring millisecond access.  | Milliseconds                   | Low storage cost, higher retrieval fees. |
| **Amazon S3 Glacier Flexible Retrieval**   | Long-term archival with occasional access.   | 1–5 minutes (expedited),3–5 hours (standard), 5–12 hours (bulk).       | Cheaper than Instant Retrieval.      |
| **Amazon S3 Glacier Deep Archive**         | Regulatory or long-term archival storage.    | 12–48 hours                    | Lowest storage cost.                 |
| **Amazon S3 Intelligent-Tiering**          | Data with unpredictable or dynamic access patterns. **Can move between classes manually or using S3 Lifecycle** | Varies by access tier               | Monitoring fee, cost varies by access tier. |

---

### Lifecycles of S3

This displays the lifecycles for infrequent data:

<div style="display: flex; align-items: center;">
    <!-- Text on the left -->
    <div style="flex: 1; padding-right: 20px;">
        <p>We can set up actions to move tiers and delete objects:</p>
        <ul>
            <li>Move objects to Standard IA class 60 days after creation</li>
            <li>Move to Glacier for archiving after 6 months</li>
        </ul>
    </div>
    <div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Lifecycles.png" alt="S3 Lifecycles" style="width: 400px;">
    </div>
</div>


---


**Example Problem:**
- Your application on EC2 creates images thumbnails after profile photos are uploaded to Amazon S3. These thumbnails can be easily recreated, and only need to be kept for 60 days. The source images should be able to be immediately retrieved for these 60 days, and afterwards, the user can wait up to 6 hours. How would you design this?

    - S3 source images can be on Standard, with a lifecycle configuration to transition them to Glacier after 60 days
    - S3 thumbnails can be on One-Zone IA, with a lifecycle configuration to expire them (delete them) after 60 days

---

<div style="display: flex; align-items: center;">
    <!-- Text on the left -->
    <div style="flex: 1; padding-right: 20px;">
        <h3>Amazon S3 Analytics – Storage Class Analysis</h3>
        <ul>
            <li>Helps you decide when to transition objects to the right storage class.</li>
            <li>It only recommends for Standard and Standard IA, <strong>Does NOT work for One-Zone IA or Glacier.</strong></li>
            <li>Report is updated daily, and it takes 24 to 48 hours to start seeing data analysis.</li>
            <li><strong>Main purpose is to give insights on S3 lifecycles.</strong></li>
        </ul>
    </div>
    <div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Analytics.png" alt="Placeholder for S3 Analytics Image" style="width: 200px; border: 1px dashed #ccc;">
    </div>
</div>


---

### Security

S3 security can be managed using **IAM roles** and **bucket policies**, which can be used either **together** or **independently**.

- For public access(WorldWideWeb) we use the bucket policies. 
    - The **Block Public Access** setting prevents a bucket from being public, overriding any bucket policy, to mitigate data leaks.

- For example:
    - You attach an IAM role to an EC2 instance to allow it to access S3.
    - The S3 bucket policy ensures only that specific IAM role can perform actions on a specific bucket.

|IAM Role	|Bucket Policy|
|---|---|
|Managed at the IAM level, tied to principals.|	Managed at the S3 bucket level.|

---

#### Encryption of Data

#### **4 Methods for Encrypting Objects in S3 Buckets**

**Server-Side Encryption (SSE)**  
   Encryption is handled by Amazon S3 on the server side.

   1. **SSE with Amazon S3-Managed Keys (SSE-S3)** - **Default Option**  
     - Encrypts S3 objects using keys managed and owned by AWS. 
     - Encryption is done serverside. 

   2. **SSE with KMS Keys (SSE-KMS)**  
     - Leverages **AWS Key Management Service (KMS)** for key management.  
     - Ideal for tighter control over key usage and permissions.

   3. **SSE with Customer-Provided Keys (SSE-C)**  
     - Allows you to manage your own encryption keys.  
     - Keys must be provided with every request to Amazon S3.

**Client-Side Encryption**  
   4. Encryption occurs **before uploading objects to S3**.  
   - Clients are responsible for managing encryption keys and operations.

#### **Key Exam Tip**  
Understand which encryption method applies to different scenarios for the exam, focusing on **SSE-KMS** for fine-grained control and **SSE-S3** for default managed encryption.

---

#### SSE with Amazon S3-Managed Keys (SSE-S3)  
Data is encrypted within AWS and stored in an S3 bucket, with encryption keys fully managed by AWS (not accessible to the owner).

<div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Encryption_1.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

#### SSE with KMS Keys (SSE-KMS)  
- Data is encrypted using customer-managed keys via AWS KMS, allowing S3 owners to control access and track key usage with CloudTrail. 
- Users need permissions in both the bucket policy and KMS key policy to access and decrypt data.


<div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Encryption_2.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

**Limitations:**
- Increase Latency in workflows. 
- API Limits: KMS API calls (e.g., Encrypt, Decrypt, GenerateDataKey) are subject to request rate limits, which could cause throttling if exceeded. 

#### SSE with Customer-Provided Keys (SSE-C)
The encryption key is generated and managed on the client side. Data is sent with key and then AWS encriypts and discards key to store into S3. To read the client will provide the key again to decrypt the data. **Main: Key is client side and not stored in AWS**

<div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Encryption_3.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

#### Client-Side Encryption  
Data is encrypted before being uploaded to S3, with the owner managing the entire encryption lifecycle. Both encryption and decryption occur outside of AWS.

<div style="flex: 1; text-align: center;">
        <img src="../Figures/S3_Encryption_4.png" alt="S3 Lifecycles" style="width: 400px;">
    </div>