# <h1 align="center">Data Engineering</h1>
<h3 align="center">Moving, Storing and Processing data in AWS</h3>

### To start off we will talk about the storage services avalibale on AWS.

1. **Amazon S3 (Simple Storage Service):** Object storage for unstructured data and data lakes.
2. **Amazon Redshift:** Fully managed data warehouse designed for large-scale analytics.
3. **Amazon RDS/Aurora**: Managed relational databases (not explicitly listed but implied as commonly used in AWS architectures).
4. **Amazon DynamoDB:** NoSQL key-value and document database for high-performance applications.

## S3 (Simple Storage Service - DataLake)

This is used for storage lakes for unstructured data, (Data Lake). 

It is used by method called buckets in AWS 
- Example:  Image_Data/Animals/Dog/Pitbull
    - In this each label is a bucket and they are globaly unique so in your entire AWS account only one Bucket can have that name. 
    - This full path of the object is refered to as the key
    - Image_Data, Animals, Dog, Pitbull are seperate buckets
    - Object storage => supports any file format
**Backbone for many AWS ML services (example: SageMaker)** 

Limitations: 
- Max object size is 5TB


### Storage Classes 

- Amazon S3 Standard - General Purpose
- Amazon S3 Standard-Infrequent Access (IA)
- Amazon S3 One Zone-Infrequent Access
- Amazon S3 Glacier Instant Retrieval
- Amazon S3 Glacier Flexible Retrieval
- Amazon S3 Glacier Deep Archive
- Amazon S3 Intelligent Tiering
- Can move between classes manually or using S3 Lifecycle

| **Storage Class**                          | **Use Cases**                                | **Retrieval Times**             | **Cost**                             |
|--------------------------------------------|---------------------------------------------|---------------------------------|--------------------------------------|
| **Amazon S3 Standard - General Purpose**   | Frequently accessed data.                   | Immediate                      | Highest cost per GB.                |
| **Amazon S3 Standard-Infrequent Access (IA)** | Long-lived, infrequently accessed data. But requires immediate acess when needed. **Backups**     | Immediate                      | Lower storage cost, retrieval fees. |
| **Amazon S3 One Zone-Infrequent Access**   | Secondary backups, easily reproducible data. | Immediate                      | Cheaper than Standard-IA.           |
| **Amazon S3 Glacier Instant Retrieval**    | Archival data requiring millisecond access.  | Milliseconds                   | Low storage cost, higher retrieval fees. |
| **Amazon S3 Glacier Flexible Retrieval**   | Long-term archival with occasional access.   | 1–5 minutes (expedited),3–5 hours (standard), 5–12 hours (bulk).       | Cheaper than Instant Retrieval.      |
| **Amazon S3 Glacier Deep Archive**         | Regulatory or long-term archival storage.    | 12–48 hours                    | Lowest storage cost.                 |
| **Amazon S3 Intelligent-Tiering**          | Data with unpredictable or dynamic access patterns. **Can move between classes manually or using S3 Lifecycle** | Varies by access tier               | Monitoring fee, cost varies by access tier. |

---

### Lifecycles of S3

This displays the lifecycles for infrequent data:

<div style="display: flex; align-items: center;">
    <!-- Text on the left -->
    <div style="flex: 1; padding-right: 20px;">
        <p>We can set up actions to move tiers and delete objects:</p>
        <ul>
            <li>Move objects to Standard IA class 60 days after creation</li>
            <li>Move to Glacier for archiving after 6 months</li>
        </ul>
    </div>
    <div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Lifecycles.png" alt="S3 Lifecycles" style="width: 400px;">
    </div>
</div>


---


**Example Problem:**
- Your application on EC2 creates images thumbnails after profile photos are uploaded to Amazon S3. These thumbnails can be easily recreated, and only need to be kept for 60 days. The source images should be able to be immediately retrieved for these 60 days, and afterwards, the user can wait up to 6 hours. How would you design this?

    - S3 source images can be on Standard, with a lifecycle configuration to transition them to Glacier after 60 days
    - S3 thumbnails can be on One-Zone IA, with a lifecycle configuration to expire them (delete them) after 60 days

---

<div style="display: flex; align-items: center;">
    <!-- Text on the left -->
    <div style="flex: 1; padding-right: 20px;">
        <h3>Amazon S3 Analytics – Storage Class Analysis</h3>
        <ul>
            <li>Helps you decide when to transition objects to the right storage class.</li>
            <li>It only recommends for Standard and Standard IA, <strong>Does NOT work for One-Zone IA or Glacier.</strong></li>
            <li>Report is updated daily, and it takes 24 to 48 hours to start seeing data analysis.</li>
            <li><strong>Main purpose is to give insights on S3 lifecycles.</strong></li>
        </ul>
    </div>
    <div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Analytics.png" alt="Placeholder for S3 Analytics Image" style="width: 200px; border: 1px dashed #ccc;">
    </div>
</div>


---

### Security

S3 security can be managed using **IAM roles** and **bucket policies**, which can be used either **together** or **independently**.

- For public access(WorldWideWeb) we use the bucket policies. 
    - The **Block Public Access** setting prevents a bucket from being public, overriding any bucket policy, to mitigate data leaks.

- For example:
    - You attach an IAM role to an EC2 instance to allow it to access S3.
    - The S3 bucket policy ensures only that specific IAM role can perform actions on a specific bucket.

|IAM Role	|Bucket Policy|
|---|---|
|Managed at the IAM level, tied to principals.|	Managed at the S3 bucket level.|

---

#### Encryption of Data

#### **4 Methods for Encrypting Objects in S3 Buckets**

**Server-Side Encryption (SSE)**  
   Encryption is handled by Amazon S3 on the server side.

   1. **SSE with Amazon S3-Managed Keys (SSE-S3)** - **Default Option**  
     - Encrypts S3 objects using keys managed and owned by AWS. 
     - Encryption is done serverside. 

   2. **SSE with KMS Keys (SSE-KMS)**  
     - Leverages **AWS Key Management Service (KMS)** for key management.  
     - Ideal for tighter control over key usage and permissions.

   3. **SSE with Customer-Provided Keys (SSE-C)**  
     - Allows you to manage your own encryption keys.  
     - Keys must be provided with every request to Amazon S3.

**Client-Side Encryption**  
   4. Encryption occurs **before uploading objects to S3**.  
   - Clients are responsible for managing encryption keys and operations.

#### **Key Exam Tip**  
Understand which encryption method applies to different scenarios for the exam, focusing on **SSE-KMS** for fine-grained control and **SSE-S3** for default managed encryption.

---

#### 1. SSE with Amazon S3-Managed Keys (SSE-S3)  
Data is encrypted within AWS and stored in an S3 bucket, with encryption keys fully managed by AWS (not accessible to the owner).

<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Encryption_1.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

#### 2. SSE with KMS Keys (SSE-KMS)  
- Data is encrypted using customer-managed keys via AWS KMS, allowing S3 owners to control access and track key usage with CloudTrail. 
- Users need permissions in both the bucket policy and KMS key policy to access and decrypt data.


<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Encryption_2.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

**Limitations:**
- Increase Latency in workflows. 
- API Limits: KMS API calls (e.g., Encrypt, Decrypt, GenerateDataKey) are subject to request rate limits, which could cause throttling if exceeded. 

#### 3. SSE with Customer-Provided Keys (SSE-C)
The encryption key is generated and managed on the client side. Data is sent with key and then AWS encriypts and discards key to store into S3. To read the client will provide the key again to decrypt the data. **Main: Key is client side and not stored in AWS**

<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Encryption_3.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

#### 4. Client-Side Encryption  
Data is encrypted before being uploaded to S3, with the owner managing the entire encryption lifecycle. Both encryption and decryption occur outside of AWS.

<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/S3_Encryption_4.png" alt="S3 Lifecycles" style="width: 400px;">
</div>

#### Encryption in Transit  
AWS ensures data in transit is encrypted using SSL/TLS for API endpoints. HTTPS is recommended for all communications and is **mandatory for SSE-C** #3. To enforce HTTPS for other encryption methods, you can configure bucket policies to block HTTP connections and allow only HTTPS.

---
---

## AWS RedShift (Data Warehouse for Analytics 📊-SQL)

This is a Data Warehouse, that uses SQL analytics (OLAP - Online analytical processing)
- Loads data from S3 to Redshift
- Use Redshift Spectrum to query data directly in S3 (no loading)
- Data is stored in columns

✅ Use Case: Business intelligence (BI), analytics, machine learning workloads.

---
---

## Amazon RDS Aurora (Managed Relational Database 🏦-SQL)

- Relational Store, SQL (OLTP - Online Transaction Processing)
- Data is stored in rows, used for transactions

- ✅ Best for: Transactional (OLTP) workloads with high availability & performance.
- ✅ Type: Relational database (fully managed PostgreSQL/MySQL-compatible).
- ✅ Use Case: Web apps, e-commerce, enterprise applications, SaaS.
---
---

## Amazon DynamoDB – (Key-Value Database ⚡ - NoSQL)

NoSQL data store, serverless, provision read/write capacity

**Useful to store a machine learning model metrics and infrence results, NOT ACTUAL MODEL**

- ✅ Best for: High-performance, low-latency NoSQL workloads.
- ✅ Type: NoSQL key-value & document store (fully managed, serverless).
- ✅ Use Case: Real-time apps, IoT, gaming, caching, leaderboards, recommendation engines.

---
---

## AWS Database Migration Service (DMS)

AWS Database Migration Service (DMS) is a fully managed service that helps migrate databases and data warehouses to AWS with minimal downtime.

**Example Use Case:** 
- Transactional data is stored in an on-premises Oracle database, but SageMaker requires data in S3 for training. AWS DMS is used to continuously replicate data from Oracle to Amazon S3.

---
---

## Kinesis Data Streams (Server-Less)(Real-Time)

Amazon Kinesis Data Streams is a real-time data streaming service that enables you to capture, process, and analyze continuous data streams at scale. It’s often used for applications requiring real-time processing, such as log processing, IOTs, or machine learning pipelines.

<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/Kinesis_DataStream.png" alt="S3 Lifecycles" style="width: 600px;">
        <p>
                Producers are the agents/appliations/code etc. that connects the realtime data stream to Kinesis. Kinesis then forwards the data to the consumers in real time.  
        </p>
</div>



### Key Details: 
- Streaming data collection
- Producer & Consumer code, Needs Code. 
- Real-time
- Provisioned / On-Demand mode
- Data storage up to 365 days
- Replay Capability


---
---

## Kinesis Video Stream (Server-less)(Real-time)

 <div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/KinesisVideoWorkflow.png" alt="Kinesis Data Flow" style="width: 600px;">
    </div>


### Kinesis Video Streams Workflow

1. **Ingest Video**: Real-time video is streamed via **Kinesis Video Streams** from a producer (e.g., IoT cameras).  
2. **Processing App**: A consumer application in a Docker container (e.g., on EC2) processes the stream, checkpointing progress to **DynamoDB** for recovery.  
3. **ML Inference**: Decoded frames are sent to **Amazon SageMaker** for machine learning predictions (e.g., object detection).  
4. **Stream Results**: Inference results are published to a **Kinesis Data Stream**.  
5. **Real-Time Actions**: **Lambda functions** consume the data stream to trigger notifications or other actions.

**Use Case**: Real-time detection of anomalies, like identifying a burglar in a house.

#### Example of Kinesis Video Stream Workflow

<div style="display: flex; align-items: flex-start;">
    <div style="flex: 1; padding-right: 20px;">
        <p>
            Amazon Kinesis Video Streams is a managed service designed to stream, process, and store video and audio data in real-time. It is specifically tailored for applications that deal with video feeds from connected devices (like IoT cameras), audio streams, or other time-encoded data (e.g., radar signals).
        </p>
        <ul>
            <li><strong>Producers:</strong>
                <ul>
                    <li>Security camera, body-worn camera, AWS DeepLens, smartphone camera</li>
                    <li>Audio feeds, images, RADAR data, RTSP camera</li>
                    <li>One producer per video stream</li>
                    <li>Video playback capability</li>
                </ul>
            </li>
            <li><strong>Consumers:</strong>
                <ul>
                    <li>Build your own (MXNet, TensorFlow)</li>
                    <li>AWS SageMaker</li>
                    <li>Amazon Rekognition Video</li>
                </ul>
            </li>
            <li>Data retention: Keep data for 1 hour to 10 years</li>
        </ul>
    </div>
    <div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/KinesisVideoStream.png" alt="Kinesis Data Flow" style="width: 600px;">
        <p style="margin-top: 10px; font-style: italic;">Illustration of data producers and consumers in Kinesis Video Streams</p>
    </div>
</div>


---
---

## Amazon Data Firehouse (Server-Less)(Near Real-Time)

Amazon Kinesis Data Firehose is a consumer of incoming data that routes it to the desired destination. It also supports integration with AWS Lambda for data transformation during the streaming process.

<div style="flex: 1; text-align: center;">
        <img src="../Figures/Data_Engineering/DataFirhouse.png" alt="S3 Lifecycles" style="width: 600px;">
        <p>
            Can send data to destinations <strong>Near-Real time</strong> with buffering, it is <strong>Server-Less</strong>
        </p>
</div>

### Key Details: 
- Load streaming data into S3 / Redshift / OpenSearch / 3rd party /custom HTTP
- Fully managed, No code needed.
- Near real-time
- Automatic scaling
- No data storage
- Doesn’t support replay capability

---
---

## Amazon Managed Service for Apache Flink (Server-Less)(Real-Time)

Also known as Kinesis Data Analytics
- It’s used for complex event processing, aggregations, joins, and anomaly detection in real-time streaming data.
- Streaming ETL
- Continuous metric generation
- Responsive real time analytics

------ **Does not work with data fire-hose** ----- **Only with Kinesis Data Stream real time** ------

---
---

## AWS Glue (serverless)(Some Services real-time)

**AWS Glue** is a fully managed **data integration service** that enables **ETL (Extract, Transform, Load)** operations across various AWS and external data sources. It consists of multiple components that work together for **data cataloging, transformation, and real-time streaming.**

### **Key Components of AWS Glue**
| **Component**             | **Purpose** |
|---------------------------|------------|
| **Glue Data Catalog**     | Metadata repository for schema and table definitions. Works with Athena, Redshift, etc. |
| **Glue Crawlers**         | Automatically scan and classify data, adding schema details to the Glue Data Catalog. (Not real time)|
| **Glue ETL Jobs**         | Serverless data transformation using **Apache Spark** or **Python (PySpark/Scala)**. |
| **Glue Studio**           | No-code **visual ETL editor** for designing and running ETL workflows. |
| **Glue DataBrew**         | No-code tool for **data cleaning and transformation**. Ideal for business analysts. |
| **Glue Streaming ETL**    | **Processes real-time streaming data** from **Kinesis, Kafka, etc.** |
| **Glue ML Transforms**    | **Machine learning-based** data deduplication and cleansing. |

### **Summary**
AWS Glue is a **unified service** with multiple capabilities, providing **serverless data processing, transformation, and cataloging** for analytics and machine learning.

---
---

## AWS Athena (Serverless)(Near-RealTime)


AWS Athena is a serverless, interactive query service that allows you to run SQL queries on data stored in Amazon S3 without needing to set up or manage databases or servers.

How AWS Athena Works:

1. Data is stored in Amazon S3 (raw or processed).
2. AWS Glue Crawlers scan and create schema metadata in the AWS Glue Data Catalog.
3. Athena runs SQL queries directly on S3 without needing a database.
4. Results are stored in S3 and can be used for analytics, dashboards, or ML processing.

------
---

## AWS Batch (Serverless)

AWS Batch – Fully Managed Batch Processing Service

**For any non-ETL related work, Batch is probably better**

**For ETL related jobs Glue is better**

####  **Is likely used for Cost-Efficent ML training outside of Sagemaker, aids to reduce costs but requires manual compute allocation.**

- ✅ Best for: Running large-scale batch processing jobs efficiently on AWS.
- ✅ Type: Fully managed batch job scheduler & compute orchestrator.
- ✅ Use Cases: ML training, data processing, financial modeling, genomics.

---
---

## AWS Step-Functions(Not-RealTime)(Serverless)

#### **(AWS services Automation of  in a workflow)**

AWS Step Functions is a serverless orchestration service that allows you to define, execute, and manage workflow sequences across multiple AWS services. It ensures fault tolerance, state management, and parallel execution without needing custom scripts.

- ✅ Best for: Orchestrating and automating workflows across multiple AWS services.
- ✅ Type: Serverless workflow automation service.
- ✅ Use Cases: ML pipelines, ETL orchestration, data processing, event-driven workflows.

---
---

## <h1 align="center">Real-Time ML Workflows Examples</h1>


### 1️⃣ Real-Time Fraud Detection (Streaming ML Inference)

### 💡 Use Case: Detect fraudulent transactions in real time.

<h2 align="center">Real-Time Fraud Detection Workflow</h2>
<img src="../Figures/Data_Engineering/MLWorkFlows/FraudMLWorkFlow.png" style="width:800px; display:block; margin:auto;">

#### FYI Their is multiple ways to make this work flow, for instance we can use Data Firehouse for the main ingestion from the procucer, however this will not be real-time but near-real-time

---
---

### 2️⃣ Real-Time Video Classification

### 💡 Use Case: Perform Computer Vision on video in real time.

<h2 align="center">Real-Time Video Classification-Rekognition</h2>
<img src="../Figures/Data_Engineering/MLWorkFlows/VideoStream1.png" style="width:800px; display:block; margin:auto;">

<h2 align="center">Real-Time Video Classification-Sagemaker</h2>
<img src="../Figures/Data_Engineering/MLWorkFlows/VideoStream2.png" style="width:800px; display:block; margin:auto;">

---
---


### Data Migration Workflow 

<img src="../Figures/Data_Engineering/MLWorkFlows/DataMigrationWF.png" style="width:800px; display:block; margin:auto;">

**AWS Glue Data Catalog and its crawlers help track data schemas throughout the migration pipeline, ensuring data is updated correctly and identifying any unintended changes in the workflow.**

**AWS Batch is a fully managed service that runs large-scale batch computing workloads, such as data processing or script execution, on demand or on a scheduled basis using EventBridge.**

**In this workflow AWS Batch is used to create jobs that processes data storage by periodically cleaning or executing large-scale scripts based on a scheduled time basis.**

### To **orchestrate** all these workflows, we use **AWS Step Functions** to connect services, manage endpoints, and ensure a **fault-tolerant ML workflow environment**.
<img src="../Figures/Data_Engineering/MLWorkFlows/Stepfunctions.png" style="width:100px; display:block; margin:auto;">

# **📌 Data Engineering Summary**  



### **🔹 Data Storage & Access**  
- **Amazon S3** → Object storage for your data.  
- **VPC Endpoint Gateway** → Privately access your S3 bucket without using the public internet.  

---

### **🔹 Real-Time & Streaming Data Processing**  
- **Kinesis Data Streams** → Real-time data streaming for applications; requires capacity planning.  
- **Kinesis Data Firehose** → Near real-time data ingestion to **S3, Redshift, Elasticsearch, Splunk**.  
- **Kinesis Data Analytics** → Perform **SQL transformations on streaming data**.  
- **Kinesis Video Streams** → Real-time video ingestion and processing.  

---

### **🔹 Data Cataloging & ETL**  
- **Glue Data Catalog & Crawlers** → Metadata repository for schemas and datasets.  
- **Glue ETL** → Serverless **ETL jobs using Apache Spark**.  

---

### **🔹 Databases & Warehouses**  
- **DynamoDB** → **NoSQL** key-value store for scalable applications.  
- **Redshift** → Data warehousing for **OLAP (Online Analytical Processing)** with **SQL support**.  
- **Redshift Spectrum** → Query **S3 data using Redshift** without loading it into Redshift.  
- **RDS / Aurora** → **Relational database (OLTP)** supporting SQL.  
---

### **🔹 Data Orchestration & Processing**  
- **Data Pipelines** → Orchestrate **ETL jobs** across **RDS, DynamoDB, and S3** (runs on EC2).  
- **AWS Batch** → **Batch processing using Docker containers**; manages EC2 instances for you.  
- **DMS (Database Migration Service)** → **1-to-1 database replication** (Change Data Capture - CDC), not ETL.  
- **Step Functions** → **Orchestrate workflows** with **audit trails & retry mechanisms**.  

---