# #01 Data Ingestion and Storage

### 1. Types of Data

| Type                | Description                                        | Examples                       |
| ------------------- | -------------------------------------------------- | ------------------------------ |
| **Structured**      | Organized with a fixed schema; easy to query.      | SQL tables, CSV, Excel         |
| **Unstructured**    | No predefined structure; harder to query directly. | Images, videos, PDFs, emails   |
| **Semi-Structured** | Has partial structure (metadata, tags).            | JSON, XML, logs, email headers |


### 2. The 3 V’s of Data

| Property     | Meaning                                          | Example                         |
| ------------ | ------------------------------------------------ | ------------------------------- |
| **Volume**   | Size of the data.                                | Terabytes of social media data. |
| **Velocity** | Speed at which data arrives & must be processed. | IoT sensor data stream.         |
| **Variety**  | Different types & formats of data.               | DB records + logs + images.     |


### 3. Data Warehouse vs Data Lake

| Feature   | **Data Warehouse**               | **Data Lake**                         |
| --------- | -------------------------------- | ------------------------------------- |
| Data Type | Structured                       | All data types                        |
| Schema    | **Schema-on-write** (predefined) | **Schema-on-read** (decide later)     |
| Workflow  | **ETL** (clean before load)      | **ELT** (load first, transform later) |
| Use Case  | BI dashboards, analytics         | ML, scalable storage, exploration     |
| Examples  | Amazon **Redshift**, BigQuery    | Amazon **S3**, ADLS, HDFS             |

**When to Use Which**

- Data Warehouse → When data is structured and used mainly for analytics & reporting.

- Data Lake → When data is raw, large-scale, used for ML, discovery, or unstructured storage.

- Most real-world systems use both (Lake → Warehouse).

### 4. Data Lakehouse

A hybrid architecture combining:

- Flexibility of Data Lake

- Performance & reliability of Warehouse

**Examples:**

- Databricks Lakehouse

- Delta Lake

- AWS Lake Formation

### 5. Data Mesh

More about organization & ownership, not tools:

- Each team owns its data as a product (domain-oriented).

- Federated governance.

- Supports self-service data consumption.

### 6. ETL Pipelines

| Step          | Purpose                                           |
| ------------- | ------------------------------------------------- |
| **Extract**   | Pull data from source systems (DBs, APIs, files). |
| **Transform** | Clean, normalize, aggregate, format data.         |
| **Load**      | Store data in warehouse (batch or streaming).     |

**Common AWS tools:**

- AWS Glue, Lambda, Step Functions, Airflow (MWAA), EventBridge.

### 7. Data File Formats

| Format      | Type                             | Best Use                                             |
| ----------- | -------------------------------- | ---------------------------------------------------- |
| **CSV**     | Text, structured                 | Simple tabular data, export/import.                  |
| **JSON**    | Semi-structured                  | API data, nested data.                               |
| **Avro**    | Binary + schema stored with data | Streaming & schema evolution (Kafka).                |
| **Parquet** | **Columnar** format              | Analytics & big data (Spark/Hive/Redshift Spectrum). |

## Amazon S3

### 8. Amazon S3 Overview

- Object storage (virtually infinite storage).

- Stores files as objects inside buckets.

- No real folders — key names simulate folder paths.

#### Security

- IAM policies (user-based)

- Bucket policies (resource-based)

- ACLs (legacy / avoid)

#### Versioning

- Keeps history of changes and restores deleted objects.

#### 9. S3 Storage Classes

| Class                | Best For                         | Retrieval | Cost                      |
| -------------------- | -------------------------------- | --------- | ------------------------- |
| Standard             | Frequent access                  | Instant   | $$                        |
| Standard-IA          | Infrequent access                | Instant   | $                         |
| One Zone-IA          | Non-critical backups             | Instant   | $ (cheaper but single AZ) |
| Glacier Instant      | Rare access but need quick reads | ms        | $                         |
| Glacier Flexible     | Archival (hrs to retrieve)       | 1–12 hrs  | $$ low                    |
| Glacier Deep Archive | Long-term cold archive           | 12–48 hrs | ***lowest***              |

#### S3 Intelligent-Tiering

- Automatically moves objects to cheaper tiers based on access frequency.

## Elastic Block Store (EBS)

## Elastic F Store (EFS)

## Amazon FSx

## Amazon Kinesis Data Streams