1. Difference between AWS Regions, Availability Zones, and Edge Locations

Regions: Geographically isolated locations like us-east-1, ap-south-1.

Availability Zones (AZs): Multiple isolated data centers within a region (e.g., ap-south-1a, ap-south-1b).

Edge Locations: CDN endpoints used by CloudFront to cache content closer to users.

Importance: Crucial for reducing latency, enabling high availability, and disaster recovery in analytics workloads.

In [None]:
#2. AWS CLI to List All Regions

aws ec2 describe-regions --query "Regions[*].RegionName" --output table


In [None]:
#3. Create IAM User with Least Privilege (for S3)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket", "s3:GetObject", "s3:PutObject"],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}


4. Compare S3 Storage Classes

| Storage Class          | Description                     | When to Use                            |
| ---------------------- | ------------------------------- | -------------------------------------- |
| S3 Standard            | Default, for frequent access    | Daily analytics, active datasets       |
| S3 Intelligent-Tiering | Auto-moves data based on access | When access frequency is unpredictable |
| S3 Glacier             | Low-cost archival storage       | Infrequently accessed backups          |


In [None]:
#5. Create S3 Bucket & Enable Versioning

aws s3api create-bucket --bucket my-analytics-bucket --region ap-south-1

aws s3api put-bucket-versioning --bucket my-analytics-bucket \
--versioning-configuration Status=Enabled

aws s3 cp data.csv s3://my-analytics-bucket/
aws s3 cp data_v2.csv s3://my-analytics-bucket/data.csv


In [None]:
#6. Lifecycle Policy for Glacier + Deletion

{
  "Rules": [
    {
      "ID": "GlacierAfter30Days",
      "Status": "Enabled",
      "Filter": {},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      }
    }
  ]
}


7. Compare RDS, DynamoDB, and Redshift

| Service  | Type       | Best Use Case                          |
| -------- | ---------- | -------------------------------------- |
| RDS      | Relational | Transactional systems (OLTP)           |
| DynamoDB | NoSQL      | Real-time apps, IoT, key-value storage |
| Redshift | Columnar   | Large-scale data analytics (OLAP)      |


In [None]:
#8. DynamoDB + Lambda Triggered by S3 Upload

aws dynamodb create-table \
--table-name UploadLogs \
--attribute-definitions AttributeName=ID,AttributeType=S \
--key-schema AttributeName=ID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST



import boto3
import time

def lambda_handler(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('UploadLogs')
    for record in event['Records']:
        table.put_item(Item={
            'ID': str(time.time()),
            'FileName': record['s3']['object']['key']
        })
    return {"status": "logged"}


9. What is Serverless Computing? Pros/Cons of Lambda

Serverless = No server management. AWS manages the backend.

Pros:

Pay per execution

No infrastructure overhead

Highly scalable

Cons:

Cold starts

Limited runtime (15 min)

In [None]:
#10. Lambda Logs File Name, Size to CloudWatch

def lambda_handler(event, context):
    for record in event['Records']:
        key = record['s3']['object']['key']
        size = record['s3']['object']['size']
        print(f"File: {key}, Size: {size}, Time: {record['eventTime']}")


In [None]:
#11. AWS Glue – Convert CSV to Parquet

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_catalog(database="mydb", table_name="csv_data")
df.toDF().write.parquet("s3://my-output-bucket/parquet/")


12. Kinesis Components Explained

| Component            | Purpose                       | Example Use Case                  |
| -------------------- | ----------------------------- | --------------------------------- |
| Kinesis Data Streams | Real-time streaming ingestion | IoT, live sensor data             |
| Kinesis Firehose     | Delivery to destinations      | Logs from app to S3/Redshift      |
| Kinesis Analytics    | SQL on real-time streams      | Filter alerts from real-time logs |


13. What is Columnar Storage? (Redshift)

Redshift uses columnar storage to:

Reduce I/O (scan only needed columns)

Compress better

Improve query performance

Great for analytics queries over wide tables.

In [None]:
#14. Load CSV into Redshift Using COPY

CREATE TABLE sales (
  id INT,
  product VARCHAR(50),
  price FLOAT
);

COPY sales
FROM 's3://my-bucket/sales.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
FORMAT AS CSV;


15. Glue Catalog + Athena: Schema-on-Read

Glue Catalog stores schema metadata.

Athena queries S3 data without moving it — using that metadata.


In [None]:
#16. Athena Table + Query Example

CREATE EXTERNAL TABLE sales_data (
  id INT,
  product STRING,
  price FLOAT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('serialization.format' = ',')
LOCATION 's3://your-bucket/sales/';


17. Quicksight BI – SPICE + Embedded Dashboards

SPICE: Super-fast in-memory engine for dashboards.

Embedded Dashboards: Share insights in your apps (without login).

18. Quicksight Dashboard Steps

Connect to Athena table.

Add a calculated field: Profit = Revenue - Cost.

Add a region filter.

Visualize using bar chart.

Share or embed the dashboard.

19. CloudWatch vs CloudTrail

| Tool       | Monitors              | Purpose                              |
| ---------- | --------------------- | ------------------------------------ |
| CloudWatch | Logs, Metrics, Alarms | Health & performance monitoring      |
| CloudTrail | API activity          | Auditing, compliance, investigations |


In [None]:
#20. End-to-End AWS Data Pipeline (Example)

S3 (raw data) →
Lambda (trigger) →
Glue (ETL jobs) →
Athena (SQL queries) →
Quicksight (dashboards)
