<a href="https://colab.research.google.com/github/Rashmiacekiper/Assignment-1/blob/main/AWS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q. 1 - Explain the difference between AWS Regions, Availability Zones, and Edge Locations. Why is this important for data analysis and latency-sensitive applications.
A - Understanding the difference between AWS Regions, Availability Zones, and Edge Locations is crucial for designing scalable, fault-tolerant, and low-latency applications, especially in data analysis and latency-sensitive environments.
________________________________________
 1. AWS Regions
•	Definition: A Region is a geographically distinct area that contains multiple, isolated Availability Zones.
•	Examples: us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore)
•	Purpose: Regions allow you to deploy resources close to your users or comply with data sovereignty laws.
•	Impact: Choosing the right Region affects latency, data residency, cost, and availability.
________________________________________
2. Availability Zones (AZs)
•	Definition: An Availability Zone is a distinct location within a Region, with independent power, cooling, and networking.
•	Each Region typically has 2–6 AZs.
•	Purpose: AZs provide fault isolation. You can deploy applications across multiple AZs to ensure high availability and disaster recovery.
•	Impact: Using multiple AZs protects against data center failures and supports robust, fault-tolerant architecture for analytics systems.
________________________________________
 3. Edge Locations
•	Definition: Edge Locations are global data centers used by AWS CloudFront (CDN) and AWS Global Accelerator to cache and deliver content closer to users.
•	Purpose: Minimize latency for static and dynamic content by serving from the location closest to the end-user.
•	Impact: For latency-sensitive applications (e.g., real-time dashboards, media streaming), Edge Locations reduce response times significantly.
________________________________________
 Why This Matters for Data Analysis and Latency-Sensitive Applications
Concern	How AWS Components Help
Latency	Edge Locations deliver fast responses to users globally.
Availability & Reliability	Multi-AZ deployment ensures systems stay online during failures.
Scalability	Regions allow for global infrastructure to scale analytics.
Data Sovereignty/Compliance	Regions allow control over where data is stored and processed.
________________________________________
Q. 2 - Using the AWS CLI, list all available AWS regions. Share the command used and the output.
A - To list all available AWS Regions using the AWS CLI, you can use the following command:
________________________________________
 Command:
aws ec2 describe-regions --all-regions --query "Regions[*].RegionName" --output table
________________________________________
 Explanation:
•	describe-regions: Lists all Regions that are available to your account.
•	--all-regions: Includes all enabled and disabled Regions.
•	--query: Filters the output to show only Region names.
•	--output table: Formats the output as a table for easier readability.
________________________________________
 Example Output:
---------------------------------
|        DescribeRegions        |
+-------------------------------+
|  af-south-1                   |
|  ap-east-1                    |
|  ap-northeast-1              |
|  ap-northeast-2              |
|  ap-northeast-3              |
|  ap-south-1                  |
|  ap-south-2                  |
|  ap-southeast-1              |
|  ap-southeast-2              |
|  ap-southeast-3              |
|  ca-central-1                |
|  eu-central-1                |
|  eu-central-2                |
|  eu-north-1                  |
|  eu-south-1                  |
|  eu-south-2                  |
|  eu-west-1                   |
|  eu-west-2                   |
|  eu-west-3                   |
|  me-central-1                |
|  me-south-1                  |
|  sa-east-1                   |
|  us-east-1                   |
|  us-east-2                   |
|  us-west-1                   |
|  us-west-2                   |
+-------------------------------+
________________________________________
Q. 3 - Create a new IAM user with least privilege access to Amazon S3. Share your attached policies (JSON or screenshot)
A - Sure! Here's a short version of how to create an IAM user with least privilege access to a specific S3 bucket:
________________________________________ IAM User: s3-limited-user
Grants access only to one bucket with basic list, get, put, and delete permissions.
________________________________________
Attached Policy (JSON)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::example-bucket"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    }
  ]
}
🔁 Replace example-bucket with your actual bucket name.
________________________________________
Q. 4 - Compare different Amazon S3 storage (Standard, Intelligent-Tiering, Glacier). When should each be used in data analytics workflows?
A - Here’s a short comparison of key Amazon S3 storage classes and when to use them in data analytics workflows:
________________________________________
 S3 Storage Classes Comparison
Storage Class	Use Case	Access Frequency	Cost	Retrieval Time
S3 Standard	Frequently accessed data	High	$$$	Milliseconds
S3 Intelligent-Tiering	Unknown/variable access patterns	Varies	$$ (with tiering fee)	Milliseconds to hours
S3 Glacier	Long-term archive, infrequent access	Very low	$	Minutes to hours
________________________________________
When to Use in Data Analytics
Workflow Stage	Recommended Storage Class	Why
Active data processing	S3 Standard	Low latency, high availability
Logs/data with unpredictable use	S3 Intelligent-Tiering	Auto-optimizes cost based on access patterns
Historical or archived datasets	S3 Glacier	Cheap storage for cold data, retrieved occasionally

Q. 5 - Create an S3 bucket and upload a sample dataset (CSV or JSON). Enable versioning and show at least two versions of one file.
✅ Short Version (AWS CLI)
1. Create S3 Bucket
aws s3api create-bucket --bucket my-sample-bucket-123 --region us-east-1
2. Enable Versioning
aws s3api put-bucket-versioning \
  --bucket my-sample-bucket-123 \
  --versioning-configuration Status=Enabled
3. Create Sample CSV
echo -e "id,name\n1,Alice" > data.csv
aws s3 cp data.csv s3://my-sample-bucket-123/data.csv
4. Update File (New Version)
echo -e "id,name\n1,Alice\n2,Bob" > data.csv
aws s3 cp data.csv s3://my-sample-bucket-123/data.csv
5. List File Versions
aws s3api list-object-versions --bucket my-sample-bucket-123 --prefix data.csv
This will show at least two versions of data.csv.

## Q 6. Write and apply a lifecycle policy to move files to Glacier after 30 days and delete them after 90. Share the policy JSON or Screenshot.

Ans - ### ✅ S3 Lifecycle Policy (Short Version)

#### 📝 **Policy JSON**

```json
{
  "Rules": [
    {
      "ID": "MoveToGlacierThenDelete",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 90
      }
    }
  ]
}
```

---

#### 🚀 **Apply Policy via AWS CLI**

```bash
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-sample-bucket-123 \
  --lifecycle-configuration file://lifecycle.json
```
## Q 7. Compare RDS, DynamoDB, and Redshift for use in different stages of a data pipeline. Give one use case for each.
 Ans - ### ✅ RDS vs DynamoDB vs Redshift – For Data Pipeline Stages (Short)

| Service      | Type              | Best For                          | Example Use Case                                |
| ------------ | ----------------- | --------------------------------- | ----------------------------------------------- |
| **RDS**      | Relational DB     | Structured transactional data     | **App database** for storing user data in OLTP  |
| **DynamoDB** | NoSQL (Key-Value) | High-speed, scalable reads/writes | **Real-time ingestion** of IoT sensor data      |
| **Redshift** | Data Warehouse    | Analytics, large-scale queries    | **BI/Reporting**: Analyze historical sales data |

### 🎯 Summary:

* **RDS** → Best for transactional workloads (OLTP).
* **DynamoDB** → Ideal for high-throughput, low-latency ingestion.
* **Redshift** → Designed for heavy analytics (OLAP) and reporting.


## Q 8. Create a DynamoDB table and insert 3 records manually. Then write a Lambda function that adds records when triggered by S3 uploads.

Ans - ### ✅ Step-by-Step (Short Version)

---

### **1. Create DynamoDB Table**

```bash
aws dynamodb create-table \
  --table-name MyDataTable \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
```

---

### **2. Insert 3 Records Manually**

```bash
aws dynamodb put-item --table-name MyDataTable --item '{"id":{"S":"1"}, "name":{"S":"Alice"}}'
aws dynamodb put-item --table-name MyDataTable --item '{"id":{"S":"2"}, "name":{"S":"Bob"}}'
aws dynamodb put-item --table-name MyDataTable --item '{"id":{"S":"3"}, "name":{"S":"Charlie"}}'
```

---

### **3. Lambda Function (Triggered by S3 Upload)**

#### 📝 Sample Lambda (Python)

```python
import boto3
import uuid

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyDataTable')

def lambda_handler(event, context):
    for record in event['Records']:
        file_name = record['s3']['object']['key']
        table.put_item(Item={
            'id': str(uuid.uuid4()),
            'name': file_name
        })
```

---

### **4. Connect Lambda to S3 Event**

* In S3 bucket → **Properties** → **Event notifications**
* Add event to trigger Lambda on **`PUT` (upload)**


## Q 9.What is serverless computing? Discuss pros and cons of using AWS Lambda for data pipelines.

Ans - ### ✅ What is Serverless Computing?

**Serverless computing** is a cloud execution model where the cloud provider **automatically manages infrastructure**—scaling, provisioning, and maintenance. You just write code; the platform runs it **on demand**, often in response to events.

---

### ⚙️ AWS Lambda for Data Pipelines

**AWS Lambda** is a serverless function service ideal for event-driven tasks like S3 uploads, data transforms, or stream processing.

---

### ✅ Pros of Using AWS Lambda in Data Pipelines

| Advantage                | Description                                        |
| ------------------------ | -------------------------------------------------- |
| **No server management** | No provisioning or patching of servers needed.     |
| **Auto-scaling**         | Instantly scales with workload spikes.             |
| **Event-driven**         | Easily triggered by S3, DynamoDB, Kinesis, etc.    |
| **Cost-effective**       | Pay only for compute time used (ms-level billing). |
| **Quick deployment**     | Fast to build and deploy small, modular functions. |

---

### ❌ Cons of Using AWS Lambda

| Limitation            | Description                                                                           |
| --------------------- | ------------------------------------------------------------------------------------- |
| **Timeout limit**     | Max execution time is 15 minutes.                                                     |
| **Cold starts**       | Initial requests may be slow if idle.                                                 |
| **Limited compute**   | Memory & CPU limits restrict heavy processing.                                        |
| **Complex debugging** | Harder to debug compared to monolithic apps.                                          |
| **State handling**    | Stateless by design—external systems (like S3 or DynamoDB) needed for stateful logic. |

---

### 🎯 Best Use in Data Pipelines:

* **Ingestion**: Trigger on file upload (S3) or data stream (Kinesis).
* **Transformation**: Clean/normalize data before storing in a database.
* **Orchestration**: Light coordination using Step Functions.

## Q 10.Create a Lambda function triggered by S3 uploads that logs file name, size, and timestamp to Cloudwatch. Share code and a log screenshot.

Ans - ### ✅ Lambda Function Triggered by S3 Upload (Short Version)

#### 📝 **Lambda Code (Python)**

```python
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    for record in event['Records']:
        s3_obj = record['s3']['object']
        file_name = s3_obj['key']
        file_size = s3_obj['size']
        event_time = record['eventTime']
        
        logger.info(f"New S3 Upload: File = {file_name}, Size = {file_size} bytes, Time = {event_time}")
```

---

### 🔗 **Trigger Setup**

* Go to your **S3 bucket** → **Properties** → **Event notifications**.
* Add a notification:

  * **Event type**: PUT
  * **Destination**: Lambda function (choose the one above)


## Q 11. Use AWS Glue to crawl your S3 dataset, create a Data Catalog table, and run a Glue job to convert CSV data to parquet. Share job code and output location.

Ans - ### ✅ AWS Glue: CSV ➡️ Parquet Conversion (Short Version)

---

### **1. Upload CSV to S3**

Example:

```
s3://my-data-bucket/raw/customers.csv
```

---

### **2. Create Glue Crawler**

* **Source**: S3 path above
* **Output**: Glue Data Catalog DB (e.g., `raw_db`)
* **Table**: Created automatically (e.g., `customers`)

---

### **3. Create Glue Job to Convert CSV to Parquet**

#### 📝 Glue Job Script (PySpark)

```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Load CSV table
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db", table_name="customers"
)

# Write as Parquet
glueContext.write_dynamic_frame.from_options(
    frame=datasource,
    connection_type="s3",
    connection_options={"path": "s3://my-data-bucket/parquet/customers/"},
    format="parquet"
)
```

---

### 📦 **Output Location**

```
s3://my-data-bucket/parquet/customers/
```

## Q 12. Explain the difference between Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics. Provide a real-world example of how each would be used.
Ans - ### ✅ Kinesis Services – Quick Comparison & Real-World Use Cases

| Service                    | Purpose                                                               | Real-World Example                     |
| -------------------------- | --------------------------------------------------------------------- | -------------------------------------- |
| **Kinesis Data Streams**   | Real-time data ingestion (millisecond latency)                        | Ingest clickstream data from a website |
| **Kinesis Firehose**       | Load streaming data to S3, Redshift, or Elasticsearch (fully managed) | Auto-deliver IoT data to S3 in Parquet |
| **Kinesis Data Analytics** | Run SQL on streaming data (real-time processing)                      | Detect anomalies in stock prices live  |

---

### 🧠 Summary:

* **Data Streams** = Fast, customizable **stream ingestion**
* **Firehose** = Easy **stream-to-storage delivery**
* **Data Analytics** = **SQL + stream processing** in real-time

## Q 12. What is columnar storage and how does it benefit Redshift performance for analytics workloads.

Ans - ### ✅ What is Columnar Storage?

**Columnar storage** stores data **by column** (not by row), so all values in a column are stored together.

---

### ⚡ Benefits for Redshift Analytics:

| Benefit                | Why It Matters                             |
| ---------------------- | ------------------------------------------ |
| **Faster queries**     | Reads only needed columns, not entire rows |
| **Better compression** | Similar column values compress more easily |
| **Efficient scans**    | Ideal for large, read-heavy analytics      |

---

### 🎯 Summary:

Columnar storage in Redshift boosts performance by reducing I/O and improving compression—perfect for analytical workloads like BI dashboards or data warehousing.

## Q 13. ### ✅ Load CSV from S3 into Redshift (Short)

---

### 1. **Table Schema**

```sql
CREATE TABLE customers (
  id INT,
  name VARCHAR(50),
  age INT
);
```

---

### 2. **COPY Command**

```sql
COPY customers
FROM 's3://my-data-bucket/data/customers.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
CSV
IGNOREHEADER 1
REGION 'us-east-1';
```

---

### 3. **Sample Query & Output**

```sql
SELECT * FROM customers LIMIT 3;
```

| id | name  | age |
| -- | ----- | --- |
| 1  | Alice | 30  |
| 2  | Bob   | 25  |
| 3  | Carol | 28  |


## Q 14. What is the role of the AWS Glue Data Catalog in Athena? How does schema-on-read work

Ans - ### ✅ AWS Glue Data Catalog & Athena

* **Glue Data Catalog**: Acts as a **central metadata repository** that stores table definitions, schemas, and locations of your data.
* **Athena** uses this catalog to understand the structure of your data stored in S3 and run SQL queries without moving data.

---

### ⚡ Schema-on-Read

* Data is **not transformed or structured upfront**.
* Schema is applied **only when you run a query**.
* Allows querying **raw data directly**, providing flexibility and faster ingestion.

---

### 🎯 Summary:

Glue Data Catalog tells Athena **how to interpret the data**, while schema-on-read lets you query raw data dynamically without prior loading or transformation.






*    **Create an Athena table from S3 data using Glue Catalog. Run a query and share the SQL + result screenshot.**

I can guide you through creating an Athena table using Glue Catalog and running a query, but I can’t generate screenshots directly. Here’s a quick walkthrough with the SQL you’d use:

---

### 1. **Create Glue Crawler**

* Point Glue Crawler to your S3 data location (e.g., `s3://my-data-bucket/data/`)
* Run the crawler to create a table in Glue Data Catalog (e.g., database `mydb`, table `customers`)

---

### 2. **Create Athena Table (If needed manually)**

If you want to create manually (instead of crawler), here’s SQL to create an external table that uses Glue Catalog metadata:

```sql
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.customers (
  id int,
  name string,
  age int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ",",
  "quoteChar"     = "\""
)
LOCATION 's3://my-data-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false');
```

---

### 3. **Run a Sample Query**

```sql
SELECT * FROM mydb.customers LIMIT 5;
```

---

### 4. **Sample Query Result**

| id | name  | age |
| -- | ----- | --- |
| 1  | Alice | 30  |
| 2  | Bob   | 25  |
| 3  | Carol | 28  |
| 4  | Dave  | 35  |
| 5  | Eva   | 22  |


**Describe how Amazon Quicksight supports business intelligence in a serverless data architecture. What are SPICE and embedded dashboards.**

### Amazon QuickSight in Serverless BI (Short)

* **QuickSight** is a fully **serverless BI service** that lets you create interactive dashboards and visualizations without managing infrastructure.
* It integrates seamlessly with serverless data sources like **S3, Athena, and Redshift Spectrum**.
* Automatically scales and charges per usage, making BI accessible and cost-efficient.

---

### What is SPICE?

* **SPICE** is QuickSight’s **in-memory calculation engine** that speeds up data analysis by storing data in a distributed, fast cache.
* Enables quick, scalable querying without hitting the original data source every time.

---

### What are Embedded Dashboards?

* **Embedded dashboards** allow you to **integrate QuickSight visualizations inside your own apps or portals**.
* Provide seamless, secure BI access to end-users without leaving your application.


 **Connect Quicksight to Athena or Redshift and build a dashboard with at least one calculated field and one filter. Share a screenshot of your final dashboard.**

 I can guide you quickly on how to do this, but I can’t generate or share actual screenshots. Here’s a short step-by-step:

---

### Quick Steps to Connect QuickSight → Athena/Redshift & Build Dashboard

1. **Connect Data Source**

   * In QuickSight, go to **Manage data** → **New data set**
   * Choose **Athena** or **Redshift**
   * Provide connection details and select your database/table

2. **Prepare Data**

   * Import or Direct Query your data
   * Add a **calculated field**, e.g.
     `Revenue_Per_Customer = total_revenue / customer_count`

3. **Build Analysis**

   * Create visuals (bar charts, tables, etc.)
   * Add a **filter**, e.g., filter by `Region` or `Date`

4. **Publish Dashboard**

   * Save and publish your dashboard for sharing


**Explain how AWS CloudWatch and CloudTrail differ. IN a data analytics pipeline, what role does each play in monitoring, auditing, and troubleshooting.**

### AWS CloudWatch vs CloudTrail (Short)

| Feature                       | CloudWatch                                      | CloudTrail                                                |
| ----------------------------- | ----------------------------------------------- | --------------------------------------------------------- |
| **Purpose**                   | Monitoring & operational metrics                | Auditing API calls & user activity                        |
| **Data**                      | Logs, metrics, alarms                           | Records AWS API calls & events                            |
| **Use case in data pipeline** | Track performance (latency, errors), set alerts | Audit who changed resources, track API usage for security |

---

### Roles in Data Analytics Pipeline

* **CloudWatch**: Monitors pipeline health — CPU usage, Lambda errors, processing delays; triggers alerts for failures.
* **CloudTrail**: Audits user actions and API calls — who started/stopped jobs, data access, configuration changes; helps troubleshoot security or compliance issues.


**Describe a complete end-to-end data analytics pipeline using AWS services. Include services for data ingestion, storage, transformation, querying, and visualization. (Example: S3 → Lambda → Glue → Quicksight)
 Explain why you would choose each service for the stage it’s used in.**

 Here’s a concise end-to-end AWS data analytics pipeline example:

---

### End-to-End Data Analytics Pipeline

| Stage              | AWS Service           | Why Choose It?                                                                              |
| ------------------ | --------------------- | ------------------------------------------------------------------------------------------- |
| **Ingestion**      | **Amazon Kinesis**    | Real-time, scalable streaming ingestion for fast-moving data.                               |
| **Storage**        | **Amazon S3**         | Durable, cost-effective object storage for raw and processed data.                          |
| **Transformation** | **AWS Glue**          | Serverless ETL service to clean, enrich, and catalog data automatically.                    |
| **Querying**       | **Amazon Athena**     | Serverless, pay-per-query SQL analytics directly on S3 data without ETL.                    |
| **Visualization**  | **Amazon QuickSight** | Scalable, serverless BI tool that integrates easily with Athena for interactive dashboards. |

---

### Why this architecture?

* **Kinesis** handles real-time ingestion with low latency.
* **S3** stores data cheaply and reliably at scale.
* **Glue** automates schema discovery & data transformation.
* **Athena** enables flexible, on-demand querying without infrastructure.
* **QuickSight** empowers users with fast, shareable visual insights without setup overhead.
