## How do you ensure the security and compliance of data when using AWS?

- Data Encryption: Use AWS KMS to encrypt data at rest and in transit.
- IAM Policies: Implement the principle of least privilege by using fine-grained IAM policies.
- VPC Security: Utilize VPCs with security groups, NACLs, and private subnets to control network access.
- Logging and Monitoring: Enable CloudTrail, AWS Config, and CloudWatch to track and audit activities.

## How would you set up an ETL pipeline using AWS Glue?

#### Create a Crawler:
- Define a data source (e.g., S3, RDS) and set up a Glue Crawler to automatically detect and catalog the schema of the source data.

#### Set Up a Glue Job
- Create a Glue job to transform the data. Use the Glue Studio visual interface or write custom PySpark scripts to handle the ETL logic.

#### Define the Data Transformation:
- Use the Glue ETL job to clean, enrich, or aggregate the data as needed. You can use built-in transformations or custom scripts.

#### Destination Data
- Specify the target data store (e.g., S3, Redshift) for the transformed data. The Glue job will write the processed data to this location.

#### Schedule and Orchestrate:
- Schedule the Glue job using triggers or AWS Step Functions to automate the ETL process. You can set the job to run on a schedule or in response to specific events.

#### Monitor and Debug:
- Use AWS CloudWatch Logs and Glue job metrics to monitor the ETL process and troubleshoot any issues that arise.

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize Glue Context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Step 1: Load data from S3
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://your-source-bucket/path/"]},
    format="json"  # Example format; adjust according to your data format
)

# Step 2: Data transformation (e.g., filtering data)
transformed_df = Filter.apply(frame=datasource0, f=lambda x: x["status"] == "active")

# Step 3: Write transformed data back to S3
datasink = glueContext.write_dynamic_frame.from_options(
    frame=transformed_df,
    connection_type="s3",
    connection_options={"path": "s3://your-target-bucket/path/"},
    format="parquet"
)

# Commit the job
job.commit()
