# AWS Guide

## How do you ensure the security and compliance of data when using AWS?

- Data Encryption: Use AWS KMS to encrypt data at rest and in transit.
- IAM Policies: Implement the principle of least privilege by using fine-grained IAM policies.
- VPC Security: Utilize VPCs with security groups, NACLs, and private subnets to control network access.
- Logging and Monitoring: Enable CloudTrail, AWS Config, and CloudWatch to track and audit activities.

## How would you set up an ETL pipeline using AWS Glue?

#### Create a Crawler:
- Define a data source (e.g., S3, RDS) and set up a Glue Crawler to automatically detect and catalog the schema of the source data.

#### Set Up a Glue Job
- Create a Glue job to transform the data. Use the Glue Studio visual interface or write custom PySpark scripts to handle the ETL logic.

#### Define the Data Transformation:
- Use the Glue ETL job to clean, enrich, or aggregate the data as needed. You can use built-in transformations or custom scripts.

#### Destination Data
- Specify the target data store (e.g., S3, Redshift) for the transformed data. The Glue job will write the processed data to this location.

#### Schedule and Orchestrate:
- Schedule the Glue job using triggers or AWS Step Functions to automate the ETL process. You can set the job to run on a schedule or in response to specific events.

#### Monitor and Debug:
- Use AWS CloudWatch Logs and Glue job metrics to monitor the ETL process and troubleshoot any issues that arise.

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize Glue Context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Step 1: Load data from S3
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://your-source-bucket/path/"]},
    format="json"  # Example format; adjust according to your data format
)

# Step 2: Data transformation (e.g., filtering data)
transformed_df = Filter.apply(frame=datasource0, f=lambda x: x["status"] == "active")

# Step 3: Write transformed data back to S3
datasink = glueContext.write_dynamic_frame.from_options(
    frame=transformed_df,
    connection_type="s3",
    connection_options={"path": "s3://your-target-bucket/path/"},
    format="parquet"
)

# Commit the job
job.commit()


## What strategies have you implemented to optimize cost while using AWS  services?

- Utilizing Pipemode and streaming data from S3 instead of using an instance
- Changing the dataformat to RecordIOprotobuff
- Auto Scaling: Use Auto Scaling groups to automatically scale resources up or down based on demand.
- Spot Instances: Utilize EC2 Spot Instances for non-critical, flexible workloads to benefit from lower pricing.
- Elastic Inference: Use Elastic Inference to attach just the right amount of inference acceleration to your EC2 or SageMaker instances, reducing the need for more powerful and expensive GPUs.
- Distributed Training: Optimize distributed training across multiple, lower-cost GPUs instead of using a single, high-cost GPU instance.


##  Describe your experience with deploying models using AWS Lambda? 

- Model Preparation: I started by training a machine learning model on a separate environment. Once trained, I serialized the model into a format suitable for deployment, such as a .pkl or .h5 file.
- Creating a Lambda Layer: I packaged all the necessary dependencies (like numpy, scikit-learn, or tensorflow) along with the serialized model into a ZIP file. This ZIP file was then uploaded to AWS Lambda as a Layer, which allows the code in the Lambda function to use these libraries and model files without bundling them directly with the function's deployment package.
- Writing the Lambda Function: The Lambda function was written in Python. It included code to deserialize the model and use it to make predictions. The function handled JSON input, where data to be predicted was passed, and JSON output, which contained the prediction results.
- API Gateway Integration: I used AWS API Gateway to create a RESTful endpoint that triggered the Lambda function. This setup allowed external applications to send data to the Lambda function via HTTP requests and receive predictions in response.
- Deployment and Testing: After deploying the Lambda function and configuring the API Gateway, I tested the setup by sending HTTP requests with test data. I monitored the execution and performance via AWS CloudWatch to ensure that the function was running efficiently and within the resource limits.

# Data Managment

## Can you discuss a scenario where you had to move data securely between on-premise and AWS cloud?
- We used AWS Direct Connect to establish a dedicated network connection from our on-premise network to AWS. This bypassed the public internet, reducing exposure to security threats and improving transfer speeds.
- AWS Database Migration Service (AWS DMS) is a managed migration and replication service that helps you move your databases and analytics workloads to AWS quickly and securely. 

## How do you handle large datasets in AWS S3?
- Multipart Upload: Use multipart upload for large files to enhance upload performance and reliability.
- Lifecycle Management: Implement S3 lifecycle policies to automatically transition data to more cost-effective storage tiers (like S3 Glacier) and manage data retention.
- S3 Select: Use S3 Select to retrieve specific data from objects, reducing data transfer costs and improving efficiency.
- Prefixes and Indexing: Organize files using logical prefixes and index them effectively to optimize data retrieval.
- Data Transfer Tools: Utilize AWS DataSync or Transfer Acceleration for faster data transfer between on-premises systems and S3.

# Python Profiency