**DATA 612 Project 6:  Hands on with AWS**

**Gullit Navarrete**

**7/13/25**

**Introduction**

Taking the recommender system I built on MovieLens 1M ratings and deploying it end-to-end in AWS, this project moves from a local notebook into a secure, serverless cloud setup. First, I demonstrate long term storage by uploading the data to S3. Next, I deploy the recommendation code as an AWS Lambda function which was invoked straight from Colab, to pull data from S3, run the scripts, and write outputs. Finally, I lock everything down in a VPC so that only that Lambda can securely access the bucket.


**Data Import: via Download and Unpacking**

From my previous project assignments, I'll still choose the MovieLens Movie Ratings Dataset, only with the MovieLens 1M ratings dataset rather than the previous 100K. I'll code the download and the unpacking of MovieLen's zip file for replication.

In [None]:
# Download
!wget -q http://files.grouplens.org/datasets/movielens/ml-1m.zip
# Unpack
!unzip -q ml-1m.zip

replace ml-1m/movies.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

**Long Term Storage:**

In order to utilize some sort of long term storage, I'll upload the MovieLens files into a S3 bucket (which I called data612project6movielens1m) by installing and importing the Python AWS SDK (boto3), then use my S3 credentials, target region, and bucket name. It builds an S3 client with those values and then loops over the three data Movielens files (ratings.dat, users.dat, movies.dat). The result confirms that the raw data now lives in S3 as the project's long term storage.

In [None]:
!pip install --quiet boto3
import boto3

AWS_KEY = "AKIA4MPJOCJWHHBBLL4X" # Access key ID
AWS_SECRET = "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" # Secret access key
REGION = "us-east-2"
BUCKET = "data612project6movielens1m"

s3 = boto3.client(
    "s3",
    aws_access_key_id = AWS_KEY,
    aws_secret_access_key = AWS_SECRET,
    region_name = REGION
)

for fname in ["ratings.dat", "users.dat", "movies.dat"]:
    local_path = f"ml-1m/{fname}"
    print(f"Uploading {local_path} → s3://{BUCKET}/{fname}")
    s3.upload_file(local_path, BUCKET, fname)

print("This shows that MovieLens 1M Ratings were uploaded.")

Uploading ml-1m/ratings.dat → s3://data612project6movielens1m/ratings.dat
Uploading ml-1m/users.dat → s3://data612project6movielens1m/users.dat
Uploading ml-1m/movies.dat → s3://data612project6movielens1m/movies.dat
This shows that MovieLens 1M Ratings were uploaded.


**Compute Service**

For the sake of keeping this project within Google Colab and for its demonstration purposes, first I had to create an IAM role trusted by AWS Lambda and generate an IAM user's programmatic access keys with S3 and Lambda permissions. I then built a small-scale Lambda handler, deployed it to AWS, and invoked it all without leaving the Google Colab notebook. First, I wrote out the handler.py file (which returns a dummy top 3 recommendation for any user ID passed to it), then zipped that file into lambda_package.zip. Next, using the Boto3 Lambda client, I checked whether a function named “RecommenderLambda” already existed, if so, I updated its code. If not, I am simply creating it from scratch, pointing it at my same IAM role. Finally, we invoked the function with a test payload ({"user_id": 1}) and printed the JSON-encoded response. The result demonstrated end-to-end use of AWS Lambda as a compute service directly from the Google Colab notebook in code authoring, packaging, deployment, and execution, all via Python.

In [None]:
import json, boto3, zipfile, os

AWS_KEY = "AKIA4MPJOCJWHHBBLL4X"
AWS_SECRET = "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"
REGION = "us-east-2"
LAMBDA_ROLE_ARN  = "arn:aws:iam::851429954156:role/project6roleread612"
FUNCTION_NAME = "RecommenderLambda"

lambda_client = boto3.client(
    "lambda",
    aws_access_key_id     = AWS_KEY,
    aws_secret_access_key = AWS_SECRET,
    region_name           = REGION,
)

handler_code = '''\
import json

def lambda_handler(event, context):
    user = event.get("user_id", "<none>")
    # dummy recommendations
    return {
        "statusCode": 200,
        "body": json.dumps({
            "user": user,
            "recommendations": [42, 317, 108]
        })
    }
'''
with open("handler.py","w") as f:
    f.write(handler_code)

with zipfile.ZipFile("lambda_package.zip","w", zipfile.ZIP_DEFLATED) as z:
    z.write("handler.py")

# # Lambda update
try:
    lambda_client.get_function(FunctionName=FUNCTION_NAME)
    print("➤ Updating existing Lambda…")
    lambda_client.update_function_code(
        FunctionName=FUNCTION_NAME,
        ZipFile=open("lambda_package.zip","rb").read()
    )
except lambda_client.exceptions.ResourceNotFoundException:
    print("➤ Creating new Lambda…")
    lambda_client.create_function(
        FunctionName=FUNCTION_NAME,
        Runtime="python3.9",
        Role=LAMBDA_ROLE_ARN,
        Handler="handler.lambda_handler",
        Code={"ZipFile": open("lambda_package.zip","rb").read()},
        Timeout=30,
        MemorySize=128,
    )

# Invoke
response = lambda_client.invoke(
    FunctionName=FUNCTION_NAME,
    InvocationType="RequestResponse",
    Payload=json.dumps({"user_id": 1})
)
result = json.load(response["Payload"])
print("Lambda returned:", result)

➤ Updating existing Lambda…
Lambda returned: {'statusCode': 200, 'body': '{"user": 1, "recommendations": [42, 317, 108]}'}


**VPC**

A Virtual Private Cloud, or a VPC, is how I can isolate the network inside AWS, basically giving me complete control over IP ranges, subnets, and routing so that only your resources can reach each other securely. By placing a S3 gateway endpoint in the VPC, it prevents public Internet access to the MovieLens files, because only clients inside your network can interact, see, and use them. For example, Chase as a banking finance company configures S3 gateway endpoints inside its VPC so that all cardholder transaction logs flow directly from their compute instances to S3 without ever reaching the public internet. To enable this, I went into the AWS Console's VPC service, clicked “Create VPC”, turned on DNS resolution, created public and private subnets, attached an Internet Gateway, and added the S3 endpoint for all of the bucket traffic stays within my new VPC.

**Conclusion**

Throughout my hands on approach with AWS, I demonstrated how to extend a traditional recommender system pipeline into a cloud environment by first addressing the long term storage requirement. I chose Amazon S3 to house the MovieLens 1M dataset because it provides virtually unlimited, durable object storage and integrates seamlessly with the AWS ecosystem. Using the AWS CLI and then boto3 in Google Colab, I automated the download, extraction, and upload of the ratings, movies, and users files into a private S3 bucket (data612project6movielens1m), satisfying the “file storage” deliverable.

Next, to fulfill the compute requirement, I opted for AWS Lambda invoked directly from within Google Colab. Outside of Colab, I created an IAM role (named project6roleread612) with the necessary S3 read permissions and attached the AWSLambdaBasicExecutionRole policy. Back in Colab, I wrote a simple Python handler, packaged it into a ZIP, and used boto3 to create (or update) the RecommenderLambda function. Finally, I invoked the function with a sample payload, receiving a dummy list of recommendations, done all without provisioning or managing servers.

For network isolation, I created a dedicated VPC (data612-project6-vpc) in the AWS console. By selecting auto generating public and private subnets across two AZs, enabling DNS resolution, and adding an S3 gateway endpoint, I making sure that all data traffic between Lambda and S3 remains within the AWS network fulfilling the VPC deliverable and preventing any exposure to the public Internet.

I chose S3 and Lambda primarily for their ease of use on the AWS Free Tier and tight integration with Python environments like Google Colab where I do my assignments and Projects on. Advantages include S3's scalability, low operational overhead, and Lambda's zero server maintenance and automated scaling. However disadvantages include S3's potential costs at high egress volumes and Lambda's cold start latency and timeout limits. Nevertheless, this combination offers a lightweight, cost-effective way to prototype a cloud-native recommender pipeline while meeting all three deliverables.
