# Homework 2: BERT on AWS

In this homework, we will apply the BERT algorithm on Amazon Web Services.  

This [DataRec repository](https://github.com/sisinflab/DataRec) contains a pointer to accessing recommendation system data, installable via

```python
pip install datarec-lib
```

**General rules of thumb for homeworks:**
- Read the homework questions carefully.
- Explain your choices.
- Present your findings concisely.
- Use tables, plots, and summary statistics to aid your presentation of findings.
- If you have an idea in mind but could not implement (in code), present the idea thoroughly and how you would have implemented the code.

### Tasks:

For all tasks below, create one or more functions for each step such that a sequence of functions may be run for a full analysis.  Specify the sequence of functions and their brief descriptions in the README.

1. Download the MovieLens 1m dataset.  You should output a copy of the dataset on an AWS S3 bucket.  
    - Check if your S3 bucket already contains the dataset. If so, the script should not actually download the dataset.

In [2]:
# Install datarec
!pip install datarec-lib

Collecting datarec-lib
  Downloading datarec_lib-1.5.4-py3-none-any.whl.metadata (10 kB)
Collecting pandas<3,>=2.3 (from datarec-lib)
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting gdown<5,>=4.7 (from datarec-lib)
  Downloading gdown-4.7.3-py3-none-any.whl.metadata (4.4 kB)
Collecting py7zr<1,>=0.22 (from datarec-lib)
  Downloading py7zr-0.22.0-py3-none-any.whl.metadata (16 kB)
Collecting appdirs<2,>=1.4.4 (from datarec-lib)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting python-igraph<2,>=1.0 (from datarec-lib)
  Downloading python_igraph-1.0.0-py3-none-any.whl.metadata (3.1 kB)
Collecting texttable (from py7zr<1,>=0.22->datarec-lib)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pyzstd>=0.15.9 (from py7zr<1,>=0.22->datarec-lib)
  Downloa

In [3]:
! pip install boto3

Collecting boto3
  Downloading boto3-1.42.50-py3-none-any.whl.metadata (6.8 kB)
Collecting botocore<1.43.0,>=1.42.50 (from boto3)
  Downloading botocore-1.42.50-py3-none-any.whl.metadata (5.9 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.1.0-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.17.0,>=0.16.0 (from boto3)
  Downloading s3transfer-0.16.0-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.42.50-py3-none-any.whl (140 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.42.50-py3-none-any.whl (14.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.1.0-py3-none-any.whl (20 kB)
Downloading s3transfer-0.16.0-py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.3 MB/s[0m eta [36m0:0

In [4]:
import os
import boto3
import zipfile
from botocore.exceptions import ClientError

from datarec.datasets import load_dataset

In [53]:
# Function for uploading data to bucket

#  read AWS credentials from local file for security
with open(r"/content/credentials.txt", "r") as f:
  line = f.read().strip()
access_key, secret_key, session_token, region = [
  x.strip() for x in line.split(",")
]

def upload_movielens(bucketname: str,
  s3_prefix: str = "datasets/movielens/ml-1m",
  tmp_dir: str = "/tmp"):

  # initialize s3 connection with AWS crdentials
  s3 = boto3.client("s3",
    aws_access_key_id = access_key,
    aws_secret_access_key = secret_key,
    aws_session_token = session_token,
    region_name = region
  )

  # Check if bucket already has files
  try:
    res = s3.list_objects_v2(Bucket=bucketname, Prefix=s3_prefix)
    if "Contents" in res:
      print("MovieLens already in S3")
      return
  except ClientError as e:
    raise RuntimeError(e)

  # Load data
  dataset = load_dataset("movielens")
  zip_paths = dataset.download()
  zip_file = zip_paths[0]
  print(f"Downloaded zip: {zip_file}")

  # Extract files
  extract_path = os.path.join(tmp_dir, "ml-1m")
  os.makedirs(extract_path, exist_ok=True)
  with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(extract_path)
  print(f"Extracted files to: {extract_path}")
  # actual files are inside ml-1m/ folder
  inner_path = os.path.join(extract_path, "ml-1m")
  if not os.path.exists(inner_path):
    print(f"expected folder {inner_path} does not exist.")
    return

  # Upload extracted files
  uploaded_files = 0
  for file in os.listdir(inner_path):
    full_local = os.path.join(inner_path, file)
    if os.path.isfile(full_local):
      key = f"{s3_prefix}/{file}"
      s3.upload_file(full_local, bucketname, key)
      print(f"Uploaded: {key}")
      uploaded_files += 1

  if uploaded_files == 0:
    print("No files found to upload.")
  else:
    print(f"Upload complete. Total files uploaded: {uploaded_files}")





In [8]:
# Function Call
upload_movielens(bucketname = "de300-bzl-wi2026", s3_prefix="datasets/movielens/ml-m1")

MovieLens already exists in S3. Skipping upload.


2. Create embeddings for the BERT algorithm.  For the same set of items (movies) in the dataset, you should create the embeddings once and output a copy of the necessary intermediate results on the S3 bucket.
    - This is the *offline* step, where embeddings only need to be created once for the recommendation system.
    - Use a random subset (30%) of users in the available dataset.

In [1]:
!pip install torch transformers



In [21]:
import json
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel

# Movie embeddings function
def bert_movie_embeddings(
    bucketname: str,
    s3_prefix: str = "embeddings/movielens/bert",
    model_name: str = "bert-base-uncased",
    user_fraction: float = 0.3,
    seed: int = 67,
    tmp_dir: str = "/tmp"):

  s3 = boto3.client("s3",
                    aws_access_key_id = access_key,
                    aws_secret_access_key = secret_key,
                    aws_session_token = session_token,
                    region_name = region)

  # check if embeddings exist
  try:
    res = s3.list_objects_v2(Bucket=bucketname, Prefix=s3_prefix)
    if "Contents" in res:
      print("Embeddings already in S3")
      return
  except ClientError as e:
    raise RuntimeError(e)

  # load data
  dataset = load_dataset("movielens")
  zip_path = dataset.download()[0]

  extract_path = "/tmp/ml-1m"
  os.makedirs(extract_path, exist_ok = True)

  with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall(extract_path)

  # get ratings and movies
  ratings_path = os.path.join(extract_path, "ml-1m", "ratings.dat")
  movies_path = os.path.join(extract_path, "ml-1m", "movies.dat")

  ratings = pd.read_csv(
      ratings_path,
      sep = "::",
      engine = "python",
      names = ["user_id", "movie_id", "rating", "timestamp"]
  )

  movies = pd.read_csv(
      movies_path,
      sep = "::",
      engine = "python",
      names=["movie_id", "title", "genres"],
      encoding = "latin-1"
  )

  # Subsample
  rng = np.random.default_rng(seed) # random seed for consistency in testing
  sampled_users = rng.choice(
      ratings["user_id"].unique(),
      size = int(user_fraction * ratings["user_id"].nunique()),
      replace = False
  )

  sampled_movie_ids = ratings[
      ratings["user_id"].isin(sampled_users)
      ]["movie_id"].unique()

  movies = movies[movies["movie_id"].isin(sampled_movie_ids)]

  # handle text inputs
  texts = (
      movies["title"] + " " +
      movies["genres"].str.replace("|", " ", regex=False)
  ).tolist()

  movie_ids = movies["movie_id"].tolist()

  # initialize bert
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModel.from_pretrained(model_name)
  model.eval()

  embeddings = []

  with torch.no_grad():
    for text in texts:
      inputs = tokenizer(
         text,
         padding = True,
         truncation = True,
         max_length = 128,
         return_tensors = "pt"
      )
      outputs = model(**inputs)
      cls_embedding = outputs.last_hidden_state[:,0,:].squeeze().numpy()
      embeddings.append(cls_embedding)

  embeddings = np.vstack(embeddings)

  # Save locally
  os.makedirs(tmp_dir, exist_ok = True)

  emb_path = os.path.join(tmp_dir, "movie_embeddings.npy")
  meta_path = os.path.join(tmp_dir, "movies_ids.json")

  np.save(emb_path, embeddings)

  with open(meta_path, "w") as f:
    json.dump(movie_ids, f)

  # Upload to S3
  s3.upload_file(emb_path, bucketname, f"{s3_prefix}/movie_embeddings.npy")
  s3.upload_file(meta_path, bucketname, f"{s3_prefix}/movie_ids.json")

  print("Upload complete")


In [22]:
# Call the function
bert_movie_embeddings(
    bucketname = "de300-bzl-wi2026",
    s3_prefix = "embeddings/movielens/bert",
    user_fraction = 0.3
)

/usr/local/lib/python3.12/dist-packages/datarec/registry/metrics/movielens_1m.yml
ml-1m.zip: File already exists, skipping download


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Upload complete


3. Recommend five movies for each of the following users.  The recommendations should be saved in a file on the S3 bucket containing `User_Type`, `Last_Interaction_Time`, other user summaries in the dataset,and a list of recommended movies:
    - *Cold user*: a user that the system has no data on.
    - *Top user*: a random user who has frequently rated movies (number of interactions among the top 5\% of users).

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

# function for recommending movies
def recommend_movies(
    bucketname: str,
    emb_prefix: str = "embeddings/movielens/bert",
    out_key: str = "recommendations/movielens/recommendations.json",
    tmp_dir: str = "/tmp",
    seed: int = 67):

  rng = np.random.default_rng(seed)

  s3 = boto3.client("s3",
                    aws_access_key_id = access_key,
                    aws_secret_access_key = secret_key,
                    aws_session_token = session_token,
                    region_name = region)

  # load embeddings
  emb_path = os.path.join(tmp_dir, "movie_embeddings.npy")
  id_path = os.path.join(tmp_dir, "movie_ids.json")

  s3.download_file(bucketname, f"{emb_prefix}/movie_embeddings.npy", emb_path)
  s3.download_file(bucketname, f"{emb_prefix}/movie_ids.json", id_path)

  embeddings = np.load(emb_path)

  with open(id_path) as f:
    movie_ids = json.load(f)

  id_to_idx = {m: i for i, m in enumerate(movie_ids)}

  # load data
  dataset = load_dataset("movielens")
  zip_path = dataset.download()[0]

  extract_path = os.path.join(tmp_dir, "ml-1m")
  os.makedirs(extract_path, exist_ok = True)

  with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall(extract_path)

  ratings = pd.read_csv(
      os.path.join(extract_path, "ml-1m", "ratings.dat"),
      sep = "::",
      engine = "python",
      names = ["user_id", "movie_id", "rating", "timestamp"]
  )

  movies = pd.read_csv(
      os.path.join(extract_path, "ml-1m", "movies.dat"),
      sep = "::",
      engine = "python",
      names = ["movie_id", "title", "genres"],
      encoding = "latin-1"
  )

  movie_titles = dict(zip(movies.movie_id, movies.title))

  # Cold User - List popular movies

  popular_movies = (
      ratings.groupby("movie_id")
      .size()
      .sort_values(ascending = False)
      .head(5)
      .index
      .tolist()
  )

  cold_user_rec = {
      "User Type": "cold",
      "User_ID" : None,
      "Last_Interaction_Time" : None,
      "Num_Interactions" : 0,
      "Recommended_movies" : [
          {"movie_id": m, "title": movie_titles[m]}
          for m in popular_movies
      ]
  }

  # Top User - recommend by content
  user_counts = ratings.groupby("user_id").size()
  threshold = np.percentile(user_counts.values, 95) # top 5%
  top_users = user_counts[user_counts >= threshold].index.tolist()

  top_user = rng.choice(top_users)
  user_data = ratings[ratings.user_id == top_user]

  seen_movies = user_data.movie_id.unique()
  seen_idxs = [id_to_idx[m] for m in seen_movies if m in id_to_idx]

  user_embedding = embeddings[seen_idxs].mean(axis=0, keepdims=True)
  sims = cosine_similarity(user_embedding, embeddings).flatten()

  candidate_idxs = [
      i for i, m in enumerate(movie_ids) if m not in seen_movies
  ]

  top_recs = sorted(
      candidate_idxs,
      key = lambda i: sims[i],
      reverse = True
  )[:5]

  top_recs = sorted(
      candidate_idxs,
      key = lambda i: sims[i],
      reverse = True
  )[:5]

  top_user_rec = {
      "User_Type" : "top",
      "User_ID" : int(top_user),
      "Last_Interaction_Time": int(user_data.timestamp.max()),
      "Num_Interactions": int(len(user_data)),
      "Recommend_Movies": [
          {
              "movie_id": movie_ids[i],
              "title": movie_titles[movie_ids[i]]
          }
          for i in top_recs
      ]
  }

  # Save recs and upload
  results = [cold_user_rec, top_user_rec]

  out_local = os.path.join(tmp_dir, "recommendations.json")

  with open(out_local, "w") as f:
    json.dump(results, f, indent=2)

  s3.upload_file(out_local, bucketname, out_key)

  print(f"Recs saved to s3://{bucketname}/{out_key}")

In [32]:
# Function call
recommend_movies(
    bucketname = "de300-bzl-wi2026",
    emb_prefix = "embeddings/movielens/bert",
    out_key = "recommendations/movielens/recommendations.json"
)

/usr/local/lib/python3.12/dist-packages/datarec/registry/metrics/movielens_1m.yml
ml-1m.zip: File already exists, skipping download
Recs saved to s3://de300-bzl-wi2026/recommendations/movielens/recommendations.json


4. Repeat steps 2 and 3 but with the full set of data.  You should be able to reuse your work from earlier.

In [33]:
# Full dataset
bert_movie_embeddings(
    bucketname = "de300-bzl-wi2026",
    s3_prefix = "embeddings/movielens/bert_full",
    user_fraction = 1.0
)

/usr/local/lib/python3.12/dist-packages/datarec/registry/metrics/movielens_1m.yml
ml-1m.zip: File already exists, skipping download


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Upload complete


In [34]:
# Recs on full dataset
recommend_movies(
    bucketname = "de300-bzl-wi2026",
    emb_prefix = "embeddings/movielens/bert_full",
    out_key = "recommendations/movielens/recommendations_full.json"
)

/usr/local/lib/python3.12/dist-packages/datarec/registry/metrics/movielens_1m.yml
ml-1m.zip: File already exists, skipping download
Recs saved to s3://de300-bzl-wi2026/recommendations/movielens/recommendations_full.json


5. Choose and rate 10 movies and create a "user profile" for yourself.  Save your user profile on the S3 bucket.  Recommend 5 movies for yourself and save the results on the S3 bucket.

In [54]:
# Create user profile function

def create_user_profile(
  bucketname: str,
  emb_prefix: str = "embeddings/movielens/bert_full",
  profile_key: str = "user_profiles/self_profile.json",
  rec_key: str = "recommendations/self_recommendations.json",
  tmp_dir: str = "/tmp"):

  s3 = boto3.client("s3",
    aws_access_key_id = access_key,
    aws_secret_access_key = secret_key,
    aws_session_token = session_token,
    region_name = region
  )

  # get embeddings
  emb_path = os.path.join(tmp_dir, "movie_embeddings.npy")
  id_path = os.path.join(tmp_dir, "movie_ids.json")

  s3.download_file(bucketname, f"{emb_prefix}/movie_embeddings.npy", emb_path)
  s3.download_file(bucketname, f"{emb_prefix}/movie_ids.json", id_path)

  embeddings = np.load(emb_path)

  with open(id_path) as f:
    movie_ids = json.load(f)

  id_to_idx = {m: i for i, m in enumerate(movie_ids)}

  # load data
  dataset = load_dataset("movielens")
  zip_path = dataset.download()[0]

  extract_path = os.path.join(tmp_dir, "ml-1m")
  os.makedirs(extract_path, exist_ok = True)

  with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall(extract_path)

  movies = pd.read_csv(
    os.path.join(extract_path, "ml-1m", "movies.dat"),
    sep = "::",
    engine = "python",
    names = ["movie_id", "title", "genres"],
    encoding = "latin-1"
  )

  movie_titles = dict(zip(movies.movie_id, movies.title))

  # make my ratings for 10 movies - by movie ID
  user_ratings = {
    848: 5,
    2394: 5,
    934: 4,
    1034: 4,
    2077: 4,
    1233: 5,
    2567: 5,
    3232: 5,
    401: 5,
    900: 3
  }

  rated_idxs = [id_to_idx[m] for m in user_ratings if m in id_to_idx]
  weights = np.array([user_ratings[m] for m in user_ratings if m in id_to_idx])

  # build profile
  user_embedding = np.average(
    embeddings[rated_idxs],
    axis=0,
    weights=weights
  )

  user_profile = {
    "User_Type": "self",
    "Num_Rated_Movies": len(user_ratings),
    "Rated_Movies": [
      {
        "movie_id": m,
        "title": movie_titles[m],
        "rating": r
      }
      for m, r in user_ratings.items()
    ]
  }

  # Save profile
  profile_path = os.path.join(tmp_dir, "self_profile.json")
  with open(profile_path, "w") as f:
    json.dump(user_profile, f, indent=2)

  s3.upload_file(profile_path, bucketname, profile_key)

  # Make recs
  sims = cosine_similarity(
    user_embedding.reshape(1,-1),
    embeddings
  ).flatten()

  unseen = [i for i, m in enumerate(movie_ids) if m not in user_ratings]

  top5 = sorted(unseen, key=lambda i: sims[i], reverse=True)[:5]

  recommendations = {
    "User_Type": "self",
    "Recommended_Movies": [
      {
        "movie_id": movie_ids[i],
        "title": movie_titles[movie_ids[i]],
        "score": float(sims[i])
      }
      for i in top5
    ]
  }

  # Save and upload recs
  rec_path = os.path.join(tmp_dir, "self_recommendations.json")
  with open(rec_path, "w") as f:
    json.dump(recommendations, f, indent=2)

  s3.upload_file(rec_path, bucketname, rec_key)

  print("User profile and recs uploaded to S3")


In [52]:
# Function call
create_user_profile(
    bucketname = "de300-bzl-wi2026",
    emb_prefix = "embeddings/movielens/bert_full",
    profile_key = "user_profiles/self_profile.json",
    rec_key = "recommendations/self_recommendations.json"
)

/usr/local/lib/python3.12/dist-packages/datarec/registry/metrics/movielens_1m.yml
ml-1m.zip: File already exists, skipping download
Rated indices: [795, 2201, 873, 969, 1896, 1142, 2370, 3009, 388, 839]
Weights: [5 5 4 4 4 5 5 5 5 3]
Num rated movies: 10
User profile and recs uploaded to S3


# Submission guidelines
Your submission should be contained in a `homework_2` folder of your Github repository, and it should include
- a `readme.md` file including how to run the code and what your expected outputs are (if the code is run),
- your source code, and/or
- a `.pdf` or `.html` file containing any necessary observations and details.
    - If you find your source code self-explanatory, you may opt to skip the `.pdf` or `.html` file in this homework.


# Generative AI disclosure

*Syllabus* policy:

Required disclosure: each submission must include an AI Usage note stating: (1) tool(s) used, (2) the key prompt(s), and (3) what you changed and how you verified the results. If none, write: “AI Usage: None.”

1. The only AI tools I used were ChatGPT and Gemini built-in to Colab.
2. I primarily used AI to troubleshoot any errors where I wasn't sure how to handle them. For example, I got several errors when trying to load my data for parts 2-3, so I copied them into the prompt and asked "explain what is causing this error and what are some potential ways to address it."
3. One change that I made using AI responses to an error was using the boto3 package to connect to S3. Another part where AI was very helpful was figuring out the id_to_idx lines and properly extracting with ZipFile, since I wasn't sure how to properly approach those steps and encountered many errors. I verified the results by checking that they were uploaded to S3, and reviewing the contents that were uploaded. Another area where AI was useful for debugging was properly saving the files locally from within the colab environemnt.