# Sagemaker Setup

## Environment and Data

In [1]:
# import as needed
# if only needed in script, does not need to be here
import os
import shutil
import warnings
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

### sagemaker essentials  - copy ###
import sagemaker
from sagemaker.pytorch import PyTorch # as needed framework
import boto3
import s3fs

# sagemaker initiation
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  
bucket = sagemaker_session.default_bucket()

# extra
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None

2025-03-01 05:29:50.591418: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [34]:
# upload data beforehand
# data should be in bucket
df = pd.read_parquet('s3://capstone-general/text-data/FNSPID_NYT_Combined_Dataset_021125.parquet')
market = pd.read_pickle("s3://capstone-general/NN-related/data_checkpoint1/mkt_daily.pkl") 

INFO:botocore.httpchecksum:Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].


## Pre-training Processing

This is straight from a Colab notebook. Data splitting can be done in the script too (personal preference). If it is done in the script there is no need to pass paths later.

In [35]:
# Shift market returns for prediction alignment (next-day returns)
market["return_sp_lag"] = market["return_sp"].shift(-1)

# Merge market data with news data
df = df.merge(market[['return_sp_lag']], left_on="Date", right_index=True, how="left")

# Clean and prepare the dataframe
df["Date"] = pd.to_datetime(df["Date"])
# df = df[["Date", "Summary", "return_sp_lag"]].dropna()

print("✅ Data loaded and cleaned.")

✅ Data loaded and cleaned.


In [36]:
# Train-Validation Split (Train: <= Dec 31, 2019 | Validation: >= Jan 1, 2020)
train_df = df[df["Date"] <= "2020-12-31"].reset_index(drop=True)
val_df = df[df["Date"] > "2020-12-31"].reset_index(drop=True)

print(f"✅ Train Size: {len(train_df)} | Validation Size: {len(val_df)}")

✅ Train Size: 1237400 | Validation Size: 1427442


## Framework Setup

If splitting outside of training script: We need to save the train and test sets (for convenience), and know where in the bucket they are located.

In [2]:
train_s3_path = "s3://capstone-general/text-data/train/"
val_s3_path = "s3://capstone-general/text-data/val/"

In [38]:
# parquet for safekeeping, keep paths in mind
train_df.to_parquet(f"{train_s3_path}train.parquet", index=False)
val_df.to_parquet(f"{val_s3_path}val.parquet", index=False)

Below is the framework for beginning a training job. The lines which may be good to change are commented with their purpose.

The most important fields are the following:
- dependencies
    - Any package that is not in the Python default library should be put into a requirements.txt which will be installed at the script initiation.
- instance_type
    - In the process of testing the training job to see if it will work, the first error that may be encountered is a dependency error, thus the requirements.txt. However, failed job seconds still count as runtime seconds, so it is preferred to choose a non-GPU instance (ml.t3.large, etc) during the first run to check for the dependencies. The run will fail quickly, but should fail as a "GPU needed" error and not a dependency error.
    - Additionally, it is generally better to start with a smaller instance and work up with need. Generally if GPU memory is the limiting factor, I would recommend this path, unless you already have a decent idea how much memory you need:
        - ml.g4dn.xlarge (16G) -> ml.g4dn.12xlarge/ml.p3.2xlarge (4x16G // V100) -> ml.p4d.24xlarge (8x40G or A100)
- hyperparameters
    - This is a dictionary to define hyperparameters from "outside", relative to the training script ("inside"). The values can be anything and any length, as long as there is a corresponding argparse field in the training script.
- output_path
    - This is a reference to where the model.tar.gz file will be placed.
- environment
    - This is also a dictionary, for environment variables. I have these as variables which are not hyperparameters but could potentially need to be changed easily without opening train.py. Also, I was struggling decently to get the job to find the data from the fit call, so I recommend setting these at least anyway. I don't think they need to be a particular name or within a certain set, as you read them with os.environ, so they can be anything you deem necessary.
 

After filling the object initiator out, we can call .fit() which will begin the training job. The paths to the train and test data should be passed as a dictionary (and these variables need to be enclosed as arguments in train.py with argparse). Follow the logs to understand the general flow of the job. You can also go to the Studio page (from where your notebook was launched) and click on Jobs > Training Jobs on the left. This is a convenient way to check the progress, logs, environment variables, hyperparameters, etc of the job, or to kill it. 

Note, this notebook and the training job run on separate instances (that's why you can pick the instance type for both). Therefore the training job is not dependent on this notebook's runtime. 

In [24]:
estimator = PyTorch( # !CHANGE! appropriate framework (HuggingFace if out of box)
    entry_point="train.py", # !CHANGE! training script
    source_dir=".", # !CHANGE! reference path to files (e.g. requirements.txt, train.py). can and likely should be an s3 path ("s3:// ... /source-code/")
    role=role, # sagemaker execution role
    dependencies=["requirements.txt"], # !CHANGE! (separately) create req.txt from dependencies
    instance_type="ml.g4dn.xlarge",  # !CHANGE! recommended to check for dependency errors with small instance (t, m, c) first. it will fail but you avoid running seconds for script package dependencies
    instance_count=1, # increase for torch.nn.parallel.DistributedDataParallel
    framework_version="1.13", # pytorch/etc version
    py_version="py39", 
    hyperparameters={"epochs": 2,
                     "weight_decay" : 0.02,
                     "gradient_accumulation_steps" : 4,
                     "learning_rate" :0.02,
                     "warmup_ratio" : 0.2,
                     "max_grad_norm" : 2.0
                    }, #!CHANGE! script calls hyperparameters as arguments
    output_path="s3://capstone-general/text-models/output", # !CHANGE! where to put model.tar.gz
    input_mode="File",
    environment={"TOKENIZERS_PARALLELISM": "false", # Avoid tokenizer parallelism warning (chatgpt put this)
                 "SM_MODEL_DIR":"s3://capstone-general/text-models/output", # to set default/expected output directory
                 "SM_CHANNEL_TRAIN": train_s3_path, # to set default/expected train directory
                 "SM_CHANNEL_VAL": val_s3_path, # to set default/expected val directory
                 "BASE_MODEL": "yiyanghkust/finbert-tone"},  # to set default/expected model
)

# Start training
estimator.fit({"train": train_s3_path, "val": val_s3_path}) # !CHANGE! the training script takes train/val directories as arguments

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-02-27-05-03-19-990


2025-02-27 05:04:08 Starting - Starting the training job...
2025-02-27 05:04:31 Starting - Preparing the instances for training...
2025-02-27 05:04:59 Downloading - Downloading input data...
2025-02-27 05:05:29 Downloading - Downloading the training image..................
2025-02-27 05:08:40 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-02-27 05:08:53,071 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-02-27 05:08:53,094 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-02-27 05:08:53,109 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-02-27 05:08:53,113 sagemaker_pytorch_container.trainin

After training is successful, check the S3 bucket path.

One thing you should notice in the bucket root is that potentially, there are many many folders named "pytorch-training-YYYY-MM-DD-HH-MM-SS...", depending on however many training jobs were *initiated* (includes fails). They have "sourcedir.tar.gz" in them. This isn't relevant to us, but it takes up a lot of storage, so if you have a lot of trial and error before you get a success, once you ARE successful go to the bucket and delete the failed previous attempts.

In the specified output path, there should be a folder with an identical name ("pytorch-training..."), inside of which includes source/model.tar.gz. Note this path.

# HF Template (Ignore)

In [None]:
try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'yiyanghkust/finbert-tone',
	'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.37.0',
	pytorch_version='2.1.0',
	py_version='py310',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
	"inputs": "I like you. I love you",
})