# 03_04_Solution: Customize a Pre-Trained Model

In [1]:
import json

import pandas as pd
from datasets import load_dataset
from IPython.display import display
from IPython.display import HTML
from sagemaker import Session
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker.s3 import S3Uploader



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Preprocess the data

In [2]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('AmznSagemaker_3942119_dataset.csv')

# Preview the column names to confirm format
print(df.columns)

Index(['DocumentID', 'Title', 'LegalText', 'Source'], dtype='object')


In [3]:
# Combine relevant fields into one text field
# Assuming your file has columns: Section, Title, Content
df['text'] = "Title: " + df['Title'].fillna('') + "\n" + \
             "LegalText: " + df['LegalText'].fillna('') + "\n\n" + \
             df['Source'].fillna('')

In [4]:
# Save to a plain .txt file, one entry per paragraph
output_file = 'formatted_amzn_train_data.txt'
df['text'].to_csv(output_file, index=False, header=False)

In [5]:
print(f"Saved formatted training file to {output_file}")

Saved formatted training file to formatted_amzn_train_data.txt


## Upload the data to S3 for training

In [6]:
session = Session()
output_bucket = session.default_bucket()
default_bucket_prefix = session.default_bucket_prefix

# If a default bucket prefix is specified, prepend it to the output bucket
if default_bucket_prefix:
    default_path = f"{output_bucket}/{default_bucket_prefix}"
else:
    default_path = output_bucket

local_data_file = "formatted_amzn_train_data.txt"
train_data_location = f"s3://{default_path}/legal_dataset"
S3Uploader.upload(local_data_file, train_data_location)
print(f"Training data: {train_data_location}")

Training data: s3://sagemaker-us-east-1-241215432415/legal_dataset


## Train the model using domain adaptation fine-tuning



In [7]:
model_id = "meta-textgeneration-llama-2-7b"

In [8]:
import boto3

estimator = JumpStartEstimator(model_id=model_id,  environment={"accept_eula": "true"}, instance_type = "ml.g5.24xlarge")
estimator.set_hyperparameters(instruction_tuned="False", epoch="3")
estimator.fit({"training": train_data_location})

Model 'meta-textgeneration-llama-2-7b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llamaEula.txt for terms of use.


Using model 'meta-textgeneration-llama-2-7b' with wildcard version identifier '*'. You can pin to version '4.15.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


2025-05-11 23:09:21 Starting - Starting the training job
2025-05-11 23:09:21 Pending - Training job waiting for capacity......
2025-05-11 23:10:21 Pending - Preparing the instances for training......
2025-05-11 23:11:03 Downloading - Downloading input data..........................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2025-05-11 23:15:40,738 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-05-11 23:15:40,774 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-05-11 23:15:40,783 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-05-11 23:15:40,785 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m

2025-05-11 23:15:36 Training - Training image download completed. Training in progress.[34m2025-05-11 23:15:

In [9]:
print(estimator.model_data)

{'S3DataSource': {'S3Uri': 's3://sagemaker-us-east-1-241215432415/meta-textgeneration-llama-2-7b-2025-05-11-23-09-20-724/output/model/', 'S3DataType': 'S3Prefix', 'CompressionType': 'None'}}


## Deploy and invoke the fine-tuned model

You can deploy the fine-tuned model to an endpoint directly from the estimator.


In [10]:
predictor = estimator.deploy()

No instance type selected for inference hosting endpoint. Defaulting to ml.g5.12xlarge.


------------!

In [12]:
import sagemaker

predictor.content_type = "application/json"
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

response = predictor.predict({"inputs": "Does unauthorized use of patented technology constitute a violation of intellectual property rights?"})

In [13]:
response = response[0] if isinstance(response, list) else response
print("Output:\n", response["generated_text"].strip(), end="\n\n\n")

Output:
 Does unauthorized use of patented technology constitute a violation of intellectual property rights?
As inventors and innovators, intellectual property plays an important contribution to an economy: it promotes creativity and innovation, and also promotes entrepreneurship and trade and thus makes our society a better place to live in. Any inappropriate use of intellectual property by unauthorized persons not only harms inventors but businesses and customers as well. Besides, it also reduces the credibility of the intellectual property regime, if no proper measures are taken in such




## Clean up to avoid endpoint charges

In [14]:
predictor.delete_predictor()