# Salary Prediction
- [The Adult Salary Prediction dataset](https://archive.ics.uci.edu/ml/datasets/adult) consists of data from the 1994 US Census and the task is to predict whether a person earns `over $50K` a year (Class 1) or `less than $50K` a year (Class 0). The columns in the dataset are as follows:

|col name|description|
|:--|:--|
|age| continuous.|
|workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.|
|fnlwgt| continuous.|
|education| Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.|
|education-num| continuous.|
|marital-status| Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.|
|occupation| Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.|
|relationship| Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.|
|race| White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.|
|sex| Female, Male.|
|capital-gain| continuous.|
|capital-loss| continuous.|
|hours-per-week| continuous.|
|native-country| United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.|
|target| This is the target variable to be predicted. Class 1 for salary >50K and class 0 for salary <=50K|

- The goal of this project is to build and tune a model to predict the `target` column using AWS Sagemaker and deploy the model as a `Serverless Inference Endpoint`

## Tips: 
- You can use the below code to get the S3 bucket to write any artifacts to
    ```
    import sagemaker
    session = sagemaker.Session()
    bucket = session.default_bucket()
    ```
- Are all the columns necessary or we can drop any?
- What ML task is this? Classification? Regression? Clustering?
- How to determine the best hyperparameters for the model?
- How to test if the model is deployed successfully?

In [3]:
!pip install --upgrade certifi



In [1]:
import pandas as pd

cols = [
    "age", 
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "target"
]

In [5]:
import requests

urls = {
    "adult.data": "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    "adult.test": "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
}

for fname, url in urls.items():
    r = requests.get(url, verify=False)
    with open(fname, "wb") as f:
        f.write(r.content)




In [6]:
# train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=cols)
# test_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", names=cols, skiprows=1)

train_df = pd.read_csv("adult.data", names=cols)
test_df = pd.read_csv("adult.test", names=cols, skiprows=1)

train_df["target"] = train_df["target"].apply(lambda x: 1 if ">50K" in x else 0)
test_df["target"] = test_df["target"].apply(lambda x: 1 if ">50K" in x else 0)

print(train_df.shape, test_df.shape)
train_df.head()

(32561, 15) (16281, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [7]:
# Write the training and testing datasets to S3
# Access the default S3 bucket
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()
bucket

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


'sagemaker-ap-southeast-2-907808569037'

In [8]:
# Write the files locally
train_df.to_csv("../data/train.csv", index=False)
test_df.to_csv("../data/test.csv", index=False)

In [9]:
# Upload files into S3
train_path = session.upload_data(path="../data/train.csv", bucket=bucket, key_prefix="sagemaker/salary_prediction")
test_path = session.upload_data(path="../data/test.csv", bucket=bucket, key_prefix="sagemaker/salary_prediction")

print(f"Train Path: {train_path}")
print(f"Test Path: {test_path}")

Train Path: s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/train.csv
Test Path: s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/test.csv


In [10]:
!pip install ydata_profiling

Collecting ydata_profiling
  Downloading ydata_profiling-4.18.0-py2.py3-none-any.whl.metadata (22 kB)
Collecting matplotlib<=3.10,>=3.5 (from ydata_profiling)
  Using cached matplotlib-3.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata_profiling)
  Using cached visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting minify-html>=0.15.0 (from ydata_profiling)
  Using cached minify_html-0.18.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting filetype>=1.0.0 (from ydata_profiling)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting phik<0.13,>=0.12.5 (from ydata_profiling)
  Using cached phik-0.12.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata_profiling)
  Using cached multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting 

In [11]:
!pip install --upgrade scipy



In [12]:
# Exploratory Data Analysis
from ydata_profiling import ProfileReport

In [13]:
profile = ProfileReport(train_df)
profile.to_file("profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/15 [00:00<?, ?it/s][A
  7%|▋         | 1/15 [00:00<00:01,  8.98it/s][A
 20%|██        | 3/15 [00:00<00:01, 10.81it/s][A
 40%|████      | 6/15 [00:00<00:00, 16.25it/s][A
 53%|█████▎    | 8/15 [00:00<00:00, 15.28it/s][A
 67%|██████▋   | 10/15 [00:00<00:00, 14.37it/s][A
100%|██████████| 15/15 [00:00<00:00, 16.82it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  target          32561 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 3.7+ MB


In [15]:
# Split features and targets
X_train = train_df.drop("target", axis=1)
y_train = train_df['target']

X_test = test_df.drop("target", axis=1)
y_test = test_df['target']

In [16]:
# Identify categorical and numerical features
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
num_cols = X_train.select_dtypes(include=['int64']).columns.tolist()

In [17]:
cat_cols

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [18]:
num_cols

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [19]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

In [20]:
# One Hot Encode the catgorial columns
cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

# Scale the numerical feature
num_transformer = StandardScaler()

# Combine preprocessing in a ColumnTransformer
preprocessor = ColumnTransformer([
        ("num", num_transformer, num_cols),
        ("cat", cat_transformer, cat_cols)
    ])

# Ml model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Pipeline with preprocessor + Ml model
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("xgb_model",xgb )
])

In [21]:
# To view pipeline as a diagram
from sklearn import set_config
set_config(display='diagram')

In [22]:
# Fit the pipeline locally
pipeline.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.



0,1,2
,steps,"[('preprocessor', ...), ('xgb_model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [23]:
y_pred_train = pipeline.predict(X_train)
y_pred_test = pipeline.predict(X_test)

In [24]:
from sklearn.metrics import accuracy_score

# Compute accuracy on training data 
train_acc = accuracy_score(y_train, y_pred_train)
print(f"Train Accuracy: {train_acc:.4f}")

# Compute accuracy on test data
test_acc = accuracy_score(y_test, y_pred_test)
print(f"Test Accuracy: {test_acc:.4f}")

Train Accuracy: 0.9020
Test Accuracy: 0.8718


Fit the model on Sagemaker

In [7]:
%%writefile train.py

import argparse
import os
import pandas as pd
import joblib
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

model_file_name = "pipeline_model1.joblib"

# Main function
def main():
    # Arguments
    parser = argparse.ArgumentParser()

    # Inbuilt Arguments
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    # Add arguments for data directories
    # SageMaker passes these automatically if you use inputs={...} in the estimator
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))

    # Hyperparameters to Tune
    parser.add_argument("--max_depth", type=int, default=5)
    parser.add_argument("--learning_rate", type=float, default=0.1)
    parser.add_argument("--n_estimators", type=int, default=100)
    
    # Custom Arguements
    parser.add_argument("--use_label_encoder", default=False)
    parser.add_argument("--eval_metric", type=str, default="logloss")

    args, _ = parser.parse_known_args()

    # Load data
    # Read from local container paths, not S3
    # We join the channel path with the filename
    print(f"Reading training data from: {args.train}")
    train_df = pd.read_csv(os.path.join(args.train, "train.csv"))
    
    print(f"Reading test data from: {args.test}")
    test_df = pd.read_csv(os.path.join(args.test, "test.csv"))

   # Split features and targets
    X_train = train_df.drop("target", axis=1)
    y_train = train_df['target']

    X_test = test_df.drop("target", axis=1)
    y_test = test_df['target']
    
    # Define columns
    cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()
    num_cols = X_train.select_dtypes(include=["int64"]).columns.tolist()
    
    # One Hot Encode the catgorial columns
    cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

    # Scale the numerical feature
    num_transformer = StandardScaler()

    # Combine preprocessing in a ColumnTransformer
    preprocessor = ColumnTransformer([
        ("num", num_transformer, num_cols),
        ("cat", cat_transformer, cat_cols)
    ])

    # Ml model
    xgb = XGBClassifier(
        max_depth=args.max_depth,
        learning_rate=args.learning_rate,
        n_estimators=args.n_estimators,
        use_label_encoder=False,
        eval_metric='logloss')

    # Pipeline with preprocessor + Ml model
    pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("xgb_model",xgb )
    ])

    # Fit the pipeline locally
    pipeline.fit(X_train, y_train)

    y_pred_train = pipeline.predict(X_train)
    y_pred_test = pipeline.predict(X_test)

    
    # Compute accuracy on training data 
    train_acc = accuracy_score(y_train, y_pred_train)
    print(f"Train Accuracy: {train_acc:.4f}")

    # Compute accuracy on test data
    test_acc = accuracy_score(y_test, y_pred_test)
    print(f"Test Accuracy: {test_acc:.4f}")

    # Save the model
    model_save_path = os.path.join(args.model_dir, model_file_name)
    joblib.dump(pipeline, model_save_path)
    print(f"Model saved at {model_save_path}")

# Run the main function when the script runs
if __name__ == "__main__":
    main()

Writing train.py


In [8]:
%%writefile requirements.txt
pandas
scikit-learn
xgboost==1.7.6
fsspec
s3fs

Writing requirements.txt


In [9]:
# Organize files
# This creates a 'code' folder and moves your files there.
# This ensures requirements.txt is found and installed correctly.
!mkdir -p code
!mv train.py code/
!mv requirements.txt code/

In [38]:
# Train
# Choose instance type
# Choose framework version
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role

# Define the S3 Paths for your data
train_path = "s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/train.csv"
test_path = "s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/test.csv"

# Configure the Estimator
sklearn_estimator = SKLearn(
    base_job_name="xgb-pipeline-run",
    framework_version="1.2-1",
    
    # source_dir points to the folder containing BOTH script and requirements
    source_dir="code", 
    entry_point="train.py",
    
    # Note: We removed 'dependencies' because source_dir handles requirements.txt automatically
    
    hyperparameters={
        "use_label_encoder": False,
        "eval_metric": "logloss"
    },
    instance_count=1,
    instance_type="ml.m5.large",
    use_spot_instances=True,
    max_wait=600,
    max_run=600,
    role=get_execution_role()
)

# Launch Training with Inputs
# The keys 'train' and 'test' match the arguments in your script!
sklearn_estimator.fit({
    'train': train_path,
    'test': test_path
})

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: xgb-pipeline-run-2025-11-23-19-21-49-247


2025-11-23 19:21:52 Starting - Starting the training job...
2025-11-23 19:22:05 Starting - Preparing the instances for training...
2025-11-23 19:22:28 Downloading - Downloading input data...
2025-11-23 19:22:59 Downloading - Downloading the training image......
  import pkg_resources[0m
[34m2025-11-23 19:24:06,047 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2025-11-23 19:24:06,051 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-11-23 19:24:06,054 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-23 19:24:06,070 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2025-11-23 19:24:06,422 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt[0m
[34mCollecting xgboost==1.7.6 (from -r requirements.txt (line 3))
  Downloading xgboost-1.7.6-py3-none-manylinux2014_x86_64.whl.

In [2]:
import sagemaker
from sagemaker.sklearn.estimator import SKLearn

# Look for the name in your AWS Console if you lost the previous output
old_training_job_name = "xgb-pipeline-run-2025-11-23-19-21-49-247" 

# Attach to the old job
print(f"Attaching to job: {old_training_job_name}")
sklearn_estimator = SKLearn.attach(old_training_job_name)

# Now sklearn_estimator is back and ready to use!
print(f"Model data is at: {sklearn_estimator.model_data}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Attaching to job: xgb-pipeline-run-2025-11-23-19-21-49-247

2025-11-23 19:24:43 Starting - Preparing the instances for training
2025-11-23 19:24:43 Downloading - Downloading the training image
2025-11-23 19:24:43 Training - Training image download completed. Training in progress.
2025-11-23 19:24:43 Uploading - Uploading generated training model
2025-11-23 19:24:43 Completed - Training job completed
Model data is at: s3://sagemaker-ap-southeast-2-907808569037/xgb-pipeline-run-2025-11-23-19-21-49-247/output/model.tar.gz


Check the training job name

In [3]:
import boto3
sm_client = boto3.client("sagemaker")

training_job_name = sklearn_estimator.latest_training_job.name

# Location of the model stored in S3
model_artifact = sm_client.describe_training_job(
    TrainingJobName=training_job_name
)["ModelArtifacts"]["S3ModelArtifacts"]

print(f"Training job name: {training_job_name}")
print(f"Model storage location: {model_artifact}")

Training job name: xgb-pipeline-run-2025-11-23-19-21-49-247
Model storage location: s3://sagemaker-ap-southeast-2-907808569037/xgb-pipeline-run-2025-11-23-19-21-49-247/output/model.tar.gz


Hyperparameter Tuning

In [11]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

# Re-define the estimator (Fresh start ensures it picks up the new train.py)
# We do NOT pass specific hyperparameters here, because the Tuner will inject them.
sklearn_estimator = SKLearn(
    base_job_name="xgb-tuning-job",
    framework_version="1.2-1",
    source_dir="code",
    entry_point="train.py",
    instance_count=1,
    instance_type="ml.m5.large",
    use_spot_instances=True,
    max_wait=600,
    max_run=600,
    role=get_execution_role()
)

# Define the ranges to search
hyperparameter_ranges = {
    'max_depth': IntegerParameter(3, 10),           
    'learning_rate': ContinuousParameter(0.01, 0.3),
    'n_estimators': IntegerParameter(50, 200)      
}

# Define the Metric to Optimize
# This regex matches the print statement in train.py: "Test Accuracy: 0.8543"
objective_metric_name = 'test-accuracy'
metric_definitions = [{'Name': 'test-accuracy', 'Regex': 'Test Accuracy: ([0-9\\.]+)'}]

# 4. Create the Tuner
tuner = HyperparameterTuner(
    estimator=sklearn_estimator,
    objective_metric_name=objective_metric_name,
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=metric_definitions,
    max_jobs=10,           # Total number of training jobs to run (Budget)
    max_parallel_jobs=2,   # How many to run at the same time
    objective_type='Maximize'
)

# Launch the Tuning Job
# We pass the same data inputs as before
tuner.fit({
    'train': "s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/train.csv",
    'test': "s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/test.csv"
})

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


....................................................................................................................................................................................................................!


In [12]:
# Analyze tuning results
results = tuner.analytics().dataframe()
results.sort_values("FinalObjectiveValue", ascending=False).head()

Unnamed: 0,learning_rate,max_depth,n_estimators,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
7,0.142639,3.0,197.0,sagemaker-scikit-lea-251123-2005-003-09e0c1b6,Completed,0.8764,2025-11-23 20:10:32+00:00,2025-11-23 20:12:57+00:00,145.0
0,0.182912,3.0,147.0,sagemaker-scikit-lea-251123-2005-010-09aa80c8,Completed,0.8756,2025-11-23 20:21:04+00:00,2025-11-23 20:23:33+00:00,149.0
6,0.214578,6.0,76.0,sagemaker-scikit-lea-251123-2005-004-ed377e92,Completed,0.8744,2025-11-23 20:10:41+00:00,2025-11-23 20:13:05+00:00,144.0
4,0.083398,8.0,114.0,sagemaker-scikit-lea-251123-2005-006-859c4b4e,Completed,0.8741,2025-11-23 20:14:00+00:00,2025-11-23 20:16:34+00:00,154.0
1,0.155947,7.0,145.0,sagemaker-scikit-lea-251123-2005-009-57f9dd79,Completed,0.874,2025-11-23 20:21:03+00:00,2025-11-23 20:23:17+00:00,134.0


In [13]:
best_job_name = tuner.best_training_job()
print(f"The best performing job was: {best_job_name}")

The best performing job was: sagemaker-scikit-lea-251123-2005-003-09e0c1b6


Create the inference script(serve.py)

In [21]:
%%writefile serve.py

import os
import joblib
import pandas as pd
import numpy

def model_fn(model_dir):
    """Load and return the model"""
    model_file_name = "pipeline_model1.joblib"
    pipeline_model = joblib.load(os.path.join(model_dir, model_file_name))

    return pipeline_model

def input_fn(request_body, request_content_type):
    """Process the input json data and return the processed data.
    You can also add any input data prepsocessing in this fucntion."""
    if request_content_type == "application/json":
        input_object = pd.read_json(request_body, lines=True)

        return input_object
    else:
        raise ValueError("Only application/json content type is supported!")

def predict_fn(input_object, pipeline_model):
    """Make predictions on processed input data"""
    predictions = pipeline_model.predict(input_object)
    pred_probs = pipeline_model.predict_proba(input_object)

    prediction_object = pd.DataFrame(
        {
          "prediction": predictions.tolist(),
          "pred_prob_class0": pred_probs[:, 0].tolist(),
          "pred_prob_class1": pred_probs[:, 1].tolist()  
        }
    )
    return prediction_object

def output_fn(prediction_object, request_content_type):
    """post process the prediction and return it as json"""
    return_object = prediction_object.to_json(orient='records', lines=True)

    return return_object
    

Overwriting serve.py


In [20]:
%%writefile requirements.txt
pandas
numpy
joblib
xgboost

Overwriting requirements.txt


Serverless

In [22]:
# Create the deployment
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import Session, get_execution_role

session = Session()
bucket = session.default_bucket()

training_job_name = "sagemaker-scikit-lea-251123-2005-003-09e0c1b6"
model_artifact = f"s3://{bucket}/{training_job_name}/output/model.tar.gz"
endpoint_name = "salary-prediction-pipeline-real-time"

model = SKLearnModel(
    name=endpoint_name,
    framework_version='1.2-1',
    entry_point='serve.py',
    source_dir='.',
    model_data=model_artifact,
    role=get_execution_role() 
)

In [23]:
# Create a config for serverless inference
from sagemaker.serverless import ServerlessInferenceConfig
serverless_config = ServerlessInferenceConfig(memory_size_in_mb=1024, max_concurrency=4)

In [24]:
# Deploy the model
predictor = model.deploy(serverless_inference_config=serverless_config)

Using already existing model: salary-prediction-pipeline-real-time


------!

In [25]:
endpoint_name = predictor.endpoint_name
print("Endpoint_Name:")
print(f"{endpoint_name}")

Endpoint_Name:
salary-prediction-pipeline-real-time-2025-11-24-17-50-09-719


Invoke the model

In [26]:
# Load some data that we want to make predictions on
import pandas as pd
import json

test_df = pd.read_csv("s3://sagemaker-ap-southeast-2-907808569037/sagemaker/salary_prediction/test.csv")

X_test = test_df.drop("target", axis=1)
y_test = test_df["target"]

# Get 2 rows to make prediction on
X_pred = X_test.head(2).to_json(orient='records', lines=True)
X_pred

'{"age":25,"workclass":" Private","fnlwgt":226802,"education":" 11th","education-num":7,"marital-status":" Never-married","occupation":" Machine-op-inspct","relationship":" Own-child","race":" Black","sex":" Male","capital-gain":0,"capital-loss":0,"hours-per-week":40,"native-country":" United-States"}\n{"age":38,"workclass":" Private","fnlwgt":89814,"education":" HS-grad","education-num":9,"marital-status":" Married-civ-spouse","occupation":" Farming-fishing","relationship":" Husband","race":" White","sex":" Male","capital-gain":0,"capital-loss":0,"hours-per-week":50,"native-country":" United-States"}\n'

In [27]:
# Submit to the multi-model endpoint
import boto3
import json

sm_runtime = boto3.client("sagemaker-runtime")

response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                      Body=X_pred,
                                      ContentType="application/json",
                                      Accept="application/json") 

In [29]:
print(response)

{'ResponseMetadata': {'RequestId': 'e197a241-3dc1-4026-bfd6-21b089052b6e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e197a241-3dc1-4026-bfd6-21b089052b6e', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Mon, 24 Nov 2025 18:03:07 GMT', 'content-type': 'text/html; charset=utf-8', 'content-length': '161', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'text/html; charset=utf-8', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7f8c27914a00>}


In [30]:
# Decode the response from the endpoint
response_body = response["Body"]
response_str = response_body.read().decode('utf-8')
response_df = pd.read_json(response_str, lines=True)

print(response_df)

   prediction  pred_prob_class0  pred_prob_class1
0           0          0.996580          0.003420
1           0          0.834408          0.165592


  response_df = pd.read_json(response_str, lines=True)


In [32]:
import boto3

def cleanup(endpoint_name):
    sm_client = boto3.client("sagemaker")
    sm_client.delete_endpoint(EndpointName=endpoint_name)

In [33]:
cleanup(endpoint_name)