# **Predicting Customer Responses to a Bank Marketing Campaign Using AWS SageMaker’s Built-in XGBoost Algorithm**

### **Running as a Notebook Instance in SageMaker**
- This notebook **runs inside a SageMaker Notebook Instance**, meaning:
  - It is **hosted on AWS SageMaker itself**.
  - It has **direct access** to SageMaker services.
  - No extra configuration is needed to interact with SageMaker.
- You **train the model using SageMaker’s built-in XGBoost container**.
- The notebook uses **`boto3`** to:  
  1. **Upload data to S3.**
- The notebook uses **SageMaker Python SDK** to:  
  1. **Train the model using an XGBoost estimator.**  
  2. **Deploy the trained model as an endpoint.**  
  3. **Perform real-time predictions.**  
- 
  ```python
  sagemaker.get_execution_role()
  ```
  - This command **automatically fetches the IAM role assigned to the notebook instance**.
- No need to manually configure AWS credentials because **the notebook is already inside the AWS environment**.


### **Algorithm for using built-in XGBoost model on AWS SageMaker**

#### **Step 1: Setup AWS SageMaker**
1. Create a **Notebook instance** on AWS SageMaker. 
2. Import necessary libraries such as `boto3`, `sagemaker`, and `numpy`.
3. Retrieve the default SageMaker execution role using `sagemaker.get_execution_role()`.
4. Define the AWS session and S3 bucket where model artifacts and datasets will be stored.

#### **Step 2: Load and Preprocess the Data**
4. Load the dataset from a CSV or any other format.
5. Split the dataset into training, validation, and test sets.
6. Convert data into CSV format if required by SageMaker XGBoost.

#### **Step 3: Upload Data to S3**
7. Define the S3 bucket location where datasets will be stored.
8. Use `sagemaker.Session().upload_data()` to upload the training and validation datasets to S3.

#### **Step 4: Define SageMaker XGBoost Estimator**
9. Retrieve the SageMaker XGBoost container image using `sagemaker.image_uris.retrieve()`.
10. Initialize a SageMaker XGBoost `Estimator`, specifying:
   - Role with necessary permissions.
   - Instance type for training (`ml.m5.large` or any other).
   - Hyperparameters such as learning rate, number of rounds, etc.
   - Input/output paths for model artifacts.

#### **Step 5: Train the XGBoost Model**
11. Define `s3_input` sources for training and validation data.
12. Call `.fit()` on the `Estimator`, providing the S3 data locations.

#### **Step 6: Deploy the Model**
13. Call `.deploy()` on the trained model with:
   - Instance type for inference (`ml.m5.large`, `ml.t2.medium`, etc.).
   - Endpoint configuration.

#### **Step 7: Make Predictions**
14. Use the SageMaker endpoint to make real-time predictions.
15. Format the input data correctly before sending it to the endpoint.

#### **Step 8: Evaluate the Model**
16. Retrieve predictions from the endpoint.
17. Compute evaluation metrics such as accuracy, RMSE, or F1-score.

#### **Step 9: Clean Up Resources**
18. Delete the SageMaker endpoint after use to avoid unnecessary charges.
19. Optionally remove the trained model from S3 if no longer needed.

### Importing Important Libraries

In [15]:
import sagemaker
import boto3
# from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.session import s3_input, Session
from sagemaker.inputs import TrainingInput

# Define the bucket name (Must be unique globally)
bucket_name = 'bankapplication-eu'  # <-- Change this to a globally unique name

# Set the region to eu-central-1
my_region = 'eu-central-1'
print(my_region)

eu-central-1


In [2]:
s3 = boto3.client('s3', region_name=my_region)  # Explicitly set region

# Check if bucket exists
existing_buckets = [bucket['Name'] for bucket in s3.list_buckets()['Buckets']]
if bucket_name in existing_buckets:
    print(f'S3 bucket "{bucket_name}" already exists.')
else:
    try:
        # Create bucket with the correct LocationConstraint for eu-central-1
        s3.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={'LocationConstraint': my_region}
        )
        print('S3 bucket created successfully in eu-central-1!')
    except Exception as e:
        print('S3 error:', e)

S3 bucket "bankapplication-eu" already exists.


In [3]:
# Set an output path where the trained model will be saved
prefix = 'xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://bankapplication-eu/xgboost-as-a-built-in-algo/output


#### Downloading The Dataset And Storing in S3

In [4]:
import pandas as pd
import urllib
try:
    urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
    print('Success: downloaded bank_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    model_data = pd.read_csv('./bank_clean.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded bank_clean.csv.
Success: Data loaded into dataframe.


In [5]:
### Train Test split

import numpy as np
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)

(28831, 61) (12357, 61)


### **Understanding the Data Split in Code**

The **train dataset** is being saved in **Amazon S3** with a specific transformation before saving.

**Amazon SageMaker's built-in XGBoost algorithm does not support multiple target columns (multi-output regression or multi-label classification) natively**. It is designed for **single-target supervised learning tasks**, such as:

1. **Regression** (single continuous target variable)
2. **Binary Classification** (one target column with `0` or `1`)
3. **Multi-Class Classification** (one target column with categorical values)
---

### **🔹 Step 1: Transform the Training Data**
```python
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1)
```
#### **🔍 What's Happening?**
- `train_data` originally has **two target columns**: `y_no` and `y_yes` (likely **one-hot encoded labels**).
- The operation **removes** `y_no` and **keeps** `y_yes` as the first column.
- All **other feature columns** remain in the dataset.

#### **💡 Why This?**
- **SageMaker XGBoost requires the label (target) as the first column** in training data.
- Since `y_yes` represents the **positive class (1 for "yes", 0 for "no")**, it is placed **at the beginning**.

---

### **🔹 Step 2: Save as CSV**
```python
.to_csv('train.csv', index=False, header=False)
```
- Saves the **processed training data** as `train.csv`.
- `index=False` → **No row indices** in the CSV file.
- `header=False` → **No column names** in the CSV file (needed for SageMaker training).

---

### **🔹 Step 3: Upload to S3**
```python
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
```
- **Uploads `train.csv` to S3** in:
  ```
  s3://<bucket_name>/<prefix>/train/train.csv
  ```
- `prefix` defines a **structured directory**, grouping related training files under `train/`.

---

### **🔹 Step 4: Define S3 Input for SageMaker Training**
```python
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')
```
- **`s3_input_train` is used by SageMaker training job**.
- It tells SageMaker **where to find the training data** and **its format (`csv`)**.

---

## **🔥 Summary: Why This Split?**
| Step | Action | Purpose |
|------|--------|---------|
| **1** | Move `y_yes` to first column, remove `y_no` | Ensures **correct target format** for XGBoost |
| **2** | Save as `train.csv` | Creates SageMaker-compatible dataset |
| **3** | Upload to S3 | Stores data for distributed training |
| **4** | Create `s3_input_train` | Defines dataset location for SageMaker |

---

### **🔹 Example: Original vs Processed Data**
#### **Original `train_data`**
| y_no | y_yes | Feature1 | Feature2 | Feature3 |
|------|------|----------|----------|----------|
| 0    | 1    | 3.5      | 2.1      | 0.7      |
| 1    | 0    | 1.2      | 4.3      | 2.2      |

#### **After Processing**
| y_yes | Feature1 | Feature2 | Feature3 |
|------|----------|----------|----------|
| 1    | 3.5      | 2.1      | 0.7      |
| 0    | 1.2      | 4.3      | 2.2      |

### **💡 Conclusion**
The transformation ensures that **SageMaker XGBoost gets data in the correct format** for training.

In [6]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
# UPLOAD TO S3 BUCKET
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
# READ TRAINING DATA FROM S3 BUCKET FOR TRAINING MODEL IN SAGEMAKER
s3_input_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')


In [7]:
# Test Data Into Buckets
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

### Building & Training Models Xgboost- Inbuilt Algorithm in Sagemaker

### **Explanation of the Code**
```python
from sagemaker.image_uris import retrieve
container = retrieve('xgboost', boto3.Session().region_name, version='1.0-1')
```
This line retrieves the **Amazon Elastic Container Registry (ECR) image** for SageMaker's built-in XGBoost algorithm and assigns it to the container variable.

---

### **1️⃣ `from sagemaker.image_uris import retrieve`**
- **`sagemaker.image_uris.retrieve()`** is a built-in SageMaker function that **retrieves the Amazon Elastic Container Registry (ECR) URI** for a given algorithm.
- SageMaker hosts pre-built Docker images for various ML algorithms (like XGBoost, TensorFlow, PyTorch, etc.), and this function helps fetch the correct container image for the selected region and version.

---

### **2️⃣ `retrieve('xgboost', boto3.Session().region_name, version='1.0-1')`**
This function call retrieves the **ECR URI** for **SageMaker's built-in XGBoost algorithm**. Let's break it down:

- **`'xgboost'`** → Specifies that we are retrieving an **XGBoost container**.
- **`boto3.Session().region_name`** → Dynamically fetches the AWS region where the SageMaker instance is running.
  - Example: If you're running in `us-east-1`, this resolves to `'us-east-1'`.
- **`version='1.0-1'`** → Specifies the **version** of the XGBoost container you want to use.
  - `'1.0-1'` refers to a specific SageMaker-managed XGBoost version.
  - If you don’t specify a version, SageMaker might use the default/latest available version.

### **Example Output**
If executed in the **`us-east-1`** region, this will return a container image URI like:
```
'433757028032.dkr.ecr.us-east-1.amazonaws.com/xgboost:1.0-1'
```
This URI points to the **SageMaker-managed Docker container** that runs XGBoost.

In [8]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.

from sagemaker.image_uris import retrieve

# Ensure the XGBoost image is fetched from eu-central-1
container = retrieve('xgboost', region="eu-central-1", version="1.0-1") 


### **📌 Breakdown of the Hyperparameter Choices**
```python
hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "binary:logistic",
    "num_round": 50
}
```

#### **1️⃣ max_depth = 5**  
- **What it does?**: Limits the depth of decision trees (controls complexity).  
- **Why 5?**:  
  - A moderate depth prevents **overfitting** while capturing enough patterns in the data.  
  - Typically chosen between **3 to 10** based on cross-validation performance.

#### **2️⃣ eta = 0.2** (Learning Rate)  
- **What it does?**: Controls the step size at each boosting iteration (shrinkage factor).  
- **Why 0.2?**:  
  - A **smaller eta (e.g., 0.1 - 0.3)** is preferred for better generalization.  
  - Higher values like **0.5 or 1.0** can lead to faster convergence but may overfit.  

#### **3️⃣ gamma = 4**  
- **What it does?**: Minimum loss reduction required for a further split in a tree.  
- **Why 4?**:  
  - **Higher values** reduce overfitting by making the model stricter in adding new splits.  
  - Default is `0`, but values **between 1-10** are commonly tested.

#### **4️⃣ min_child_weight = 6**  
- **What it does?**: Minimum sum of instance weights (hessian) needed in a child node.  
- **Why 6?**:  
  - Prevents overfitting by requiring more data in leaf nodes before splitting.  
  - Common values are **3-10** for classification problems.

#### **5️⃣ subsample = 0.7**  
- **What it does?**: Percentage of data used per boosting iteration.  
- **Why 0.7?**:  
  - Reduces overfitting by adding randomness.  
  - Usually, values **between 0.5 to 0.8** are tested.

#### **6️⃣ objective = "binary:logistic"**  
- **What it does?**: Specifies the loss function.  
- **Why binary:logistic?**:  
  - Used for **binary classification problems** (e.g., Yes/No, 0/1).

#### **7️⃣ num_round = 50**  
- **What it does?**: Number of boosting iterations (trees).  
- **Why 50?**:  
  - A moderate value ensures sufficient training without excessive computation.  
  - Typically tuned in the range of **50-500** based on dataset complexity.

---

### **🔬 How Were These Hyperparameters Chosen?**

#### **✅ 1. Prior Knowledge / Default Values**
- Some values like `"objective": "binary:logistic"` and `"eta": 0.2` are common default choices.
- Based on experience with similar datasets.

#### **✅ 2. Grid Search (Manual Tuning)**
- Trying different combinations and selecting the best-performing one.

#### **✅ 3. Random Search**
- Randomly selecting values within a predefined range and evaluating performance.

#### **✅ 4. Bayesian Optimization (AutoML)**
- Using tools like **Optuna** or SageMaker's **Automatic Model Tuning (HPO)**.

#### **✅ 5. Cross-validation Performance**
- The author likely ran multiple tests and **evaluated accuracy, log loss, or F1-score**.

---

### **🛠️ How to Find the Best Hyperparameters?**
If you are unsure about the best values, you can:
- **Use SageMaker Hyperparameter Tuning Job** to automate the search.
- **Try Grid Search / Random Search** using Scikit-learn or Optuna.
- **Analyze Feature Importance** and Tree Structures in XGBoost.


In [9]:
# initialize hyperparameters
# HYPERPARAMETER MUST ALWAYS BE IN DICT FORMAT
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":50
        }

In [10]:
boto3.Session().client("s3").list_buckets()


{'ResponseMetadata': {'RequestId': '8TA7KBWNKYHCD8H9',
  'HostId': 'dBvLwqVMxg3CdckuM1RWzafasWwlrnT9RrY7bJ5wM/Usy/vYGMQ8yYVOHhlTnuEji6qtDYJkAYTRAde9kSoWsq4bgO62vYOS',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'dBvLwqVMxg3CdckuM1RWzafasWwlrnT9RrY7bJ5wM/Usy/vYGMQ8yYVOHhlTnuEji6qtDYJkAYTRAde9kSoWsq4bgO62vYOS',
   'x-amz-request-id': '8TA7KBWNKYHCD8H9',
   'date': 'Tue, 04 Feb 2025 22:04:00 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'avbaws1',
   'CreationDate': datetime.datetime(2025, 2, 2, 20, 21, 6, tzinfo=tzlocal())},
  {'Name': 'avbnewawsbucket1',
   'CreationDate': datetime.datetime(2025, 2, 2, 23, 49, 32, tzinfo=tzlocal())},
  {'Name': 'bankapplication-eu',
   'CreationDate': datetime.datetime(2025, 2, 4, 20, 18, 51, tzinfo=tzlocal())}],
 'Owner': {'ID': '0cca97c7b0c2abb728f58611286458c87a7f051ac9927ce21c5a5bbcd8aa64d1'}}

In [11]:
# Create a SageMaker session explicitly in eu-central-1
boto3_session = boto3.Session(region_name="eu-central-1")
sagemaker_session = sagemaker.Session(boto_session=boto3_session)

In [12]:
from sagemaker.estimator import Estimator

estimator = Estimator(image_uri=container,  # Use image_uri instead of image_name
                      hyperparameters=hyperparameters,
                      role=sagemaker.get_execution_role(), # IAM ROLE FOR THIS NOTEBOOK INSTANCE, REQUIRED TO PULL DATA FROM S3 BUCKETS
                      instance_count=1, 
                      instance_type='ml.m5.2xlarge',  # Change train_instance_type -> instance_type
                      volume_size=5,  # Change train_volume_size -> volume_size
                      output_path=output_path,
                      use_spot_instances=True,  # Change train_use_spot_instances -> use_spot_instances
                      max_run=300,  # Change train_max_run -> max_run
                      max_wait=600,  # Change train_max_wait -> max_wait
                      sagemaker_session=sagemaker_session) 

In [13]:
# # construct a SageMaker estimator that calls the xgboost-container
# estimator = sagemaker.estimator.Estimator(image_name=container, 
#                                           hyperparameters=hyperparameters,
#                                           role=sagemaker.get_execution_role(), 
#                                           train_instance_count=1, 
#                                           train_instance_type='ml.m5.2xlarge', 
#                                           train_volume_size=5, # 5 GB 
#                                           output_path=output_path,
#                                           train_use_spot_instances=True,
#                                           train_max_run=300,
#                                           train_max_wait=600)

In [14]:
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2025-02-04 22:04:12 Starting - Starting the training job...
..25-02-04 22:04:27 Starting - Preparing the instances for training.
..25-02-04 22:05:04 Downloading - Downloading the training image.
.[34m[2025-02-04 22:05:54.660 ip-10-0-207-160.eu-central-1.compute.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Si

### Deploy Machine Learning Model As Endpoints

In [16]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

------!

#### Prediction of the Test Data

In [18]:
from sagemaker.serializers import CSVSerializer  # ✅ Correct import

# Prepare test data
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values  # Convert to NumPy array

# Configure predictor
xgb_predictor.content_type = 'text/csv'  
xgb_predictor.serializer = CSVSerializer()  # ✅ Use CSVSerializer() instead of csv_serializer

# Make predictions
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')  

# Convert predictions to NumPy array
predictions_array = np.fromstring(predictions[1:], sep=',')  
print(predictions_array.shape)


(12357,)


In [19]:
predictions_array

array([0.05214286, 0.05660191, 0.05096195, ..., 0.03436061, 0.02942475,
       0.03715819])

In [21]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]
fn = cm.iloc[1,0]
tp = cm.iloc[1,1]
fp = cm.iloc[0,1]
p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.7%

Predicted      No Purchase    Purchase
Observed
No Purchase    91% (10785)    34% (151)
Purchase        9% (1124)     66% (297) 



#### Deleting The Endpoints

In [22]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'JFVJYZ5XDDSRBM13',
   'HostId': 'vTRjpo7mtAOc7bUNHh0CJ5i5MOoJIULjBqtgPC2CwoZrljCgPwwSDea1Ec/HSdmKazLMtG28T9QjdOkymCHWea7RuWL3xeLPjmEZ+xXfn7s=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'vTRjpo7mtAOc7bUNHh0CJ5i5MOoJIULjBqtgPC2CwoZrljCgPwwSDea1Ec/HSdmKazLMtG28T9QjdOkymCHWea7RuWL3xeLPjmEZ+xXfn7s=',
    'x-amz-request-id': 'JFVJYZ5XDDSRBM13',
    'date': 'Wed, 05 Feb 2025 12:47:40 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2025-02-04-21-45-48-957/output/model.tar.gz'},
   {'Key': 'xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2025-02-04-22-04-11-570/profiler-output/system/incremental/2025020422/1738706700.algo-1.json'},
   {'Key': 'xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2025-02-04-21-45-48-957/debug-output/index/000000000/0000000