<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Credit risk using SageMaker linear-learner

Contents
 - Setup
 - Introduction
 - Load and explore data
 - Create logistic regression model using SageMaker linear-learner algorithm
 - Deploy the SageMaker model in the AWS Cloud
 - Score the model

**Note:** This notebook works correctly with kernel **`IBM Runtime 23.1 on Python 3.10 XS`** if using IBM Watson Studio or else use standard Python 3.10 runtime.

## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a SageMaker Service, setting up steps described here: https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html
- Install reqiured python packages from PyPi repository

### Package installation

In [None]:
#Note Install below packages if you are not running notebook in IBM Watson Studio
# !pip install -U boto3 | tail -n 1
# !pip install -U pandas | tail -n 1
# !pip install -U scikit_learn | tail -n 1
# !pip install -U category_encoders | tail -n 1

In [1]:
!pip install -U sagemaker | tail -n 1
!pip install -U category_encoders | tail -n 1

**Action:** Restart the kernel.

## 2. Introduction

This notebook defines, trains and deploys the model predicting risk for credit.

## 3. Load and explore data

In this section you will prepare your data for training using SageMaker linear-learner algorithm.

- Load data from github repository
- Explore data
- Store training data in S3 Object Storage

### Load data from github repository

In [None]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/credit_risk/credit_risk_training.csv')

### Explore data

In [3]:
print("Sample records:")
display(data.head())

print("Label column summary:")
display(data.Risk.value_counts())

Sample records:


Unnamed: 0,CheckingStatus,LoanDuration,CreditHistory,LoanPurpose,LoanAmount,ExistingSavings,EmploymentDuration,InstallmentPercent,Sex,OthersOnLoan,...,OwnsProperty,Age,InstallmentPlans,Housing,ExistingCreditsCount,Job,Dependents,Telephone,ForeignWorker,Risk
0,0_to_200,31,credits_paid_to_date,other,1889,100_to_500,less_1,3,female,none,...,savings_insurance,32,none,own,1,skilled,1,none,yes,No Risk
1,less_0,18,credits_paid_to_date,car_new,462,less_100,1_to_4,2,female,none,...,savings_insurance,37,stores,own,2,skilled,1,none,yes,No Risk
2,less_0,15,prior_payments_delayed,furniture,250,less_100,1_to_4,2,male,none,...,real_estate,28,none,own,2,skilled,1,yes,no,No Risk
3,0_to_200,28,credits_paid_to_date,retraining,3693,less_100,greater_7,3,male,none,...,savings_insurance,32,none,own,1,skilled,1,none,yes,No Risk
4,no_checking,28,prior_payments_delayed,education,6235,500_to_1000,greater_7,3,male,none,...,unknown,57,none,own,2,skilled,1,none,yes,Risk


Label column summary:


No Risk    3330
Risk       1670
Name: Risk, dtype: int64

### Store training data in S3 Object Storage

You will use SageMaker build-in linear-learner algorithm. This algorithm expects first column to be the label when training data is in `text/csv` format.

Moreover label column have to be numeric, so you will recode it.

#### Save prepared data to local filesystem

In [4]:
target = 'Risk'
string_features = [nm for nm, ty in zip(data.dtypes.index, data.dtypes.values) if (nm != target) and (ty is np.dtype('O')) ]
numeric_features = [nm for nm, ty in zip(data.dtypes.index, data.dtypes.values) if (nm != target) and (ty is not np.dtype('O'))]

In [5]:
data_recoded = pd.concat([data[[target]], pd.get_dummies(data[string_features]), data[numeric_features]], axis=1)
data_recoded.replace({target: {'Risk': 1, 'No Risk': 0}}, inplace = True)

In [6]:
train_data_filename = 'credit_risk_training_recoded.csv'
data_recoded.to_csv(path_or_buf = train_data_filename, index = False, header = False)

**Note:** Header row have to be omitted. First column have to be target.

In [7]:
print(data_recoded.columns.tolist())

['Risk', 'CheckingStatus_0_to_200', 'CheckingStatus_greater_200', 'CheckingStatus_less_0', 'CheckingStatus_no_checking', 'CreditHistory_all_credits_paid_back', 'CreditHistory_credits_paid_to_date', 'CreditHistory_no_credits', 'CreditHistory_outstanding_credit', 'CreditHistory_prior_payments_delayed', 'LoanPurpose_appliances', 'LoanPurpose_business', 'LoanPurpose_car_new', 'LoanPurpose_car_used', 'LoanPurpose_education', 'LoanPurpose_furniture', 'LoanPurpose_other', 'LoanPurpose_radio_tv', 'LoanPurpose_repairs', 'LoanPurpose_retraining', 'LoanPurpose_vacation', 'ExistingSavings_100_to_500', 'ExistingSavings_500_to_1000', 'ExistingSavings_greater_1000', 'ExistingSavings_less_100', 'ExistingSavings_unknown', 'EmploymentDuration_1_to_4', 'EmploymentDuration_4_to_7', 'EmploymentDuration_greater_7', 'EmploymentDuration_less_1', 'EmploymentDuration_unemployed', 'Sex_female', 'Sex_male', 'OthersOnLoan_co-applicant', 'OthersOnLoan_guarantor', 'OthersOnLoan_none', 'OwnsProperty_car_other', 'Owns

#### Upload data to S3 Object Storage

In [None]:
import time
import json
import boto3
import sagemaker

In [9]:
aws_credentials = {'access_key': '', 
                   'secret_key': '', 
                   'region_name': ''}

**Note:** You have to provide credentials from your Amazon account.

In [10]:
import boto3
import sagemaker

session = boto3.session.Session(
    aws_access_key_id = aws_credentials['access_key'],
    aws_secret_access_key = aws_credentials['secret_key'],
    region_name = aws_credentials['region_name']
)
region = session.region_name
sagemaker_session = sagemaker.Session(session)
bucket = sagemaker_session.default_bucket()

s3 = session.resource('s3')


In [None]:
print('Default bucket: {}'.format(bucket))

**Note:** You have to replace `bucket_name` with name of bucket in your S3 Object Storage and path where training data will be stored.

**Tip:** You can run following code `[bkt.name for bkt in s3.buckets.all()]` to list all your buckets.

In [12]:
bucket_name = bucket
train_data_path = 'credit_risk'

In [13]:
output_data_path = 's3://{}/credit-risk/output'.format(bucket_name)
time_suffix = time.strftime("%Y-%m-%d-%H-%M", time.gmtime())

In [14]:
train_data_filename

'credit_risk_training_recoded.csv'

In [15]:
s3_bucket = s3.Bucket(bucket_name)
s3_bucket.upload_file(Filename = train_data_filename, Key = '{}/{}'.format(train_data_path, train_data_filename))

Let's check if your data have been uploaded successfully.

In [None]:
for s3_obj in s3_bucket.objects.all():
    if (s3_obj.bucket_name == bucket_name) and (train_data_path in s3_obj.key):
        train_data_uri = 's3://{}/{}'.format(s3_obj.bucket_name, s3_obj.key)
        print(train_data_uri)
train_data_uri_updated = "s3://{}/{}/train/{}".format(s3_obj.bucket_name,train_data_path,train_data_filename)

<a id="model"></a>
## 3. Create logistic regression model using SageMaker linear-learner algorithm

In this section you will learn how to:

- Setup training parameters
- Start training job

### Setup training parameters

In [17]:
from sagemaker.image_uris import retrieve, get_training_image_uri
sm_client = session.client('sagemaker')


In [18]:
training_image = get_training_image_uri(session.region_name, 'linear-learner')

iam_client = session.client('iam')
[role_arn, *_] = [role['Arn'] for role in iam_client.list_roles()['Roles'] if 'AmazonSageMaker-ExecutionRole' in role['RoleName'] or  'SagemakerFull' in role['RoleName']]

linear_job_name = 'Credit-risk-linear-learner-' + time_suffix

In [19]:
linear_training_params = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "HyperParameters": {
        "feature_dim": str(data_recoded.shape[1] - 1),
        "mini_batch_size": "100",
        "predictor_type": "binary_classifier",
        "epochs": "10",
        "num_models": "32",
        "loss": "auto"
    },
    "InputDataConfig": [{
        "ChannelName": "train",
        "ContentType": "text/csv", 
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": train_data_uri,
                "S3DataDistributionType": "ShardedByS3Key"
            }
        }
    }],
    "OutputDataConfig": {"S3OutputPath": output_data_path},
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 2
    },
    "RoleArn": role_arn,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 6 * 60
    },
    "TrainingJobName": linear_job_name

}


### Start training job

In [None]:
sm_client.create_training_job(**linear_training_params)

In [None]:
try:
    sm_client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName = linear_job_name)
except Exception:
    print('Training job error.')

train_job_details = sm_client.describe_training_job(TrainingJobName = linear_job_name)
train_job_status = train_job_details['TrainingJobStatus']

if train_job_status == 'Failed':
    print(train_job_details['FailureReason'])
else:
    train_job_arn = train_job_details['TrainingJobArn']
    print(train_job_arn)
    trained_model_uri = train_job_details['ModelArtifacts']['S3ModelArtifacts']
    print(trained_model_uri)



## 4. Deploy the SageMaker model in the AWS Cloud

In this section you will learn howto:

- Setup deployment parameters
- Create deployment configuration endpoint
- Create online scoring endpoint

### Setup deployment parameters

In [None]:
linear_hosting_container = {'Image': training_image, 'ModelDataUrl': trained_model_uri}

create_model_details = sm_client.create_model(
    ModelName = linear_job_name,
    ExecutionRoleArn = role_arn,
    PrimaryContainer = linear_hosting_container)

print(create_model_details['ModelArn'])


### Create deployment configuration endpoint

In [None]:
endpoint_config = 'Credit-risk-linear-endpoint-config-' + time_suffix
print(endpoint_config)

create_endpoint_config_details = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config,
    ProductionVariants = [{
        'InstanceType': 'ml.m4.xlarge',
        'InitialInstanceCount': 1,
        'ModelName': linear_job_name,
        'VariantName': 'AllTraffic'}])

endpoint_config_details = sm_client.describe_endpoint_config(EndpointConfigName = endpoint_config)
print(endpoint_config_details)

### Create online scoring endpoint

In [24]:
scoring_endpoint = 'Credit-risk-endpoint-scoring-' + time_suffix

create_endpoint_details = sm_client.create_endpoint(
    EndpointName = scoring_endpoint,
    EndpointConfigName = endpoint_config)

In [None]:
try:
    sm_client.get_waiter('endpoint_in_service').wait(EndpointName = scoring_endpoint)
except Exception:
    print('Create scoring endpoint error')

scoring_endpoint_details = sm_client.describe_endpoint(EndpointName = scoring_endpoint)
scoring_enpoint_config_status = scoring_endpoint_details['EndpointStatus']

if scoring_enpoint_config_status != 'InService':
    print(scoring_endpoint_details['FailureReason'])
else:
    print(scoring_endpoint_details['EndpointArn'])


## 5. Score the model

In this section you will learn howto score deployed model.

- Prepare sample data for scoring
- Send payload for scoring

### Prepare sample data for scoring

You will use data in `csv` format as scoring payload. First column (label) is removed from data. Last 20 training records are selected as scoring payload.

In [26]:
scoring_data_filename = 'credit_risk_scoring_recoded.csv'

In [27]:
with open(train_data_filename) as f_train:
    with open(scoring_data_filename, 'w') as f_score:
        f_score.writelines([','.join(line.split(',')[1:]) for line in f_train.readlines()[-10:]])



### Send payload for scoring

In [28]:
sm_runtime = session.client('runtime.sagemaker')

with open(scoring_data_filename) as f_payload:
    scoring_response = sm_runtime.invoke_endpoint(EndpointName = scoring_endpoint,
                                                  ContentType = 'text/csv',
                                                  Body = f_payload.read().encode())
    
    scored_records = scoring_response['Body'].read().decode()
    print(json.loads(scored_records))


{'predictions': [{'score': 0.45812129974365234, 'predicted_label': 0}, {'score': 0.055442217737436295, 'predicted_label': 0}, {'score': 0.3524249792098999, 'predicted_label': 0}, {'score': 0.2956694960594177, 'predicted_label': 0}, {'score': 0.1703130602836609, 'predicted_label': 0}, {'score': 0.4231722950935364, 'predicted_label': 0}, {'score': 0.08539573103189468, 'predicted_label': 0}, {'score': 0.8461574912071228, 'predicted_label': 1}, {'score': 0.5603607892990112, 'predicted_label': 1}, {'score': 0.19936884939670563, 'predicted_label': 0}]}
