# MID TERM PROJECT

Problem Statement:

The United States is nowadays facing a huge problem with human resources. Finding the best talent is always a tough job for the human resources department. Companies in the USA are open to hiring highly skilled people both locally and internationally.

The Immigration and Nationality Act (INA) of the United States allows foreign workers to come to the country to work on a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring all the safeguards mentioned in the act. Office of Foreign Labor Certification (OFLC) administers this act.

The OFLC (Office of Foreign Labor Certification) processed millions of applications for different positions for temporary and permanent labor certifications. There is a percentage increase in the number of applications as compared to previous years. The process of reviewing every case is becoming a tedious task as the number of applicants increases every year. With an increasing number of applicants each year, a machine learning-based solution that can assist in shortlisting candidates with a higher chance of VISA approval is required. EasyVisa has been hired by OFLC to provide data-driven solutions. As a data scientist at EasyVisa, you must analyze the data provided and, using a classification model, make the visa approval process easier.

 


Dataset Description:

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

* case_id: ID of each visa application
* continent: Information of continent the employee
* education_of_employee: Information of education of the employee
* has_job_experience: Does the employee has any job experience? Y= Yes; N = No
* requires_job_training: Does the employee require any job training? Y = Yes; N = No
* no_of_employees: Number of employees in the employer's company
* yr_of_estab: Year in which the employer's company was established
* region_of_employment: Information of foreign worker's intended region of employment in the US.
* prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
* unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
* full_time_position: Is the position of work full-time? Y = Full-Time Position; N = Part-Time Position
* case_status: Flag indicating if the Visa was certified or denied

## Data Pre-processing

#### Import all necessary libraries


In [12]:
#Libraries to wrok with data
import numpy as np
import pandas as pd

#Library to work with SageMaker
import os
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput

#Package for data pre-processing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sagemaker.serializers import CSVSerializer

#### Create the S3 bucket and read the dataset from S3 Bucket

In [13]:
#Create S3 bucket
role = get_execution_role()
sess = sagemaker.Session()
#Set S3 bucket
bucket = sess.default_bucket()
prefix = "mid-term"
#Set up S3 client
s3_client = boto3.client("s3")

#### Read data from S3. (Upload the data into S3 Bucket manually and with the help of code read the data from the S3 Bucket)
Note: 
* Keep the target column at first place in the dataframe
* Drop the column 'case_id'
* Replace the value of Denied with 0 and Certified with 1 in the case_status column
* Apply the lable encoding

In [14]:
#Read the dataset from S3
data = pd.read_csv(s3_client.get_object(Bucket=bucket, Key="EasyVisa.csv").get("Body"))
data.head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [15]:
# drop the case_id column
data.drop('case_id', inplace=True, axis=1)
data.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [16]:
# Keep the target variables in the 1st position of dataframe 
#shift case_status in the first column
first_column = data.pop('case_status')
#insert case_status to column1
data.insert(0,'case_status',first_column)
data.head()

Unnamed: 0,case_status,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position
0,Denied,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y
1,Certified,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y
2,Denied,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y
3,Denied,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y
4,Certified,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y


Now we have the case_status, which is our target variable in the first column

In [17]:
#Replaced the value of Denied with 0 and Certified with 1
data['case_status'] = data['case_status'].replace({'Denied' : 0, 'Certified' : 1})

In [18]:
data.head()

Unnamed: 0,case_status,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position
0,0,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y
1,1,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y
2,0,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y
3,0,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y
4,1,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y


We can now see that "Denied" is replaced by 0 and "Certified" is replaced by 1 in column "case_status"

In [19]:
#Apply label encoding on categorical data
en_data = data.apply(LabelEncoder().fit_transform)

In [20]:
en_data.head()

Unnamed: 0,case_status,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position
0,0,1,2,0,0,5853,189,4,1587,0,1
1,1,1,3,1,0,2391,184,2,15091,3,1
2,0,1,0,0,1,6495,190,4,21194,3,1
3,0,1,0,0,0,87,79,4,15095,3,1
4,1,0,3,1,0,1069,187,3,23611,3,1


After encoding categorical data, we can now see that all column has numerical value.

#### Split data and upload back to S3

In [21]:
# Split data into train, validation and test set
train_data, validation_data = train_test_split(en_data, test_size=0.3, random_state=1)

# Write the train and validation dataframe to sagemaker file browser
train_data.to_csv("datasets/train.csv", header=False, index=False)
validation_data.to_csv("datasets/validation.csv", header=False, index=False)

#### Upload into S3 (checkpoint)

Let's checkpoint the train and validation datasets into the S3 Bucket, so we can fallback in case we need them again

In [22]:
import os
#Upload train.csv to s3
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("datasets/train.csv")

#Upload validation.csv to s3

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.csv")
).upload_file("datasets/validation.csv")
                                                     

## Training, Tuning and Deploying the model

### Model Training

#### Get the container image for Xgboost

In [23]:
#set he version of xgboost
model_name = "xgboost"
image_version = "1.5-1"

In [24]:
#Create container variable
sess = sagemaker.Session()
container = sagemaker.image_uris.retrieve(model_name, sess.boto_region_name, image_version)
display(container)

'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1'

#### Set the training configuration

In [25]:
#Input training data
input_train = TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv")
# Input validation data
input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv")

#### Train Classification Model (inbuilt XG-Boost model) using the data

In [26]:
xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)


In [27]:
xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective="binary:logistic", # the output will be a probability score
    num_round=100,
)

In [28]:
#Fit the data to the model
xgb.fit({"train": input_train, "validation": input_validation})

2023-01-05 18:30:42 Starting - Starting the training job...ProfilerReport-1672943442: InProgress
...
2023-01-05 18:31:22 Starting - Preparing the instances for training............
2023-01-05 18:33:31 Downloading - Downloading input data...
2023-01-05 18:34:11 Training - Downloading the training image..[34m[2023-01-05 18:34:19.622 ip-10-0-176-199.ec2.internal:6 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-01-05:18:34:19:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-01-05:18:34:19:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2023-01-05:18:34:19:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-01-05:18:34:19:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2023-01-05:18:34:19:INFO] Determined delimiter of CSV input is ','[0m
[34m[2023-01-05:18:34:19:INFO] Determined delimiter of CSV input is ','[0m
[34m[2023-01-05:18:34:1

#### Deploy the model

In [30]:
xgb_predictor = xgb.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=CSVSerializer()
)

--------!

#### Save the predictions

In [34]:
def predict(en_data, rows=500):
    split_array = np.array_split(en_data, int(en_data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = "".join([predictions, xgb_predictor.predict(array).decode("utf-8")])

    return predictions.split("\n")[:-1]


predictions = predict(validation_data.drop(['case_status'],axis=1).to_numpy())

In [35]:
predictions = np.array([float(num) for num in predictions])
print(predictions)

[0.77296561 0.74351811 0.58985001 ... 0.95998597 0.71194988 0.39377275]


In [36]:
columns=np.round(predictions,0)
predictions = pd.DataFrame(columns)
test_y = validation_data['case_status']
actual = test_y
test_x = validation_data.drop(columns=['case_status'],axis=1)
#Prediction column called case_pred
predictions.columns =['case_pred']
# Concatenate them into a single dataframe
output = pd.concat([test_x.reset_index(drop=True),actual.reset_index(drop=True),predictions.reset_index(drop=True)],axis=1)
output['case_pred'] = output['case_pred'].astype(int)
output.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status,case_pred
0,1,0,1,0,556,174,1,5248,3,1,1,1
1,4,0,0,0,607,120,1,11997,3,1,1,1
2,1,3,0,0,2614,187,3,2107,0,1,1,1
3,1,0,1,1,3154,168,2,8975,3,1,1,1
4,2,0,1,0,4557,110,4,9052,3,1,0,1


#### Evaluate the model using performance metrics

Confusion Matrix

In [52]:
# Evaluate the model using a confusion matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(output['case_status'], output['case_pred'])
# Display matrix
print(matrix)

[[1208 1254]
 [ 704 4478]]


In [53]:
# Calculate TP, FP, FN, TN
true_pos = matrix[0][0]
fals_pos = matrix[0][1]
fals_neg = matrix[1][0]
true_neg = matrix[1][1]

**Accuracy**: 


In [54]:
# Calulate Accuracy Score
accuracy = (true_pos + true_neg) / (true_pos + fals_pos + fals_neg + true_neg)
# Display score
print(accuracy)

0.7438513867085296


**Precision**: 


In [55]:
# Calculate Precision Score
precision = true_pos / (true_pos + fals_pos)
# Display score
print(precision)

0.49065800162469536


**Recall**:


In [56]:
# Calculate Recall Score
recall = true_pos / (true_pos + fals_neg)
# Display score
print(recall)

0.6317991631799164


**F1 score**:


In [57]:
# Calculate F1-score
f1 = (2*precision*recall) / (precision + recall)
# Display score
print(f1)

0.5523548239597622


### Observation


1. From the confusion matrix, we get 1254 false positive (FP) of visa status, which means the model predict 1254 visas are certified while they are are actually denied. For the false negative, the model predict 704 visas are denied while they are actually certified. Both error need to be decreased becasue if the company hire those 1254 people that they think will have higher chance of getting visa approved, but their visa actually get denied, so the company will likely lose their time and money getting every candidates on the process and might lose time to actually hire other people who can actually work for the company. For the false negative, company might not hiher 704 people because they have possibility of getting visa denied, but their visa actually get approved, the company will likely lose highly skill people to work for them.  
  
2. The model accuracy is at 74%, model precision is at 49%, model recall is at 63% and f1 score which is a balance of precision and recall is at 55%. We can see here that the model need to be improve to actually increase all these values. 

Next, the model tuning will be studied to find the best hyperparameters to find the new model to train data.
 

### Model Tuning

**Hyper-parameter Tuning** is the art of figuring out what works best for your model - performing well statistically and aligning with your business context & needs

In [63]:
# import the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

#### Define the parameter space & train the model

Each model is build on a different algorithm, and hence the hyper-parameters may vary. So before you tune these models.

In [64]:
hyperparameter_ranges = {
    'alpha' : ContinuousParameter(0, 1000, scaling_type = 'Auto'),
    'eta' : ContinuousParameter(0.1, 0.5, scaling_type = 'Logarithmic'),
    'max_depth' : IntegerParameter(0, 10, scaling_type = 'Auto'),
    'min_child_weight' : ContinuousParameter(0, 10, scaling_type = 'Auto')}

objective_metric_name = 'validation:auc'

In [65]:
tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs = 20,
    max_parallel_jobs = 3
)

In [66]:
#Start Hyperparameter tuning job
tuner.fit({'train': input_train, 'validation': input_validation}, include_cls_metadata=False)

.................................................................................................................!


In [68]:
#Check status of hyperparameter tunning job
region = boto3.Session().region_name
sage_client = boto3.Session().client("sagemaker")

tuning_job_name = "sagemaker-xgboost-230105-1915"

tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
is_minimize = objective["Type"] != "Maximize"
objective_name = objective["MetricName"]

20 training jobs have completed


In [69]:
#Finding best model from parameter tunning
from pprint import pprint

if tuning_job_result.get("BestTrainingJob", None):
    print("Best model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

Best model found so far:
{'CreationTime': datetime.datetime(2023, 1, 5, 19, 20, 59, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc',
                                                 'Value': 0.7786700129508972},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2023, 1, 5, 19, 21, 41, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:160277102220:training-job/sagemaker-xgboost-230105-1915-011-25c0d605',
 'TrainingJobName': 'sagemaker-xgboost-230105-1915-011-25c0d605',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2023, 1, 5, 19, 21, 4, tzinfo=tzlocal()),
 'TunedHyperParameters': {'alpha': '0.0',
                          'eta': '0.10000000000000002',
                          'max_depth': '5',
                          'min_child_weight': '9.670827651909446'}}


In [70]:
#Showing results of hyperparameter in dataframe
tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", None)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df


Number of training jobs with valid objective: 20
{'lowest': 0.5, 'highest': 0.7786700129508972}


Unnamed: 0,alpha,eta,max_depth,min_child_weight,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
9,0.0,0.1,5.0,9.670828,sagemaker-xgboost-230105-1915-011-25c0d605,Completed,0.77867,2023-01-05 19:21:04+00:00,2023-01-05 19:21:41+00:00,37.0
7,7.976895,0.138115,5.0,3.800037,sagemaker-xgboost-230105-1915-013-26b35e21,Completed,0.77842,2023-01-05 19:21:49+00:00,2023-01-05 19:22:27+00:00,38.0
4,0.853459,0.117405,5.0,0.180162,sagemaker-xgboost-230105-1915-016-b6dd06b8,Completed,0.77834,2023-01-05 19:22:47+00:00,2023-01-05 19:23:25+00:00,38.0
1,0.0,0.126534,4.0,10.0,sagemaker-xgboost-230105-1915-019-48d68c2e,Completed,0.77685,2023-01-05 19:23:51+00:00,2023-01-05 19:24:23+00:00,32.0
5,20.020887,0.129631,6.0,6.020987,sagemaker-xgboost-230105-1915-015-301c248d,Completed,0.77661,2023-01-05 19:22:54+00:00,2023-01-05 19:23:31+00:00,37.0
10,0.0,0.158999,6.0,6.197071,sagemaker-xgboost-230105-1915-010-e31e7fd0,Completed,0.77517,2023-01-05 19:20:55+00:00,2023-01-05 19:21:32+00:00,37.0
3,0.0,0.1,3.0,7.482663,sagemaker-xgboost-230105-1915-017-a57fc57a,Completed,0.77479,2023-01-05 19:22:57+00:00,2023-01-05 19:23:35+00:00,38.0
0,25.317331,0.115423,5.0,8.094955,sagemaker-xgboost-230105-1915-020-8a5f5a30,Completed,0.77458,2023-01-05 19:23:53+00:00,2023-01-05 19:24:30+00:00,37.0
6,0.0,0.224293,3.0,0.0,sagemaker-xgboost-230105-1915-014-d0f9556e,Completed,0.77401,2023-01-05 19:21:57+00:00,2023-01-05 19:22:34+00:00,37.0
2,0.0,0.5,4.0,10.0,sagemaker-xgboost-230105-1915-018-133affe7,Completed,0.76763,2023-01-05 19:23:43+00:00,2023-01-05 19:24:20+00:00,37.0


#### Re-build the model & predict with the best parameters

In [71]:
#Set the estimater configuration
tune_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)

We are tuning the model with the best parameters we found in the previous step

In [72]:
#Tune model 
tune_model.set_hyperparameters(
    max_depth=5,
    eta=0.1,
    min_child_weight=9.7,
    objective = "binary:logistic",
    num_round=100
)

In [73]:
#Fit the model
tune_model.fit({"train": input_train, "validation": input_validation})

2023-01-05 19:30:37 Starting - Starting the training job...
2023-01-05 19:31:02 Starting - Preparing the instances for trainingProfilerReport-1672947036: InProgress
...............
2023-01-05 19:33:25 Downloading - Downloading input data...
2023-01-05 19:34:01 Training - Training image download completed. Training in progress..[34m[2023-01-05 19:34:02.769 ip-10-2-171-39.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-01-05:19:34:02:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-01-05:19:34:02:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2023-01-05:19:34:02:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-01-05:19:34:02:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2023-01-05:19:34:02:INFO] Determined delimiter of CSV input is ','[0m
[34m[2023-01-05:19:34:02:INFO] Determined delimiter of CSV input is ','[0m


### Deploy the tuned model

In [74]:
xgb_tuned_predictor = tune_model.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=CSVSerializer()
)

-------!

In [None]:
#xgb_tuned_predictor.predict(validation_data.drop(['case_status'],axis=1))

#### Save the predictions

In [76]:
def predict(en_data, rows=500):
    split_array = np.array_split(en_data, int(en_data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = "".join([predictions, xgb_tuned_predictor.predict(array).decode("utf-8")])

    return predictions.split("\n")[:-1]


predictions = predict(validation_data.drop(['case_status'],axis=1).to_numpy())

In [77]:
#Prediction
predictions = np.array([float(num) for num in predictions])
print(predictions)

[0.76051009 0.69261122 0.51291931 ... 0.94161266 0.75041068 0.41827521]


In [78]:
columns=np.round(predictions,0)
predictions = pd.DataFrame(columns)
test_y = validation_data['case_status']
actual = test_y
test_x = validation_data.drop(columns=['case_status'],axis=1)
#Prediction column called case_pred
predictions.columns =['case_pred']
# Concatenate them into a single dataframe
output = pd.concat([test_x.reset_index(drop=True),actual.reset_index(drop=True),predictions.reset_index(drop=True)],axis=1)
output['case_pred'] = output['case_pred'].astype(int)
output.head()

Unnamed: 0,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status,case_pred
0,1,0,1,0,556,174,1,5248,3,1,1,1
1,4,0,0,0,607,120,1,11997,3,1,1,1
2,1,3,0,0,2614,187,3,2107,0,1,1,1
3,1,0,1,1,3154,168,2,8975,3,1,1,1
4,2,0,1,0,4557,110,4,9052,3,1,0,1


#### Evaluate the tuned model (compare performace with baseline model)

Confusion Matrix

In [79]:
matrix = confusion_matrix(output['case_status'], output['case_pred'])
# Display matrix
print(matrix)

[[1204 1258]
 [ 672 4510]]


Accuracy Score


In [80]:
# Calculate TP, FP, FN, TN
true_pos = matrix[0][0]
fals_pos = matrix[0][1]
fals_neg = matrix[1][0]
true_neg = matrix[1][1]

In [81]:
# Calulate Accuracy Score
accuracy = (true_pos + true_neg) / (true_pos + fals_pos + fals_neg + true_neg)
# Display score
print(accuracy)

0.7475143903715332


Precision Score


In [82]:
# Calculate Precision Score
precision = true_pos / (true_pos + fals_pos)
# Display score
print(precision)

0.4890333062550772


Recall Score


In [83]:
# Calculate Recall Score
recall = true_pos / (true_pos + fals_neg)
# Display score
print(recall)

0.6417910447761194


F1-score


In [84]:
# Calculate F1-score
f1 = (2*precision*recall) / (precision + recall)
# Display score
print(f1)


0.5550945136007376


Calulate all performance metrics - after tuning


In [85]:
# Calulate all performance metrics - after tuning
accuracy_tuned = (true_pos + true_neg) / (true_pos + fals_pos + fals_neg + true_neg)
precision_tuned = true_pos / (true_pos + fals_pos)
recall_tuned = true_pos / (true_pos + fals_neg)

# Display performance metrics before & after tuning, and the % increment achieved
print("accuracy:  before:{:0.2f}, after:{:0.2f}, %increment:{:0.2f}%".
      format(accuracy,accuracy_tuned,(accuracy_tuned-accuracy)*100))
print("precision: before:{:0.2f}, after:{:0.2f}, %increment:{:0.2f}%".
      format(precision,precision_tuned,(precision_tuned-precision)*100))
print("recall:    before:{:0.2f}, after:{:0.2f}, %increment:{:0.2f}%".
      format(recall,recall_tuned,(recall_tuned-recall)*100))

accuracy:  before:0.75, after:0.75, %increment:0.00%
precision: before:0.49, after:0.49, %increment:0.00%
recall:    before:0.64, after:0.64, %increment:0.00%


### Observation

From the confusion matrix, the number of false positive is increasing by 4, while the number of false negative is decrease by 32. However, from the percentage calculation of accuracy, precision, and recall are the same before and after tuning. Therefore, this way of using new hyperparameters do not improve the model performance.

### Delete the Endpoint


In [86]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

In [87]:
xgb_tuned_predictor.delete_endpoint(delete_endpoint_config=True)

## Recomendation

To improve the model performance;  

1.We resplit the data to train and validation set with different proportion.  
2.We can use different models to train the data, forexample using classification method such as logistic regression, AdaBoostClassifier, decision tree, random forest and neural network.