# BIA 678 Project

**Purpose:** To utilize machine learning algorithms to classify high risk loans by training on a labeled dataset containing attributes such as personal income, marriage status, job experience/profession, etc.

**Problem Statement**: Measure how data scalability affects the overall performance for different classification models

## Overview

1. Loading the data into Spark
2. Data Exploration and Cleaning
3. Create Data Processing Pipeline
4. Model Training
5. Model Evaluation

### Loading in the data

In [1]:
from pyspark.sql.functions import isnan, when, count, col, rand, udf
from pyspark.sql.functions import regexp_replace, explode, array, lit
from pyspark.sql.types import DoubleType
from pyspark.mllib.util import MLUtils
from pyspark.ml.feature import StandardScaler, VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
import math as mth

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1638772814445_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
schema = """
`id` INT,
`Income` INT,
`Age` INT,
`Experience` INT,
`Married/Single` STRING,
`House_Ownership` STRING,
`Car_Ownership` STRING,
`Profession` STRING,
`CITY` STRING,
`STATE` STRING,
`CURRENT_JOB_YRS` INT,
`CURRENT_HOUSE_YRS` INT,
`Risk_Flag` INT
"""

df1 = spark.read.format("csv").option("header", "true").load("s3://bia678projectteam13/loan_prediction_data.csv", schema=schema)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
df1.groupBy("Risk_Flag").count().sort("count", ascending=False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+------+
|Risk_Flag| count|
+---------+------+
|        0|221004|
|        1| 30996|
+---------+------+

## Data Cleaning + Preprocessing

In [4]:
df1 = df1.withColumn('CITY', regexp_replace('CITY', '\[\d\d\]|\[\d\]', ''))
df1 = df1.withColumn('STATE', regexp_replace('STATE', '\[\d\d\]|\[\d\]', ''))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Split the data into different scales

In [5]:
(training_data_full, test_data) = df1.randomSplit([0.8,0.2], seed=420)
(training_data_08, _) = training_data_full.randomSplit([0.8, 0.2], seed=420)
(training_data_06, _) = training_data_full.randomSplit([0.6, 0.4], seed=420)
(training_data_04, _) = training_data_full.randomSplit([0.4, 0.6], seed=420)
(training_data_02, _) = training_data_full.randomSplit([0.2, 0.8], seed=420)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
def oversample(data):
    major_df = data.filter(col("Risk_Flag") == 0)
    minor_df = data.filter(col("Risk_Flag") == 1)
  
    ratio = int(major_df.count()/minor_df.count())
    a = range(ratio)
  
    # duplicate the minority rows
    oversampled_df = minor_df.withColumn("dummy", explode(array([lit(x) for x in a]))).drop('dummy')
  
    # combine both oversampled minority rows and previous majority rows 
    combined_df = major_df.unionAll(oversampled_df)

    return combined_df.orderBy(rand())

def undersample(data):
    major_df = data.filter(col("Risk_Flag") == 0)
    minor_df = data.filter(col("Risk_Flag") == 1)
  
    ratio = int(major_df.count()/minor_df.count())

    sampled_majority_df = major_df.sample(False, 1/ratio)
    combined_df = sampled_majority_df.unionAll(minor_df)
  
    return combined_df

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
train_oversample = oversample(training_data_full)
train_undersample = undersample(training_data_full)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
def process_data(train, test):
    # Data PreProcessing Pipeline
  
    # Standardize all numerical columns
    num_assembler = VectorAssembler(inputCols=['Income','Age', 'Experience','CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS'], outputCol='num_features')
    scaler = StandardScaler(inputCol="num_features", outputCol="scaled_num_features", withStd=True, withMean=False)
    
    # Index all categorical variables
    cat_cols = ['Married/Single', 'House_Ownership', 'Car_Ownership', 'Profession', 'CITY', 'STATE']
    cat_replace = ['Married/Single_idx', 'House_Ownership_idx', 'Car_Ownership_idx', 'Profession_idx', 'CITY_idx', 'STATE_idx']
    

    indexers = [StringIndexer(inputCol=cat_cols[i], outputCol=cat_replace[i]) for i in range(len(cat_cols))]

    # One-hot encode categorical variables with greater than 2 categories
    oh_cols = ['House_Ownership_idx', 'Profession_idx', 'CITY_idx', 'STATE_idx']
    oh_cols_replace = ['House_Ownership_ohe', 'Profession_ohe', 'CITY_ohe', 'STATE_ohe']
    encoders = [OneHotEncoder(inputCol=oh_cols[i], outputCol= oh_cols_replace[i]) for i in range(len(oh_cols))]

    # Assemble the transformed columns into one final vector
    final_cols = ['scaled_num_features', 'Married/Single_idx', 'Car_Ownership_idx', 'Profession_ohe', 'House_Ownership_ohe', 'CITY_ohe', 'STATE_ohe']
    full_assembler = [VectorAssembler(inputCols=final_cols, outputCol='features')]
    
    steps = [num_assembler, scaler] + indexers + encoders + full_assembler
    pipeline = Pipeline(stages=steps)
  
    pipe = pipeline.fit(train)
  
    train_trans = pipe.transform(train)
    test_trans = pipe.transform(test)
  
    return train_trans, test_trans

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
train_full, test_full = process_data(training_data_full, test_data)
train_over, test_over = process_data(train_oversample, test_data)
train_under, test_under = process_data(train_undersample, test_data)
train_80, test_80 = process_data(training_data_08, test_data)
train_60, test_60 = process_data(training_data_06, test_data)
train_40, test_40 = process_data(training_data_04, test_data)
train_20, test_20 = process_data(training_data_02, test_data)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
train_full.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------+---+----------+--------------+---------------+-------------+--------------------+---------------+--------------+---------------+-----------------+---------+--------------------+--------------------+------------------+-------------------+-----------------+--------------+--------+---------+-------------------+---------------+-----------------+---------------+--------------------+
| id| Income|Age|Experience|Married/Single|House_Ownership|Car_Ownership|          Profession|           CITY|         STATE|CURRENT_JOB_YRS|CURRENT_HOUSE_YRS|Risk_Flag|        num_features| scaled_num_features|Married/Single_idx|House_Ownership_idx|Car_Ownership_idx|Profession_idx|CITY_idx|STATE_idx|House_Ownership_ohe| Profession_ohe|         CITY_ohe|      STATE_ohe|            features|
+---+-------+---+----------+--------------+---------------+-------------+--------------------+---------------+--------------+---------------+-----------------+---------+--------------------+--------------------+-

In [10]:
num_samples = train_full.count()
num_oversample = train_over.count()
num_undersample = train_under.count()
print(f'Full training set: {num_samples}')
print(f'Oversampled training set: {num_oversample}')
print(f'Undersampled training set: {num_undersample}')
print(f'80% of full training set: {mth.floor(num_samples*0.8)}')
print(f'60% of full training set: {mth.floor(num_samples*0.6)}')
print(f'40% of full training set: {mth.floor(num_samples*0.4)}')
print(f'20% of full training set: {mth.floor(num_samples*0.2)}')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Full training set: 201235
Oversampled training set: 349983
Undersampled training set: 50077
80% of full training set: 160988
60% of full training set: 120741
40% of full training set: 80494
20% of full training set: 40247

In [11]:
def add_class_weights(df):
    
    balancingRatio = df.filter(col('Risk_Flag') == 1).count()/df.count()
    calculateWeights = udf(lambda x: 1*balancingRatio if x == 0 else (1*(1.0 - balancingRatio)), DoubleType())
    
    df_weights = df.withColumn('weight', calculateWeights('Risk_Flag'))
    return df_weights

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
# Add class weights column for non sampled datasets

train_full = add_class_weights(train_full)
train_80 = add_class_weights(train_80)
train_60 = add_class_weights(train_60)
train_40 = add_class_weights(train_40)
train_20 = add_class_weights(train_20)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
train_full.groupBy('weight').count().show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+------+
|             weight| count|
+-------------------+------+
| 0.8771470449407498| 24726|
|0.12285295505925024|176539|
+-------------------+------+

### Random Forest Classifier

In [16]:
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
#from sklearn.metrics import classification_report, confusion_matrix
from pyspark.ml.evaluation import BinaryClassificationEvaluator

def get_classification_results(clf, train, test):
    model = clf.fit(train)
    predictions = model.transform(test)
    return predictions, model

def get_accuracy_score(pred):
    acc = pred.where('prediction == Risk_Flag').count()/pred.count()
    return acc
    
evaluator_ROC = BinaryClassificationEvaluator(
    labelCol='Risk_Flag', 
    rawPredictionCol='prediction', 
    metricName='areaUnderROC')

evaluator_PR = BinaryClassificationEvaluator(
    labelCol='Risk_Flag', 
    rawPredictionCol='prediction', 
    metricName='areaUnderPR')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**Implementation on full training set**

In [29]:
rf = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            weightCol='weight',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_rf, rf_model = get_classification_results(rf, train_full, test_full)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 317.13725185394287 seconds

In [46]:
acc = get_accuracy_score(predictions_rf)
area_ROC_full = evaluator_ROC.evaluate(predictions_rf)
area_PR_full = evaluator_PR.evaluate(predictions_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_full)
print('Area Under PR Curve = ', area_PR_full)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7597091682922504
Area Under ROC Curve =  0.682412818076296
Area Under PR Curve =  0.23910299566652599

**Implementation of Random Forest Classifier on oversampled training set**

In [47]:
rf_over = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_over_rf, model_over_rf = get_classification_results(rf_over, train_over, test_over)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 521.0781497955322 seconds

In [48]:
acc = get_accuracy_score(predictions_over_rf)
area_ROC_full = evaluator_ROC.evaluate(predictions_over_rf)
area_PR_full = evaluator_PR.evaluate(predictions_over_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_full)
print('Area Under PR Curve = ', area_PR_full)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7872554235384525
Area Under ROC Curve =  0.6695216592310268
Area Under PR Curve =  0.24624978416515025

**Implementation of Random Forest on undersampled training set**

In [50]:
rf_under = RandomForestClassifier(labelCol='Risk_Flag', 
                                  featuresCol='features',
                                  numTrees=300,
                                  maxDepth=15,
                                  seed=0)

start_time = time.time()

predictions_under_rf, model_under_rf = get_classification_results(rf_under, train_under, test_under)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 90.91341757774353 seconds

In [51]:
acc = get_accuracy_score(predictions_under_rf)
area_ROC_full = evaluator_ROC.evaluate(predictions_under_rf)
area_PR_full = evaluator_PR.evaluate(predictions_under_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_full)
print('Area Under PR Curve = ', area_PR_full)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7661969054893072
Area Under ROC Curve =  0.6791981435858642
Area Under PR Curve =  0.24520577158170415

**Implementation on 80% of the training set**

In [52]:
rf = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            weightCol='weight',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_80, model_80_rf = get_classification_results(rf, train_80, test_80)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 278.2161240577698 seconds

In [53]:
acc = get_accuracy_score(predictions_80)
area_ROC_80 = evaluator_ROC.evaluate(predictions_80)
area_PR_80 = evaluator_PR.evaluate(predictions_80)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_80)
print('Area Under PR Curve = ', area_PR_80)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7227443794210952
Area Under ROC Curve =  0.6896723921788804
Area Under PR Curve =  0.22734000467648036

**Implementation on 60% of training set**

In [54]:
rf = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            weightCol='weight',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_60_rf, model_60_rf = get_classification_results(rf, train_60, test_60)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 225.52445244789124 seconds

In [55]:
acc = get_accuracy_score(predictions_60_rf)
area_ROC_60 = evaluator_ROC.evaluate(predictions_60_rf)
area_PR_60 = evaluator_PR.evaluate(predictions_60_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_60)
print('Area Under PR Curve = ', area_PR_60)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7240842544974483
Area Under ROC Curve =  0.6827368523892658
Area Under PR Curve =  0.22624840078435624

**Implementation of Random Forest on 40% of training set**

In [56]:
rf = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            weightCol='weight',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_40_rf, model_40_rf = get_classification_results(rf, train_40, test_40)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 154.05084204673767 seconds

In [57]:
acc = get_accuracy_score(predictions_40_rf)
area_ROC_40 = evaluator_ROC.evaluate(predictions_40_rf)
area_PR_40 = evaluator_PR.evaluate(predictions_40_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_40)
print('Area Under PR Curve = ', area_PR_40)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.717365175070442
Area Under ROC Curve =  0.6866324471849079
Area Under PR Curve =  0.22461482385444012

**Implementation of Random Forest on 20% of training set**

In [58]:
rf = RandomForestClassifier(labelCol='Risk_Flag', 
                            featuresCol='features',
                            weightCol='weight',
                            numTrees=300,
                            maxDepth=15,
                            seed=0)

start_time = time.time()

predictions_20_rf, model_20_rf = get_classification_results(rf, train_20, test_20)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 89.08292961120605 seconds

In [59]:
acc = get_accuracy_score(predictions_20_rf)
area_ROC_20 = evaluator_ROC.evaluate(predictions_20_rf)
area_PR_20 = evaluator_PR.evaluate(predictions_20_rf)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC_20)
print('Area Under PR Curve = ', area_PR_20)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7852653149691632
Area Under ROC Curve =  0.6592259161132037
Area Under PR Curve =  0.23956612643918812

## Support Vector Machine

**Implementation on full training set**

In [60]:
from pyspark.ml.classification import LinearSVC
import time

lsvc = LinearSVC(labelCol='Risk_Flag', 
                 featuresCol='features',
                 weightCol='weight',
                 regParam=0.9)

start_time = time.time()

predictions_svc, model_svc = get_classification_results(lsvc, train_full, test_full)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 34.03608989715576 seconds

In [61]:
acc = get_accuracy_score(predictions_svc)
area_ROC = evaluator_ROC.evaluate(predictions_svc)
area_PR = evaluator_PR.evaluate(predictions_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.4355579200409844
Area Under ROC Curve =  0.5859225154407148
Area Under PR Curve =  0.14833434558943642

**Implementation of Support Vector Classifier on oversampled training set**

In [62]:
lsvc_over = LinearSVC(labelCol='Risk_Flag', featuresCol='features', regParam=0.9)

start_time = time.time()

predictions_over_svc, model_over_svc = get_classification_results(lsvc_over, train_over, test_over)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-62:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 1496



Total execution time: 101.10137629508972 seconds

In [63]:
acc = get_accuracy_score(predictions_over_svc)
area_ROC = evaluator_ROC.evaluate(predictions_over_svc)
area_PR = evaluator_PR.evaluate(predictions_over_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.8297822016359515
Area Under ROC Curve =  0.5452513230573443
Area Under PR Curve =  0.1886066952167489

**Implementation of Support Vector Classifier on undersampled training set**

In [64]:
lsvc_under = LinearSVC(labelCol='Risk_Flag', featuresCol='features', regParam=0.9)

start_time = time.time()

predictions_under_svc, model_under_svc = get_classification_results(lsvc_under, train_under, test_under)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 12.701873064041138 seconds

In [65]:
acc = get_accuracy_score(predictions_under_svc)
area_ROC = evaluator_ROC.evaluate(predictions_under_svc)
area_PR = evaluator_PR.evaluate(predictions_under_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.8231758980118619
Area Under ROC Curve =  0.5465646047826286
Area Under PR Curve =  0.18120309805354212

**Implementation on 80% of training set**

In [66]:
start_time = time.time()

predictions_80_svc, model_80_svc = get_classification_results(lsvc, train_80, test_80)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 22.056426286697388 seconds

In [67]:
acc = get_accuracy_score(predictions_80_svc)
area_ROC = evaluator_ROC.evaluate(predictions_80_svc)
area_PR = evaluator_PR.evaluate(predictions_80_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.26943311461843117
Area Under ROC Curve =  0.5439516220736599
Area Under PR Curve =  0.13314232404447854

**Implementation on 60% of training set**

In [69]:
start_time = time.time()

predictions_60_svc, model_60_svc = get_classification_results(lsvc, train_60, test_60)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 17.00077986717224 seconds

In [70]:
acc = get_accuracy_score(predictions_60_svc)
area_ROC = evaluator_ROC.evaluate(predictions_60_svc)
area_PR = evaluator_PR.evaluate(predictions_60_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.252625564028295
Area Under ROC Curve =  0.5395765213808936
Area Under PR Curve =  0.13189679168998925

Exception in thread cell_monitor-70:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2996



**Implementation on 40% of training set**

In [73]:
start_time = time.time()

predictions_40_svc, model_40_svc = get_classification_results(lsvc, train_40, test_40)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 19.105817079544067 seconds

In [74]:
acc = get_accuracy_score(predictions_40_svc)
area_ROC = evaluator_ROC.evaluate(predictions_40_svc)
area_PR = evaluator_PR.evaluate(predictions_40_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-74:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 3497



Accuracy =  0.253472837973636
Area Under ROC Curve =  0.5367346633632495
Area Under PR Curve =  0.1311726451445018

**Implementation on 20% of training set**

In [75]:
start_time = time.time()

predictions_20_svc, model_20_svc = get_classification_results(lsvc, train_20, test_20)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 10.209076881408691 seconds

In [76]:
acc = get_accuracy_score(predictions_20_svc)
area_ROC = evaluator_ROC.evaluate(predictions_20_svc)
area_PR = evaluator_PR.evaluate(predictions_20_svc)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.7937183503773325
Area Under ROC Curve =  0.5537271698683495
Area Under PR Curve =  0.17260642610452648

## Logistic Regression

**Implementation of Logistic Regression on full training set**

In [77]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol='Risk_Flag', 
                        featuresCol='features',
                        weightCol='weight', 
                        standardization=False,
                        maxIter=200,
                        regParam=0.1)

start_time = time.time()

predictions_lr, model_lr = get_classification_results(lr, train_full, test_full)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 22.42728352546692 seconds

In [78]:
acc = get_accuracy_score(predictions_lr)
area_ROC = evaluator_ROC.evaluate(predictions_lr)
area_PR = evaluator_PR.evaluate(predictions_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5262950483734311
Area Under ROC Curve =  0.5420526845385911
Area Under PR Curve =  0.13672808341967205

**Implementation of Logistic Regression oversampled training set**

In [79]:
lr_over = LogisticRegression(labelCol='Risk_Flag', 
                        featuresCol='features', 
                        standardization=False,
                        maxIter=200,
                        regParam=0.1)

start_time = time.time()

predictions_over_lr, model_over_lr = get_classification_results(lr_over, train_over, test_over)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-79:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 3896



Total execution time: 57.073182106018066 seconds

In [80]:
acc = get_accuracy_score(predictions_over_lr)
area_ROC = evaluator_ROC.evaluate(predictions_over_lr)
area_PR = evaluator_PR.evaluate(predictions_over_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5801123484773825
Area Under ROC Curve =  0.5328949834385808
Area Under PR Curve =  0.13628112650952728

**Implementation of Logistic Regression on undersampled training set**

In [86]:
lr_under = LogisticRegression(labelCol='Risk_Flag', 
                        featuresCol='features', 
                        standardization=False,
                        maxIter=200,
                        regParam=0.1)

start_time = time.time()

predictions_under_lr, model_under_lr = get_classification_results(lr_under, train_under, test_under)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 13.660803318023682 seconds

In [87]:
acc = get_accuracy_score(predictions_under_lr)
area_ROC = evaluator_ROC.evaluate(predictions_under_lr)
area_PR = evaluator_PR.evaluate(predictions_under_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5902434217009954
Area Under ROC Curve =  0.5331263088668381
Area Under PR Curve =  0.1366811126480058

Exception in thread cell_monitor-87:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 4896



**Implementation of Logistic Regression on 80% of training set**

In [88]:
start_time = time.time()

predictions_80_lr, model_80_lr = get_classification_results(lr, train_80, test_80)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 12.917373657226562 seconds

In [83]:
acc = get_accuracy_score(predictions_80_lr)
area_ROC = evaluator_ROC.evaluate(predictions_80_lr)
area_PR = evaluator_PR.evaluate(predictions_80_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5329747197099565
Area Under ROC Curve =  0.5411375847728928
Area Under PR Curve =  0.13665506770597832

**Implementation of Logistic Regression on 60% of training set**

In [89]:
start_time = time.time()

predictions_60_lr, model_60_lr = get_classification_results(lr, train_60, test_60)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 7.288186311721802 seconds

In [90]:
acc = get_accuracy_score(predictions_60_lr)
area_ROC = evaluator_ROC.evaluate(predictions_60_lr)
area_PR = evaluator_PR.evaluate(predictions_60_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5308761210209915
Area Under ROC Curve =  0.5343289822954491
Area Under PR Curve =  0.13553542550694037

**Implementation of 40% of training set**

In [96]:
start_time = time.time()

predictions_40_lr, model_40_lr = get_classification_results(lr, train_40, test_40)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 9.310492515563965 seconds

In [97]:
acc = get_accuracy_score(predictions_40_lr)
area_ROC = evaluator_ROC.evaluate(predictions_40_lr)
area_PR = evaluator_PR.evaluate(predictions_40_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.491478000433489
Area Under ROC Curve =  0.54239320620698
Area Under PR Curve =  0.13598620884570195

**Implementation of 20% of training set**

In [99]:
start_time = time.time()

predictions_20_lr, model_20_lr = get_classification_results(lr, train_20, test_20)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 8.076089143753052 seconds

In [100]:
acc = get_accuracy_score(predictions_20_lr)
area_ROC = evaluator_ROC.evaluate(predictions_20_lr)
area_PR = evaluator_PR.evaluate(predictions_20_lr)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5790624815274575
Area Under ROC Curve =  0.541746591549863
Area Under PR Curve =  0.13826145651143884

## Naive Bayes Classifier

**Implementation of Naive Bayes on full training set**

In [101]:
from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes(smoothing=0.5, labelCol='Risk_Flag', 
                featuresCol='features',
                weightCol='weight')

start_time = time.time()

predictions_nb, model_nb = get_classification_results(nb, train_full, test_full)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 16.379687309265137 seconds

In [102]:
acc = get_accuracy_score(predictions_nb)
area_ROC = evaluator_ROC.evaluate(predictions_nb)
area_PR = evaluator_PR.evaluate(predictions_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5802250201966463
Area Under ROC Curve =  0.5894141618907046
Area Under PR Curve =  0.15687669259692577

Exception in thread cell_monitor-102:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6098



**Implementation of Naive Bayes on Oversampled training set**

In [110]:
nb_over = NaiveBayes(smoothing=0.5, labelCol='Risk_Flag', featuresCol='features')

start_time = time.time()

predictions_over_nb, model_over_nb = get_classification_results(nb_over, train_over, test_over)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-109:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6194



Total execution time: 15.050861120223999 seconds

In [111]:
acc = get_accuracy_score(predictions_over_nb)
area_ROC = evaluator_ROC.evaluate(predictions_over_nb)
area_PR = evaluator_PR.evaluate(predictions_over_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.591603429585099
Area Under ROC Curve =  0.5857567131182689
Area Under PR Curve =  0.1578546552239959

**Implementation of Naive Bayes on undersampled training set**

In [107]:
nb_under = NaiveBayes(smoothing=0.5, labelCol='Risk_Flag', featuresCol='features')

start_time = time.time()

predictions_under_nb, model_under_nb = get_classification_results(nb_under, train_under, test_under)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 2.6132657527923584 seconds

In [108]:
acc = get_accuracy_score(predictions_under_nb)
area_ROC = evaluator_ROC.evaluate(predictions_under_nb)
area_PR = evaluator_PR.evaluate(predictions_under_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5911607653051172
Area Under ROC Curve =  0.5841352839431843
Area Under PR Curve =  0.15560117591016973

**Implementation of Naive Bayes on 80% of training set**

In [112]:
start_time = time.time()

predictions_80_nb, model_80_nb = get_classification_results(nb, train_80, test_80)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 5.228437900543213 seconds

In [113]:
acc = get_accuracy_score(predictions_80_nb)
area_ROC = evaluator_ROC.evaluate(predictions_80_nb)
area_PR = evaluator_PR.evaluate(predictions_80_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5841135310929338
Area Under ROC Curve =  0.5805931891667568
Area Under PR Curve =  0.15523246334425278

**Implementation of Naive Bayes on 60% of training set**

In [114]:
start_time = time.time()

predictions_60_nb, model_60_nb = get_classification_results(nb, train_60, test_60)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 1.9169542789459229 seconds

In [115]:
acc = get_accuracy_score(predictions_60_nb)
area_ROC = evaluator_ROC.evaluate(predictions_60_nb)
area_PR = evaluator_PR.evaluate(predictions_60_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-115:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6298



Accuracy =  0.5836010643539963
Area Under ROC Curve =  0.5776293237105112
Area Under PR Curve =  0.15401999364167673

**Implementation of Naive Bayes on 40% of training set**

In [116]:
start_time = time.time()

predictions_40_nb, model_40_nb = get_classification_results(nb, train_40, test_40)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 1.6204540729522705 seconds

In [117]:
acc = get_accuracy_score(predictions_40_nb)
area_ROC = evaluator_ROC.evaluate(predictions_40_nb)
area_PR = evaluator_PR.evaluate(predictions_40_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5794816201833054
Area Under ROC Curve =  0.5793891616993169
Area Under PR Curve =  0.15443832125302517

**Implementation on 20% of training set**

In [118]:
start_time = time.time()

predictions_20_nb, model_20_nb = get_classification_results(nb, train_20, test_20)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 1.3145997524261475 seconds

In [119]:
acc = get_accuracy_score(predictions_20_nb)
area_ROC = evaluator_ROC.evaluate(predictions_20_nb)
area_PR = evaluator_PR.evaluate(predictions_20_nb)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.5718734601360008
Area Under ROC Curve =  0.5687466675849023
Area Under PR Curve =  0.14981649809277667

## Gradient Boosted Tree Classifier

In [15]:
from pyspark.ml.classification import GBTClassifier
import time

# Train a GBT model.
gbt = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions, model_gbt= get_classification_results(gbt, train_full, test_full)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
name 'get_classification_results' is not defined
Traceback (most recent call last):
NameError: name 'get_classification_results' is not defined



In [123]:
acc = get_accuracy_score(predictions)
area_ROC = evaluator_ROC.evaluate(predictions)
area_PR = evaluator_PR.evaluate(predictions)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-123:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6797



Accuracy =  0.8786230813185947
Area Under ROC Curve =  0.5064898929201874
Area Under PR Curve =  0.4090704621324296

**Implementation of Gradient Boosted Tree on oversampled training set**

In [124]:
gbt_over = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_over_gbt, model_gbt= get_classification_results(gbt_over, train_over, test_over)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-124:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6992



Total execution time: 119.98680782318115 seconds

In [125]:
acc = get_accuracy_score(predictions_over_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_over_gbt)
area_PR = evaluator_PR.evaluate(predictions_over_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy =  0.6792943727210012
Area Under ROC Curve =  0.6005073162035763
Area Under PR Curve =  0.17452119391138102

**Implementation of Gradient Boosted Tree on undersampled training set**

In [126]:
gbt_under = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_under_gbt, model_under_gbt= get_classification_results(gbt_under, train_under, test_under)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

acc = get_accuracy_score(predictions_under_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_under_gbt)
area_PR = evaluator_PR.evaluate(predictions_under_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 32.281821966171265 seconds
Accuracy =  0.5588264270654766
Area Under ROC Curve =  0.6037771115273078
Area Under PR Curve =  0.16081552384186906

**Implementation of Gradient Boosted Tree on 80% of training set**

In [18]:
gbt = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_80_gbt, model_gbt= get_classification_results(gbt, train_80, test_80)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

acc = get_accuracy_score(predictions_80_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_80_gbt)
area_PR = evaluator_PR.evaluate(predictions_80_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 32.09629416465759 seconds
Accuracy =  0.8783866327757088
Area Under ROC Curve =  0.5054539130076998
Area Under PR Curve =  0.39312846351833197

**Implementation of Gradient Boosted Tree on 60% of training set**

In [19]:
gbt = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_60_gbt, model_gbt= get_classification_results(gbt, train_60, test_60)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

acc = get_accuracy_score(predictions_60_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_60_gbt)
area_PR = evaluator_PR.evaluate(predictions_60_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 27.841379404067993 seconds
Accuracy =  0.8784063368209494
Area Under ROC Curve =  0.5053264791506947
Area Under PR Curve =  0.39918855596441194

**Implementation of Gradient Boosted Tree on 40% of training set**

In [21]:
gbt = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_40_gbt, model_gbt= get_classification_results(gbt, train_40, test_40)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

acc = get_accuracy_score(predictions_40_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_40_gbt)
area_PR = evaluator_PR.evaluate(predictions_40_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-21:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 1397



Total execution time: 22.137314081192017 seconds
Accuracy =  0.8782290004137849
Area Under ROC Curve =  0.5051561260871348
Area Under PR Curve =  0.37368943488794293

**Implementation of Gradient Boosted Tree on 20% of training set**

In [24]:
gbt = GBTClassifier(labelCol="Risk_Flag", featuresCol="features")

start_time = time.time()

predictions_20_gbt, model_gbt= get_classification_results(gbt, train_20, test_20)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

acc = get_accuracy_score(predictions_20_gbt)
area_ROC = evaluator_ROC.evaluate(predictions_20_gbt)
area_PR = evaluator_PR.evaluate(predictions_20_gbt)

print('Accuracy = ', acc)
print('Area Under ROC Curve = ', area_ROC)
print('Area Under PR Curve = ', area_PR)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Total execution time: 16.11212134361267 seconds
Accuracy =  0.8784457449114303
Area Under ROC Curve =  0.5088847260531769
Area Under PR Curve =  0.3618862999066784

Exception in thread cell_monitor-24:
Traceback (most recent call last):
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/mnt/notebook-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2097

