# Data Exploration
This notebook performs exploratory data analysis on the dataset.
To expand on the analysis, attach this notebook to the **AutoMLCluster** cluster,
edit [the options of pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/2181109761951328/s?orderByKey=metrics.%60val_f1_score%60&orderByAsc=false)
- Navigate to the parent notebook [here](#notebook/2181109761951318) (If you launched the AutoML experiment using the Experiments UI, this link isn't very useful.)

Runtime Version: _9.1.x-cpu-ml-scala2.12_

In [0]:
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

from mlflow.tracking import MlflowClient

# Download input data from mlflow into a pandas DataFrame
# create temp directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# download the artifact and read it
client = MlflowClient()
training_data_path = client.download_artifacts("5bd2893efd1f4668976fca998402268d", "data", temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# delete the temp data
shutil.rmtree(temp_dir)

target_col = "Is_Submitted"

## Profiling Results

In [0]:
from pandas_profiling import ProfileReport
df_profile = ProfileReport(df, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


0,1
Number of variables,9
Number of observations,98768
Missing cells,36864
Missing cells (%),4.1%
Duplicate rows,2727
Duplicate rows (%),2.8%
Total size in memory,6.4 MiB
Average record size in memory,68.0 B

0,1
Numeric,6
Categorical,3

0,1
Dataset has 2727 (2.8%) duplicate rows,Duplicates
Web_Ad_Outlet__c has a high cardinality: 72 distinct values,High cardinality
LS_GAS is highly correlated with AnnualRevenue,High correlation
Web_Ad_Outlet_Source__c is highly correlated with Web_Ad_Outlet__c and 1 other fields,High correlation
Web_Ad_Outlet__c is highly correlated with Web_Ad_Outlet_Source__c and 1 other fields,High correlation
AnnualRevenue is highly correlated with LS_GAS,High correlation
LS_Entity_Type is highly correlated with Web_Ad_Outlet_Source__c and 1 other fields,High correlation
Web_Ad_Outlet_Source__c is highly correlated with Web_Ad_Outlet__c,High correlation
Web_Ad_Outlet__c is highly correlated with Web_Ad_Outlet_Source__c,High correlation
AnnualRevenue has 36279 (36.7%) missing values,Missing

0,1
Analysis started,2022-02-07 18:47:40.279828
Analysis finished,2022-02-07 18:47:50.542974
Duration,10.26 seconds
Software version,pandas-profiling v3.0.0
Download configuration,config.json

0,1
Distinct,7145
Distinct (%),11.4%
Missing,36279
Missing (%),36.7%
Infinite,0
Infinite (%),0.0%
Mean,509016.0774

0,1
Minimum,0
Maximum,600000000
Zeros,18336
Zeros (%),18.6%
Negative,0
Negative (%),0.0%
Memory size,771.8 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,180000
Q3,300000
95-th percentile,1300000
Maximum,600000000
Range,600000000
Interquartile range (IQR),300000

0,1
Standard deviation,4922224.699
Coefficient of variation (CV),9.670077071
Kurtosis,5914.059193
Mean,509016.0774
Median Absolute Deviation (MAD),180000
Skewness,64.30482215
Sum,3.180790566 × 1010
Variance,2.422829599 × 1013
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,18336,18.6%
200000,3288,3.3%
100000,2380,2.4%
250000,2219,2.2%
150000,1954,2.0%
300000,1829,1.9%
240000,1370,1.4%
180000,1323,1.3%
120000,1315,1.3%
360000,892,0.9%

Value,Count,Frequency (%)
0,18336,18.6%
7989,1,< 0.1%
9201,2,< 0.1%
10545,1,< 0.1%
12321,1,< 0.1%
13201,1,< 0.1%
13801,1,< 0.1%
14147,1,< 0.1%
14186,14,< 0.1%
15562,1,< 0.1%

Value,Count,Frequency (%)
600000000,1,< 0.1%
462699996,1,< 0.1%
384000000,1,< 0.1%
300000000,1,< 0.1%
239999988,1,< 0.1%
204988236,1,< 0.1%
186928224,1,< 0.1%
184200000,1,< 0.1%
180000000,1,< 0.1%
161000940,1,< 0.1%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,0.1724445164

0,1
Minimum,0
Maximum,1
Zeros,81736
Zeros (%),82.8%
Negative,0
Negative (%),0.0%
Memory size,385.9 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,0
95-th percentile,1
Maximum,1
Range,1
Interquartile range (IQR),0

0,1
Standard deviation,0.3777682492
Coefficient of variation (CV),2.190665479
Kurtosis,1.007456592
Mean,0.1724445164
Median Absolute Deviation (MAD),0
Skewness,1.734196123
Sum,17032
Variance,0.1427088501
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,81736,82.8%
1,17032,17.2%

Value,Count,Frequency (%)
0,81736,82.8%
1,17032,17.2%

Value,Count,Frequency (%)
1,17032,17.2%
0,81736,82.8%

0,1
Distinct,567
Distinct (%),0.6%
Missing,4
Missing (%),< 0.1%
Infinite,0
Infinite (%),0.0%
Mean,65739.64353

0,1
Minimum,1
Maximum,30000000
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,771.8 KiB

0,1
Minimum,1
5-th percentile,7500
Q1,17500
median,37500
Q3,75000
95-th percentile,250000
Maximum,30000000
Range,29999999
Interquartile range (IQR),57500

0,1
Standard deviation,183679.3569
Coefficient of variation (CV),2.794042484
Kurtosis,11263.33649
Mean,65739.64353
Median Absolute Deviation (MAD),20000
Skewness,80.93323672
Sum,6492710154
Variance,3.373810615 × 1010
Monotonicity,Not monotonic

Value,Count,Frequency (%)
37500,12910,13.1%
34000,11875,12.0%
17500,9838,10.0%
75000,7441,7.5%
7500,7337,7.4%
50000,6218,6.3%
100000,6089,6.2%
25000,5657,5.7%
10000,3758,3.8%
20000,2590,2.6%

Value,Count,Frequency (%)
1,2,< 0.1%
10,1,< 0.1%
25,2,< 0.1%
30,2,< 0.1%
50,4,< 0.1%
80,1,< 0.1%
100,5,< 0.1%
150,2,< 0.1%
200,1,< 0.1%
300,3,< 0.1%

Value,Count,Frequency (%)
30000000,1,< 0.1%
25000000,1,< 0.1%
15000000,1,< 0.1%
12000000,1,< 0.1%
6800000,1,< 0.1%
5000000,3,< 0.1%
4500000,1,< 0.1%
4100000,1,< 0.1%
4000000,1,< 0.1%
3600000,2,< 0.1%

0,1
Distinct,7
Distinct (%),< 0.1%
Missing,2
Missing (%),< 0.1%
Memory size,771.8 KiB

0,1
Corporation,59924
Limited Liability Company,26238
S-Corporation,7407
Sole Proprietorship,4699
Limited Liability Partnership,461
Other values (2),37

0,1
Max length,29.0
Median length,11.0
Mean length,15.33321183
Min length,5.0

0,1
Total characters,1514400
Distinct characters,24
Distinct categories,4 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Limited Liability Company
2nd row,Limited Liability Company
3rd row,Limited Liability Company
4th row,Limited Liability Company
5th row,Limited Liability Company

Value,Count,Frequency (%)
Corporation,59924,60.7%
Limited Liability Company,26238,26.6%
S-Corporation,7407,7.5%
Sole Proprietorship,4699,4.8%
Limited Liability Partnership,461,0.5%
Non-Profit,32,< 0.1%
Other,5,< 0.1%
(Missing),2,< 0.1%

Value,Count,Frequency (%)
corporation,59924,38.2%
liability,26699,17.0%
limited,26699,17.0%
company,26238,16.7%
s-corporation,7407,4.7%
sole,4699,3.0%
proprietorship,4699,3.0%
partnership,461,0.3%
non-profit,32,< 0.1%
other,5,< 0.1%

Value,Count,Frequency (%)
o,242392,16.0%
i,210717,13.9%
r,149718,9.9%
t,125926,8.3%
a,120729,8.0%
p,103428,6.8%
n,94062,6.2%
C,93569,6.2%
,58097,3.8%
L,53398,3.5%

Value,Count,Frequency (%)
Lowercase Letter,1284562,84.8%
Uppercase Letter,164302,10.8%
Space Separator,58097,3.8%
Dash Punctuation,7439,0.5%

Value,Count,Frequency (%)
o,242392,18.9%
i,210717,16.4%
r,149718,11.7%
t,125926,9.8%
a,120729,9.4%
p,103428,8.1%
n,94062,7.3%
m,52937,4.1%
y,52937,4.1%
e,36563,2.8%

Value,Count,Frequency (%)
C,93569,56.9%
L,53398,32.5%
S,12106,7.4%
P,5192,3.2%
N,32,< 0.1%
O,5,< 0.1%

Value,Count,Frequency (%)
,58097,100.0%

Value,Count,Frequency (%)
-,7439,100.0%

Value,Count,Frequency (%)
Latin,1448864,95.7%
Common,65536,4.3%

Value,Count,Frequency (%)
o,242392,16.7%
i,210717,14.5%
r,149718,10.3%
t,125926,8.7%
a,120729,8.3%
p,103428,7.1%
n,94062,6.5%
C,93569,6.5%
L,53398,3.7%
m,52937,3.7%

Value,Count,Frequency (%)
,58097,88.6%
-,7439,11.4%

Value,Count,Frequency (%)
ASCII,1514400,100.0%

Value,Count,Frequency (%)
o,242392,16.0%
i,210717,13.9%
r,149718,9.9%
t,125926,8.3%
a,120729,8.0%
p,103428,6.8%
n,94062,6.2%
C,93569,6.2%
,58097,3.8%
L,53398,3.5%

0,1
Distinct,7144
Distinct (%),7.2%
Missing,3
Missing (%),< 0.1%
Infinite,0
Infinite (%),0.0%
Mean,404724.0486

0,1
Minimum,7989
Maximum,600000000
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,771.8 KiB

0,1
Minimum,7989
5-th percentile,60000
Q1,60000
median,150000
Q3,288000
95-th percentile,1000000
Maximum,600000000
Range,599992011
Interquartile range (IQR),228000

0,1
Standard deviation,3919272.954
Coefficient of variation (CV),9.683815348
Kurtosis,9319.068868
Mean,404724.0486
Median Absolute Deviation (MAD),90000
Skewness,80.67394665
Sum,3.997257066 × 1010
Variance,1.536070049 × 1013
Monotonicity,Not monotonic

Value,Count,Frequency (%)
60000,24002,24.3%
95000,13251,13.4%
150000,12358,12.5%
350000,4489,4.5%
200000,3288,3.3%
100000,2380,2.4%
250000,2219,2.2%
750000,1996,2.0%
300000,1829,1.9%
240000,1370,1.4%

Value,Count,Frequency (%)
7989,1,< 0.1%
9201,2,< 0.1%
10545,1,< 0.1%
12321,1,< 0.1%
13201,1,< 0.1%
13801,1,< 0.1%
14147,1,< 0.1%
14186,14,< 0.1%
15562,1,< 0.1%
15566,1,< 0.1%

Value,Count,Frequency (%)
600000000,1,< 0.1%
462699996,1,< 0.1%
384000000,1,< 0.1%
300000000,1,< 0.1%
239999988,1,< 0.1%
204988236,1,< 0.1%
186928224,1,< 0.1%
184200000,1,< 0.1%
180000000,1,< 0.1%
161000940,1,< 0.1%

0,1
Distinct,654
Distinct (%),0.7%
Missing,2
Missing (%),< 0.1%
Infinite,0
Infinite (%),0.0%
Mean,34.28511836

0,1
Minimum,0
Maximum,1366
Zeros,25936
Zeros (%),26.3%
Negative,0
Negative (%),0.0%
Memory size,771.8 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,24
Q3,36
95-th percentile,132
Maximum,1366
Range,1366
Interquartile range (IQR),36

0,1
Standard deviation,62.64545802
Coefficient of variation (CV),1.827190951
Kurtosis,65.73002888
Mean,34.28511836
Median Absolute Deviation (MAD),12
Skewness,6.26987275
Sum,3386204
Variance,3924.45341
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,25936,26.3%
24,20908,21.2%
36,12493,12.6%
12,7587,7.7%
6,2583,2.6%
16,1066,1.1%
15,893,0.9%
14,863,0.9%
13,860,0.9%
17,693,0.7%

Value,Count,Frequency (%)
0,25936,26.3%
6,2583,2.6%
8,10,< 0.1%
11,1,< 0.1%
12,7587,7.7%
13,860,0.9%
14,863,0.9%
15,893,0.9%
16,1066,1.1%
17,693,0.7%

Value,Count,Frequency (%)
1366,1,< 0.1%
1292,1,< 0.1%
1276,1,< 0.1%
1263,1,< 0.1%
1249,1,< 0.1%
1240,1,< 0.1%
1236,1,< 0.1%
1235,1,< 0.1%
1233,1,< 0.1%
1224,1,< 0.1%

0,1
Distinct,4
Distinct (%),< 0.1%
Missing,2
Missing (%),< 0.1%
Infinite,0
Infinite (%),0.0%
Mean,2.868497256

0,1
Minimum,1
Maximum,4
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,771.8 KiB

0,1
Minimum,1
5-th percentile,1
Q1,2
median,3
Q3,4
95-th percentile,4
Maximum,4
Range,3
Interquartile range (IQR),2

0,1
Standard deviation,0.9886514579
Coefficient of variation (CV),0.3446583244
Kurtosis,-1.164622079
Mean,2.868497256
Median Absolute Deviation (MAD),1
Skewness,-0.2441060927
Sum,283310
Variance,0.9774317052
Monotonicity,Not monotonic

Value,Count,Frequency (%)
4,34525,35.0%
2,31307,31.7%
3,24831,25.1%
1,8103,8.2%
(Missing),2,< 0.1%

Value,Count,Frequency (%)
1,8103,8.2%
2,31307,31.7%
3,24831,25.1%
4,34525,35.0%

Value,Count,Frequency (%)
4,34525,35.0%
3,24831,25.1%
2,31307,31.7%
1,8103,8.2%

0,1
Distinct,72
Distinct (%),0.1%
Missing,2
Missing (%),< 0.1%
Memory size,771.8 KiB

0,1
1101,27358
5061,15649
1280,11599
5081,7764
1000,6895
Other values (67),29501

0,1
Max length,5.0
Median length,4.0
Mean length,4.005750967
Min length,4.0

0,1
Total characters,395632
Distinct characters,10
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,15 ?
Unique (%),< 0.1%

0,1
1st row,1101
2nd row,1101
3rd row,1101
4th row,1101
5th row,1101

Value,Count,Frequency (%)
1101,27358,27.7%
5061,15649,15.8%
1280,11599,11.7%
5081,7764,7.9%
1000,6895,7.0%
1257,4084,4.1%
1011,3927,4.0%
5192,3141,3.2%
5000,2819,2.9%
1119,2246,2.3%

Value,Count,Frequency (%)
1101,27358,27.7%
5061,15649,15.8%
1280,11599,11.7%
5081,7764,7.9%
1000,6895,7.0%
1257,4084,4.1%
1011,3927,4.0%
5192,3141,3.2%
5000,2819,2.9%
1119,2246,2.3%

Value,Count,Frequency (%)
1,163397,41.3%
0,111301,28.1%
5,43555,11.0%
2,23954,6.1%
8,21658,5.5%
6,16859,4.3%
9,7761,2.0%
7,5659,1.4%
4,1205,0.3%
3,283,0.1%

Value,Count,Frequency (%)
Decimal Number,395632,100.0%

Value,Count,Frequency (%)
1,163397,41.3%
0,111301,28.1%
5,43555,11.0%
2,23954,6.1%
8,21658,5.5%
6,16859,4.3%
9,7761,2.0%
7,5659,1.4%
4,1205,0.3%
3,283,0.1%

Value,Count,Frequency (%)
Common,395632,100.0%

Value,Count,Frequency (%)
1,163397,41.3%
0,111301,28.1%
5,43555,11.0%
2,23954,6.1%
8,21658,5.5%
6,16859,4.3%
9,7761,2.0%
7,5659,1.4%
4,1205,0.3%
3,283,0.1%

Value,Count,Frequency (%)
ASCII,395632,100.0%

Value,Count,Frequency (%)
1,163397,41.3%
0,111301,28.1%
5,43555,11.0%
2,23954,6.1%
8,21658,5.5%
6,16859,4.3%
9,7761,2.0%
7,5659,1.4%
4,1205,0.3%
3,283,0.1%

0,1
Distinct,7
Distinct (%),< 0.1%
Missing,570
Missing (%),0.6%
Memory size,771.8 KiB

0,1
Direct,36542
Lending Tree,27358
Strategic Partners,18138
Digital-Organic,7677
Digital-PPC,6136
Other values (2),2347

0,1
Max length,19.0
Median length,12.0
Mean length,11.21454612
Min length,4.0

0,1
Total characters,1101246
Distinct characters,24
Distinct categories,5 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Lending Tree
2nd row,Lending Tree
3rd row,Lending Tree
4th row,Lending Tree
5th row,Lending Tree

Value,Count,Frequency (%)
Direct,36542,37.0%
Lending Tree,27358,27.7%
Strategic Partners,18138,18.4%
Digital-Organic,7677,7.8%
Digital-PPC,6136,6.2%
Digital - Affiliate,2345,2.4%
1000,2,< 0.1%
(Missing),570,0.6%

Value,Count,Frequency (%)
direct,36542,24.6%
tree,27358,18.4%
lending,27358,18.4%
strategic,18138,12.2%
partners,18138,12.2%
digital-organic,7677,5.2%
digital-ppc,6136,4.1%
,2345,1.6%
digital,2345,1.6%
affiliate,2345,1.6%

Value,Count,Frequency (%)
e,157237,14.3%
i,126721,11.5%
r,125991,11.4%
t,109459,9.9%
n,80531,7.3%
g,69331,6.3%
a,62456,5.7%
c,62357,5.7%
D,52700,4.8%
,50186,4.6%

Value,Count,Frequency (%)
Lowercase Letter,862772,78.3%
Uppercase Letter,172122,15.6%
Space Separator,50186,4.6%
Dash Punctuation,16158,1.5%
Decimal Number,8,< 0.1%

Value,Count,Frequency (%)
e,157237,18.2%
i,126721,14.7%
r,125991,14.6%
t,109459,12.7%
n,80531,9.3%
g,69331,8.0%
a,62456,7.2%
c,62357,7.2%
d,27358,3.2%
l,18503,2.1%

Value,Count,Frequency (%)
D,52700,30.6%
P,30410,17.7%
L,27358,15.9%
T,27358,15.9%
S,18138,10.5%
O,7677,4.5%
C,6136,3.6%
A,2345,1.4%

Value,Count,Frequency (%)
0,6,75.0%
1,2,25.0%

Value,Count,Frequency (%)
,50186,100.0%

Value,Count,Frequency (%)
-,16158,100.0%

Value,Count,Frequency (%)
Latin,1034894,94.0%
Common,66352,6.0%

Value,Count,Frequency (%)
e,157237,15.2%
i,126721,12.2%
r,125991,12.2%
t,109459,10.6%
n,80531,7.8%
g,69331,6.7%
a,62456,6.0%
c,62357,6.0%
D,52700,5.1%
P,30410,2.9%

Value,Count,Frequency (%)
,50186,75.6%
-,16158,24.4%
0,6,< 0.1%
1,2,< 0.1%

Value,Count,Frequency (%)
ASCII,1101246,100.0%

Value,Count,Frequency (%)
e,157237,14.3%
i,126721,11.5%
r,125991,11.4%
t,109459,9.9%
n,80531,7.3%
g,69331,6.3%
a,62456,5.7%
c,62357,5.7%
D,52700,4.8%
,50186,4.6%

Unnamed: 0,AnnualRevenue,Is_Submitted,LS_Amount_to_Borrow,LS_Entity_Type,LS_GAS,LS_Months_in_Business,LS_Self_Graded_Credit,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c
0,250000.0,0,25000.0,Limited Liability Company,250000.0,13.0,1.0,1101,Lending Tree
1,110000.0,0,50000.0,Limited Liability Company,110000.0,12.0,4.0,1101,Lending Tree
2,1000000.0,1,90000.0,Limited Liability Company,1000000.0,12.0,2.0,1101,Lending Tree
3,350000.0,1,40000.0,Limited Liability Company,350000.0,13.0,3.0,1101,Lending Tree
4,1000000.0,0,20000.0,Limited Liability Company,1000000.0,13.0,3.0,1101,Lending Tree
5,100000.0,0,25000.0,Limited Liability Company,100000.0,13.0,4.0,1101,Lending Tree
6,200000.0,0,100000.0,Limited Liability Company,200000.0,14.0,2.0,1101,Lending Tree
7,459000.0,0,150000.0,Limited Liability Company,459000.0,12.0,1.0,1101,Lending Tree
8,100000.0,0,50000.0,Limited Liability Company,100000.0,13.0,1.0,1101,Lending Tree
9,300000.0,0,25000.0,S-Corporation,300000.0,13.0,4.0,1101,Lending Tree

Unnamed: 0,AnnualRevenue,Is_Submitted,LS_Amount_to_Borrow,LS_Entity_Type,LS_GAS,LS_Months_in_Business,LS_Self_Graded_Credit,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c
98758,102000.0,0,100000.0,Limited Liability Company,352565.0,464.0,1.0,1101.0,Lending Tree
98759,150000.0,0,300000.0,S-Corporation,1000000.0,476.0,2.0,1101.0,Lending Tree
98760,352565.0,0,10000.0,S-Corporation,180000.0,486.0,2.0,1101.0,Lending Tree
98761,1000000.0,0,25000.0,S-Corporation,1350000.0,531.0,3.0,1101.0,Lending Tree
98762,180000.0,0,65000.0,Corporation,175000.0,608.0,4.0,1101.0,Lending Tree
98763,1350000.0,0,35000.0,S-Corporation,5500000.0,718.0,2.0,1101.0,Lending Tree
98764,175000.0,0,25000.0,Limited Liability Company,1000000.0,70.0,4.0,1101.0,Lending Tree
98765,5500000.0,0,15000.0,Limited Liability Company,250536.0,112.0,3.0,1101.0,Lending Tree
98766,1000000.0,0,,,,,,,
98767,250536.0,0,,,,,,,

Unnamed: 0,AnnualRevenue,Is_Submitted,LS_Amount_to_Borrow,LS_Entity_Type,LS_GAS,LS_Months_in_Business,LS_Self_Graded_Credit,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c,# duplicates
522,0.0,0,34000.0,Corporation,60000.0,0.0,2.0,5061,Direct,799
566,0.0,0,34000.0,Corporation,95000.0,0.0,2.0,5061,Direct,549
524,0.0,0,34000.0,Corporation,60000.0,0.0,2.0,5081,Direct,416
102,0.0,0,7500.0,Corporation,60000.0,24.0,4.0,5061,Direct,314
568,0.0,0,34000.0,Corporation,95000.0,0.0,2.0,5081,Direct,255
675,0.0,0,37500.0,Corporation,60000.0,24.0,4.0,5061,Direct,217
666,0.0,0,37500.0,Corporation,60000.0,24.0,3.0,5061,Direct,198
256,0.0,0,17500.0,Corporation,60000.0,24.0,4.0,5061,Direct,194
61,0.0,0,7500.0,Corporation,60000.0,12.0,4.0,5061,Direct,164
247,0.0,0,17500.0,Corporation,60000.0,24.0,3.0,5061,Direct,157


In [0]:
import os
import requests
import numpy as np
import pandas as pd

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(dataset):
  url = 'https://adb-8618183333104940.0.azuredatabricks.net/model/BestmodelAuto/1/invocations'
  headers = {'Authorization': f'Bearer {os.environ.get("dapi9c50893032994015e6383d2178d6d878-2")}'}
  data_json = dataset.to_dict(orient='split') if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)
  response = requests.request(method='POST', headers=headers, url=url, json=data_json)
  if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')
  return response.json()

In [0]:
num_predictions = 5
served_predictions = score_model(X_test[:num_predictions])
#model_evaluations = model.predict(X_test[:num_predictions])
# Compare the results from the deployed model and the trained model
pd.DataFrame({
#  "Model Prediction": model_evaluations,
  "Served Model Prediction": served_predictions,
})

In [0]:
import mlflow
import databricks.automl_runtime

# Use MLflow to track experiments
mlflow.set_experiment("/Users/rtavakoli@reliantfunding.com/databricks_automl/Is_Submitted_leadgendatabricks_csv-2022_02_07-10_17")

target_col = "Is_Submitted"

In [0]:
from mlflow.tracking import MlflowClient
import os
import uuid
import shutil
import pandas as pd

# Create temp directory to download input data from MLflow
input_temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], str(uuid.uuid4())[:8])
os.makedirs(input_temp_dir)


# Download the artifact and read it into a pandas DataFrame
input_client = MlflowClient()
input_data_path = input_client.download_artifacts("ed6c7f565ca646a896ba7d4539709e2d", "data", input_temp_dir)

df_loaded = pd.read_parquet(os.path.join(input_data_path, "training_data"))
# Delete the temp data
shutil.rmtree(input_temp_dir)

# Preview data
df_loaded.head(5)

Unnamed: 0,AnnualRevenue,Is_Submitted,LS_Amount_to_Borrow,LS_Entity_Type,LS_GAS,LS_Months_in_Business,LS_Self_Graded_Credit,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c
0,250000.0,0,25000.0,Limited Liability Company,250000.0,13.0,1.0,1101,Lending Tree
1,110000.0,0,50000.0,Limited Liability Company,110000.0,12.0,4.0,1101,Lending Tree
2,1000000.0,1,90000.0,Limited Liability Company,1000000.0,12.0,2.0,1101,Lending Tree
3,350000.0,1,40000.0,Limited Liability Company,350000.0,13.0,3.0,1101,Lending Tree
4,1000000.0,0,20000.0,Limited Liability Company,1000000.0,13.0,3.0,1101,Lending Tree


In [0]:
from sklearn.model_selection import train_test_split

split_X = df_loaded.drop([target_col], axis=1)
split_y = df_loaded[target_col]

X_train, X_val, y_train, y_val = train_test_split(split_X, split_y, random_state=866429491, stratify=split_y)

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_loaded, random_state=123)
X_train = train.drop([target_col], axis=1)
X_test = test.drop([target_col], axis=1)
y_train = train.target_col
y_test = test.target_col

In [0]:
import os
os.environ["DATABRICKS_TOKEN"] = "dapi9c50893032994015e6383d2178d6d878-2"

In [0]:
import os
import requests
import numpy as np
import pandas as pd

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(dataset):
  url = 'https://adb-8618183333104940.0.azuredatabricks.net/model/BestmodelAuto/1/invocations'
  headers = {'Authorization': f'Bearer {os.environ.get("DATABRICKS_TOKEN")}'}
  data_json = dataset.to_dict(orient='split') if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)
  response = requests.request(method='POST', headers=headers, url=url, json=data_json)
  if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')
  return response.json()

In [0]:
num_predictions = 5
served_predictions = score_model(X_test[:num_predictions])
#model_evaluations = model.predict(X_test[:num_predictions])
# Compare the results from the deployed model and the trained model
pd.DataFrame({
#  "Model Prediction": model_evaluations,
  "Served Model Prediction": served_predictions,
})

Unnamed: 0,Served Model Prediction
0,0
1,0
2,1
3,0
4,0
