# Data Exploration
This notebook performs exploratory data analysis on the dataset.
To expand on the analysis, attach this notebook to the **AutoMLCluster** cluster,
edit [the options of pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/3869966274492205/s?orderByKey=metrics.%60val_f1_score%60&orderByAsc=false)
- Navigate to the parent notebook [here](#notebook/3869966274492195) (If you launched the AutoML experiment using the Experiments UI, this link isn't very useful.)

Runtime Version: _9.1.x-cpu-ml-scala2.12_

> **NOTE:** The dataset loaded below is a sample of the original dataset.
Stratified sampling using pyspark's [sampleBy](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameStatFunctions.sampleBy.html)
method is used to ensure that the distribution of the target column is retained.
<br/>
> Rows were sampled with a sampling fraction of **0.7379109669685667**

In [0]:
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

from mlflow.tracking import MlflowClient

# Download input data from mlflow into a pandas DataFrame
# create temp directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# download the artifact and read it
client = MlflowClient()
training_data_path = client.download_artifacts("a49ada83698c40c7bc1df218835d17bd", "data", temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# delete the temp data
shutil.rmtree(temp_dir)

target_col = "Is_Submitted"

## Profiling Results

In [0]:
from pandas_profiling import ProfileReport
df_profile = ProfileReport(df, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

0,1
Number of variables,11
Number of observations,19724
Missing cells,58
Missing cells (%),< 0.1%
Duplicate rows,39
Duplicate rows (%),0.2%
Total size in memory,1.2 MiB
Average record size in memory,64.0 B

0,1
Numeric,6
Categorical,5

0,1
"Web_Ad_Outlet__c has constant value ""1101""",Constant
"Web_Ad_Outlet_Source__c has constant value ""Lending Tree""",Constant
Dataset has 39 (0.2%) duplicate rows,Duplicates
City has a high cardinality: 4503 distinct values,High cardinality
Company has a high cardinality: 19307 distinct values,High cardinality
State is highly correlated with PostalCode,High correlation
PostalCode is highly correlated with State,High correlation
State is highly correlated with Web_Ad_Outlet_Source__c,High correlation
Web_Ad_Outlet_Source__c is highly correlated with State and 1 other fields,High correlation
LS_Entity_Type is highly correlated with Web_Ad_Outlet_Source__c,High correlation

0,1
Analysis started,2022-03-01 20:01:40.392358
Analysis finished,2022-03-01 20:01:48.063241
Duration,7.67 seconds
Software version,pandas-profiling v3.0.0
Download configuration,config.json

0,1
Distinct,345
Distinct (%),1.7%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,67919.16655

0,1
Minimum,2000
Maximum,2000000
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,2000
5-th percentile,10000
Q1,20000
median,45000
Q3,80000
95-th percentile,225000
Maximum,2000000
Range,1998000
Interquartile range (IQR),60000

0,1
Standard deviation,110278.1862
Coefficient of variation (CV),1.623668131
Kurtosis,117.9982619
Mean,67919.16655
Median Absolute Deviation (MAD),28500
Skewness,8.684725627
Sum,1339637641
Variance,1.216127835 × 1010
Monotonicity,Not monotonic

Value,Count,Frequency (%)
50000,3584,18.2%
10000,2372,12.0%
100000,1920,9.7%
25000,1633,8.3%
20000,1474,7.5%
15000,1189,6.0%
30000,885,4.5%
150000,734,3.7%
250000,567,2.9%
40000,550,2.8%

Value,Count,Frequency (%)
2000,25,0.1%
2017,1,< 0.1%
2158,1,< 0.1%
2400,1,< 0.1%
2500,17,0.1%
2850,1,< 0.1%
3000,15,0.1%
3500,6,< 0.1%
4000,5,< 0.1%
4500,3,< 0.1%

Value,Count,Frequency (%)
2000000,16,0.1%
1800000,2,< 0.1%
1600000,2,< 0.1%
1500000,12,0.1%
1400000,2,< 0.1%
1350000,1,< 0.1%
1300000,2,< 0.1%
1200000,3,< 0.1%
1100000,2,< 0.1%
1000000,24,0.1%

0,1
Distinct,5
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,154.2 KiB

0,1
Limited Liability Company,11849
S-Corporation,3010
Sole Proprietorship,2974
Corporation,1836
Limited Liability Partnership,55

0,1
Max length,29.0
Median length,25.0
Mean length,20.97201379
Min length,11.0

0,1
Total characters,413652
Distinct characters,21
Distinct categories,4 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Limited Liability Company
2nd row,Limited Liability Company
3rd row,Limited Liability Company
4th row,Limited Liability Company
5th row,Limited Liability Company

Value,Count,Frequency (%)
Limited Liability Company,11849,60.1%
S-Corporation,3010,15.3%
Sole Proprietorship,2974,15.1%
Corporation,1836,9.3%
Limited Liability Partnership,55,0.3%

Value,Count,Frequency (%)
limited,11904,25.6%
liability,11904,25.6%
company,11849,25.5%
s-corporation,3010,6.5%
sole,2974,6.4%
proprietorship,2974,6.4%
corporation,1836,3.9%
partnership,55,0.1%

Value,Count,Frequency (%)
i,70369,17.0%
o,35309,8.5%
t,31683,7.7%
a,28654,6.9%
,26782,6.5%
L,23808,5.8%
m,23753,5.7%
y,23753,5.7%
p,22698,5.5%
r,18724,4.5%

Value,Count,Frequency (%)
Lowercase Letter,334344,80.8%
Uppercase Letter,49516,12.0%
Space Separator,26782,6.5%
Dash Punctuation,3010,0.7%

Value,Count,Frequency (%)
i,70369,21.0%
o,35309,10.6%
t,31683,9.5%
a,28654,8.6%
m,23753,7.1%
y,23753,7.1%
p,22698,6.8%
r,18724,5.6%
e,17907,5.4%
n,16750,5.0%

Value,Count,Frequency (%)
L,23808,48.1%
C,16695,33.7%
S,5984,12.1%
P,3029,6.1%

Value,Count,Frequency (%)
,26782,100.0%

Value,Count,Frequency (%)
-,3010,100.0%

Value,Count,Frequency (%)
Latin,383860,92.8%
Common,29792,7.2%

Value,Count,Frequency (%)
i,70369,18.3%
o,35309,9.2%
t,31683,8.3%
a,28654,7.5%
L,23808,6.2%
m,23753,6.2%
y,23753,6.2%
p,22698,5.9%
r,18724,4.9%
e,17907,4.7%

Value,Count,Frequency (%)
,26782,89.9%
-,3010,10.1%

Value,Count,Frequency (%)
ASCII,413652,100.0%

Value,Count,Frequency (%)
i,70369,17.0%
o,35309,8.5%
t,31683,7.7%
a,28654,6.9%
,26782,6.5%
L,23808,5.8%
m,23753,5.7%
y,23753,5.7%
p,22698,5.5%
r,18724,4.5%

0,1
Distinct,1670
Distinct (%),8.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,482228.5665

0,1
Minimum,60000
Maximum,99999999
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,60000
5-th percentile,100000
Q1,145000
median,200000
Q3,300000
95-th percentile,989850
Maximum,99999999
Range,99939999
Interquartile range (IQR),155000

0,1
Standard deviation,3406149.288
Coefficient of variation (CV),7.063350296
Kurtosis,689.8717099
Mean,482228.5665
Median Absolute Deviation (MAD),70000
Skewness,25.24920483
Sum,9511476246
Variance,1.160185297 × 1013
Monotonicity,Not monotonic

Value,Count,Frequency (%)
200000,2377,12.1%
100000,1700,8.6%
250000,1500,7.6%
150000,1247,6.3%
120000,864,4.4%
300000,721,3.7%
500000,429,2.2%
350000,357,1.8%
125000,346,1.8%
240000,332,1.7%

Value,Count,Frequency (%)
60000,2,< 0.1%
100000,1700,8.6%
100006,1,< 0.1%
100008,1,< 0.1%
100020,1,< 0.1%
100051,1,< 0.1%
100200,3,< 0.1%
100225,1,< 0.1%
100300,3,< 0.1%
100313,1,< 0.1%

Value,Count,Frequency (%)
99999999,16,0.1%
92000000,1,< 0.1%
80046379,1,< 0.1%
75246431,1,< 0.1%
75000000,2,< 0.1%
67467000,1,< 0.1%
58000000,1,< 0.1%
52845666,1,< 0.1%
50000000,1,< 0.1%
46580000,1,< 0.1%

0,1
Distinct,1
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,1101

0,1
Minimum,1101
Maximum,1101
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,1101
5-th percentile,1101
Q1,1101
median,1101
Q3,1101
95-th percentile,1101
Maximum,1101
Range,0
Interquartile range (IQR),0

0,1
Standard deviation,0
Coefficient of variation (CV),0
Kurtosis,0
Mean,1101
Median Absolute Deviation (MAD),0
Skewness,0
Sum,21716124
Variance,0
Monotonicity,Increasing

Value,Count,Frequency (%)
1101,19724,100.0%

Value,Count,Frequency (%)
1101,19724,100.0%

Value,Count,Frequency (%)
1101,19724,100.0%

0,1
Distinct,1
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,154.2 KiB

0,1
Lending Tree,19724

0,1
Max length,12
Median length,12
Mean length,12
Min length,12

0,1
Total characters,236688
Distinct characters,9
Distinct categories,3 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Lending Tree
2nd row,Lending Tree
3rd row,Lending Tree
4th row,Lending Tree
5th row,Lending Tree

Value,Count,Frequency (%)
Lending Tree,19724,100.0%

Value,Count,Frequency (%)
lending,19724,50.0%
tree,19724,50.0%

Value,Count,Frequency (%)
e,59172,25.0%
n,39448,16.7%
L,19724,8.3%
d,19724,8.3%
i,19724,8.3%
g,19724,8.3%
,19724,8.3%
T,19724,8.3%
r,19724,8.3%

Value,Count,Frequency (%)
Lowercase Letter,177516,75.0%
Uppercase Letter,39448,16.7%
Space Separator,19724,8.3%

Value,Count,Frequency (%)
e,59172,33.3%
n,39448,22.2%
d,19724,11.1%
i,19724,11.1%
g,19724,11.1%
r,19724,11.1%

Value,Count,Frequency (%)
L,19724,50.0%
T,19724,50.0%

Value,Count,Frequency (%)
,19724,100.0%

Value,Count,Frequency (%)
Latin,216964,91.7%
Common,19724,8.3%

Value,Count,Frequency (%)
e,59172,27.3%
n,39448,18.2%
L,19724,9.1%
d,19724,9.1%
i,19724,9.1%
g,19724,9.1%
T,19724,9.1%
r,19724,9.1%

Value,Count,Frequency (%)
,19724,100.0%

Value,Count,Frequency (%)
ASCII,236688,100.0%

Value,Count,Frequency (%)
e,59172,25.0%
n,39448,16.7%
L,19724,8.3%
d,19724,8.3%
i,19724,8.3%
g,19724,8.3%
,19724,8.3%
T,19724,8.3%
r,19724,8.3%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,0.1357229771

0,1
Minimum,0
Maximum,1
Zeros,17047
Zeros (%),86.4%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,0
95-th percentile,1
Maximum,1
Range,1
Interquartile range (IQR),0

0,1
Standard deviation,0.3425028439
Coefficient of variation (CV),2.523543553
Kurtosis,2.525930082
Mean,0.1357229771
Median Absolute Deviation (MAD),0
Skewness,2.127363148
Sum,2677
Variance,0.1173081981
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,17047,86.4%
1,2677,13.6%

Value,Count,Frequency (%)
0,17047,86.4%
1,2677,13.6%

Value,Count,Frequency (%)
1,2677,13.6%
0,17047,86.4%

0,1
Distinct,4
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,2.950415737

0,1
Minimum,1
Maximum,4
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,1
5-th percentile,1
Q1,2
median,3
Q3,4
95-th percentile,4
Maximum,4
Range,3
Interquartile range (IQR),2

0,1
Standard deviation,1.006280198
Coefficient of variation (CV),0.3410638662
Kurtosis,-0.9101440463
Mean,2.950415737
Median Absolute Deviation (MAD),1
Skewness,-0.5122086386
Sum,58194
Variance,1.012599837
Monotonicity,Not monotonic

Value,Count,Frequency (%)
4,7473,37.9%
3,5848,29.6%
2,4355,22.1%
1,2048,10.4%

Value,Count,Frequency (%)
1,2048,10.4%
2,4355,22.1%
3,5848,29.6%
4,7473,37.9%

Value,Count,Frequency (%)
4,7473,37.9%
3,5848,29.6%
2,4355,22.1%
1,2048,10.4%

0,1
Distinct,4503
Distinct (%),22.8%
Missing,0
Missing (%),0.0%
Memory size,154.2 KiB

0,1
Miami,317
Houston,282
Atlanta,249
Chicago,245
Brooklyn,181
Other values (4498),18450

0,1
Max length,25.0
Median length,8.0
Mean length,8.865341716
Min length,3.0

0,1
Total characters,174860
Distinct characters,57
Distinct categories,5 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,2152 ?
Unique (%),10.9%

0,1
1st row,Bronx
2nd row,Grand Rapids
3rd row,Gansevoort
4th row,Grenada
5th row,El Paso

Value,Count,Frequency (%)
Miami,317,1.6%
Houston,282,1.4%
Atlanta,249,1.3%
Chicago,245,1.2%
Brooklyn,181,0.9%
New York,166,0.8%
Las Vegas,155,0.8%
Los Angeles,148,0.8%
Orlando,143,0.7%
Philadelphia,141,0.7%

Value,Count,Frequency (%)
city,393,1.6%
beach,375,1.5%
new,366,1.4%
miami,348,1.4%
fort,320,1.3%
san,313,1.2%
houston,283,1.1%
chicago,262,1.0%
atlanta,249,1.0%
saint,238,0.9%

Value,Count,Frequency (%)
a,15971,9.1%
e,15641,8.9%
o,13506,7.7%
n,12877,7.4%
l,12056,6.9%
i,11069,6.3%
r,10469,6.0%
t,9299,5.3%
s,7382,4.2%
,5541,3.2%

Value,Count,Frequency (%)
Lowercase Letter,144028,82.4%
Uppercase Letter,25283,14.5%
Space Separator,5541,3.2%
Decimal Number,7,< 0.1%
Other Punctuation,1,< 0.1%

Value,Count,Frequency (%)
C,2681,10.6%
S,2219,8.8%
B,2016,8.0%
M,2005,7.9%
L,1811,7.2%
P,1707,6.8%
H,1592,6.3%
A,1508,6.0%
W,1152,4.6%
R,1027,4.1%

Value,Count,Frequency (%)
a,15971,11.1%
e,15641,10.9%
o,13506,9.4%
n,12877,8.9%
l,12056,8.4%
i,11069,7.7%
r,10469,7.3%
t,9299,6.5%
s,7382,5.1%
d,4483,3.1%

Value,Count,Frequency (%)
1,3,42.9%
2,3,42.9%
3,1,14.3%

Value,Count,Frequency (%)
,5541,100.0%

Value,Count,Frequency (%)
.,1,100.0%

Value,Count,Frequency (%)
Latin,169311,96.8%
Common,5549,3.2%

Value,Count,Frequency (%)
a,15971,9.4%
e,15641,9.2%
o,13506,8.0%
n,12877,7.6%
l,12056,7.1%
i,11069,6.5%
r,10469,6.2%
t,9299,5.5%
s,7382,4.4%
d,4483,2.6%

Value,Count,Frequency (%)
,5541,99.9%
1,3,0.1%
2,3,0.1%
.,1,< 0.1%
3,1,< 0.1%

Value,Count,Frequency (%)
ASCII,174860,100.0%

Value,Count,Frequency (%)
a,15971,9.1%
e,15641,8.9%
o,13506,7.7%
n,12877,7.4%
l,12056,6.9%
i,11069,6.3%
r,10469,6.0%
t,9299,5.3%
s,7382,4.2%
,5541,3.2%

0,1
Distinct,19307
Distinct (%),97.9%
Missing,0
Missing (%),0.0%
Memory size,154.2 KiB

0,1
Nertelligence,5
dollarplusfoodmartllc,4
Dips Services,4
Direct Poke,4
Pretend IT,4
Other values (19302),19703

0,1
Max length,76.0
Median length,20.0
Mean length,20.68870412
Min length,2.0

0,1
Total characters,408064
Distinct characters,88
Distinct categories,13 ?
Distinct scripts,2 ?
Distinct blocks,2 ?

0,1
Unique,18932 ?
Unique (%),96.0%

0,1
1st row,Beastea LLC
2nd row,Ilia Transportation LLC
3rd row,Mallis Accessories LLC
4th row,Premir Beauty & stylez
5th row,Lik Yuh Finga Jamaican Restaurant

Value,Count,Frequency (%)
Nertelligence,5,< 0.1%
dollarplusfoodmartllc,4,< 0.1%
Dips Services,4,< 0.1%
Direct Poke,4,< 0.1%
Pretend IT,4,< 0.1%
Re-Nu Property Restoration,4,< 0.1%
Fannin's Home Improvements LLC,4,< 0.1%
Imperial Desert LLC,3,< 0.1%
Double R Trucking LLC,3,< 0.1%
Gamboatransport,3,< 0.1%

Value,Count,Frequency (%)
llc,7062,11.2%
inc,2057,3.3%
and,1078,1.7%
trucking,896,1.4%
,834,1.3%
services,813,1.3%
construction,635,1.0%
the,479,0.8%
transport,475,0.8%
group,390,0.6%

Value,Count,Frequency (%)
,43452,10.6%
e,29503,7.2%
n,25406,6.2%
i,23326,5.7%
a,23025,5.6%
r,22918,5.6%
o,21574,5.3%
t,19934,4.9%
s,19402,4.8%
l,16402,4.0%

Value,Count,Frequency (%)
Lowercase Letter,277469,68.0%
Uppercase Letter,80094,19.6%
Space Separator,43452,10.6%
Other Punctuation,5448,1.3%
Decimal Number,1259,0.3%
Dash Punctuation,249,0.1%
Other Symbol,25,< 0.1%
Close Punctuation,19,< 0.1%
Open Punctuation,18,< 0.1%
Math Symbol,17,< 0.1%

Value,Count,Frequency (%)
L,14214,17.7%
C,11443,14.3%
S,5626,7.0%
T,4873,6.1%
A,4408,5.5%
I,3899,4.9%
R,3574,4.5%
E,3508,4.4%
M,3228,4.0%
B,2945,3.7%

Value,Count,Frequency (%)
e,29503,10.6%
n,25406,9.2%
i,23326,8.4%
a,23025,8.3%
r,22918,8.3%
o,21574,7.8%
t,19934,7.2%
s,19402,7.0%
l,16402,5.9%
c,14808,5.3%

Value,Count,Frequency (%)
",",1552,28.5%
.,1471,27.0%
&,1412,25.9%
',895,16.4%
/,45,0.8%
@,34,0.6%
"""",22,0.4%
:,6,0.1%
#,5,0.1%
!,3,0.1%

Value,Count,Frequency (%)
1,280,22.2%
2,215,17.1%
3,157,12.5%
4,133,10.6%
0,126,10.0%
5,82,6.5%
7,74,5.9%
8,73,5.8%
9,62,4.9%
6,57,4.5%

Value,Count,Frequency (%)
+,13,76.5%
~,2,11.8%
<,1,5.9%
=,1,5.9%

Value,Count,Frequency (%)
(,17,94.4%
[,1,5.6%

Value,Count,Frequency (%)
),18,94.7%
],1,5.3%

Value,Count,Frequency (%)
,43452,100.0%

Value,Count,Frequency (%)
-,249,100.0%

Value,Count,Frequency (%)
�,25,100.0%

Value,Count,Frequency (%)
$,9,100.0%

Value,Count,Frequency (%)
_,4,100.0%

Value,Count,Frequency (%)
`,1,100.0%

Value,Count,Frequency (%)
Latin,357563,87.6%
Common,50501,12.4%

Value,Count,Frequency (%)
e,29503,8.3%
n,25406,7.1%
i,23326,6.5%
a,23025,6.4%
r,22918,6.4%
o,21574,6.0%
t,19934,5.6%
s,19402,5.4%
l,16402,4.6%
c,14808,4.1%

Value,Count,Frequency (%)
,43452,86.0%
",",1552,3.1%
.,1471,2.9%
&,1412,2.8%
',895,1.8%
1,280,0.6%
-,249,0.5%
2,215,0.4%
3,157,0.3%
4,133,0.3%

Value,Count,Frequency (%)
ASCII,408039,> 99.9%
Specials,25,< 0.1%

Value,Count,Frequency (%)
,43452,10.6%
e,29503,7.2%
n,25406,6.2%
i,23326,5.7%
a,23025,5.6%
r,22918,5.6%
o,21574,5.3%
t,19934,4.9%
s,19402,4.8%
l,16402,4.0%

Value,Count,Frequency (%)
�,25,100.0%

0,1
Distinct,8912
Distinct (%),45.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,48825.96907

0,1
Minimum,1027
Maximum,99901
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,77.2 KiB

0,1
Minimum,1027.0
5-th percentile,8012.6
Q1,29306.0
median,40456.0
Q3,76051.0
95-th percentile,94102.0
Maximum,99901.0
Range,98874.0
Interquartile range (IQR),46745.0

0,1
Standard deviation,27905.78901
Coefficient of variation (CV),0.5715357942
Kurtosis,-1.219003809
Mean,48825.96907
Median Absolute Deviation (MAD),21107
Skewness,0.2343445352
Sum,963043414
Variance,778733060.3
Monotonicity,Not monotonic

Value,Count,Frequency (%)
30349,26,0.1%
30331,26,0.1%
30016,23,0.1%
30318,22,0.1%
30253,20,0.1%
30281,19,0.1%
30034,18,0.1%
30038,17,0.1%
34787,17,0.1%
33166,17,0.1%

Value,Count,Frequency (%)
1027,2,< 0.1%
1029,1,< 0.1%
1056,1,< 0.1%
1068,1,< 0.1%
1082,1,< 0.1%
1083,2,< 0.1%
1101,1,< 0.1%
1106,2,< 0.1%
1109,1,< 0.1%
1119,1,< 0.1%

Value,Count,Frequency (%)
99901,2,< 0.1%
99827,1,< 0.1%
99801,1,< 0.1%
99709,2,< 0.1%
99705,1,< 0.1%
99654,3,< 0.1%
99645,2,< 0.1%
99615,1,< 0.1%
99611,1,< 0.1%
99603,1,< 0.1%

0,1
Distinct,48
Distinct (%),0.2%
Missing,58
Missing (%),0.3%
Memory size,154.2 KiB

0,1
FL,2211
TX,2012
CA,1727
GA,1356
NY,1149
Other values (43),11211

0,1
Max length,2
Median length,2
Mean length,2
Min length,2

0,1
Total characters,39332
Distinct characters,24
Distinct categories,1 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,1 ?
Unique (%),< 0.1%

0,1
1st row,NY
2nd row,MI
3rd row,NY
4th row,MS
5th row,TX

Value,Count,Frequency (%)
FL,2211,11.2%
TX,2012,10.2%
CA,1727,8.8%
GA,1356,6.9%
NY,1149,5.8%
IL,825,4.2%
OH,719,3.6%
NC,715,3.6%
PA,625,3.2%
NJ,600,3.0%

Value,Count,Frequency (%)
fl,2211,11.2%
tx,2012,10.2%
ca,1727,8.8%
ga,1356,6.9%
ny,1149,5.8%
il,825,4.2%
oh,719,3.7%
nc,715,3.6%
pa,625,3.2%
nj,600,3.1%

Value,Count,Frequency (%)
A,6131,15.6%
L,3790,9.6%
N,3762,9.6%
C,3423,8.7%
T,2698,6.9%
F,2211,5.6%
M,2207,5.6%
I,2149,5.5%
X,2012,5.1%
O,1756,4.5%

Value,Count,Frequency (%)
Uppercase Letter,39332,100.0%

Value,Count,Frequency (%)
A,6131,15.6%
L,3790,9.6%
N,3762,9.6%
C,3423,8.7%
T,2698,6.9%
F,2211,5.6%
M,2207,5.6%
I,2149,5.5%
X,2012,5.1%
O,1756,4.5%

Value,Count,Frequency (%)
Latin,39332,100.0%

Value,Count,Frequency (%)
A,6131,15.6%
L,3790,9.6%
N,3762,9.6%
C,3423,8.7%
T,2698,6.9%
F,2211,5.6%
M,2207,5.6%
I,2149,5.5%
X,2012,5.1%
O,1756,4.5%

Value,Count,Frequency (%)
ASCII,39332,100.0%

Value,Count,Frequency (%)
A,6131,15.6%
L,3790,9.6%
N,3762,9.6%
C,3423,8.7%
T,2698,6.9%
F,2211,5.6%
M,2207,5.6%
I,2149,5.5%
X,2012,5.1%
O,1756,4.5%

Unnamed: 0,LS_Amount_to_Borrow,LS_Entity_Type,LSGAS,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c,Is_Submitted,LS_Self_Graded_Credit,City,Company,PostalCode,State
0,25000,Limited Liability Company,250000,1101,Lending Tree,0,1,Bronx,Beastea LLC,10473,NY
1,90000,Limited Liability Company,1000000,1101,Lending Tree,1,2,Grand Rapids,Ilia Transportation LLC,49503,MI
2,20000,Limited Liability Company,1000000,1101,Lending Tree,0,3,Gansevoort,Mallis Accessories LLC,12831,NY
3,100000,Limited Liability Company,200000,1101,Lending Tree,0,2,Grenada,Premir Beauty & stylez,38901,MS
4,150000,Limited Liability Company,459000,1101,Lending Tree,0,1,El Paso,Lik Yuh Finga Jamaican Restaurant,79928,TX
5,25000,S-Corporation,300000,1101,Lending Tree,0,4,Sacramento,Thai Street Bistro,95820,CA
6,250000,Limited Liability Company,800000,1101,Lending Tree,0,4,Midland,Renaissance Industrial LLC,79706,TX
7,10000,Corporation,100000,1101,Lending Tree,0,3,Clyde,Intethica inc,79510,MN
8,10000,Limited Liability Company,155000,1101,Lending Tree,0,1,Laurel,EAZY SHIPPING LLC,20707,MD
9,100000,S-Corporation,450000,1101,Lending Tree,0,1,Big Bear City,Big Bear Takeout & Delivery,92314,CA

Unnamed: 0,LS_Amount_to_Borrow,LS_Entity_Type,LSGAS,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c,Is_Submitted,LS_Self_Graded_Credit,City,Company,PostalCode,State
19714,20000,Limited Liability Company,150000,1101,Lending Tree,0,1,Mount Pleasant,m.evans distribution,29466,SC
19715,50000,S-Corporation,102000,1101,Lending Tree,0,2,Vail,Peter F Pattison Inc,81657,CO
19716,15000,Sole Proprietorship,150000,1101,Lending Tree,0,4,Muskegon,Blessed For All Seasons,49442,MI
19717,100000,Limited Liability Company,352565,1101,Lending Tree,0,1,Houston,Jimmy Mac metal's,77078,TX
19718,300000,S-Corporation,1000000,1101,Lending Tree,0,2,Gaffney,Superior Bag,29341,SC
19719,10000,S-Corporation,180000,1101,Lending Tree,0,2,Chicago,Italian Court Flowers,60647,IL
19720,25000,S-Corporation,1350000,1101,Lending Tree,0,3,Jackson,Us coating specialties and supplies,39213,MS
19721,65000,Corporation,175000,1101,Lending Tree,0,4,National City,ACRO SALES,91950,CA
19722,25000,Limited Liability Company,1000000,1101,Lending Tree,0,4,Westbury,Dakott LLC,11590,NY
19723,15000,Limited Liability Company,250536,1101,Lending Tree,0,3,Fort Lauderdale,Trap house chicken,33312,FL

Unnamed: 0,LS_Amount_to_Borrow,LS_Entity_Type,LSGAS,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c,Is_Submitted,LS_Self_Graded_Credit,City,Company,PostalCode,State,# duplicates
23,30000,Corporation,200000,1101,Lending Tree,0,3,Saint Petersburg,Nertelligence,33701,FL,5
13,20000,Corporation,150000,1101,Lending Tree,0,1,Denver,Pretend IT,80205,CO,4
19,25000,Limited Liability Company,500000,1101,Lending Tree,0,2,New Orleans,Direct Poke,70130,LA,4
28,50000,Corporation,275000,1101,Lending Tree,0,3,Pleasanton,Dips Services,94566,CA,4
18,25000,Limited Liability Company,100000,1101,Lending Tree,0,3,Minneapolis,House Homecare,55402,MN,3
21,25000,S-Corporation,200000,1101,Lending Tree,0,3,Dallas,Billie Manu,75202,TX,3
38,275000,Sole Proprietorship,106859,1101,Lending Tree,0,4,Weslaco,Gamboatransport,78599,TX,3
0,10000,Corporation,100000,1101,Lending Tree,0,2,Princeton,Electric Professionals,8542,NJ,2
1,10000,Limited Liability Company,200000,1101,Lending Tree,0,4,Columbia,live love and appreciate jd,29204,SC,2
2,10000,Limited Liability Company,500000,1101,Lending Tree,0,4,Cumming,Tounkdeliveryservice LLC,30040,GA,2


In [0]:
import os
import requests
import numpy as np
import pandas as pd

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(dataset):
  url = 'https://adb-8618183333104940.0.azuredatabricks.net/model/BestmodelAuto/2/invocations'
  headers = {'Authorization': f'Bearer {os.environ.get("dapib6f8870f91937873710e749bc9facb1d-2")}'}
  data_json = dataset.to_dict(orient='split') if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)
  response = requests.request(method='POST', headers=headers, url=url, json=data_json)
  if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')
  return response.json()

In [0]:
import mlflow
import databricks.automl_runtime

# Use MLflow to track experiments
mlflow.set_experiment("/Users/rtavakoli@reliantfunding.com/databricks_automl/_lendingtreesample_automl_3-1")

target_col = "Is_Submitted"

In [0]:
from mlflow.tracking import MlflowClient
import os
import uuid
import shutil
import pandas as pd

# Create temp directory to download input data from MLflow
input_temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], str(uuid.uuid4())[:8])
os.makedirs(input_temp_dir)


# Download the artifact and read it into a pandas DataFrame
input_client = MlflowClient()
input_data_path = input_client.download_artifacts("a49ada83698c40c7bc1df218835d17bd", "data", input_temp_dir)

df_loaded = pd.read_parquet(os.path.join(input_data_path, "training_data"))
# Delete the temp data
shutil.rmtree(input_temp_dir)

# Preview data
df_loaded.head(5)

Unnamed: 0,LS_Amount_to_Borrow,LS_Entity_Type,LSGAS,Web_Ad_Outlet__c,Web_Ad_Outlet_Source__c,Is_Submitted,LS_Self_Graded_Credit,City,Company,PostalCode,State
0,25000,Limited Liability Company,250000,1101,Lending Tree,0,1,Bronx,Beastea LLC,10473,NY
1,90000,Limited Liability Company,1000000,1101,Lending Tree,1,2,Grand Rapids,Ilia Transportation LLC,49503,MI
2,20000,Limited Liability Company,1000000,1101,Lending Tree,0,3,Gansevoort,Mallis Accessories LLC,12831,NY
3,100000,Limited Liability Company,200000,1101,Lending Tree,0,2,Grenada,Premir Beauty & stylez,38901,MS
4,150000,Limited Liability Company,459000,1101,Lending Tree,0,1,El Paso,Lik Yuh Finga Jamaican Restaurant,79928,TX


In [0]:
from sklearn.model_selection import train_test_split

split_X = df_loaded.drop([target_col], axis=1)
split_y = df_loaded[target_col]

X_train, X_val, y_train, y_val = train_test_split(split_X, split_y, random_state=79393084, stratify=split_y)

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_loaded, random_state=123)
X_train = train.drop([target_col], axis=1)
X_test = test.drop([target_col], axis=1)
y_train = train.Is_Submitted
y_test = test.Is_Submitted

In [0]:
import os
os.environ["DATABRICKS_TOKEN"] = "dapib6f8870f91937873710e749bc9facb1d-2"

In [0]:
import os
import requests
import numpy as np
import pandas as pd

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(dataset):
  url = 'https://adb-8618183333104940.0.azuredatabricks.net/model/BestmodelAuto/2/invocations'
  headers = {'Authorization': f'Bearer {os.environ.get("DATABRICKS_TOKEN")}'}
  data_json = dataset.to_dict(orient='split') if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)
  response = requests.request(method='POST', headers=headers, url=url, json=data_json)
  if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')
  return response.json()

In [0]:
num_predictions = 10
served_predictions = score_model(X_test[:num_predictions])
#model_evaluations = model.predict(X_test[:num_predictions])
# Compare the results from the deployed model and the trained model
pd.DataFrame({
#  "Model Prediction": model_evaluations,
  "Served Model Prediction": served_predictions,
})

Unnamed: 0,Served Model Prediction
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0
