# Data Profile
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:
- Understand the input data.
- Determine which columns might need further preparation.
- Verify that data preparation operations produced the desired result.

`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile.

In [1]:
import azureml.dataprep as dprep

dflow = dprep.auto_read_file('../data/crime-spring.csv')

profile = dflow.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.DECIMAL,1.04986e+07,1.05351e+07,10.0,0.0,10.0,0.0,0.0,0.0,10498600.0,10499200.0,10498600.0,10516600.0,10520900.0,10525900.0,10535100.0,10535100.0,10535100.0,10519500.0,12302.7,151358000.0,-0.495701,-1.02814
Case Number,FieldType.STRING,HZ239907,HZ278872,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.DATE,{'timestamp': 1460694600000},{'timestamp': 1460764560000},10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Block,FieldType.STRING,004XX S KILBOURN AVE,113XX S PRAIRIE AVE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.DECIMAL,810,1154,10.0,0.0,10.0,0.0,0.0,0.0,810.0,850.0,810.0,890.0,1136.0,1153.0,1154.0,1154.0,1154.0,1058.5,137.285,18847.2,-0.785501,-1.3543
Primary Type,FieldType.STRING,DECEPTIVE PRACTICE,THEFT,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Description,FieldType.STRING,BOGUS CHECK,OVER $500,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Location Description,FieldType.STRING,,"SCHOOL, PUBLIC, BUILDING",10.0,0.0,10.0,0.0,0.0,1.0,,,,,,,,,,,,,,
Arrest,FieldType.BOOLEAN,False,False,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Domestic,FieldType.BOOLEAN,False,False,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,


A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles.

In [2]:
profile.columns['Beat']

Unnamed: 0,Statistics
Type,FieldType.DECIMAL
Min,531
Max,2433
Count,10
Missing Count,0
Not Missing Count,10
Percent missing,0
Error Count,0
Empty count,0
0.1% Quantile,531


You can also extract and filter data from profiles by using list and dict comprehensions.

In [3]:
variances = [c.variance for c in profile.columns.values() if c.variance]
variances

[151357527.12222305,
 18847.16666666667,
 478994.10000000003,
 48.27777777777778,
 264.49999999999994,
 709.5111111111112,
 5.6000000000000005,
 116500361.33333333,
 1592425833.3333333,
 0.01203298507795059,
 0.0014919951954574125]

In [4]:
column_types = {c.name: c.type for c in profile.columns.values()}
column_types

{'Arrest': <FieldType.BOOLEAN: 1>,
 'Beat': <FieldType.DECIMAL: 3>,
 'Block': <FieldType.STRING: 0>,
 'Case Number': <FieldType.STRING: 0>,
 'Community Area': <FieldType.DECIMAL: 3>,
 'Date': <FieldType.DATE: 4>,
 'Description': <FieldType.STRING: 0>,
 'District': <FieldType.DECIMAL: 3>,
 'Domestic': <FieldType.BOOLEAN: 1>,
 'FBI Code': <FieldType.DECIMAL: 3>,
 'ID': <FieldType.DECIMAL: 3>,
 'IUCR': <FieldType.DECIMAL: 3>,
 'Latitude': <FieldType.DECIMAL: 3>,
 'Location': <FieldType.STRING: 0>,
 'Location Description': <FieldType.STRING: 0>,
 'Longitude': <FieldType.DECIMAL: 3>,
 'Primary Type': <FieldType.STRING: 0>,
 'Updated On': <FieldType.DATE: 4>,
 'Ward': <FieldType.DECIMAL: 3>,
 'X Coordinate': <FieldType.DECIMAL: 3>,
 'Y Coordinate': <FieldType.DECIMAL: 3>,
 'Year': <FieldType.DECIMAL: 3>}

If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts.

In [5]:
profile.columns['Primary Type'].value_counts

[ValueCountEntry(value='DECEPTIVE PRACTICE', count=7.0),
 ValueCountEntry(value='THEFT', count=3.0)]

Numeric ColumnProfiles include an estimated histogram of the data.

In [6]:
profile.columns['District'].histogram

[HistogramBucket(lower_bound=5.0, upper_bound=6.9, count=2.6800000000000006),
 HistogramBucket(lower_bound=6.9, upper_bound=8.8, count=0.37999999999999945),
 HistogramBucket(lower_bound=8.8, upper_bound=10.7, count=0.379999999999999),
 HistogramBucket(lower_bound=10.7, upper_bound=12.6, count=1.3600000000000008),
 HistogramBucket(lower_bound=12.6, upper_bound=14.5, count=0.8666666666666671),
 HistogramBucket(lower_bound=14.5, upper_bound=16.4, count=0.6333333333333329),
 HistogramBucket(lower_bound=16.4, upper_bound=18.299999999999997, count=0.8499999999999988),
 HistogramBucket(lower_bound=18.299999999999997, upper_bound=20.2, count=0.7500000000000009),
 HistogramBucket(lower_bound=20.2, upper_bound=22.099999999999998, count=0.6500000000000012),
 HistogramBucket(lower_bound=22.099999999999998, upper_bound=24.0, count=1.4499999999999993)]

For columns containing data of mixed types, the ColumnProfile also provides counts of each type.

In [7]:
profile.columns['X Coordinate'].type_counts

[TypeCountEntry(type=<FieldType.NULL: 7>, count=7.0),
 TypeCountEntry(type=<FieldType.DECIMAL: 3>, count=3.0)]