# Semantic Types
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Some string values can be recognized as semantic types. For example, email addresses, US zip codes or IP addresses have specific formats that can be recognized, and then split in specific ways.

When getting a DataProfile you can optionally ask to collect counts of values recognized as semantic types. [`Dataflow.get_profile()`](./data-profile.ipynb) executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile. Semantic type counts can be included in the data profile by calling `get_profile` with the `include_stype_counts` argument set to true.

The `stype_counts` property of the DataProfile will then include entries for columns where some semantic types were recognized for some values.

In [1]:
import azureml.dataprep as dprep
dflow = dprep.read_json(path='../data/json.json')

profile = dflow.get_profile(include_stype_counts=True)

print("row count: " + str(profile.row_count))
profile.stype_counts

row count: 58


{'inspections.business.business_certificate': [STypeCountEntry(stype=<SType.ZIPCODE: 0>, count=1)],
 'inspections.business.business_id': [STypeCountEntry(stype=<SType.ZIPCODE: 0>, count=31)],
 'inspections.business.postal_code': [STypeCountEntry(stype=<SType.ZIPCODE: 0>, count=57)]}

To see all the supported semantic types, you can examine the `SType` enumeration. More types will be added over time.

In [2]:
[t.name for t in dprep.SType]


['ZIPCODE', 'IPADDRESS']

You can filter the found semantic types down to just those where all non-empty values matched. The `DataProfile.stype_counts` gives a list of semantic type counts for each column, where at least some matches were found. Those lists are in desecending order of count, so here we consider only the first in each list, as that will be the one with the highest count of values that match.

In this example, the column `inspections.business.postal_code` looks to be a US zip code.

In [3]:
stypes_counts = profile.stype_counts
all_match = [
    (column, stypes_counts[column][0].stype)
    for column in stypes_counts
    if profile.row_count - profile.columns[column].empty_count == stypes_counts[column][0].count
]
all_match

[('inspections.business.postal_code', <SType.ZIPCODE: 0>)]

You can use semantic types to compute new columns. The new columns are the values split up into elements, or canonicalized.

Here we reduce our data down to just the `postal` column so we can better see what a `split_stype` operation can do.

In [4]:
dflow_postal = dflow.keep_columns(['inspections.business.postal_code']).rename_columns({'inspections.business.postal_code': 'postal'})
dflow_postal.head(5)

Unnamed: 0,postal
0,94114.0
1,
2,94116.0
3,94116.0
4,94118.0


With `SType.ZipCode`, values are split into their basic five digit zip code and the plus-four add-on of the Zip+4 format.

In [5]:
dflow_split = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE)
dflow_split.head(5)

Unnamed: 0,postal,postal_zip,postal_plus4
0,94114.0,94114,
1,,"azureml.dataprep.native.DataPrepError(""'Micros...","azureml.dataprep.native.DataPrepError(""'Micros..."
2,94116.0,94116,
3,94116.0,94116,
4,94118.0,94118,


`split_stype` also allows you to specify the fields of the stype to use and the name of the new columns. For example, if you just needed to strip the plus four from our zip codes, you could use this.

In [6]:
dflow_no_plus4 = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE, ['zip'], ['zipNoPlus4'])
dflow_no_plus4.head(5)

Unnamed: 0,postal,zipNoPlus4
0,94114.0,94114
1,,"azureml.dataprep.native.DataPrepError(""'Micros..."
2,94116.0,94116
3,94116.0,94116
4,94118.0,94118
