# Fuzzy Grouping
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Unprepared data often represents the same entity with multiple values; examples include different spellings, varying capitalizations, and abbreviations. This is common when working with data gathered from multiple sources or through human input. One way to canonicalize and reconcile these variants is to use Data Prep's fuzzy_group_column (also known as "text clustering") functionality.

Data Prep inspects a column to determine clusters of similar values. A new column is added in which clustered values are replaced with the canonical value of its cluster, thus significantly reducing the number of distinct values. You can control the degree of similarity required for values to be clustered together, override canonical form, and set clusters if automatic clustering did not provide the desired results.

Let's explore the capabilities of `fuzzy_group_column` by first reading in a dataset and inspecting it.

In [1]:
import azureml.dataprep as dprep

In [2]:
dflow = dprep.read_json(path='../data/json.json')
dflow.head(5)

Unnamed: 0,inspections.business.business_id,inspections.business.name,inspections.business.address,inspections.business.city,inspections.business.postal_code,inspections.business.latitude,inspections.business.longitude,inspections.business.phone_number,inspections.business.TaxCode,inspections.business.business_certificate,inspections.business.application_date,inspections.business.owner_name,inspections.business.owner_address,inspections.Score,inspections.date,inspections.type,inspections.violations
0,16162,Quick-N-Ezee Indian Foods,3861 24th St,SF,94114.0,,,,H34,467114.0,May 9 2005 12:00AM,Jagpreet Enterprises,23682 Clawiter Road\n Hayward\n CA\n 94545,100.0,20130223,Routine - Unscheduled,[]
1,69707,Little Green Cyclo 2,Off The Grid,,,,,,H79,453248.0,Jul 12 2012 12:00AM,LITTLEGREENCYCLO LLC,"100 Esplanade Ave., Apt. 99\n Pacifica\n CA\n ...",93.0,20130224,Routine - Unscheduled,"[{""description"":""103112: No hot water or runni..."
2,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79.0,20130225,Routine - Unscheduled,"[{""description"":""103139: Improper food storage..."
3,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,,20130225,Complaint,"[{""description"":""103139: Improper food storage..."
4,68701,Grindz,832 Clement St,SF,94118.0,37.7828,-122.468,,H25,467498.0,Mar 16 2012 12:00AM,"Ono Grindz, LLC",1055 Granada St.\n Vallejo\n CA\n 94591,100.0,20130225,Routine - Unscheduled,[]


As you can see above, the column `inspections.business.city` contains several forms of the city name "San Francisco".
Let's add a column with values replaced by the automatically detected canonical form. To do so call fuzzy_group_column() on an existing Dataflow:

In [3]:
dflow_clean = dflow.fuzzy_group_column(source_column='inspections.business.city',
                                       new_column_name='city_grouped',
                                       similarity_threshold=0.8,
                                       similarity_score_column_name='similarity_score')
dflow_clean.head(5)

Unnamed: 0,inspections.business.business_id,inspections.business.name,inspections.business.address,inspections.business.city,city_grouped,similarity_score,inspections.business.postal_code,inspections.business.latitude,inspections.business.longitude,inspections.business.phone_number,inspections.business.TaxCode,inspections.business.business_certificate,inspections.business.application_date,inspections.business.owner_name,inspections.business.owner_address,inspections.Score,inspections.date,inspections.type,inspections.violations
0,16162,Quick-N-Ezee Indian Foods,3861 24th St,SF,San Francisco,0.814806,94114.0,,,,H34,467114.0,May 9 2005 12:00AM,Jagpreet Enterprises,23682 Clawiter Road\n Hayward\n CA\n 94545,100.0,20130223,Routine - Unscheduled,[]
1,69707,Little Green Cyclo 2,Off The Grid,,,,,,,,H79,453248.0,Jul 12 2012 12:00AM,LITTLEGREENCYCLO LLC,"100 Esplanade Ave., Apt. 99\n Pacifica\n CA\n ...",93.0,20130224,Routine - Unscheduled,"[{""description"":""103112: No hot water or runni..."
2,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,San Francisco,1.0,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79.0,20130225,Routine - Unscheduled,"[{""description"":""103139: Improper food storage..."
3,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,San Francisco,1.0,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,,20130225,Complaint,"[{""description"":""103139: Improper food storage..."
4,68701,Grindz,832 Clement St,SF,San Francisco,0.814806,94118.0,37.7828,-122.468,,H25,467498.0,Mar 16 2012 12:00AM,"Ono Grindz, LLC",1055 Granada St.\n Vallejo\n CA\n 94591,100.0,20130225,Routine - Unscheduled,[]


The arguments `source_column` and `new_column_name` are required, whereas the others are optional.
If `similarity_threshold` is provided, it will be used to control the required similarity level for the values to be grouped together.
If `similarity_score_column_name` is provided, a second new column will be added to show similarity score between every pair of original and canonical values.

In the resulting data set, you can see that all the different variations of representing "San Francisco" in the data were normalized to the same string, "San Francisco".

But what if you want more control over what gets grouped, what doesn't, and what the canonical value should be? 

To get more control over grouping, canonical values, and exceptions, you need to use the `FuzzyGroupBuilder` class.
Let's see what it has to offer below:

In [4]:
builder = dflow.builders.fuzzy_group_column(source_column='inspections.business.city',
                                            new_column_name='city_grouped',
                                            similarity_threshold=0.8,
                                            similarity_score_column_name='similarity_score')

In [5]:
# calling learn() to get fuzzy groups
builder.learn()
builder.groups

[{'canonicalValue': 'San Francisco',
  'duplicates': [{'duplicateValue': 'San Francisco',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SAN FRANCISCO',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SF',
    'similarityScore': 0.8148061037063599,
    'useForReplacement': True},
   {'duplicateValue': 'S.F.',
    'similarityScore': 0.8148061037063599,
    'useForReplacement': True}]}]

Here you can see that `fuzzy_group_column` detected one group with four values that all map to "San Francisco" as the canonical value.
You can see the effects of changing the similarity threshold next:

In [6]:
builder.similarity_threshold = 0.9
builder.learn()
builder.groups

[{'canonicalValue': 'SF',
  'duplicates': [{'duplicateValue': 'SF',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'S.F.',
    'similarityScore': 0.9523809552192688,
    'useForReplacement': True}]},
 {'canonicalValue': 'San Francisco',
  'duplicates': [{'duplicateValue': 'San Francisco',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SAN FRANCISCO',
    'similarityScore': 1.0,
    'useForReplacement': True}]}]

Now that you are using a similarity threshold of `0.9`, two distinct groups of values are generated.

Let's tweak some of the detected groups before completing the builder and getting back the Dataflow with the resulting fuzzy grouped column.

In [7]:
builder.similarity_threshold = 0.8
builder.learn()
groups = builder.groups
groups

[{'canonicalValue': 'San Francisco',
  'duplicates': [{'duplicateValue': 'San Francisco',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SAN FRANCISCO',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SF',
    'similarityScore': 0.8148061037063599,
    'useForReplacement': True},
   {'duplicateValue': 'S.F.',
    'similarityScore': 0.8148061037063599,
    'useForReplacement': True}]}]

In [8]:
# change the canonical value for the first group
groups[0]['canonicalValue'] = 'SANFRAN'
duplicates = groups[0]['duplicates']
# remove the last duplicate value from the cluster
duplicates = duplicates[:-1]
# assign modified duplicate array back
groups[0]['duplicates'] = duplicates
# assign modified groups back to builder
builder.groups = groups
builder.groups

[{'canonicalValue': 'SANFRAN',
  'duplicates': [{'duplicateValue': 'San Francisco',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SAN FRANCISCO',
    'similarityScore': 1.0,
    'useForReplacement': True},
   {'duplicateValue': 'SF',
    'similarityScore': 0.8148061037063599,
    'useForReplacement': True}]}]

Here, the canonical value is modified to be used for the single fuzzy group and removed 'S.F.' from this group's duplicates list.

You can mutate the copy of the `groups` list from the builder (be careful to keep the structure of objects inside this list). After getting the desired groups in the list, you can update the builder with it.

Now you can get a dataflow with the FuzzyGroup step in it.

In [9]:
dflow_clean = builder.to_dataflow()

df = dflow_clean.to_pandas_dataframe()
df

Unnamed: 0,inspections.business.business_id,inspections.business.name,inspections.business.address,inspections.business.city,city_grouped,similarity_score,inspections.business.postal_code,inspections.business.latitude,inspections.business.longitude,inspections.business.phone_number,inspections.business.TaxCode,inspections.business.business_certificate,inspections.business.application_date,inspections.business.owner_name,inspections.business.owner_address,inspections.Score,inspections.date,inspections.type,inspections.violations
0,16162,Quick-N-Ezee Indian Foods,3861 24th St,SF,SANFRAN,0.814806,94114.0,,,,H34,467114.0,May 9 2005 12:00AM,Jagpreet Enterprises,23682 Clawiter Road\n Hayward\n CA\n 94545,100.0,20130223,Routine - Unscheduled,[]
1,69707,Little Green Cyclo 2,Off The Grid,,,,,,,,H79,453248.0,Jul 12 2012 12:00AM,LITTLEGREENCYCLO LLC,"100 Esplanade Ave., Apt. 99\n Pacifica\n CA\n ...",93.0,20130224,Routine - Unscheduled,"[{""description"":""103112: No hot water or runni..."
2,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,SANFRAN,1.0,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79.0,20130225,Routine - Unscheduled,"[{""description"":""103139: Improper food storage..."
3,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,SANFRAN,1.0,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,,20130225,Complaint,"[{""description"":""103139: Improper food storage..."
4,68701,Grindz,832 Clement St,SF,SANFRAN,0.814806,94118.0,37.7828,-122.468,,H25,467498.0,Mar 16 2012 12:00AM,"Ono Grindz, LLC",1055 Granada St.\n Vallejo\n CA\n 94591,100.0,20130225,Routine - Unscheduled,[]
5,69186,"Premier Catering & Events, Inc.",1255 22nd St,S.F.,S.F.,,94107.0,,,14155530288.0,H30,362812.0,Apr 30 2012 12:00AM,"Premier Catering & Events, Inc.",298 Magellan Ave.\n SF\n CA\n 94116,,20130225,Reinspection/Followup,[]
6,2689,THE BLUE PLATE,3218 MISSION St,SF,SANFRAN,0.814806,94110.0,37.7452,-122.42,14155286777.0,H25,325714.0,,BLUE ENCLAVE LLC,3218 MISSION ST.\n SAN FRANCISCO\n CA\n 94110,98.0,20130225,Routine - Unscheduled,"[{""description"":""103143: Inadequate warewashin..."
7,15806,Vital Tea Leaf,1044 Grant Ave,San Francisco,SANFRAN,1.0,94133.0,37.7966,-122.407,,H24,388301.0,May 23 2005 12:00AM,Minh H. Duong,1044 Grant Ave\n San Francisco\n CA\n 94133,98.0,20130225,Routine - Unscheduled,"[{""description"":""103157: Food safety certifica..."
8,21807,The Front Porch,65 29th St A,SF,SANFRAN,0.814806,94110.0,37.7439,-122.422,,H25,398500.0,Jun 7 2006 12:00AM,Front Porch Restaurant LLC,65A 29th Street\n SF\n CA\n 94110,,20130225,Reinspection/Followup,[]
9,69041,Washington Cafe,826 Washington St,San Francisco,SANFRAN,1.0,94108.0,37.7951,-122.407,,H26,468548.0,Apr 18 2012 12:00AM,"Washington Caf�, Inc. / Louis Kuang",333 Third Avenue\n Daly City\n CA\n 94014,65.0,20130225,Routine - Unscheduled,"[{""description"":""103120: Moderate risk food ho..."
