# Chicago food inspections project

## Exploratory data analysis

### Prepare data science tools

In [1]:
import numpy as np
import pandas as pd

In [2]:
import dask.dataframe as dd

In [3]:
import pandas_profiling

In [4]:
import plotly.express as px

In [41]:
import pandas_dedupe

### Read data

We're first trying to read data from the Food Inspections dataset - https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data - any supplementary data data will be read later on (e.g. weather conditions or socio-economical data)

In [5]:
df = dd.read_csv('Food_Inspections_general_201911.csv', 
                 dtype={'License #': 'Int64',
                        'Zip': 'Int64'})

# The lack of NaN rep in integer columns is a pandas "gotcha", 
# but it's fixable with a new type Int64Dtype

### List of columns

In [6]:
df.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

### View first rows

First glance at the data. Surprisingly, most of the data is un-anonimized, meaning it's possible to identify the exact restaurant which was audited. 

In [7]:
df.head(5)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2345699,JET'S PIZZA,JET'S PIZZA,2163956,Restaurant,Risk 2 (Medium),2811 N ASHLAND AVE,CHICAGO,IL,60657,11/15/2019,Canvass,Pass w/ Conditions,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.932766,-87.668262,"(-87.66826200882875, 41.93276573571165)"
1,2345619,JAIPUR,JAIPUR,2694084,Restaurant,Risk 1 (High),738 W RANDOLPH ST,CHICAGO,IL,60661,11/14/2019,Canvass,Pass,,41.884518,-87.647304,"(-87.64730383120978, 41.88451799637232)"
2,2345616,VIP FIT CLUB,VIP FIT CLUB,2446547,Restaurant,Risk 2 (Medium),3426 W DIVERSEY AVE,CHICAGO,IL,60647,11/14/2019,Complaint,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.932069,-87.713294,"(-87.71329441237397, 41.932068626464286)"
3,2345602,VELVET TACO,VELVET TACO,2652941,Restaurant,Risk 3 (Low),2309 N LINCOLN AVE,CHICAGO,IL,60614,11/14/2019,License,Fail,,41.923953,-87.646462,"(-87.6464616293504, 41.9239529807269)"
4,2345603,VELVET TACO,VELVET TACO,2652943,Restaurant,Risk 3 (Low),2309 N LINCOLN AVE,CHICAGO,IL,60614,11/14/2019,License,Fail,,41.923953,-87.646462,"(-87.6464616293504, 41.9239529807269)"


Some data is unstructured, especially in text columns like 'Violations', which leaves an opportunity to use NLP tools (for gathering additional information from text). Next, this data has an (inspection) date column, which makes for some time-series analysis.

### Get data types of columns

In [8]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 17 entries, Inspection ID to Location
dtypes: Int64(2), object(12), float64(2), int64(1)

Many columns are text (represented by the 'object' type)

In [9]:
df.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #            Int64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                  Int64
Inspection Date     object
Inspection Type     object
Results             object
Violations          object
Latitude           float64
Longitude          float64
Location            object
dtype: object

### Number of rows

In [10]:
f'There are {len(df):,} rows in dataset and {len(df.columns)} columns'

'There are 195,979 rows in dataset and 17 columns'

### Basic reconnaissance analysis

#### Use random sample

In [11]:
df_5p = df.sample(frac=0.05).compute()
f'Sample size is {len(df_5p)}'

'Sample size is 9799'

In [None]:
df_5p.profile_report(minify_html=True, use_local_assets=True)

### Basic cleaning and checks

#### License ID

Q: Does a license ID uniquely identifies an establishment? 

In [12]:
# df.groupby(['License #'])
df.groupby('License #')['DBA Name'].nunique().nlargest(10).compute()

License #
0          229
14616        7
1354323      5
1514802      4
1196         3
1806         3
1932         3
12141        3
17464        3
21943        3
Name: DBA Name, dtype: int64

A: Yes, but restaurant names require some fixing. Also, the license type may change in time.

#### Empty license number

In [13]:
df[df['License #'].isin([0, None, ])].head(50)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
187,2315561,DORE EARLY CHILDHOOD CENTER,DORE EARLY CHILDHOOD CENTER,0,School,Risk 1 (High),6108 S Natoma AVE,CHICAGO,IL,60638,10/09/2019,Canvass,Pass,,41.780927,-87.78764,"(-87.78764008652699, 41.78092716332793)"
893,2300450,SUBWAY,SUBWAY,0,Restaurant,Risk 1 (High),4771 N LINCOLN AVE,CHICAGO,IL,60625,06/28/2019,Canvass,Out of Business,,41.968506,-87.688338,"(-87.6883380552873, 41.96850625784847)"
1049,2293545,SUBWAY,SUBWAY,0,Restaurant,Risk 1 (High),4771 N LINCOLN AVE,CHICAGO,IL,60625,06/13/2019,Canvass,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.968506,-87.688338,"(-87.6883380552873, 41.96850625784847)"
1584,2283080,CHURCH OF THE THREE CROSSES,CHURCH OF THE THREE CROSSES,0,Special Event,Risk 2 (Medium),333 W WISCONSIN,CHICAGO,IL,60614,04/12/2019,Canvass,Pass,"55. PHYSICAL FACILITIES INSTALLED, MAINTAINED ...",41.916419,-87.637601,"(-87.63760096984397, 41.91641905854752)"
3001,2222226,BIRRIA OCOTLAN MEZCAL,BIRRIA OCOTLAN MEZCAL,0,Restaurant,Risk 1 (High),3011 W CERMAK RD,CHICAGO,IL,60623,09/12/2018,Canvass,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.85172,-87.700877,"(-87.70087724148631, 41.851719660634274)"
3359,2185028,IMMACULATE CONCEPTION CHURCH,,0,CHURCH/SPECIAL EVENT,Risk 1 (High),8756 S COMMERICAL AVE,CHICAGO,IL,60617,07/25/2018,Canvass,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",,,
3686,2176741,BELMONT PLACE SENIOR HOUSING,BELMONT PLACE SENIOR HOUSING,0,Long Term Care,Risk 1 (High),4645 W BELMONT AVE,CHICAGO,IL,60641,06/05/2018,Short Form Complaint,Pass w/ Conditions,,41.938758,-87.74392,"(-87.74392031202537, 41.93875847343951)"
4092,2159749,LITTLE BLACK PEARL,LITTLE BLACK PEARL,0,School,Risk 2 (Medium),1060 E 47TH ST,CHICAGO,IL,60653,04/10/2018,Canvass,Pass,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",41.809722,-87.599389,"(-87.59938918300288, 41.80972238714215)"
4882,2116707,MRS. T'S SOUTHERN FRIED CHICKEN,MRS. T'S SOUTHERN FRIED CHICKEN,0,Restaurant,Risk 1 (High),3343 N BROADWAY,CHICAGO,IL,60657,12/06/2017,Canvass,Out of Business,,41.943219,-87.64459,"(-87.64459003405148, 41.94321903123653)"
5507,2081887,ASADO COFFEE ROASTERS,ASADO COFFEE ROASTERS,0,Restaurant,Risk 2 (Medium),22 E JACKSON BLVD,CHICAGO,IL,60604,09/12/2017,Canvass Re-Inspection,Pass,,41.878336,-87.626892,"(-87.62689227478914, 41.878336071547935)"


#### City attribute

In [14]:
df.City.value_counts().compute()

CHICAGO               195139
Chicago                  321
chicago                   97
CCHICAGO                  46
SCHAUMBURG                25
                       ...  
COUNTRY CLUB HILLS         1
DES PLAINES                1
GRIFFITH                   1
alsip                      1
GLENCOE                    1
Name: City, Length: 71, dtype: int64

#### Normalize city names

In [15]:
def fix_city(df):
    df['City'] = df['City'].map(lambda x: 'CHICAGO' if x in ['Chicago', 'chicago', 'CCHICAGO'] else x)
    return df

#### State attribute

In [16]:
df.State.value_counts().compute()

IL    195934
WI         1
NY         1
IN         1
Name: State, dtype: int64

#### State other than 'IL' is probably a noise and it'll be removed

In [17]:
incidents = df.map_partitions(fix_city).compute()
incidents = incidents[incidents.State == 'IL']

In [18]:
incidents.head(5)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2345699,JET'S PIZZA,JET'S PIZZA,2163956,Restaurant,Risk 2 (Medium),2811 N ASHLAND AVE,CHICAGO,IL,60657,11/15/2019,Canvass,Pass w/ Conditions,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.932766,-87.668262,"(-87.66826200882875, 41.93276573571165)"
1,2345619,JAIPUR,JAIPUR,2694084,Restaurant,Risk 1 (High),738 W RANDOLPH ST,CHICAGO,IL,60661,11/14/2019,Canvass,Pass,,41.884518,-87.647304,"(-87.64730383120978, 41.88451799637232)"
2,2345616,VIP FIT CLUB,VIP FIT CLUB,2446547,Restaurant,Risk 2 (Medium),3426 W DIVERSEY AVE,CHICAGO,IL,60647,11/14/2019,Complaint,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.932069,-87.713294,"(-87.71329441237397, 41.932068626464286)"
3,2345602,VELVET TACO,VELVET TACO,2652941,Restaurant,Risk 3 (Low),2309 N LINCOLN AVE,CHICAGO,IL,60614,11/14/2019,License,Fail,,41.923953,-87.646462,"(-87.6464616293504, 41.9239529807269)"
4,2345603,VELVET TACO,VELVET TACO,2652943,Restaurant,Risk 3 (Low),2309 N LINCOLN AVE,CHICAGO,IL,60614,11/14/2019,License,Fail,,41.923953,-87.646462,"(-87.6464616293504, 41.9239529807269)"


### Identify duplicated establishments

In [19]:
establishments = incidents[['DBA Name', 'AKA Name', 'Facility Type', 'Address', 'City', 'Zip']]

In [44]:
establishments['LatLong'] = incidents.apply(lambda x: None if any([pd.isna(x['Latitude']), pd.isna(x['Longitude'])]) else (x['Latitude'], x['Longitude']), axis=1)

In [49]:
len(establishments)

195934

In [50]:
establishments = establishments.drop_duplicates()

In [51]:
len(establishments)

32887

In [52]:
establishments.LatLong.value_counts()

(42.008536400868735, -87.91442843927048)    267
(41.88743405025222, -87.68184949426895)     188
None                                        160
(41.85045102427, -87.65879785567869)        113
(41.83078366228312, -87.6352957830455)       96
                                           ... 
(41.90344939527244, -87.66803165380388)       1
(41.76830760574476, -87.72264592858038)       1
(41.98467406624727, -87.69692510314671)       1
(41.713787168999794, -87.53663320988323)      1
(41.90326646138292, -87.67912669629169)       1
Name: LatLong, Length: 16815, dtype: int64

In [53]:
# Initiate deduplication
locations_dedup = pandas_dedupe.dedupe_dataframe(establishments, 
                                                 ['DBA Name', 'AKA Name', 'Facility Type', 
                                                  'Address', 'City', 
                                                  ('Zip', 'Exact'), ('LatLong', 'LatLong', 'has missing')])

importing data ...


DBA Name : dona maris 2
AKA Name : dona maris 2
Facility Type : grocery store
Address : 3518 w montrose ave
City : chicago
Zip : 60618
LatLong : (41.96119793965089, -87.7164129559992)

DBA Name : wing lee chinese cuisine and cafe corporation
AKA Name : wing lee chinese cuisine and cafe corporation
Facility Type : None
Address : 1810 w montrose ave
City : chicago
Zip : 60613
LatLong : (41.961619281289565, -87.67502882185283)

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


starting active labeling...
n


DBA Name : 7-eleven
AKA Name : 7-eleven
Facility Type : grocery store
Address : 5206 n western ave
City : chicago
Zip : 60625
LatLong : (41.97612112189385, -87.68930731555461)

DBA Name : 7-eleven 36718b
AKA Name : 7-eleven
Facility Type : grocery store
Address : 3600 w belmont ave
City : chicago
Zip : 60618
LatLong : (41.939325192495275, -87.71746162505319)

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : ella vanilla llc
AKA Name : ella vanilla llc
Facility Type : shared kitchen user (long term)
Address : 2217 w crystal st apt 2
City : chicago
Zip : 60622
LatLong : None

DBA Name : taqueria guerrero
AKA Name : taqueria guerrero
Facility Type : restaurant
Address : 2551 w cermak rd
City : chicago
Zip : 60608
LatLong : (41.85186827166058, -87.69007512129495)

0/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : pizzeria calzone
AKA Name : pizzeria calzone
Facility Type : restaurant
Address : 5858 n lincoln ave
City : chicago
Zip : 60659
LatLong : (41.98786035580833, -87.70336737996178)

DBA Name : nancy foods inc
AKA Name : nancy foods inc
Facility Type : grocery store
Address : 12315 s state st
City : chicago
Zip : 60628
LatLong : (41.670252399927904, -87.62227195617474)

0/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : mcdonalds 12785
AKA Name : mcdonalds (t5 arrival)
Facility Type : restaurant
Address : 11601 w touhy ave
City : chicago
Zip : 60666
LatLong : (42.008536400868735, -87.91442843927048)

DBA Name : mcdonalds
AKA Name : mcdonalds
Facility Type : None
Address : 6900 s lafayette ave
City : chicago
Zip : 60621
LatLong : (41.76915533533597, -87.6268126762725)

0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : bridgeport pasty, llc
AKA Name : bridgeport pasty
Facility Type : restaurant
Address : 3142 s morgan st
City : chicago
Zip : 60608
LatLong : (41.83673163109136, -87.65125761223642)

DBA Name : bridgeport pasty, llc
AKA Name : bridgeport pasty
Facility Type : mobile food dispenser
Address : 3142 s morgan st
City : chicago
Zip : 60608
LatLong : (41.83673163109136, -87.65125761223642)

0/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : czerwony kapturek
AKA Name : czerwony kapturek
Facility Type : None
Address : 7628-7630 w foster ave
City : chicago
Zip : 60656
LatLong : (41.974365531990216, -87.81812421807619)

DBA Name : south shore high no.
AKA Name : south shore high north
Facility Type : school
Address : 7529 s constance (1832e)
City : charles a hayes
Zip : 60649
LatLong : (41.75795594904808, -87.57964647686238)

1/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : gateway newstand
AKA Name : gateway newstand
Facility Type : grocery store
Address : 875 n michigan ave
City : chicago
Zip : 60611
LatLong : (41.898948541729865, -87.62397498768969)

DBA Name : aramark
AKA Name : aramark
Facility Type : restaurant
Address : 875 n michigan ave
City : chicago
Zip : 60611
LatLong : (41.898948541729865, -87.62397498768969)

1/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : pulaski mobil
AKA Name : mobil
Facility Type : grocery store
Address : 2801-2811 s pulaski rd
City : chicago
Zip : 60623
LatLong : (41.84047629775851, -87.72429987820476)

DBA Name : freds mobil
AKA Name : mobil
Facility Type : grocery store
Address : 2801-2811 s pulaski rd
City : chicago
Zip : 60623
LatLong : (41.84047629775851, -87.72429987820476)

1/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : quick bite
AKA Name : quick bite
Facility Type : None
Address : 3901 s dr martin luther king jr dr
City : chicago
Zip : 60653
LatLong : (41.82377697840096, -87.61684857582857)

DBA Name : quick bite
AKA Name : quick bite
Facility Type : None
Address : 3901 s dr martin luther king jr
City : chicago
Zip : 60653
LatLong : None

1/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : dianas daycare center
AKA Name : dianas daycare center
Facility Type : daycare above and under 2 years
Address : 5961 n clark st
City : chicago
Zip : 60660
LatLong : (41.990244301033066, -87.66985749853201)

DBA Name : dianas daycare center llc
AKA Name : dianas daycare center
Facility Type : daycare (2 - 6 years)
Address : 5961 n clark st
City : chicago
Zip : 60660
LatLong : (41.990244301033066, -87.66985749853201)

2/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : illinois sportservice inc
AKA Name : dipping dots ice cream 2 (540-542)
Facility Type : kiosk
Address : 333 w 35th st
City : chicago
Zip : 60616
LatLong : (41.83078366228312, -87.6352957830455)

DBA Name : illinois sportservice inc
AKA Name : hot dog vienna beef 7 (557)
Facility Type : restaurant
Address : 333 w 35th st
City : chicago
Zip : 60616
LatLong : (41.83078366228312, -87.6352957830455)

3/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : 7-eleven
AKA Name : 7-eleven
Facility Type : None
Address : 343 n la salle dr
City : chicago
Zip : 60610
LatLong : (41.88863332889772, -87.63236252748243)

DBA Name : 7-eleven
AKA Name : 7-eleven
Facility Type : None
Address : 343 n la salle st
City : chicago
Zip : 60654
LatLong : (41.88863332889772, -87.63236252748243)

3/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : southport city saloon
AKA Name : southport city saloon
Facility Type : restaurant
Address : 2542-2548 n southport ave
City : chicago
Zip : 60614
LatLong : (41.928296636787316, -87.66357131513206)

DBA Name : southport city saloon
AKA Name : southport city saloon
Facility Type : None
Address : 2548 n southport ave
City : chicago
Zip : 60614
LatLong : (41.928485477764006, -87.66357735760441)

4/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : bambi
AKA Name : bambi cart cbh 1383
Facility Type : mobile food dispenser
Address : 2051 w 47th st
City : chicago
Zip : 60609
LatLong : (41.808395917931094, -87.67672122326663)

DBA Name : bambi
AKA Name : bambi cart cbh 1385
Facility Type : mobile food dispenser
Address : 2051 w 47th st
City : chicago
Zip : 60609
LatLong : (41.808395917931094, -87.67672122326663)

5/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


DBA Name : papa ts italiano
AKA Name : papa ts italiano truck 5
Facility Type : mobile food dispenser
Address : 2843 - 2847 w 63rd st
City : chicago
Zip : 60629
LatLong : (41.778980820715226, -87.69509004304912)

DBA Name : papa ts italiano
AKA Name : papa ts italiano truck 8
Facility Type : mobile food dispenser
Address : 2843 - 2847 w 63rd st
City : chicago
Zip : 60629
LatLong : (41.778980820715226, -87.69509004304912)

5/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : automatic icemakers
AKA Name : automatic icemakers
Facility Type : wholesale
Address : 3725 n talman ave
City : chicago
Zip : 60618
LatLong : (41.9492777002764, -87.69475171075673)

DBA Name : automatic icemakers holding, inc
AKA Name : automatic icemakers holding, inc
Facility Type : None
Address : 3725 n talman ave
City : chicago
Zip : 60618
LatLong : (41.9492777002764, -87.69475171075673)

5/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : kentucky fried chicken 521047
AKA Name : kentucky fried chicken 521047
Facility Type : restaurant
Address : 5230 w madison st
City : chicago
Zip : 60644
LatLong : (41.880465442372845, -87.75634710267644)

DBA Name : mih admin services llc/dba kfc 521047
AKA Name : kfc
Facility Type : restaurant
Address : 5230 w madison st
City : chicago
Zip : 60644
LatLong : (41.880465442372845, -87.75634710267644)

6/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : hudson news
AKA Name : hudson news t-1 gate b5
Facility Type : grocery store
Address : 11601 w touhy ave
City : chicago
Zip : 60666
LatLong : (42.008536400868735, -87.91442843927048)

DBA Name : hudson news
AKA Name : hudson news/t1 b-11
Facility Type : store
Address : 11601 w touhy ave
City : chicago
Zip : 60666
LatLong : (42.008536400868735, -87.91442843927048)

7/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : harolds 57
AKA Name : harolds 57
Facility Type : restaurant
Address : 7113 s state st
City : chicago
Zip : 60619
LatLong : (41.7650677789878, -87.62475001172706)

DBA Name : harolds chicken shack 6
AKA Name : harolds chicken shack 6
Facility Type : None
Address : 7139 s state st
City : chicago
Zip : 60619
LatLong : (41.76435396706882, -87.62473563971668)

7/10 positive, 11/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : i dream of falafel
AKA Name : i dream of falafel
Facility Type : restaurant
Address : 112 w monroe st
City : chicago
Zip : 60603
LatLong : (41.880801065513246, -87.63126132290381)

DBA Name : i dream of falafel
AKA Name : i dream of falafel
Facility Type : restaurant
Address : 555 w monroe st
City : chicago
Zip : 60661
LatLong : (41.880447908791545, -87.64183897534404)

7/10 positive, 12/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : bias cafeteria marianao ii
AKA Name : bias cafe
Facility Type : restaurant
Address : 6401 w addison st
City : chicago
Zip : 60634
LatLong : (41.94555292295917, -87.78636821200944)

DBA Name : taco madre
AKA Name : None
Facility Type : restaurant
Address : 6401 w addison st
City : chicago
Zip : 60634
LatLong : (41.94555292295917, -87.78636821200944)

7/10 positive, 13/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : burger joint chicago bar grill llc
AKA Name : None
Facility Type : restaurant
Address : 500 w madison st
City : chicago
Zip : 60661
LatLong : (41.88199433820508, -87.6397586848809)

DBA Name : k r lobby shop
AKA Name : k r lobby shop
Facility Type : grocery store
Address : 500 w monroe st
City : chicago
Zip : 60661
LatLong : (41.880699246124394, -87.63972502663869)

7/10 positive, 14/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


DBA Name : the baby academy
AKA Name : the baby academy
Facility Type : childrens services facility
Address : 8605-8607 s cottage grove ave
City : chicago
Zip : 60619
LatLong : (41.738261523797966, -87.6047230763477)

DBA Name : the baby academy
AKA Name : the baby academy
Facility Type : daycare (under 2 years)
Address : 8607 s cottage grove ave
City : chicago
Zip : 60619
LatLong : (41.7382052478746, -87.60472164419079)

7/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : happy panda inc.
AKA Name : 18th st. asian bistro
Facility Type : restaurant
Address : 1343 w 18th st
City : chicago
Zip : 60608
LatLong : (41.85777810023129, -87.6605884280591)

DBA Name : wongs asia cafe
AKA Name : 18th street asian bistro
Facility Type : restaurant
Address : 1343 w 18th st
City : chicago
Zip : 60608
LatLong : (41.85777810023129, -87.6605884280591)

8/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : quest food management services
AKA Name : latin school
Facility Type : restaurant
Address : 59 w north blvd
City : chicago
Zip : 60610
LatLong : (41.91108516003885, -87.63125957941122)

DBA Name : quest food management services
AKA Name : latin school
Facility Type : restaurant
Address : 45 w north blvd
City : chicago
Zip : 60610
LatLong : (41.911085151800144, -87.63063210150685)

9/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : early childhood educare center
AKA Name : None
Facility Type : daycare above and under 2 years
Address : 5055 s state
City : chicago
Zip : 60609
LatLong : (41.80214349010087, -87.62574740239889)

DBA Name : attucks
AKA Name : attucks
Facility Type : school
Address : 5055 s state
City : chicago
Zip : 60609
LatLong : (41.80214349010087, -87.62574740239889)

10/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


DBA Name : la esperanza restaurant
AKA Name : la esperanza restaurant
Facility Type : restaurant
Address : 1864 s blue island ave
City : chicago
Zip : 60608
LatLong : (41.856306651100724, -87.66265894498156)

DBA Name : la esperanza restuarant
AKA Name : la esperanza restuarant
Facility Type : None
Address : 1864 s blue island ave
City : chicago
Zip : 60608
LatLong : (41.856306651100724, -87.66265894498156)

10/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : el original gallo bravo
AKA Name : el original gallo bravo
Facility Type : restaurant
Address : 4503-4505 n kedzie ave
City : chicago
Zip : 60625
LatLong : (41.96315679416602, -87.70815959668909)

DBA Name : el original gallo bravo
AKA Name : el gallo bravo
Facility Type : restaurant
Address : 4503 n kedzie ave
City : chicago
Zip : 60625
LatLong : (41.96315679416602, -87.70815959668909)

11/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


DBA Name : smart learning center llc
AKA Name : smart learning center
Facility Type : after school program
Address : 2839-2841 s archer ave
City : chicago
Zip : 60608
LatLong : (41.842695478349206, -87.65550921697735)

DBA Name : smart learning center llc
AKA Name : smart learning center llc
Facility Type : childern activity facility
Address : 2841 s archer ave
City : chicago
Zip : 60608
LatLong : (41.842662462230905, -87.6555778185462)

12/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.



clustering...



Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.



# duplicate sets 1627


In [57]:
locations_dedup[~pd.isna(locations_dedup['cluster id'])]

Unnamed: 0,DBA Name,AKA Name,Facility Type,Address,City,Zip,LatLong,cluster id,confidence
13,smart child preschool,smart child preschool,15 monts to 5 years old,5100 w foster ave,chicago,60630,"(41.975354698580844, -87.75534848899322)",0.0,0.619474
17,idof fresh mediterranean,idof fresh mediterranean,restaurant,6558 n sheridan rd,chicago,60626,"(42.001798790886504, -87.66090125470284)",1.0,0.769347
19,gopuff,gopuff,grocery store,1801 w warner ave,chicago,60613,"(41.956845683288854, -87.67439466946578)",2.0,0.789743
19,subway restaurant,subway,restaurant,953 w webster ave,chicago,60614,"(41.921633711055094, -87.65317221940548)",2.0,0.789743
54,the market,the market,restaurant,5700 s cicero ave,chicago,60638,"(41.78932932326538, -87.74164564419638)",3.0,0.800028
...,...,...,...,...,...,...,...,...,...
57536,under the el,mixed greens,restaurant,223 w lake st,chicago,60606,"(41.885621553750475, -87.63480048875809)",1580.0,0.538934
57954,rose food mart inc,rose food mart,grocery store,11300 s wentworth ave,chicago,60628,"(41.6888459802589, -87.62794434817805)",541.0,0.797941
58218,pizza mania,pizza mania,,5777 n ridge ave,chicago,60660,"(41.987061183467176, -87.66544832642705)",499.0,0.807182
58671,dollar deals up,dollar deals up,grocery store,2057- 2059 n pulaski rd,chicago,60639,"(41.918891263940395, -87.72646531498671)",1559.0,0.819209


In [None]:
# ## Writing Results

# Write our original data back out to a CSV with a new column called 
# 'Cluster ID' which indicates which records refer to each other.

cluster_membership = {}
cluster_id = 0
for (cluster_id, cluster) in enumerate(clustered_dupes):
    id_set, scores = cluster
    cluster_d = [data_d[c] for c in id_set]
    canonical_rep = dedupe.canonicalize(cluster_d)
    for record_id, score in zip(id_set, scores):
        cluster_membership[record_id] = {
            "cluster id" : cluster_id,
            "canonical representation" : canonical_rep,
            "confidence": score
        }