# python distance measures

## why important

** data often contains messy strings that you need to "clean" before further analysis **

examples:
* misspellings - someone may type "misisipi" instead of "mississippi"
* alternates - someone may type "ny giants" instead of "new york giants" 

** search engines are "smart" and want to help you find the right thing even if you didn't get an exact match **

blog post from SeatGeek >> http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

```
Of course, a big problem with most corners of the internet is labeling. 
One of our most consistently frustrating issues is trying to figure out 
whether two ticket listings are for the same real-life event 
(that is, without enlisting the help of our army of interns).

```

* abbreviations - someone may type "cal bears football" instead of "california golden bears football" 
  * Seatgeek > https://seatgeek.com/search?f=1&search=cal%20bears%20football


## how 

investigate the usage of the following packages that compute distances
* Levenshtein - https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html
* difflib - https://docs.python.org/2/library/difflib.html
* fuzzywuzzy (uses Levenshtein package) - https://github.com/seatgeek/fuzzywuzzy

## installation

```
pip install python-Levenshtein
pip install fuzzywuzzy

```


In [1]:
!pip install python-Levenshtein
!pip install fuzzywuzzy

Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.0.tar.gz (48kB)
[K    100% |████████████████████████████████| 51kB 1.3MB/s ta 0:00:01
Building wheels for collected packages: python-Levenshtein
  Running setup.py bdist_wheel for python-Levenshtein ... [?25l- \ | / done
[?25h  Stored in directory: /Users/mango/Library/Caches/pip/wheels/c0/83/e9/b2cc2876e175d04091caf4e9f5de564ff2503b1f1885e7c3ba
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.0
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.15.1-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.15.1


In [5]:
import Levenshtein
from fuzzywuzzy import fuzz, StringMatcher, process
import difflib # part of standard library
import csv

In [6]:
import pandas

## basic usage 

### example 1

In [10]:
s1 = "foo"
s2 = "Foo"
print(Levenshtein.distance(s1,s2))
print(Levenshtein.ratio(s1,s2))

1
0.6666666666666666


In [12]:

s1 = "misisipi"
s2 = "mississippi"
print(Levenshtein.distance(s1,s2))
print(Levenshtein.ratio(s1,s2))


3
0.8421052631578947


### example 2

In [14]:

s1 = "NY Giants"
s2 = "New York Giants"
print(Levenshtein.distance(s1.lower(),s2.lower()))
print(Levenshtein.ratio(s1.lower(),s2.lower()))

print(fuzz.token_sort_ratio(s1, s2))

6
0.75
75


In [15]:
# different way to get the ratio
print(Levenshtein.ratio(s1,s2))
print(fuzz.ratio(s1,s2) / 100.)
print(difflib.SequenceMatcher(None,s1,s2).ratio())

0.75
0.75
0.75


### example 3


In [17]:
s1 = "fuzzy wuzzy was a bear"
s2 = "wuzzy fuzzy was a bear"

print(fuzz.ratio(s1, s2))

91


In [18]:
s1 = "fuzzy wuzzy was a bear"
s1 = "fuzzy wuzzy was a"
s2 = "wuzzy fuzzy was a bear"
print(fuzz.token_sort_ratio(s1, s2))

87


##  find best match

In [169]:
# read data
df_teams = pandas.read_csv('nflTeams.csv')
teams = df_teams.team.tolist()
sorted(teams)

['Arizona Cardinals',
 'Atlanta Falcons',
 'Baltimore Ravens',
 'Buffalo Bills',
 'Carolina Panthers',
 'Chicago Bears',
 'Cincinnati Bengals',
 'Cleveland Browns',
 'Dallas Cowboys',
 'Denver Broncos',
 'Detroit Lions',
 'Green Bay Packers',
 'Houston Texans',
 'Indianapolis Colts',
 'Jacksonville Jaguars',
 'Kansas City Chiefs',
 'Los Angeles Rams',
 'Miami Dolphins',
 'Minnesota Vikings',
 'New England Patriots',
 'New Orleans Saints',
 'New York Giants',
 'New York Jets',
 'Oakland Raiders',
 'Philadelphia Eagles',
 'Pittsburgh Steelers',
 'San Diego Chargers',
 'San Francisco 49ers',
 'Seattle Seahawks',
 'St Louis Rams',
 'Tampa Bay Buccaneers',
 'Tennessee Titans',
 'Washington Redskins']

In [170]:
# use difflib to get best match
s1 = 'NY Giants'
print s1

print difflib.get_close_matches(s1,teams)

NY Giants
['New York Giants']


In [171]:
# use difflib to get best match
s1 = 'New York Giants' 
print s1
print difflib.get_close_matches(s1,teams)

New York Giants
['New York Giants', 'New York Jets']


In [172]:
# which match is better?
close_matches  = difflib.get_close_matches(s1,teams)
for s2 in close_matches:
    print s1, s2, Levenshtein.ratio(s1,s2)

New York Giants New York Giants 1.0
New York Giants New York Jets 0.785714285714


In [173]:
# use fuzzywuzzy
s1 = 'NY Giants' 
print s1
process.extract(s1, teams)

NY Giants


[('New York Giants', 86),
 ('New Orleans Saints', 60),
 ('Tennessee Titans', 53),
 ('Houston Texans', 50),
 ('Carolina Panthers', 50)]

In [174]:
# one get one answer
# use fuzzywuzzy
s1 = 'NY Giants' 
print s1
process.extractOne(s1, teams)

NY Giants


('New York Giants', 86)

In [177]:
# use fuzzywuzzy - not the desired result!
s1 = 'NY Gaints' 
print s1
process.extract(s1, teams)

NY Gaints


[('New Orleans Saints', 70),
 ('New York Giants', 67),
 ('New England Patriots', 50),
 ('Arizona Cardinals', 50),
 ('Houston Texans', 50)]

##  find best match 2 

In [178]:
# read data - includes Canadian provinces
df_states = pandas.read_csv('state_table.csv')
states = df_states.name.str.lower()
sorted(states)

['alabama',
 'alaska',
 'alberta',
 'american samoa',
 'arizona',
 'arkansas',
 'armed forces americas',
 'armed forces europe',
 'armed forces pacific',
 'bajo nuevo bank',
 'baker island',
 'british columbia',
 'california',
 'colorado',
 'connecticut',
 'delaware',
 'district of columbia',
 'florida',
 'georgia',
 'guam',
 'hawaii',
 'howland island',
 'idaho',
 'illinois',
 'indiana',
 'iowa',
 'jarvis island',
 'johnston atoll',
 'kansas',
 'kentucky',
 'kingman reef',
 'louisiana',
 'maine',
 'manitoba',
 'maryland',
 'massachusetts',
 'michigan',
 'midway islands',
 'minnesota',
 'mississippi',
 'missouri',
 'montana',
 'navassa island',
 'nebraska',
 'nevada',
 'new brunswick',
 'new hampshire',
 'new jersey',
 'new mexico',
 'new york',
 'newfoundland and labrador',
 'north carolina',
 'north dakota',
 'northern mariana islands',
 'northwest territories',
 'nova scotia',
 'nunavut',
 'ohio',
 'oklahoma',
 'ontario',
 'oregon',
 'palmyra atoll',
 'pennsylvania',
 'prince edward

In [179]:
# get closest matches
s1 = 'misisipi'
print s1

print difflib.get_close_matches(s1,states)


misisipi
['mississippi', 'missouri']


In [180]:
# which match is better?
close_matches  = difflib.get_close_matches(s1,states)
for s2 in close_matches:
    print s1, s2, Levenshtein.ratio(s1,s2)


misisipi mississippi 0.842105263158
misisipi missouri 0.625


In [182]:
# use fuzzy wuzzy this time
process.extract(s1, states)

[('mississippi', 84),
 ('missouri', 63),
 ('wisconsin', 47),
 ('jarvis island', 45),
 ('district of columbia', 43)]

## time to try it for yourself

In [183]:
# DO NOT TRY THIS - unless you have pandas installed
df_districts = pandas.read_csv("districts.csv")

In [185]:
df_districts[5000:5010]

Unnamed: 0,Agency Name,State Name [District] Latest available year,Agency ID,County Name [District] 2010-11,State Abbr [District] Latest available year,Agency Name [District] 2010-11
5000,ENTREPRENEURSHIP PREPARATORY SCHOOL - WOODLAND...,Ohio,3901406,CUYAHOGA COUNTY,OH,CLEVELAND COLLEGIATE PREPARATORY SCHOOL
5001,ENUMCLAW SCHOOL DISTRICT,Washington,5300001,KING COUNTY,WA,ENUMCLAW SCHOOL DISTRICT
5002,ENVIRONMENT COMMUNITY OPPORTUNITY (ECO) CHARTE...,New Jersey,3400079,CAMDEN COUNTY,NJ,ENVIRONMENT COMMUNITY OPP CS
5003,ENVIRONMENTAL CHARTER SCHOOL AT FRICK PARK,Pennsylvania,4200812,ALLEGHENY COUNTY,PA,ENVIRONMENTAL CHARTER SCHOOL AT FRICK PARK
5004,ENVISIONS LEVEL III SCH PROG,Nebraska,3100156,MADISON COUNTY,NE,ENVISIONS LEVEL III SCH PROG
5005,EPHRATA AREA SD,Pennsylvania,4209270,LANCASTER COUNTY,PA,EPHRATA AREA SD
5006,EPHRATA SCHOOL DISTRICT,Washington,5302610,GRANT COUNTY,WA,EPHRATA SCHOOL DISTRICT
5007,EPPING SAU OFFICE,New Hampshire,3399914,ROCKINGHAM COUNTY,NH,EPPING SAU OFFICE
5008,EPPING SCHOOL DISTRICT,New Hampshire,3302880,ROCKINGHAM COUNTY,NH,EPPING SCHOOL DISTRICT
5009,EPSOM SCHOOL DISTRICT,New Hampshire,3302910,MERRIMACK COUNTY,NH,EPSOM SCHOOL DISTRICT


In [186]:
# DO NOT TRY THIS - unless you have pandas installed
# just for your information 
df_districts[df_districts['Agency Name'].str.lower().str.contains('rochelle')]

Unnamed: 0,Agency Name,State Name [District] Latest available year,Agency ID,County Name [District] 2010-11,State Abbr [District] Latest available year,Agency Name [District] 2010-11
10994,NEW ROCHELLE CITY SCHOOL DISTRICT,New York,3620490,WESTCHESTER COUNTY,NY,NEW ROCHELLE CITY SCHOOL DISTRICT
13606,ROCHELLE CCSD 231,Illinois,1734260,OGLE COUNTY,IL,ROCHELLE CCSD 231
13607,ROCHELLE ISD,Texas,4837500,MCCULLOCH COUNTY,TX,ROCHELLE ISD
13608,ROCHELLE PARK SCHOOL DISTRICT,New Jersey,3414070,BERGEN COUNTY,NJ,ROCHELLE PARK
13609,ROCHELLE TWP HSD 212,Illinois,1734290,OGLE COUNTY,IL,ROCHELLE TWP HSD 212


In [187]:
# read the file and get the districts w/o pandas 
f = open("districts.csv", 'rU') # read in universal mode
csv_reader = csv.reader(f, dialect='excel')
districts = list()
for line in csv_reader:
    districts.append(line[0].lower())


In [189]:
len(districts)

18046

### try to find the school district where Amit went to high school in Moraga

* find the top 5 best matches


In [194]:
s1 = "acalanes"  

# enter your matching function below
close_matches = difflib.get_close_matches(s1,districts)
print close_matches



['lane', 'lane esd', 'coalgate']


In [195]:
process.extract(s1, districts)

[('acalanes union high', 90),
 ('lane', 90),
 ('alva', 68),
 ('clay', 68),
 ('gans', 68)]

### try to find the school district where Amit's kids go to school in New Rochelle

* find the top 5 results
* do they look like the above 5 or not?
* why or why not?
* how would you fix it?  (hint: read the docs, try a different scoring function)

In [196]:
s1 = "new rochelle"  

# enter you matching function below
# enter your matching function below
close_matches = difflib.get_close_matches(s1,districts)
print close_matches

['lone rock elem', 'rochelle isd', 'new hope elem']


In [197]:
process.extract(s1, districts)

[('new rochelle city school district', 90),
 ('academy of new media middle', 86),
 ('achievement first east new york charter school', 86),
 ('bedford stuyvesant new beginnings charter school', 86),
 ('dike-new hartford comm school district', 86)]

In [198]:
process.extract(s1, districts , scorer=fuzz.token_sort_ratio)

[('rochelle isd', 75),
 ('newhall', 63),
 ('rochelle ccsd 231', 62),
 ('kane roe', 60),
 ('new west school', 59)]