<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/simstring_phon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phonetic mapping of names

* Captures regularities in name transliterations
* Map name to a representation which is (hopefully) common to all its different transliterations
* E.g. for Arabic names, you might want to drop vowels





In [1]:
!pip install abydos
!pip install http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl

Collecting abydos
[?25l  Downloading https://files.pythonhosted.org/packages/7f/a5/ca258a571997be1c9483d6075bbc1b9487ae80f3bb3bf1f60db0b29f5aa6/abydos-0.5.0-py2.py3-none-any.whl (886kB)
[K     |▍                               | 10kB 11.3MB/s eta 0:00:01[K     |▊                               | 20kB 8.9MB/s eta 0:00:01[K     |█                               | 30kB 7.2MB/s eta 0:00:01[K     |█▌                              | 40kB 7.3MB/s eta 0:00:01[K     |█▉                              | 51kB 4.5MB/s eta 0:00:01[K     |██▏                             | 61kB 5.1MB/s eta 0:00:01[K     |██▋                             | 71kB 5.0MB/s eta 0:00:01[K     |███                             | 81kB 5.3MB/s eta 0:00:01[K     |███▎                            | 92kB 5.2MB/s eta 0:00:01[K     |███▊                            | 102kB 5.5MB/s eta 0:00:01[K     |████                            | 112kB 5.5MB/s eta 0:00:01[K     |████▍                           | 122kB 5.5MB/s eta 

In [2]:
import simstring
import abydos

In [5]:
import abydos.phonetic
encoder=abydos.phonetic.Metaphone()
print(encoder.encode("Daniil Shafran"))
print(encoder.encode("Daniel Shafran"))
print(encoder.encode("Taniel Shafran"))
print(encoder.encode("Tamiel Shafran"))


print(encoder.encode("Evgeni Mravinsky"))
print(encoder.encode("Yevgeny Mravinsky"))
print(encoder.encode("Yevgeny Mriavinsky"))

TNLXFRN
TNLXFRN
TNLXFRN
TMLXFRN
EFJNMRFNSK
YFJNMRFNSK
YFJNMRFNSK


In [6]:
!wget -nc http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2

--2021-03-10 17:13:27--  http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56380139 (54M) [application/octet-stream]
Saving to: ‘wikidata.fi.bz2’


2021-03-10 17:13:33 (9.54 MB/s) - ‘wikidata.fi.bz2’ saved [56380139/56380139]



In [7]:
from tqdm import tqdm #progress-bar
import pickle
import os
import bz2


os.makedirs("wikidata_phon.db",exist_ok=True)
db=simstring.writer("wikidata_phon.db/wikidata_phon.db")

name_mapping={} #phonetic -> set(names)

with bz2.open("wikidata.fi.bz2","rt") as f:
    for line in tqdm(f):
        line=line.strip()
        # 4-col file with string, two urls, and official label
        # let us index the strings
        s,url1,url2,label=line.split("\t")
        encoded_s=encoder.encode(s)
        if encoded_s not in name_mapping: #a new string
            db.insert(encoded_s)
        name_mapping.setdefault(encoded_s,set()).add(s) #remember the string
db.close()

#store the name mapping 
with open("wikidata_phon.db/name_mapping.pickle","wb") as f:
    pickle.dump(name_mapping,f)

3995911it [02:31, 26308.03it/s]


In [8]:
import pickle
with open("wikidata_phon.db/name_mapping.pickle","rb") as f:
    name_mapping=pickle.load(f)
    
db=simstring.reader("wikidata_phon.db/wikidata_phon.db")
db.metric=simstring.cosine

In [9]:
print(len(name_mapping))

2271840


In [12]:
def retrieve_phon(s,db,threshold,encoder,name_mapping):
    db.threshold=threshold
    phon_hits=db.retrieve(encoder.encode(s))
    return [name_mapping[ph] for ph in phon_hits]
print(retrieve_phon("Tarja Halunen",db,0.9,encoder,name_mapping)) #success
print(retrieve_phon("Vlodymyr Puutin",db,0.9,encoder,name_mapping)) #success
print(retrieve_phon("Oleg Gordyievskyi",db,0.6,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Kordievski",db,0.8,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Gordyievskyi",db,0.8,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Gordievski",db,0.8,encoder,name_mapping)) #success

[{'Tarja Helanen', 'Tarja Halonen'}]
[{'Vladimir Putin'}, {'Vladimir Potanin'}]
[{'Toya fusca'}, {'Oli kord', 'Olokurto', 'Oelgard'}, {'Bradya fusca'}, {'Gordeyevsky'}]
[{'Oleg Ogorodov'}]
[]
[{'Olga Rotševa'}, {'Gratofsky', 'Carte physique', 'Gardavská'}, {'Luokka:Eredivisie'}, {'Oleg Gordievski'}]


## Possible improvements

* These techniques compress the names by removing many vowels
* 3-grams in simstring might be too long