<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/simstring_phon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phonetic mapping of names

* Captures regularities in name transliterations
* Map name to a representation which is (hopefully) common to all its different transliterations
* E.g. for Arabic names, you might want to drop vowels





In [2]:
!pip install abydos
!pip install http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl

Collecting simstring==1.1
[?25l  Downloading http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl (893kB)
[K     |████████████████████████████████| 901kB 261kB/s 
[?25hInstalling collected packages: simstring
Successfully installed simstring-1.1


In [3]:
import simstring
import abydos

In [6]:
import abydos.phonetic
encoder=abydos.phonetic.Metaphone()
print(encoder.encode("Daniil Shafran"))
print(encoder.encode("Daniel Shafran"))
print(encoder.encode("Evgeni Mravinsky"))
print(encoder.encode("Yevgeny Mravinsky"))
print(encoder.encode("Yevgeny Mriavinsky"))

TNLXFRN
TNLXFRN
EFJNMRFNSK
YFJNMRFNSK
YFJNMRFNSK


In [7]:
!wget -nc http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2

--2021-03-10 16:56:45--  http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56380139 (54M) [application/octet-stream]
Saving to: ‘wikidata.fi.bz2’


2021-03-10 16:56:49 (15.6 MB/s) - ‘wikidata.fi.bz2’ saved [56380139/56380139]



In [8]:
from tqdm import tqdm #progress-bar
import pickle
import os
import bz2


os.makedirs("wikidata_phon.db",exist_ok=True)
db=simstring.writer("wikidata_phon.db/wikidata_phon.db")

name_mapping={} #phonetic -> set(names)

with bz2.open("wikidata.fi.bz2","rt") as f:
    for line in tqdm(f):
        line=line.strip()
        # 4-col file with string, two urls, and official label
        # let us index the strings
        s,url1,url2,label=line.split("\t")
        encoded_s=encoder.encode(s)
        if encoded_s not in name_mapping: #a new string
            db.insert(encoded_s)
        name_mapping.setdefault(encoded_s,set()).add(s) #remember the string
db.close()

#store the name mapping 
with open("wikidata_phon.db/name_mapping.pickle","wb") as f:
    pickle.dump(name_mapping,f)

3995911it [02:14, 29650.65it/s]


In [9]:
import pickle
with open("wikidata_phon.db/name_mapping.pickle","rb") as f:
    name_mapping=pickle.load(f)
    
db=simstring.reader("wikidata_phon.db/wikidata_phon.db")
db.metric=simstring.cosine

In [11]:
def retrieve_phon(s,db,threshold,encoder,name_mapping):
    db.threshold=threshold
    phon_hits=db.retrieve(encoder.encode(s))
    return [name_mapping[ph] for ph in phon_hits]
print(retrieve_phon("Tarja Halunen",db,0.9,encoder,name_mapping)) #success
print(retrieve_phon("Vlodymyr Puutin",db,0.9,encoder,name_mapping)) #success
print(retrieve_phon("Oleg Gordyievskyi",db,0.9,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Kordievski",db,0.8,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Gordyievskyi",db,0.8,encoder,name_mapping)) #fail
print(retrieve_phon("Oleg Gordievski",db,0.8,encoder,name_mapping)) #fail

[{'Tarja Halonen', 'Tarja Helanen'}]
[{'Vladimir Putin'}, {'Vladimir Potanin'}]
[]
[{'Oleg Ogorodov'}]
[]
[{'Olga Rotševa'}, {'Gardavská', 'Carte physique', 'Gratofsky'}, {'Luokka:Eredivisie'}, {'Oleg Gordievski'}]


## Possible improvements

* These techniques compress the names by removing many vowels
* 3-grams in simstring might be too long