<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/simstring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simstring

* A very (very!) fast approximate string matching algorithm and its implementation [paper](https://www.researchgate.net/publication/221101753_Simple_and_Efficient_Algorithm_for_Approximate_Dictionary_Matching)  [code](https://github.com/chokkan/simstring)
* Allows you to index a large number of strings (100+ million not a problem)
* Can be queried for approximate hits of your query string

## Installation

* simstring needs a little effort to get running
    * the necessary steps to follow are here: https://github.com/chokkan/simstring/issues/27
    * that will give you a file called `simstring-1.1-cp37-cp37m-linux_x86_64.whl` (on current colab, differs by versions of things)
    * You don't need to follow these steps, but if you want then it is possible on Colab but you will need to `! apt install automake swig` to make it work
* I pre-compiled simstring for you, you can install it directly: http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl

In [13]:
!pip install http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl



In [14]:
import simstring
help(simstring)

Help on module simstring:

NAME
    simstring

DESCRIPTION
    # This file was automatically generated by SWIG (http://www.swig.org).
    # Version 3.0.12
    #
    # Do not make changes to this file unless you know what you are doing--modify
    # the SWIG interface file instead.

CLASSES
    builtins.object
        StringVector
        SwigPyIterator
        reader
        writer
    
    class StringVector(builtins.object)
     |  StringVector(*args)
     |  
     |  Methods defined here:
     |  
     |  __bool__(self)
     |  
     |  __del__ lambda self
     |  
     |  __delitem__(self, *args)
     |  
     |  __delslice__(self, i, j)
     |  
     |  __getattr__ lambda self, name
     |  
     |  __getitem__(self, *args)
     |  
     |  __getslice__(self, i, j)
     |  
     |  __init__(self, *args)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __iter__(self)
     |  
     |  __len__(self)
     |  
     |  __nonzero__(self)
     | 

# Usage

* Index a number of strings
* Use the index for lookup
* Simple test first:
  * 100 famous Finn names
  * (you can pick any other data you want for these tests)

In [15]:
!wget -nc http://dl.turkunlp.org/textual-data-analysis-course-data/names.txt.bz2
!bzcat names.txt.bz2 | head -n 10



File ‘names.txt.bz2’ already there; not retrieving.

Mikael Agricola
Adolf Ehrnrooth
Tarja Halonen
Urho Kekkonen
Aleksis Kivi
Elias Lönnrot
Kalevala
C. G. E. Mannerheim
Risto Ryti
Jean Sibelius


In [16]:
import bz2
names=bz2.open("names.txt.bz2","rt").read().splitlines()
print(len(names))
print(names)

101
['Mikael Agricola', 'Adolf Ehrnrooth', 'Tarja Halonen', 'Urho Kekkonen', 'Aleksis Kivi', 'Elias Lönnrot', 'Kalevala', 'C. G. E. Mannerheim', 'Risto Ryti', 'Jean Sibelius', 'Arvo Ylppö', 'Matti Nykänen', 'Väinö Myllyrinne', 'Ville Valo', 'Lalli', 'Väinö Linna', 'Linus Torvalds', 'Spede Pasanen', 'Pentti Linkola', 'Tove Jansson', 'Veikko Hursti', 'Paavo Nurmi', 'Minna Canth', 'Juho Kusti Paasikivi', 'J. V. Snellman', 'Hertta Kuusinen', 'Arto Saari', 'Miina Sillanpää', 'Väinö Tanner', 'Lucina Hagman', 'Kristfrid Ganander', 'Mika Waltari', 'Mika Häkkinen', 'Alvar Aalto', 'Eugen Schauman', 'Tapio Rautavaara', 'Eino Leino', 'Jaakko Pöyry', 'Otto Wille Kuusinen', 'Juice Leskinen', 'Anders Chydenius', 'Uno Cygnaeus', 'Jari Litmanen', 'Katri Helena Kalaoja', 'Fanni Luukkonen', 'Anneli Jäätteenmäki', 'Karl Fazer', 'K. J. Ståhlberg', 'Mauno Koivisto', 'Helene Schjerfbeck', 'Reino Helismaa', 'Jorma Ollila', 'Lauri Törni', 'Georg Henrik von Wright', 'Arndt Pekurinen', 'Tauno Palo', 'Akseli Gall

In [17]:
# Index
import os
os.makedirs("test_names",exist_ok=True) #produces a lot of files, make directory to save these
db=simstring.writer("test_names/test_names.db") 
for name in names:
    db.insert(name)
db.close()


In [18]:
# Query
db=simstring.reader("test_names/test_names.db")
db.measure=simstring.cosine
db.threshold=0.5
print(db.retrieve("Karita Matila"))
print(db.retrieve("Arvi Lund")) #short name, similarity decreases steeply!

('Karita Mattila',)
('Arvi Lind',)


In [19]:
db.retrieve("Kaarlo Juho Stahlberg")

()

In [20]:
db.retrieve("Kaarlo Juho Ståhlberg")

('K. J. Ståhlberg',)

* Many limitations shown above
* Generic string matching needs further customization
* simstring is a great fast first filter for more costly techniques

# Larger index

* all Finnish wikidata strings

In [21]:
!wget -nc http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2

--2021-03-10 16:11:46--  http://dl.turkunlp.org/textual-data-analysis-course-data/wikidata.fi.bz2
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56380139 (54M) [application/octet-stream]
Saving to: ‘wikidata.fi.bz2.1’


2021-03-10 16:11:50 (16.6 MB/s) - ‘wikidata.fi.bz2.1’ saved [56380139/56380139]



In [22]:
!bzcat wikidata.fi.bz2 | head -n 30

Belgia	https://fi.wikipedia.org/wiki/Belgia	https://en.wikipedia.org/wiki/Belgium	Belgia
Belgian kuningaskunta	https://fi.wikipedia.org/wiki/Belgia	https://en.wikipedia.org/wiki/Belgium	Belgia
onnellisuus	https://fi.wikipedia.org/wiki/Onnellisuus	https://en.wikipedia.org/wiki/Happiness	onnellisuus
:)	https://fi.wikipedia.org/wiki/Onnellisuus	https://en.wikipedia.org/wiki/Happiness	onnellisuus
George Washington	https://fi.wikipedia.org/wiki/George_Washington	https://en.wikipedia.org/wiki/George_Washington	George Washington
Jack Bauer	https://fi.wikipedia.org/wiki/Jack_Bauer	https://en.wikipedia.org/wiki/Jack_Bauer	Jack Bauer
Douglas Adams	https://fi.wikipedia.org/wiki/Douglas_Adams	https://en.wikipedia.org/wiki/Douglas_Adams	Douglas Adams
Paul Otlet	https://fi.wikipedia.org/wiki/Paul_Otlet	https://en.wikipedia.org/wiki/Paul_Otlet	Paul Otlet
Wikidata	https://fi.wikipedia.org/wiki/Wikidata	https://en.wikipedia.org/wiki/Wikidata	Wikidata
Portugali	https://fi.wikipedia.org/wiki/Portugali	ht

In [23]:
from tqdm import tqdm #progress-bar
os.makedirs("wikidata.db",exist_ok=True)
db=simstring.writer("wikidata.db/wikidata.db")
with bz2.open("wikidata.fi.bz2","rt") as f:
    for line in tqdm(f):
        line=line.strip()
        # 4-col file with string, two urls, and official label
        # let us index the strings
        s,url1,url2,label=line.split("\t")
        db.insert(s)
db.close()

3995911it [01:59, 33546.48it/s]


* 4M strings - 2min index time

In [24]:
db=simstring.reader("wikidata.db/wikidata.db")
db.retrieve("Tarja Halonen")

('Ona Halonen',
 'Esa Halonen',
 'Mia Halonen',
 'Tarja Halonen',
 'Tarja Salonen',
 'Tuija Halonen',
 'Maija Halonen',
 'Luokka:Tarja Halonen',
 'Tarja Kaarina Halonen')

In [30]:
db.retrieve("Turun yliopisto")

('yliopisto',
 'Turun yliopisto',
 'Oulun yliopisto',
 'Oulun yliopisto',
 'Ségoun yliopisto',
 'Tohokun yliopisto',
 'Turun kesäyliopisto',
 'Luokka:Turun yliopisto',
 'Turun yliopiston kuoro',
 'Malline:Turun yliopisto',
 'Turun yliopistosäätiö',
 'Turun yliopiston alue 10',
 'Turun yliopiston kuoro ry',
 'Turun yliopiston kirjasto',
 'Kansatiede Turun yliopisto',
 'Turun yliopiston kasvimuseo',
 'Turun yliopiston eläinmuseo')

## Lookup speed

In [33]:
import os
from tqdm import tqdm
with bz2.open("wikidata.fi.bz2","rt") as f:
    for i,line in enumerate(tqdm(f)):
        line=line.strip()
        db.retrieve(line)
        if i==10000:
            break


6545it [00:07, 887.40it/s]


KeyboardInterrupt: ignored