<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/simstring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simstring

* A very (very!) fast approximate string matching library
* Allows you to index a large number of strings (100+ million not a problem)
* Can be queried for approximate hits of your query string

## Installation

* simstring needs a little effort to get running
    * the necessary steps to follow are here: https://github.com/chokkan/simstring/issues/27
    * that will give you a file called `simstring-1.1-cp37-cp37m-linux_x86_64.whl` (on current colab, differs by versions of things)
    * You don't need to follow these steps, but if you want then it is possible on Colab but you will need to `! apt install automake swig` to make it work
* I pre-compiled simstring for you, you can install it directly: http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl

In [1]:
!pip install http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl

Collecting simstring==1.1
[?25l  Downloading http://dl.turkunlp.org/textual-data-analysis-course-data/simstring-1.1-cp37-cp37m-linux_x86_64.whl (893kB)
[K     |████████████████████████████████| 901kB 260kB/s 
[?25hInstalling collected packages: simstring
Successfully installed simstring-1.1


In [17]:
import simstring
help(simstring)

Help on module simstring:

NAME
    simstring

DESCRIPTION
    # This file was automatically generated by SWIG (http://www.swig.org).
    # Version 3.0.12
    #
    # Do not make changes to this file unless you know what you are doing--modify
    # the SWIG interface file instead.

CLASSES
    builtins.object
        StringVector
        SwigPyIterator
        reader
        writer
    
    class StringVector(builtins.object)
     |  StringVector(*args)
     |  
     |  Methods defined here:
     |  
     |  __bool__(self)
     |  
     |  __del__ lambda self
     |  
     |  __delitem__(self, *args)
     |  
     |  __delslice__(self, i, j)
     |  
     |  __getattr__ lambda self, name
     |  
     |  __getitem__(self, *args)
     |  
     |  __getslice__(self, i, j)
     |  
     |  __init__(self, *args)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __iter__(self)
     |  
     |  __len__(self)
     |  
     |  __nonzero__(self)
     | 

# Usage

1. Index a number of strings
2. Use the index for lookup

In [20]:
import bz2
names=bz2.open("names.txt.bz2","rt").read().splitlines()
print(names)

['Mikael Agricola', 'Adolf Ehrnrooth', 'Tarja Halonen', 'Urho Kekkonen', 'Aleksis Kivi', 'Elias Lönnrot', 'Kalevala', 'C. G. E. Mannerheim', 'Risto Ryti', 'Jean Sibelius', 'Arvo Ylppö', 'Matti Nykänen', 'Väinö Myllyrinne', 'Ville Valo', 'Lalli', 'Väinö Linna', 'Linus Torvalds', 'Spede Pasanen', 'Pentti Linkola', 'Tove Jansson', 'Veikko Hursti', 'Paavo Nurmi', 'Minna Canth', 'Juho Kusti Paasikivi', 'J. V. Snellman', 'Hertta Kuusinen', 'Arto Saari', 'Miina Sillanpää', 'Väinö Tanner', 'Lucina Hagman', 'Kristfrid Ganander', 'Mika Waltari', 'Mika Häkkinen', 'Alvar Aalto', 'Eugen Schauman', 'Tapio Rautavaara', 'Eino Leino', 'Jaakko Pöyry', 'Otto Wille Kuusinen', 'Juice Leskinen', 'Anders Chydenius', 'Uno Cygnaeus', 'Jari Litmanen', 'Katri Helena Kalaoja', 'Fanni Luukkonen', 'Anneli Jäätteenmäki', 'Karl Fazer', 'K. J. Ståhlberg', 'Mauno Koivisto', 'Helene Schjerfbeck', 'Reino Helismaa', 'Jorma Ollila', 'Lauri Törni', 'Georg Henrik von Wright', 'Arndt Pekurinen', 'Tauno Palo', 'Akseli Gallen-K

In [21]:
# index
import os
os.makedirs("test_names",exist_ok=True) #produces a lot of files, make dir
db=simstring.writer("test_names/test_names.db") 
for name in names:
    db.insert(name)
db.close()


In [22]:
db=simstring.reader("test_names/test_names.db")
db.measure=simstring.cosine
db.threshold=0.5
print(db.retrieve("Karita Matila"))
print(db.retrieve("Arvi Lund")) #short name, similarity decreases steeply!

('Karita Mattila',)
('Arvi Lind',)


In [23]:
db.retrieve("Kaarlo Juho Stahlberg")

()

In [24]:
db.retrieve("Kaarlo Juho Ståhlberg")

('K. J. Ståhlberg',)

* Many limitations shown above
* Generic string matching needs further customization
* simstring is a fast first filter for more costly techniques

# Larger index

* all Finnish wikidata strings

In [None]:
from tqdm import tqdm #progress-bar
os.makedirs("wikidata.db",exist_ok=True)
db=simstring.writer("wikidata.db/wikidata.db")
with open("wikidata.fi") as f:
    for line in tqdm(f):
        line=line.strip()
        # 4-col file with string, two urls, and official label
        # let us index the strings
        s,url1,url2,label=line.split("\t")
        db.insert(s)
db.close()

3995911it [01:58, 33791.00it/s]


* 4M strings - 2min index time

In [None]:
db=simstring.reader("wikidata.db/wikidata.db")
db.retrieve("Tarja Halonen")

('Ona Halonen',
 'Esa Halonen',
 'Mia Halonen',
 'Tarja Halonen',
 'Tarja Salonen',
 'Tuija Halonen',
 'Maija Halonen',
 'Luokka:Tarja Halonen',
 'Tarja Kaarina Halonen')

In [None]:
db.retrieve("keskusrikospoliisi")

('rikospoliisi',
 'keskusrikospoliisi',
 'Keskusrikospoliisi',
 'Keskusrikospoliisi',
 'Keskusrikospoliisin Rikosmuseo')

## Lookup speed

In [None]:
import os
from tqdm import tqdm
with open("wikidata.fi") as f:
    for i,line in enumerate(tqdm(f)):
        line=line.strip()
        db.retrieve(line)
        if i==10000:
            break


10000it [00:10, 927.73it/s]
