# Levenshtein distance and spelling corrections

This notebook is a continuation of a Google Colab notebook 'nlp3.ipynb'. 

The version of macOS used to solve this task does not support Morfeusz2 Python Bindings. 

In [1]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
import pandas as pd
from tqdm import tqdm

Please, for this Notebook to run disable security features (SSL) in elasticsearch.yml.

In [9]:
es = Elasticsearch("http://localhost:9200/")
if es.ping():
    print('Yay Connected!')
    print(es.info().body)
else:
    print('Awww it could not connect!')
    print(es.ping())

Yay Connected!
{'name': 'MacBook-Pro-5.lan', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'rJQK1DceS2-i7H25KMdmOA', 'version': {'number': '8.5.1', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': 'c1310c45fc534583afe2c1c03046491efba2bba2', 'build_date': '2022-11-09T21:02:20.169855900Z', 'build_snapshot': False, 'lucene_version': '9.4.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


Unpack df_corrections pickle:

In [3]:
df_corrections = pd.read_pickle("df_corrections.pkl")

12. Load SGJP dictionary (Słownik SGJP dane tekstowe) to ElasticSearch (one document for each form) and use fuzzy matching to obtain the possible corrections of the 30 words with 5 occurrences that do not belong to the dictionary.

In [4]:
es_index = "polish_bills_recommendation"

if es.indices.exists(index=es_index):
    es.indices.delete(index=es_index)

res = es.indices.create(
    index=es_index,
    mappings={
        'properties': {
            'text': {
                'type': "text",
                'analyzer': "keyword"
            }
        }
    }
)
res

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'polish_bills_recommendation'})

In [5]:
with open("sgjp-20221113.tab", 'r') as f:
    sgjp = f.read().split('\n')[28:-1]

In [6]:
def map_words_for_bulk_upload(lines):
    for i, line in enumerate(lines):
        word, *_ = line.split('\t')
        yield {
            '_op_type': 'create',
            '_index': es_index,
            '_id': i,
            'text': word
        }

def get_corrections(word):
    res = es.search(index=es_index,
                    query={'match': {'text': {'query': word, 'fuzziness': 2}}},
                    filter_path=["hits.hits._source.text"])
    return ", ".join(hit['_source']['text'] for hit in res['hits']['hits'])

In [7]:
progress_bar = tqdm(desc="Uploading SGJP to ES", total=len(sgjp))
for _ in parallel_bulk(client=es, actions=map_words_for_bulk_upload(sgjp)):
    progress_bar.update(1)

Uploading SGJP to ES: 100%|█████████▉| 7409501/7412267 [06:19<00:00, 19601.46it/s]

In [8]:
df_corrections['correction_es'] = df_corrections['token'].apply(get_corrections)
df_corrections

Unnamed: 0,token,count,rank,correction,correction_es
371,późn,1065,372,plan,"późni, późno, późna, późna, późne, późne, późn..."
1435,gmo,298,1436,go,"dmo, emo, emo, emo, emo, emo, emo, emo, emo, emo"
1975,sww,216,1976,swe,"swa, siw, sów, suw, swe, swe, swe, swe, swą, swą"
2174,skw,196,2175,kw,"sakw, siw, ska, ski, sków, sów, suw, kw, Bokw,..."
2501,ex,167,2502,ix,"Rex, em, eś, ee, ef, ef, eh, ej, el, em"
2571,ike,162,2572,ile,"Mike, Mike, Mike, Mike, Nike, ikr, iks, iks, i..."
3350,remediacji,120,3351,mediacji,"repudiacji, repudiacji, repudiacji, remediach,..."
3733,ure,103,3734,ue,"Bure, Bure, Sure, bure, bure, bure, bure, uje,..."
3907,uke,97,3908,ue,"uje, ule, Bek, Bek, Age, Ale, Ale, Ale, Ale, Buje"
3993,kn,95,3994,on,"en, in, ka, kan, kb, ka, ki, ki, ki, ką"


Uploading SGJP to ES: 100%|██████████| 7412267/7412267 [06:31<00:00, 19601.46it/s]