## Detection and elimination of typos in OCR data
The data obtained using OCR tools are characterized by frequent (depending on the quality of the input data) minor reading errors of individual characters in the analyzed text. The langdetect package recognizes the language of the text resulting from OCR analysis. In turn, thanks to the spellchecker module, it is possible to supplement and eliminate some of these errors, which allows to improve the quality of data for further analysis.

In [None]:
import os

from langdetect import detect
import pandas as pd
from spellchecker import Spellchecker

Readout of analyzed data sets: training and test

In [None]:
df_train = pd.read_csv(os.path.join('..', 'data', 'train_data_complete_fixed.csv'))
df_test = pd.read_csv(os.path.join('..', 'data', 'test_set_fixed.csv'))

In [None]:
def detect_language(text):
    try:
        return detect(text)
    except:
       return 'unknown'

Language detection for the training dataset

In [None]:
df_train['lang'] = df_train['text'].progress_apply(detect_language)

Language detection for the test dataset

In [None]:
df_test['lang'] = df_test['text'].progress_apply(detect_language)

Extraction of items for the identified language: English

In [None]:
df_train_en = df_train.loc[df_train['lang'] == 'en']
df_test_en = df_test.loc[df_test['lang'] == 'en']

Tool for analyzing the quality of text in the selected language

In [None]:
spell = Spellchecker(language='en')

def correct_text(text_string):
    text_list = text_string.split()
    corrected_text_list = []
    for word in text_list:
        correction = spell.correction(word)
        if correction is not None:
            corrected_text_list.append(correction)
        else:
            corrected_text_list.append(word)
    corrected_text_string = ' '.join(corrected_text_list)

    return corrected_text_string

English text analysis for training dataset and test dataset with correction of minor grammatical errors with addition of a new column for corrected text.

In [None]:
df_train_en['text_fixed'] = df_train_en['text'].progress_apply(correct_text)
df_test_en['text_fixed'] = df_test_en['text'].progress_apply(correct_text)