<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Extract-texts" data-toc-modified-id="Extract-texts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Extract texts</a></span></li><li><span><a href="#Knit-a-table" data-toc-modified-id="Knit-a-table-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Knit a table</a></span><ul class="toc-item"><li><span><a href="#Itterate-through-all-lang-directories-to-collect-an-exhaustive-set-of-available-OCRed-issues" data-toc-modified-id="Itterate-through-all-lang-directories-to-collect-an-exhaustive-set-of-available-OCRed-issues-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Itterate through all lang directories to collect an exhaustive set of available OCRed issues</a></span></li><li><span><a href="#Write-a-function-which-reads-OCRed-files-and-transform-it-into-the-dictionary-structure-where-'page'-is-a-key-containing-list-of-lines-as-a-value" data-toc-modified-id="Write-a-function-which-reads-OCRed-files-and-transform-it-into-the-dictionary-structure-where-'page'-is-a-key-containing-list-of-lines-as-a-value-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Write a function which reads OCRed files and transform it into the dictionary structure where 'page' is a key containing list of lines as a value</a></span></li><li><span><a href="#Itterate-over-all-files-available,-make-dictionaries-with-respect-to-each-language-for-each-issue-and-make-one-big-dataframe-where-all-the-issues-on-each-language-are-stored" data-toc-modified-id="Itterate-over-all-files-available,-make-dictionaries-with-respect-to-each-language-for-each-issue-and-make-one-big-dataframe-where-all-the-issues-on-each-language-are-stored-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Itterate over all files available, make dictionaries with respect to each language for each issue and make one big dataframe where all the issues on each language are stored</a></span></li><li><span><a href="#Additional-annotations-and-fixes-for-the-dataframe" data-toc-modified-id="Additional-annotations-and-fixes-for-the-dataframe-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Additional annotations and fixes for the dataframe</a></span><ul class="toc-item"><li><span><a href="#Delete-new-lines-character-at-the-end-of-each-line" data-toc-modified-id="Delete-new-lines-character-at-the-end-of-each-line-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Delete new lines character at the end of each line</a></span></li><li><span><a href="#Add-annotation-if-lines-in-all-languages-are-the-same" data-toc-modified-id="Add-annotation-if-lines-in-all-languages-are-the-same-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>Add annotation if lines in all languages are the same</a></span></li></ul></li><li><span><a href="#Saving-the-dataframe-as-files" data-toc-modified-id="Saving-the-dataframe-as-files-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Saving the dataframe as files</a></span></li></ul></li></ul></div>

In [1]:
import os
import re
import pandas as pd
import numpy as np
import pytesseract # note: tesseract utils need to be installed on the machine
from pdf2image import convert_from_path # note: poppler utils need to be installed on the machine
from tqdm.notebook import tqdm


langs = ['rus','avar','kumyk','lezg','tabas','darg','lak'] # list all languages of interest, the same names as directories


## Extract texts

config options for tesseract: https://muthu.co/all-tesseract-ocr-options/ <br>
(write strings like r'--psm 2 --oem 3')


available language models for tesseract (should be downloaded on the machine first):
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

In [None]:
def extract_texts(lang_dir, tesseract_lang="rus", tesseract_config=r''):
    """Takes as an argument a directory with pdf files to work with
    iterrate over each file, convert its pages to image
    iterrate over each page-image with the tesseract
    Add tesseract output to the OCRed directory"""
    os.chdir(os.getcwd()+'/'+lang_dir) # change directory to where the pdf files are stored
    try: # create a folder for OCRed .txts #error raises if already exists
        os.mkdir(os.getcwd()+"/tesseract_OCRed_texts")
    except:
        pass
    
    files = os.listdir() # list of files in the language directory
    files = list(filter(lambda x: '.pdf' in x, files)) # specify we work only with pdfs
    
    for file in tqdm(files): # tqdm for visualising the progress
        images = convert_from_path(file)
        if  file.endswith("_2016_5_6.pdf") == False: #don't skip the first and the last two pages in the issue "_2016_5_6.pdf"
            images = images[2:-2] # skip the first and last two pages
        
        doc_name = "tesseract_OCRed_" + str(file) + ".txt"
        i = 1 # increment for adding the page number
        # don't do if OCRed file already exists
        if os.path.exists(os.getcwd()+"/tesseract_OCRed_texts/"+doc_name) == False: #don't process with tesseract if the OCRed file already exists
        
            for image in tqdm(images):
                page_number = "######################### Page_№"+str(i)+" _file:_" +str(file) + " _tesseract_language_model:_" + str(tesseract_lang) + " _tesseract_config:_" +str(tesseract_config) + "#########################\n"
                
                OCR_doc = pytesseract.image_to_string(image, lang=tesseract_lang, config=tesseract_config)
            
                f = open(doc_name, "a")
                f.write(page_number+OCR_doc)
                f.close()
                i += 1
            #move file to OCRed dir
            os.rename(src= os.getcwd()+"/"+doc_name,dst=os.getcwd()+"/tesseract_OCRed_texts/"+doc_name)
            #os.remove(os.getcwd()+"/"+doc_name)
        
    os.chdir(os.path.dirname(os.getcwd())) # go to parent directory


In [None]:
# apply the function to all language directories
for lang in langs:
    extract_texts(lang_dir=lang)

## Knit a table

Make a table where each line from OCRed files are considered as one observation.

### Itterate through all lang directories to collect an exhaustive set of available OCRed issues

In [2]:
## Get a set of issues available elsewhere
issues = set() # set for issues
main_dir = os.getcwd() # path to this file directory

for lang_dir in langs:
    os.chdir(main_dir+'/'+lang_dir+'/tesseract_OCRed_texts')
    files = os.listdir()
    for file_name in files:
        if file_name.endswith(".txt"):
            file_name = re.search(pattern=r"\d.+", string=file_name).group() # extract issue number
            issues.add(file_name)
        
    os.chdir(main_dir)
    
print(issues)    

{'2021_2.pdf.txt', '2020_2.pdf.txt', '2018_1.pdf.txt', '2017_3.pdf.txt', '2019_1.pdf.txt', '2020_4.pdf.txt', '2017_4.pdf.txt', '2020_3.pdf.txt', '2018_5.pdf.txt', '2019_3.pdf.txt', '2022_1.pdf.txt', '2020_1.pdf.txt', '2018_4.pdf.txt', '2018_2.pdf.txt', '2017_6.pdf.txt', '2019_6.pdf.txt', '2019_2.pdf.txt', '2016_5_6.pdf.txt', '2020_6.pdf.txt', '2023_1.pdf.txt', '2022_3.pdf.txt', '2022_6.pdf.txt', '2021_1.pdf.txt', '2017_5.pdf.txt', '2022_2.pdf.txt', '2022_4.pdf.txt', '2021_5.pdf.txt', '2018_6.pdf.txt', '2023_2.pdf.txt', '2019_5.pdf.txt', '2020_5.pdf.txt', '2017_1.pdf.txt', '2021_6.pdf.txt', '2018_3.pdf.txt', '2021_4.pdf.txt', '2021_3.pdf.txt', '2017_2.pdf.txt', '2023_3.pdf.txt', '2019_4.pdf.txt', '2022_5.pdf.txt'}


### Write a function which reads OCRed files and transform it into the dictionary structure where 'page' is a key containing list of lines as a value

In [3]:
def dictionarise_issue(file):
    '''Collect a dictionary where keys are page numbers, values are lists with lines'''
    dic_of_pages = {}
    list_of_lines = []
    
    page_count = 0 #page counter
    with open(file) as f:
        for line in f:

            if not line.startswith("##############") and line != '\n': # add line to list # do not add empty lines and lines with metadata
                list_of_lines.append(line)
                
            if line.startswith("##############"): 
                # stop adding lines to list, add list to dictionary with page key, make new list
                if list_of_lines != []:
                    dic_of_pages[page_count] = list_of_lines
                page_count += 1
                list_of_lines = []
    return dic_of_pages

### Itterate over all files available, make dictionaries with respect to each language for each issue and make one big dataframe where all the issues on each language are stored

In [4]:
%%time
# 0) make a big empty dataframe where smaller dataframes dedicated to different issues eventually will be held
# (columns needed: issue, page, line, rus, tabas, ...)
# 1) itterate over possible issues  
# 2) nest loop itterate over languages
# 3) form a dictionary with pages and lines for each language
# 4) transform issue dictionary to the dataframe; 'explode' it appropriately
# 5) concatinate issue dataframe with the all-issues dataframe


general_table = pd.DataFrame(columns = ["issue","page","line", *langs])

for issue in issues:
    issue_dict = {}
    for lang in langs:
        os.chdir(main_dir+'/'+lang+'/tesseract_OCRed_texts')
        try:
            file = os.getcwd() + "/tesseract_OCRed_"+lang+"_"+issue 
            issue_dict[lang] = dictionarise_issue(file)
        except FileNotFoundError:
            issue_dict[lang] = {}
        if "rus" and 'avar'and"kumyk"and"lezg"and"tabas"and'darg'and'lak' in issue_dict.keys():
            
        
            issue_df = pd.DataFrame(issue_dict)
            # replace NaN (product of empty dictionaries,when there is no file) with empty lists # based on: #https://stackoverflow.com/questions/31567218/replace-nan-with-empty-list-in-a-pandas-dataframe
            for col in issue_df.columns:
                issue_df.loc[issue_df[col].isnull(),col] = issue_df.loc[issue_df[col].isnull(),col].apply(lambda x: [])
            
            issue_df['issue'] = issue                  
            
        
        
        # itterate over rows, find maximum possible lines on the page, extend same page of other languages with NaNs # it is done to make the lists the same size, otherwise convenient 'explode' method from pandas does not work
            for index in issue_df.index:
                max_length = max([len(issue_df['rus'][index]),len(issue_df['avar'][index]),len(issue_df['kumyk'][index]),len(issue_df['lezg'][index]),len(issue_df['tabas'][index]),len(issue_df['darg'][index]),len(issue_df['lak'][index])])
                
                for lang in [issue_df['rus'][index],issue_df['avar'][index],issue_df['kumyk'][index],issue_df['lezg'][index],issue_df['tabas'][index],issue_df['darg'][index],issue_df['lak'][index]]:
                    while len(lang) <= max_length:
                        lang.append(np.NaN)
            
            
            issue_df = issue_df.reset_index().rename(columns = {'index':'page'}) # add pages numbering
            
            issue_df = issue_df.explode(langs)\
            .reset_index(drop=True).reset_index().rename(columns = {'index':'line'}) # add line (per issue)
            
        
        # append the issue dataframe to general table
            general_table = pd.concat([general_table,issue_df])
        
        os.chdir(main_dir)
        general_table = general_table.reset_index(drop=True) # making of row id in the table

general_table.tail(10)

CPU times: user 9.88 s, sys: 1.11 s, total: 11 s
Wall time: 13.6 s


Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak
167394,2022_5.pdf.txt,39,3433,,,,,,,
167395,2022_5.pdf.txt,19,3434,,19\n,,,,,
167396,2022_5.pdf.txt,19,3435,,5/2022 МАПАРУПАЙ\n,,,,,
167397,2022_5.pdf.txt,19,3436,,,,,,,
167398,2022_5.pdf.txt,25,3437,,,,,“А\n,,м }\n
167399,2022_5.pdf.txt,25,3438,,,,,‘ Г 5/2022 дАГЪУФТААУ ри 25. |\n,,1 |. } 5/2022 РУО МИ 25. |\n
167400,2022_5.pdf.txt,25,3439,,,,,\“ > >. ' >)\n,,КИ р >)!\n
167401,2022_5.pdf.txt,25,3440,,,,,лек й\n,,"# А а, „у $;\n"
167402,2022_5.pdf.txt,25,3441,,,,,РТ Я\n,,"""р Я\n"
167403,2022_5.pdf.txt,25,3442,,,,,,,


In [12]:
general_table.head()

Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak
0,2022_6.pdf.txt,2,0,аканчивается год — начинается\n,ъалеха ра{алде щвана гьаб\n,ыл тамамлана-янгы йылгъа\n,ис акъатна куьтягь жезва —\n,ис ккудубк[Гура — Щийиб\n,ус ахирличи биркули\n,"ъуртал хъанай дур шин,\n"
1,2022_6.pdf.txt,2,1,новый. Обычно принято\n,сонги — т[аде щолеб буго\n,"абат алабыз. Адатлы кюйде,\n",цГийиди алукьзава. Адет\n,"ккебгъра. Аьдат вуйиганси,\n","саби, сагаси сабиули\n",дайдихьлай дур цГусса шин.\n
2,2022_6.pdf.txt,2,2,подводить итоги ушедшего\n,цГиябги. ЦТияб соналда цебе\n,шу гюнлерде биз оьтген замангъа\n,"яз, алатай йисан гьахъ-гьисаб\n",ккудубшубдин натижйир\n,саби. ГШядатлибиубли дусла\n,"'Аьдат дур, ларгмур шинал итогру\n"
3,2022_6.pdf.txt,2,3,и строить планы на будущее;\n,нилъер Падат буго араб соналъул\n,"тёз къаратып, бир тюрлю\n","кьада, къведай йисуз кьиле тухун\n","йивури, гележегдиз планар диври,\n",ахирличир дарибтала итогуни\n,"бищайсса, ялун нанимур шинал\n"
4,2022_6.pdf.txt,2,4,благодарить своих читателей и\n,"хГасилал гьарулел, бач[унеб\n","натижалар чыгъарабыз, гележекге\n","патал рик[ик вуч кват[а, гьадакай\n",урхрудариз ва подписчикариз\n,кайули бирар ва челябкьлалис\n,планну дайсса; цала буккултрахь\n


### Additional annotations and fixes for the dataframe

#### Delete new lines character at the end of each line

In [5]:
for x in langs:
    general_table[x] = general_table[x].str.replace(r'\n', '')
general_table.head(3)

  general_table[x] = general_table[x].str.replace(r'\n', '')


Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak
0,2021_2.pdf.txt,1,0,СЛОВО,НОМЕРАЛДА ЩАЛЕ:,РЕДАКТОРНУ,«Дагъустандин дишегьли»,Гъи дишагьлийирин «Дагъу-,"Лебжели — мякамхГебирулра,",Ва надирсса журнал
1,2021_2.pdf.txt,1,1,РЕДАКТОРА,ТУБАРАБ ПУМРУГИ ЦО,СЁЗЮ,журнал кхьена ва къе ам,стан дишагьли» аьникьа жур-,бетахъибх/ели - дисулра...,«Дагусттаннал хьами»
2,2021_2.pdf.txt,1,2,"Поздравляем всех,",РЕДАКТОРАСУЛ,«Дагьыстанлы къатынгьа»,гъиле кьунвайбуруз виридаз,нал хилиъ дибиснайидар ва ди-,«Дагъиста хьунул адам»,"чигу-чивчуну, ХТакьину канил"


#### Add +1 to lines id to start from "1" instead of "0"

In [10]:
general_table.line = general_table.line + 1
general_table.head(3)

Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak,all_matching
0,2021_2.pdf.txt,1,2,СЛОВО,НОМЕРАЛДА ЩАЛЕ:,РЕДАКТОРНУ,«Дагъустандин дишегьли»,Гъи дишагьлийирин «Дагъу-,"Лебжели — мякамхГебирулра,",Ва надирсса журнал,False
1,2021_2.pdf.txt,1,3,РЕДАКТОРА,ТУБАРАБ ПУМРУГИ ЦО,СЁЗЮ,журнал кхьена ва къе ам,стан дишагьли» аьникьа жур-,бетахъибх/ели - дисулра...,«Дагусттаннал хьами»,False
2,2021_2.pdf.txt,1,4,"Поздравляем всех,",РЕДАКТОРАСУЛ,«Дагьыстанлы къатынгьа»,гъиле кьунвайбуруз виридаз,нал хилиъ дибиснайидар ва ди-,«Дагъиста хьунул адам»,"чигу-чивчуну, ХТакьину канил",False


#### Add annotation if lines in all languages are the same

Usually it means that a fragment is written down in the same language in all versions, namely in Russian. It can signalise that a fragment is not interesting for a parallel corpus. For example, it can be a fragment for kids or a cooking recipe. Also, it can be some artifact of OCR or part of interesting fragment like personal name

In [11]:
general_table['all_matching'] = general_table.apply(lambda x: x.rus == \
                                                    x.avar == x.kumyk == x.tabas == \
                                                    x.darg == x.lak, axis = 1)
general_table.head(3)

Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak,all_matching
0,2021_2.pdf.txt,1,2,СЛОВО,НОМЕРАЛДА ЩАЛЕ:,РЕДАКТОРНУ,«Дагъустандин дишегьли»,Гъи дишагьлийирин «Дагъу-,"Лебжели — мякамхГебирулра,",Ва надирсса журнал,False
1,2021_2.pdf.txt,1,3,РЕДАКТОРА,ТУБАРАБ ПУМРУГИ ЦО,СЁЗЮ,журнал кхьена ва къе ам,стан дишагьли» аьникьа жур-,бетахъибх/ели - дисулра...,«Дагусттаннал хьами»,False
2,2021_2.pdf.txt,1,4,"Поздравляем всех,",РЕДАКТОРАСУЛ,«Дагьыстанлы къатынгьа»,гъиле кьунвайбуруз виридаз,нал хилиъ дибиснайидар ва ди-,«Дагъиста хьунул адам»,"чигу-чивчуну, ХТакьину канил",False


In [12]:
# examples of lines where OCR results are the same
general_table.query("all_matching == True").head(5)

Unnamed: 0,issue,page,line,rus,avar,kumyk,lezg,tabas,darg,lak,all_matching
2028,2021_2.pdf.txt,17,2030,о к,о к,о к,о к,о к,о к,о к,True
2029,2021_2.pdf.txt,17,2031,рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),рнальв ЖУРНЯЛЕ),True
2030,2021_2.pdf.txt,17,2032,-. =,-. =,-. =,-. =,-. =,-. =,-. =,True
2031,2021_2.pdf.txt,17,2033,Ве,Ве,Ве,Ве,Ве,Ве,Ве,True
2032,2021_2.pdf.txt,17,2034,.% %_Х-А ь ь и,.% %_Х-А ь ь и,.% %_Х-А ь ь и,.% %_Х-А ь ь и,.% %_Х-А ь ь и,.% %_Х-А ь ь и,.% %_Х-А ь ь и,True


### Saving the dataframe as files

In [13]:
# save the table
%time general_table.to_excel("parallel_texts_table.xlsx", index_label = "row_id")
%time general_table.to_csv("parallel_texts_table.csv", index_label = "row_id")

CPU times: user 45.3 s, sys: 1.25 s, total: 46.6 s
Wall time: 52.7 s
CPU times: user 1.58 s, sys: 99.4 ms, total: 1.68 s
Wall time: 1.85 s
