# OpenITI customized cleaner | by Maroussia Bednarkiewicz

The OpenITI official helper can be found here: https://openiti.readthedocs.io/en/latest/
and on github: https://github.com/OpenITI/openiti/tree/master/openiti/helper

### Preliminary imports

In [1]:
import os
import re
import openiti_cleaner
import pandas as pd
from itertools import chain

### If you already have OpenITI installed, skip this part 

If you do not have the OpenITI corpus downloaded on your computer you can download its latest and previous versions on zenodo under the following link:

https://zenodo.org/record/6808108

The files will be located in the folder called `data`.

_To credit the authors please use the following citation:_

`Nigst, Lorenz, Romanov, Maxim, Savant, Sarah Bowen, Seydi, Masoumeh, & Verkinderen, Peter. (2022). OpenITI: a Machine-Readable Corpus of Islamicate Texts (2022.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6808108`

Note that ReadITI only works on the text files not the metadata .yml files.

# Trial with the general functions

### Generate the test_file and test_dir

The test files are very small to showcase how the code work. Add your own files in the `data` folder for more extensive trials.

In [2]:
test_file = "data/trial_with_file/0303Nasai.IbnAbiCaruba.Shia003077BK2-ara1.mARkdown"

In [3]:
test_dir = "data/trial_with_folder/"

## Example with a whole folder

In [4]:
openiti_cleaner.dir_processor(test_dir, "data/trial_with_folder/output/")

0303Nasai.IbnAbiCaruba.Shamela0005839BK2-ara2.completed
length original:58
number of sentences: 9
number of tokens: 47
0303Nasai.IbnAbiCaruba.Shia003077BK2-ara1.mARkdown
length original:54
number of sentences: 3
number of tokens: 48
0303Nasai.IbnAbiCaruba.JK011235BK2-ara1
length original:60
number of sentences: 9
number of tokens: 48


## Example with a single file

_The output files will be sotored in "data/trial_dir/trial_output/"_

In [5]:
openiti_cleaner.file_processor(test_file, "data/trial_with_file/output/")

length original:54
number of sentences: 3
number of tokens: 48


# Trial with single functions

In [6]:
_, text = openiti_cleaner.get_content(test_file)
print(len(text.split()))
tokens = openiti_cleaner.cleaner(text)
sentences = openiti_cleaner.sentencizer(tokens)
doc = openiti_cleaner.indexizer(sentences)

54


In [7]:
tokens

['\n\n',
 'ذكر',
 'من',
 'حدث',
 'عنه',
 'ابن',
 'أبي',
 'عروبة',
 'ولم',
 'يسمع',
 'منه',
 '\n\n',
 'لم',
 'يسمع',
 'من',
 'عمرو',
 'بن',
 'دينار',
 'ولا',
 'من',
 'هشام',
 'بن',
 'عروة',
 'ولا',
 'من',
 'زيد',
 'بن',
 'أسلم',
 'ولا',
 'من',
 'عبيد',
 'الله',
 'بن',
 'عمر',
 'ولا',
 'من',
 'أبي',
 'الزناد',
 'ولا',
 'من',
 '\n\n',
 'الحكم',
 'ولا',
 'من',
 'حماد',
 'ولا',
 'من',
 'إسماعيل',
 'بن',
 'أبي',
 'خالد']

In [8]:
sentences

[' ذكر من حدث عنه ابن أبي عروبة ولم يسمع منه',
 ' لم يسمع من عمرو بن دينار ولا من هشام بن عروة ولا من زيد بن أسلم ولا من عبيد الله بن عمر ولا من أبي الزناد ولا من',
 ' الحكم ولا من حماد ولا من إسماعيل بن أبي خالد']

In [9]:
doc

[[{'idx': 0, 'token': 'ذكر', 'start_char': 0, 'end_char': 3},
  {'idx': 1, 'token': 'من', 'start_char': 4, 'end_char': 6},
  {'idx': 2, 'token': 'حدث', 'start_char': 7, 'end_char': 10},
  {'idx': 3, 'token': 'عنه', 'start_char': 11, 'end_char': 14},
  {'idx': 4, 'token': 'ابن', 'start_char': 15, 'end_char': 18},
  {'idx': 5, 'token': 'أبي', 'start_char': 19, 'end_char': 22},
  {'idx': 6, 'token': 'عروبة', 'start_char': 23, 'end_char': 28},
  {'idx': 7, 'token': 'ولم', 'start_char': 29, 'end_char': 32},
  {'idx': 8, 'token': 'يسمع', 'start_char': 33, 'end_char': 37},
  {'idx': 9, 'token': 'منه', 'start_char': 38, 'end_char': 41}],
 [{'idx': 0, 'token': 'لم', 'start_char': 42, 'end_char': 44},
  {'idx': 1, 'token': 'يسمع', 'start_char': 45, 'end_char': 49},
  {'idx': 2, 'token': 'من', 'start_char': 50, 'end_char': 52},
  {'idx': 3, 'token': 'عمرو', 'start_char': 53, 'end_char': 57},
  {'idx': 4, 'token': 'بن', 'start_char': 58, 'end_char': 60},
  {'idx': 5, 'token': 'دينار', 'start_char'

### The doc indexed as a table

In [10]:
pd.DataFrame(list(chain.from_iterable(doc)))

Unnamed: 0,idx,token,start_char,end_char
0,0,ذكر,0,3
1,1,من,4,6
2,2,حدث,7,10
3,3,عنه,11,14
4,4,ابن,15,18
5,5,أبي,19,22
6,6,عروبة,23,28
7,7,ولم,29,32
8,8,يسمع,33,37
9,9,منه,38,41
