kaz-parallel-corpora

The products have to use on base of Creative Commons licences, exactly, CC BY-SA (CC Attribution-Share Alike)

Developed bilingual tools for crawling and cleaning the corpus are following:

URL collection tool;
News crawling tools;
Data cleaning tool;
Sentence splitting tool;
Bilingual Frequency Lexicon;
Hunalign adapted for Kazakh-English language pair;
Morphological segmentation tools;

Content of the repository

compare/ - code to compare the corpus with other parallel corpora
corpus/ - corpus files in *.tsv
utils/ - scripts used for crawling and cleaning the corpus

What it needed

● Python3(version 3.6 or later)

● lxml

● sacremoses

● jupyter

● numpy

● matplotlib

How it works

utils/ directory consist of developed bilingual tools for crawling and cleaning the Kazakh-English(vise versa) parallel corpora. It consist following files: ● _clean_text_in_files.sh

● _gen_langs_lists.py

● _join_similar_urls.py

● align_files.py

● clean_alphabets.py

● clean_text.py

● combine_texts_into_one_file.py

● en_kz.dic

● extract_data_from_xml.py

● sacre_norm_tok.py

● segment.py

● split_direct_speech.py

● split_sentences_eng.py

● split_sentences_kaz.py

● split_text_into_many_files.py

The sequence of steps for launching files is as follows:

URL collection tool. Run ‘_gen_langs_lists.py’, which collects urls of pages in the language in which the most news is published.
Collecting news tools. ‘_join_similar_urls.py’ concatenate similar URL addresses. Collects resources from a list of URLs in .xml format.
‘extract_data_from_xml.py’ file extract text from xml files and and save texts into separate file pairs
By ‘split_text_into_many_files.py’ file, each file is saved by the corresponding numbering and the corresponding markup of the language, for example, 1000.kk, 1000.en The file consists of a section, article title, publication date and article text.
Run ‘align_files.py’, that checks an equal number of files in both lists and correspondence of files names to each other.
Data cleaning tools. Run these ‘split_direct_speech.py’, ‘clean_alphabets.py’, ‘clean_text.py’ files. ‘clean_alphabets.py’ file cleans and replaces incorrect letters, punctuation, unwanted symbols in text/file. ‘_clean_text_in_files.sh’ scripts cleans each language file by using ‘clean_text.py’ and saving output files .
Sentence splitting tools. Run ‘split_sentences_eng.py’ for splitting sentences in case English text, and ‘split_sentences_kaz.py’ for Kazakh text.
Run ‘combine_texts_into_one_file.py’ file, that collects all files into one.
Sacre moses tool. Run ‘sacre_norm_tok.py’ file, that normalizes punctuation and tokenize text.
Bilingual Frequency Lexicon. ‘en_kz.dic’ dictionary file, whose content has the following format as example: may @ мамыр.
Adapted Hunalign aligns bilingual texts on the sentence level. The input files are tokenized and sentence-segmented text in two languages.
Morphological segmentation tool. Run ‘segment.py’ file to segment Kazakh text to stems and endings splitted by symbol @@. For segmentation, it needed two files: endings file and stopwords. Stopwords file consists of word stems that have been compiled to avoid incorrect, incorrect segmentation.

With the developed tools, parallel corpora for the Kazakh-English(vice versa) language pair were collected and processed, the results of which are presented in the table below.

Parallel Kazakh-English corpus collected from news sections of government websites.

Corpus size

#	Web-site	Number of sentence pairs
1	http://www.akorda.kz/	35 368
2	https://primeminister.kz/	6 323
3	http://www.mfa.gov.kz/	9 152
4	http://economy.gov.kz/	6 123
5	https://strategy2050.kz/	203 665
6	News titles	41 899
	Total:	302 530

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
compare		compare
corpus		corpus
paper		paper
utils		utils
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
list		list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare

compare

corpus

corpus

paper

paper

utils

utils

.gitignore

.gitignore

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

list

list

Repository files navigation

kaz-parallel-corpora

Content of the repository

What it needed

How it works

Corpus size

About

Releases

Packages

Contributors 2

Languages

NLP-KazNU/kaz-parallel-corpora_collect_and_clean

Folders and files

Latest commit

History

Repository files navigation

kaz-parallel-corpora

Content of the repository

What it needed

How it works

Corpus size

About

Resources

Stars

Watchers

Forks

Languages