UET.PJKC4.0

KC4Align

KC4Align uses the similarity of multilingual sentence embeddings using LASER to evaluate the similarity of sentences, In conjunction with deep-translator, KC4Align works in more low-resource languages which not supported by LASER by translating it to a supported high-language

Installation KC4Align

You can build an environment using conda as follows:

* Create environment
    conda env create -f environment.yml

You can install LASER Language-Agnostic Sentence Representations (LASER) toolkit from Facebook here which is used to embed sentences in each document.

Then set the environment variables in your workspace

* Set the environment variable to projects directory in your workspace
    - KC4ALIGN=${HOME}/projects/kc4align
    - LASER=${HOME}/projects/LASER

Run Tool

conda activate kc4align
cd $KC4ALIGN
python align.py --src_file src_path --tgt_file tgt_path -o output_dir_path

Variables Meaning

--src_file : path to source file where each line is a paragraph
--tgt_file : path to source file in which each line is a paragraph
-o : output file contain pairs of sentences that alignment each other

Alignment output file is written to stdout:

score source_sentence taget_sentence
...

Run Sample

python align.py -src_file ./data/sample/vi_test.txt --tgt_file ./data/sample/lo_test.txt -o ./output

Lao Sentence Tokenize

Sentence tokenization is the process of splitting text into individual sentences.

Dependencies

Numpy, tested with 1.19.2

Run Tool

python sentence_tokenize -i input_file -o output_file

Variables Meaning

-i : path to input lao text file
-o : path to the output text file

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/sample		data/sample
kc4align		kc4align
kc4tokenizer		kc4tokenizer
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/sample

data/sample

kc4align

kc4align

kc4tokenizer

kc4tokenizer

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

Repository files navigation

UET.PJKC4.0

KC4Align

Installation KC4Align

Run Tool

Variables Meaning

Run Sample

Lao Sentence Tokenize

Dependencies

Run Tool

Variables Meaning

About

Releases

Packages

Languages

NHDat2/UET.PJKC4.0

Folders and files

Latest commit

History

Repository files navigation

UET.PJKC4.0

KC4Align

Installation KC4Align

Run Tool

Variables Meaning

Run Sample

Lao Sentence Tokenize

Dependencies

Run Tool

Variables Meaning

About

Resources

Stars

Watchers

Forks

Languages