NArabizi corpus

This dataset is described in the paper "The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus" by Samia Touileb and Jeremy Barnes, accepted at Findings of ACL: ACL2021.

This corpus is built on top of the NArabizi treebank by Seddah et all., 2020 freely available here.

Format and pre-processing

The extentions to the treebank are added in the .conllu files, split into pre-defined train, dev, and test sets (as inherited from the original NArabizi treebank).

The sentiment and topic annotations are presented as one sentence per line, following the format "conllu_ID \t annotation".

Cite

If you use this dataset or code, please cite the following paper:

@misc{touileb2021interplay,
      title={The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus}, 
      author={Samia Touileb and Jeremy Barnes},
      year={2021},
      eprint={2105.07400},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
experiments		experiments
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NArabizi corpus

Format and pre-processing

Cite

About

Releases

Packages

Languages

SamiaTouileb/NArabizi

Folders and files

Latest commit

History

Repository files navigation

NArabizi corpus

Format and pre-processing

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages