Skip to content

SamiaTouileb/NArabizi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

NArabizi corpus

This dataset is described in the paper "The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus" by Samia Touileb and Jeremy Barnes, accepted at Findings of ACL: ACL2021.

This corpus is built on top of the NArabizi treebank by Seddah et all., 2020 freely available here.

Format and pre-processing

The extentions to the treebank are added in the .conllu files, split into pre-defined train, dev, and test sets (as inherited from the original NArabizi treebank).

The sentiment and topic annotations are presented as one sentence per line, following the format "conllu_ID \t annotation".

Cite

If you use this dataset or code, please cite the following paper:

@misc{touileb2021interplay,
      title={The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus}, 
      author={Samia Touileb and Jeremy Barnes},
      year={2021},
      eprint={2105.07400},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published