Identifying Code-switching in Arabizi

We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.

If you use this dataset, please cite: @inproceedings{shehadi-wintner-2022-identifying, title = "Identifying Code-switching in {A}rabizi", author = "Shehadi, Safaa and Wintner, Shuly", booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.wanlp-1.18", pages = "194--204", }

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Identifying Code-switching in Arabizi.pdf		Identifying Code-switching in Arabizi.pdf
README.md		README.md
arabizi-reddit_new.zip		arabizi-reddit_new.zip
arabizi_tweet_ids_auto.csv		arabizi_tweet_ids_auto.csv
langdetect_lstm_bert_crf.ipynb		langdetect_lstm_bert_crf.ipynb
main.py		main.py
sen_annotated.csv		sen_annotated.csv
utils.py		utils.py
words_annotated.csv		words_annotated.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Code-switching in Arabizi

About

Releases

Packages

Contributors 2

Languages

HaifaCLG/Arabizi

Folders and files

Latest commit

History

Repository files navigation

Identifying Code-switching in Arabizi

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages