GitHub - 4mekki4/arabic-nlp-da

Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Code for NAACL 2021 paper Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Requirements

Please make sure you have pytorch >=1.8 , transformers >= 3.0.0 , farasa and pyarabic installed.

Datasets

The data/ folder includes a dataset example used in the paper (TEAD_MSA to BRAD_LEV).

To reproduce the results achieved in paper please use the following datasets:

Scenario 1: Domain adaptation for dialects of the same region.

ArSentD-LEV: Arabic Sentiment Twitter Dataset for LEVantine dialect (Ramy, 2018)

Scenario 2: Domain adaptation across regional dialects.

HARD: Hotel Arabic-Reviews Dataset (Elnagar, 2018)
BRAD: Book reviews in Arabic dataset (Elnagar, 2016)
TEAD: Large Scale Arabic Dataset for Sentiment Analysis (Abdellaoui, 2018)

Scenario 3: Domain adaptation from MSA to Arabic dialects using social media data.

ArSAS: An Arabic Speech-Act and Sentiment Corpus of Tweets (Elmadany, 2018)
MSAC: Arabic Sentiment Analysis corpus (Link)
TSAC: Tunisian Sentiment Analysis Corpus (Medhaffar, 2017)
ASTD: Arabic Sentiment Tweets Dataset (Nabil, 2015)
AJGT: Arabic Jordanian General Tweets (Link)
TweetSYR (Saif, 2018)
AraSenti-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets (Al-Twairesh, 2017)

Setting Up the Data

Binary sentiment analysis

Format your data for the binary sentiment analysis to have two classes:

Positive: for the positive rows.
Negative: for the negative rows.

Typically each data folder has 2 files: {task}_train.csv and {task}_test.csv.

Training

To evaluate or predict labels using a finetuned model:

python train.py ALDA 
				--lr 5e-6 
				--source TEAD_MSA 
				--target BRAD_LEV

Where you can specify your source and target datasets. You can replace ALDA with MMD, CORAL or DANN if you want to train your model using the other domain adaptation methods.

Citation

If you use this code, please cite this paper

@inproceedings{el-mekki-etal-2021-domain,
    title = "Domain Adaptation for {A}rabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding",
    author = "El Mekki, Abdellah  and
      El Mahdaouy, Abdelkader  and
      Berrada, Ismail  and
      Khoumsi, Ahmed",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.226",
    pages = "2824--2837",
    abstract = "Finetuning deep pre-trained language models has shown state-of-the-art performances on a wide range of Natural Language Processing (NLP) applications. Nevertheless, their generalization performance drops under domain shift. In the case of Arabic language, diglossia makes building and annotating corpora for each dialect and/or domain a more challenging task. Unsupervised Domain Adaptation tackles this issue by transferring the learned knowledge from labeled source domain data to unlabeled target domain data. In this paper, we propose a new unsupervised domain adaptation method for Arabic cross-domain and cross-dialect sentiment analysis from Contextualized Word Embedding. Several experiments are performed adopting the coarse-grained and the fine-grained taxonomies of Arabic dialects. The obtained results show that our method yields very promising results and outperforms several domain adaptation methods for most of the evaluated datasets. On average, our method increases the performance by an improvement rate of 20.8{\%} over the zero-shot transfer learning from BERT.",
}

Acknowledgment

The structure of this code is largely based on ALDA and CDAN. We are very grateful for their open source.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.DS_Store		.DS_Store
Coral.py		Coral.py
Dataset.py		Dataset.py
README.md		README.md
loss.py		loss.py
mmd.py		mmd.py
network.py		network.py
preprocess_arabert.py		preprocess_arabert.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Code for NAACL 2021 paper Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Requirements

Datasets

Scenario 1: Domain adaptation for dialects of the same region.

Scenario 2: Domain adaptation across regional dialects.

Scenario 3: Domain adaptation from MSA to Arabic dialects using social media data.

Setting Up the Data

Binary sentiment analysis

Training

Citation

Acknowledgment

About

Releases

Packages

Languages

4mekki4/arabic-nlp-da

Folders and files

Latest commit

History

Repository files navigation

Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Code for NAACL 2021 paper Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Requirements

Datasets

Scenario 1: Domain adaptation for dialects of the same region.

Scenario 2: Domain adaptation across regional dialects.

Scenario 3: Domain adaptation from MSA to Arabic dialects using social media data.

Setting Up the Data

Binary sentiment analysis

Training

Citation

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages