T-HSAB-A Tunisian Hate Speech and Abusive Dataset

The first Tunisian Hate Speech and Abusive Dataset

T-HSAB Dataset: Context and Topics

The Tunisian Hate Speech and ABusive (T-HSAB) is the first Arabic Tunisian Hate Speech and Abusive Language Dataset proposed in the The 7th International Conference on Arabic Language Processing October 16-17, 2019 (Nancy, France).

Since the "Jasmine Revolution" at 2011, Tunisia has entered a new era of ultimate freedom of expression with a full access into social media. This has been associated with an unrestricted spread of toxic contents such as Abusive and Hate speech. T-HSAB combines 6,024 Tunisian comments labeled as normal, abusive or hate. The collected tweets were posted between October 2018 and March 2019.

Data Collection & Resources

T-HSAB was constructed out of Tunisian comments scraped from facebook and youtube. We collected the comments based on multiple queries formulated from the potential entities that are usually targeted by abusive/hate speech such as “اليهود” (Jews), "الأفارقة" (Africans), “المساواة في الميراث” (gender equality in inheritance), etc.

Data Annotation Guidelines

Our annotation process was conducted by 3 Tunisian-speaking annotators. The annotation instructions defined the 3 label categories as:

• Normal tweets are those instances with no offensive, aggressive, insulting and profanity content.

• Abusive tweets are those instances that combine offensive, aggressive, insulting or profanity content.

• Hate tweets are those instances that: (a) contain an abusive language, (b) dedicate the abusive language towards a specific person or a group of people and (c) demean or dehumanize that person or that group of people based on their descriptive identity (race, gender, religion, disability, skin color, belief).

• The annotators were provided by the nicknames usually used, within hate/abusive contexts, to refer to certain political parties, minorities and ethnic/religion groups. For example, “كحلوش” (of a dark skin), which represents the African ethnic group, is usually used within hate speech contexts.

Annotation Evaluation: Methods & Results

The annotation credibility was evaluated using several evaluation measures:

1- Pairwise Percent Agreement Measure (PRAM): best value between annotator 1 & annotator 2 = 97.9%

2- Cohen's Kappa (K): best value between annotator 1 & annotator 2 = 96.1%

3- Krippendorff’s Alpha (α)= 75%

T-HSAB: Classification Experiments

1- Binary Classification (Normal, Abusive):

Best performance by Naive Bayes with an F-measure of 92.3%

2- Multi-Class Classification (Normal, Abusive, Hate):

Best performance by Naive Bayes with an F-measure of 83.6%

Paper Citation

Haddad H., Mulki H., Oueslati A. (2019) T-HSAB: A Tunisian Hate Speech and Abusive Dataset. In: Smaïli K. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2019. Communications in Computer and Information Science, vol 1108. Springer, Cham

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
T-HSAB Corpus		T-HSAB Corpus
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T-HSAB Corpus

T-HSAB Corpus

README.md

README.md

Repository files navigation

T-HSAB-A Tunisian Hate Speech and Abusive Dataset

T-HSAB Dataset: Context and Topics

Data Collection & Resources

Data Annotation Guidelines

Annotation Evaluation: Methods & Results

T-HSAB: Classification Experiments

Paper Citation

About

Releases

Packages

Hala-Mulki/T-HSAB-A-Tunisian-Hate-Speech-and-Abusive-Dataset

Folders and files

Latest commit

History

T-HSAB Corpus

T-HSAB Corpus

README.md

README.md

Repository files navigation

T-HSAB-A Tunisian Hate Speech and Abusive Dataset

T-HSAB Dataset: Context and Topics

Data Collection & Resources

Data Annotation Guidelines

Annotation Evaluation: Methods & Results

T-HSAB: Classification Experiments

Paper Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages