AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors

Link to the paper:

https://ieeexplore.ieee.org/document/10459781

Dataset Characteristics:

Source Diversity: Our human-written texts are sourced from a wide range of materials to ensure objectivity and unbiased content. This includes:

Books: A selection of 40 books covering various topics and periods. Each book contributes unique passages to avoid overlap.

News Articles: Articles from the Aljazeera website, specifically between 2014 and 2016, to ensure there is no potential for synthetic texts.

Sample Size: The dataset is carefully constructed to comply with the character limits of AI detectors. Human-written texts vary in length, with shorter samples padded to meet a 1000-character minimum. AI-generated texts from ChatGPT are tailored to maintain consistency in character count and meaningful content.

Text Variations: To test the adaptability of AI detectors, the dataset includes a diverse range of text structures:

Single and multi-paragraph compositions.
Bullet point formats.
Passages with in-text citations.

This diversity ensures a comprehensive evaluation of AI detectors across different text types and structures.

License

AIRABIC Dataset is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

What does this mean?

Attribution: You must give appropriate credit to the creators of the dataset.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Commercial Use: You are free to use the information for commercial purposes.

The AIRABIC dataset comprises two main categories:

AI-generated texts.xlsx - This dataset represents the collection of 500 examples of AI-generated Arabic texts.
Human-written texts.xlsx - This dataset represents the collection of 500 examples of Arabic texts written by humans.

Ready-to-Use Text Files

The dataset is also available in a ready-to-use text file format. Each category (AI-generated texts and Human-written texts) has been split into individual .txt files for ease of use. The .txt files are organized in the following folders:

output_AI-generated_txt: Contains 500 AI-generated Arabic texts, each saved as an individual .txt file.
output_human-written_txt: Contains 500 human-written Arabic texts, each saved as an individual .txt file.

This makes it incredibly straightforward to integrate our dataset into your existing natural language processing pipelines or research projects.

Codebase

Diacritics.py - This script provides functionalities for building diacritics upon Arabic texts.
NonDiacritics.py - This script provides functionalities for removing diacritics from the Arabic texts.
main.py - main class.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
dataset in txt format		dataset in txt format
AI-generated texts.xlsx		AI-generated texts.xlsx
AIRABIC Arabic Dataset for Performance Evaluation of AI.pdf		AIRABIC Arabic Dataset for Performance Evaluation of AI.pdf
Diacritics.py		Diacritics.py
Human-written texts.xlsx		Human-written texts.xlsx
NonDiacritics.py		NonDiacritics.py
README.md		README.md
main.py		main.py
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors

Link to the paper:

Dataset Characteristics:

License

Ready-to-Use Text Files

Codebase

About

Releases

Packages

Languages

Hamed1Hamed/AIRABIC

Folders and files

Latest commit

History

Repository files navigation

AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors

Link to the paper:

Dataset Characteristics:

License

Ready-to-Use Text Files

Codebase

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages