<a href="https://colab.research.google.com/github/MohamedAhmedGalal/human-vs-machine-arabic-nlp/blob/main/last_delivered_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explanation of `drive.mount('/content/drive')`

## Purpose:
- The `drive.mount('/content/drive')` command allows you to connect your Google Drive to your Colab notebook.
- This enables access to files stored in your Google Drive directly from the Colab environment.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install gdown
!gdown https://drive.google.com/file/d/1Wf0J8e4MF7N5M6VcCpxTNIHKIgssaARx/view?usp=drive_link


Downloading...
From: https://drive.google.com/file/d/1Wf0J8e4MF7N5M6VcCpxTNIHKIgssaARx/view?usp=drive_link
To: /content/view?usp=drive_link
92.4kB [00:00, 2.12MB/s]


!cp -r /content/drive/MyDrive/Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip

## Purpose:
- The command copies the specified file or directory from Google Drive to the current working directory in Colab.
- It uses the `cp` command to copy files and folders.

In [None]:
!gdown --id 1Wf0J8e4MF7N5M6VcCpxTNIHKIgssaARx

!cp -r /content/Copy\ of\ Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip /content/


Downloading...
From (original): https://drive.google.com/uc?id=1Wf0J8e4MF7N5M6VcCpxTNIHKIgssaARx
From (redirected): https://drive.google.com/uc?id=1Wf0J8e4MF7N5M6VcCpxTNIHKIgssaARx&confirm=t&uuid=7c96d181-75d5-4c31-ae92-15697cd8de9d
To: /content/Copy of Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip
100% 35.7M/35.7M [00:01<00:00, 22.7MB/s]
cp: '/content/Copy of Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip' and '/content/Copy of Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip' are the same file


In [None]:
# %%capture
!unzip /content/Copy\ of\ Human_Articles_VS_Machine_Generated_Articles_Detection_datasetV2.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: dataset/articles/real/YEM_Almasdar/YEM_Almasdar_article_83.txt  
  inflating: dataset/articles/real/YEM_Almasdar/YEM_Almasdar_article_121.txt  
  inflating: dataset/articles/real/YEM_Almasdar/YEM_Almasdar_article_156.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_215_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_43_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_198_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_102_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_31_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_232_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_208_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_79_summary.txt  
  inflating: dataset/summaries/MUR_sahra/MUR_sahra_article_254_summary.txt  
  inflati

## Purpose:
- This command installs several Python libraries essential for natural language processing (NLP) tasks, particularly those involving Arabic text.

## Breakdown of Components:

1. **`%%capture`**:
   - A magic command in Jupyter and Colab notebooks that captures the output of the cell, preventing it from displaying in the notebook.
   - Useful for suppressing verbose installation logs.

2. **`!pip install`**:
   - The `!` allows execution of shell commands within a notebook cell.
   - `pip install` is the command to install Python packages.

3. **List of Packages**:
   - **`transformers`**: A library by Hugging Face providing pre-trained models for NLP tasks.
   - **`datasets`**: Offers a collection of ready-to-use datasets and tools for loading and processing them.
   - **`arabic-reshaper`**: Reconstructs Arabic sentences for applications that don't support Arabic script. :contentReference[oaicite:0]{index=0}
   - **`python-bidi`**: Implements the Unicode Bidirectional Algorithm necessary for displaying right-to-left languages like Arabic.
   - **`wordcloud`**: Generates word clouds from text data.
   - **`arabert`**: Provides pre-trained BERT models specifically for Arabic. :contentReference[oaicite:1]{index=1}
   - **`pyarabic`**: A library for Arabic language processing, offering functions to manipulate Arabic text. :contentReference[oaicite:2]{index=2}
   - **`tensorboard`**: A visualization tool for monitoring machine learning experiments.


In [None]:
# %%capture
!pip install transformers datasets arabic-reshaper python-bidi wordcloud arabert pyarabic tensorboard



# Explanation of Imported Libraries and Modules

This section provides an overview of the libraries and modules imported in the code, highlighting their purposes and functionalities.

## Standard Libraries

- **`os`**: Provides a way to interact with the operating system, enabling tasks like file and directory manipulation.
- **`warnings`**: Offers a mechanism to control the display of warning messages.
- **`re`**: Supports regular expression operations for pattern matching and string manipulation.
- **`collections.Counter`**: A subclass of the dictionary object, useful for counting hashable objects.

## Data Manipulation and Analysis

- **`pandas as pd`**: A powerful data manipulation and analysis library, providing data structures like DataFrames.
- **`numpy as np`**: Supports large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.

## Machine Learning and NLP

- **`torch`**: The core library of PyTorch, an open-source machine learning framework.
- **`datasets.Dataset`**: Part of the Hugging Face Datasets library, facilitating the handling of large datasets.
- **`transformers`**: A library by Hugging Face that provides general-purpose architectures for natural language understanding and generation.
  - **`AutoTokenizer`**: Automatically retrieves the appropriate tokenizer for a given model.
  - **`AutoModelForSequenceClassification`**: Fetches a pre-trained model suitable for sequence classification tasks.
  - **`TrainingArguments`**: Specifies hyperparameters for training models.
  - **`Trainer`**: A class that simplifies the training and evaluation of models.
  - **`EarlyStoppingCallback`**: A callback to stop training when a specified metric stops improving.

## Evaluation Metrics

- **`sklearn.metrics`**: Provides functions to assess the performance of machine learning models.
  - **`accuracy_score`**: Calculates the accuracy of predictions.
  - **`precision_recall_fscore_support`**: Computes precision, recall, F1-score, and support.
  - **`roc_curve`**: Generates the Receiver Operating Characteristic (ROC) curve.
  - **`auc`**: Calculates the Area Under the Curve (AUC) for ROC.
  - **`precision_recall_curve`**: Plots the precision-recall curve.
  - **`average_precision_score`**: Computes the average precision score.
  - **`classification_report`**: Builds a text report showing the main classification metrics.
  - **`confusion_matrix`**: Generates a confusion matrix to evaluate classification accuracy.

## Visualization

- **`matplotlib.pyplot as plt`**: A plotting library for creating static, animated, and interactive visualizations.
- **`seaborn as sns`**: A data visualization library based on Matplotlib, offering a high-level interface for drawing attractive statistical graphics.

## Arabic Text Processing

- **`arabic_reshaper`**: Reshapes Arabic text for proper display in environments that do not support Arabic script.
- **`bidi.algorithm.get_display`**: Reorders bidirectional text for correct display, essential for languages like Arabic.
- **`arabert.preprocess`**: Provides preprocessing functions tailored for Arabic text, enhancing compatibility with the AraBERT model.
- **`pyarabic.araby`**: Offers utilities for Arabic text processing.
  - **`strip_tashkeel`**: Removes diacritics from Arabic text.
  - **`strip_tatweel`**: Eliminates elongation characters from Arabic text.

## Word Cloud Generation

- **`wordcloud`**: Generates word clouds, which are visual representations of text data where the size of each word indicates its frequency or importance.

## Summary

The imported libraries and modules encompass a wide range of functionalities, including data manipulation, machine learning, evaluation metrics, visualization, and specialized processing for Arabic text. This comprehensive set of tools facilitates the development and evaluation of models, particularly those focused on Arabic language processing.


In [None]:
import os
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import arabic_reshaper
from bidi.algorithm import get_display
import re
from collections import Counter
from wordcloud import WordCloud
import numpy as np
from sklearn.metrics import confusion_matrix
import arabert.preprocess
from pyarabic.araby import strip_tashkeel, strip_tatweel
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_curve, auc,
    precision_recall_curve, average_precision_score
)
import pandas as pd
import os
import torch
from arabic_reshaper import reshape
from bidi.algorithm import get_display

# ArabicTextPreprocessor Class:

## Overview
The `ArabicTextPreprocessor` class provides a comprehensive pipeline for cleaning and preprocessing Arabic text, preparing it for natural language processing (NLP) tasks.

## Key Components

### Initialization
- **`__init__`**:
  - Initializes the AraBERT preprocessor with a specified model (default: `FacebookAI/xlm-roberta-base`).
  - Compiles regex patterns for efficiency (e.g., URLs, HTML tags, emails).

### Text Cleaning Methods
1. **`normalize_arabic_characters`**:
   - Unifies Arabic character variations (e.g., Alef, Ya, Ha).
2. **`remove_emojis`**:
   - Removes emojis from text using the `emoji` library.
3. **`remove_diacritics`**:
   - Strips Arabic diacritical marks (e.g., Tashkeel, Tatweel) using `pyarabic`.
4. **`basic_clean`**:
   - Performs the following:
     - Removes irrelevant lines (e.g., containing "مقال").
     - Normalizes characters.
     - Removes emojis, diacritics, HTML tags, URLs, email addresses, and punctuation.
     - Strips extra whitespace.

### Preprocessing Pipelines
1. **`preprocess`**:
   - Executes the full cleaning pipeline, including AraBERT-specific preprocessing.
2. **`advanced_preprocessing`**:
   - Offers additional flexibility with options to:
     - Normalize characters.
     - Remove diacritics.
     - Remove punctuation.

## Benefits
- **Comprehensive Cleaning**: Handles a wide range of Arabic-specific text challenges.
- **Customizable**: Provides options for advanced, configurable preprocessing.
- **AraBERT Integration**: Ensures compatibility with Arabic pre-trained models.



In [None]:
import re
import unicodedata
import emoji
from pyarabic.araby import strip_tashkeel, strip_tatweel
import arabert.preprocess

class ArabicTextPreprocessor:
    def __init__(self, model_name="FacebookAI/xlm-roberta-base"):
        """
        Initialize the Arabic text preprocessor

        Args:
            model_name (str): Name of the AraBERT model to use for preprocessing
        """
        self.arabert_prep = arabert.preprocess.ArabertPreprocessor(
            model_name=model_name
        )

        # Compile regex patterns for efficiency
        self.url_pattern = re.compile(r'http\S+|www\.\S+')
        self.html_tag_pattern = re.compile(r'<[^<]+?>')
        self.email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

        # Define additional cleaning patterns
        self.punctuation_pattern = re.compile(r'[^\w\s\u0600-\u06FF]')

    def normalize_arabic_characters(self, text):
        """
        Normalize Arabic characters to improve consistency

        Args:
            text (str): Input text

        Returns:
            str: Normalized text
        """
        # Normalize Alef variations
        text = text.replace('أ', 'ا')
        text = text.replace('إ', 'ا')
        text = text.replace('آ', 'ا')

        # Normalize Ya variations
        text = text.replace('ى', 'ي')

        # Normalize Ha variations
        text = text.replace('ة', 'ه')

        return text

    def remove_emojis(self, text):
        """
        Remove emojis from text

        Args:
            text (str): Input text

        Returns:
            str: Text without emojis
        """
        return emoji.replace_emoji(text, replace='')

    def remove_diacritics(self, text):
        """
        Remove Arabic diacritical marks

        Args:
            text (str): Input text

        Returns:
            str: Text without diacritics
        """
        # Remove tashkeel and tatweel
        text = strip_tashkeel(text)
        text = strip_tatweel(text)

        return text

    def basic_clean(self, text):
        """
        Perform basic text cleaning

        Args:
            text (str): Input text

        Returns:
            str: Cleaned text
        """
        if not isinstance(text, str):
            return ""

        # Split text into lines
        lines = text.split('\n')

        # Remove first line if it contains 'مقال' and has 2-3 words
        if lines and len(lines[0].split()) <= 3 and 'مقال' in lines[0]:
            lines = lines[1:]

        # Rejoin the text
        text = '\n'.join(lines)

        # Normalize characters
        text = self.normalize_arabic_characters(text)

        # Remove emojis
        text = self.remove_emojis(text)

        # Remove diacritics
        text = self.remove_diacritics(text)

        # Remove HTML tags
        text = self.html_tag_pattern.sub('', text)

        # Remove URLs
        text = self.url_pattern.sub('', text)

        # Remove email addresses
        text = self.email_pattern.sub('', text)

        # Remove punctuation (except Arabic characters and whitespace)
        text = self.punctuation_pattern.sub('', text)

        # Remove asterisks and other special characters
        text = text.replace('*', '')

        # Remove extra whitespace
        text = ' '.join(text.split())

        return text

    def preprocess(self, text):
        """
        Full preprocessing pipeline

        Args:
            text (str): Input text

        Returns:
            str: Preprocessed text
        """
        # Basic cleaning
        text = self.basic_clean(text)

        # AraBERT preprocessing
        text = self.arabert_prep.preprocess(text)

        return text

    def advanced_preprocessing(self, text,
                                normalize_chars=True,
                                remove_diacritics=True,
                                remove_punctuation=True):
        """
        Advanced preprocessing with configurable options

        Args:
            text (str): Input text
            normalize_chars (bool): Normalize Arabic characters
            remove_diacritics (bool): Remove diacritical marks
            remove_punctuation (bool): Remove punctuation

        Returns:
            str: Advanced preprocessed text
        """
        # Basic cleaning
        text = self.basic_clean(text)

        # Optional character normalization
        if normalize_chars:
            text = self.normalize_arabic_characters(text)

        # Optional diacritics removal
        if remove_diacritics:
            text = self.remove_diacritics(text)

        # Optional punctuation removal
        if remove_punctuation:
            text = self.punctuation_pattern.sub('', text)

        # AraBERT preprocessing
        text = self.arabert_prep.preprocess(text)

        return text

# Function: `load_and_preprocess_data`

## Purpose
This function loads Arabic text data from a specified directory, preprocesses it using the `ArabicTextPreprocessor` class, and provides detailed analytics on the preprocessing steps.

## Parameters
- **`split_dir`**: Path to the directory containing text files.
- **`metadata_file`**: Path to an Excel file with metadata (currently read but not utilized in the function).

## Workflow
1. **Initialization**:
   - Instantiate the `ArabicTextPreprocessor` for text preprocessing.
   - Read the metadata Excel file into a DataFrame (though it's not used further in the function).

2. **Data Processing**:
   - Initialize a dictionary to store text data, preprocessed text, labels, text lengths, preprocessed text lengths, and filenames.
   - Iterate over each `.txt` file in the specified directory:
     - Read the file content.
     - Preprocess the text using the `preprocess` method.
     - Determine the label based on the filename (`1` if 'generated' is in the filename; otherwise, `0`).
     - Calculate the word count for both original and preprocessed text.
     - Store the original text, preprocessed text, label, word counts, and filename in the dictionary.

3. **DataFrame Creation**:
   - Convert the dictionary into a Pandas DataFrame for structured data handling.

4. **Preprocessing Statistics**:
   - Compute and store various statistics:
     - Total number of files processed.
     - Average word count of original texts.
     - Average word count of preprocessed texts.
     - Average number of characters removed during preprocessing.
     - Count of files where the first line was removed (if it contained 'مقال').

## Returns
- **`df`**: A DataFrame containing the original text, preprocessed text, labels, word counts, and filenames.
- **`preprocessing_stats`**: A dictionary with statistics summarizing the preprocessing outcomes.


In [None]:
def load_and_preprocess_data(split_dir, metadata_file):
    """Load and preprocess the data with detailed analytics"""
    preprocessor = ArabicTextPreprocessor()

    # Read metadata
    df = pd.read_excel(metadata_file)

    # Process files
    data = {
        'text': [],
        'preprocessed_text': [],
        'label': [],
        'length': [],
        'preprocessed_length': [],
        'filename': []
    }

    for filename in os.listdir(split_dir):
        if filename.endswith('.txt'):
            file_path = os.path.join(split_dir, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read().strip()

            # Preprocess text
            preprocessed_text = preprocessor.preprocess(text)

            # Store data
            data['text'].append(text)
            data['preprocessed_text'].append(preprocessed_text)
            data['label'].append(1 if 'generated' in filename else 0)
            data['length'].append(len(text.split()))
            data['preprocessed_length'].append(len(preprocessed_text.split()))
            data['filename'].append(filename)

    df = pd.DataFrame(data)

    # Save preprocessing stats
    preprocessing_stats = {
        'Total Files': len(df),
        'Average Original Length': df['length'].mean(),
        'Average Preprocessed Length': df['preprocessed_length'].mean(),
        'Average Characters Removed': (df['text'].str.len() - df['preprocessed_text'].str.len()).mean(),
        'Files with First Line Removed': sum(df['text'].str.split('\n').str[0].str.contains('مقال', na=False))
    }

    return df, preprocessing_stats

# Explanation: Directory Setup

## Purpose
This snippet sets up directories for handling datasets and storing results for an Arabic text classification task.

## Code Breakdown
1. **Define Base Directory**:
   - `base_dir = 'dataset/splits'`
   - Specifies the location of the dataset, particularly in a `splits` subdirectory within a `dataset` folder.

2. **Define Output Directory**:
   - `output_dir = 'arabic_classification_results'`
   - Specifies the directory where results of the classification task (e.g., preprocessed data, metrics) will be saved.

3. **Create Output Directory**:
   - `os.makedirs(output_dir, exist_ok=True)`
   - Ensures the output directory exists.
   - If the directory does not already exist, it will be created.
   - The `exist_ok=True` flag prevents errors if the directory already exists.



## Benefits
- Organized structure for dataset handling and result storage.
- Prevents errors from missing or duplicate directories.





In [None]:
# Set up directories
base_dir = 'dataset/splits'
output_dir = 'arabic_classification_results'
os.makedirs(output_dir, exist_ok=True)

# Explanation: Loading and Preprocessing Data

## Purpose
This code snippet loads text data from training, validation, and test directories, preprocesses it, and provides preprocessing statistics.

## Workflow

1. **Print Status**:
   - `print("Loading and preprocessing data...")`
   - Displays a message to indicate the start of the data loading and preprocessing process.

2. **Load and Preprocess Data**:
   - **`load_and_preprocess_data`**:
     - Processes files in the specified directories (`train`, `val`, `test`) using the `ArabicTextPreprocessor` class.
     - Cleans and prepares text data for model training, validation, and testing.

   - **Arguments**:
     - `os.path.join(base_dir, 'train')`: Path to the training data directory.
     - `'dataset/combined_dataset_metadata.xlsx'`: Path to the metadata file for additional information.
     - Similarly, validation (`val`) and testing (`test`) directories are specified.

3. **Outputs**:
   - **`train_df`, `val_df`, `test_df`**:
     - Pandas DataFrames containing original and preprocessed text, labels, text lengths, and filenames for the respective datasets.
   - **`train_prep_stats`, `val_prep_stats`, `test_prep_stats`**:
     - Dictionaries with preprocessing statistics (e.g., average lengths, number of cleaned files).



In [None]:
# Load and preprocess data
print("Loading and preprocessing data...")

train_df, train_prep_stats = load_and_preprocess_data(
    os.path.join(base_dir, 'train'),
    'dataset/combined_dataset_metadata.xlsx'
)
val_df, val_prep_stats = load_and_preprocess_data(
    os.path.join(base_dir, 'val'),
    'dataset/combined_dataset_metadata.xlsx'
)
test_df, test_prep_stats = load_and_preprocess_data(
    os.path.join(base_dir, 'test'),
    'dataset/combined_dataset_metadata.xlsx'
)



Loading and preprocessing data...




#**BiLSTM,Attention,Retention Mechanisms introduction**
##Models Performance Proposed Enhancements:
For more probable model enhancement,BiLSTM,Attention,Retention Mechanisms are introduced to arabic language models:
BiLSTM (Bidirectional Long Short-Term Memory):

1. BiLSTM (Bidirectional Long Short-Term Memory):
extends LSTMs by processing input sequences in both forward and backward directions using two separate hidden layers. The outputs are then concatenated, capturing both past and future contexts.The memory cells of LSTM retain important information, while the bidirectional setup ensures that dependencies between distant words.
2. Attention Mechanism:
calculates a weighted relevance score between every input token and its target representation, focusing on the most contextually important tokens for generating outputs.This is implemented as a weighted sum of encoder hidden states in sequence-to-sequence models. In Arabic, where sentence meanings rely heavily on word context and semantics, attention ensures the model focuses on significant words while ignoring less relevant parts.
3. Retention mechanisms:
Extends recurrent networks by preserving long-term dependencies over extended sequences. They involve enhanced gating systems to reduce the decay of information over time.It mitigates the loss of information in longer Arabic texts, which often feature nested clauses and complex structures.







###Herein after, the process of enhancment and layers introduction will be in steps as follows:
1. Analyze data for data preparation,importing and preprocessing.
2. Tokenization
3. Compatible HuggingFace datasets conversion
4. Training:
   a.Defining model architecture.
   b.Model initialization
   c. Defining model judgement metrics
   d.trainng arguments definition
   e.training
###Above mentioned steps will be applied on these models
####Selected language models for enhancement are:
1. aubmindlab/bert-base-arabertv02
2. aubmindlab/bert-large-arabertv2
3. asafaya/bert-base-arabic
4. FacebookAI/xlm-roberta-base",
5. araelectra-base-discriminator"






###**1-aubmindlab/bert-base-arabertv02 Model Enhancement**

In [None]:
# Analyze datasets
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as scipy_stats  # Rename to avoid conflict
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from collections import Counter
from wordcloud import WordCloud

def analyze_data(df, output_dir):
    """
    Generate detailed analytics about the dataset with both original and enhanced analyses

    Args:
        df (pd.DataFrame): DataFrame containing the dataset
        output_dir (str): Directory to save the results

    Returns:
        dict: A dictionary with basic statistics
    """
    os.makedirs(output_dir, exist_ok=True)

    # 1. Basic statistics (Preserved from original implementation)
    dataset_stats = {
        'Total Samples': len(df),
        'Real Articles': sum(df['label'] == 0),
        'Generated Articles': sum(df['label'] == 1),
        'Average Original Length': df['length'].mean(),
        'Average Preprocessed Length': df['preprocessed_length'].mean(),
        'Max Length': df['length'].max(),
        'Min Length': df['length'].min(),
        'Median Length': df['length'].median()
    }

    # Save basic stats
    pd.DataFrame([dataset_stats]).to_csv(os.path.join(output_dir, 'basic_stats.csv'))

    # 2. Length distribution plot (Original implementation)
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='label', y='length', data=df)
    plt.title('Article Length Distribution by Class')
    plt.xlabel('Class (0: Real, 1: Generated)')
    plt.ylabel('Word Count')
    plt.savefig(os.path.join(output_dir, 'length_distribution.png'))
    plt.close()

    # 3. Length histogram (Original implementation)
    plt.figure(figsize=(12, 6))
    plt.hist(df['length'], bins=50, alpha=0.5, label='Original')
    plt.hist(df['preprocessed_length'], bins=50, alpha=0.5, label='Preprocessed')
    plt.title('Article Length Distribution')
    plt.xlabel('Word Count')
    plt.ylabel('Frequency')
    plt.legend()
    plt.savefig(os.path.join(output_dir, 'length_histogram.png'))
    plt.close()

    # 4. Enhanced Length Distribution Plots (New addition)
    def plot_length_distribution(df, column, title, filename):
        plt.figure(figsize=(15, 10))
        sns.kdeplot(data=df, x=column, hue='label', fill=True, common_norm=False, palette="crest", alpha=0.5)
        plt.title(title)
        plt.xlabel('Word Count')
        plt.ylabel('Density')
        plt.savefig(os.path.join(output_dir, filename))
        plt.close()

    plot_length_distribution(df, 'length', 'Article Length Distribution by Class (KDE)', 'length_distribution_kde.png')
    plot_length_distribution(df, 'preprocessed_length', 'Preprocessed Article Length Distribution by Class (KDE)', 'preprocessed_length_distribution_kde.png')

    # 5. Box Plot with Outliers (New addition)
    plt.figure(figsize=(15, 10))
    sns.boxplot(x='label', y='length', data=df, showfliers=True)
    sns.stripplot(x='label', y='length', data=df, size=4, color='.3', linewidth=0)
    plt.title('Article Length Distribution by Class with Outliers')
    plt.xlabel('Class (0: Real, 1: Generated)')
    plt.ylabel('Word Count')
    plt.savefig(os.path.join(output_dir, 'length_distribution_boxplot.png'))
    plt.close()

    # 6. Statistical Tests (New addition)
    real_lengths = df[df['label'] == 0]['length']
    gen_lengths = df[df['label'] == 1]['length']

    # Perform t-test
    try:
        t_stat, p_value = scipy_stats.ttest_ind(real_lengths, gen_lengths, equal_var=False)
        dataset_stats['T-test Statistic'] = t_stat
        dataset_stats['T-test P-value'] = p_value
    except Exception as e:
        print(f"T-test calculation error: {e}")
        dataset_stats['T-test Statistic'] = None
        dataset_stats['T-test P-value'] = None

    # 7. Word Frequency Analysis (New addition)
    def get_top_words(texts, n=20):
        words = ' '.join(texts).split()
        word_counts = Counter(words)
        return pd.DataFrame(word_counts.most_common(n), columns=['Word', 'Count'])

    # Top words for each class
    top_words_real = get_top_words(df[df['label'] == 0]['text'])
    top_words_gen = get_top_words(df[df['label'] == 1]['text'])

    top_words_real.to_csv(os.path.join(output_dir, 'top_words_real.csv'), index=False)
    top_words_gen.to_csv(os.path.join(output_dir, 'top_words_generated.csv'), index=False)

    # 8. Correlation Heatmap (New addition)
    correlation = df[['length', 'preprocessed_length', 'label']].corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation, annot=True, cmap='coolwarm')
    plt.title('Correlation Between Features')
    plt.savefig(os.path.join(output_dir, 'correlation_heatmap.png'))
    plt.close()

    # Additional error handling and logging
    print("Dataset analysis completed successfully.")

    return dataset_stats

# Usage remains the same
# Analyze datasets
print("Analyzing datasets...")
train_stats = analyze_data(train_df, os.path.join(output_dir, 'train_analytics'))
val_stats = analyze_data(val_df, os.path.join(output_dir, 'val_analytics'))
test_stats = analyze_data(test_df, os.path.join(output_dir, 'test_analytics'))


Analyzing datasets...
Dataset analysis completed successfully.
Dataset analysis completed successfully.
Dataset analysis completed successfully.
Analyzing datasets...
Dataset analysis completed successfully.
Dataset analysis completed successfully.


In [None]:
# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
val_dataset = Dataset.from_pandas(val_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
test_dataset = Dataset.from_pandas(test_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
# Tokenize datasets
print("Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)



Tokenizing datasets...


Map:   0%|          | 0/6515 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

In [None]:
#new for "aubmindlab/bert-base-arabertv2"
# Initialize tokenizer
print("Initializing tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")



Initializing tokenizer and model...


In [None]:
import os
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, EarlyStoppingCallback
from arabert.preprocess import ArabertPreprocessor
from sklearn.metrics import (
    precision_recall_fscore_support,
    accuracy_score,
    roc_curve,
    auc,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessor for AraBERT
arabert_prep = ArabertPreprocessor(model_name="aubmindlab/bert-base-arabertv2")

# Define the Arabic Text Classifier
class ArabicTextClassifier(nn.Module):
    def __init__(self, pretrained_model_name="aubmindlab/bert-base-arabertv2",
                 hidden_size=256, num_classes=2):
        super(ArabicTextClassifier, self).__init__()

        # Pretrained AraBERTv2 model
        self.backbone = AutoModel.from_pretrained(pretrained_model_name)

        # Freeze backbone layers (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # BiLSTM Layer
        self.bilstm = nn.LSTM(
            input_size=self.backbone.config.hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Retention Mechanism (custom layer)
        self.retention = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size * 2)
        )

        # Attention Mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, num_classes)
        )

        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from AraBERT
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Extract last hidden state
        sequence_output = outputs.last_hidden_state

        # Prepare sequence lengths for packed sequence
        lengths = attention_mask.sum(dim=1)

        # Pack sequence for BiLSTM
        packed_sequence = pack_padded_sequence(
            sequence_output,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # BiLSTM processing
        packed_output, _ = self.bilstm(packed_sequence)

        # Unpack sequence
        lstm_output, _ = pad_packed_sequence(packed_output, batch_first=True)

        # Apply retention mechanism
        lstm_output = self.retention(lstm_output)

        # Attention mechanism
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1),
            dim=1
        )

        # Weighted sum of BiLSTM outputs
        context_vector = torch.bmm(
            attention_weights.unsqueeze(1),
            lstm_output
        ).squeeze(1)

        # Classification
        logits = self.classifier(context_vector)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {
            'loss': loss,
            'logits': logits
        }

# Initialize model
model = ArabicTextClassifier(
    pretrained_model_name="aubmindlab/bert-base-arabertv2",
    hidden_size=256,
    num_classes=2
)

# Define tokenizer
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2")

# Preprocess and tokenize function
def tokenize_function(examples):
    # Preprocess using ArabertPreprocessor
    examples['text'] = [arabert_prep.preprocess(text) for text in examples['text']]
    return tokenizer(examples['text'], padding="max_length", truncation=True)

# Define compute_metrics
def compute_metrics(pred):
    """Compute metrics for evaluation"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Precision, Recall, F1-Score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')

    # Accuracy
    acc = accuracy_score(labels, preds)

    # Calculate ROC curve and AUC
    probs = torch.nn.functional.softmax(torch.tensor(pred.predictions), dim=-1)[:, 1].numpy()
    fpr, tpr, _ = roc_curve(labels, probs)
    roc_auc = auc(fpr, tpr)

    # Save ROC curve
    plt.figure(figsize=(8, 8))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.savefig('roc_curve.png')
    plt.close()

    # Confusion Matrix
    cm = confusion_matrix(labels, preds)
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig('confusion_matrix.png')
    plt.close()

    # Return computed metrics
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        'roc_auc': roc_auc
    }

# Training arguments
output_dir = "./model_output"
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_steps=10,
    report_to="tensorboard",
    gradient_accumulation_steps=1,
    max_grad_norm=1.0
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2,
            early_stopping_threshold=0.01
        )
    ]
)

# Train model
print("Training model...")
trainer.train()




Training model...


Epoch,Training Loss,Validation Loss


###**2-aubmindlab/bert-large-arabertv2 Model Enhancement**

In [None]:


# Preprocessor for AraBERT Large
arabert_prep = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")




In [None]:

# Define the Arabic Text Classifier
class ArabicTextClassifier(nn.Module):
    def __init__(self, pretrained_model_name="aubmindlab/bert-large-arabertv2",
                 hidden_size=256, num_classes=2):
        super(ArabicTextClassifier, self).__init__()

        # Pretrained AraBERTv2 model
        self.backbone = AutoModel.from_pretrained(pretrained_model_name)

        # Freeze backbone layers (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # BiLSTM Layer
        self.bilstm = nn.LSTM(
            input_size=self.backbone.config.hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Retention Mechanism (custom layer)
        self.retention = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size * 2)
        )

        # Attention Mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, num_classes)
        )

        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from AraBERT
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Extract last hidden state
        sequence_output = outputs.last_hidden_state

        # Prepare sequence lengths for packed sequence
        lengths = attention_mask.sum(dim=1)

        # Pack sequence for BiLSTM
        packed_sequence = pack_padded_sequence(
            sequence_output,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # BiLSTM processing
        packed_output, _ = self.bilstm(packed_sequence)

        # Unpack sequence
        lstm_output, _ = pad_packed_sequence(packed_output, batch_first=True)

        # Apply retention mechanism
        lstm_output = self.retention(lstm_output)

        # Attention mechanism
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1),
            dim=1
        )

        # Weighted sum of BiLSTM outputs
        context_vector = torch.bmm(
            attention_weights.unsqueeze(1),
            lstm_output
        ).squeeze(1)

        # Classification
        logits = self.classifier(context_vector)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {
            'loss': loss,
            'logits': logits
        }

# Initialize model
model = ArabicTextClassifier(
    pretrained_model_name="aubmindlab/bert-large-arabertv2",
    hidden_size=256,
    num_classes=2
)

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.48G [00:00<?, ?B/s]

In [None]:

# Training arguments
output_dir = "./model_output"
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_steps=10,
    report_to="tensorboard",
    gradient_accumulation_steps=1,
    max_grad_norm=1.0
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2,
            early_stopping_threshold=0.01
        )
    ]
)


In [None]:
# Train model
print("Training model...")
trainer.train()


Training model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Roc Auc
1,0.6066,0.563754,0.810308,0.824619,0.721065,0.962906,0.95302
2,0.1489,0.328833,0.851825,0.861908,0.758216,0.998454,0.988511


###**3-asafaya/bert-base-arabic**




In [None]:
# new for "asafaya/bert-base-arabic"
print("Initializing tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")

Initializing tokenizer and model...


tokenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/334k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding="max_length",
        truncation=True,
        max_length=512
    )

In [None]:
# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
val_dataset = Dataset.from_pandas(val_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
test_dataset = Dataset.from_pandas(test_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))

In [None]:
# Tokenize datasets
print("Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Tokenizing datasets...


Map:   0%|          | 0/6515 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

In [None]:
import os
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from transformers import AutoModel, Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import (
    precision_recall_fscore_support,
    accuracy_score,
    roc_curve,
    auc,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns

# Define the Arabic Text Classifier
class ArabicTextClassifier(nn.Module):
    def __init__(self, pretrained_model_name="asafaya/bert-base-arabic",
                 hidden_size=256, num_classes=2):
        super(ArabicTextClassifier, self).__init__()

        # Pretrained Bert-base Arabic model
        self.backbone = AutoModel.from_pretrained(pretrained_model_name)

        # Freeze backbone layers (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # BiLSTM Layer
        self.bilstm = nn.LSTM(
            input_size=self.backbone.config.hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Retention Mechanism (custom layer)
        self.retention = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size * 2)
        )

        # Attention Mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, num_classes)
        )

        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from Bert-base Arabic
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Extract last hidden state
        sequence_output = outputs.last_hidden_state

        # Prepare sequence lengths for packed sequence
        lengths = attention_mask.sum(dim=1)

        # Pack sequence for BiLSTM
        packed_sequence = pack_padded_sequence(
            sequence_output,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # BiLSTM processing
        packed_output, _ = self.bilstm(packed_sequence)

        # Unpack sequence
        lstm_output, _ = pad_packed_sequence(packed_output, batch_first=True)

        # Apply retention mechanism
        lstm_output = self.retention(lstm_output)

        # Attention mechanism
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1),
            dim=1
        )

        # Weighted sum of BiLSTM outputs
        context_vector = torch.bmm(
            attention_weights.unsqueeze(1),
            lstm_output
        ).squeeze(1)

        # Classification
        logits = self.classifier(context_vector)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {
            'loss': loss,
            'logits': logits
        }

# Initialize model
model = ArabicTextClassifier(
    pretrained_model_name="asafaya/bert-base-arabic",
    hidden_size=256,
    num_classes=2
)

# Define compute_metrics
def compute_metrics(pred):
    """Compute metrics for evaluation"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Precision, Recall, F1-Score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')

    # Accuracy
    acc = accuracy_score(labels, preds)

    # Calculate ROC curve and AUC
    probs = torch.nn.functional.softmax(torch.tensor(pred.predictions), dim=-1)[:, 1].numpy()
    fpr, tpr, _ = roc_curve(labels, probs)
    roc_auc = auc(fpr, tpr)

    # Save ROC curve
    plt.figure(figsize=(8, 8))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.savefig('roc_curve.png')
    plt.close()

    # Confusion Matrix
    cm = confusion_matrix(labels, preds)
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig('confusion_matrix.png')
    plt.close()

    # Return computed metrics
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        'roc_auc': roc_auc
    }

# Training arguments
output_dir = "./model_output"
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_steps=10,
    report_to="tensorboard",
    gradient_accumulation_steps=1,
    max_grad_norm=1.0
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2,
            early_stopping_threshold=0.01
        )
    ]
)

# Train model
print("Training model...")
trainer.train()


Training model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Roc Auc
1,0.3045,0.216834,0.964925,0.963406,0.932081,0.996909,0.99892
2,0.0093,0.013757,0.997137,0.996914,0.995378,0.998454,0.999868
3,0.002,0.009374,0.997137,0.996914,0.995378,0.998454,0.999924
4,0.0018,0.017729,0.996421,0.996145,0.993846,0.998454,0.99992


TrainOutput(global_step=816, training_loss=0.16479119910389486, metrics={'train_runtime': 1057.2585, 'train_samples_per_second': 61.622, 'train_steps_per_second': 1.93, 'total_flos': 0.0, 'train_loss': 0.16479119910389486, 'epoch': 4.0})

###**4-aubmindlab/araelectra-base-discriminator**


In [None]:
# new for  "UBC-NLP/ARBERT"
print("Initializing tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained( "aubmindlab/araelectra-base-discriminator")

Initializing tokenizer and model...


tokenizer_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/825k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding="max_length",
        truncation=True,
        max_length=512
    )

In [None]:
# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
val_dataset = Dataset.from_pandas(val_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
test_dataset = Dataset.from_pandas(test_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))

In [None]:
# Tokenize datasets
print("Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Tokenizing datasets...


Map:   0%|          | 0/6515 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

In [None]:
import os
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from transformers import AutoModel, Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import (
    precision_recall_fscore_support,
    accuracy_score,
    roc_curve,
    auc,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns

# Define the Arabic Text Classifier
class ArabicTextClassifier(nn.Module):
    def __init__(self, pretrained_model_name="aubmindlab/araelectra-base-discriminator",
                 hidden_size=256, num_classes=2):
        super(ArabicTextClassifier, self).__init__()

        # Pretrained Bert-base Arabic model
        self.backbone = AutoModel.from_pretrained(pretrained_model_name)

        # Freeze backbone layers (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # BiLSTM Layer
        self.bilstm = nn.LSTM(
            input_size=self.backbone.config.hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Retention Mechanism (custom layer)
        self.retention = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size * 2)
        )

        # Attention Mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, num_classes)
        )

        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from Bert-base Arabic
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Extract last hidden state
        sequence_output = outputs.last_hidden_state

        # Prepare sequence lengths for packed sequence
        lengths = attention_mask.sum(dim=1)

        # Pack sequence for BiLSTM
        packed_sequence = pack_padded_sequence(
            sequence_output,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # BiLSTM processing
        packed_output, _ = self.bilstm(packed_sequence)

        # Unpack sequence
        lstm_output, _ = pad_packed_sequence(packed_output, batch_first=True)

        # Apply retention mechanism
        lstm_output = self.retention(lstm_output)

        # Attention mechanism
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1),
            dim=1
        )

        # Weighted sum of BiLSTM outputs
        context_vector = torch.bmm(
            attention_weights.unsqueeze(1),
            lstm_output
        ).squeeze(1)

        # Classification
        logits = self.classifier(context_vector)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {
            'loss': loss,
            'logits': logits
        }

# Initialize model
model = ArabicTextClassifier(
    pretrained_model_name="aubmindlab/araelectra-base-discriminator",
    hidden_size=256,
    num_classes=2
)

# Define compute_metrics
def compute_metrics(pred):
    """Compute metrics for evaluation"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Precision, Recall, F1-Score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')

    # Accuracy
    acc = accuracy_score(labels, preds)

    # Calculate ROC curve and AUC
    probs = torch.nn.functional.softmax(torch.tensor(pred.predictions), dim=-1)[:, 1].numpy()
    fpr, tpr, _ = roc_curve(labels, probs)
    roc_auc = auc(fpr, tpr)

    # Save ROC curve
    plt.figure(figsize=(8, 8))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.savefig('roc_curve.png')
    plt.close()

    # Confusion Matrix
    cm = confusion_matrix(labels, preds)
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig('confusion_matrix.png')
    plt.close()

    # Return computed metrics
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        'roc_auc': roc_auc
    }

# Training arguments
output_dir = "./model_output"
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_steps=10,
    report_to="tensorboard",
    gradient_accumulation_steps=1,
    max_grad_norm=1.0
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2,
            early_stopping_threshold=0.01
        )
    ]
)

# Train model
print("Training model...")
trainer.train()


model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

Training model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Roc Auc
1,0.6256,0.594159,0.924123,0.923851,0.863087,0.993818,0.993875
2,0.0025,0.244841,0.940587,0.939724,0.886301,1.0,0.998739
3,0.0011,0.149622,0.951324,0.950073,0.904895,1.0,0.999365
4,0.0017,0.131395,0.968504,0.967115,0.936324,1.0,0.999624
5,0.0004,0.167224,0.955619,0.954277,0.912553,1.0,0.999633
6,0.0077,0.066993,0.987115,0.98628,0.972932,1.0,0.999829
7,0.0002,0.072759,0.981389,0.980303,0.961367,1.0,0.999848
8,0.0151,0.073368,0.984968,0.98403,0.968563,1.0,0.999856


TrainOutput(global_step=1632, training_loss=0.10732040979900037, metrics={'train_runtime': 2634.0397, 'train_samples_per_second': 24.734, 'train_steps_per_second': 0.774, 'total_flos': 0.0, 'train_loss': 0.10732040979900037, 'epoch': 8.0})

###5-**FacebookAI/xlm-roberta-base**

In [None]:
# new for  "FacebookAI/xlm-roberta-base"
print("Initializing tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained( "FacebookAI/xlm-roberta-base")

Initializing tokenizer and model...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding="max_length",
        truncation=True,
        max_length=512
    )

In [None]:
# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
val_dataset = Dataset.from_pandas(val_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))
test_dataset = Dataset.from_pandas(test_df[['preprocessed_text', 'label']].rename(columns={'preprocessed_text': 'text'}))

In [None]:
# Tokenize datasets
print("Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Tokenizing datasets...


Map:   0%|          | 0/6515 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

Map:   0%|          | 0/1397 [00:00<?, ? examples/s]

In [None]:
import os
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from transformers import AutoModel, Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import (
    precision_recall_fscore_support,
    accuracy_score,
    roc_curve,
    auc,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns

# Define the Arabic Text Classifier
class ArabicTextClassifier(nn.Module):
    def __init__(self, pretrained_model_name="FacebookAI/xlm-roberta-base",
                 hidden_size=256, num_classes=2):
        super(ArabicTextClassifier, self).__init__()

        # Pretrained Bert-base Arabic model
        self.backbone = AutoModel.from_pretrained(pretrained_model_name)

        # Freeze backbone layers (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # BiLSTM Layer
        self.bilstm = nn.LSTM(
            input_size=self.backbone.config.hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        # Retention Mechanism (custom layer)
        self.retention = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size * 2)
        )

        # Attention Mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, num_classes)
        )

        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from Bert-base Arabic
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Extract last hidden state
        sequence_output = outputs.last_hidden_state

        # Prepare sequence lengths for packed sequence
        lengths = attention_mask.sum(dim=1)

        # Pack sequence for BiLSTM
        packed_sequence = pack_padded_sequence(
            sequence_output,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        # BiLSTM processing
        packed_output, _ = self.bilstm(packed_sequence)

        # Unpack sequence
        lstm_output, _ = pad_packed_sequence(packed_output, batch_first=True)

        # Apply retention mechanism
        lstm_output = self.retention(lstm_output)

        # Attention mechanism
        attention_weights = torch.softmax(
            self.attention(lstm_output).squeeze(-1),
            dim=1
        )

        # Weighted sum of BiLSTM outputs
        context_vector = torch.bmm(
            attention_weights.unsqueeze(1),
            lstm_output
        ).squeeze(1)

        # Classification
        logits = self.classifier(context_vector)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {
            'loss': loss,
            'logits': logits
        }

# Initialize model
model = ArabicTextClassifier(
    pretrained_model_name="FacebookAI/xlm-roberta-base",
    hidden_size=256,
    num_classes=2
)

# Define compute_metrics
def compute_metrics(pred):
    """Compute metrics for evaluation"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Precision, Recall, F1-Score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')

    # Accuracy
    acc = accuracy_score(labels, preds)

    # Calculate ROC curve and AUC
    probs = torch.nn.functional.softmax(torch.tensor(pred.predictions), dim=-1)[:, 1].numpy()
    fpr, tpr, _ = roc_curve(labels, probs)
    roc_auc = auc(fpr, tpr)

    # Save ROC curve
    plt.figure(figsize=(8, 8))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.savefig('roc_curve.png')
    plt.close()

    # Confusion Matrix
    cm = confusion_matrix(labels, preds)
    plt.figure(figsize=(8, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig('confusion_matrix.png')
    plt.close()

    # Return computed metrics
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        'roc_auc': roc_auc
    }

# Training arguments
output_dir = "./model_output"
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=2,
    logging_dir=os.path.join(output_dir, 'logs'),
    logging_steps=10,
    report_to="tensorboard",
    gradient_accumulation_steps=1,
    max_grad_norm=1.0
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2,
            early_stopping_threshold=0.01
        )
    ]
)

# Train model
print("Training model...")
trainer.train()


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Training model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Roc Auc
1,0.6714,0.666307,0.536865,0.0,0.0,0.0,0.968494
2,0.1502,0.12407,0.981389,0.980153,0.968326,0.992272,0.99662
3,0.0289,0.147253,0.970651,0.96915,0.944282,0.995363,0.996437
4,0.0175,0.088758,0.981389,0.980153,0.968326,0.992272,0.998195


TrainOutput(global_step=816, training_loss=0.28364921684431676, metrics={'train_runtime': 1137.9069, 'train_samples_per_second': 57.254, 'train_steps_per_second': 1.793, 'total_flos': 0.0, 'train_loss': 0.28364921684431676, 'epoch': 4.0})