This code performs the following key tasks:

1. **Load the Dataset**:  
   - Reads a `.parquet` file into a pandas DataFrame using the `pyarrow` engine.

2. **Remap Labels**:  
   - The `label_mapping` dictionary is used to map sentiment labels from their original values (e.g., 0, 1, 3, 4) to a continuous range (0, 1, 2, 3).  
   - This transformation simplifies label interpretation and ensures compatibility with models expecting sequential integers.

3. **Train-Test Split**:  
   - Splits the dataset into training and testing sets with an 80-20 ratio.
   - Separates the `text` (features) and `label` (targets) columns using `train_test_split`.

4. **Verify Remapped Labels**:  
   - Prints the unique remapped labels to confirm the transformation.

### Purpose:
- Prepares the dataset for training and evaluation by splitting it into train/test sets and ensuring labels are properly formatted.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_parquet(r"C:\Users\saall\Desktop\Arabic Sentiment Analysis for Hotel Reviews Multi-Class Prediction Model\data\train-00000-of-00001.parquet", engine="pyarrow")
# Remap labels to a continuous range
label_mapping = {0: 0, 1: 1, 3: 2, 4: 3}
df['label'] = df['label'].map(label_mapping)

# Split into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

print("Labels after remapping:", df['label'].unique()) 

Labels after remapping: [1 3 0 2]


1. **Define `preprocess_text` Function**:  
   - Removes any leading or trailing whitespace from a given text string using Python's built-in `.strip()` method.  
   - Helps clean the data by ensuring consistency in the text formatting.

2. **Apply the Function to the Dataset**:  
   - Uses the pandas `.apply()` method to apply the `preprocess_text` function to each entry in the `text` column of the DataFrame.  
   - Updates the `text` column in place with the cleaned version of each text entry.

### Purpose:  
This step ensures that the text data is free of unnecessary whitespace, which can be crucial for improving the quality of tokenization and subsequent NLP model performance.

In [None]:
def preprocess_text(text):
    # Strip extra whitespace
    text = text.strip()
    return text

df['text'] = df['text'].apply(preprocess_text)

1. **Purpose of `train_test_split`**:  
   - Splits the dataset into training and testing subsets for supervised machine learning tasks.  
   - Ensures that the model has separate data for training (to learn patterns) and testing (to evaluate performance).

2. **Parameters**:  
   - `df['text']`: The text data to be used as input features.
   - `df['label']`: The target labels corresponding to each text entry.
   - `test_size=0.2`: Reserves 20% of the data for testing and 80% for training.
   - `random_state=42`: Ensures the split is reproducible, providing consistent results across runs.

3. **Outputs**:  
   - `train_texts`: Text data for training the model.
   - `test_texts`: Text data for evaluating the model.
   - `train_labels`: Corresponding labels for the training data.
   - `test_labels`: Corresponding labels for the testing data.

### Purpose:  
This step ensures a fair split of the data into training and testing subsets, enabling the model to generalize better and preventing overfitting.

In [None]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42
)

1. **Purpose of `AutoTokenizer`:**  
   - Automatically loads the tokenizer compatible with the specified model (`AraBERT` in this case).  
   - `AraBERT` is designed to handle Arabic text preprocessing and tokenization efficiently.

2. **Parameters for Tokenizer**:  
   - `list(train_texts)` and `list(test_texts)`: Converts training and testing text data into lists for processing.  
   - `truncation=True`: Ensures that texts longer than the `max_length` are truncated to prevent overflow.  
   - `padding=True`: Adds padding to ensure that all sequences in a batch have the same length, required for model compatibility.  
   - `max_length=128`: Limits the tokenized sequence length to 128 tokens.

3. **Outputs**:  
   - `train_encodings`: Dictionary containing tokenized training data with keys like `input_ids` and `attention_mask`.  
   - `test_encodings`: Dictionary containing tokenized test data, structured similarly to training data.

### Purpose:  
This code prepares the text data for input into the `AraBERT` model by converting raw Arabic text into tokenized sequences that the model can process. Tokenization handles splitting text into tokens, padding, and truncating, ensuring uniformity for training and evaluation.

In [None]:
from transformers import AutoTokenizer

# Load AraBERT tokenizer
model_name = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the data
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, max_length=128)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


1. **Purpose of `SentimentDataset` Class**:  
   - Custom PyTorch `Dataset` implementation to handle tokenized data and labels.  
   - Converts tokenized data (from the tokenizer) into a format compatible with PyTorch models.

2. **Key Methods**:  
   - `__init__`:  
     - Accepts `encodings` (tokenized input data) and `labels` (sentiment labels).  
     - Stores them as instance variables for further access.  
   - `__len__`:  
     - Returns the length of the dataset, equivalent to the number of labels.  
   - `__getitem__`:  
     - Retrieves a specific data point by index (`idx`).  
     - Converts each part of the tokenized data and its corresponding label into PyTorch tensors.  
     - Returns a dictionary containing the tokenized inputs (`input_ids`, `attention_mask`, etc.) and the label.

3. **Dataset Creation**:  
   - `train_dataset` and `test_dataset` instances are created using `train_encodings` and `test_encodings` (from tokenization) along with their corresponding labels.

### Purpose:  
The `SentimentDataset` class and its instances (`train_dataset` and `test_dataset`) structure the data for PyTorch's DataLoader. This enables efficient batch processing, shuffling, and iteration during training and evaluation.

In [None]:
import torch

class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['label'] = torch.tensor(self.labels[idx])
        return item

# Create PyTorch datasets
train_dataset = SentimentDataset(train_encodings, list(train_labels))
test_dataset = SentimentDataset(test_encodings, list(test_labels))


### Loading AraBERT Model for Sequence Classification

```python
from transformers import AutoModelForSequenceClassification
```
- **Importing `AutoModelForSequenceClassification`**:  
  This imports the `AutoModelForSequenceClassification` class from the Hugging Face `transformers` library. This class provides a convenient way to load pre-trained models for sequence classification tasks, such as sentiment analysis.

```python
# Load AraBERT with 4 output labels
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
```
- **Loading the Pre-Trained AraBERT Model**:  
  - **`AutoModelForSequenceClassification`**: This class automatically selects and loads a suitable pre-trained model for sequence classification tasks (like sentiment analysis).
  - **`from_pretrained(model_name)`**: This method loads a pre-trained model based on the name provided. In this case, `model_name` is set to `"aubmindlab/bert-base-arabertv02"`, which refers to a pre-trained Arabic BERT model by the AUB Mind Lab.
  - **`num_labels=4`**: This specifies that the model should be configured for a classification task with 4 possible output labels (0, 1, 3, and 4 in this case). This is important because the pre-trained BERT model was originally designed for a different number of labels, and setting `num_labels` ensures the output layer of the model is configured accordingly.

### Summary:
- The code loads the pre-trained AraBERT model for sequence classification, and it specifies that the model will have 4 possible output labels. This setup is essential for training the model to predict sentiment labels from the input text.

In [None]:
from transformers import AutoModelForSequenceClassification

# Load AraBERT with 4 output labels
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


1. **Data Loading**:  
   - `DataLoader` is used to iterate through `train_dataset` and `test_dataset` in batches, enabling efficient memory use and faster training.  
   - `train_loader`: Batches are shuffled for randomness.  
   - `test_loader`: Shuffling isn't necessary for evaluation.  

2. **Optimizer Setup**:  
   - `AdamW`: A weight decay-optimized Adam optimizer suitable for Transformer models.  
   - Learning rate: Set to `5e-5`, which is a typical starting point for fine-tuning large language models.  

3. **Device Configuration**:  
   - Checks for GPU availability (`torch.cuda.is_available()`), defaulting to CPU if none is available.  
   - Model is moved to the selected device for faster computation on compatible hardware.  

4. **Training Loop**:  
   - **Outer Loop (Epochs)**: Iterates through the dataset multiple times to improve performance.  
   - **Inner Loop (Batches)**: Processes one batch at a time:  
     - Clears gradients with `optimizer.zero_grad()`.  
     - Moves data (`input_ids`, `attention_mask`, and `labels`) to the device.  
     - Performs a forward pass to compute predictions and calculate the loss.  
     - Computes gradients via backpropagation (`loss.backward()`) and updates model weights (`optimizer.step()`).  
   - Tracks training progress and calculates metrics like **loss** and **accuracy** after each epoch:  
     - **Loss**: Measures model error. Lower loss indicates better performance.  
     - **Accuracy**: Tracks the percentage of correct predictions.  

5. **Metrics Display**:  
   - Prints batch-level loss every 10 batches for progress monitoring.  
   - Outputs average loss and accuracy at the end of each epoch for performance tracking.  

### Purpose:  
The code fine-tunes a pre-trained Transformer model (e.g., AraBERT) on the dataset by iteratively adjusting model weights to minimize prediction errors, using batches of training data for efficient computation.

In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW
import torch

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

# Set optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Select device (use GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
model.train()
for epoch in range(3):  # Number of epochs
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    for i, batch in enumerate(train_loader):
        optimizer.zero_grad()

        # Move data to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Calculate accuracy
        _, preds = torch.max(logits, dim=1)  # Get the predicted class
        correct_predictions += (preds == labels).sum().item()
        total_samples += labels.size(0)

        # Print progress every 10 batches
        if (i + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}, Batch {i + 1}/{len(train_loader)}, Loss: {loss.item():.4f}")

    # Epoch metrics
    epoch_loss = total_loss / len(train_loader)
    epoch_accuracy = (correct_predictions / total_samples) * 100  # Convert to percentage
    print(f"Epoch {epoch + 1}, Average Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.2f}%")




Epoch 1, Batch 10/5285, Loss: 1.2008
Epoch 1, Batch 20/5285, Loss: 1.2568
Epoch 1, Batch 30/5285, Loss: 1.1756
Epoch 1, Batch 40/5285, Loss: 0.8731
Epoch 1, Batch 50/5285, Loss: 1.1242
Epoch 1, Batch 60/5285, Loss: 0.3804
Epoch 1, Batch 70/5285, Loss: 1.0238
Epoch 1, Batch 80/5285, Loss: 0.4465
Epoch 1, Batch 90/5285, Loss: 0.6120
Epoch 1, Batch 100/5285, Loss: 0.4893
Epoch 1, Batch 110/5285, Loss: 0.6490
Epoch 1, Batch 120/5285, Loss: 0.5577
Epoch 1, Batch 130/5285, Loss: 0.6667
Epoch 1, Batch 140/5285, Loss: 0.7095
Epoch 1, Batch 150/5285, Loss: 0.6321
Epoch 1, Batch 160/5285, Loss: 0.4536
Epoch 1, Batch 170/5285, Loss: 0.2634
Epoch 1, Batch 180/5285, Loss: 0.3555
Epoch 1, Batch 190/5285, Loss: 0.6381
Epoch 1, Batch 200/5285, Loss: 0.7491
Epoch 1, Batch 210/5285, Loss: 0.6537
Epoch 1, Batch 220/5285, Loss: 0.6732
Epoch 1, Batch 230/5285, Loss: 0.5858
Epoch 1, Batch 240/5285, Loss: 0.5510
Epoch 1, Batch 250/5285, Loss: 0.3905
Epoch 1, Batch 260/5285, Loss: 0.4549
Epoch 1, Batch 270/52

This function evaluates a trained model on a test dataset and reports its performance.

1. **Function Arguments**:  
   - **`model`**: The pre-trained and fine-tuned model to evaluate.  
   - **`test_loader`**: The DataLoader object for the test dataset, which provides batches of test data.  
   - **`label_mapping`**: A dictionary mapping internal model labels to the original dataset labels.

2. **Evaluation Mode**:  
   - The model is set to evaluation mode (`model.eval()`), which deactivates layers like dropout to ensure consistent predictions.

3. **Prediction and True Label Collection**:  
   - Iterates through the test dataset using the `test_loader`.
   - Moves inputs (`input_ids`, `attention_mask`, and `labels`) to the selected device (CPU/GPU).
   - Performs forward passes to generate logits and uses `torch.max` to get predicted classes.
   - Appends predictions and true labels to separate lists for further analysis.

4. **Reverse Mapping of Labels**:  
   - Converts predictions and true labels back to their original label format using `label_mapping`.

5. **Metrics Calculation**:  
   - **Accuracy**: The percentage of correctly predicted samples. Calculated using `accuracy_score`.
   - **Classification Report**: A detailed breakdown of metrics such as precision, recall, and F1-score for each class, using `classification_report`.

6. **Outputs**:  
   - Prints accuracy and the classification report.
   - Returns lists of mapped predictions and true labels for further use.

### Purpose:  
This function assesses how well the model performs on unseen data by reporting its accuracy and detailed performance metrics, helping identify areas for improvement.

In [None]:
from sklearn.metrics import classification_report, accuracy_score

def evaluate_model(model, test_loader, label_mapping):
    """
    Evaluates the model on the test dataset and prints accuracy and classification report.

    Args:
        model: Trained model to evaluate.
        test_loader: DataLoader for the test dataset.
        label_mapping: Reverse mapping from model labels to original labels.

    Returns:
        predictions, true_labels: Lists of predicted and actual labels.
    """
    model.eval()
    predictions = []
    true_labels = []

    with torch.no_grad():
        for batch in test_loader:
            # Move data to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs.logits, dim=1)

            # Store predictions and true labels
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    # Reverse map predictions and true labels to original labels
    reverse_label_mapping = {v: k for k, v in label_mapping.items()}
    mapped_predictions = [reverse_label_mapping[p] for p in predictions]
    mapped_true_labels = [reverse_label_mapping[t] for t in true_labels]

    # Calculate accuracy
    acc = accuracy_score(mapped_true_labels, mapped_predictions) * 100

    # Classification report
    print(f"Accuracy: {acc:.2f}%")
    print("\nClassification Report:")
    print(classification_report(mapped_true_labels, mapped_predictions, digits=4))

    return mapped_predictions, mapped_true_labels


This code will evaluate the trained model on the test dataset, using the specified label mapping, and print out the accuracy and detailed classification report. The evaluation involves the following steps:

1. **Label Mapping**: It uses the `label_mapping` dictionary to map the internal model labels to the original labels of the dataset. 
   
2. **Evaluate Function**: The `evaluate_model` function will iterate over the test dataset using the `test_loader`, making predictions on the input text, and comparing these predictions to the actual labels.

3. **Metrics**: After evaluating the model's predictions, the function will output:
   - **Accuracy**: The percentage of correct predictions.
   - **Classification Report**: A detailed performance summary that includes precision, recall, and F1 score for each label.

Once the model is evaluated, you’ll get an output showing its accuracy and the classification report, which will help you understand its performance on each class (label).

In [None]:
# Evaluate the model
label_mapping = {0: 0, 1: 1, 3: 2, 4: 3}  # Original to remapped labels
evaluate_model(model, test_loader, label_mapping)


Accuracy: 81.77%

Classification Report:
              precision    recall  f1-score   support

           0     0.8106    0.5942    0.6857      2846
           1     0.8361    0.8975    0.8657      7817
           3     0.8126    0.7550    0.7827      5249
           4     0.7978    0.8829    0.8382      5228

    accuracy                         0.8177     21140
   macro avg     0.8143    0.7824    0.7931     21140
weighted avg     0.8174    0.8177    0.8141     21140



([4,
  4,
  1,
  1,
  4,
  1,
  1,
  1,
  1,
  3,
  3,
  3,
  0,
  3,
  3,
  4,
  4,
  1,
  3,
  1,
  1,
  1,
  0,
  4,
  0,
  1,
  4,
  1,
  1,
  1,
  1,
  3,
  1,
  0,
  1,
  1,
  1,
  3,
  0,
  1,
  4,
  1,
  3,
  1,
  4,
  1,
  1,
  3,
  4,
  1,
  4,
  1,
  4,
  0,
  0,
  3,
  4,
  4,
  4,
  3,
  1,
  1,
  0,
  0,
  0,
  1,
  4,
  3,
  3,
  4,
  1,
  1,
  0,
  4,
  3,
  3,
  4,
  3,
  0,
  3,
  3,
  1,
  4,
  1,
  3,
  4,
  1,
  4,
  0,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  3,
  4,
  3,
  3,
  4,
  3,
  1,
  1,
  4,
  4,
  0,
  1,
  3,
  1,
  3,
  3,
  1,
  4,
  4,
  3,
  3,
  4,
  3,
  4,
  4,
  3,
  3,
  1,
  3,
  4,
  1,
  1,
  4,
  3,
  1,
  1,
  1,
  4,
  3,
  4,
  4,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  4,
  4,
  1,
  0,
  1,
  1,
  3,
  1,
  1,
  1,
  1,
  3,
  4,
  4,
  0,
  1,
  1,
  3,
  4,
  4,
  1,
  4,
  3,
  1,
  0,
  4,
  3,
  4,
  3,
  1,
  3,
  4,
  1,
  1,
  0,
  1,
  1,
  3,
  3,
  3,
  4,
  3,
  4,
  3,
  4,
  4,
  1,
  3,
  1,
  4,
  4,
  1,
  0,
  4,
  0,


### Saving the Model and Tokenizer:

1. **`model.save_pretrained("arabic_sentiment_model")`**:
   - This saves the trained model to the specified directory, `arabic_sentiment_model`.
   - The model's state dictionary (which contains the weights learned during training) is saved, allowing you to reload the model later for inference or further fine-tuning.

2. **`tokenizer.save_pretrained("arabic_sentiment_model")`**:
   - This saves the tokenizer used for processing input text during training to the same directory.
   - The tokenizer is essential because it ensures that the same text preprocessing steps (tokenization, padding, truncation) are applied when you load the model for inference later.
   - The saved tokenizer includes the vocabulary and configuration needed to correctly tokenize the text, ensuring consistent handling of input during both training and inference.

#### Directory Structure:
The saved model and tokenizer will be stored in the `arabic_sentiment_model` directory, which will contain:
- **`config.json`**: The configuration of the model (such as model architecture and settings).
- **`tokenizer_config.json`**: The tokenizer configuration.
- **`vocab.txt`**: The vocabulary used by the tokenizer.

This makes it easier to deploy or reload the model and tokenizer for inference or continued training at a later stage. You can reload the saved model and tokenizer with:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("arabic_sentiment_model")
tokenizer = AutoTokenizer.from_pretrained("arabic_sentiment_model")
```

This code would restore the model and tokenizer from the saved directory so that they are ready to be used again.

In [None]:
model.save_pretrained("arabic_sentiment_model")
tokenizer.save_pretrained("arabic_sentiment_model")


('arabic_sentiment_model/tokenizer_config.json',
 'arabic_sentiment_model/special_tokens_map.json',
 'arabic_sentiment_model/vocab.txt',
 'arabic_sentiment_model/added_tokens.json',
 'arabic_sentiment_model/tokenizer.json')