In [None]:
"""
Install necessary NLP libraries compatible with Google Colab base environment.
These include Arabert, HuggingFace Transformers, Datasets, and Evaluate.
"""
!pip uninstall transformers --quiet
!pip install transformers==4.41.1 --quiet
!pip install arabert
!pip install datasets
!pip install evaluate


Proceed (Y/n)? 

In [None]:
import torch
import matplotlib
import transformers
import datasets
import evaluate

print("Torch version:", torch.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("Transformers version:", transformers.__version__)
print("Datasets version:", datasets.__version__)


Torch version: 2.6.0+cu124
Matplotlib version: 3.10.0
Transformers version: 4.41.1
Datasets version: 2.14.4


In [None]:
"""
This cell creates a modular folder structure for the NLP Political Bias project,
and generates placeholder Python files for each logical component.
"""

import os

# Define base project folder
project_dir = "/content/nlp_project"
os.makedirs(project_dir, exist_ok=True)

# Define target files
files = {
    "data_loader.py": "# data_loader.py - Loads and prepares the filtered dataset\n\n",
    "preprocessing.py": "# preprocessing.py - Cleans and normalizes Arabic text\n\n",
    "models.py": "# models.py - Loads and configures the desired transformer model\n\n",
    "trainer.py": "# trainer.py - Handles training logic, metrics, and weighted loss\n\n",
    "run_training.py": "# run_training.py - Main execution entry point for training the model\n\n"
}

# Create each file with a basic header
for filename, header in files.items():
    path = os.path.join(project_dir, filename)
    with open(path, "w", encoding="utf-8") as f:
        f.write(header)

print(" Project structure created with the following files:")
for fname in files:
    print("-", fname)


 Project structure created with the following files:
- data_loader.py
- preprocessing.py
- models.py
- trainer.py
- run_training.py


In [None]:
# This will execute the data_loader.py script as if it were run directly
!python /content/nlp_project/data_loader.py


✅ Cleaned merged dataset saved to: /content/merged_bias_data_cleaned.csv
📊 Dataset Shape: (11968, 5)

🧾 Columns: ['label', 'text', 'explicit_type', 'bias_domain', 'label_id']

🧹 Missing Values:
 label            0
text             0
explicit_type    0
bias_domain      0
label_id         0
dtype: int64

📋 Label Counts:
 label
Unbiased                    6817
Biased against Palestine    2300
Biased against Israel       1981
Biased against Muslims       542
Biased against Jews          328
Name: count, dtype: int64

📋 Bias Domain Counts:
 bias_domain
Political    11098
Religious      870
Name: count, dtype: int64

📋 Explicit Type Counts:
 explicit_type
neutral     6817
Implicit    4449
Explicit     702
Name: count, dtype: int64

📈 Word Count Stats:
 count    11968.000000
mean        35.214405
std         61.571085
min          1.000000
25%         11.000000
50%         17.000000
75%         34.000000
max       1000.000000
Name: word_count, dtype: float64

📊 Label × Bias Domain:
 bias_doma

# 🧠 Project Dataset Overview: Arabic Bias Detection

This summary documents the key characteristics and imbalances in the dataset used for Arabic bias classification (political and religious), based on the merged file:  
📄 `/content/merged_bias_data_cleaned.csv`

---

## 📊 Dataset Size and Structure

- **Total samples**: `11,968`
- **Columns**:
  - `text`: the input sentence
  - `label`: the bias label (in English)
  - `explicit_type`: nature of the bias expression (Explicit/Implicit/Neutral)
  - `bias_domain`: the domain of bias (Political/Religious)
  - `label_id`: numeric encoding of the label

---

## 📋 Label Distribution

| Label                    | Count | Percentage |
|--------------------------|-------|------------|
| Unbiased                 | 6817  | 57.0% 🔺 |
| Biased against Palestine | 2300  | 19.2% |
| Biased against Israel    | 1981  | 16.5% |
| Biased against Muslims   | 542   | 4.5% 🔻 |
| Biased against Jews      | 328   | 2.7% 🔻 |

> ⚠️ **Observation**: Strong class imbalance. Over 57% of the data is "Unbiased", while "Biased against Jews/Muslims" are significantly underrepresented.

---

## 🧭 Bias Domain Distribution

| Domain     | Count | Percentage |
|------------|--------|------------|
| Political  | 11098  | 92.7% |
| Religious  | 870    | 7.3% 🔻 |

> ⚠️ **Observation**: Religious bias data is limited, likely requiring sampling or augmentation for balance.

---

## 💬 Explicit Type Distribution

| Type      | Count | Percentage |
|-----------|--------|------------|
| Neutral   | 6817   | 57.0% 🔺 |
| Implicit  | 4449   | 37.2% |
| Explicit  | 702    | 5.9% 🔻 |

> ⚠️ **Observation**: Majority of "neutral" samples overlap with "Unbiased". Explicit examples are rare and underrepresented.

---

## 📈 Word Count Statistics

- **Mean**: 35.2 words
- **Median**: 17 words
- **Max**: 1000 words
- **Min**: 1 word

> ⚠️ **Recommendation**: Consider truncating or padding inputs to `max_length = 256` for training with transformer models.

---

## 🔀 Cross Analysis: Label × Domain

- Political bias is associated exclusively with:
  - Biased against Israel / Palestine / Unbiased
- Religious bias includes:
  - Biased against Muslims / Jews

## 🔀 Cross Analysis: Label × Explicit Type

| Label                    | Explicit | Implicit | Neutral |
|--------------------------|----------|----------|---------|
| Biased against Palestine | 44       | 2256     | 0       |
| Biased against Israel    | 204      | 1777     | 0       |
| Biased against Muslims   | 302      | 240      | 0       |
| Biased against Jews      | 152      | 176      | 0       |
| Unbiased                 | 0        | 0        | 6817 ✅ |

> ✅ **Pattern**: Neutral bias type only exists in "Unbiased" samples.
> ⚠️ **Explicit bias underrepresented** across all categories.

---

## 🧩 Summary of Key Issues

- **Label Imbalance**:
  - "Unbiased" dominates.
  - Rare representation of religious bias categories.

- **Explicit Type Imbalance**:
  - Very few explicit bias examples.
  - Most biased samples are implicit.

- **Domain Imbalance**:
  - Political bias is overwhelming compared to religious bias.

- **Long Tail in Text Length**:
  - Some samples are extremely long (up to 1000 words), requiring truncation.

---

## 🛠 Next Steps (Recommended)

1. Remove 50% of "Unbiased" and/or "neutral" examples.
2. Apply `class_weight` in training or use weighted sampling.
3. Explore data augmentation for underrepresented classes.
4. Optional: Multi-task learning with both `label` and `explicit_type`.

