# **Introduction**
**Project Overview :**
The project aims to explore natural language processing (NLP) techniques, specifically classification and text generation, using BERT (Bidirectional Encoder Representations from Transformers) on Arabic text. The tasks will involve fine-tuning the BERT model to classify articles from the KALIMAT Multipurpose Arabic Corpus and generate summaries or extended text.

**Dataset :** 
The KALIMAT Corpus, consisting of Arabic articles from six categories (culture, economy, local news, international news, religion, and sports), will be used for this project. The corpus is sourced from the Omani newspaper Alwatan.

In [None]:
!pip install transformers
!pip install arabic_reshaper
!pip install farasa
!pip install torch


# **Data Exploration**

In [None]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('/kaggle/input/twocolumns-dataset/twoColumns.csv')
print(df.head())


# **Preprocessing**

We begin the preprocessing by cleaning the text data, removing punctuation to make it more consistent. Next, we encode the labels into numeric values using LabelEncoder, transforming the categorical targets into a machine-readable format. Lastly, we split the dataset into training and testing sets, allowing us to train the model and evaluate its performance on unseen data.

In [None]:
import re

# Function to clean text
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

df['Text'] = df['Text'].apply(clean_text)


In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode the labels into numeric values
label_encoder = LabelEncoder()
df['Label'] = label_encoder.fit_transform(df['Label'])
print(label_encoder.classes_)  # To see the mapping of labels to numbers


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['Text'], df['Label'], test_size=0.2, random_state=42)


In [None]:
from transformers import BertTokenizer

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Tokenize the text data
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, max_length=512)


# **Model implementation**:

# **1- Text Classification**

In this section, we implement a BERT-based classifier to categorize Arabic text articles from the KALIMAT dataset into six distinct categories: culture, economy, local news, international news, religion, and sports. We utilize a pre-trained BERT model (bert-base-multilingual-cased) with a classification head, fine-tuning it on the dataset. The text data is tokenized using BERT's tokenizer, ensuring proper handling of Arabic script, and transformed into appropriate encodings for training. The model is then fine-tuned over three epochs, with evaluation metrics such as accuracy, precision, and recall used to measure its performance.

In [None]:
import torch

class ArabicDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Ensure both 'encodings' and 'labels' exist for the given index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx])  # Use .iloc for proper row indexing in pandas
        return item

    def __len__(self):
        return len(self.labels)

# Create dataset objects
train_dataset = ArabicDataset(train_encodings, train_labels)
test_dataset = ArabicDataset(test_encodings, test_labels)


In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load pre-trained BERT model with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=6)  # 6 is the number of classes

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated model to be trained
    args=training_args,                  # training arguments
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)


In [None]:
print(len(train_encodings['input_ids']), len(train_labels))


In [None]:
print(len(train_texts), len(train_labels))  # Both should be the same length


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [None]:
trainer.train()


# **Evaluate The Classification Performance**

In [None]:
# Evaluate the model
trainer.evaluate()


In [None]:
model.save_pretrained('./arabic_bert_classifier')
tokenizer.save_pretrained('./arabic_bert_classifier')


In [None]:
# Load the saved model and tokenizer
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('./arabic_bert_classifier')
model = BertForSequenceClassification.from_pretrained('./arabic_bert_classifier')

# Example new text to classify
new_text = ["الرياضة هي جزء مهم من حياة الإنسان، حيث تلعب دورًا كبيرًا في تعزيز الصحة الجسدية والعقلية. ممارسة الرياضة بشكل منتظم تساعد في تقوية العضلات وتحسين اللياقة البدنية، كما تقلل من مخاطر الإصابة بالأمراض المزمنة مثل أمراض القلب والسكري. بالإضافة إلى الفوائد الجسدية، تعزز الرياضة الثقة بالنفس وتساعد على تقليل التوتر والقلق. كما توفر الرياضة فرصة للتفاعل الاجتماعي وتعزز روح الفريق. سواء كنت تمارس رياضة فردية أو جماعية، فإن الرياضة تعد وسيلة فعالة للحفاظ على نمط حياة صحي ونشيط."]

# Tokenize the new text
new_encoding = tokenizer(new_text, truncation=True, padding=True, return_tensors='pt')

# Predict the label
model.eval()
with torch.no_grad():
    outputs = model(**new_encoding)
    predictions = torch.argmax(outputs.logits, dim=1)

# Assuming you already have the LabelEncoder instance from preprocessing
# Example label_encoder from training (replace with your actual label_encoder)
from sklearn.preprocessing import LabelEncoder

# Make sure to use the same labels as during training
categories = ['culture', 'economy', 'international', 'local', 'religion', 'sports']
label_encoder = LabelEncoder()
label_encoder.fit(categories)

# Convert the predicted label from number to category name
predicted_category = label_encoder.inverse_transform(predictions.cpu().numpy())

# Print the predicted category
print(predicted_category[0])
