**Pretraining Steps**

we will decode the Categories  to use catboost in its full limits,because after a search we did we found out that CatBoost might misinterpret the binary-encoded values as having ordinal relationships and it will treat them as numerical features rather than categorical features. which means CatBoost won’t apply its specialized handling for categorical data (e.g., Ordered Target Encoding), which could reduce model performance.Also we have merged the title and categories to provide more data to the model and increase it's accuracy

**Justification for Choosing CatBoost**

CatBoost is a gradient boosting algorithm specifically designed to handle categorical and structured data efficiently. Since book categorization involves predicting a categorical output (genre) from text and structured data, CatBoost is a strong choice.

Key reasons for selecting CatBoost:

    1.Handles Categorical Features Without Extensive Preprocessing: Unlike traditional boosting models (e.g., XGBoost), CatBoost can process categorical data directly without requiring one-hot encoding or label encoding.
    
    2.Works Well with Arabic Text: Since CatBoost supports non-English text processing, it is suitable for this Arabic dataset.
    
    3.Uses Structured Data Alongside Text: CatBoost can incorporate both structured data (such as author, publisher, price) and text-based        features (like book descriptions). This allows the model to leverage a more comprehensive set of information, improving its ability to       make predictions based on a richer dataset.
    
    4.Regularization Reduces Overfitting: Techniques such as ordered boosting help prevent the model from memorizing the dataset rather than       learning meaningful patterns.

**Reference for CatBoost**

A detailed explanation of CatBoost and its applications is available in the official documentation:
  Dorogush, A. V., Jr., Ershov, V., Gulin, A., & Yandex. (2018). CatBoost: gradient boosting with categorical features [Journal-article]. arXiv. https://arxiv.org/abs/1810.11363v1 (Original work published 2018)

This paper presents the theoretical framework of CatBoost, its advantages over traditional boosting methods, and its success in classification tasks.



In [6]:
#install all necessary libraries
%pip install catboost
%pip install pandas openpyxl xlrd
%pip install torch
%pip install transformers
%pip install numpy
%pip install scikit-learn


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [7]:
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import joblib
import pickle

# ----------------- Load dataset and BERT Model for Embeddings -----------------

file_path = "/home/nouarif4/Downloads/Book_Cleaned_Dataset_.xls"
df = pd.read_csv(file_path, encoding="utf-8-sig")
# Initialize tokenizer, device
bert_model_name = 'asafaya/bert-base-arabic'
tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModel.from_pretrained(bert_model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model.to(device)
bert_model.eval()

# ----------------- Embedding with the use of batching for faster preformence -----------------
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),  # Remove batch dimension
            'attention_mask': encoding['attention_mask'].squeeze(0)
        }

def mean_pooling(model_output, attention_mask):
    last_hidden_state = model_output.last_hidden_state
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    summed = torch.sum(last_hidden_state * mask, dim=1)
    counts = torch.clamp(mask.sum(dim=1), min=1e-9)
    return summed / counts  # (batch_size, hidden_dim)

def convert_to_embeddings(df, column_names, model_name="asafaya/bert-base-arabic", max_length=512, batch_size=32, device='cuda' if torch.cuda.is_available() else 'cpu'):
    
    print("Loading model and tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to(device)
    model.eval()
    
    for column_name in column_names:
        print(f"\nProcessing column: {column_name}")
        
        # Ensure all text data is string format
        texts = df[column_name].astype(str).tolist()
        
        # Create dataset and dataloader
        dataset = TextDataset(texts, tokenizer, max_length)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

        embeddings = []
        with torch.no_grad():
            for batch in tqdm(dataloader, desc="Generating embeddings"):
                # Move input tensors to GPU
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                
                # Get model output
                outputs = model(input_ids, attention_mask=attention_mask)
                
                # Apply mean pooling
                sentence_embeddings = mean_pooling(outputs, attention_mask)
                
                # Move to CPU and store embeddings
                embeddings.append(sentence_embeddings.cpu().numpy())

        # Store embeddings in DataFrame
        df[f"{column_name}_embedded"] = list(np.concatenate(embeddings, axis=0))
        print(f"Completed embedding generation for {column_name}")
    
    return df


In [8]:

# -----------------  Decode the catgories -----------------

def decode_categories(df):

    # Define the category mapping
    category_map = {
        "الأدب والخيال": "1",
        "الكتب الإسلامية": "10",
        "الاقتصاد والأعمال": "100",
        "الفلسفة": "1000",
        "الصحافة والإعلام": "10000",
        "الكتب السياسية": "100000",
        "العلوم والطبيعة": "1000000",
        "الأسرة والطفل": "10000000",
        "السير والمذكرات": "100000000",
        "الفنون": "1000000000",
        "التاريخ والجغرافيا": "10000000000",
        "الرياضة والتسلية": "100000000000",
        "الشرع والقانون": "1000000000000"
    }
    
    # Create reversed mapping
    reversed_category_map = {v: k for k, v in category_map.items()}
    
    # Convert category values to string to ensure proper matching
    df['Category'] = df['Category'].astype(str)
    
    # Function to safely map categories
    def safe_map_category(x):
        if pd.isna(x) or x == 'nan':
            return np.nan
        
        # Convert the input to a simple string of the number
        x_str = str(int(x))  # This removes leading zeros and converts to simple number string
        
        return reversed_category_map.get(x_str, x)
    
    # Apply the mapping
    df['Category_original'] = df['Category'].apply(safe_map_category)
    
    return df

print(df[['Category']].head(3))
df = decode_categories(df)
print(df[['Category']].head(3))
print(df[['Category_original']].head(3))


   Category
0     10000
1        10
2  10000000
   Category
0     10000
1        10
2  10000000
  Category_original
0  الصحافة والإعلام
1   الكتب الإسلامية
2     الأسرة والطفل


In [9]:
# ----------------- Merge Title & Description -----------------
# we have merged them for higher accuracy result of the training model
df['Title_Description'] = df['Title'] + " " + df['Description'] 

# ----------------- max_length descion making process -----------------

# we will decide the max_length based on the following results 
# Calculate the number of words in each text
df['word_count_Title_Description'] = df['Title_Description'].apply(lambda x: len(x.split()))

# Analyze the distribution
print(df['word_count_Title_Description'].describe())
df = df.drop(['word_count_Title_Description'], axis=1)


#%75 of 'Title_Description' will be covered and 128 will avoid excessive padding for shorter descriptions
# and will truncates very long descriptions
df = convert_to_embeddings(df, 
                         column_names=['Title_Description'], 
                         max_length=128, 
                         batch_size=32)


#Flatten embeddings into separate columns
df = pd.concat([df.drop(['Title_Description'], axis=1),
                df['Title_Description'].apply(pd.Series),], axis=1)


count    3299.000000
mean       99.467717
std        95.465457
min         5.000000
25%        60.000000
50%        67.000000
75%        78.000000
max      1377.000000
Name: word_count_Title_Description, dtype: float64
Loading model and tokenizer...

Processing column: Title_Description


Generating embeddings: 100%|██████████████████| 104/104 [05:51<00:00,  3.38s/it]

Completed embedding generation for Title_Description





In [10]:
# the end result of the dataset in the training we will use Category_original
# in catogory format after decoding to utilize catboost 
print (df.head(1))

                                             Title  Author  \
0  التشبيك وميثاق الممارسة في عمل المنظمات الأهلية    2073   

                                         Description  Pages  Publication year  \
0  تقرير يوثق أعمال ورشة عمل 1995 عن محاولة صياغة...     40              2003   

   Publisher Category  Subcategory  Price Page Range Category_original  \
0        145    10000           65  16.88       0-50  الصحافة والإعلام   

                          Title_Description_embedded  \
0  [0.4847845, -0.12686867, 0.12688579, -0.401593...   

                                                   0  
0  التشبيك وميثاق الممارسة في عمل المنظمات الأهلي...  


In [11]:
# ----------------- Train CatBoost Classifier -----------------

# We Used only the embeddings for Description and title as features and Category_original as the target
X = df['Title_Description_embedded'].apply(np.array).tolist()  # Only use Description embeddings as features
X = np.array(X)  # Convert the list to a numpy array

# The target variable is 'Category_original'
y = df['Category_original']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the CatBoost model
catboost_model = CatBoostClassifier(iterations=500,  # Number of trees
                          depth=6,  # Depth of each tree
                          learning_rate=0.05,  # Learning rate
                          loss_function='MultiClass',  # Multi-class classification
                          cat_features=[],  # No categorical features in this case
                          early_stopping_rounds=5,
                          )

# Train the model with validation set
catboost_model.fit(X_train, 
                   y_train, 
                   eval_set=(X_test, y_test),  # Specify validation data
                   verbose=200)  # Print training progress every 200 iterations

# Calculate training accuracy
train_accuracy = accuracy_score(y_train, catboost_model.predict(X_train))
print(f"Training Accuracy: {train_accuracy:.4f}")

# Make predictions on the test set
y_pred = catboost_model.predict(X_test)
# Save predictions to a CSV file
pd.DataFrame(y_pred, columns=['Predictions']).to_csv("y_pred_catboost.csv", index=False)
# Save trained model
joblib.dump(catboost_model, "catboost_classifier.pkl")

# Evaluate the model (Test Accuracy)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

0:	learn: 2.5174652	test: 2.5216084	best: 2.5216084 (0)	total: 588ms	remaining: 4m 53s
200:	learn: 0.7639073	test: 1.2565981	best: 1.2565981 (200)	total: 1m 55s	remaining: 2m 51s
400:	learn: 0.4780127	test: 1.1214042	best: 1.1213550 (399)	total: 3m 49s	remaining: 56.6s
499:	learn: 0.4094757	test: 1.0935687	best: 1.0934911 (498)	total: 4m 45s	remaining: 0us

bestTest = 1.09349109
bestIteration = 498

Shrink model to first 499 iterations.
Training Accuracy: 0.9492
Test Accuracy: 0.6742


In [12]:
from sklearn.model_selection import cross_val_score

# Cross-validation for CatBoost
cross_val_scores = cross_val_score(catboost_model, X_train, y_train, cv=5)  # 5-fold cross-validation
print(f"Cross-validation scores: {cross_val_scores}")
print(f"Mean CV score: {cross_val_scores.mean()}")

0:	learn: 2.5195859	total: 532ms	remaining: 4m 25s
1:	learn: 2.4826376	total: 1.08s	remaining: 4m 29s
2:	learn: 2.4368976	total: 1.61s	remaining: 4m 26s
3:	learn: 2.4063440	total: 2.17s	remaining: 4m 29s
4:	learn: 2.3689126	total: 2.69s	remaining: 4m 26s
5:	learn: 2.3327883	total: 3.21s	remaining: 4m 24s
6:	learn: 2.2993407	total: 3.82s	remaining: 4m 29s
7:	learn: 2.2710872	total: 4.42s	remaining: 4m 31s
8:	learn: 2.2367428	total: 4.88s	remaining: 4m 26s
9:	learn: 2.2098137	total: 5.36s	remaining: 4m 22s
10:	learn: 2.1788338	total: 5.83s	remaining: 4m 19s
11:	learn: 2.1510802	total: 6.3s	remaining: 4m 16s
12:	learn: 2.1267024	total: 6.77s	remaining: 4m 13s
13:	learn: 2.1005105	total: 7.28s	remaining: 4m 12s
14:	learn: 2.0711184	total: 7.82s	remaining: 4m 12s
15:	learn: 2.0431959	total: 8.34s	remaining: 4m 12s
16:	learn: 2.0209614	total: 8.81s	remaining: 4m 10s
17:	learn: 1.9942894	total: 9.3s	remaining: 4m 9s
18:	learn: 1.9765545	total: 9.76s	remaining: 4m 7s
19:	learn: 1.9547546	total

In [13]:
# Classify a new description (example)
example_description = "قصص مغامرات للأطفال"

# Tokenize using BERT
example_inputs = tokenizer(
    example_description, truncation=True, max_length=128, padding='max_length', return_tensors='pt'
).to(device)

# Generate embedding
with torch.no_grad():
    outputs = bert_model(example_inputs['input_ids'], attention_mask=example_inputs['attention_mask'])

# Mean Pooling
mask = example_inputs['attention_mask'].unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
masked_embeddings = outputs.last_hidden_state * mask
summed = torch.sum(masked_embeddings, dim=1)
counts = torch.clamp(mask.sum(dim=1), min=1e-9)
mean_pooled = summed / counts  # Final sentence embedding

# Convert to NumPy (reshape for CatBoost)
example_embedding = mean_pooled.cpu().numpy().reshape(1, -1)

# Predict category using trained CatBoost model
example_category = catboost_model.predict(example_embedding)[0]

print(f"Predicted Category: {example_category}")


Predicted Category: ['الأدب والخيال']
