# Step 1: Library Imports and Data Loading

In this block, we import the necessary Python libraries (PyTorch, Transformers, scikit-learn, etc.). 
We also define `data_folder_path` and loop through each author directory to load the article files. 
Each article is stored as a tuple `(author, content)` in the `articles` list.


In [1]:
# Step 1 - Library Imports and Data Loading

import os
import re
import torch
import nltk
import numpy as np

# For creating DataLoader and TensorDataset
from torch.utils.data import DataLoader, TensorDataset

# Transformers: BERT tokenizer and classification model
from transformers import BertTokenizer, BertForSequenceClassification, AdamW

# scikit-learn tools: label encoding, k-fold, metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, precision_recall_fscore_support

nltk.download('stopwords')
from nltk.corpus import stopwords

# Define the data folder path
data_folder_path = "data/finalDataset/makaleler-yazarlar"

# This list will hold (author, article_content) tuples
articles = []

# Loop through each author's folder and read article files
for author_folder in os.listdir(data_folder_path):
    author_path = os.path.join(data_folder_path, author_folder)
    if os.path.isdir(author_path):
        for article_file in os.listdir(author_path):
            article_path = os.path.join(author_path, article_file)
            
            # Try-except block for handling various text encodings
            try:
                with open(article_path, 'r', encoding='utf-8') as file:
                    content = file.read()
            except UnicodeDecodeError:
                with open(article_path, 'r', encoding='ISO-8859-9') as file:
                    content = file.read()

            articles.append((author_folder, content))

print(f"Total articles loaded: {len(articles)}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Deder\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Total articles loaded: 1500


# Step 2: Text Preprocessing

Here, we define a simple `preprocess_text` function. 
It replaces multiple whitespace characters with a single space, and strips leading/trailing spaces.
You could optionally add more steps (stopword removal, punctuation handling, etc.) if desired.


In [2]:
# Step 2 - Text Preprocessing

def preprocess_text(text):
    """
    A simple preprocessing function:
      - Replace multiple whitespaces with a single space
      - Strip leading/trailing whitespace
    """
    text = re.sub(r'\s+', ' ', text)  # Convert multiple spaces to one
    text = text.strip()               # Remove leading/trailing whitespace
    
    # If using a cased model for Turkish, you might keep original casing.
    # You can modify or add steps as needed.
    
    return text

# Apply preprocessing to each article
preprocessed_articles = [(author, preprocess_text(content)) for author, content in articles]

# Print an example preprocessed article
print("Example of a preprocessed article:")
print(preprocessed_articles[0])


Example of a preprocessed article:
('AHMET ÇAKAR', "Fernando Muslera! Önce şunu belirtelim ki; Galatasaray yerli ve yabancı transferlerini mükemmel yapmış. Mesela kaleci Muslera... Kritik anlarda maçı kurtaran adam oldu. Mesela emektar Necati... Galatasaray'ın en faydalı oyuncusu. Gol attı; gol attırdı; ileride top tuttu; mükemmel oynadı. Üstelik attığı ilk gol de, yılın golleri arasına girebilecek güzellikte. 25 metreden ve kalecinin üzerinden öyle bir vurdu ki, maçın tüm stratejisini Galatasaray lehine değiştiriverdi. Skor kimseyi aldatmasın. Sivas, kendi çapında çok iyi bir takım. Üstelik saha ve hava şartları da onların alışkın olduğu cinstendi. Ve daha önemlisi, maçın hemen başında mağlup duruma düşmüş olsalar da, özellikle ilk yarıda disiplini hiç kaybetmediler. Ve çok da önemli pozisyonlar buldular. Fakat Galatasaray pozisyon verse de, önce kaleci Muslera, sonra da oyuncuların panik yapmaması sarı-kırmızılılarda dengeyi getirdi. Çünkü ilk yarıda yenecek bir gol, her şeyi berbat 

# Step 3: Tokenization and Label Encoding

In this block, we use the **BERT tokenizer** (Turkish cased) to tokenize all texts. 
We also use `LabelEncoder` to convert author names into numeric IDs (0 through 29).
By setting `max_length=256`, we truncate/pad each article to 256 tokens.
Finally, we extract `input_ids_all` and `attention_mask_all` from the tokenizer output.


In [3]:
# Step 3 - Tokenization and Label Encoding

# Instantiate the Turkish cased BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

# Separate authors and texts into two lists
authors = [author for author, _ in preprocessed_articles]
texts = [content for _, content in preprocessed_articles]

# Convert author names to numeric IDs
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(authors)

print(f"Number of classes (authors): {len(label_encoder.classes_)}")  # Expect 30

# Tokenize all texts with BERT
encoded_inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=256,  # My system is not enough for 512
    return_tensors="pt",
    return_attention_mask=True
)

# Extract input_ids and attention masks
input_ids_all = encoded_inputs["input_ids"]
attention_mask_all = encoded_inputs["attention_mask"]

print("Tokenization completed.")
print("Shape of input_ids_all:", input_ids_all.shape)


Number of classes (authors): 30
Tokenization completed.
Shape of input_ids_all: torch.Size([1500, 256])


# Step 4: 5-Fold Cross Validation, Model Training, and Evaluation

We use `StratifiedKFold` to split the data into 5 folds, preserving class balance. 
For each fold, we instantiate a fresh `BertForSequenceClassification` model, train it for several epochs, 
and then evaluate on the validation portion. 
We gather predictions and true labels to generate a fold-wise classification report.


In [4]:
# Step 4 - 5-Fold Cross Validation, Model Training, and Evaluation

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = len(label_encoder.classes_)

# StratifiedKFold object for 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Lists to store the true labels and predictions across all folds
all_folds_true = []
all_folds_pred = []

fold_index = 1

# Perform the 5-fold split
for train_index, val_index in skf.split(input_ids_all, labels):
    print(f"\n===== Fold {fold_index} =====")
    
    # Split data into train/validation sets for this fold
    X_train_ids = input_ids_all[train_index]
    X_train_mask = attention_mask_all[train_index]
    y_train = labels[train_index]

    X_val_ids = input_ids_all[val_index]
    X_val_mask = attention_mask_all[val_index]
    y_val = labels[val_index]
    
    # Create TensorDatasets for train and validation
    train_dataset = TensorDataset(X_train_ids, X_train_mask, torch.tensor(y_train))
    val_dataset = TensorDataset(X_val_ids, X_val_mask, torch.tensor(y_val))
    
    # DataLoaders for batch processing
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
    
    # Load the pre-trained BERT model for sequence classification
    model = BertForSequenceClassification.from_pretrained(
        "dbmdz/bert-base-turkish-cased",
        num_labels=num_labels
    ).to(device)
    
    # Use AdamW optimizer from Transformers
    optimizer = AdamW(model.parameters(), lr=2e-5)

    # CrossEntropy loss for multi-class classification
    loss_fn = torch.nn.CrossEntropyLoss()
    
    epochs = 4  # Number of epochs
    
    # Training loop
    for epoch in range(epochs):
        model.train()
        total_train_loss = 0
        correct_train = 0
        
        for batch in train_loader:
            input_ids_batch, attn_mask_batch, labels_batch = [b.to(device) for b in batch]
            
            # Reset gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(input_ids_batch, attention_mask=attn_mask_batch)
            logits = outputs.logits
            
            # Compute loss
            loss = loss_fn(logits, labels_batch)
            total_train_loss += loss.item()
            
            # Predictions for accuracy
            _, preds = torch.max(logits, dim=1)
            correct_train += torch.sum(preds == labels_batch)
            
            # Backprop
            loss.backward()
            optimizer.step()
        
        avg_train_loss = total_train_loss / len(train_loader)
        train_acc = correct_train.double() / len(train_loader.dataset)
        
        print(f"Epoch {epoch+1}: Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.4f}")
        
        # Validation step
        model.eval()
        val_loss = 0
        correct_val = 0
        
        with torch.no_grad():
            for batch in val_loader:
                input_ids_batch, attn_mask_batch, labels_batch = [b.to(device) for b in batch]
                outputs = model(input_ids_batch, attention_mask=attn_mask_batch)
                logits = outputs.logits
                
                loss = loss_fn(logits, labels_batch)
                val_loss += loss.item()
                
                _, preds = torch.max(logits, dim=1)
                correct_val += torch.sum(preds == labels_batch)
        
        avg_val_loss = val_loss / len(val_loader)
        val_acc = correct_val.double() / len(val_loader.dataset)
        
        print(f"Validation Loss: {avg_val_loss:.4f}, Validation Acc: {val_acc:.4f}")
    
    # After training, collect predictions on the validation set
    model.eval()
    fold_preds = []
    fold_true = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids_batch, attn_mask_batch, labels_batch = [b.to(device) for b in batch]
            outputs = model(input_ids_batch, attention_mask=attn_mask_batch)
            logits = outputs.logits
            
            _, preds = torch.max(logits, dim=1)
            
            fold_preds.extend(preds.cpu().numpy())
            fold_true.extend(labels_batch.cpu().numpy())
    
    # Add results from this fold to the global lists
    all_folds_true.extend(fold_true)
    all_folds_pred.extend(fold_preds)

    # Classification report for this fold
    print("\nClassification report for this fold:")
    fold_report = classification_report(
        fold_true, 
        fold_preds, 
        target_names=label_encoder.classes_, 
        zero_division=0
    )
    print(fold_report)
    
    fold_index += 1



===== Fold 1 =====


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1: Train Loss: 3.3525, Train Acc: 0.0742
Validation Loss: 2.9317, Validation Acc: 0.2133
Epoch 2: Train Loss: 2.4873, Train Acc: 0.4117
Validation Loss: 1.8534, Validation Acc: 0.6300
Epoch 3: Train Loss: 1.5101, Train Acc: 0.7625
Validation Loss: 1.1356, Validation Acc: 0.8033
Epoch 4: Train Loss: 0.8887, Train Acc: 0.8908
Validation Loss: 0.8070, Validation Acc: 0.8067

Classification report for this fold:
                   precision    recall  f1-score   support

      AHMET ÇAKAR       0.82      0.90      0.86        10
       ALİ SİRMEN       0.80      0.40      0.53        10
 ATAOL BEHRAMOĞLU       0.44      0.40      0.42        10
    ATİLLA DORSAY       0.77      1.00      0.87        10
      AYKAN SEVER       0.89      0.80      0.84        10
       AZİZ ÜSTEL       1.00      1.00      1.00        10
       CAN ATAKLI       1.00      1.00      1.00        10
      DENİZ GÖKÇE       1.00      1.00      1.00        10
      EMRE KONGAR       0.38      0.50      0.43  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1: Train Loss: 3.3959, Train Acc: 0.0558
Validation Loss: 3.1650, Validation Acc: 0.1333
Epoch 2: Train Loss: 2.8402, Train Acc: 0.2492
Validation Loss: 2.4392, Validation Acc: 0.4100
Epoch 3: Train Loss: 2.0762, Train Acc: 0.5517
Validation Loss: 1.6927, Validation Acc: 0.6633
Epoch 4: Train Loss: 1.3346, Train Acc: 0.7683
Validation Loss: 1.0612, Validation Acc: 0.8000

Classification report for this fold:
                   precision    recall  f1-score   support

      AHMET ÇAKAR       0.80      0.80      0.80        10
       ALİ SİRMEN       0.86      0.60      0.71        10
 ATAOL BEHRAMOĞLU       1.00      0.50      0.67        10
    ATİLLA DORSAY       0.64      0.90      0.75        10
      AYKAN SEVER       1.00      0.90      0.95        10
       AZİZ ÜSTEL       0.91      1.00      0.95        10
       CAN ATAKLI       1.00      1.00      1.00        10
      DENİZ GÖKÇE       1.00      1.00      1.00        10
      EMRE KONGAR       0.53      0.90      0.67  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1: Train Loss: 3.3399, Train Acc: 0.0783
Validation Loss: 2.9569, Validation Acc: 0.2767
Epoch 2: Train Loss: 2.4431, Train Acc: 0.4442
Validation Loss: 1.8370, Validation Acc: 0.6100
Epoch 3: Train Loss: 1.5061, Train Acc: 0.7025
Validation Loss: 1.2012, Validation Acc: 0.7833
Epoch 4: Train Loss: 0.9085, Train Acc: 0.8808
Validation Loss: 0.8442, Validation Acc: 0.8300

Classification report for this fold:
                   precision    recall  f1-score   support

      AHMET ÇAKAR       0.67      1.00      0.80        10
       ALİ SİRMEN       0.80      0.80      0.80        10
 ATAOL BEHRAMOĞLU       1.00      0.80      0.89        10
    ATİLLA DORSAY       0.82      0.90      0.86        10
      AYKAN SEVER       1.00      0.70      0.82        10
       AZİZ ÜSTEL       1.00      1.00      1.00        10
       CAN ATAKLI       0.91      1.00      0.95        10
      DENİZ GÖKÇE       1.00      1.00      1.00        10
      EMRE KONGAR       0.70      0.70      0.70  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1: Train Loss: 3.3460, Train Acc: 0.0767
Validation Loss: 2.9600, Validation Acc: 0.2033
Epoch 2: Train Loss: 2.4911, Train Acc: 0.4042
Validation Loss: 1.8815, Validation Acc: 0.6100
Epoch 3: Train Loss: 1.5570, Train Acc: 0.7108
Validation Loss: 1.2803, Validation Acc: 0.7400
Epoch 4: Train Loss: 0.9433, Train Acc: 0.8708
Validation Loss: 0.8327, Validation Acc: 0.8133

Classification report for this fold:
                   precision    recall  f1-score   support

      AHMET ÇAKAR       0.59      1.00      0.74        10
       ALİ SİRMEN       1.00      0.30      0.46        10
 ATAOL BEHRAMOĞLU       0.58      0.70      0.64        10
    ATİLLA DORSAY       0.89      0.80      0.84        10
      AYKAN SEVER       0.80      0.80      0.80        10
       AZİZ ÜSTEL       1.00      1.00      1.00        10
       CAN ATAKLI       1.00      1.00      1.00        10
      DENİZ GÖKÇE       1.00      1.00      1.00        10
      EMRE KONGAR       0.70      0.70      0.70  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1: Train Loss: 3.3757, Train Acc: 0.0592
Validation Loss: 3.0359, Validation Acc: 0.1800
Epoch 2: Train Loss: 2.4907, Train Acc: 0.4067
Validation Loss: 1.8428, Validation Acc: 0.5867
Epoch 3: Train Loss: 1.5171, Train Acc: 0.7258
Validation Loss: 1.1847, Validation Acc: 0.7800
Epoch 4: Train Loss: 0.9182, Train Acc: 0.8800
Validation Loss: 0.7886, Validation Acc: 0.8333

Classification report for this fold:
                   precision    recall  f1-score   support

      AHMET ÇAKAR       1.00      0.90      0.95        10
       ALİ SİRMEN       0.36      0.90      0.51        10
 ATAOL BEHRAMOĞLU       0.00      0.00      0.00        10
    ATİLLA DORSAY       0.83      1.00      0.91        10
      AYKAN SEVER       0.90      0.90      0.90        10
       AZİZ ÜSTEL       1.00      1.00      1.00        10
       CAN ATAKLI       1.00      1.00      1.00        10
      DENİZ GÖKÇE       1.00      1.00      1.00        10
      EMRE KONGAR       1.00      0.20      0.33  

# Step 5: Combining All Folds and Final Report

Finally, we combine the predictions from all 5 folds and generate an overall 
classification report on all 1500 articles (i.e., the entire dataset). 
This fulfills the requirement for a complete 5-fold CV result. 
We also save the final report to a `.txt` file.


In [5]:
# Step 5 - Combining All Folds and Final Report

print("\n=== 5-Fold Cross-Validation Results (All Folds Combined) ===")
final_report = classification_report(
    all_folds_true,
    all_folds_pred,
    target_names=label_encoder.classes_,
    zero_division=0
)
print(final_report)

# Save the final report to a text file
with open("bert_5fold_cv_report.txt", "w", encoding="utf-8") as f:
    f.write(final_report)

print("5-fold CV report has been saved to 'bert_5fold_cv_report.txt'.")



=== 5-Fold Cross-Validation Results (All Folds Combined) ===
                   precision    recall  f1-score   support

      AHMET ÇAKAR       0.74      0.92      0.82        50
       ALİ SİRMEN       0.60      0.60      0.60        50
 ATAOL BEHRAMOĞLU       0.71      0.48      0.57        50
    ATİLLA DORSAY       0.78      0.92      0.84        50
      AYKAN SEVER       0.91      0.82      0.86        50
       AZİZ ÜSTEL       0.98      1.00      0.99        50
       CAN ATAKLI       0.98      1.00      0.99        50
      DENİZ GÖKÇE       1.00      1.00      1.00        50
      EMRE KONGAR       0.58      0.60      0.59        50
  GÖZDE BEDELOĞLU       0.78      0.42      0.55        50
      HASAN PULUR       0.76      1.00      0.86        50
 HİKMET ÇETİNKAYA       0.92      0.98      0.95        50
MEHMET ALİ BİRAND       0.98      1.00      0.99        50
  MEHMET DEMİRKOL       1.00      0.96      0.98        50
     MELTEM GÜRLE       0.66      0.82      0.73    

In [8]:
# --- ADD THIS AT THE VERY END, AFTER ALL_FOLDS_TRUE and ALL_FOLDS_PREDS ARE COLLECTED ---

from sklearn.metrics import precision_recall_fscore_support
import numpy as np

# We assume 'num_labels' == 30 and 'label_encoder.classes_' maps 0..29 to your 30 authors.
# First, compute per-class precision, recall, and f1
precisions, recalls, f1_scores, _ = precision_recall_fscore_support(
    all_folds_true, 
    all_folds_pred, 
    labels=range(num_labels),  # ensure we cover all classes in correct order
    zero_division=0            # avoids division-by-zero warnings
)

# Calculate macro-average (or use micro if you prefer)
avg_precision = np.mean(precisions)
avg_recall = np.mean(recalls)
avg_f1_score = np.mean(f1_scores)

# Print the table header
print("\nPerformance Results")
header_row = "\t".join([f"Class {i+1}" for i in range(num_labels)]) + "\tAverage"
print(" " * 11 + header_row)

# Print precision row
precision_row = "\t".join(f"{p:.2f}" for p in precisions) + f"\t{avg_precision:.2f}"
print(f"Precision  {precision_row}")

# Print recall row
recall_row = "\t".join(f"{r:.2f}" for r in recalls) + f"\t{avg_recall:.2f}"
print(f"Recall     {recall_row}")

# Print F-score row
fscore_row = "\t".join(f"{f:.2f}" for f in f1_scores) + f"\t{avg_f1_score:.2f}"
print(f"F-Score    {fscore_row}")



Performance Results
           Class 1	Class 2	Class 3	Class 4	Class 5	Class 6	Class 7	Class 8	Class 9	Class 10	Class 11	Class 12	Class 13	Class 14	Class 15	Class 16	Class 17	Class 18	Class 19	Class 20	Class 21	Class 22	Class 23	Class 24	Class 25	Class 26	Class 27	Class 28	Class 29	Class 30	Average
Precision  0.74	0.60	0.71	0.78	0.91	0.98	0.98	1.00	0.58	0.78	0.76	0.92	0.98	1.00	0.66	0.80	0.88	0.68	0.83	0.86	0.70	0.97	0.96	0.94	0.92	0.75	0.90	0.57	0.71	0.72	0.82
Recall     0.92	0.60	0.48	0.92	0.82	1.00	1.00	1.00	0.60	0.42	1.00	0.98	1.00	0.96	0.82	0.86	0.98	0.64	0.70	0.60	0.76	0.74	1.00	0.92	0.92	0.92	0.92	0.50	0.94	0.58	0.82
F-Score    0.82	0.60	0.57	0.84	0.86	0.99	0.99	1.00	0.59	0.55	0.86	0.95	0.99	0.98	0.73	0.83	0.92	0.66	0.76	0.71	0.73	0.84	0.98	0.93	0.92	0.83	0.91	0.53	0.81	0.64	0.81
