# Bert model

The code provided is for building a text classification model using the BERT (Bidirectional Encoder Representations from Transformers) architecture and TensorFlow. The dataset used for classification is loaded from the CSV named 'data.csv' which contains the social media comments. Here is an explanation of what the code does: 

1. **Imports libraries**: Essential for data manipulation (Pandas, NumPy), deep learning (TensorFlow), NLP (transformers library), and evaluation (scikit-learn and Matplotlib).

2. **Loads the data**: Reads the CSV file into a DataFrame, which should contain the text and corresponding labels for the classification task.

3. **Initializes BERT tokenizer**: Sets up the tokenizer from the pre-trained 'bert-base-uncased' model to process the text data.

4. **Data tokenization function**: The `tokenize_data` function tokenizes the text data into a format suitable for BERT, padding/truncating sequences to a maximum length of 120 tokens.

5. **Data splitting**: Splits the dataset into an initial 80% training set and a 20% test set, then tokenizes the test set.

6. **K-Fold Cross-Validation Setup**: Sets up 5-fold cross-validation to evaluate model performance during training.

7. **Training and Validation Loop**: Trains the model across each fold of the cross-validation, creating separate subsets of the data for training and validation purposes. Training is done in mini-batches of size 16.

8. **Model Initialization**: Initializes a new BERT model for sequence classification for each fold, adjusting for the binary classification task (num_labels=2).

9. **Optimizer and Loss Function**: Sets up the Adam optimizer and sparse categorical crossentropy loss function for managing multi-class classification in a format that TensorFlow can process. 

10. **Training with Early Stopping**: Trains the model using an early stopping callback to prevent overfitting, limiting training to a maximum of 3 epochs per fold. 

11. **Evaluation on Validation Data**: After training on each fold, the model's predictions are evaluated in terms of accuracy, precision, recall, and F1 score. Confusion matrices are also generated to analyze the performance in more detail.


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the CSV file into a DataFrame
df = pd.read_csv('data/all_social_media_posts.csv')

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize the data
def tokenize_data(texts, max_length=120):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=max_length, return_tensors='tf')

# Split the data into initial training and test sets (80% train and 20% test)
initial_train_texts, initial_test_texts, initial_train_labels, initial_test_labels = train_test_split(
    df['Content'].tolist(),
    df['Eating_Disorder'].tolist(),
    test_size=0.2,
    shuffle=True
)

# Tokenize the initial test set
initial_test_encodings = tokenize_data(initial_test_texts)

# Setting up KFold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=1)

# Lists to hold scores
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = []

# Cross-validation loop
for fold, (train_ids, validation_ids) in enumerate(kfold.split(df)):
    print(f"Fold {fold+1}")
    
    # Prepare the datasets for training and validation
    train_encodings = tokenize_data(df.iloc[train_ids]['Content'].tolist())
    validation_encodings = tokenize_data(df.iloc[validation_ids]['Content'].tolist())
    
    train_dataset = tf.data.Dataset.from_tensor_slices((
        dict(input_ids=train_encodings['input_ids'], attention_mask=train_encodings['attention_mask']),
        df.iloc[train_ids]['Eating_Disorder'].tolist()
    )).shuffle(10000).batch(16)
    
    validation_dataset = tf.data.Dataset.from_tensor_slices((
        dict(input_ids=validation_encodings['input_ids'], attention_mask=validation_encodings['attention_mask']),
        df.iloc[validation_ids]['Eating_Disorder'].tolist()
    )).batch(16)
    
    # Initialize the model for each fold
    model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    
    # Set up the legacy TF-Keras Adam optimizer for M1/M2 Macs
    optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=2e-5, epsilon=1e-8)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    
    # Early stopping to prevent overfitting
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
    
    # Train the model on the current fold's data
    model.fit(train_dataset, validation_data=validation_dataset, epochs=3, callbacks=[early_stopping])

    # Save the model after each fold if necessary
    # fold_model_save_path = f'model_save_path_fold_{fold+1}'
    # model.save_pretrained(fold_model_save_path)
    
    # Predict on the validation set and calculate metrics
    logits = model.predict(validation_dataset).logits
    predictions = np.argmax(logits, axis=1)
    
    # Collect true labels for the current fold
    true_labels = df.iloc[validation_ids]['Eating_Disorder'].tolist()
    
    # Calculate and append metrics for the current fold
    accuracy_scores.append(accuracy_score(true_labels, predictions))
    precision_scores.append(precision_score(true_labels, predictions))
    recall_scores.append(recall_score(true_labels, predictions))
    f1_scores.append(f1_score(true_labels, predictions))
    confusion_matrices.append(confusion_matrix(true_labels, predictions))

# Output the scores and confusion matrices
print(f'Accuracy scores for each fold: {accuracy_scores}')
print(f'Precision scores for each fold: {precision_scores}')
print(f'Recall scores for each fold: {recall_scores}')
print(f'F1 scores for each fold: {f1_scores}')
print(f'Confusion matrices for each fold:\n{confusion_matrices}')

# Train final model on the entire dataset
final_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
final_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

full_encodings = tokenize_data(df['Content'].tolist())
full_dataset = tf.data.Dataset.from_tensor_slices((
    dict(input_ids=full_encodings['input_ids'], attention_mask=full_encodings['attention_mask']),
    df['Eating_Disorder'].tolist()
)).shuffle(10000).batch(16)

final_model.fit(full_dataset, epochs=3)

# Evaluate the final model on the initial test set
initial_test_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': initial_test_encodings['input_ids'], 'attention_mask': initial_test_encodings['attention_mask']},
    initial_test_labels
)).batch(16)

logits = final_model.predict(initial_test_dataset).logits
predictions = np.argmax(logits, axis=1)

# Calculate metrics for the final model
accuracy = accuracy_score(initial_test_labels, predictions)
precision = precision_score(initial_test_labels, predictions)
recall = recall_score(initial_test_labels, predictions)
f1 = f1_score(initial_test_labels, predictions)
confusion_mat = confusion_matrix(initial_test_labels, predictions)

# Output the scores and confusion matrix for the final model
print(f'Final model accuracy: {accuracy}')
print(f'Final model precision: {precision}')
print(f'Final model recall: {recall}')
print(f'Final model F1 score: {f1}')
print(f'Final model confusion matrix:\n{confusion_mat}')

Fold 1


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Cause: for/else statement not yet supported
Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3
Fold 2


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Fold 3


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Fold 4


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Fold 5


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy scores for each fold: [0.9833729216152018, 0.9596199524940617, 0.9738717339667459, 0.9809976247030879, 0.9642857142857143]
Precision scores for each fold: [0.9752475247524752, 0.958139534883721, 0.966183574879227, 0.9812206572769953, 0.9575471698113207]
Recall scores for each fold: [0.9899497487437185, 0.9626168224299065, 0.9803921568627451, 0.9812206572769953, 0.9712918660287081]
F1 scores for each fold: [0.9825436408977556, 0.9603729603729604, 0.9732360097323601, 0.9812206572769953, 0.9643705463182898]
Confusion matrices for each fold:
[array([[217,   5],
       [  2, 197]]), array([[198,   9],
       [  8, 206]]), array([[210,   7],
       [  4, 200]]), array([[204,   4],
       [  4, 209]]), array([[202,   9],
       [  6, 203]])]


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Final model accuracy: 1.0
Final model precision: 1.0
Final model recall: 1.0
Final model F1 score: 1.0
Final model confusion matrix:
[[218   0]
 [  0 203]]


In [2]:
# Save the final model with the name "final_bert"
final_model.save_pretrained('final_bert')

The results are exceptionally good and even indicate perfection in terms of accuracy, precision, recall, and F1-score on the test dataset. 

This could maybe stem from the fact that the problem the model is intended to solve could be too simple. Maybe the reason for that is that the synthetic training data has not really much ambiguity in them. Therefore, patterns of the two classes might have been very easy to learn. 