# Introduction
This Jupyter Notebook is an integral component of a bachelor thesis that aims to explore the capabilities of specialized chatbots, particularly those built using BERT. The notebook is designed to preprocess and analyze a set of CSV files. These files contain performance metrics from various machine learning models, including BERT-based chatbots, to provide a comprehensive view of their effectiveness and efficiency.

## Import required libraries
In this section, we import all the necessary libraries that will be used throughout the notebook.

In [79]:
# !pip install pandas
import pandas as pd
import glob
import ast

# Data Loading and Preprocessing

Here, we define the path to the folder containing the CSV files and read them into Pandas DataFrames. We also preprocess the data by adding new columns for learning rate and batch size, extracted from the filenames.

In [None]:
# Define the path to the folder containing the CSV files
path = 'files_transformer'  # Update this path as needed

# Read all CSV files in the folder
all_files = glob.glob(f"{path}/*.csv")

In [None]:
# Loop through each file to read and preprocess the data
for filename in all_files:
    # Read the CSV file into a DataFrame
    temp_df = pd.read_csv(filename)
    
    # Extract learning rate and batch size from the filename
    learning_rate = filename.split('\\')[-1].split('LR ')[1].split(' ')[0]
    batch_size = ['16', '32', '64']
    
    # Add new columns for learning rate and batch size
    temp_df['learning_rate'] = learning_rate
    temp_df['batch_size'] = batch_size
    
    # Append the DataFrame to the list
    data_frames.append(temp_df)

In [None]:
# Concatenate all DataFrames into a single DataFrame
final_df = pd.concat(data_frames, ignore_index=True)

In [None]:
# Filter and explode the DataFrame
final_df = final_df.explode(['Epochs', 'Training Loss', 'Validation Loss', 'Training Accuracy', 'Validation Accuracy', 'Test Loss (every 10 epochs)', 'Test Accuracy (every 10 epochs)'])
final_df = final_df[(final_df['Epochs'] % 5 == 0) & (final_df['Epochs'] < 30)]

# Column Renaming and Saving
Finally, we rename the columns to be more descriptive and save the transformed data into a new CSV file.

In [None]:
# Rename columns
column_rename_dict = {
    'Epochs': 'epoch',
    'Training Accuracy': 'train_accuracy',
    'Training Loss': 'train_loss',
    'Validation Accuracy': 'val_accuracy',
    'Validation Loss': 'val_loss',
    'Test Accuracy (every 10 epochs)': 'test_accuracy'
}
final_df.rename(columns=column_rename_dict, inplace=True)

In [None]:
# Reorder columns
final_df = final_df[['batch_size', 'epoch', 'learning_rate', 'train_accuracy', 'train_loss', 'val_accuracy', 'val_loss', 'test_accuracy']]

In [91]:
df.to_csv('transformer_data.csv', index=False)