# This notebook attempts to use the trainer API from the transformer library in order to use Optuna, but that ended up following a guide for PyTorch, so this code didn't work properly for my TensorFlow Models

# Step 0: Kaggle Set Up

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


## Step 1: pip install dependencies and set global variables

In [2]:
# %pip install pandas numpy tensorflow transformers scikit-learn matplotlib

# # #python.exe -m pip install --upgrade pip



### Set Global random seed to make sure we can replicate any model that we create (no randomness)

In [3]:
import random
import tensorflow as tf
import numpy as np
import os
from transformers import set_seed



np.random.seed(42)
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
set_seed(42)

os.environ['TF_DETERMINISTIC_OPS'] = '1'

2024-06-05 14:07:22.666947: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-05 14:07:22.667039: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-05 14:07:22.829458: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Step 2: Exploring and Understanding the data

### Loading Data

In [4]:
import pandas as pd

# Load the training data
train_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_data = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

# Display the first few rows of the training data
print(train_data.head())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  


### Data Cleaning

I had a hard choice of whether or not to delete hashtags, but after inspecting the data, I saw that there were so many hashtags and hashtags are a crucial part of tweets so I decided that I want to keep them and then do the extra work of preprecessing them later on when I preprocess the data.

I might go and remove hashtags in the future, to see how it affects the performance. So, if you see that I decided to remove the hashtags, then now you know why!

In [5]:
import re # Regular Expression

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)     # Remove mentions
    text = re.sub(r'\d+', '', text)      # Remove numbers
    text = re.sub(r'[^\w\s#]', '', text)  # Remove punctuation except hashtags
    text = text.lower()                  # Convert to lowercase
    return text

train_data['clean_text'] = train_data['text'].apply(clean_text) # Apply the data cleaning process to training data
test_data['clean_text'] = test_data['text'].apply(clean_text)# Apply the data cleaning process to testing data

# Display the first few rows of the cleaned data
print(train_data[['text', 'clean_text']].head())


                                                text  \
0  Our Deeds are the Reason of this #earthquake M...   
1             Forest fire near La Ronge Sask. Canada   
2  All residents asked to 'shelter in place' are ...   
3  13,000 people receive #wildfires evacuation or...   
4  Just got sent this photo from Ruby #Alaska as ...   

                                          clean_text  
0  our deeds are the reason of this #earthquake m...  
1              forest fire near la ronge sask canada  
2  all residents asked to shelter in place are be...  
3   people receive #wildfires evacuation orders i...  
4  just got sent this photo from ruby #alaska as ...  


## Step 3: Preprocessing and Tokenization

### Tokenization and Padding/Truncation from bert-base-uncased 

- The bert-base-uncased tokenizer also has a padding/truncation feature built into it so we will use that so that we don't have to manually truncate and pad!

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_texts(texts):
    return tokenizer(
        texts.tolist(),
        max_length=64,
        padding=True,
        truncation=True,
        return_tensors='tf'
    )

train_encodings = tokenize_texts(train_data['clean_text'])
test_encodings = tokenize_texts(test_data['clean_text'])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Analyzing the length of tokens to find the optimal maximum length for sequences

Based on the results from this, we saw that the maximum sequence length (a sequence in this context is a single tweet) was around 40, and so we will pick a multiple of 2 for better computational performance. 

So, I decided on 64!

In [7]:
# import matplotlib.pyplot as plt

# # Tokenize the clean text without padding to get the length of each tweet
# train_data['token_length'] = train_data['clean_text'].apply(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))
# test_data['token_length'] = test_data['clean_text'].apply(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))

# # Plot the distribution of token lengths
# plt.hist(train_data['token_length'], bins=50, alpha=0.7, label='Train')
# plt.hist(test_data['token_length'], bins=50, alpha=0.7, label='Test')
# plt.axvline(x=128, color='r', linestyle='--', label='MAX_LEN = 128')
# plt.xlabel('Token Length')
# plt.ylabel('Frequency')
# plt.legend()
# plt.show()

# # Display some statistics
# print("Train token length statistics:")
# print(train_data['token_length'].describe())

# print("\nTest token length statistics:")
# print(test_data['token_length'].describe())


### Prepare data for training

This includes:
- Splitting data into training and validation 
- Converting data into a format that BERT can actually train on 

In [8]:
import tensorflow as tf

train_labels = tf.convert_to_tensor(train_data['target'].values)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

# Create a validation split 
val_size = int(0.3 * len(train_data))
val_dataset = train_dataset.take(val_size)
train_dataset = train_dataset.skip(val_size)

# Batch and shuffle the datasets
batch_size = 32

# train_dataset = train_dataset.shuffle(10000).batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
train_dataset = train_dataset.shuffle(10000) # removed batch and prefetch here since I will do that in my optuna tuning
val_dataset = val_dataset.batch(batch_size) # batch but don't prefetch since it says online that it might cause some randomness



## Step 4: Building a model!!!

### Picking a Model Architecture

I am going to pick BERT but I might play around with other pretrained models later. 

This is an example of transfer learning, where I am taking a pretrained model (ex. BERT) and then training it on my specific data. No need to re-invent the wheel, especially since it will take long time to make a model from scratch and I might not get great results back since my training data size is not very good. The BERT model is training on SO MUCH data, so it's already very smart.

I am using the BERT model and by doing import TFBertForSequenceClassification, I am using the model that adds a classification head to the BERT base model. Adding a layer to a pre-trained model is a crucial part of transfer learning, and by training the model on my data, I will be setting the weights of the new head layer of the model, which is where it learns about disaster tweets and how to classify them!

In [9]:
# from transformers import TFBertForSequenceClassification, BertConfig

# config = BertConfig.from_pretrained('bert-base-uncased', num_labels=2)
# model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)


### Compiling the Model

In [10]:
# model.compile(
#     optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-8),
#     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
#     metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
# )


## Step 5: Training the Model

### Train the model


In [11]:
# history = model.fit(
#     train_dataset,
#     epochs=3,
#     validation_data=val_dataset
# )


In [15]:
# from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
#                           Trainer, TrainingArguments)

# def optuna_hp_space(trial):
#     return {
#       "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
#       "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]),
#       "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
#     }


# def model_init(trial):
#   # Assuming model_args is defined elsewhere with model details
#   return AutoModelForSequenceClassification.from_pretrained(
#       model_args.model_name_or_path,
#       from_tf=bool(".ckpt" in model_args.model_name_or_path),
#       config=config,
#       cache_dir=model_args.cache_dir,
#       revision=model_args.model_revision,
#       token=True if model_args.use_auth_token else None,
#   )


# # Define your training arguments (assuming training_args is set elsewhere)
# trainer = transformers.Trainer(
#     model=None,  # Set by model_init
#     args=training_args,
#     train_dataset=small_train_dataset,
#     eval_dataset=small_eval_dataset,
#     compute_metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy'), tf.keras.metrics.F1Score(num_classes=2)],
#     tokenizer=tokenizer,
#     model_init=model_init,
#     data_collator=data_collator,
# )

# # Perform hyperparameter search with Optuna
# best_trials = trainer.hyperparameter_search(
#     direction=["maximize"],
#     backend="optuna",
#     hp_space=optuna_hp_space,
#     n_trials=2,
# )

# # Access results from best_trials (if needed)


ModuleNotFoundError: No module named 'Transfomer'

In [25]:
from transformers import (TFBertForSequenceClassification, BertConfig,
                          Trainer, TrainingArguments)

from datasets import load_metric

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [16, 32, 64, 128]),
        "num_epochs": trial.suggest_int("num_epochs", 1, 3),
    }

def model_init():
    """
    Creates a non-compiled TFBertForSequenceClassification model.

    Args:
      learning_rate: Learning rate for the optimizer.
      num_train_epochs: Number of training epochs.
      per_device_train_batch_size: Train batch size per device.

    Returns:
      A TFBertForSequenceClassification model.
    """
    config = BertConfig.from_pretrained('bert-base-uncased', num_labels=2)
    model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)
    return model


# def model_init(trial):
#     return AutoModelForSequenceClassification.from_pretrained(
#         model_args.model_name_or_path,
#         from_tf=bool(".ckpt" in model_args.model_name_or_path),
#         config=config,
#         cache_dir=model_args.cache_dir,
#         revision=model_args.model_revision,
#         token=True if model_args.use_auth_token else None,
#     )

# Improved training arguments for Optuna tuning (consider adjusting eval_steps)
training_args = TrainingArguments(
    output_dir="/kaggle/working/",
    evaluation_strategy="epoch",  # Evaluate after each epoch
    # Consider adjusting eval_steps based on training speed and desired frequency (e.g., for faster training with 10k samples)
    # eval_steps=(len(your_train_dataset) // per_device_train_batch_size),  # Evaluate after each epoch
    disable_tqdm=True,
    # Include Early Stopping for efficiency (optional)
    #early_stopping_patience=3,  # Stop training if validation F1 doesn't improve for 3 epochs
)


# Load the F1 metric
metric = load_metric("f1")

# Define the compute_metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    #tokenizer=tokenizer,
    model_init=model_init,
    #data_collator=data_collator,
)


best_trials = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=2,
    compute_objective=compute_objective
)


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AttributeError: 'TFBertForSequenceClassification' object has no attribute 'to'

In [None]:
# from transformers import TFBertForSequenceClassification, AutoTokenizer
# from transformers import TrainingArguments, Trainer
# from optuna import create_study, Trial
# import tensorflow as tf  # Import tensorflow for strategy

# # Define your objective function (to minimize loss)
# def objective(trial):
#     # Load model and tokenizer (assuming you have model_name defined)
#     model_name = "bert-base-uncased"  # Replace with your desired model
#     tokenizer = AutoTokenizer.from_pretrained(model_name)
#     model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

#     # Suggest hyperparameters from search space
#     learning_rate = trial.suggest_float("learning_rate", low=1e-5, high=5e-5)
#     num_train_epochs = trial.suggest_int("num_train_epochs", low=1, high=2)
#     per_device_train_batch_size = trial.suggest_int("per_device_train_batch_size", low=8, high=32)
#     dropout_rate = trial.suggest_float("dropout_rate", low=0.1, high=0.3)  # Dropout rate search space
#     adam_epsilon = trial.suggest_float("adam_epsilon", low=1e-8, high=1e-6)  # Adam epsilon search space

#     # Define the MirroredStrategy for multi-gpu training on Kaggle
#     strategy = tf.distribute.MirroredStrategy()

#     # Wrap training logic with strategy.scope()
#     with strategy.scope():
#         # Prepare training arguments with suggested hyperparameters
#         training_args = TrainingArguments(
#             output_dir="/kaggle/working/",
#             per_device_train_batch_size=per_device_train_batch_size,
#             learning_rate=learning_rate,
#             num_train_epochs=num_train_epochs,
#             # ... other training arguments
#         )

#         # Create batched and prefetched training dataset using the selected batch size
#         train_dataset2 = train_dataset.batch(per_device_train_batch_size).prefetch(tf.data.experimental.AUTOTUNE)

#         # Train the model using Trainer (replace with your training logic)
#         trainer = Trainer(
#             model=model,
#             args=training_args,
#             train_dataset=train_dataset,  # Replace with your training dataset
#             eval_dataset=val_dataset,  # Replace with your evaluation dataset
#             compute_metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')],  # Replace with your metric function
#         )
#         trainer.train()

#         # Return evaluation metric (e.g., validation loss) to minimize
#         return trainer.evaluate()["eval_loss"]

# # Create Optuna study
# study = create_study(direction="minimize")

# # Optimize hyperparameters over a specified number of trials
# study.optimize(objective, n_trials=2)

# # Get the best trial with its hyperparameters
# best_trial = study.best_trial
# print(f"Best learning rate: {best_trial['learning_rate']}")
# print(f"Best num_train_epochs: {best_trial['num_train_epochs']}")
# print(f"Best per_device_train_batch_size: {best_trial['per_device_train_batch_size']}")
# print(f"Best dropout rate: {best_trial['dropout_rate']}")
# print(f"Best adam_epsilon: {best_trial['adam_epsilon']}")

# # # Load the model with the best hyperparameters
# # best_model_name = f"best_model_{study.study_id}"  # Create unique name
# # model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
# # model.load_weight(best_model_name)  # Assuming you've saved the best model weights

# # # Prepare your Kaggle test data (preprocessing, tokenization)
# # test_data = your_kaggle_test_data  # Replace with your test data preparation logic
# # test_encodings = tokenizer(test_data, padding="max_length", truncation=True)  # Tokenize test data

# # # Make predictions on Kaggle test data
# # predictions = model.predict(test_encodings)["logits"].argmax(-1)  # Get predicted class indices

# # # Prepare your Kaggle submission format (e.g., pandas dataframe)
# # submission_df = your_submission_df  # Replace with your submission dataframe creation logic
# # submission_df["predicted_label"] = predictions

# # # Submit your predictions to the Kaggle competition
# # submission_df.to_csv("submission.csv", index=False)  # Replace with your submission logic

# # print(f"Submitted predictions using the best model with hyperparameters:")
# # for key, value in best_trial.params.items():
# #   print(f"{key}: {value}")


## Step 6: Evaluating and Submitting the Results

### Predict on the test set

In [None]:
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings)
)).batch(32)

predictions = model.predict(test_dataset).logits
predicted_labels = tf.argmax(predictions, axis=1).numpy()


### Prepare the Submission File:

In [None]:
# Create a submission DataFrame
submission = pd.DataFrame({'id': test_data['id'], 'target': predicted_labels})
submission.to_csv('submission_3_kaggle.csv', index=False)


## Step 7: Iterating and Improving

Hyperparameter Tuning:
- Experiment with different hyperparameters such as learning rate, batch size, and the number of epochs to improve the model’s performance.

Data Augmentation:
- Consider using data augmentation techniques to increase the diversity of your training data.

Model Ensembles:
- Combine the predictions from multiple models to improve overall performance.