
#### Climate Sentiment Classification with BERT

In this notebook, we will go through the steps to train a sentiment analysis model using the BERT transformer model. We will:
1. Load and prepare the dataset.
2. Tokenize the text data.
3. Split the data into training and evaluation sets.
4. Train the model.
5. Test the model on a set of predefined prompts.

Let's begin!


**Install** Huggingface packages for both transformer models and datasets

In [None]:
# Install the necessary libraries for transformers and datasets from Hugging Face.
# 'transformers' provides access to pre-trained models like BERT.
# 'datasets' provides tools to easily load and process datasets.
%pip -q install transformers datasets


#### Step 1: Define Functions

We will define the following functions to organize our code:
*   `load_and_prepare_data`: Handles loading the dataset and getting it ready for tokenization.
*   `tokenize_dataset`: Specifically takes care of converting our text data into a format that the BERT model can understand (tokenization).
*   `select_subsets`: Helps us split the dataset into smaller portions for training and evaluation.
*   `initialize_trainer`: Sets up the training environment, including the model, training parameters, and datasets.
*   `test_model_performance`: Evaluates how well the model is doing on example sentences.


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, pipeline
from datasets import load_dataset

# Load dataset and prepare for training
def load_and_prepare_data(file_path):
    # Load the dataset from the specified file path.
    # We specify 'csv' as the format and provide a dictionary mapping 'train' to the file path.
    dataset = load_dataset('csv', data_files={'train': file_path})
    # Tokenize the loaded dataset using the tokenize_dataset function.
    # The map function applies the tokenize_dataset function to each example in the dataset.
    # batched=True processes examples in batches, which is more efficient.
    processed_dataset = dataset.map(tokenize_dataset, batched=True)
    # Return the tokenized dataset.
    return processed_dataset

# Tokenize text data for BERT
def tokenize_dataset(examples):
    # Tokenize the 'text' field of the examples.
    # padding="max_length": Pads sequences to the maximum length specified by max_length.
    # truncation=True: Truncates sequences that are longer than max_length.
    # max_length=512: The maximum length of the tokenized sequences.
    return bert_tokenizer(examples['text'], padding="max_length", truncation=True, max_length=512)

# Select training and evaluation subsets
def select_subsets(processed_dataset, subset_size_ratio=0.8):
    # Determine the size of the subset to use, taking the minimum of 1000 and the total number of training examples.
    subset_size = min(1200, len(processed_dataset['train']))
    # Calculate the size of the training set based on the subset size and ratio.
    train_size = int(subset_size * subset_size_ratio)

    print(f"Training set size: {train_size}")
    print(f"Evaluation set size: {subset_size - train_size}")

    # Randomly shuffle the training dataset and select a subset for training.
    # seed=42 ensures reproducibility of the shuffling.
    train_subset = processed_dataset['train'].shuffle(seed=42).select(range(train_size))
    # Randomly shuffle the training dataset and select a subset for evaluation (the remaining part after the training set).
    eval_subset = processed_dataset['train'].shuffle(seed=42).select(range(train_size, subset_size))
    # Return the small training and evaluation datasets.
    return train_subset, eval_subset

# Initialize and return the Trainer
def initialize_trainer(train_dataset, eval_dataset):
    # Define the training arguments.
    training_args = TrainingArguments(
        # Directory to save model checkpoints and outputs.
        output_dir="/results",
        # The learning rate for the optimizer.
        learning_rate=2e-5,
        # Batch size for training on each device.
        per_device_train_batch_size=8,
        # Batch size for evaluation on each device.
        per_device_eval_batch_size=8,
        # Number of training epochs (set to 1 for demonstration, >1 for accurate results).
        num_train_epochs=2,
        # The weight decay to apply (L2 regularization).
        weight_decay=0.01,
        # Evaluate the model at the end of each epoch.
        eval_strategy="epoch",
        # Disable logging to services like Weights & Biases.
        report_to="none"
    )
    # Initialize the Trainer with the model, arguments, and datasets.
    model_trainer = Trainer(
        model=bert_model, # The BERT model to train.
        args=training_args, # The training arguments.
        train_dataset=train_dataset, # The training dataset.
        eval_dataset=eval_dataset, # The evaluation dataset.
    )
    # Return the initialized trainer.
    return model_trainer

# Evaluate model performance on test prompts
def test_model_performance(sentiment_pipeline, prompts):
    # Define a mapping from the model's predicted labels to human-readable labels.
    label_map = {'LABEL_0': 'Risk', 'LABEL_1': 'Neutral', 'LABEL_2': 'Opportunity'}
    # Iterate through each prompt in the list.
    for prompt in prompts:
        # Get the sentiment prediction for the current prompt using the pipeline.
        prediction = sentiment_pipeline(prompt)[0]
        # Get the translated label from the label_map, defaulting to the original label if not found.
        translated_label = label_map.get(prediction['label'], prediction['label'])
        # Print the prompt, the predicted label, and the confidence score.
        print(f"Prompt: {prompt}\nPrediction: Label: {translated_label}, Score: {prediction['score']}\n")


#### Step 2: Main Program

Here, we initialize the tokenizer and model, load the dataset, and proceed with training the model. We will evaluate the model's performance before and after training on a set of predefined climate-related prompts.


1.  **Initialize the Tokenizer and Model:** Load the pre-trained BERT tokenizer and the BERT model configured for our classification task.
2.  **Load and Prepare Data:** Use our helper function to load the dataset and apply tokenization.
3.  **Select Subsets:** Create smaller training and evaluation datasets from the loaded data.
4.  **Initialize Trainer:** Set up the training process using the `Trainer` class and defined training arguments.
5.  **Test Before Training:** Evaluate the model's performance on sample prompts *before* any training occurs to see the initial, untrained results.
6.  **Train the Model:** Run the training process using the prepared data and trainer.
7.  **Test After Training:** Evaluate the model's performance on the same sample prompts *after* training to observe the impact of the training.


In [None]:
# Initialize the BERT tokenizer with the 'bert-base-uncased' model.
# This tokenizer is used to preprocess text data for the BERT model.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Initialize the BERT model for sequence classification with 'bert-base-uncased'.
# num_labels=3 specifies that the model should predict one of three classes (for sentiment).
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Check if the script is being run directly (not imported as a module).
if __name__ == "__main__":
    # Define the path to the dataset file.
    dataset_path = 'https://jerrycuomo.github.io/Think_Artificial_Intelligence/datasets/climatebert-climate-sentiment.csv'

    # Load and prepare the dataset using the defined function.
    tokenized_datasets = load_and_prepare_data(dataset_path)

    # Select smaller training and evaluation subsets from the tokenized dataset.
    small_train_dataset, small_eval_dataset = select_subsets(tokenized_datasets)

    # Initialize the Trainer with the model and datasets.
    trainer = initialize_trainer(small_train_dataset, small_eval_dataset)
    print("Before Training:")

    # Create a sentiment analysis pipeline using the initialized model and tokenizer.
    sentiment_pipeline_before = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

    # Define a list of test prompts to evaluate the model.
    test_prompts = [
        "The company has achieved a 20% reduction in water usage over the past year through improved conservation efforts.",
        "Recent audits revealed non-compliance with environmental regulations in several of our manufacturing facilities.",
        "Our new product line uses recycled materials, contributing to a circular economy and reducing waste.",
        "Emissions have increased due to expanded operations.",
        "The company is currently evaluating the environmental impact of its operations to better align with sustainability goals."
    ]

    # Test the model's performance on the test prompts before training.
    test_model_performance(sentiment_pipeline_before, test_prompts)

    # Train the model using the initialized trainer.
    trainer.train()

    print("After Training:")

    # Create a new sentiment analysis pipeline with the trained model and tokenizer.
    sentiment_pipeline_after = pipeline("sentiment-analysis", model=trainer.model, tokenizer=tokenizer) # Use trainer.model here

    # Test the model's performance on the test prompts after training.
    test_model_performance(sentiment_pipeline_after, test_prompts)


#### Conclusion

In this notebook, we walked through the process of training a BERT model for sentiment analysis on climate-related data. We saw how to:
1. Load and prepare the dataset.
2. Tokenize the data for BERT.
3. Train the model using the `Trainer` class.
4. Evaluate the model's performance before and after training.

Great work!
