
#### Climate Sentiment Classification with BERT

In this notebook, we will go through the steps to train a sentiment analysis model using the BERT transformer model. We will:
1. Load and prepare the dataset.
2. Tokenize the text data.
3. Split the data into training and evaluation sets.
4. Train the model.
5. Test the model on a set of predefined prompts.

Let's begin!


**Install** Huggingface packages for both transformer models and datasets

In [None]:
%pip install transformers datasets


#### Step 1: Define Functions

We will define the following functions to organize our code:
1. `load_and_prepare_data(file_path)`: Loads and tokenizes the dataset.
2. `tokenize_dataset(examples)`: Tokenizes text data.
3. `select_subsets(tokenized_datasets, subset_size_ratio)`: Splits the dataset into training and evaluation subsets.
4. `initialize_trainer(train_dataset, eval_dataset)`: Initializes the Trainer for BERT.
5. `test_model_performance(sentiment_pipeline, prompts)`: Tests the model on given prompts and prints predictions.


In [None]:

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, pipeline
from datasets import load_dataset

# Load dataset and prepare for training
def load_and_prepare_data(file_path):
    dataset = load_dataset('csv', data_files={'train': file_path})
    tokenized_datasets = dataset.map(tokenize_dataset, batched=True)
    return tokenized_datasets

# Tokenize text data for BERT
def tokenize_dataset(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=512)

# Select training and evaluation subsets
def select_subsets(tokenized_datasets, subset_size_ratio=0.8):
    subset_size = min(1000, len(tokenized_datasets['train']))
    train_size = int(subset_size * subset_size_ratio)
    small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(train_size))
    small_eval_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(train_size, subset_size))
    return small_train_dataset, small_eval_dataset

# Initialize and return the Trainer
def initialize_trainer(train_dataset, eval_dataset):
    training_args = TrainingArguments(
        output_dir="/results",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        # only one epoch for demostration purposes
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )
    return trainer

# Evaluate model performance on test prompts
def test_model_performance(sentiment_pipeline, prompts):
    for prompt in prompts:
        print(f"Prompt: {prompt}\nPrediction: {sentiment_pipeline(prompt)}\n")



#### Step 2: Main Program

Here, we initialize the tokenizer and model, load the dataset, and proceed with training the model. We will evaluate the model's performance before and after training on a set of predefined climate-related prompts.


In [None]:

# Initialize tokenizer and model for BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Main flow
if __name__ == "__main__":
    dataset_path = 'https://jerrycuomo.github.io/Think_Artificial_Intelligence/datasets/climatebert-climate-sentiment.csv'
    tokenized_datasets = load_and_prepare_data(dataset_path)
    small_train_dataset, small_eval_dataset = select_subsets(tokenized_datasets)

    trainer = initialize_trainer(small_train_dataset, small_eval_dataset)
    print("Before Training:")
    sentiment_pipeline_before = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

    test_prompts = [
        "The company has achieved a 20% reduction in water usage over the past year through improved conservation efforts.",
        "Recent audits revealed non-compliance with environmental regulations in several of our manufacturing facilities.",
        "Our new product line uses recycled materials, contributing to a circular economy and reducing waste.",
        "Emissions have increased due to expanded operations.",
        "The company is currently evaluating the environmental impact of its operations to better align with sustainability goals."
    ]

    test_model_performance(sentiment_pipeline_before, test_prompts)

    # Train the model
    trainer.train()

    print("After Training:")
    sentiment_pipeline_after = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
    test_model_performance(sentiment_pipeline_after, test_prompts)



#### Conclusion

In this notebook, we walked through the process of training a BERT model for sentiment analysis on climate-related data. We saw how to:
1. Load and prepare the dataset.
2. Tokenize the data for BERT.
3. Train the model using the `Trainer` class.
4. Evaluate the model's performance before and after training.

Great work!
