
<a href="https://colab.research.google.com/github/percw/Corporate_sustainability/blob/main/Companies_sustainability.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Github"/></a>

<div align = "center">
    <h1>Project Title</h1>
</div>

**To-do (Rita)**: write an introduction of our project and add here the review of the existing solutions and research on greenwashing claims.

## Table of Contents

- [0 - Imports and Dependencies](#imports-and-dependencies)
- [1 - Data Loading](#data-loading)
- [2 - Exploratory Data Analysis](#explatory-data-analysis)
- [3 - Training, testing and evaluating ML models](#training-testing-models)
- [4 - Classifiers](#classifiers)

## 0 - Imports and Dependencies<a id="imports-and-dependencies"></a>

In [None]:
requirements_url = 'https://raw.githubusercontent.com/percw/Corporate_sustainability/main/requirements.txt'

### Dependencies

To run this notebook, please make sure you have the required dependencies installed.
Requirements.txt file contains all the required dependencies for this notebook.

1. If you have forked or dowloaded the repository, you can install the dependencies using the following command:
	```bash
	pip install -r requirements_url
	```
	or using conda:
	```bash
	conda install -r requirements_url

	```
	Splitting imports into three cells to avoid errors.
2. If you're using Google Colab, you can run the following cell to install the required dependencies.

```bash
!pip install -r requirements_url
```

And then run the following command to download the english spacy pipeline:
```bash
python -m spacy download en_core_web_sm
```
if you are using python v.3 or above or above you might need to run the following command:
```bash
python3 -m spacy download en_core_web_sm
```

### Recommendations 
Some of the following code is computation heavy, we therefor recommend to run this code using GoogleColab and doing the following adjustments to the runtime:
1. Go to the menu and select "Runtime" > "Change runtime type."
2. Choose "GPU" as the hardware accelerator and click "Save."

In [None]:
# General imports
import json
import random
import warnings
from collections import Counter

# Importing libraries for data analysis and visualiation
import matplotlib.pyplot as plt
import nlpaug.augmenter.word as naw
import numpy as np
import pandas as pd
import plotly.express as px
import requests
import seaborn as sns
import spacy as spacy
import torch
from datasets import load_dataset
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_extraction.text import (CountVectorizer,
                                             HashingVectorizer,
                                             TfidfVectorizer)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, auc, classification_report,
                             confusion_matrix, precision_recall_fscore_support,
                             roc_auc_score, roc_curve)
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV,
                                     train_test_split)
# Importing libraries for data preprocessing and modeling
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from textblob import TextBlob
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.utils.data import DataLoader, Dataset
from transformers import BertForSequenceClassification, BertTokenizerFast
from wordcloud import WordCloud

# Ignore warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## 1 - Data Loading<a id="data-loading"></a>

The dataset used to conduct this study is available [at this link](https://huggingface.co/datasets/climatebert/environmental_claims/tree/main). 

The *Environmental Claims* dataset is a collection textual data that expresses a commitment, initiative, or action made by various entities, such as companies or organizations. Some of these claims concern initiatives towards addressing environmental issues. As such, the dataset includes information on diverse environmental topics, such as renewable energy, carbon footprint reduction, waste management, water conservation, sustainable practices, biodiversity preservation, or eco-friendly products. 

In [None]:
# Load data
dataset = load_dataset('climatebert/environmental_claims')

In [None]:
# Display to show how the format of the dataset looks like
dataset

In [None]:
# Creating a dataframe for each split
env_claim_train = pd.DataFrame(dataset['train'])
env_claim_test = pd.DataFrame(dataset['test'])
env_claim_val = pd.DataFrame(dataset['validation'])

In [None]:

# Define color codes
class Colors:
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    ENDC = '\033[0m'

# Displaying the characteristics of the data
print(Colors.OKGREEN + "Training data" + Colors.ENDC)
print(f"The shape of the training data is: {env_claim_train.shape}")
display(env_claim_train.head())
display(env_claim_train.label.value_counts())

print(Colors.OKBLUE + "\nTest data" + Colors.ENDC)
print(f"The shape of the test data is: {env_claim_test.shape}")
display(env_claim_test.head())
display(env_claim_test.label.value_counts())

print(Colors.OKCYAN + "\nValidation data" + Colors.ENDC)
print(f"The shape of the validation data is: {env_claim_val.shape}")
display(env_claim_val.head())
display(env_claim_val.label.value_counts())

The *Environmental Claims* dataset is already divided into three subsets:
* **Train set**: it contains 2400 observations, each of them corresponding to a claim. The majority (i.e., 542) of these statements are environmental claims, while 1858 of them are non-environmental statements.


* **Test set**: it contains 300 observations, 64 of which are environmental claims.


* **Validation set**: it contains 300 observation, 64 of which are environmental claims as for the test set.

While the labelled variable will be studied in more detail in the coming sections, we can already note how the dataset seem unbalanced towards non-environmental claims.

## 2 - Exploratory Data Analysis<a id="#explatory-data-analysis" ></a>

In [None]:
# Concatenate sets 
claim_dataset = pd.concat([env_claim_train, env_claim_test, env_claim_val], ignore_index = True)
print("Number of claims in the dataset:", claim_dataset.shape[0])    # observations
print("Number of variables in the dataset:", claim_dataset.shape[1]) # variables 

In [None]:
# NaNs 
print("Number of NaNs:")
display(claim_dataset.isna().sum())

# Duplicates
print("Number of duplicates:")
display(claim_dataset.duplicated().sum())

# Variable types
print("Variable types:")
claim_dataset.dtypes

### Word Count by Claim

In [None]:
# Creating a function that show the avergage number of words in each claim and a graphical representation by class

def word_count_graph(claim_dataset):
	# Word count
	claim_dataset["word count"] = claim_dataset["text"].apply(lambda x: len(x.split()))
	print("The average number of words in each claim is equal to:", round(claim_dataset["word count"].mean(),0), "words.")

	# Graphical representation by class
	class_1_counts = claim_dataset[claim_dataset["label"] == 1]["word count"]
	class_2_counts = claim_dataset[claim_dataset["label"] == 0]["word count"]

	plt.hist(class_1_counts, bins = range(11, 39), alpha = 0.5, label = "Environmental Claim", color = "#4958B5")
	plt.hist(class_2_counts, bins = range(11, 39), alpha = 0.5, label = "Non-environmental Claim", color = "#8DB8B7")
	plt.xlabel("Word count")
	plt.ylabel("Frequency")
	plt.title("Number of Words in Each Claim by Class")
	plt.legend(loc = "upper right")
	plt.show()

# Applying the function
word_count_graph(claim_dataset)

### Claim Cleaning

In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

# Apply the Spacy sp function to each row of the 'text' column
claim_dataset["spacy object"] = claim_dataset["text"].apply(sp)

# Filter stopwords, punctuation and spaces
def filter_tokens(token):
    return not token.is_stop and not token.is_punct and not token.is_space

# Remove stopwords, punctuation, and whitespace from each Spacy object
claim_dataset["filtered tokens"] = claim_dataset["spacy object"].apply(lambda doc: [token.text for token in doc if filter_tokens(token)])

print("This is the first sentence before filtering:", claim_dataset.iloc[0,0])
print("\nThis is the first sentence after filtering:", claim_dataset.iloc[0,4])

# Calculating new average value of words per claim
number_words = [len(x) for x in claim_dataset["filtered tokens"]]
print("\nThe average number of words per claim is now:", round(np.mean(number_words),0))
display(claim_dataset.head())

### Environmental Claims versus Non-Environmental Claims

In [None]:
# Mean 
print("Average number of words per claim by class:")
display(claim_dataset.groupby("label").mean().round())

# Median
print("\nMedian number of words per claim by class:")
display(claim_dataset.groupby("label").median())

In [None]:
# WordCloud hue by class label

# Join the strings in each list into a single string
claim_dataset["joined tokens"] = claim_dataset["filtered tokens"].apply(lambda tokens: ' '.join(tokens))

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(24, 6))

# For Environmental Claims
text = " ".join(word for word in claim_dataset[claim_dataset["label"]==1]["joined tokens"])
wordcloud = WordCloud( background_color = "white", colormap = "Greens").generate(text)

ax1.imshow(wordcloud, interpolation = "bilinear")
ax1.set(title = "WordCloud of Environmental Claims")
ax1.axis("off")

# For Non-Environmental Claims
text = " ".join(word for word in claim_dataset[claim_dataset["label"]==0]["joined tokens"])
wordcloud = WordCloud(background_color = "white", colormap = "Reds").generate(text)

ax2.imshow(wordcloud, interpolation='bilinear')
ax2.set(title = "WordCloud of Non-Environmental Claims")
ax2.axis("off")
plt.show()

In [None]:
# Create a function for the top most frequent words
def top_words(claim_dataset, label, n=10):
    top = Counter([item for sublist in claim_dataset["joined tokens"]   # sublist is a list of words in each claim
                  [claim_dataset["label"] == label] for item in str(sublist).split()])
    temp = pd.DataFrame(top.most_common(n))     # Create a dataframe with the top n words
    temp.columns = ["Common Words", "Count"]    # Naming the columns
    temp.index = np.arange(1, len(temp) + 1)    # Setting first index to 1
    return temp


# Top 10 most frequent words for environmental claims
print("Top 10 most frequent words for environmental claims:")
env_top_words = top_words(claim_dataset, 1)
display(env_top_words.style.background_gradient(cmap="Greens"))

# Top 10 most frequent words for non-environmental claims
print("\nTop 10 most frequent words for non-environmental claims:")
nonenv_top_words = top_words(claim_dataset, 0)
display(nonenv_top_words.style.background_gradient(cmap="Reds"))


### 2.2 Populating with more environmental claims

Since are data is heavily underrepresented in class 1, we used ChatGPT 3 and 4 to generate different environmental claims. These can be found in the `env_claims_gpt3.json` and `env_claims_gpt4.json` in the `./data` folder.

We will append these statements to the ones we already have, and later investigate whether these can improve our performance.

In [None]:

# Load the env_claims_gpt4.json file from remote
claims_gpt4 = []
gpt_4_url = 'https://raw.githubusercontent.com/percw/Corporate_sustainability/main/data/env_claims_gpt4.json'
gpt_4 = requests.get(gpt_4_url).json()

for i in range(len(gpt_4['claims'])):
	claims_gpt4.append(gpt_4['claims'][i]['claim'])



In [None]:
# Load the env_claims_gpt3.json file from remote
claims_gpt3 = []
gpt_3_url = 'https://raw.githubusercontent.com/percw/Corporate_sustainability/main/data/env_claims_gpt3.json'
gpt_3 = requests.get(gpt_3_url).json()

for i in range(len(gpt_3['claims'])):
	claims_gpt3.append(gpt_3['claims'][i]['claim'])


In [None]:
# Number of claims
print("Number of claims in the dataset (GPT4):", len(claims_gpt4))    # observations
print("Number of claims in the dataset (GPT3):", len(claims_gpt3))    # observations
print("Total number of claims in the dataset:", len(claims_gpt4)+len(claims_gpt3))    # observations

In [None]:
# Converting the list to a dataframe

claims_gpt4_df = pd.DataFrame(claims_gpt4, columns = ['text'])
claims_gpt3_df = pd.DataFrame(claims_gpt3, columns = ['text'])

# Adding the label column
claims_gpt4_df['label'] = 1
claims_gpt3_df['label'] = 1

# Concatenating the two dataframes
claims_gpt_df = pd.concat([claims_gpt4_df, claims_gpt3_df], ignore_index=True)
display(claims_gpt_df)

If we concat the generated data, we can see how the distribution and inbalance has changed.

In [None]:
# Creating a new df with the original claims and the generated claims
claim_pop = pd.concat([claim_dataset, claims_gpt_df], ignore_index=True)

# Displaying how the generated has changed the distribution of the dataset
print("Distribution of the dataset before adding the generated claims:")
display(claim_dataset['label'].value_counts(normalize=True).round(2))

print("\nDistribution of the dataset after adding the generated claims:")
display(claim_pop['label'].value_counts(normalize=True).round(2))

word_count_graph(claim_pop)


### Studying Energy Claims using N-Grams

In [None]:
# TODO: Still need to finish this 

In [None]:
# Subsample of claims with word "energy" inside
energy_df = claim_dataset[claim_dataset["joined tokens"].str.contains("energy")]
print("In the original dataset, there are", len(energy_df), "claims containing the word 'energy'.")

# Subsample of claims with word "energy" inside and label == 1
energy_df_1 = energy_df[energy_df["label"] == 1]

# Subsample of claims with word "energy" inside and label == 0
energy_df_0 = energy_df[energy_df["label"] == 0]

In [None]:
# Function calculating most frequent N-Grams given corpus, n-grams

def top_n_ngram(energy_corpus, ngram = 3):
    vec = CountVectorizer(ngram_range = (ngram,ngram)).fit(energy_corpus)
    words_bag = vec.transform(energy_corpus)  # Have the count of  all the words for each claim
    sum_words = words_bag.sum(axis = 0)       # Calculates the count of all the word in the whole claim
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key = lambda x:x[1],reverse = True)
    return words_freq

# Call function on both datasets 
pop_words_1 = top_n_ngram(energy_df_1["joined tokens"], 3)  
pop_words_0 = top_n_ngram(energy_df_0["joined tokens"], 3)  

# Select top 20 N-Grams having 'energy' in text 
pop_energy_1 = [t for t in pop_words_1 if "energy" in t[0]]
pop_energy_1 = pop_energy_1[:20]
pop_energy_0 = [t for t in pop_words_0 if "energy" in t[0]]
pop_energy_0 = pop_energy_0[:20]

# Graphical representation

# Extract x and y values from each list
x1, y1 = zip(*pop_energy_1)
x2, y2 = zip(*pop_energy_0)

# Set up the figure and axes
fig, (ax1, ax2) = plt.subplots(nrows = 2, figsize = (20,9))
fig.subplots_adjust(hspace = 1.2)

# Create the first bar plot on ax1
ax1.bar(x1, y1, color = "#4958B5")
ax1.set_ylabel("Frequency")
ax1.set_xticks(range(len(x1)))
ax1.set_xticklabels(x1, rotation = 90)
ax1.set_title("Top 20 Energy 3-Grams in Environmental Claims")

# Create the second bar plot on ax2
ax2.bar(x2, y2, color = "#8DB8B7")
ax2.set_ylabel("Frequency")
ax2.set_xticks(range(len(x2)))
ax2.set_xticklabels(x2, rotation = 90)
ax2.set_title("Top 20 Energy 3-Grams in Non-Environmental Claims")

# Add x-axis label
fig.add_subplot(111, frameon = False)
plt.tick_params(labelcolor = "none", top = False, bottom = False, left = False, right = False)

# Show the plot
plt.show()

Analyzing the 3-gram lists, it becomes apparent that the terms in Class 1 are more focused on energy generation, efficiency, and reduction of consumption with a clear emphasis on renewable and clean technology. Key phrases like 'improve energy efficiency', 'renewable energy projects', and 'reduce energy consumption' suggest that Class 1 is associated with environmentally proactive actions or strategies.

On the other hand, Class 0 appears to be more concerned with the management and infrastructure of energy, including the use of renewable sources, but with notable mentions of 'energy management systems', 'incineration energy recovery', and 'energy storage capacity'. This class seems to focus more on the operational aspects and physical assets related to energy.

### Declaring the train and test datasets for X and y

In [None]:
X_train, y_train = env_claim_train['text'], env_claim_train['label']
X_test, y_test = env_claim_test['text'], env_claim_test['label']
X_val, y_val = env_claim_val['text'], env_claim_test['label']

### Exploring Labelled Data and Defining Base Rate

TODO: Include gpt generated claims and show graph for both dataframes and label categories

In [None]:
# Train Set
print(Colors.OKGREEN + "Train set per class:" + Colors.ENDC)
display(y_train.value_counts())


# Test Set
print(Colors.OKBLUE + "\nTest set per class:" + Colors.ENDC)
display(y_test.value_counts())

# Validation Set
print(Colors.OKCYAN + "\nValidation set per class:" + Colors.ENDC)
display(y_val.value_counts())


# Creating a function that takes in y's returns a a plot of the distribution of the classes
def plot_class_distribution(y_list: list, title: str = ''):
    ''' 
    Function that plots the class distribution
    Input: List of y vars
    Output: Plot of distribution
    '''
    is_list = isinstance(y_list, list)
    if is_list:
        outcome_variable = pd.concat(y_list)

    else:
        outcome_variable = y_list
    outcome_variable.value_counts().plot.bar(
        color=["#4958B5", "#8DB8B7"], grid=False)
    plt.ylabel("Number of observations")
    plt.xlabel("Class")
    plt.title(f"Number of Observations per Class {title}")
    plt.xticks(rotation=0)
    plt.show()


plot_class_distribution(claim_dataset['label'])


In [None]:
# Calculating base rate

outcome_variable = claim_dataset['label']
base_rate = round(len(outcome_variable[outcome_variable == 0]) / len (outcome_variable), 4)
print(f'The base rate is: {base_rate*100:0.2f}%')

Now we can compare the graph above with the populated data from ChatGPT.

In [None]:
# Plotting class distribution for the merged populated and given data

plot_class_distribution(claim_pop['label'], title="(Merged Populated and Given Data)")


### Balancing Labelled Data

TODO: Comment: I'm not sure if we need to balance the test data...

In [None]:
# Get indices of "0" outcomes in the training set
train_zeros_idx = pd.Series(y_train[y_train == 0].index)

# Randomly select a balanced number of "0" outcomes
train_zeros_sample_idx = train_zeros_idx.sample(n = sum(y_train == 1), random_state = 7)

# Use the sampled indices to get the final balanced training set
X_train_bal = pd.concat([X_train[y_train == 1], X_train[train_zeros_sample_idx]])
y_train_bal = pd.concat([y_train[y_train == 1], y_train[train_zeros_sample_idx]])


# Get indices of "0" outcomes in the test set
test_zeros_idx = pd.Series(y_test[y_test == 0].index)

# Randomly select a balanced number of "0" outcomes
test_zeros_sample_idx = test_zeros_idx.sample(n = sum(y_test == 1), random_state = 7)

# Use the sampled indices to get the final balanced test set
X_test_bal = pd.concat([X_test[y_test == 1], X_test[test_zeros_sample_idx]])
y_test_bal = pd.concat([y_test[y_test == 1], y_test[test_zeros_sample_idx]])

# Get indices of "0" outcomes in the validation set
val_zeros_idx = pd.Series(y_val[y_val == 0].index)

# Randomly select a balanced number of "0" outcomes
val_zeros_sample_idx = val_zeros_idx.sample(n = sum(y_val == 1), random_state = 7)

# Use the sampled indices to get the final balanced validation set
X_val_bal = pd.concat([X_val[y_val == 1], X_val[val_zeros_sample_idx]])
y_val_bal = pd.concat([y_val[y_val == 1], y_val[val_zeros_sample_idx]])

In [None]:
print("Number of observations per class after balancing the classes:\n")

# Train Set
print(Colors.OKGREEN + "Train set per class" + Colors.ENDC)
display(y_train_bal.value_counts())
      

# Test Set 
print(Colors.OKBLUE + "\nTest set per class" + Colors.ENDC)
display(y_test_bal.value_counts())

# Validation Set
print(Colors.OKCYAN + "\nValidation set per class" + Colors.ENDC)
display(y_val_bal.value_counts())

print("\nThe new balanced dataset contains", len(y_train_bal + y_test_bal + y_val_bal) , "observations.")

## 3 - Training, testing and evaluating ML models<a id="training-testing-models"></a>

#### Strategy

We will use the following strategy to train, test and evaluate our models:
1. Define different tokenization functions
   1. Test different tokenization functions on the Logistic Regression model
2. Define different vectorization functions
   1. Test different vectorization functions on the Logistic Regression model
3. Fine-tune hyperparameters
4. Compare the performance of the different pipelines

We will in the Classifiers section define different models and compare the performance with the Logistic Regression Model.

### 3.1 Logistic Regression with different tokenization techniques

In [None]:
# Creating several helper functions

# Create a spaCy tokenizer
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def simple_spacy_tokenizer(text):
    return [tok.lemma_.lower() for tok in nlp(text) if not tok.is_stop and tok.is_alpha]

def spacy_tokenizer_ngrams(text):
    # Parse the text with spaCy's language model
    doc = nlp(text)

    # Generate n-grams
    def generate_ngrams(doc, n):
        return [' '.join(doc[i:i+n]) for i in range(len(doc) - n + 1)]

    tokens = []
    for tok in doc:
        # Remove stop words and non-alphabetical tokens
        if tok.is_alpha and not tok.is_stop:
            # Lemmatize and lower case the token
            tokens.append(tok.lemma_.lower().strip())

        # If the token is a named entity, add it to the list
        if tok.ent_type_:
            tokens.append(tok.text)

    # Add bi-grams to the list of tokens
    tokens.extend(generate_ngrams(tokens, 2))

    return tokens


The function `simple_spacy_tokenizer` tokenizes the input text, i.e. For each token, it checks if it is an alphabetical character and if it is not a stop word (commonly used words like 'is', 'the', 'and', etc., that do not carry significant meaning on their own). If the token meets these criteria, it is lemmatized, which means it is converted to its base or dictionary form (for example, 'running' becomes 'run'). The function then converts the token to lowercase and strips any leading or trailing white space. The functon `spacey_tokenizer_ngrams` includes bigrams, the function gives a machine learning model a better chance of understanding the text accurately. The final output is a list of processed tokens.

In [None]:

def classification_report_prettify(report):
    ''' 
    Creates a df from the precision_recall_fscore_support function
    Input: model performance metrics
    '''
    out_df = pd.DataFrame(report).transpose()
    out_df.columns = ['precision', 'recall', 'f1-score', 'support']
    avg_tot = (out_df.apply(lambda x: round(x.mean(), 2) if x.name!="support" else  round(x.sum(), 2)).to_frame().T)
    avg_tot.index = ["avg/total"]
    out_df = pd.concat([out_df, avg_tot])
    out_df['support'] = out_df['support'].apply(lambda x: int(x))
    return out_df


def plot_roc_curve(y_true, y_score):
    '''
    This function plots the ROC curve.
    '''
    fpr, tpr, thresholds = roc_curve(y_true, y_score)
    roc_auc = auc(fpr, tpr)

    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--', label="Random Classifier")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

def plot_confusion_matrix(y_true, y_pred, classes=['0', '1']):
    '''
    This function prints and plots the confusion matrix.
    '''
    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots()
    sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap="Blues", cbar=False)
    
    ax.set(xlabel="Pred", ylabel="True", xticklabels=classes, 
           yticklabels=classes, title="Confusion matrix")
    plt.yticks(rotation=0)
    plt.show()



In [None]:
best_model = [None, 0, None]


def evaluate_model(vectorizer, classifier, X_train, y_train, X_test, y_test, with_confusion_matrix=False):
    ''' 
    Function to evaluate the model performance
    Input: vectorizer, classifier, X_train, y_train, X_test, y_test
    Output: predicted y and y score
    '''
    # Create a pipeline with the vectorizer and classifier
    pipe = Pipeline([('vectorizer', vectorizer), ('classifier', classifier)])

    # Train the model
    pipe.fit(X_train, y_train)

    # Test the model
    y_pred = pipe.predict(X_test)
    y_score = pipe.predict_proba(X_test)[:, 1]

    # Adding a title to the dataframe based on the model and vectorizer used and tokenizer
    title = str(classifier).split(
        '(')[0] + ' with ' + str(vectorizer).split('(')[0]
    print("-" * len(title))
    print(Colors.OKGREEN + title + Colors.ENDC)
    print("-" * len(title))

    # Calculate accuracy and print the performance metrics
    performance_df = classification_report_prettify(
        precision_recall_fscore_support(y_test, y_pred))
    display(performance_df)
    # Accuracy is the number of correct predictions divided by the total number of predictions
    accuracy = accuracy_score(y_test, y_pred)
    # Print the accuracy with 3 decimal points
    print(Colors.OKBLUE + f'Accuracy: {accuracy*100:0.2f}%' + Colors.ENDC)

    # Plot the confusion matrix
    if with_confusion_matrix:
        plot_confusion_matrix(y_test, y_pred)

    # Updating the best model
    global best_model
    if accuracy > best_model[1]:
        best_model = [pipe, accuracy, performance_df]
        print(Colors.OKGREEN + "\nBest model updated!" + Colors.ENDC)
    else:
        print(Colors.OKBLUE + "\nBest model not updated!" + Colors.ENDC)

    # Return the predicted y's and y score and the performance dataframe
    return y_pred, y_score, performance_df


Finally, we can call create a vectorizer and a logistic regression model and evaluate it on the functions created above with the two different tokenizers.

In [None]:

def plot_roc_curves(names, y_scores, title=""):
    plt.figure(figsize=(8, 6))

    # Plot the ROC curve for each vectorizer/tokenizer/model combination
    for name, y_score in zip(names, y_scores):
        fpr, tpr, thresholds = roc_curve(y_test, y_score)
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label=name+(' (area = %0.3f)' % roc_auc))

    # Plot the ROC curve of a purely random classifier
    plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

    # Plot the ROC curve of a perfect classifier
    plt.plot([0, 0, 1], [0, 1, 1], 'k:', label='Perfect Classifier')

    # Add labels and legend to the plot
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curves {title}')
    plt.legend()
    plt.show()


In [None]:
# Define the classifier
clf = LogisticRegression(solver='liblinear')
tokenizer_list = [simple_spacy_tokenizer,
                  spacy_tokenizer_ngrams]   # Tokenizer list

tokenizer_y_score = []  # List to store the y_score for each tokenizer

# Going through list of tokenizers and evaluating the model
for tokenizer in tokenizer_list:
    vectorizer = TfidfVectorizer(
        tokenizer=tokenizer, ngram_range=(1, 2), max_df=0.85, min_df=2)
    y_pred, y_score, df = evaluate_model(
        vectorizer, clf, X_train, y_train, X_test, y_test)
    tokenizer_y_score.append(y_score)    # Append y_score to the list

# Plotting the ROC curve for each tokenizer
plot_roc_curves(["Simple Spacy Tokenizer", "Spacy Tokenizer",
                "NLTK Tokenizer"], tokenizer_y_score, 'Logistic Regression with different tokenizers')


The second model with a bit more advanced tokenizer has a slightly higher accuracy but the same ROC-AUC score. This means that the second tokenizer is slightly better at correctly classifying instances overall, but both models are almost equally good at distinguishing between the classes, as indicated by the ROC-AUC score.

### 3.2 Logistic Regression with different vectorization techniques

In [None]:
# A wrapper for Word2Vec to allow it to be used in a scikit-learn Pipeline
class Word2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, size=100):
        self.size = size
        self.model = None
        
    def fit(self, X, y=None):
        sentences = [doc.split() for doc in X]
        self.model = Word2Vec(sentences, vector_size=self.size, window=5, min_count=1, workers=4)
        return self

    def transform(self, X):
        return np.array([np.mean([self.model.wv[w] for w in doc.split() if w in self.model.wv]
                                or [np.zeros(self.size)], axis=0) for doc in X])

# CountVectorizer
count_vectorizer = CountVectorizer(tokenizer=spacy_tokenizer_ngrams)
y_pred_c_vec, y_score_c_vec, c_vec_perf_df = evaluate_model(count_vectorizer, clf, X_train, y_train, X_test, y_test)

# HashingVectorizer
hashing_vectorizer = HashingVectorizer(tokenizer=spacy_tokenizer_ngrams)
y_pred_h_vec, y_score_h_vec, h_vec_perf_df = evaluate_model(hashing_vectorizer, clf, X_train, y_train, X_test, y_test)

# Word2Vec
w2v_vectorizer = Word2VecVectorizer()
y_pred_w2v, y_score_w2v, w2v_perf_df = evaluate_model(w2v_vectorizer, clf, X_train, y_train, X_test, y_test)

#TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer_ngrams, ngram_range=(1,2), max_df=0.85, min_df=2)
y_pred_tfidf, y_score_tfidf, tfidf_perf_df = evaluate_model(tfidf_vectorizer, clf, X_train, y_train, X_test, y_test)

In [None]:
# Plotting the ROC curve for each vectorizer
plot_roc_curves(["Count Vectorizer", "Hashing Vectorizer", "Word2Vec Vectorizer", 'TF-IDF Vectorizer'],
				[y_score_c_vec, y_score_h_vec, y_score_w2v, y_score_tfidf], 'Logistic Regression with different Vectorizers')


We clearly see that the TF-IDF, CountVectorizer and HashingVectorizer perform equally well. We will fine-tune the hyperparameters of the TF-IDF vectorizer and use it in the next section. We see that the Word2Vec Vectorizer performs poorly which might be due to the small size of our dataset. We can see if it improves with the validation set, augemented data, and populated data from ChatGPT later.

### 3.3 TF-IDF Vectorization and Logistic Regression Fine Tuning Hyperparameters

In the initial model, we used a TF-IDF Vectorizer and a Logistic Regression classifier to predict whether a statement is an environmental claim or not. However, the model's performance can often be improved by tuning the hyperparameters of the vectorizer and the classifier.

Hyperparameters are parameters that are not learned from the data. They are set prior to the commencement of the learning process. For instance, in the case of TF-IDF Vectorizer, `ngram_range`, `max_df`, and `min_df` are hyperparameters. For the Logistic Regression classifier, `C`, which is the inverse of regularization strength, is a hyperparameter. 

Hyperparameter tuning involves selecting the combination of hyperparameters for a machine learning model that performs the best on a validation set.

#### Steps for Hyperparameter Tuning

1. **Pipeline Creation**: We first created a pipeline that combines the vectorizer and the classifier. This allows us to jointly optimize the hyperparameters of both.

2. **Define Hyperparameters**: We then defined a list of hyperparameters to tune for both the vectorizer and the classifier. For the vectorizer, we decided to tune `ngram_range`, `max_df`, and `min_df`. For the classifier, we decided to tune `C`.

3. **Grid Search**: Next, we performed a grid search to find the combination of hyperparameters that results in the best cross-validated performance on the training data. Grid search works by training and evaluating a model for each combination of hyperparameters, and selecting the combination that performs best.

4. **Best Parameters**: After the grid search, we printed the combination of hyperparameters that performed the best.

5. **Evaluate the Model**: Finally, we used the best hyperparameters to create a new vectorizer and classifier, and evaluated the performance of the model on the test data.

This process allowed us to optimize the model's performance by finding the best hyperparameters.

In [None]:

# Define a new hyperparameter tuning function that doesn't include the tokenizer
def hyperparameter_tuning(X_train, y_train, classifier):
    ''' 
    Function to perform hyperparameter tuning for different classifiers
    Input: X_train, y_train
    Output: best_params_
    '''
    # Create a pipeline with the vectorizer and classifier
    pipe = Pipeline([('vectorizer', TfidfVectorizer()), 
                     ('classifier', classifier)])
    
    # Define the hyperparameters to tune for LogisticRegression
    if isinstance(classifier, LogisticRegression):
        params = {
            'vectorizer__ngram_range': [(1, 1), (1, 2)],
            'vectorizer__max_df': [0.85, 0.9, 0.95],
            'vectorizer__min_df': [1, 2, 3],
            'classifier__C': [0.1, 1, 10],
        }
         # Perform grid search
        grid_search = GridSearchCV(pipe, param_grid=params, cv=5, verbose=1, n_jobs=-1)
        grid_search.fit(X_train, y_train)
        print("Best parameters:", grid_search.best_params_)
        return grid_search.best_params_

    # Define the hyperparameters to tune for RandomForest
    elif isinstance(classifier, RandomForestClassifier):
        params = {
            'vectorizer__ngram_range': [(1, 1), (1, 2)],
            'vectorizer__max_df': [0.85, 0.9, 0.95],
            'vectorizer__min_df': [1, 2, 3],
            'classifier__n_estimators': [100, 200, 300],
            'classifier__max_depth': [None, 10, 20, 30],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4],
            'classifier__bootstrap': [True, False]
        }
        # Perform randomized search
        random_search = GridSearchCV(pipe, param_grid=params, cv=5, verbose=1, n_jobs=-1)
        random_search.fit(X_train, y_train)
        print("Best parameters:", random_search.best_params_)
        return random_search.best_params_
    
    # Define the hyperparameters to tune for KNeighborsClassifier
    elif isinstance(classifier, KNeighborsClassifier):
        params = {
            'vectorizer__ngram_range': [(1, 1), (1, 2)],
            'vectorizer__max_df': [0.85, 0.9, 0.95],
            'vectorizer__min_df': [1, 2, 3],
            'classifier__n_neighbors': [3, 5, 7, 9],
            'classifier__weights': ['uniform', 'distance'],
            'classifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'classifier__leaf_size': [10, 20, 30, 40, 50],
            'classifier__p': [1, 2]
        }
        # Perform grid search
        random_search = GridSearchCV(pipe, param_grid=params, cv=5, verbose=1, n_jobs=-1)
        random_search.fit(X_train, y_train)
        print("Best parameters:", random_search.best_params_)
        return random_search.best_params_
    
    # Define the hyperparameters for NN
    elif isinstance(classifier, MLPClassifier):
        params = {
            'vectorizer__ngram_range': [(1, 1), (1, 2)],
            'vectorizer__max_df': [0.85, 0.9, 0.95],
            'vectorizer__min_df': [1, 2, 3],
            'classifier__hidden_layer_sizes': [(50, 50, 50), (50, 100, 50), (100,)],
            'classifier__activation': ['tanh', 'relu'],
            'classifier__solver': ['sgd', 'adam'],
            'classifier__alpha': [0.0001, 0.05],
            'classifier__learning_rate': ['constant','adaptive'],
        }
        # Perform randomized search
        random_search = RandomizedSearchCV(pipe, param_distributions=params, n_iter=100, cv=5, verbose=1, n_jobs=-1)
        random_search.fit(X_train, y_train)
        print("Best parameters:", random_search.best_params_)
        return random_search.best_params_
 
# Preprocess the data using the spaCy tokenizer
X_train_tokenized = [' '.join(spacy_tokenizer_ngrams(text)) for text in X_train] 
X_test_tokenized = [' '.join(spacy_tokenizer_ngrams(text)) for text in X_test]   

# Perform hyperparameter tuning on the tokenized data
best_params = hyperparameter_tuning(X_train_tokenized, y_train, LogisticRegression(solver='liblinear'))

# Using the best parameters to create our vectorizer and classifier
vectorizer_tuned = TfidfVectorizer(ngram_range=best_params['vectorizer__ngram_range'], 
                             max_df=best_params['vectorizer__max_df'], min_df=best_params['vectorizer__min_df'])

clf_tuned = LogisticRegression(solver='liblinear', C=best_params['classifier__C'])

# Evaluate the model using previously made function
y_pred_clf_tfidf_tuned, y_score_clf_tfidf_tuned, clf_tfidf_tuned_perf_df = evaluate_model(vectorizer_tuned, clf_tuned, X_train_tokenized, y_train, X_test_tokenized, y_test)


We can see that our accuracy improves with a little over 1% by fine-tuning the hyperparameters. But we still se that the simple CountVectorizer performed slightly better. We will see if including more data changes this.

### 3.4 TF-IDF Vectorization and Logistic Regression with Balanced Data

We saw previously that our data was heavily skewed, now let's try to run our logistic model with the balanced data created earlier.

In [None]:
# Perform hyperparameter tuning on the tokenized data
best_params = hyperparameter_tuning(X_train_bal, y_train_bal, LogisticRegression(solver='liblinear'))

# Using the best parameters to create our vectorizer and classifier
vectorizer_bal_tuned = TfidfVectorizer(ngram_range=best_params['vectorizer__ngram_range'], 
                             max_df=best_params['vectorizer__max_df'], min_df=best_params['vectorizer__min_df'])

clf_bal_tuned = LogisticRegression(solver='liblinear', C=best_params['classifier__C'])

# Evaluate the model using previously made function
y_pred_clf_tfidf_bal, y_score_clf_tfidf_bal, clf_tfidf_bal_perf = evaluate_model(vectorizer_bal_tuned, clf_bal_tuned, X_train_bal, y_train_bal, X_test, y_test)


We see the decrease in the training data evidently led to decrease in performance.

### 3.5 Text Classification using TF-IDF Vectorization and Logistic Regression with Generated Claims
With populated data from ChatGPT3&4

In [None]:
X_chatgpt_train, y_chatgpt_train = claims_gpt_df['text'], claims_gpt_df['label']

# Concatenate the train, validation and env_claims sets
X_train_val = pd.concat([X_train, X_val]) 
y_train_val = pd.concat([y_train, y_val])
X_train_pop = pd.concat([X_train, X_chatgpt_train])
y_train_pop = pd.concat([y_train, y_chatgpt_train])
X_train_val_pop = pd.concat([X_train, X_val, X_chatgpt_train])
y_train_val_pop = pd.concat([y_train, y_val, y_chatgpt_train])

# Print the number of observations per class
print(Colors.OKBLUE + "\nTrain set per class" + Colors.ENDC)
display(y_train.value_counts())

# Print the number of observations per class
print(Colors.OKBLUE + "\nTrain set with populated claims" + Colors.ENDC)
display(y_train_pop.value_counts())

# Print the number of observations per class
print(Colors.OKBLUE + "\nTrain set with validation claims" + Colors.ENDC)
display(y_train_val.value_counts())

# Print the number of observations per class
print(Colors.OKBLUE + "\nTrain set with validation and populated claims" + Colors.ENDC)
display(y_train_val_pop.value_counts())


#### 3.5.1 Trying our best LGR model with the additional validation data.

In [None]:
# Preprocess the data using the spaCy tokenizer
X_train_val_tokenized = [' '.join(spacy_tokenizer_ngrams(text)) for text in X_train_val]  

# Perform hyperparameter tuning on the tokenized data
best_params = hyperparameter_tuning(X_train_val, y_train_val, LogisticRegression(solver='liblinear'))

# Using the best parameters to create our vectorizer and classifier
vectorizer_val_tuned = TfidfVectorizer(ngram_range=best_params['vectorizer__ngram_range'], 
                             max_df=best_params['vectorizer__max_df'], min_df=best_params['vectorizer__min_df'])

clf_val_tuned = LogisticRegression(solver='liblinear', C=best_params['classifier__C'])
# Evaluating the performance of the validated data
y_pred_clf_tfidf_val, y_score_clf_tfidf_val, clf_tfidf_val_perf_df = evaluate_model(vectorizer_val_tuned, clf_val_tuned, X_train_val_tokenized, y_train_val, X_test_tokenized, y_test)

This didn't seem to improve our model at all. Still standing. 

#### 3.5.2 Trying our best LGR model with the additional populated data (from ChatGPT3&4).

In [None]:
# Preprocess the data using the spaCy tokenizer
X_train_pop_tokenized = [' '.join(spacy_tokenizer_ngrams(text)) for text in X_train_pop]  

# Evaluating the performance of the populated data
y_pred_clf_tfidf_pop, y_score_clf_tfidf_pop, clf_tfidf_pop_perf_df = evaluate_model(vectorizer_tuned, clf_tuned, X_train_pop_tokenized, y_train_pop, X_test, y_test)

In [None]:
# Evaluating the performance of the validated data
y_pred_clf_tfidf_val, y_score_clf_tfidf_val, clf_tfidf_val_perf_df = evaluate_model(vectorizer_tuned, clf_tuned, X_train_pop, y_train_pop, X_test, y_test)

Let's see if the CountVectorizer performs better with the populated data.

In [None]:
# Preprocess the data using the spaCy tokenizer
X_train_pop_tokenized = [' '.join(spacy_tokenizer_ngrams(text)) for text in X_train_pop]  

# Perform hyperparameter tuning on the tokenized data
best_params_pop = hyperparameter_tuning(X_train_val, y_train_val, LogisticRegression(solver='liblinear'))

# Using the best parameters to create our vectorizer and classifier
count_vectorizer_tuned = CountVectorizer(ngram_range=best_params_pop['vectorizer__ngram_range'], 
                             max_df=best_params_pop['vectorizer__max_df'], min_df=best_params_pop['vectorizer__min_df'])

clf_pop_tuned = LogisticRegression(solver='liblinear', C=best_params['classifier__C'])


y_pre_clf_count_pop, y_score_clf_count_pop, clf_count_pop_perf_df = evaluate_model(count_vectorizer_tuned, clf_tuned, X_train_pop, y_train_pop, X_test, y_test)

We can now see that the more sophisticated TF-IDF Vectorizer performs better than the simple CountVectorizer when we used the populated data.

Wow! it actually increased quite a lot! Cool, let's try to use them both and see what happens.

#### 3.5.3 Trying our best LGR model with the additional populated and validation data.

In [None]:

# Evaluating the performance of the populated and validated data
y_pred_clf_tfidf_val_pop, y_score_clf_tfidf_val_pop, clf_tfidf_val_pop_per_df = evaluate_model(vectorizer_tuned, clf_tuned, X_train_val_pop, y_train_val_pop, X_test, y_test)

Interestingly, the additional training-data from ChatGPT and the validation data gave the best performance. We used different prompts, asking for info from websites, annual reports and so on, in addition to differ in length (word count). With both datasets, we can see that the model is able to generalize better and perform better on the test data.

#### 3.5.4 Trying our best LGR model with populated (ChatGPT3&4) and augmented data

In [None]:
def data_augmentation(sentences, labels, augment_times=1):
    '''
    Function to perform data augmentation
    Input: sentences - list of sentences
           labels - list of labels corresponding to the sentences
           augment_times - number of times to augment each sentence
    Output: aug_sentences - list of augmented sentences
            aug_labels - list of labels for the augmented sentences
    '''
    synonym_aug = naw.SynonymAug(aug_src='wordnet')

    aug_sentences = []
    aug_labels = []

    for i in range(augment_times):
        for sentence, label in zip(sentences, labels):
            # Apply synonym augmentation
            new_sentence = synonym_aug.augment(sentence)

            # Only append the new sentence if it's not NaN
            if pd.notnull(new_sentence):
                aug_sentences.append(new_sentence)
                aug_labels.append(label)

    return aug_sentences, aug_labels

# Only augmenting the label 1 sentences
X_train_aug_1, y_train_aug_1 = data_augmentation(X_train[y_train == 1].values.tolist(), y_train[y_train == 1].values.tolist())

# Convert the list of augmented sentences and labels to pandas Series
X_train_aug_1 = pd.Series(X_train_aug_1, index = range(len(X_train), len(X_train) + len(X_train_aug_1)))
y_train_aug_1 = pd.Series(y_train_aug_1, index = range(len(X_train), len(X_train) + len(y_train_aug_1)))

# Feeding the data_augmentation function with the training data
augmented_sentences, augmented_labels = data_augmentation(X_train.values.tolist(), y_train.values.tolist())

# Convert the list of augmented sentences and labels to pandas Series
augmented_sentences = pd.Series(augmented_sentences, index = range(len(X_train), len(X_train) + len(augmented_sentences)))
augmented_labels = pd.Series(augmented_labels, index = range(len(X_train), len(X_train) + len(augmented_labels)))

# Concatenate the augmented sentences and labels with the ChatGPT Populated and original training data
X_train_aug = pd.concat([X_train_pop, augmented_sentences])
y_train_aug = pd.concat([y_train_pop, augmented_labels])

# Print the number of observations per class
print(Colors.OKBLUE + "\nTrain set with augmented data" + Colors.ENDC)
display(y_train_aug.value_counts())

# Convert list of words in each document into a single string
X_train_aug = [" ".join(sublist) for sublist in X_train_aug]

# Now we can evaluate the model
y_pred_clf_tfidf_aug, y_score_clf_tfidf_aug, clf_tfidf_aug_per_df = evaluate_model(vectorizer_tuned, clf_tuned, X_train_aug, y_train_aug, X_test, y_test)


In [None]:
y_scores_data_input = [y_score_clf_tfidf_tuned, y_score_clf_tfidf_bal,
                       y_score_clf_tfidf_val, y_score_clf_tfidf_pop, y_score_clf_tfidf_aug, y_score_clf_tfidf_val_pop]
data_input_names = ['BenchMark', 'Balanced', 'Validation',
                    'ChatGPT', 'ChatGPT + Aug', 'ChatGPT + Val']

plot_roc_curves(data_input_names, y_scores_data_input,
                'TF-IDF Logistic Regression for different Data Input')


The ROC plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. By comparing different ROC curves, we can make informed decisions about which vectorizer is best suited for our specific task based on its performance in terms of trade-off between sensitivity (TPR) and specificity (1 - FPR).

Each line in the plot corresponds to a different data input. The closer a curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. This means the top left corner of the plot is the 'ideal' point - a false positive rate of zero, and a true positive rate of one. Therefore, a model whose ROC curve is closer to the top left corner performs better than a model whose curve is closer to the diagonal line.

The diagonal line in the middle of the plot represents a random classifier (e.g., a coin flip), which has an equal chance of giving a correct or incorrect classification. Any good classifier should have its ROC curve above this line. If a curve is below this line, it means the classifier is worse than random chance. 

The dotted line represents a perfect classifier.

Interestingly for the ChatGPT+Augmented data curve goes to 1 on the True Positive Rate scale before reaching 0.7 on the False Positive Rate, which means that our model is able to achieve a high rate of true positives (correctly identified positive instances) with a relatively low rate of false positives (negative instances incorrectly identified as positive). 

But when looking at other performance metrics such as accuracy, precision and recall, and area under curve in total, the ChatGPT data performed best.

In [None]:
# Create a function to plot the ROC as a bar chart
def plot_roc_bars(names, y_scores, title):
    '''
    Function to plot the ROC curves as a bar chart
    Input: names - list of names for the ROC curves
           y_scores - list of y_scores for the ROC curves
           title - title of the plot
    Output: Plot of ROC curves
    '''
    # Create a list of vectorizer AUC scores
    auc_scores = [roc_auc_score(y_test, y_score)
                  for y_score in y_scores]

    # Create a DataFrame of vectorizer performance
    performance = pd.DataFrame(
        {'Data': names, 'AUC Score': auc_scores})

    # Sort the DataFrame by AUC score
    performance.sort_values(
        by='AUC Score', ascending=True, inplace=True, ignore_index=True)

    plt.figure(figsize=(8, 6))
    sns.barplot(x='Data', y='AUC Score', data=performance,
                palette='Blues')  # AUC = Area Under the Curve (ROC)
    plt.title(title)

    # Plotting the AUC scores on each bar
    for index, row in performance.iterrows():
        plt.text(index, row['AUC Score'] + 0.005,
                 round(row['AUC Score'], 3), ha='center', color='black')
    plt.ylim(0.5, 1)
    plt.xticks(fontsize=8)
    plt.show()


# Plotting the ROC curves for the different data inputs
plot_roc_bars(data_input_names, y_scores_data_input,
                "TF-IDF Logistic Regression Model Performance with various Data Input")


The bar chart provided offers a visual comparison of the performance of different vectorizers used in our text classification model, as measured by the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve. AUC-ROC is a valuable metric for this purpose, as it provides a comprehensive view of model performance across all possible classification thresholds, unlike accuracy, precision, or recall which depend on a specific threshold. An AUC-ROC score close to 1.0 indicates that the model has a high ability to distinguish between the classes correctly, regardless of the threshold chosen. Therefore, in this graph, the TF-IDF vectorizer is the one that, on average, best discriminates between the classes in our dataset.

In [None]:
# import precison_recall_curve
from sklearn.metrics import precision_recall_curve

# Creating a functin that plots the relationship between the precision and the recall of several models
def plot_precision_recall_curve(model_names, y_scores, title):
	'''
	Function to plot the precision recall curve
	Input: model_names - list of model names
		   y_scores - list of y_scores for each model
		   title - title of the plot
	Output: None
	'''
	plt.figure(figsize=(8, 6))
	for model_name, y_score in zip(model_names, y_scores):
		precision, recall, thresholds = precision_recall_curve(y_test, y_score)
		plt.plot(recall, precision, label=model_name)

	# Plotting the baseline
	plt.plot([0, 1], [0.5, 0.5], linestyle='--', label='Baseline')

	plt.xlabel('Recall')
	plt.ylabel('Precision')
	plt.title(title)
	plt.legend()
	plt.xlim(0, 1)
	plt.ylim(0, 1)
	plt.show()

# Plotting the precision recall curve for the different data inputs
plot_precision_recall_curve(data_input_names, y_scores_data_input, 'TF-IDF Logistic Regression Precision Recall Curve for different Data Input')


## 4- Classifiers<a id="classifiers"></a>

In [None]:
# Create a list of classifiers including KNN, MLPC, LogReg, MultinomialNB, RandomForest
classifiers = [KNeighborsClassifier(), MLPClassifier(),
               clf_tuned, MultinomialNB(), RandomForestClassifier()]

vectorizers = [tfidf_vectorizer, w2v_vectorizer,
               tfidf_vectorizer, count_vectorizer, tfidf_vectorizer]

# Create a list of classifier names containing the names of the classifiers with the most suitable vectorizer
classifier_names = ['KNN', 'Neural Net',
                    'Logistic Regression', 'Naive Bayes', 'Random Forest']

# Create a json to store the classifiers and vectorizers with the most suitable vectorizer
classifiers_dict = dict(zip(classifier_names, classifiers))
# Assuming 'vectorizers' is your list of vectorizers
vectorizers_dict = dict(zip(classifier_names, vectorizers))

# Create a list of classifier predictions
y_preds = []
y_scores = []

# Feeding all the prediction into the list of classifiers
for classifier_name in classifiers_dict:
    y_pred, y_score, df = evaluate_model(
        vectorizers_dict[classifier_name], classifiers_dict[classifier_name], X_train, y_train, X_test, y_test)

    y_preds.append(y_pred)
    y_scores.append(y_score)


In [None]:

# Plotting the ROC curves for the different classifiers
plot_roc_curves(classifier_names, y_scores, "Different Classifiers ROC Curves")


In [None]:

plot_roc_bars(classifier_names, y_scores, "Different Classifiers ROC")


The random forest seems promising to further develop and tune.

### 4.1 Fine Tuning the Random Forest

In [None]:


best_params_rf = hyperparameter_tuning(X_train, y_train, RandomForestClassifier())


In [None]:
best_params_rf

In [None]:
# Create a new random forest classifier using the best parameters
clf_tuned_rf = RandomForestClassifier(
    n_estimators=best_params_rf['classifier__n_estimators'],
    max_depth=best_params_rf['classifier__max_depth'],
    min_samples_split=best_params_rf['classifier__min_samples_split'],
    min_samples_leaf=best_params_rf['classifier__min_samples_leaf'],
    bootstrap=best_params_rf['classifier__bootstrap'],
    random_state=42)

# Using the best parameters to create our vectorizer and classifier
vectorizer_rf_tuned = TfidfVectorizer(ngram_range=best_params_rf['vectorizer__ngram_range'], 
                             max_df=best_params_rf['vectorizer__max_df'], min_df=best_params_rf['vectorizer__min_df'])

# Evaluate the performance of the tuned random forest classifier
y_pred_clf_tuned_rf, y_score_clf_tuned_rf, rf_tuned_df = evaluate_model(
    vectorizer_rf_tuned, clf_tuned_rf, X_train, y_train, X_test, y_test)


It seems that the RandomSearchGrid is too vague and therefor does not improve our model at all.

### 4.2 KNN Model Tuning

In [None]:
%time

# Creating the optimal KNN classifier
best_params_knn = hyperparameter_tuning(
    X_train_val_pop, y_train_val_pop, KNeighborsClassifier())

# Create a new KNN classifier using the best parameters
clf_tuned_knn = KNeighborsClassifier(
    n_neighbors=best_params_knn['classifier__n_neighbors'],
    weights=best_params_knn['classifier__weights'],
    algorithm=best_params_knn['classifier__algorithm'],
    leaf_size=best_params_knn['classifier__leaf_size'],
    p=best_params_knn['classifier__p'])

# Using the best parameters to create our vectorizer and classifier
vectorizer_knn_tuned = TfidfVectorizer(ngram_range=best_params_knn['vectorizer__ngram_range'],
                                       max_df=best_params_knn['vectorizer__max_df'], min_df=best_params_knn['vectorizer__min_df'])

# Evaluate the performance of the tuned KNN classifier
y_pred_clf_tuned_knn, y_score_clf_tuned_knn, knn_tuned_df = evaluate_model(
    vectorizer_knn_tuned, clf_tuned_knn,  X_train_val_pop,  y_train_val_pop, X_test, y_test)


### 5.3 NN Tuning

In [None]:
%time

# Creating the optimal Neural Net classifier
best_params_nn = hyperparameter_tuning(X_train_pop, y_train_pop, MLPClassifier())

# Create a new Neural Net classifier using the best parameters
clf_tuned_nn = MLPClassifier(
    hidden_layer_sizes=best_params_nn['classifier__hidden_layer_sizes'],
    activation=best_params_nn['classifier__activation'],
    solver=best_params_nn['classifier__solver'],
    alpha=best_params_nn['classifier__alpha'],
    learning_rate=best_params_nn['classifier__learning_rate'],
    max_iter=best_params_nn['classifier__max_iter'],
    random_state=42)

# Using the best parameters to create our vectorizer and classifier
vectorizer_nn_tuned = TfidfVectorizer(ngram_range=best_params_nn['vectorizer__ngram_range'],
                                        max_df=best_params_nn['vectorizer__max_df'], min_df=best_params_nn['vectorizer__min_df'])

# Evaluate the performance of the tuned Neural Net classifier
y_pred_clf_tuned_nn, y_score_clf_tuned_nn, tuned_nn_df = evaluate_model(
    vectorizer_nn_tuned, clf_tuned_nn, X_train_pop, y_train_pop, X_test, y_test)

# Print the tuned Neural Net classifier's AUC score
print('Tuned Neural Net Classifier AUC Score: {:.2f}'.format(
    roc_auc_score(y_test, y_score_clf_tuned_nn)))



### 5.4 Voting 

In [None]:
from sklearn.ensemble import VotingClassifier

# Implementing voting classifier

# Define a RandomForestClassifier
RF = RandomForestClassifier()

# Define a MLPClassifier
NN = MLPClassifier()

voting_clf = VotingClassifier(estimators=[('RF', RF), ('NN', NN), ('LogReg', clf_tuned)], voting='hard')

y_pred_vote, y_score_vote = evaluate_model(
    vectorizer_tuned, voting_clf, X_train_pop, y_train_pop, X_test, y_test)


### 5.5 Stacking

In [None]:
# Define the base models
from sklearn.ensemble import StackingClassifier


level0 = list()
level0.append(('RF', RandomForestClassifier()))
level0.append(('NN', MLPClassifier()))
level0.append(('LogReg', clf_tuned))

# Define meta learner model
level1 = clf_tuned

# Define the stacking ensemble
model_stacking = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)

# Evaluate stacked model
evaluate_model(vectorizer_tuned, model_stacking, X_train_pop, y_train_pop, X_test, y_test)

## 6- BERT Model<a id="bert"></a>

BERT (Bidirectional Encoder Representations from Transformers) is a powerful pre-trained model developed by Google. It has been widely adopted for many Natural Language Processing (NLP) tasks due to its great performance. BERT is trained on a large chunks of text and hence, has learned a rich understanding of language, including context, semantics, and grammar. These qualities make it a great choice for our task. Let's see how it performs on our environmental claim dataset.

Before using the heavyweight full BERT mode, we can use a lightweight version: **DistilBert** model instead. DistilBert is a smaller, faster, and lighter version of Bert that retains 95% of Bert’s performance while being 60% smaller and 2.5 times faster.

Step-by-step Breakdown
1. **Import necessary libraries**: Specifically for this task we need PyTorch and Transformers.
2. **Data Loading**: We load the training, validation, and testing data using Pandas. This data is provided in CSV files. Each line in the file represents a text and its corresponding label (1 for climate change claim, 0 for no climate change claim).
3. **Tokenization**: Next, we use the BERT tokenizer to convert the text into tokens that the BERT model can understand. We pad and truncate all sentences to a single constant length.
4. **Dataset and DataLoader**: We create a PyTorch Dataset from the tokenized data. This allows us to use a DataLoader, which makes it easy to efficiently feed data in batches to the neural network.
5. **BERT Model**: We load a pre-trained BERT model for sequence classification from the Hugging Face library. This model is designed to handle tasks like ours, where we have a sequence of tokens as input and a single label as output.
6. **Optimizer**: We choose the AdamW optimizer with a learning rate of 1e-5. This optimizer is known to work well with BERT.
7. **Training Loop**: We train the BERT model for a specified number of epochs. In each epoch, we iterate over batches of data, feed them to the model, and update the model's parameters based on the computed gradients.
8. **Evaluation:** Finally, after training, we evaluate the model on the validation set. We calculate the model's predictions and compare them with the true labels.

### 6.0 Light Weight BERT Model

In [None]:
# Importing the libraries needed for the BERT model
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, Dataset

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading data
train_df = env_claim_train
test_df = env_claim_test
val_df = env_claim_val

# Load DistilBert tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, max_length=128)

class ClimateDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create a data loader
train_dataset = ClimateDataset(train_encodings, list(train_df['label']))
val_dataset = ClimateDataset(val_encodings, list(val_df['label']))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)

# Load DistilBert model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2).to(device)

# Set up optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) 
num_epochs = 2

# Train the model
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluation
model.eval()
predictions, true_labels = [], []
for batch in val_loader:
    with torch.no_grad():
        outputs = model(batch['input_ids'].to(device), attention_mask=batch['attention_mask'].to(device))
    predictions.extend(torch.argmax(outputs.logits, dim=1).tolist())
    true_labels.extend(batch['labels'].tolist())

print(classification_report(true_labels, predictions, target_names=['No Climate Claim', 'Climate Claim']))


We experienced that GoogleColab where effective at running the code so we try to use the heavyweight BERT model with more epochs. This might longer to run, but hopefully we can get a better performance.

### 6.1 BERT Model with 3 Epochs

In [None]:
%pip install torch

import torch
from transformers import BertTokenizerFast, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import pandas as pd
from sklearn.metrics import classification_report
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading data
train_df = env_claim_train
test_df = env_claim_test
val_df = env_claim_val

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True)
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True)
test_encodings = tokenizer(list(test_df['text']), truncation=True, padding=True)

# Creating the ClimateDataset class to load the data into PyTorch
class ClimateDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create a data loader
train_dataset = ClimateDataset(train_encodings, list(train_df['label']))
val_dataset = ClimateDataset(val_encodings, list(val_df['label']))
test_dataset = ClimateDataset(test_encodings, list(test_df['label']))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)

# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).to(device)

# Set up optimizer and scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) 
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

num_epochs = 5
best_val_loss = float('inf')

# Train the model
for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
    train_loss /= len(train_loader)
    
    val_loss = 0
    model.eval()
    for batch in val_loader:
        with torch.no_grad():
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()
    val_loss /= len(val_loader)
    
    print(f'Epoch {epoch+1}, Train loss: {train_loss}, Val loss: {val_loss}')
    
    scheduler.step(val_loss)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')

# Evaluation
model.load_state_dict(torch.load('best_model.pt'))
model.eval()
predictions, true_labels = [], []
for batch in test_loader:
    with torch.no_grad():
        outputs = model(batch['input_ids'].to(device), attention_mask=batch['attention_mask'].to(device))
    predictions.extend(torch.argmax(outputs.logits, dim=1).tolist())
    true_labels.extend(batch['labels'].tolist())

print(classification_report(true_labels, predictions, target_names=['No Climate Claim', 'Climate Claim']))


### 6.2 Optimizing BERT and using the validation data

In [None]:

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading data
train_df = env_claim_train
test_df = env_claim_test
val_df = env_claim_val

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True)
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True)
test_encodings = tokenizer(list(test_df['text']), truncation=True, padding=True)

# Creating the ClimateDataset class to load the data into PyTorch
class ClimateDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create a data loader
train_dataset = ClimateDataset(train_encodings, list(train_df['label']))
val_dataset = ClimateDataset(val_encodings, list(val_df['label']))
test_dataset = ClimateDataset(test_encodings, list(test_df['label']))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)

# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).to(device)

# Set up optimizer and scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) 
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

num_epochs = 5
best_val_loss = float('inf')

# Train the model
for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
    train_loss /= len(train_loader)
    
    val_loss = 0
    model.eval()
    for batch in val_loader:
        with torch.no_grad():
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()
    val_loss /= len(val_loader)
    
    print(f'Epoch {epoch+1}, Train loss: {train_loss}, Val loss: {val_loss}')
    
    scheduler.step(val_loss)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')

# Evaluation
model.load_state_dict(torch.load('best_model.pt'))
model.eval()
predictions, true_labels = [], []
for batch in test_loader:
    with torch.no_grad():
        outputs = model(batch['input_ids'].to(device), attention_mask=batch['attention_mask'].to(device))
    predictions.extend(torch.argmax(outputs.logits, dim=1).tolist())
    true_labels.extend(batch['labels'].tolist())

print(classification_report(true_labels, predictions, target_names=['No Climate Claim', 'Climate Claim']))


Wee can see that the accuracy is now at 91% which is quite impressive. It will be interesting to see what BERT's environmental claim pre-trained  model will do.

### 6.3 BERT Model with Environmental Claim Pre-trained Model

In [None]:
!pip install simpletransformers
from simpletransformers.classification import ClassificationModel
from sklearn.metrics import accuracy_score
import pandas as pd

use_cuda = torch.cuda.is_available()

# Create a ClassificationModel
model = ClassificationModel(
    "roberta", 
    "climatebert/distilroberta-base-climate-detector", 
    num_labels=2, 
    args={"reprocess_input_data": True, "overwrite_output_dir": True},
    use_cuda=use_cuda
)

# Train the model on our dataset
model.train_model(env_claim_train)

# Evaluation
def get_predictions(texts, true_labels=None):
    predictions, raw_outputs = model.predict(texts)

    if true_labels is not None:
        accuracy = accuracy_score(true_labels, predictions)
        print(f'Accuracy: {accuracy}')

    return predictions

texts = list(env_claim_test['text'])
true_labels = list(env_claim_test['label'])
predicted_labels = get_predictions(texts, true_labels)


We obtain a 87% accuracy, which is not as good as our previous best at 91%.

**Model citation:** 
```bibtex
Bingler, J., Kraus, M., Leippold, M., & Webersinke, N. (2023). How Cheap Talk in Climate Disclosures Relates to Climate Initiatives, Corporate Emissions, and Reputation Risk. *Working paper*. Available at SSRN 3998435.
```

## 7- Fine Tuning ChatGPT-3 Model<a id="chatgpt"></a>

Code below is so far just an outline and has not been tested yet.

In [None]:
import json


def create_prompt_json(claims, labels):
    json_list = []
    for index, claim in enumerate(claims):
        if labels[index] == 0:
            json_list.append(
                {"prompt": f"{claim} ->", "completion": "Non-environmental claim.\n"})
        else:
            json_list.append(
                {"prompt": f"{claim} ->", "completion": "Environmental claim.\n"})
    return json_list

# Create function that that creates a json file for each claim in the test set


def create_json_files(claims, labels):
    json_list = create_prompt_json(claims, labels)
    for index, json_data in enumerate(json_list):
        with open(f'prompt-data/{index}.json', 'w') as f:
            json.dump(json_data, f)


claims = list(env_claim_test['text'])
labels = list(env_claim_test['label'])
create_json_files(claims, labels)


In [None]:
import json
import openai

# 1. OpenAI Key
api_key ="YOUR_OPENAI_API_KEY"
openai.api_key = api_key

# 2. Create Training Data - this is an example, replace it with your data processing
# Todo: convert our data to the following format
data_file = [{
    "prompt": "Environmental claim ->",
    "completion": " Ideal answer.\n"
},{
    "prompt":"Environmental claim ->",
    "completion": " Ideal answer.\n"
}]

# 3. Save dict as JSONL file
file_name = "training_data.jsonl"
with open(file_name, 'w') as outfile:
    for entry in data_file:
        json.dump(entry, outfile)
        outfile.write('\n')

# Prepare data for fine-tuning

!openai tools fine_tunes.prepare_data -f training_data.jsonl

# 4. Upload file to your OpenAI account
upload_response = openai.File.create(
  file=file_name,
  purpose='fine-tune'
)

# Save file ID
file_id = upload_response.id

# 5. Fine-tune a model
model="gpt-3.5-turbo"  # Use the model of your choice (e.g. ada, babbage, curie, davinci, etc.)
fine_tune_response = openai.FineTune.create(training_file=file_id, model=model)

# 7. Save the fine-tuned model
fine_tuned_model = openai.FineTune.retrieve(id=fine_tune_response.id).fine_tuned_model

# 8. Test the fine-tuned model
new_prompt = "NEW ENVIRONMENTAL CLAIM ->"
answer = openai.Completion.create(
  model=fine_tuned_model,
  prompt=new_prompt,
  max_tokens=50,
  temperature=0.5
)
print(answer['choices'][0]['text'])


## Environment description

In [None]:
%load_ext watermark
%watermark -v -p pandas,numpy,sklearn,datasets,spacy,wordcloud