# Using Transformers to Understand Product Labelling

![Image](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41562-024-02087-0/MediaObjects/41562_2024_2087_Fig1_HTML.png?as=webp)

The consumption of many products and services carries carbon costs, and associated human health and mortality impacts due to climate change. In late 2024 we published an article proposing that accounting for these costs in labels could drive manufacturers towards net zero targets. You can read the paper here https://www.nature.com/articles/s41562-024-02087-0

This idea is controversial as you can imagine. You might wonder how people responded to this idea. While it is hard to assess directly from the paper,  earlier on in the year we did run an advocacy campaign which included a brief survey in which people had to pick some food items, travel options and from that were told how many minutes they would take from another persons life due to the climate change impacts of those consumption choices. https://ziamehrabi.medium.com/calculate-the-human-impact-of-your-everyday-decisions-6a65d63efec9. We  asked students in an Introductory Environmental Studies class what they thought too. 

The responses those students created were free form text based responses to a conversation. In this notebook you'll be leveraging pre-trained and off-the-shelf fine tuned language models to try and learn what those students thought and how this idea could potentially be improved.

## Packages

In [None]:
# read and manipulate data
import os
import pandas as pd
import seaborn as sns
import numpy as np
import nltk
from sklearn.decomposition import PCA
from tqdm import tqdm  # For progress tracking

# Ensure necessary NLTK data is downloaded
nltk.download("punkt")

# for running llms from hugging face
import torch
from transformers import pipeline

# ignore warnings because they're annoying
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## Models

We'll be using a couple of models in this notebook, a fine-tuned version of BERT for sentiment analysis (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), and a fine tuned version of BART for summarization (https://huggingface.co/facebook/bart-large-cnn). 

## Pipeline() API Overview

We'll be using the pipeline() function for this notebook. It is a very powerful abstracted API for running a range of standardized pipelines with the Hugging Face transformers library. You can find out more here: https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline.

Below we demonstrate the use of the pipeline() function for sentiment analysis. Note the pipelines function will download the model locally for use. We will first test it works with a simple sequence of text, and return predicted probabilities of the sentiment.



In [None]:
# set up pipeline
classifier = pipeline(
    task="sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    device=0
)
preds = classifier("Hugging Face is the best thing since sliced bread!")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

## Comparison to vanilla transformers

The pipeline() function is an abstraction of a number of individual tasks which you can implement with the transformers library. These include defining the model,downloading it, setting the method of tokenization, doing tokenization of input, getting predictions, converting predictions into form for interpretation, and so on. We have pasted these below so you can understand a little what is happening under the hood.

See here for quick overview of the transformers package https://huggingface.co/docs/transformers/en/quicktour

In [None]:
# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification #note here we are selecting specific tokenizer for this model, Autotokenizer also exists, which will automatically choose the right one for the model specified.

# # Load model and tokenizer
# model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# tokenizer = DistilBertTokenizer.from_pretrained(model_name)
# model = DistilBertForSequenceClassification.from_pretrained(model_name)

# # Define input text
# text = "Hugging Face is the best thing since sliced bread!"

# # Tokenize input text
# inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# # Get predictions
# with torch.no_grad():
#     outputs = model(**inputs)
#     logits = outputs.logits

# # Get predicted label and score
# predicted_class = torch.argmax(logits, dim=1).item()
# pred_score = torch.softmax(logits, dim=1)[0, predicted_class].item()

# # Map the predicted class to label
# labels = ["NEGATIVE", "POSITIVE"]
# pred_label = labels[predicted_class]

# # Prepare output in the same format as the pipeline
# preds = [{"score": round(pred_score, 4), "label": pred_label}]
# print(preds)


## Read in the data 

Now we have a little understanding of the APIs, we will read in the data. This is a csv with a row for each student response. There is no index column, so just use the dataframe index as the student id. There may be some blank rows and some HTML artificats in the responses (the conversations were originally recorded in HTML documents).

In [None]:
base_path = "/kaggle/input/labelling/"
file_name = "student_responses.csv"
df = pd.read_csv(os.path.join(base_path, file_name))
df.head()
print(df)

## What did students think?

Here we count the frequency of all the positive and negative responses. Overall what did people think? We'll display the first few results.

In [None]:
inputs = df["response"].to_list()
preds = classifier(inputs) #note we set this above when we set up the pipeline
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds_df = pd.DataFrame(preds)
preds_df.head()

Cool. Now what did everyone think? Let's try and plot the distribution of responses.

What did we see overall? Looks like there was a marginally net positive reponse. Although it was aparently very mixed.

In [None]:
sns.histplot(
    data=preds_df, 
    x="label", 
    hue="label", 
    multiple="stack",
    palette=["darkred", "green"]
)

## Changing sentiment

Is it possible to pick out something more? One thing to note is that during the conversation students were asked three questions in sequence, How did the survey make you feel? Was it an effective way to communicate this issue? What could be improved? Below is some analysis that splits up the responses into sentences, identified the sentiment in each, and then plots the result across a normalized conversation time axis.


In [None]:

# Function to process each response
def process_response(response_text):
    sentences = nltk.sent_tokenize(response_text)  # Split into sentences
    num_sentences = len(sentences)

    # Process sentences in batches (classifier using GPU)
    results = classifier(sentences)

    # Extract sentiment and confidence
    sentiment_scores = [1 if r["label"] == "POSITIVE" else -1 for r in results]
    confidences = [r["score"] for r in results]

    # Normalize sentence positions (0 to 1)
    normalized_time = np.linspace(0, 1, num_sentences)

    # Return DataFrame
    return pd.DataFrame({"normalized_time": normalized_time, 
                         "sentiment": sentiment_scores, 
                         "confidence": confidences, 
                         "response_id": response_text[:30]})  # First 30 chars as ID

# Example data (Replace with your actual DataFrame `df` containing responses)
# df = pd.DataFrame({"response": ["Example response text.", "Another response."]})

# Apply processing to all responses
all_sentences_df = pd.concat(df["response"].apply(process_response).tolist())

Now we plot the results. Remember the conversation had three questions: how did the survey make you feel? Was it an effective way to communicate this issue? What could be improved?  There is a lot of noise but we also find three peaks in the responses. And there was a tendency for the sentiment on the middle peak (which we may assume to match to the question "Was it an effective way to communicate this issue?" to be mostly positive.

In [None]:

import matplotlib.pyplot as plt
# Split the data into positive and negative sentiments
positive_sentiments = all_sentences_df[all_sentences_df['sentiment'] == 1]
negative_sentiments = all_sentences_df[all_sentences_df['sentiment'] == -1]

# Plot KDE for positive and negative sentiment densities
plt.figure(figsize=(10, 5))

# KDE for positive sentiments (blue)
sns.kdeplot(positive_sentiments['normalized_time'], 
            color='blue', 
            label='Positive Sentiment', 
            fill=True, 
            alpha=0.5,  # Transparency to visualize overlap
            bw_adjust=0.5)  # Bandwidth adjustment for smoothness

# KDE for negative sentiments (red)
sns.kdeplot(negative_sentiments['normalized_time'], 
            color='red', 
            label='Negative Sentiment', 
            fill=True, 
            alpha=0.5,  # Transparency to visualize overlap
            bw_adjust=0.5)  # Bandwidth adjustment for smoothness

# Labeling and titles
plt.xlabel("Normalized Time Within Response")
plt.ylabel("Density")
plt.title("Kernel Density Estimation of Sentiment Progression")

# Customize the legend
plt.legend(title="Sentiment", labels=["Positive", "Negative"])

plt.grid(True)
plt.show()


## Summarization

That is cool. But one of the things we might want to do is actually pick out some key take homes in natural language form. Kind of like a summary of what people thought, not just the sentiment. The first thing we might want to do is simply plot the embeddings of the responses to see how different they are from each other. For this we'll import the base model of BERT (not fine tuned on any downstream tasks).



In [None]:
# Load data
data = df

# Initialize the Hugging Face feature-extraction pipeline
feature_extractor = pipeline('feature-extraction', model='bert-base-uncased', tokenizer='bert-base-uncased')

# Function to get embeddings for responses using the pipeline
def get_bert_embeddings(texts, batch_size=16):
    all_embeddings = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Processing Batches"):
        batch_texts = texts[i:i + batch_size]
        
        # Get embeddings for the batch
        embeddings_batch = feature_extractor(batch_texts.tolist())
        
        # Extract the [CLS] token embedding from each response in the batch
        cls_embeddings = [embedding[0][0] for embedding in embeddings_batch]  # [0][0] gives [CLS] token
        all_embeddings.extend(cls_embeddings)
    
    return all_embeddings

# Get embeddings for all responses
embeddings = get_bert_embeddings(data['response'])

# Perform PCA on the embeddings
pca = PCA(n_components=2)
pca_result = pca.fit_transform(embeddings)

# Plot the results
pca_df = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])

# Plot the results using Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', data=pca_df, color='blue', alpha=0.5)

# Add title and labels
plt.title('PCA of BERT Embeddings', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=14)
plt.ylabel('Principal Component 2', fontsize=14)

# Show the plot
plt.show()

Now we sample some disparate examples from this embedding space. We then run a summarization pipeline on these and return the concatenated results. 

In [None]:
# Merge the data

merged_df = pd.concat([data, pca_df], axis=1)

# Calculate the Euclidean distance from the origin (0, 0) using PCA1 and PCA2
merged_df['dist_pca'] = np.sqrt(merged_df['PC1']**2 + merged_df['PC2']**2)

# Bin the distances into 5 quantiles using qcut
merged_df['quantile'] = pd.qcut(merged_df['dist_pca'], q=5, labels=False)

# Sample one row from each quantile group, excluding the grouping column
sampled_df = merged_df.groupby('quantile', group_keys=False).apply(lambda x: x.sample(n=1))

# Print the sampled rows (with id, response, PC1, PC2, dist_pca)
print(sampled_df)

In [None]:
# Initialize the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

combined_responses = " ".join(sampled_df['response'].tolist())

# Function to generate bullet-point summary
def generate_bullet_summary(text):
    summary = summarizer(text, max_length=200, min_length=100, do_sample=False)[0]['summary_text']
    bullet_points = "• " + summary.replace(". ", ".\n• ")
    return bullet_points

# Generate the bullet-point summary for all responses combined
bullet_summary = generate_bullet_summary(combined_responses)

# Print the final bullet-point summary
print("Bullet-point Summary for All Responses:")
print(bullet_summary)



## Assignments

1. Compare outputs of the last excercise to either an extractive (e.g. BERTSUM) or abstractive summarization model (e.g. like BART, T5) run over the whole dataset (hint: first do summaries of the individual responses, then concatenate and do a summary of the result) Do the take-homes differ from your simple approach of sampling different responses? If so how?
2. Did you get any truncation of the contexts using the tokenizer? How much were you missing? Update using chunking or additional model (Longformer), how did it change your results?
3. Generate your own sub queries/questions you want to ask these students, and return the 5 top individual responses that best match your queries using cosine similarity matches and rankings.

## Bonus Assignment
Let us know some way in which we can improve this notebook for future students to make it more informative and engaging.