This notebook provides end-to-end code for using the computational pipeline based on BERT-NLI approach from the paper "[Building Efficient Universal Classifiers with Natural Language Inference](https://arxiv.org/abs/2312.17543)" by Moritz Laurer, Wouter van Atteveldt, Andreu Casas, Kasper Welbers (2024). I developed this pipeline as part of my master’s thesis. The pipeline is designed to analyze Sociotechincal Imaginaries of AI in Italian online news websites and combines NLI-based zero-shot classification of the articles with topic modeling using BERTopic (Grootendorst, 2022 [link text](https://maartengr.github.io/BERTopic/index.html)). After text preprocessing, the pipeline is divided into two phases, each consisting of two steps. The piepline is grounded in the SIPCs (Sociotechnical Imaginaries in Public Communication) framework developed by Brause et al. (2025) [link text](https://journals.sagepub.com/doi/full/10.1177/13548565251338192). To apply the pipeline to news media in other languages, it is sufficient to adapt the text preprocessing and translate the hypotheses for NLI classification.

# Activate a GPU runtime

In order to run this notebook on a GPU, click on "Runtime" > "Change runtime type" > select "GPU" in the menue bar in to top left. Executing the notebook on a GPU allows for much faster analysis.

## Install relevant packages

In [None]:
!pip install transformers[sentencepiece]
!pip install bertopic[all] sentence-transformers

# Data download and article preprocessing

The dataset comprises Italian-language articles sourced from news websites. Articles whose headlines contain “IA” or “intelligenza artificiale” were collected over a five-year period. The dataset is not publicly available due to privacy constraints. This does not affect the pipeline architecture presented below, which can be readily adapted to Italian newspaper corpora for the analysis of AI imaginaries

In [None]:
#loading the dataset
#Let us assume that the dataset has several columns, such as "Author", "Date" etc. The column containing the article text is called "Content".
#
import pandas as pd
from google.colab import drive
from google.colab import files
drive.mount('/content/drive')
df_articles = pd.read_csv ('path to your dataset') #Enter the path to your dataset in the drive here

#Minimal preprocessing of the articles (can be adapted and expanded depending on the dataset's requirement)
# Remove location markers at the beginning of the articles if they appear within the first ~30 characters

def clean_italian_articles(text):
    """
    Clean Italian news articles by removing common metadata and stamps
    """
    if not text:
        return ""

    text = str(text).strip()

    if ' - ' in text[:30]:
        parts = text.split(' - ', 1)
        if len(parts[0]) < 25:
            text = parts[1].strip()
    if ' – ' in text[:30]:
        parts = text.split(' – ', 1)
        if len(parts[0]) < 25:
            text = parts[1].strip()

    #Remove author names at the end of the articles if they are in parentheses
    if text.endswith(')'):
        last_paren = text.rfind('(')
        if last_paren != -1:
            author_part = text[last_paren+1:-1].strip()
            if len(author_part) < 50 and author_part.replace(' ', '').replace('.', '').isalpha():
                text = text[:last_paren].strip()

    return text

# Apply the cleaning function to the Content column
df_articles['Content'] = df_articles['Content'].apply(clean_italian_articles)

# Formatting the articles to facilitate classification using NLI models

# Add new column 'Content_formatted' with the quote structure

df_articles['Content_formatted'] = df_articles['Content'].apply(
    lambda text: f'La citazione: "{text}" - fine della citazione.'
)




# First phase step 1 - NLI-based classification of the articles according to the vision dimension

The first phase aims to define the envisioned roles of AI, which form the foundation of imaginaries. The first step consists of classifying the articles through NLI according to the Vision dimension, within the SIPCs framework (Brause et al., 2025 [link text](https://journals.sagepub.com/doi/10.1177/13548565251338192)). The multilingual model "MoritzLaurer/bge-m3-zeroshot-v2.0" is used for zero-shot classification, since its 8,192-token context window enables full processing of newspaper texts. Depending on the specific requirements, it is possible to select a zero-shot classifier from Hugging Face (https://huggingface.co/collections/MoritzLaurer/zeroshot-classifiers-6548b4ff407bb19ff5c3ad6f)

In [None]:


# Load the multilingual zero-shot classification pipeline from Hugging Face
from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1  # GPU=0, CPU=-1
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/bge-m3-zeroshot-v2.0",
    framework="pt",
    device=device
)


# Define hypothesis template and candidate labels
hypothesis_template = "La citazione contiene riferimenti a {}"
candidate_labels = [
    "il ruolo esplicito o l'uso specifico che l'intelligenza artificiale o una sua tecnologia avrà in futuro"
]

# Apply classification with batch processing for speed
texts = df_articles["Content_formatted"].tolist()
batch_size = 8
all_scores = []

for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i+batch_size]
    batch_results = classifier(
        batch_texts,
        candidate_labels,
        hypothesis_template=hypothesis_template,
        multi_label=True
    )
    batch_scores = [result["scores"][0] for result in batch_results]
    all_scores.extend(batch_scores)

# Store results
df_articles["probability_score"] = all_scores
df_articles["Role"] = (df_articles["probability_score"] > 0.5).astype(int)


# First phase step 2 - Topic modeling with BERTopic to identify the imaginary contexts and the role of AI within imaginaries

Through a dual clustering approach with BERTopic, combined with the manual reading of a sample of articles, it is possible to analyze the articles previously classified by NLI in order to identify the envisioned role of AI within different imaginaries.

First, BERTopic is applied to the articles to identify the broader imaginary contexts, namely the thematic domains and areas in which AI imaginaries are situated. "Qwen/Qwen3-Embedding-0.6B" is used as the embedding model, since its 8,192-token context window enables full-article processing of newspaper texts.

In [None]:
import torch
import nltk
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
import pandas as pd
import numpy as np

# Download Italian stopwords
nltk.download('stopwords', quiet=True)
italian_stops = stopwords.words('italian')

# Define custom stopwords. Since we already know that all the articles focus on AI, we can add the following stopwords
custom_stops = ["intelligenza", "artificiale", "AI", "IA"]

# Combine standard and custom stopwords into a list
all_stopwords = italian_stops + custom_stops

# Prepare the list of documents - only Role = 1
role_1_articles = df_articles[df_articles['Role'] == 1].copy()
docs = role_1_articles['Content'].astype(str).tolist()

print(f"Total articles: {len(df_articles)}")
print(f"Role = 1 articles: {len(role_1_articles)}")
print(f"Documents to analyze: {len(docs)}")

#Work with GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load the embedding model
embedding_model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-0.6B",
    device=device,
    trust_remote_code=True
)

# Calculate embeddings (conservative batch size)
embeddings = embedding_model.encode(
    docs,
    batch_size=1,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

# Create a CountVectorizer that filters out our stopwords
vectorizer_model = CountVectorizer(stop_words=all_stopwords)

# Instantiate BERTopic with custom vectorizer and representation model
topic_model = BERTopic(
    vectorizer_model=vectorizer_model,  # remove stopwords at vectorization
    calculate_probabilities=True        # compute topic probabilities
)

# Fit the model and transform the documents into topics with batch processing
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)

# Store the topic IDs and max probability back into the DataFrame
role_1_articles['Topic'] = topics
role_1_articles['Topic_Probability'] = [p.max() if p is not None else 0 for p in probs]

#Using BERTopic’s visualizations to examine the topics.
print(topic_model.get_topic_info())
topic_model.visualize_barchart()
#In particular, visualizing the hierarchical topics is useful for merging clusters, especially when working with many documents
hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

# Add "Imaginary context" column with the BERTopic topic number
df_articles.loc[role_1_articles.index, "Imaginary context"] = role_1_articles["Topic"].values

At the end of the previous step, a CSV file is produced containing a column with the BERTopic cluster assignments — that is, the imaginary contexts. At this stage, one of these contexts can be selected to identify the role of AI and, subsequently, to analyze the AI imaginary as a whole. As an example, the first imaginary (Imaginary1) contex — corresponding to BERTopic cluster 0 — is used. A second round of clustering with BERTopic is then applied only to the articles within that context, complemented by manual reading of a sample of articles to interpret the results. By the end of this phase, the role of AI within the imaginary. The next phase analyzes these articles to determine the remaining dimensions of the SIPCs framework, thereby providing a comprehensive account of the imaginary.

In [None]:
from google.colab import files
# Select only articles in Imaginary1
imaginary1_articles = df_articles[df_articles["Imaginary context"] == 0].copy()
docs_imaginary1 = imaginary1_articles["Content"].astype(str).tolist()

# Recompute embeddings for this subset
embeddings_imaginary1 = embedding_model.encode(
    docs_imaginary1,
    batch_size=1,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

# BERTopic on the selected Imaginary1 context
topic_model_imaginary1 = BERTopic(
    vectorizer_model=vectorizer_model,
    calculate_probabilities=True
)
topics_imaginary1, probs_imaginary1 = topic_model_imaginary1.fit_transform(docs_imaginary1, embeddings=embeddings_imaginary1)

# Store back to the main DataFrame
imaginary1_articles["Subtopic"] = topics_imaginary1
imaginary1_articles["Subtopic_Probability"] = [p.max() if p is not None else 0 for p in probs_imaginary1]

df_articles.loc[imaginary1_articles.index, "Imaginary subcontext"] = imaginary1_articles["Subtopic"]
df_articles.loc[imaginary1_articles.index, "Subtopic_Probability"] = imaginary1_articles["Subtopic_Probability"]

# For each subtopic group, randomly sample half of the articles
sample_df = imaginary1_articles.groupby('Subtopic').apply(
    lambda x: x.sample(n=int(len(x) / 2), random_state=42)
).reset_index(drop=True)

# Optional: Check how many articles were sampled from each subtopic
print("Sample size per subtopic:")
print(sample_df['Subtopic'].value_counts())

# Export the sampled DataFrame to an Excel file
sample_df.to_excel('path', index=False)

# Download the Excel file to your computer in order to inspect the clusters and read the articles from the Excel file
files.download('path')

# Second phase step 1 - NLI-based classification of all remaining dimensions of the imaginary

NLI-based classification is applied to identify the remaining dimensions of the SIPCs framework for Imaginary 1 under analysis

In [None]:
#OBJECT DIMENSION

# Load the multilingual zero-shot classification pipeline
from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1  # GPU=0, CPU=-1
classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/bge-m3-zeroshot-v2.0",
    framework="pt",
    device=device
)
print(f"Running inference on: {'cuda:0' if device==0 else 'cpu'}")

# Define hypothesis template and candidate labels
hypothesis_template = "La citazione contiene riferimenti a {}"
candidate_labels = ["il nome di una specifica tecnologia dell'intelligenza artificiale usata per un determinato scopo"]

# Apply classification with batch processing for speed (Imaginary1 only)
texts = imaginary1_articles["Content_formatted"].tolist()
batch_size = 8
all_scores = []

for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i+batch_size]
    batch_results = classifier(
        batch_texts,
        candidate_labels,
        hypothesis_template=hypothesis_template,
        multi_label=True
    )
    batch_scores = [result["scores"][0] for result in batch_results]
    all_scores.extend(batch_scores)

# Add scores to the filtered dataframe (Imaginary1)
imaginary1_articles["probability_score"] = all_scores

# Create Object dimension column: 1 if probability > 0.5, otherwise 0
imaginary1_articles["Object dimension"] = (imaginary1_articles["probability_score"] > 0.5).astype(int)

# IMPLICATIONS DIMENSION
# Define the hypothesis template
hypothesis_template = "La citazione contiene riferimenti a {}"

# Function to classify each sentence and return the entailment score
def classify_futuro(text):
    result = classifier(
        text,
        candidate_labels=["raccomandazioni o azioni specifiche, o specifici responsabili per l'uso dell'intelligenza artificiale"],
        multi_label=True
    )
    return result["scores"][0]

# Apply the model to every sentence for Imaginary1 articles
probability_scores = imaginary1_articles["Content_formatted"].apply(classify_futuro)

# Convert the score to a binary label using a threshold (0.5)
imaginary1_articles["Implications"] = (probability_scores >= 0.5).astype(int)

# SPATIO-TEMPORAL FUTURES

# Define the hypothesis template
hypothesis_template = "La citazione contiene riferimenti a {}"

# Function to classify each sentence and return the entailment score
def classify_futuro(text):
    result = classifier(
        text,
        candidate_labels=["periodi di tempo in cui l'intelligenza artificiale o una sua tecnologia si svilupperà, oppure luoghi in cui l'intelligenza artificiale o una sua tecnologia si svilupperà."],
        multi_label=True
    )
    return result["scores"][0]

# Apply the model to Imaginary1 articles
probability_scores = imaginary1_articles["Content_formatted"].apply(classify_futuro)

# Convert the score to a binary label using a threshold (0.5)
imaginary1_articles["Spatio_temporal"] = (probability_scores >= 0.5).astype(int)

#(UN)DESIRABILITY DIMENSION

# Define the hypothesis template and candidate labels
hypothesis_template = "La citazione contiene riferimenti a {}"
candidate_labels = [
    "effetti positivi e vantaggi dell'uso dell'intelligenza artificiale",
    "effetti negativi e rischi dell'uso dell'intelligenza artificiale"
]

# Classifier already instantiated earlier as `classifier`
def classify_desirability(text):
    result = classifier(
        text,
        candidate_labels=candidate_labels,
        hypothesis_template=hypothesis_template,
        multi_label=True
    )
    return dict(zip(result["labels"], result["scores"]))

# Apply to Imaginary1
score_dicts = imaginary1_articles["Content_formatted"].apply(classify_desirability)
positive_scores = score_dicts.apply(lambda d: d[candidate_labels[0]])
negative_scores = score_dicts.apply(lambda d: d[candidate_labels[1]])

# Add columns with correct spelling
imaginary1_articles["Desirability"] = (positive_scores >= 0.5).astype(int)
imaginary1_articles["Undesirability"] = (negative_scores >= 0.5).astype(int)

# STAKEHOLDER AND SPEAKER DIMENSION
# Define the hypothesis template and labels
hypothesis_template = "La citazione contiene riferimenti a {}"
labels = {
    "Politics": "rappresentanti politici che esprimono dichiarazioni o opinioni sull'intelligenza artificiale",
    "Industry": "rappresentanti del settore industriale, come aziende big tech o altre imprese, che esprimono dichiarazioni o opinioni sull'intelligenza artificiale",
    "Civil society": "membri di ONG o associazioni indipendenti che esprimono dichiarazioni o opinioni sull'intelligenza artificiale",
    "Academia": "membri del settore accademico che esprimono dichiarazioni o opinioni sull'intelligenza artificiale",
    "Media": "giornalisti o operatori dei media che esprimono dichiarazioni o opinioni sull'intelligenza artificiale"
}

def classify_row(text):
    result = classifier(
        text,
        candidate_labels=list(labels.values()),
        hypothesis_template=hypothesis_template,
        multi_label=True
    )
    reverse_map = {v: k for k, v in labels.items()}
    binary_results = {key: 0 for key in labels.keys()}
    for label_desc, score in zip(result["labels"], result["scores"]):
        if label_desc in reverse_map:
            key = reverse_map[label_desc]
            binary_results[key] = int(score >= 0.5)
    return binary_results

# Apply to Imaginary1 articles
score_dicts = imaginary1_articles["Content_formatted"].apply(classify_row)
binary_predictions = pd.DataFrame(list(score_dicts))

# Add columns to imaginary1_articles
for col in labels.keys():
    imaginary1_articles[col] = binary_predictions[col].astype(int)


In [None]:
from google.colab import files

# Save and download Imaginary1 with all dimension columns as CSV in order to inspect the results manually and with BERTopic
OUTPUT_PATH = "/content/imaginary1_with_dimensions.csv"
imaginary1_articles.to_csv(path, index=False)
files.download('path')

# Second phase step 2 - BERTopic and manual reading to analyze articles within each dimension and identify the full set of characteristics of the imaginary

BERTopic can be applied, as needed, to articles within each dimension to cluster semantically similar texts and facilitate the researcher’s interpretation of the imaginaries. This enables granular analysis of imaginaries and supports a systematic examination of the SIPCs framework’s elements, with manual reading by the researcher guided by NLI filters and BERTopic-based clustering. BERTopic is advisable when the number of articles in a dimension is sufficiently large; otherwise, the researcher can rely on manual reading. As an illustrative case, the code below applies BERTopic to articles that NLI classified as belonging to the Object dimension. The same code can be replicated by substituting the dimension with the one to be analyzed

In [None]:
# Apply BERTopic only to articles with Object dimension = 1 (within Imaginary1)

# 1) Select subset
object_articles = imaginary1_articles[imaginary1_articles["Object dimension"] == 1].copy()
docs_object = object_articles["Content"].astype(str).tolist()

# 2) Compute embeddings using the existing embedding_model/device
embeddings_object = embedding_model.encode(
    docs_object,
    batch_size=1,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

# 3) BERTopic on the Object subset
topic_model_object = BERTopic(
    vectorizer_model=vectorizer_model,
    calculate_probabilities=True
)
object_topics, object_probs = topic_model_object.fit_transform(docs_object, embeddings=embeddings_object)

# 4) Store results back
object_articles["Object subtopic"] = object_topics
object_articles["Object subtopic probability"] = [p.max() if p is not None else 0 for p in object_probs]

imaginary1_articles.loc[object_articles.index, "Object subtopic"] = object_articles["Object subtopic"]
imaginary1_articles.loc[object_articles.index, "Object subtopic probability"] = object_articles["Object subtopic probability"]

#Using BERTopic’s visualizations to examine the clusters.
print(topic_model.get_topic_info())
topic_model.visualize_barchart()

# For each Object cluster, randomly sample half of the articles
sample_object_df = object_articles.groupby('Object subtopic').apply(
    lambda x: x.sample(n=int(len(x) / 2), random_state=42)
).reset_index(drop=True)
