# This notebook will walk you through the steps carried out in "method D" of Mingyar et al. 2025

First we input the necessary packages. "is_tpt_ref" is not necessary and is not actually used in this implimentation, but was trialed inported to try it out. I found it drastically slowed down our pipeline, and generally the juice wasn't worth the squeeze. If you want to try it out, it's on my github.

In [2]:
import random
import torch
import numpy as np 
import pandas as pd
from transformers import AutoModel
from time import perf_counter as timer
from sentence_transformers import util, SentenceTransformer

import textwrap
#from is_ref import ReferenceClassifier


##This runs everything using the GPU (CUDA) if available. If not, use CPU. CPU is quite slow for large datasets. 
device = "cuda" if torch.cuda.is_available() else "cpu"


#Load in the dataframe with the text already parsed into PDFs (see embedding notebook for details of how to do this)
full_text_and_embeddings = pd.read_pickle("ajp_perc_prper_tpt_full_text_embeddings_2.pkl")
full_text_and_embeddings=full_text_and_embeddings.rename(columns={"Full Text Embedding":"embedding"})

# Load embeddings onto GPU
embeddings = torch.tensor(np.array(full_text_and_embeddings["embedding"].tolist()), dtype=torch.float32).to(device)

# Using Jina:
embedding_model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True, device_map=device)

df=full_text_and_embeddings

  from .autonotebook import tqdm as notebook_tqdm


Now we are going to load in the researcher-coded dataset. These changes are because the researchers were not consistent on how the journals were coded (i.e. "Teacher" vs "teacher"), and this just normalizes it. 

The asterisks were a symbol left over from coding, but are not needed in this analysis. This removes tags from the code. 

In [3]:
csv=pd.read_excel("paper_labels_M_S_2.xlsx")

csv['Category (M)']=csv['Category (M)'].str.replace("*", "").replace("Journal business", "journal business").replace("Teacher", "teacher").replace('teaching', 'teacher').replace('Content', 'content').replace('content  ', 'content').replace('Student', 'student').replace("Content    ", "content").replace("Content  ", "content").replace("Teacher  ", "teacher").replace("Teacher ", "teacher").replace("Teaching", "teacher")


98 Papers are loaded from the spreadsheet, but let's double check. 

In [4]:
len(csv)

98

It's good to check and make sure there are no accidents in the spreadsheet, such as a DOI that got put in the wrong place and is therefore doubled. Let's check for that. 

In [5]:
def check_doubles(csv):
    for doi in csv['Doi']:
        length=len(csv[csv['Doi']==doi]['Doi'])
        if length>1:
            print(doi)
#check_doubles(csv)

Now that we're sure our human-labeled data look good, let's move on to defining our themes we're going to search over. 

the "t1...t4" labels are for readability and consistency. When implimented for deductive analysis, those wouldn't exist. But because we're cross referencing with human labeled data, this is necessary. 

The "s1...s4" labels are the actual "search queries" that you are entering. 

In [6]:

t1 = "Teaching students."
t2 = "Student focus."
t3 = "Physics content."
t4= "Journal business."

query_list = [t1, t2, t3, t4]

# Encode all query strings
s1="Teaching. Laboratory equipment. Teaching methods." #Teacher
s2="Student belonging. Student focused. Student agency." #Student
s3="Physics content. Physics material. Math. Derivations." #Physics content
s4="Editorials, book reviews, announcements, obituaries. Journal business. Reports on business. "


Now that the search queries are defined, we need to convert the searches into embeddings. This needs to be done every time the queries are modified, so it cannot be done one and preserved unless the same queries are used every time. 

There are a number of embeddings models to choose from, some of which are free, some not. Generally speaking, the longer the vectors used for the model the greater context it can learn and the better your results will be. It's also important to be aware of the maximum number of tokens that your model can take, given that we are embedding entire papers. 

This model *must* be the same model used for embedding the text of the paper. You *cannot* mix and match. 

In [7]:


query_list_verbose=[s1, s2, s3, s4]
query_embedding = embedding_model.encode(query_list_verbose, convert_to_tensor=True)

print(f"Query: {query_list}")

# Compute dot product similarity
start_time = timer()
dot_scores_1 = util.dot_score(a=query_embedding, b=embeddings).T  # Shape: (n_embeddings, 3)
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# Add columns for s1, s2, s3, s4 similarity to dataframe
for n, t in enumerate(query_list):
    df[t]= dot_scores_1[:, n].cpu().numpy()


# Load classifier -- not currently used. This slows down batch processing time by a lot, so is recommended for smaller datasets. 
#model_path = "trained_model.safetensors"
#classifier = ReferenceClassifier(model_path)


Query: ['Teaching students.', 'Student focus.', 'Physics content.', 'Journal business.']
Time take to get scores on 43607 embeddings: 0.01640 seconds.


We calculate the topic score for each paper. For details, see Odden et al. 2024

In [8]:

# Default value of 'a' -- can adjust this from 0-90
a = -10

# Precompute the exponentials for efficiency
exp_values = np.exp(a * (1-df[query_list]))

# Compute the denominator for softmax-like normalization
denominator = exp_values.sum(axis=1)

# Create new score columns

for col in query_list:
    score_col = f"{col.strip()}_score"
    df[score_col] = exp_values[col] / denominator


We currently have two dataframes moving around, one from the human labeled dataset and one from the papers' embeddings. We need to merge the two together on their idenfifying column (doi)

In [9]:
#Grabs labels form the CSV and inserts them into the dataframe. Renames the dataframe df_labels

weighted_score_ql=[]
for query in query_list:
    weighted_score_ql.append(f"{query}_score_weighted")


csv=csv.rename(columns={"Doi": "doi"})
csv['doi']=csv['doi'].str.replace(" ", "")
csv['doi']=csv['doi'].str.replace("\'", "")
csv['doi']=csv['doi'].str.replace("https://doi.org/", "")

df_full_text=pd.merge(df, csv[['Category (M)', 'doi']], on='doi', how="left")



Let's check to see if everything merged as expected. 

Here we're making sure that none of the hand-labeled items were missed or dropped by cross checking it with the larger dataset. 

In [11]:

csv_dois = set(csv['doi'].dropna())
labels_dois = set(df_full_text['doi'].dropna())

# Find DOIs unique to each dataframe
dois_only_in_csv = csv_dois - labels_dois
dois_only_in_labels = labels_dois - csv_dois

# Filter rows based on unique DOIs
csv_unique_rows = csv[csv['doi'].isin(dois_only_in_csv)]
labels_unique_rows = df_full_text[df_full_text['doi'].isin(dois_only_in_labels)]

# Display results
print(f"Found {len(csv_unique_rows)} rows unique to 'csv' dataframe")
print(f"Found {len(labels_unique_rows)} rows unique to 'df_labels' dataframe")

print("\nRows unique to 'csv' dataframe:")
display(csv_unique_rows)

print("\nRows unique to 'df_labels' dataframe:")
display(labels_unique_rows)

Found 0 rows unique to 'csv' dataframe
Found 43509 rows unique to 'df_labels' dataframe

Rows unique to 'csv' dataframe:


Unnamed: 0,Title,Abstract,Journal,Category (M),doi,Category (S)



Rows unique to 'df_labels' dataframe:


Unnamed: 0,full_text,doi,year,journal,char_count,token_count,sentences,sentences_count,sentence_chunks,num_chunks,embedding,Teaching students.,Student focus.,Physics content.,Journal business.,Teaching students._score,Student focus._score,Physics content._score,Journal business._score,Category (M)
0,Make a Mystery Circuit with a Bar Light Fixtur...,10.1119/1.2715425,,tpt,7234,1808.50,[Make a Mystery Circuit with a Bar Light Fixtu...,81,[[Make a Mystery Circuit with a Bar Light Fixt...,17,"[tensor(-0.0012), tensor(0.0354), tensor(0.108...",0.525292,0.338646,0.458282,0.336486,0.550147,0.085091,0.281489,0.083273,
1,AGOLDEN OLDIE-ABLAOK BOX OIROUIT \r\nClifton K...,10.1119/1.2343976,,tpt,4633,1158.25,[AGOLDEN OLDIE-ABLAOK BOX OIROUIT \r\nClifton ...,58,[[AGOLDEN OLDIE-ABLAOK BOX OIROUIT \r\nClifton...,12,"[tensor(0.0039), tensor(-0.0006), tensor(0.114...",0.547897,0.314111,0.479536,0.343349,0.577820,0.055779,0.291679,0.074722,
2,Modeling Electricity: Model-Based Inquiry with...,10.1119/1.4745686,,tpt,17496,4374.00,[Modeling Electricity: Model-Based Inquiry wit...,140,[[Modeling Electricity: Model-Based Inquiry wi...,28,"[tensor(0.0564), tensor(-0.0816), tensor(0.150...",0.533192,0.390003,0.500644,0.263966,0.492910,0.117736,0.355970,0.033384,
3,"Two Approaches to Learning Physics \r\n""I look...",10.1119/1.2342910,,tpt,37322,9330.50,"[Two Approaches to Learning Physics \r\n""I loo...",324,"[[Two Approaches to Learning Physics \r\n""I lo...",65,"[tensor(0.0499), tensor(-0.1132), tensor(0.178...",0.538308,0.392010,0.579177,0.325154,0.350258,0.081101,0.527082,0.041559,
4,"\r\nJochen Kuhn and Patrik Vogt, Column Editor...",10.1119/1.4865529,,tpt,6621,1655.25,"[\r\nJochen Kuhn and Patrik Vogt, Column Edito...",76,"[[\r\nJochen Kuhn and Patrik Vogt, Column Edit...",16,"[tensor(0.0292), tensor(-0.0519), tensor(0.055...",0.378926,0.262477,0.467860,0.276191,0.243691,0.076051,0.593028,0.087230,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43602,Examining faculty choices while implementing t...,10.1119/perc.2023.pr.Willison,2023.0,perc,24881,6220.25,[Examining faculty choices while implementing ...,203,[[Examining faculty choices while implementing...,41,"[tensor(0.1635), tensor(-0.1263), tensor(0.057...",0.378620,0.383887,0.397787,0.285023,0.273406,0.288194,0.331168,0.107231,
43603,Analyzing the dimensionality of the Energy and...,10.1119/perc.2023.pr.Wu,2023.0,perc,24460,6115.00,[Analyzing the dimensionality of the Energy an...,227,[[Analyzing the dimensionality of the Energy a...,46,"[tensor(0.0763), tensor(-0.1726), tensor(0.054...",0.371197,0.347483,0.457647,0.243488,0.225147,0.177614,0.534458,0.062781,
43604,Students’ use of symmetry as a tool for sensem...,10.1119/perc.2023.pr.Young,2023.0,perc,27598,6899.50,[Students’ use of symmetry as a tool for sense...,264,[[Students’ use of symmetry as a tool for sens...,53,"[tensor(0.1260), tensor(-0.0853), tensor(0.105...",0.415883,0.419996,0.427131,0.264395,0.295781,0.308200,0.330995,0.065023,
43605,Analyzing Physics Majors’ Specialization Low I...,10.1119/perc.2023.pr.Zohrabi_Alaee,2023.0,perc,27954,6988.50,[Analyzing Physics Majors’ Specialization Low ...,222,[[Analyzing Physics Majors’ Specialization Low...,45,"[tensor(0.0705), tensor(-0.1829), tensor(0.015...",0.390673,0.501591,0.496636,0.326901,0.134306,0.407200,0.387513,0.070980,


There are zero unique rows in the csv, as expected. At this point, the dataframe has researcher defined labels 

Now we need to define the function that calculates the refined centroids. 

For more detail about the technique, see the associated paper. 

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
import scipy.ndimage as nd
import numpy as np

  

def centroid_labeling_whole_text(df, K=1, whole_text=True):
    """
    Perform advanced centroid-based topic labeling using full-text embeddings.
    
    This function implements an enhanced centroid-based classification approach that:
    1. Identifies archetypal papers for each topic category based on raw scores
    2. Creates topic centroids from full-text embeddings of archetypal papers
    3. Computes cosine similarity between all papers and topic centroids
    4. Applies Topic Score to generate topic probabilities
    5. Assigns the most likely topic label to each paper
    
    This approach differs from sentence-based methods by using document-level 
    embeddings, potentially capturing broader semantic themes and overall document structure.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe containing papers to be labeled. Must include:
        - 'doi': Document identifiers for each paper
        - '{label}_score' columns for each label in query_list (raw scores)
        - 'Full Text Embedding' column (when whole_text=True)
    K : int, optional
        Number of top-scoring archetypal papers to use per topic category for 
        centroid creation. Automatically adjusted if fewer papers are available
        (default: 1)
    whole_text : bool, optional
        Whether to use full-text embeddings. Currently only True is supported
        (default: True)
        
    Returns
    -------
    pandas.DataFrame
        Input dataframe enhanced with additional columns:
        - '{label}_Advanced Centroid dotp': Cosine similarity scores to each topic centroid
        - '{label}_Advanced Centroid TS': Topic probabilities
        - 'Main Group Advanced Centroid': Predicted topic label (highest scoring category)
        
    Global Dependencies
    -------------------
    Requires the following global variables to be defined:
    - query_list : list
        List of topic labels/categories to classify papers into
    - df_full_text : pandas.DataFrame
        Dataframe containing full-text embeddings with 'Full Text Embedding' column
    - a : float
        Temperature scaling factor for topic score (controls topic-mixedness)
    - nd : module
        Numerical computation module with rotate function (scipy.ndimage)
        
    Notes
    -----
    - Uses raw scores ('{label}_score') rather than weighted scores for archetypal selection
    - Automatically handles cases where K exceeds the number of available papers
    - Cosine similarity is used instead of dot products for better normalized comparison
    - The topic score transformation includes array rotation operations for proper alignment
    - Currently only supports whole_text=True mode; sentence-level fallback not implemented
    
    Algorithm Steps
    ---------------
    1. **Archetypal Selection**: Select top K papers per category based on raw scores
    2. **Centroid Creation**: Average full-text embeddings of archetypal papers 
       to create topic centroids
    3. **Similarity Calculation**: Compute cosine similarity between all paper 
       embeddings and topic centroids
    4. **Topic Scoring**: Topic score to convert similarities 
       to probabilities with array transformations
    5. **Label Assignment**: Assign topic with highest probability as main group
    
    Raises
    ------
    ValueError
        If whole_text=False (not currently supported)
    IndexError
        If required columns are missing from input dataframes
        
    Examples
    --------
    >>> # Assuming global variables are properly set up
    >>> labeled_df = advanced_centroid_labeling(papers_df, K=3, whole_text=True)
    >>> print(labeled_df['Main Group Advanced Centroid'].value_counts())
    >>> # Check similarity scores
    >>> similarity_cols = [col for col in labeled_df.columns if 'dotp' in col]
    >>> print(labeled_df[similarity_cols].describe())
    
    See Also
    --------
    cosine_similarity : Used for measuring embedding similarity
    numpy.mean : Used for averaging embeddings
    centroid_labeling_sentences : Alternative sentence-level approach
    """
    df_labels = df.copy()
    print(f"K: {K} whole_text: {whole_text}")

    # Step 1: Identify archetypal papers
    archetypal_papers = {}
    for label in query_list:
        k_actual = min(K, len(df_labels))
        top_k_dois = df_labels.sort_values(by=f"{label}_score", ascending=False).head(k_actual)['doi'].tolist()
        archetypal_papers[label] = top_k_dois
    
    # Step 2: Compute centroids
    centroids = {}
    for label, dois in archetypal_papers.items():
        top_embeddings = []

        for doi in dois:
            if whole_text:
                # Use full-text embedding directly
                paper_embedding = df_labels[df_labels['doi'] == doi]['embedding'].values
                if len(paper_embedding) > 0:
                    top_embeddings.append(paper_embedding[0])
                else:
                    print("There appears to be no full-text embeddings")
            centroid = np.mean(np.stack(top_embeddings), axis=0)
            centroids[label] = centroid

    # Step 3: Average embeddings per paper
    avg_embeddings = []

    # Step 4: Cosine similarity to centroids
    dot_prods={}
    for label in query_list:
        new_label=label+"_Advanced Centroid dotp"
        sent_emb=[item for item in df_labels['embedding']]
        topic_emb=[centroids[label]]*len(sent_emb) #Copies the topic centroid embedding to be equal in length to sent_emb. For the case of the whole paper, this should just multiply it by one
        dot_prods[new_label]= np.diag(np.array(cosine_similarity(topic_emb, sent_emb)))  ##This uses cosine similarity instead of dot product. The vectors used here are normalized so that isn't a problem
    for new_label in [label+"_Advanced Centroid dotp" for label in query_list]:
        df_labels[new_label]=dot_prods[new_label]
    # Step 5: Topic Scoring 
    topic_scores=[]
    ac_dotp_labels=[label+"_Advanced Centroid dotp" for label in query_list]
    for index, row in df_labels.iterrows():
        exp_values =np.exp(a* (1-row[ac_dotp_labels].values.astype(float)))
        denominator = exp_values.sum(axis=0)
        topic_scores.append( exp_values / denominator )
    topic_scores=np.flip(nd.rotate(np.array(topic_scores),90), axis=0)
    ac_ts_labels=[label+"_Advanced Centroid TS" for label in query_list]
    for n, label in enumerate(ac_ts_labels):
        df_labels[label]=topic_scores[n]
    
    # Step 7: Identify main group
    df_labels['Main Group Advanced Centroid'] = df_labels[[f"{label}_Advanced Centroid TS" for label in query_list]].idxmax(axis=1)
    df_labels['Main Group Advanced Centroid'] = df_labels['Main Group Advanced Centroid'].str.replace("_Advanced Centroid TS", "")
    
    
    return df_labels


Now we need to define the function that evaluates the model on mutiple different metrics. This is where the hand-labeled data are used. 

In [19]:


def evaluate_labeling_recall(df, teacher_col, student_col, content_col, journal_col, main_group):
    # Mapping of category labels to full column names
    category_map = {
        't': teacher_col.strip(),
        's': student_col.strip(),
        'c': content_col.strip(),
        'jb': journal_col.strip()
    }

    def compare_labels(row):
        category = row['Category (M)']
        if pd.isna(category):
            return np.nan
        expected_group = category_map.get(category.strip().lower())
        return expected_group == row[main_group]

    df['Correctly labeled'] = df.apply(compare_labels, axis=1)
    valid = df[~df['Category (M)'].isna()].copy()

    print("Here:")
    print(len(valid))

    total = len(valid)
    correct = valid['Correctly labeled'].sum()
    percent_correct = 100 * correct / total if total > 0 else 0

    print(f"1. Total number of entries with a value in 'Category (M)': {total}")
    print(f"2. Percent correctly labeled (accuracy): {percent_correct:.2f}%")

    # Per-category accuracy (recall), precision, false positive rate
    print("3. Detailed metrics per category:")
    for cat, expected_val in category_map.items():
        # True Positives: predicted = expected = this category
        tp = valid[(valid['Category (M)'].str.lower() == cat) & (valid[main_group] == expected_val)]
        
        # False Negatives: actual is this category, but predicted is not
        fn = valid[(valid['Category (M)'].str.lower() == cat) & (valid[main_group] != expected_val)]

        # False Positives: predicted is this category, but actual is not
        fp = valid[(valid['Category (M)'].str.lower() != cat) & (valid[main_group] == expected_val)]

        # True Negatives: actual and predicted are both *not* this category
        tn = valid[(valid['Category (M)'].str.lower() != cat) & (valid[main_group] != expected_val)]

        tp_count = len(tp)
        fn_count = len(fn)
        fp_count = len(fp)
        tn_count = len(tn)

        recall = tp_count / (tp_count + fn_count) if (tp_count + fn_count) > 0 else 0
        precision = tp_count / (tp_count + fp_count) if (tp_count + fp_count) > 0 else 0
        fpr = fp_count / (fp_count + tn_count) if (fp_count + tn_count) > 0 else 0
        acc = (tp_count + tn_count) / (tp_count + tn_count + fp_count + fn_count)

        print(f"   {cat.title()}:")
        print(f"      Recall (TP / TP + FN): {recall:.2f}")
        print(f"      Precision (TP / TP + FP): {precision:.2f}")
        print(f"      False Positive Rate (FP / FP + TN): {fpr:.2f}")
        print(f"      Accuracy (TP + TN / All): {acc:.2f}")

    # Label distributions
    print("4. Label distribution:")
    cat_counts = valid['Category (M)'].str.lower().value_counts(normalize=True)
    main_counts = valid[main_group].value_counts(normalize=True)
    
    print("Category (M):")
    for k, v in cat_counts.items():
        print(f"     {k.title()}: {v:.2%}")
    print("Automated:")
    for k, v in main_counts.items():
        print(f"     {k[:40].strip()}: {v:.2%}")

    # Return doi and correctness
    result = valid[['doi', 'Correctly labeled']].reset_index(drop=True)
    return result

In [20]:
df1=centroid_labeling_whole_text(df_full_text, K=4) #This processes the centroids and labels the papers by their most likely group

K: 4 whole_text: True


In [22]:
mg1="Main Group Advanced Centroid" #Advanced centroids based on full text

#mg="MG Score Full Embedding"

##This is Group D, the one that matters
results = evaluate_labeling_recall(
    df1,
    query_list[0], #teacher_col
    query_list[1], #student_col
    query_list[2], #content_col
    query_list[3], #journal_col
    mg1)


Here:
98
1. Total number of entries with a value in 'Category (M)': 98
2. Percent correctly labeled (accuracy): 60.20%
3. Detailed metrics per category:
   T:
      Recall (TP / TP + FN): 0.78
      Precision (TP / TP + FP): 0.44
      False Positive Rate (FP / FP + TN): 0.31
      Accuracy (TP + TN / All): 0.71
   S:
      Recall (TP / TP + FN): 0.88
      Precision (TP / TP + FP): 0.88
      False Positive Rate (FP / FP + TN): 0.02
      Accuracy (TP + TN / All): 0.96
   C:
      Recall (TP / TP + FN): 0.64
      Precision (TP / TP + FP): 0.66
      False Positive Rate (FP / FP + TN): 0.17
      Accuracy (TP + TN / All): 0.77
   Jb:
      Recall (TP / TP + FN): 0.23
      Precision (TP / TP + FP): 0.67
      False Positive Rate (FP / FP + TN): 0.04
      Accuracy (TP + TN / All): 0.77
4. Label distribution:
Category (M):
     C: 33.67%
     Jb: 26.53%
     T: 23.47%
     S: 16.33%
Automated:
     Teaching students.: 41.84%
     Physics content.: 32.65%
     Student focus.: 16.33%
   

import joblib

joblib.dump(df1, "full_text_refined_df.pkl")