<a href="https://colab.research.google.com/github/BarbaraMcG/darwin-semantic-change/blob/main/Semantic_change_Darwin_BMcG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic change in Darwin letters using contextualised embeddings



Barbara McGillivray

This notebook contains the code for pre-processing the corpus of Darwin letters and train the embeddings. The methods used in this notebook focus on contextualised embeddings to answer the question: How does Darwin’s conceptual vocabulary evolve across the major phases of his scientific career, as reflected in his personal correspondence?
NB: As some of the steps can take a long time to run, it is advisable to only run the first part of this notebook at the beginning of the project.

NB: remember to select darwin-env as the Python environment!

## Note on data

We will use the letters from the Darwin Correspondence Project (https://www.darwinproject.ac.uk/) which can be freely downloaded from https://github.com/cambridge-collection/darwin-correspondence-data ("xml" folder).

# Corpus processing of Darwin letters

## 1. Initialisation

In [1]:
import sys
sys.executable

'/opt/anaconda3/envs/darwin-env/bin/python'

Download Spacy's English language model:

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Import libraries

In [25]:
import os 
from bs4 import BeautifulSoup
#from google.colab import drive
import csv
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from scipy import spatial
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import time
from sklearn.decomposition import PCA
import re
import spacy
from statistics import mean
from langdetect import detect
# to make our plot outputs appear and be stored within the notebook:
%matplotlib inline 
from transformers import AutoTokenizer, AutoModel
import torch
from collections import defaultdict
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/barbaramcgillivray/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/barbaramcgillivray/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Reading the files

I define a dataframe from the csv file of Darwin's proceessed letters:

In [4]:
in_folder = "/Users/barbaramcgillivray/OneDrive - King's College London/Research/2024/Darwin/data/preprocessed_text/" 
df = pd.read_csv(os.path.join(in_folder, 'transcription_tokens_onlyDarwin.csv'), sep = "\t")
out_folder = "/Users/barbaramcgillivray/OneDrive - King's College London/Research/2024/Darwin/Semantic_change_output/BERT/" 

In [5]:
df.shape

(8136, 11)

In [6]:
df.head()

Unnamed: 0,File,Year,Sender,Receiver,Transcription,Transcription_spacy,Tokens,Lemmas,Lemmas_clean,Lemmas_nostop,language
0,DCP-LETT-12349.xml,1879,"Darwin, C. R.","Payne, A. H.",I have no objection to express my opinion on t...,I have no objection to express my opinion on t...,"['I', 'have', 'no', 'objection', 'to', 'expres...","['I', 'have', 'no', 'objection', 'to', 'expres...","['I', 'have', 'no', 'objection', 'to', 'expres...","['objection', 'express', 'opinion', 'subject',...",en
1,DCP-LETT-7806.xml,1871,"Darwin, C. R.","Darwin, Francis",Very many thanks for all that you have done fo...,Very many thanks for all that you have done fo...,"['Very', 'many', 'thanks', 'for', 'all', 'that...","['very', 'many', 'thank', 'for', 'all', 'that'...","['very', 'many', 'thank', 'for', 'all', 'that'...","['many', 'thank', '.', '—', 'earth', 'Mivart',...",en
2,DCP-LETT-11640.xml,1878,"Darwin, C. R.","Flower, W. H.",You will remember the dried wings of the goose...,You will remember the dried wings of the goose...,"['You', 'will', 'remember', 'the', 'dried', 'w...","['you', 'will', 'remember', 'the', 'dry', 'win...","['you', 'will', 'remember', 'the', 'dry', 'win...","['remember', 'dry', 'wing', 'goose', '&', 'wis...",en
3,DCP-LETT-8650F.xml,1874,"Darwin, C. R.","Spengel, J. W.",I thank you most sincerely for your kindness i...,I thank you most sincerely for your kindness i...,"['I', 'thank', 'you', 'most', 'sincerely', 'fo...","['I', 'thank', 'you', 'most', 'sincerely', 'fo...","['I', 'thank', 'you', 'most', 'sincerely', 'fo...","['thank', 'sincerely', 'kindness', 'send', 'Fo...",en
4,DCP-LETT-13731.xml,1882,"Darwin, C. R.","Jenner, William",I am much obliged for the honour of your invit...,I am much obliged for the honour of your invit...,"['I', 'am', 'much', 'obliged', 'for', 'the', '...","['I', 'be', 'much', 'oblige', 'for', 'the', 'h...","['I', 'be', 'much', 'oblige', 'for', 'the', 'h...","['much', 'oblige', 'honour', 'invitation', 'at...",en


# 3. Using contextualised word embeddings

I extract a list of lexical items with their frequencies:

In [7]:
import ast
from collections import Counter

# Convert Lemmas_nostop column from string to list, if needed
df['Lemmas_nostop'] = df['Lemmas_nostop'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# Flatten the list of lemmas
all_lemmas = [lemma for row in df['Lemmas_nostop'] if isinstance(row, list) for lemma in row]

# Filter out non-alphabetic or too-short tokens
filtered_lemmas = [lemma for lemma in all_lemmas if lemma.isalpha() and len(lemma) > 1]

# Count frequencies
lemma_freq = Counter(filtered_lemmas)

# Create sorted frequency DataFrame
lemma_freq_df = (
    pd.DataFrame(lemma_freq.items(), columns=['Lemma', 'Frequency'])
    .sort_values(by='Frequency', ascending=False)
    .reset_index(drop=True)
)

lemma_freq_df.head(20)


Unnamed: 0,Lemma,Frequency
0,much,8241
1,Darwin,7797
2,think,7268
3,one,6594
4,would,6374
5,see,6052
6,send,5495
7,good,5217
8,say,4917
9,know,4604


I filter out function words:

In [8]:
nlp = spacy.load("en_core_web_sm")
stopWords = nlp.Defaults.stop_words  # a set of lowercase stopwords
#additional_stopwords = {'much', 'one', 'would', 'shall'}
#stopWords.update(additional_stopwords)

I set a frequency threshold:

In [9]:
# Remove lemmas that are in the stopword list (assumes stopWords is defined in lowercase)
content_lemmas_df = lemma_freq_df[~lemma_freq_df['Lemma'].str.lower().isin(stopWords)].copy()

# Remove very short lemmas (≤2 characters)
content_lemmas_df = content_lemmas_df[content_lemmas_df['Lemma'].str.len() > 2]

# Reset index
content_lemmas_df.reset_index(drop=True, inplace=True)

# Extract frequent lemmas (frequency ≥ 50) from the filtered set
frequent_lemmas = [
    row['Lemma'] for _, row in content_lemmas_df.iterrows()
    if row['Frequency'] >= 50
]

# Display result
frequent_lemmas[:20]


['Darwin',
 'think',
 'send',
 'good',
 'know',
 'write',
 'thank',
 'letter',
 'shall',
 'believe',
 'work',
 'hear',
 'great',
 'time',
 'case',
 'dear',
 'like',
 'sincerely',
 'hope',
 'plant']

Periodisation:

In [10]:
period_labels = {
    '1831–1859': (1822, 1859),
    '1860–1870': (1860, 1870),
    '1871–1882': (1871, 1882)
}

# Create a dictionary of filtered dataframes for each period
period_dfs = {label: df[df['Year'].between(start, end)]
              for label, (start, end) in period_labels.items()}

# Preview counts per period
for label, subdf in period_dfs.items():
    print(f"{label}: {len(subdf)} letters")

df_t1 = df[df['Year'].between(1822, 1859)]
df_t2 = df[df['Year'].between(1860, 1870)]
df_t3 = df[df['Year'].between(1871, 1882)]

1831–1859: 2146 letters
1860–1870: 2503 letters
1871–1882: 3487 letters


I load the sciBERT model and tokenizer:

In [11]:
model_name = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = torch.device("cpu")
model.to(device)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31090, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

Function to get SciBERT embeddings for a word:

In [12]:
def get_mean_embedding(text, tokenizer, model):
    """Returns the mean contextual embedding of a given text (token-level average)."""
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Average over tokens
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()


Sample list:

In [13]:
sample_lemmas = ['evolution', 'species', 'origin', 'variation', 'selection']

Check if these lemmas are frequent enough:

In [14]:
from collections import Counter

# Flatten all lemmas
all_lemmas = [lemma for row in df['Lemmas_nostop'] if isinstance(row, list) for lemma in row]
lemma_counts = Counter(all_lemmas)

# Show frequencies for sample lemmas
for lemma in sample_lemmas:
    print(f"{lemma}: {lemma_counts[lemma]}")


evolution: 77
species: 82
origin: 226
variation: 452
selection: 458


Run embedding extraction across time bins:

In [22]:
# Initialise embedding storage
embeddings = {lemma: {} for lemma in sample_lemmas}

# Loop through time periods
for period_name, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    print(f"Processing period: {period_name}")
    for lemma in tqdm(sample_lemmas):
        # Get letters where lemma occurs (based on Lemmas_nostop)
        matched_texts = df_period[df_period['Lemmas_nostop'].apply(
            lambda lst: lemma in lst if isinstance(lst, list) else False
        )]['Transcription'].tolist()
        
        vectors = []
        for text in matched_texts:
            try:
                vec = get_mean_embedding(text, tokenizer, model)
                vectors.append(vec)
            except Exception as e:
                continue  # skip bad examples silently

        # Save average vector if we got any
        if vectors:
            mean_vec = np.mean(vectors, axis=0)
            embeddings[lemma][period_name] = mean_vec


Processing period: t1


100%|█████████████████████████████████████████████| 5/5 [00:44<00:00,  8.84s/it]


Processing period: t2


100%|█████████████████████████████████████████████| 5/5 [01:14<00:00, 14.83s/it]


Processing period: t3


100%|█████████████████████████████████████████████| 5/5 [00:36<00:00,  7.33s/it]


Check the embeddings are there:

In [23]:
# Example: number of lemmas with embeddings in each time period
for period in ['t1', 't2', 't3']:
    count = sum(1 for lemma in embeddings if period in embeddings[lemma])
    print(f"Lemmas with embeddings in {period}: {count}")


Lemmas with embeddings in t1: 4
Lemmas with embeddings in t2: 5
Lemmas with embeddings in t3: 5


### Measure change

Compute cosine similarities between periods:

In [26]:
results = []
for lemma in sample_lemmas:
    if all(k in embeddings[lemma] for k in ['t1', 't2', 't3']):
        sim_t1_t2 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t2']])[0][0]
        sim_t2_t3 = cosine_similarity([embeddings[lemma]['t2']], [embeddings[lemma]['t3']])[0][0]
        sim_t1_t3 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t3']])[0][0]
        results.append({
            "Lemma": lemma,
            "Sim_t1_t2": sim_t1_t2,
            "Sim_t2_t3": sim_t2_t3,
            "Sim_t1_t3": sim_t1_t3
        })

results_df = pd.DataFrame(results)
print(results_df)



       Lemma  Sim_t1_t2  Sim_t2_t3  Sim_t1_t3
0    species   0.995748   0.981593   0.981416
1     origin   0.996708   0.995121   0.994874
2  variation   0.998142   0.994573   0.991467
3  selection   0.997554   0.995562   0.992499


### Neighbours

To complement cosine similarity scores and gain qualitative insight into contextual change, we extract the top nearest neighbours for selected lemmas in each period using SciBERT embeddings. This helps interpret how a word’s meaning or usage evolves over time by observing its most similar contexts.



The following function normalizes both lemma and tokens for matching, removing punctuation and lowercasing. It reconstructs whole words from WordPiece subtokens so that multi-subtoken words can be matched correctly. It averages the embeddings of subtokens to represent the full word. It logs texts where the lemma was not found at all for debugging.

In [59]:
import re

def normalize_text(text):
    # Basic normalization: lowercase and strip punctuation
    return re.sub(r'\W+', '', text.lower())

def extract_token_contexts(df_period, lemma):
    lemma_norm = normalize_text(lemma)
    token_vectors = []
    tokens_all = []
    not_found_count = 0

    for i, text in enumerate(df_period['Transcription']):
        # Normalize text and check if lemma is present
        if lemma_norm not in normalize_text(text):
            # Optional: log skipped texts where lemma isn't found
            not_found_count += 1
            continue

        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            embeddings = outputs.last_hidden_state.squeeze(0).cpu().numpy()
            tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze(0))

            # Reconstruct words from subtokens
            reconstructed_words = []
            current_word = ''
            current_vecs = []
            for tok, vec in zip(tokens, embeddings):
                if tok.startswith('##'):
                    current_word += tok[2:]
                    current_vecs.append(vec)
                else:
                    if current_word:
                        reconstructed_words.append((current_word, current_vecs))
                    current_word = tok
                    current_vecs = [vec]
            if current_word:
                reconstructed_words.append((current_word, current_vecs))

            # Match lemma with normalized reconstructed words
            for word, vecs in reconstructed_words:
                if normalize_text(word) == lemma_norm:
                    # Average vectors if multiple subtokens
                    avg_vec = np.mean(vecs, axis=0) if len(vecs) > 1 else vecs[0]
                    token_vectors.append(avg_vec)
                    tokens_all.append(word)

    if not_found_count > 0:
        print(f"Note: lemma '{lemma}' not found in {not_found_count} texts out of {len(df_period)}")

    return token_vectors, tokens_all


Compute most similar tokens in each period:

In [66]:
import re

def normalize_token(token):
    # Lowercase and remove non-alphanumeric characters (keep just letters)
    return re.sub(r'[^a-z]', '', token.lower())

def get_top_neighbours(lemma, df_period, top_n=10):
    target_vectors, _ = extract_token_contexts(df_period, lemma)
    if not target_vectors:
        print(f"[SKIP] No vectors found for {lemma} in this period.")
        return []

    # Collect token embeddings in this period
    token_embeddings = defaultdict(list)
    
    for text in df_period['Transcription']:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            vectors = outputs.last_hidden_state.squeeze(0).cpu().numpy()
            tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze(0))

            for tok, vec in zip(tokens, vectors):
                norm_tok = normalize_token(tok)
                # Filter: alphabetic only, length > 3
                if norm_tok.isalpha() and len(norm_tok) > 3:
                    token_embeddings[norm_tok].append(vec)

    # Average embeddings for tokens with enough occurrences (>=2)
    token_avg_vecs = {tok: np.mean(vecs, axis=0) for tok, vecs in token_embeddings.items() if len(vecs) >= 2}

    # Mean vector for the target lemma
    target_mean = np.mean(target_vectors, axis=0)

    # Compute cosine similarity to all tokens
    similarities = {
        tok: cosine_similarity([target_mean], [vec])[0][0]
        for tok, vec in token_avg_vecs.items()
    }

    # Sort top neighbours by similarity
    top_neighbors = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_n]
    return top_neighbors


Try on one lemma:

In [67]:
sample_lemma = "selection"  # change as needed

neighbors_by_period = {}
for label, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    print(f"Finding neighbours for '{sample_lemma}' in {label}...")
    neighbours = get_top_neighbours(sample_lemma, df_period, top_n=10)
    neighbors_by_period[label] = neighbours

# Display results
for period, neighbours in neighbors_by_period.items():
    print(f"\n{sample_lemma} – Top neighbours in {period}:")
    for word, score in neighbours:
        print(f"  {word}: {score:.4f}")


Finding neighbours for 'selection' in t1...
Note: lemma 'selection' not found in 2100 texts out of 2146
Finding neighbours for 'selection' in t2...
Note: lemma 'selection' not found in 2285 texts out of 2503
Finding neighbours for 'selection' in t3...
Note: lemma 'selection' not found in 3394 texts out of 3487

selection – Top neighbours in t1:
  selection: 1.0000
  select: 0.7073
  selected: 0.6927
  variation: 0.6506
  selecting: 0.6480
  descent: 0.6409
  history: 0.6351
  science: 0.6190
  inheritance: 0.6174
  migration: 0.6172

selection – Top neighbours in t2:
  selection: 1.0000
  selected: 0.7075
  evolution: 0.7045
  descent: 0.7036
  selecting: 0.6747
  variation: 0.6650
  history: 0.6573
  reproduction: 0.6546
  generation: 0.6501
  natural: 0.6349

selection – Top neighbours in t3:
  selection: 1.0000
  evolution: 0.7149
  selected: 0.6827
  reproduction: 0.6343
  competition: 0.6293
  variation: 0.6284
  evolutionary: 0.6237
  natural: 0.6212
  history: 0.6208
  selecting

Over the three time periods, the semantic neighbourhood of selection shifts in a way that reflects Darwin's increasing theoretical engagement with evolution.

In t1 (1822–1859), selection is most closely associated with morphological variants (select, selected, selecting) and biological concepts such as variation, inheritance, migration. The presence of terms like history and science hints at early discussions of mechanisms of inheritance and population change, but the focus is still exploratory and not yet unified under evolutionary theory.

In t2 (1860–1870), immediately after the publication of On the Origin of Species, the neighbourhood of selection becomes more conceptually dense: evolution, descent, reproduction, and natural enter the top neighbours, aligning selection with core components of Darwin’s theory of natural selection. This reflects the consolidation of selection into the explanatory apparatus of evolutionary biology.

In t3 (1871–1882), the cluster intensifies around evolutionary theory: evolution, evolutionary, competition, natural, and reproduction now appear alongside selection. This suggests that selection had become firmly embedded in the lexicon of Darwin's mature theoretical system. It also shows a shift from general scientific terms (science, history) to technical vocabulary reflecting mechanisms and consequences of selection.



Now I'm going to focus on a set of interesting words for Darwin.
NB: This cell takes about 3 hours to run.

In [136]:
selected_lemmas = ['origin', 'variation', 'species', 'evolution', 'selection', 'modification', 'insectivorous', 'protoplasm', 'fertilization', 'pollen', 'propagation', 'curious', 'scientific', 'science', 'transmutation']


In [140]:
embeddings = {lemma: {} for lemma in selected_lemmas}

# Loop through time periods
for period_name, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    print(f"Processing period: {period_name}")
    for lemma in tqdm(selected_lemmas):
        # Get letters where lemma occurs (based on Lemmas_nostop)
        matched_texts = df_period[df_period['Lemmas_nostop'].apply(
            lambda lst: lemma in lst if isinstance(lst, list) else False
        )]['Transcription'].tolist()
        
        vectors = []
        for text in matched_texts:
            try:
                vec = get_mean_embedding(text, tokenizer, model)
                vectors.append(vec)
            except Exception as e:
                continue  # skip bad examples silently

        # Save average vector if we got any
        if vectors:
            mean_vec = np.mean(vectors, axis=0)
            embeddings[lemma][period_name] = mean_vec
            
results = []
all_neighbour_rows = []  # Collect rows here

for lemma in selected_lemmas:
    if all(k in embeddings[lemma] for k in ['t1', 't2', 't3']):
        sim_t1_t2 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t2']])[0][0]
        sim_t2_t3 = cosine_similarity([embeddings[lemma]['t2']], [embeddings[lemma]['t3']])[0][0]
        sim_t1_t3 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t3']])[0][0]
        results.append({
            "Lemma": lemma,
            "Sim_t1_t2": sim_t1_t2,
            "Sim_t2_t3": sim_t2_t3,
            "Sim_t1_t3": sim_t1_t3
        })

    # neighbours:
    neighbors_by_period = {}
    print(lemma)
    for label, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
        print(f"Finding neighbours in {label}...")
        neighbours = get_top_neighbours(lemma, df_period, top_n=10)
        neighbors_by_period[label] = neighbours

        for neighbour, score in neighbours:
            all_neighbour_rows.append({
                'Lemma': lemma,
                'Period': label,
                'Neighbour': neighbour,
                'Similarity': round(score, 4)
            })

results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_results.csv"), index=False)
# Create a DataFrame from all rows
neighbours_df = pd.DataFrame(all_neighbour_rows)

# Save to CSV
neighbours_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_neighbours.csv"), index=False)

print(results_df)
print(neighbours_df)

Processing period: t1


100%|███████████████████████████████████████████| 15/15 [02:06<00:00,  8.45s/it]


Processing period: t2


100%|███████████████████████████████████████████| 15/15 [03:52<00:00, 15.49s/it]


Processing period: t3


100%|███████████████████████████████████████████| 15/15 [01:57<00:00,  7.83s/it]


origin
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
variation
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
species
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
evolution
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
selection
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
modification
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
insectivorous
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
protoplasm
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
fertilization
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
pollen
Finding neighbours in t1...
Finding neighbours in t2...
Finding neighbours in t3...
propagation
Finding neighbours in t1...
Finding neighbou

In [141]:
results_df

Unnamed: 0,Lemma,Sim_t1_t2,Sim_t2_t3,Sim_t1_t3
0,origin,0.996708,0.995121,0.994874
1,variation,0.998142,0.994573,0.991467
2,species,0.995748,0.981593,0.981416
3,selection,0.997554,0.995562,0.992499
4,modification,0.993124,0.989055,0.98792
5,insectivorous,0.933851,0.947329,0.952745
6,pollen,0.995259,0.996493,0.99278
7,propagation,0.973796,0.957761,0.95604
8,curious,0.997677,0.995548,0.991993
9,scientific,0.997828,0.996733,0.994944


In [143]:
pd.set_option('display.max_rows', 310) 
neighbours_df

Unnamed: 0,Lemma,Period,Neighbour,Similarity
0,origin,t1,origin,1.0
1,origin,t1,creation,0.6875
2,origin,t1,descent,0.6734
3,origin,t1,distribution,0.6733
4,origin,t1,affinities,0.6684
5,origin,t1,identity,0.6682
6,origin,t1,source,0.6681
7,origin,t1,nature,0.667
8,origin,t1,existence,0.6584
9,origin,t1,character,0.6535


In [143]:
pd.set_option('display.max_rows', 310) 
neighbours_df

Unnamed: 0,Lemma,Period,Neighbour,Similarity
0,origin,t1,origin,1.0
1,origin,t1,creation,0.6875
2,origin,t1,descent,0.6734
3,origin,t1,distribution,0.6733
4,origin,t1,affinities,0.6684
5,origin,t1,identity,0.6682
6,origin,t1,source,0.6681
7,origin,t1,nature,0.667
8,origin,t1,existence,0.6584
9,origin,t1,character,0.6535


The similarity scores for the selected lemmas are all very high. Let's extract the scores for all lexical words above a threshold of 200 to see if there are any with low scores:

In [148]:
highly_frequent_lemmas = [
    row['Lemma'] for _, row in content_lemmas_df.iterrows()
    if row['Frequency'] >= 200
]

embeddings = {lemma: {} for lemma in highly_frequent_lemmas}

# Loop through time periods
for period_name, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    print(f"Processing period: {period_name}")
    for lemma in tqdm(highly_frequent_lemmas):
        # Get letters where lemma occurs (based on Lemmas_nostop)
        matched_texts = df_period[df_period['Lemmas_nostop'].apply(
            lambda lst: lemma in lst if isinstance(lst, list) else False
        )]['Transcription'].tolist()
        
        vectors = []
        for text in matched_texts:
            try:
                vec = get_mean_embedding(text, tokenizer, model)
                vectors.append(vec)
            except Exception as e:
                continue  # skip bad examples silently

        # Save average vector if we got any
        if vectors:
            mean_vec = np.mean(vectors, axis=0)
            embeddings[lemma][period_name] = mean_vec
            
results = []
#all_neighbour_rows = []  # Collect rows here

for lemma in highly_frequent_lemmas:
    if all(k in embeddings[lemma] for k in ['t1', 't2', 't3']):
        sim_t1_t2 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t2']])[0][0]
        sim_t2_t3 = cosine_similarity([embeddings[lemma]['t2']], [embeddings[lemma]['t3']])[0][0]
        sim_t1_t3 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t3']])[0][0]
        results.append({
            "Lemma": lemma,
            "Sim_t1_t2": sim_t1_t2,
            "Sim_t2_t3": sim_t2_t3,
            "Sim_t1_t3": sim_t1_t3
        })

    # neighbours:
    #neighbors_by_period = {}
    #print(lemma)
    #for label, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    #    print(f"Finding neighbours in {label}...")
    #    neighbours = get_top_neighbours(lemma, df_period, top_n=10)
    #    neighbors_by_period[label] = neighbours

    #    for neighbour, score in neighbours:
    #        all_neighbour_rows.append({
    #            'Lemma': lemma,
    #            'Period': label,
    #            'Neighbour': neighbour,
    #            'Similarity': round(score, 4)
    #        })

results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_results_allwords.csv"), index=False)

# Create a DataFrame from all rows
#neighbours_df = pd.DataFrame(all_neighbour_rows)

# Save to CSV
#neighbours_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_neighbours_allwords.csv"), index=False)

print(results_df)
#print(neighbours_df)

Processing period: t1


100%|███████████████████████████████████████| 645/645 [4:55:23<00:00, 27.48s/it]


Processing period: t2


100%|███████████████████████████████████████| 645/645 [4:16:58<00:00, 23.90s/it]


Processing period: t3


100%|███████████████████████████████████████| 645/645 [3:14:58<00:00, 18.14s/it]

          Lemma  Sim_t1_t2  Sim_t2_t3  Sim_t1_t3
0        Darwin   0.998196   0.997460   0.993926
1         think   0.998036   0.997611   0.994374
2          send   0.997758   0.996365   0.991796
3          good   0.997119   0.997512   0.992303
4          know   0.997876   0.997758   0.994426
..          ...        ...        ...        ...
638      remove   0.993912   0.994893   0.990978
639       carry   0.989857   0.991824   0.993385
640    astonish   0.996033   0.995995   0.991140
641  experience   0.994411   0.996467   0.991150
642  Shrewsbury   0.989631   0.986374   0.982158

[643 rows x 4 columns]





The similarities are all very high (>0.9). I try with slighly less frequent lemmas:

In [29]:
frequent_lemmas = [
    row['Lemma'] for _, row in content_lemmas_df.iterrows()
    if row['Frequency'] >= 50
]

embeddings = {lemma: {} for lemma in frequent_lemmas}

# Loop through time periods
for period_name, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    print(f"Processing period: {period_name}")
    for lemma in tqdm(frequent_lemmas):
        # Get letters where lemma occurs (based on Lemmas_nostop)
        matched_texts = df_period[df_period['Lemmas_nostop'].apply(
            lambda lst: lemma in lst if isinstance(lst, list) else False
        )]['Transcription'].tolist()
        
        vectors = []
        for text in matched_texts:
            try:
                vec = get_mean_embedding(text, tokenizer, model)
                vectors.append(vec)
            except Exception as e:
                continue  # skip bad examples silently

        # Save average vector if we got any
        if vectors:
            mean_vec = np.mean(vectors, axis=0)
            embeddings[lemma][period_name] = mean_vec
            

Processing period: t1


  0%|                                                  | 0/2107 [00:33<?, ?it/s]


KeyboardInterrupt: 

In [62]:
results = []
#all_neighbour_rows = []  # Collect rows here

for lemma in frequent_lemmas:
    if all(k in embeddings[lemma] for k in ['t1', 't2', 't3']):
        sim_t1_t2 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t2']])[0][0]
        sim_t2_t3 = cosine_similarity([embeddings[lemma]['t2']], [embeddings[lemma]['t3']])[0][0]
        sim_t1_t3 = cosine_similarity([embeddings[lemma]['t1']], [embeddings[lemma]['t3']])[0][0]
        results.append({
            "Lemma": lemma,
            "Sim_t1_t2": sim_t1_t2,
            "Sim_t2_t3": sim_t2_t3,
            "Sim_t1_t3": sim_t1_t3
        })

    # neighbours:
    #neighbors_by_period = {}
    #print(lemma)
    #for label, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
    #    print(f"Finding neighbours in {label}...")
    #    neighbours = get_top_neighbours(lemma, df_period, top_n=10)
    #    neighbors_by_period[label] = neighbours

    #    for neighbour, score in neighbours:
    #        all_neighbour_rows.append({
    #            'Lemma': lemma,
    #            'Period': label,
    #            'Neighbour': neighbour,
    #            'Similarity': round(score, 4)
    #        })

results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_results_allwords_freq50.csv"), index=False)

# Create a DataFrame from all rows
#neighbours_df = pd.DataFrame(all_neighbour_rows)

# Save to CSV
#neighbours_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_neighbours_allwords.csv"), index=False)

print(results_df)
#print(neighbours_df)

        Lemma  Sim_t1_t2  Sim_t2_t3  Sim_t1_t3
0      Darwin   0.998194   0.997466   0.993926
1       think   0.998036   0.997611   0.994374
2        send   0.997758   0.996365   0.991796
3        good   0.997119   0.997512   0.992303
4        know   0.997876   0.997758   0.994426
...       ...        ...        ...        ...
2050      joy   0.987482   0.990990   0.980622
2051  mankind   0.980246   0.988728   0.978296
2052   Ceylon   0.986289   0.983974   0.969843
2053   alpine   0.988736   0.986596   0.987154
2054    Plate   0.984050   0.977589   0.984507

[2055 rows x 4 columns]


I extract the words with similarity under 0.9:

In [63]:
# Define the file path
file_path = os.path.join(out_folder, "darwin_lemmas_BERT_results_allwords_freq50.csv")

# Load the DataFrame
results_df = pd.read_csv(file_path)

# Confirm it loaded correctly
print(results_df.head())
# Filter lemmas with any similarity under 0.8
low_similarity_lemmas = results_df[
    (results_df["Sim_t1_t2"] < 0.9) |
    (results_df["Sim_t2_t3"] < 0.9) |
    (results_df["Sim_t1_t3"] < 0.9)
]["Lemma"].tolist()

# Display the list
print(low_similarity_lemmas)


    Lemma  Sim_t1_t2  Sim_t2_t3  Sim_t1_t3
0  Darwin   0.998194   0.997466   0.993926
1   think   0.998036   0.997611   0.994374
2    send   0.997758   0.996365   0.991796
3    good   0.997119   0.997512   0.992303
4    know   0.997876   0.997758   0.994426
['Primula', 'Ammonia', 'Balanus', 'filament', 'Watson', 'Gower']


Now I extract their neighbours:

In [68]:
import ast

def parse_lemmas(val):
    if isinstance(val, str):
        try:
            return ast.literal_eval(val)
        except Exception:
            return val.split()
    elif isinstance(val, list):
        return val
    else:
        return []

# Apply parsing to lemmas columns so they are lists
df_t1['Lemmas_clean'] = df_t1['Lemmas_clean'].apply(parse_lemmas)
df_t2['Lemmas_clean'] = df_t2['Lemmas_clean'].apply(parse_lemmas)
df_t3['Lemmas_clean'] = df_t3['Lemmas_clean'].apply(parse_lemmas)

changed_neighbour_rows = []

for lemma in low_similarity_lemmas:
    lemma_norm = lemma.lower()
    print(f"\nProcessing lemma: {lemma_norm}")
    for label, df_period in zip(['t1', 't2', 't3'], [df_t1, df_t2, df_t3]):
        print(f"Checking lemma '{lemma_norm}' in {label} lemmas")

        # Filter rows containing lemma (case insensitive)
        df_filtered = df_period[df_period['Lemmas_clean'].apply(lambda lemmas: lemma_norm in [l.lower() for l in lemmas])]

        if df_filtered.empty:
            print(f"Skipping {label} – no texts contain lemma")
            continue

        neighbours = get_top_neighbours(lemma_norm, df_filtered, top_n=10)
        print(f"Found {len(neighbours)} neighbours in {label}")

        for neighbour, score in neighbours:
            changed_neighbour_rows.append({
                'Lemma': lemma,
                'Period': label,
                'Neighbour': neighbour,
                'Similarity': round(score, 4)
            })

neighbours_df = pd.DataFrame(changed_neighbour_rows)
print("\nFinal neighbour DataFrame:")
print(neighbours_df)

neighbours_df.to_csv(os.path.join(out_folder, "darwin_lemmas_BERT_neighbours_mostchangedwords.csv"), index=False)



Processing lemma: primula
Checking lemma 'primula' in t1 lemmas
Found 10 neighbours in t1
Checking lemma 'primula' in t2 lemmas
Found 10 neighbours in t2
Checking lemma 'primula' in t3 lemmas
Found 10 neighbours in t3

Processing lemma: ammonia
Checking lemma 'ammonia' in t1 lemmas
Found 10 neighbours in t1
Checking lemma 'ammonia' in t2 lemmas
Found 10 neighbours in t2
Checking lemma 'ammonia' in t3 lemmas
Found 10 neighbours in t3

Processing lemma: balanus
Checking lemma 'balanus' in t1 lemmas
Found 10 neighbours in t1
Checking lemma 'balanus' in t2 lemmas
[SKIP] No vectors found for balanus in this period.
Found 0 neighbours in t2
Checking lemma 'balanus' in t3 lemmas
Found 2 neighbours in t3

Processing lemma: filament
Checking lemma 'filament' in t1 lemmas
[SKIP] No vectors found for filament in this period.
Found 0 neighbours in t1
Checking lemma 'filament' in t2 lemmas
Found 10 neighbours in t2
Checking lemma 'filament' in t3 lemmas
Found 10 neighbours in t3

Processing lemma:

1. Primula (Type 1: Gradual conceptual change)
t1: Neighbours are botanical and folk-taxonomic (offic, berry, wort, cent), with terms like laure and common suggesting naturalist or herbalist registers.

t2: More scientific and Latinised neighbours appear (spec, haben, ulas, aria), with cows and capsules indicating specific plant parts and reproductive features.

t3: Shift to biological processes and plant anatomy (pollen, seeds, visiting, insects), signalling a transition from naming plants to discussing their role in cross-fertilisation and evolutionary biology.

A clear progression from vernacular/specimen-level naming to botanical classification, then to evolutionary and reproductive functions. This reflects Darwin’s increasing focus on fertilisation and sexual selection in plants.

2. Ammonia (Type 1: Gradual conceptual shift toward chemistry)
t1: Broad, vague neighbours (half, white, again, become), with appearance hinting at surface-level discussion.

t2: Chemistry-oriented terms emerge strongly (acid, iodine, nitrate, carbonate, starch), focusing on compounds and reactions.

t3: Further deepening into chemical composition (urea, phosphate, alkali, alcohol, chlorophyll), with ammonia as a component in physiological and experimental contexts.

The word shifts from general mention to a technical term within chemical and physiological analysis. Reflects Darwin’s later engagement with fertilisers and plant nutrition.

3. Balanus (Type 2: Abrupt disappearance and reduced salience)
t1: Dense cluster of taxonomic and geographic terms (acea, american, scalp, islands, beds), suggesting detailed discussion of barnacle species.

t2: Absent.

t3: Sparse and generic neighbours (which, your).

This reflects an abrupt drop in attention to barnacles post-Cirripedia. Once central, Balanus becomes peripheral or obsolete in Darwin’s later writing.

4. Filament (Type 1: Gradual shift with extended metaphor)
t2: Botanical terms (disc, tube, gland, column, stigma), consistent with flower structure.

t3: Mix of botanical (flower, pollen), metaphorical (moving, bloom), and physical objects (glass, worm).

The concept expands slightly from botanical anatomy to metaphorical and comparative usage (e.g. glass filament, worm filament). Darwin begins relating plant anatomy to broader mechanical or evolutionary metaphors.

5. Watson (Type 2: Shift from correspondent to genericised educational context)
t1–t2: Personal and professional associates (darwin, gray, holland, phillips).

t3: Institutional neighbours (university, college, school, examination), without personal context.

Watson changes from being a named interlocutor to a role embedded in institutional references. Possibly reflects changes in Darwin’s professional network and the abstraction of individual identities into broader academic structures.

6. Gower (Type 2: Address shift to historical or indirect reference)
t1–t2: Strong address-based neighbours (street, upper, charles, london), suggesting it’s Darwin’s residence.

t3: Broader and more narrative or historical neighbours (arth, bury, public, john, this).

“Gower” transitions from a concrete physical address to part of narrative/historical references—perhaps in published letters or third-person retellings. The word’s indexical function fades.

