A notebook to explore availability of data for LinkBERT style pretraining.

## 0 Data an Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re

In [2]:
df = pd.read_pickle("DATASET/ED4RE_2503/ED4RE_2603.pickle")

## 1 Initial analysis

In [5]:

# --- Step 1: Calculate the number of references for each document ---
def get_reference_count(ref_entry):
    if isinstance(ref_entry, list):
        return len(ref_entry)
    elif ref_entry == "no_references":
        return 0
    else: # This handles strings and other non-list types
        return 1

# Apply this function to create a new column with the counts
df['num_references'] = df['References'].apply(get_reference_count)


# --- Step 2: Get high-level statistics ---

# First, let's count how many are neatly structured vs. single blobs
is_list_mask = df['References'].apply(lambda x: isinstance(x, list))
list_count = is_list_mask.sum()
string_blob_count = len(df) - list_count

print("="*60)
print("Analysis of Reference Column Structure")
print("="*60)
print(f"Total documents with potential references: {len(df):,}")
print(f"Documents with references as a list (good format): {list_count:,} ({list_count/len(df):.2%})")
print(f"Documents with references as a single string (needs parsing): {string_blob_count:,} ({string_blob_count/len(df):.2%})")
print("\n")


# # Now, let's get descriptive statistics on the 'num_references' column
# print("="*60)
# print("Descriptive Statistics for Number of References per Document")
# print("="*60)
# # The describe() output will be heavily influenced by the single-string blobs (value=1)
# # so we'll show stats for both all data and just the list-formatted data.
# print("--- Overall (including single-string blobs as 1) ---")
# print(df['num_references'].describe())
# print("\n")

# Filter for only the documents that had a list to get a cleaner distribution
df_lists_only = df[is_list_mask]
print("--- For List-Formatted References Only ---")
print(df_lists_only['num_references'].describe())
print("="*60)


Analysis of Reference Column Structure
Total documents with potential references: 171,880
Documents with references as a list (good format): 162,726 (94.67%)
Documents with references as a single string (needs parsing): 9,154 (5.33%)


--- For List-Formatted References Only ---
count    162726.000000
mean         52.317902
std         583.275877
min           0.000000
25%          33.000000
50%          45.000000
75%          60.000000
max      169759.000000
Name: num_references, dtype: float64


**Observation 1**:

Total documents with potential references: 171,880
Documents with references as a list (good format): 162,726 (94.67%)
Documents with references as a single string (needs parsing): 9,154 (5.33%)

count    162726
mean         52.317902
std         583.275877
min           0
25%          33
50%          45
75%          60
max      169759

75% of references are lists that contain up to 60 references, this seems like a reasonable maximum number? Let's look at a random sample to feel the data.

In [83]:
a = df_lists_only[df_lists_only.num_references <= 10].sample(100)["Title"].to_list()

def fix_title(title):
    """
    Normalizes a paper title string, using a simple rule to detect spaced-out text.

    The logic is:
    1. Check if the title contains a double-space ("  ").
    2. IF IT DOES: Assume it's "spaced-out" text (e.g., "T i t l e   W o r d").
       - Mark the real word breaks (the double-spaces) with a placeholder.
       - Remove all single spaces (the junk between letters).
       - Restore the real word breaks.
    3. IF IT DOES NOT: Treat it as a normal title.
    4. Finally, apply standard cleaning (lowercase, remove punctuation, etc.) to all titles.

    Args:
        title (str or any): The input title to be cleaned.

    Returns:
        str: The normalized title string. Returns an empty string if input is not a string.
    """

    if not isinstance(title, str):
        return ""

    if "  " in title:
        placeholder = "@@@"
        
        title_fixed = re.sub(r'\s{2,}', placeholder, title)
        
        title_fixed = title_fixed.replace(' ', '')
        
        title = title_fixed.replace(placeholder, ' ')
        
    title = title.lower()
    title = re.sub(r'[^\w\s]', '', title)
    title = " ".join(title.split())
    
    return title

for title in a:
    print(title)
    print(fix_title(title))
    print()


Extreme enrichment in atmosphericNN
extreme enrichment in atmosphericnn

V a r i a t i o n s   i n   T I M P 3   a r e   a s s o c i a t e d   w i t h   a g e - r e l a t e d   m a c u l a r   d e g e n e r a t i o n
variations in timp3 are associated with agerelated macular degeneration

H i g h - p r e c i s i o n   d a t i n g   o f   c o l o n i z a t i o n   a n d   s e t t l e m e n t   i n   E a s t   P o l y n e s i a
highprecision dating of colonization and settlement in east polynesia

E d i t o r i a l — M o d e l l i n g   o f   F l o o d s   i n   U r b a n   A r e a s
editorialmodelling of floods in urban areas

S m a l l e r   h u m a n   p o p u l a t i o n   i n   2 1 0 0   c o u l d   i m p o r t a n t l y   r e d u c e   t h e   r i s k   o f   c l i m a t e   c a t a s t r o p h e
smaller human population in 2100 could importantly reduce the risk of climate catastrophe

T h e   p a n d e m i c   i s   p r o m p t i n g   w i d e s p r e a d   u s e — a n d   m i s u

This function seems to be working for titles, next, reference titles should be tackled.

In [96]:
def extract_title_from_reference(ref_string):
    """
    Extracts the most likely title from a scientific reference string.

    This uses a simple but robust heuristic:
    1. Removes bracketed metadata (e.g., [CrossRef]).
    2. Splits the reference by periods ('.').
    3. Assumes the longest resulting segment is the title.

    Args:
        ref_string (str): The reference string to parse.

    Returns:
        str: The extracted title, or None if no suitable title is found.
    """
    if not isinstance(ref_string, str):
        return None

    # 1. Pre-clean: Remove bracketed metadata like [Google Scholar] [CrossRef]
    cleaned_ref = re.sub(r'\[.*?\]', '', ref_string).strip()
    
    # 2. Split the string into major parts using the period as a delimiter
    parts = cleaned_ref.split('.')
    
    # 3. Filter out empty or very short parts that can't be titles
    # A title usually has some substance. We also strip whitespace from each part.
    potential_titles = [part.strip() for part in parts if len(part.strip()) > 10]
    
    # If, after filtering, there are no candidates, we can't find a title
    if not potential_titles:
        return None
        
    # 4. Select the longest part from the candidates
    # The title is almost always the longest meaningful segment of a reference.
    title = max(potential_titles, key=len)
    
    return title


In [None]:
for ref in df_lists_only[df_lists_only.num_references <= 100].sample(1)["References"].to_list()[0]:
    print(ref)
    print(extract_title_from_reference(ref))
    print(fix_title(extract_title_from_reference(ref)))
    print()


This setup seems to work good for references.

In [143]:
def extract_year(date_string):
    """
    Extracts the most likely publication year from a messy date string.

    This function tries several strategies in order:
    1. Looks for a year in a "Published: ..." line.
    2. If not found, it searches for any four-digit number starting with 19 or 20.
    3. Returns the last year found in the string, as 'Published' is usually last.

    Args:
        date_string (str): The messy string from the 'Date' column.

    Returns:
        int or None: The extracted four-digit year as an integer, or None if no
                     plausible year can be found.
    """
    if not isinstance(date_string, str):
        return None

    published_match = re.search(r'Published: .*?(\b(19|20)\d{2}\b)', date_string)
    if published_match:
        return int(published_match.group(1))

    all_years = re.findall(r'\b(?:19|20)\d{2}\b', date_string)
    
    if all_years:
        return int(all_years[-1])

    # If no year could be found, return None
    return "no_date"

for date in df_lists_only.Date.sample(100).tolist():
    print(date)
    print(extract_year(date))
    print()

May 17, 2004
2004

May 1, 2012
2012


Received: 1 June 2017
/
Revised: 13 June 2017
/
Accepted: 14 June 2017
/
Published: 17 June 2017

2017

2022-07-01
2022


Received: 1 July 2023
/
Revised: 4 August 2023
/
Accepted: 5 August 2023
/
Published: 8 August 2023

2023

2011
2011

February 4, 2013
2013

February 19, 2002
2002

November 13, 2001
2001


Received: 18 May 2022
/
Revised: 1 July 2022
/
Accepted: 6 July 2022
/
Published: 11 July 2022

2022

July 22, 2022
2022

December 23, 2013
2013

February 17, 2009
2009

2019
2019


Received: 18 October 2021
/
Revised: 12 November 2021
/
Accepted: 6 December 2021
/
Published: 13 December 2021

2021


Received: 11 May 2018
/
Revised: 11 July 2018
/
Accepted: 18 July 2018
/
Published: 25 July 2018

2018


Received: 22 March 2023
/
Revised: 9 April 2023
/
Accepted: 13 April 2023
/
Published: 14 April 2023

2023


Received: 8 July 2019
/
Revised: 25 July 2019
/
Accepted: 25 July 2019
/
Published: 28 July 2019

2019

June 18, 2013
2013

July 8, 20

## 2 Implementation

In [148]:
import pandas as pd
import re
from rapidfuzz import fuzz, process
from tqdm.auto import tqdm # For a nice progress bar!


def fix_title(title):
    """
    Normalizes a paper title string, using a simple rule to detect spaced-out text.
    Args:
        title (str or any): The input title to be cleaned.

    Returns:
        str: The normalized title string. Returns an empty string if input is not a string.
    """

    if not isinstance(title, str):
        return ""

    if "  " in title:
        placeholder = "@@@"
        
        title_fixed = re.sub(r'\s{2,}', placeholder, title)
        
        title_fixed = title_fixed.replace(' ', '')
        
        title = title_fixed.replace(placeholder, ' ')
        
    title = title.lower()
    title = re.sub(r'[^\w\s]', '', title)
    title = " ".join(title.split())
    
    return title

def extract_title_from_reference(ref_string):
    """
    Extracts the most likely title from a scientific reference string.
    Args:
        ref_string (str): The reference string to parse.

    Returns:
        str: The extracted title, or None if no suitable title is found.
    """
    if not isinstance(ref_string, str):
        return None

    # 1. Pre-clean: Remove bracketed metadata like [Google Scholar] [CrossRef]
    cleaned_ref = re.sub(r'\[.*?\]', '', ref_string).strip()
    
    # 2. Split the string into major parts using the period as a delimiter
    parts = cleaned_ref.split('.')
    
    # 3. Filter out empty or very short parts that can't be titles
    # A title usually has some substance. We also strip whitespace from each part.
    potential_titles = [part.strip() for part in parts if len(part.strip()) > 10]
    
    # If, after filtering, there are no candidates, we can't find a title
    if not potential_titles:
        return None
        
    # 4. Select the longest part from the candidates
    # The title is almost always the longest meaningful segment of a reference.
    title = max(potential_titles, key=len)
    
    return title

def extract_year(date_string):
    """
    Extracts the most likely publication year from a messy date string.
    Args:
        date_string (str): The messy string from the 'Date' column.

    Returns:
        int or None: The extracted four-digit year as an integer, or None if no
                     plausible year can be found.
    """
    if not isinstance(date_string, str):
        return None

    published_match = re.search(r'Published: .*?(\b(19|20)\d{2}\b)', date_string)
    if published_match:
        return int(published_match.group(1))

    all_years = re.findall(r'\b(?:19|20)\d{2}\b', date_string)
    
    if all_years:
        return int(all_years[-1])

    # If no year could be found, return None
    return "no_date"

def find_best_match_for_reference_v2(ref_title, corpus_df, titles_map, ref_year=None, threshold=85, year_tol=5):
    if not ref_title: return None, None, None
    if ref_title in titles_map: return ref_title, titles_map[ref_title], 100
    
    candidate_corpus = corpus_df
    if ref_year is not None and 'year' in corpus_df.columns and pd.api.types.is_numeric_dtype(corpus_df['year']):
        filtered_view = corpus_df[corpus_df['year'].between(ref_year - year_tol, ref_year + year_tol)]
        if not filtered_view.empty: candidate_corpus = filtered_view
            
    candidate_titles = list(candidate_corpus['normalized_title'])
    candidate_indices = list(candidate_corpus.index)
    if not candidate_titles: return None, None, None

    best_match = process.extractOne(ref_title, candidate_titles, scorer=fuzz.WRatio, score_cutoff=threshold)

    if best_match:
        matched_title_str, score, original_list_index = best_match
        corpus_idx = candidate_indices[original_list_index]
        return matched_title_str, corpus_idx, score
    
    return None, None, None


df = df_lists_only[df_lists_only.num_references <= 100] # Use your actual DataFrame here

print("Preparing the corpus...")
# Use tqdm to see progress on large dataframes
tqdm.pandas(desc="Normalizing Titles")
df['normalized_title'] = df['Title'].progress_apply(fix_title)

tqdm.pandas(desc="Extracting Years")
df['year'] = df['Date'].progress_apply(extract_year)

# Create the fast lookup map for exact matches
print("Creating title-to-index map...")
corpus_titles_map = {title: idx for idx, title in df['normalized_title'].items() if title}


# --- Step 3: The Main Linking Loop ---

print("\nStarting the reference linking process...")
citation_links = []
# We iterate through each row of the DataFrame
for source_index, row in tqdm(df.iterrows(), total=len(df), desc="Linking Documents"):
    references = row['References']
    source_title = row['Title']
    
    # Ensure references is a list of strings
    if not isinstance(references, list):
        continue

    for ref_string in references:
        # Extract and normalize title from the reference string
        raw_ref_title = extract_title_from_reference(ref_string)
        norm_ref_title = fix_title(raw_ref_title)
        
        # Extract year from the reference string
        ref_year = extract_year(ref_string)

        # Find the best match in the entire corpus
        matched_title, target_index, score = find_best_match_for_reference_v2(
            norm_ref_title,
            df,
            corpus_titles_map,
            ref_year=ref_year,
            threshold=85, # You can tune this!
            year_tol=2
        )

        # If a good match was found, store it
        if target_index is not None:
            citation_links.append({
                'source_index': source_index,
                'source_title': source_title,
                'reference_string': ref_string,
                'target_index': target_index,
                'matched_corpus_title': df.loc[target_index, 'Title'],
                'match_score': score
            })

# --- Step 4: Create and Analyze the Results DataFrame ---

print(f"\nProcess complete. Found {len(citation_links)} potential links.")

# Create a new DataFrame from the results
links_df = pd.DataFrame(citation_links)

print("\nSample of created links:")
print(links_df.head().to_string())

# Analyze the distribution of match scores
if not links_df.empty:
    print("\nDistribution of match scores:")
    print(links_df['match_score'].describe())
    
    # It's also very useful to look at the borderline cases
    print("\nExamples of borderline matches (score between 85 and 90):")
    borderline_cases = links_df[(links_df['match_score'] > 85) & (links_df['match_score'] < 90)]
    print(borderline_cases.head(5).to_string())

Preparing the corpus...


Normalizing Titles:   0%|          | 0/156237 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['normalized_title'] = df['Title'].progress_apply(fix_title)


Extracting Years:   0%|          | 0/156237 [00:00<?, ?it/s]

Creating title-to-index map...

Starting the reference linking process...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year'] = df['Date'].progress_apply(extract_year)


Linking Documents:   0%|          | 0/156237 [00:00<?, ?it/s]

KeyboardInterrupt: 

## 3 Tryout

In [22]:
def fix_title(title):
    """
    Normalizes a paper title string, using a simple rule to detect spaced-out text.
    Args:
        title (str or any): The input title to be cleaned.

    Returns:
        str: The normalized title string. Returns an empty string if input is not a string.
    """

    if not isinstance(title, str):
        return ""

    if "  " in title:
        placeholder = "@@@"
        
        title_fixed = re.sub(r'\s{2,}', placeholder, title)
        
        title_fixed = title_fixed.replace(' ', '')
        
        title = title_fixed.replace(placeholder, ' ')
        
    title = title.lower()
    title = re.sub(r'[^\w\s]', '', title)
    title = " ".join(title.split())
    
    return title

def extract_year(date_string):
    """
    Extracts the most likely publication year from a messy date string.
    Args:
        date_string (str): The messy string from the 'Date' column.

    Returns:
        int or None: The extracted four-digit year as an integer, or None if no
                     plausible year can be found.
    """
    if not isinstance(date_string, str):
        return None

    published_match = re.search(r'Published: .*?(\b(19|20)\d{2}\b)', date_string)
    if published_match:
        return int(published_match.group(1))

    all_years = re.findall(r'\b(?:19|20)\d{2}\b', date_string)
    
    if all_years:
        return int(all_years[-1])

    # If no year could be found, return None
    return "no_date"


In [None]:
links = pd.read_pickle("citation_links_38652.pickle")

In [None]:

df = pd.read_pickle("DATASET/ED4RE_2503/ED4RE_2603.pickle")
is_list_mask = df['References'].apply(lambda x: isinstance(x, list))
df_lists_only = df[is_list_mask]
df = df_lists_only.copy()

In [25]:

import networkx as nx
from tqdm.auto import tqdm

tqdm.pandas(desc="Normalizing Titles")
df['normalized_title'] = df['Title'].progress_apply(fix_title)

tqdm.pandas(desc="Extracting Years")
df['year'] = df['Date'].progress_apply(extract_year)


# Let's use the high-confidence links for a cleaner graph
CONFIDENCE_THRESHOLD = 90
if 'match_score' in links.columns:
    graph_df = links[links['match_score'] >= CONFIDENCE_THRESHOLD].copy()
else:
    graph_df = links.copy()

print(f"Building graph from {len(graph_df)} high-confidence links.")

# 1. Identify all unique nodes from your links
# These are all the paper indices that are either a source or a target.
all_node_indices = pd.concat([
    graph_df['source_index'], 
    graph_df['target_index']
]).unique()

print(f"Found {len(all_node_indices)} unique papers (nodes) in the network.")

# 2. Get the attributes for these nodes (the titles) from the main DataFrame
# We select only the rows from the main 'df' that correspond to our nodes.
node_attributes_df = df.loc[all_node_indices, ['normalized_title', 'year']]

# 3. Create the list of nodes in the format networkx expects:
# [(node_id, {attribute_dict}), (node_id_2, {attribute_dict_2}), ...]
nodes_with_attrs = [
    (index, {"title": row["normalized_title"], "year": row["year"]}) 
    for index, row in node_attributes_df.iterrows()
]

# 4. Create the list of edges from your links DataFrame
# This is a simple list of (source, target) tuples.
edges = list(graph_df[['source_index', 'target_index']].to_records(index=False))

# 1. Create a new directed graph
G = nx.DiGraph()

# 2. Add the nodes and edges
G.add_nodes_from(nodes_with_attrs)
G.add_edges_from(edges)

print("\nGraph created successfully!")
print("--- Basic Graph Statistics ---")
print(f"Number of nodes: {G.number_of_nodes():,}")
print(f"Number of edges: {G.number_of_edges():,}")

# The density of a graph is the ratio of actual edges to all possible edges.
# Citation networks are extremely sparse, so this number will be tiny.
density = nx.density(G)
print(f"Graph density: {density:.6f}")

# --- Find the most influential papers (most cited) ---
# In a citation network, this means finding the nodes with the highest "in-degree".
in_degrees = G.in_degree() # This returns a list of (node, degree) tuples
top_10_cited = sorted(in_degrees, key=lambda x: x[1], reverse=True)[:10]

print("\n--- Top 10 Most Cited Papers in Your Corpus ---")
for node_id, degree in top_10_cited:
    # Use the node attribute 'title' that we added earlier
    title = G.nodes[node_id]['title']
    print(f"Citations: {degree:<5} | Title: {title}")

Normalizing Titles:   0%|          | 0/162726 [00:00<?, ?it/s]

Extracting Years:   0%|          | 0/162726 [00:00<?, ?it/s]

Building graph from 46 high-confidence links.
Found 22 unique papers (nodes) in the network.

Graph created successfully!
--- Basic Graph Statistics ---
Number of nodes: 22
Number of edges: 28
Graph density: 0.060606

--- Top 10 Most Cited Papers in Your Corpus ---
Citations: 4     | Title: business structure of electricity distribution system operator and effect on solar photovoltaic uptake an empirical case study for switzerland
Citations: 3     | Title: are electric vehicle drivers willing to participate in vehicletogrid contracts a contextdependent stated choice experiment
Citations: 2     | Title: recycled text and risk communication in natural gas pipeline environmental impact assessments
Citations: 2     | Title: efficient floating offshore wind realization a comparative legal analysis of france norway and the united kingdom
Citations: 2     | Title: time history and meaningmaking in research on peoples relations with renewable energy technologies retsa conceptual proposal
Citat

268

In [None]:
nx.write_graphml(G, f"citation_graph_{len(G.nodes)}.graphml")

: 