# Corpus Characterisation & Candidate Selection

This notebook implements the first phase of the analytical framework: Corpus Characterization and Candidate Selection. The objective is to move from the full corpus of papers to a defensible, balanced subset of approximately 50-75 key papers for in-depth qualitative analysis. This is achieved using a "Two-Bucket" strategy to identify both "Foundational Pillars" and "Rising Stars" in the literature.

In [15]:
import pandas as pd
from neo4j import GraphDatabase
import os
from dotenv import load_dotenv
import networkx as nx
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# --- Connect to Neo4j ---
load_dotenv()
URI = os.getenv("NEO4J_URI")
AUTH = (os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD"))
driver = GraphDatabase.driver(URI, auth=AUTH)

# --- Helper Function to run queries and return a DataFrame ---
def query_to_dataframe(driver, query, **params):
    """
    This function executes a Cypher query against the database
    and returns the results as a pandas DataFrame.
    """
    with driver.session() as session:
        result = session.run(query, **params)
        return pd.DataFrame([r.data() for r in result])

print("Setup complete. Connected to Neo4j.")

  from .autonotebook import tqdm as notebook_tqdm


Setup complete. Connected to Neo4j.


## 1. Fetching and Preparing the Corpus Data
The process begins by fetching all papers from the graph, along with their associated in-corpus citation counts. To measure network influence, the PageRank algorithm is then run on the full citation graph. The resulting scores are merged to create a master DataFrame, which forms the base dataset upon which all subsequent scoring will be performed.

Page Rank: Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.

In [17]:
# This query fetches all papers and their in-corpus citation counts.
corpus_query = """
MATCH (p:Paper)
OPTIONAL MATCH (p)-[:PUBLISHED_IN]->(v:Venue)
OPTIONAL MATCH (p)<-[:CITES]-(citer)
WITH p, v, count(citer) as in_corpus_citations
RETURN
    p.paperId AS paperId,
    p.title AS title,
    p.year AS year,
    p.citation_count AS overall_citations,
    in_corpus_citations,
    v.name AS venue
"""
print("Fetching full corpus data from Neo4j...")
corpus_df = query_to_dataframe(driver, corpus_query)
corpus_df.set_index('paperId', inplace=True)


# This query fetches the citation network to calculate PageRank.
citation_network_query = """
MATCH (p1:Paper)-[:CITES]->(p2:Paper)
RETURN p1.paperId AS source, p2.paperId AS target
"""
print("Fetching citation network for PageRank calculation...")
citation_df = query_to_dataframe(driver, citation_network_query)

# It creates a NetworkX graph and calculates PageRank.
G = nx.from_pandas_edgelist(citation_df, 'source', 'target', create_using=nx.DiGraph())
pagerank = nx.pagerank(G)

# The PageRank scores are added to the main DataFrame.
corpus_df['pagerank'] = corpus_df.index.map(pagerank)
corpus_df['pagerank'] = corpus_df['pagerank'].fillna(0)

print(f"Corpus prepared with {len(corpus_df)} papers.")
display(corpus_df.sort_values(by='pagerank', ascending=False).head(75))

Fetching full corpus data from Neo4j...
Fetching citation network for PageRank calculation...
Corpus prepared with 2055 papers.


Unnamed: 0_level_0,title,year,overall_citations,in_corpus_citations,venue,pagerank
paperId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8db33a0a1c3ab2b45f9229896e9e2a02e309bab8,Demand response of a heterogeneous cluster of ...,2014,74,7,Power Systems Computation Conference,0.019176
745a134eca192982e8e0c16d6f36cfe24f9bdd08,"Woulda, Coulda, Shoulda: Counterfactually-Guid...",2018,143,11,International Conference on Learning Represent...,0.018275
648ea87fe7f99ca8ea5090cb1ba40242299ef4c4,Reinforcement learning for demand response: A ...,2019,604,31,Applied Energy,0.016781
59021ecd3fd15c59b8b774c87ae974c9fffe9fa5,Model-Free Real-Time EV Charging Scheduling Ba...,2019,393,39,IEEE Transactions on Smart Grid,0.015975
0aa23eca1bf2302bd358b7d31e76987a0b6fb1b0,Optimal Demand Response Using Device-Based Rei...,2014,236,18,IEEE Transactions on Smart Grid,0.013682
...,...,...,...,...,...,...
3afbead850747d4a98c24cca0af1f5e78a128396,An effective energy management Layout-Based re...,2023,39,1,Solar Energy,0.002737
8addbf4b72e2e7c8ee85f8b1d03b3a2298b741a6,Power Flow Management in Electric Vehicles Cha...,2020,16,1,IEEE Congress on Evolutionary Computation,0.002737
267ada92e896d061d401665ab1443b570132ad49,A Multi-Level Reinforcement-Learning Model of ...,2019,3,1,2019 Conference on Cognitive Computational Neu...,0.002737
387a17823d7c47c0bd3390a124708933032989e0,Generalized Decision Transformer for Offline H...,2021,107,2,International Conference on Learning Represent...,0.002737


## 2. Bucket A: Identifying Foundational Pillars
The "Foundational Pillars" are identified using a composite "Foundational Score." This score is a weighted average of three normalized metrics: overall citations (external influence), in-corpus citations (domain centrality), and PageRank (network influence). This multi-faceted approach provides a robust measure of a paper's established importance.

In [None]:
# A copy of the main dataframe is created for this analysis.
foundational_df = corpus_df.copy()

# The metrics are normalized to a 0-1 scale to allow for fair comparison.
foundational_df['norm_overall'] = foundational_df['overall_citations'] / foundational_df['overall_citations'].max()
foundational_df['norm_in_corpus'] = foundational_df['in_corpus_citations'] / foundational_df['in_corpus_citations'].max()
foundational_df['norm_pagerank'] = foundational_df['pagerank'] / foundational_df['pagerank'].max()

# The weighted Foundational Score is calculated.
# In-corpus citations are weighted most heavily to prioritize domain relevance.
weights = {'in_corpus': 0.5, 'pagerank': 0.3, 'overall': 0.2}
foundational_df['foundational_score'] = (
    weights['in_corpus'] * foundational_df['norm_in_corpus'] +
    weights['pagerank'] * foundational_df['norm_pagerank'] +
    weights['overall'] * foundational_df['norm_overall']
)

# The top 100 foundational papers are selected.
foundational_papers = foundational_df.sort_values('foundational_score', ascending=False).head(100)

print(f"Identified {len(foundational_papers)} foundational papers.")
display(foundational_papers[['title', 'year', 'foundational_score']].head(75))

Identified 100 foundational papers.


Unnamed: 0_level_0,title,year,foundational_score
paperId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
59021ecd3fd15c59b8b774c87ae974c9fffe9fa5,Model-Free Real-Time EV Charging Scheduling Ba...,2019,0.794918
648ea87fe7f99ca8ea5090cb1ba40242299ef4c4,Reinforcement learning for demand response: A ...,2019,0.729113
bd7638ddbbe249c3e6070f951b39f961b5e61cb5,Incentive-based demand response for smart grid...,2019,0.498753
cd3c9cba90f778ada660a462d58573756c60fb27,Reinforcement Learning-Based Plug-in Electric ...,2017,0.480211
0aa23eca1bf2302bd358b7d31e76987a0b6fb1b0,Optimal Demand Response Using Device-Based Rei...,2014,0.471835
...,...,...,...
816ea64552b56ca901939e8506ff7802251dd783,Dynamic Pricing Strategy of Electric Vehicle A...,2021,0.087352
8605228dc596ca1701a3e19d6eff49225e399b05,Dynamic pricing and energy management for prof...,2021,0.087345
95900962a264ce89610877a323334c441c93c3b2,Demand Response Management for Industrial Faci...,2019,0.087227
b2c70c4d23c98dd4e77234fe0720595d3d565a12,Systematic Evaluation of Causal Discovery in V...,2021,0.086073


## 3. Bucket B: Identifying Rising Stars

The "Rising Stars" are identified to counteract the citation bias towards older papers. This method first filters the corpus for recent publications (last 3 years). It then calculates a "Citation Velocity" for each recent paper, which measures the rate of citation accumulation. This process identifies new papers that are gaining impact most quickly.

In [13]:
# A copy of the main dataframe is created and filtered for recent papers.
current_year = 2025
rising_stars_df = corpus_df[corpus_df['year'] >= (current_year - 2)].copy()

# Citation velocity is calculated.
# 1 is added to the denominator to avoid division by zero for papers published in the current year.
rising_stars_df['citation_velocity'] = rising_stars_df['overall_citations'] / (current_year - rising_stars_df['year'] + 1)

# The top 100 rising stars are selected.
rising_stars = rising_stars_df.sort_values('citation_velocity', ascending=False).head(100)

print(f"Identified {len(rising_stars)} rising star papers.")
display(rising_stars[['title', 'year', 'citation_velocity']].head(75))

Identified 100 rising star papers.


Unnamed: 0_level_0,title,year,citation_velocity
paperId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1b1efa2f9731ab3801c46bfc877695d41e437406,An Online Reinforcement Learning-Based Energy ...,2025,60.000000
7158277c0361f15a7c67621f5940c4208c9c46ce,Optimizing renewable energy systems through ar...,2024,52.000000
5432468c6ab917ae8540be8e2086da301b59461e,Deep Learning and Artificial Intelligence in S...,2023,40.000000
ba54f632c4edf6aaa44bbdfc31d1942a379352c3,Asynchronous Deep Reinforcement Learning for C...,2023,36.333333
14fbad4644fcefc0bb2ea5314bc65387a11b7b21,AI and human-robot interaction: A review of re...,2024,29.500000
...,...,...,...
87ad649e29687275b1bd288104ec0238c321a72f,Reinforcement Learning-Based Demand Response M...,2023,7.000000
dcc4cc6d47fd5cb583bc3ce7f36c2b6ce1610866,AI-DRIVEN OPTIMIZATION IN RENEWABLE HYDROGEN P...,2025,7.000000
8322a4ac24b1e43ae684e0d66ae67e6e256bf377,A Reinforcement Learning Approach for Integrat...,2023,7.000000
c33a21e4215a5cd690ca8e2bb0d770b45adf5c89,Reinforcing the Diffusion Chain of Lateral Tho...,2025,7.000000


## Bucket C: Identifying the Pre-publication Frontier

To capture the most recent, pre-publication research that has not yet had time to accumulate citations, this step identifies promising papers from arXiv. The methodology uses semantic similarity to find recent preprints that are most closely related to the core themes of the established "Rising Stars."

In [22]:
# --- BUCKET C: PRE-PUBLICATION FRONTIER ---
print("\nIdentifying Pre-publication Frontier papers...")

# It first filters the corpus for papers from the current year and from arXiv.
recent_arxiv_df = corpus_df[(corpus_df['year'] == 2025) & (corpus_df['venue'] == 'arXiv.org')].copy()
print(f"Found {len(recent_arxiv_df)} recent arXiv papers.")

if not recent_arxiv_df.empty:
    # An embedding model is initialized.
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

    # It generates embeddings for the abstracts of the "Rising Stars" and recent arXiv papers.
    print("Generating embeddings for similarity search...")
    rising_star_embeddings = embedding_model.encode(rising_stars['title'].tolist())
    arxiv_embeddings = embedding_model.encode(recent_arxiv_df['title'].tolist())

    # It creates a single "prototype" vector representing the core of the rising star research.
    prototype_embedding = np.mean(rising_star_embeddings, axis=0).reshape(1, -1)

    # It calculates the cosine similarity between each arXiv paper and the prototype.
    similarities = cosine_similarity(arxiv_embeddings, prototype_embedding)
    recent_arxiv_df['similarity_score'] = similarities

    # The top 25 most similar preprints are selected.
    pre_publication_papers = recent_arxiv_df.sort_values('similarity_score', ascending=False).head(25)

    print(f"Identified {len(pre_publication_papers)} pre-publication papers.")
    display(pre_publication_papers[['title', 'similarity_score']].head(25))
else:
    print("No recent arXiv papers found to form Bucket C.")
    pre_publication_papers = pd.DataFrame()


Identifying Pre-publication Frontier papers...
Found 56 recent arXiv papers.
Generating embeddings for similarity search...
Identified 25 pre-publication papers.


Unnamed: 0_level_0,title,similarity_score
paperId,Unnamed: 1_level_1,Unnamed: 2_level_1
dd3c8ed71b80fb7d4dd6fb68b409e21f984e2f9b,Deep Reinforcement Learning-Based Optimization...,0.805666
f34925ad0d98a506eec2934e3435eef12554969e,A Generative Model Enhanced Multi-Agent Reinfo...,0.775052
e9977d1686710191032d176ff32f1f5bb12c0704,Replicating the behaviour of electric vehicle ...,0.714626
83e899f71cf32a769de2bc786c63810a2e1320db,Integration of Multi-Mode Preference into Home...,0.714096
d137a77a7abb3c19ff678f234756f2b6a00b6b71,LLM-Enhanced Multi-Agent Reinforcement Learnin...,0.666148
df96e0f65c5743c9459c4d5ae3e022741cdb8f2c,Deep Learning Innovations for Energy Efficienc...,0.646962
75bf60502ab1e9a1325f817c37ab83c4ee129037,RAD: Training an End-to-End Driving Policy via...,0.621505
4376e282954ec59eaeca345ce4ec99219a075670,A Unified Pairwise Framework for RLHF: Bridgin...,0.621246
3429d178fc117ccef92fbf6910ad4cff34290094,Reinforcement Learning-Driven Plant-Wide Refin...,0.620789
1dce46450c1a08aa4cc0da964162733b67909fc0,A Roadmap Towards Improving Multi-Agent Reinfo...,0.598535


## 4. Candidate Selection

The final step combines the papers from the "Foundational Pillars" and "Rising Stars" buckets. The list is de-duplicated to create the final set of candidate papers. This balanced set, representing both established and emerging research, will be the focus of the in-depth content analysis in Phase 2.

In [23]:
# The two buckets are combined into a single DataFrame.
final_candidates = pd.concat([foundational_papers, rising_stars, pre_publication_papers])

# The list is de-duplicated in case a paper appeared in both buckets.
final_candidates.drop_duplicates(inplace=True)

# The paperId index is reset to a column.
final_candidates.reset_index(inplace=True)

print(f"Total unique candidate papers for deep analysis: {len(final_candidates)}")

# The final list is saved to a new CSV file for the next phase.
final_candidates.to_csv('./data/processed/candidate_papers_for_analysis.csv', index=False)
print("Candidate list saved.")

display(final_candidates[['title', 'year']].head())

Total unique candidate papers for deep analysis: 225
Candidate list saved.


Unnamed: 0,title,year
0,Model-Free Real-Time EV Charging Scheduling Ba...,2019
1,Reinforcement learning for demand response: A ...,2019
2,Incentive-based demand response for smart grid...,2019
3,Reinforcement Learning-Based Plug-in Electric ...,2017
4,Optimal Demand Response Using Device-Based Rei...,2014
