
Dimitrios Georgitsis 4334

Kaliopi Oikonomou 5099



Kaggle Names: Dimitris Georgitsis, kaliopiOiko

# Link Prediction in Citation Networks

Extracting features and building a model to predict citation links between research papers, optimized to minimize log loss.

It is a supervised link prediction model that combines NLP (textual similarity) and graph-based features to predict edges in a citation network.

**Note**

The code was developed and tested using LightGBM version 3.3.5.
Please make sure this version is installed before running the code, either via the terminal or directly inside the notebook.

In [3]:
import re
import random
import numpy as np
import pandas as pd
import networkx as nx

from tqdm import tqdm

from collections import defaultdict

from gensim.models import Word2Vec
from node2vec import Node2Vec

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import log_loss, jaccard_score

import lightgbm as lgb


Loading raw data files into pandas DataFrames.


Each file contains structured information about papers, including authorship, abstracts, citation edges, and test pairs.


The data is parsed and stored in corresponding DataFrames for further processing.


In [4]:
# Load authors data
authors_data = []
with open("authors.txt", 'r', encoding='utf-8') as file:
    for line in file:
        paper_id, _, author_list = line.strip().partition('|--|')
        authors_data.append((int(paper_id), author_list))
authors_df = pd.DataFrame(authors_data, columns=["paper_id", "authors"])

# Load abstracts data
abstracts_data = []
with open("abstracts.txt", 'r', encoding='utf-8') as file:
    for line in file:
        paper_id, _, abstract_text = line.strip().partition('|--|')
        abstracts_data.append((int(paper_id), abstract_text))
abstracts_df = pd.DataFrame(abstracts_data, columns=["paper_id", "abstract"])


# Load edges data
edges_data = []
with open("edgelist.txt", 'r', encoding='utf-8') as file:
    for line in file:
        node1, node2 = map(int, line.strip().split(','))
        edges_data.append((node1, node2))
edges_df = pd.DataFrame(edges_data, columns=["source", "target"])

# Load test data
test_data = []
with open("test.txt", 'r', encoding='utf-8') as file:
    for line in file:
        node1, node2 = map(int, line.strip().split(','))
        test_data.append((node1, node2))
test_df = pd.DataFrame(test_data, columns=["source", "target"])


# The following data cleaning steps (e.g., removing duplicates and NaNs) were initially considered, 
# but later deemed unnecessary since the data appeared to be already clean and consistent.



# Data Preprocessing
#authors_df.drop_duplicates(inplace = True)
#authors_df.dropna(inplace = True)

#abstracts_df.drop_duplicates(inplace = True)
#abstracts_df.dropna(inplace = True)

#edges_df.drop_duplicates(inplace = True)
#edges_df.dropna(inplace = True)

# edgelist:
# paper id 0 -----> paper id 1
# (node) ---> (node)    [citation]



print(f"Loaded {authors_df.shape[0]} papers with author information.")
print(f"Loaded {abstracts_df.shape[0]} papers with abstracts.")
print(f"Loaded {edges_df.shape[0]} citation edges.")
print(f"Loaded {test_df.shape[0]} test pairs.")

print(authors_df.head())
print(abstracts_df.head())
print(edges_df.head())
print(test_df.head())


Loaded 138499 papers with author information.
Loaded 138499 papers with abstracts.
Loaded 1091955 citation edges.
Loaded 106692 test pairs.
   paper_id                                            authors
0         0  James H. Niblock,Jian-Xun Peng,Karen R. McMene...
1         1              Jian-Xun Peng,Kang Li,De-Shuang Huang
2         2                                        J. Heikkila
3         3         L. Teslic,B. Hartmann,O. Nelles,I. Skrjanc
4         4      Long Zhang,Kang Li,Er-Wei Bai,George W. Irwin
   paper_id                                           abstract
0         0  The development of an automated system for the...
1         1  This paper proposes a novel hybrid forward alg...
2         2  Modern CCD cameras are usually capable of a sp...
3         3  This paper deals with the problem of fuzzy non...
4         4  A number of neural networks can be formulated ...
   source  target
0       0       1
1       0       2
2       1       3
3       1       5
4       1     

## A. Feature Extraction 

### From Abstracts

We extract textual similarity features between pairs of papers by processing their abstracts using two common techniques:

- TF-IDF Cosine Similarity:

Each abstract is converted into a numerical vector using TF-IDF. We then compute the cosine similarity between vectors of two papers. A value closer to 1 indicates higher similarity.


- Word2Vec Cosine Similarity:

We train a Word2Vec model on all abstracts. Each abstract is then represented by the average of its word vectors. Cosine similarity is computed between two such averaged vectors.

------

To train a binary classification model, we need both positive and negative examples:

- Positive examples are paper pairs that actually have a citation link (i.e., they appear in the citation graph edges_df).

- Negative examples are randomly sampled pairs of papers that do not have a citation link between them.
    We ensure:

    - The two papers are not the same (paper1 ≠ paper2)

    - The pair doesn't already exist as a real citation (to avoid false negatives)

    - The number of negative samples is equal to the number of positive samples to maintain a balanced dataset
        
        
We defined the positive examples with a label of 1, and the negative examples with a label of 0.  

#### Code for Generating Positive Examples for TF-IDF and  Word2Vec

In [3]:
# TF-IDF Feature Extraction
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(abstracts_df["abstract"]) # article abstracts -> numerical vectors (using TF-IDF)

# Function to compute TF-IDF similarity between two papers
def get_tfidf_similarity(paper1_id, paper2_id, tfidf_matrix):
    return cosine_similarity(tfidf_matrix[paper1_id], tfidf_matrix[paper2_id])[0][0] # similarity score closer to 1 -> higher similarity between the abstracts.
    

# Word2Vec Feature Extraction
# Tokenize abstracts into words
tokenized_abstracts = [abstract.split() for abstract in abstracts_df["abstract"]] # each abstract -> list of tokens (words)

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_abstracts, vector_size=300, window=5, min_count=2, workers=2) # Each word is represented by a 300-dimensional vector
# We experimented with different Word2Vec hyperparameters such as vector_size, window size, and min_count.
# After evaluating performance, we selected: vector_size=300, window=5, min_count=2 as they provided a good trade-off between quality and computational efficiency.



# Function to compute the average Word2Vec vector for an abstract
def get_abstract_vector(abstract, model):
    vectors = [model.wv[word] for word in abstract if word in model.wv]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)     # For each abstract, compute the average of its word vectors
    else:
        return np.zeros(model.vector_size)

# Compute Word2Vec vectors for all abstracts
abstract_vectors = [get_abstract_vector(abstract, word2vec_model) for abstract in tokenized_abstracts]

# Function to compute Word2Vec similarity between two papers
def get_word2vec_similarity(paper1_id, paper2_id, abstract_vectors):
    return cosine_similarity([abstract_vectors[paper1_id]], [abstract_vectors[paper2_id]])[0][0]


textual_features = [] # list to store textual similarity features for connected papers


# Iterate through the edgelist and compute textual similarity for each pair
for _, row in edges_df.iterrows():   # for every edge 'paperid1, paperid2'
    paper1_id = row["source"]
    paper2_id = row["target"]
    
    # Ensure paper IDs are within the range of abstracts
    if paper1_id < len(abstracts_df) and paper2_id < len(abstracts_df):
        # Compute TF-IDF similarity
        tfidf_sim = get_tfidf_similarity(paper1_id, paper2_id, tfidf_matrix)
        
        # Compute Word2Vec similarity
        word2vec_sim = get_word2vec_similarity(paper1_id, paper2_id, abstract_vectors)

        
        # Store textual similarity features for each positive (citation) pair
        # Append to features
        textual_features.append({
            "source": paper1_id,
            "target": paper2_id,
            "tfidf_similarity": tfidf_sim,
            "word2vec_similarity": word2vec_sim,
            "label": 1  # positive example
        })
        # format of textual_feautes: 
        #[  {"source": 0, "target": 1, "tfidf_similarity": 0.45, "word2vec_similarity": 0.61, "label": 1},
        #  {"source": 0, "target": 2, "tfidf_similarity": 0.12, "word2vec_similarity": 0.37, "label": 1},
        #  ...]


#textual_features_df = pd.DataFrame(textual_features)
#print(textual_features_df.head())


#### Code for Negative Examples for TF-IDF and  Word2Vec

In [4]:
# Generate negative examples (non-citation pairs) by randomly sampling unconnected paper pairs
# For each sampled pair, compute TF-IDF and Word2Vec similarities and assign label = 0
# The number of negative samples matches the number of positive samples to maintain class balance


# set of positive edges
positive_edges = set((row["source"], row["target"]) for _, row in edges_df.iterrows())
num_papers = abstracts_df.shape[0]
num_negative_samples = len(edges_df)  # same length with positive samples

negative_samples = []

while len(negative_samples) < num_negative_samples:
    paper1_id = random.randint(0, num_papers - 1)
    paper2_id = random.randint(0, num_papers - 1)

    # pairs must be: (different papers) AND (edges that don't already exist)
    if paper1_id != paper2_id and (paper1_id, paper2_id) not in positive_edges:
        negative_samples.append((paper1_id, paper2_id))
        positive_edges.add((paper1_id, paper2_id))  # Add to the set to avoid selecting this pair again

negative_textual_features = []

for paper1_id, paper2_id in tqdm(negative_samples, desc="Processing negative samples"):
    tfidf_sim = get_tfidf_similarity(paper1_id, paper2_id, tfidf_matrix)
    word2vec_sim = get_word2vec_similarity(paper1_id, paper2_id, abstract_vectors)

    negative_textual_features.append({
        "source": paper1_id,
        "target": paper2_id,
        "tfidf_similarity": tfidf_sim,
        "word2vec_similarity": word2vec_sim,
        "label": 0  # negative example
    })

    
    
# Merging positive and negative feature sets into a single DataFrame
# DataFrame of (positive + negative) textual features
all_textual_features_df = pd.DataFrame(textual_features + negative_textual_features)

print(all_textual_features_df.head())

# all_features_df.to_csv("all_textual_features_df.csv", index=False)


Processing negative samples: 100%|██████████| 1091955/1091955 [19:43<00:00, 922.41it/s]


   source  target  tfidf_similarity  word2vec_similarity  label
0       0       1          0.089054             0.806190      1
1       0       2          0.145861             0.906708      1
2       1       3          0.125001             0.889072      1
3       1       5          0.095378             0.839317      1
4       1       6          0.286036             0.922433      1


### From Authors

In this section, we extract features based on the authorship information of the papers. The intuition is that shared or similar authorship patterns may indicate a higher likelihood of citation between two papers. We compute:

- Jaccard Similarity:
    Measures the overlap between the sets of authors for two papers. A higher value suggests more shared authors, which may imply topical or collaborative proximity.

- Common Authors (Count):
    Counts how many authors appear in both papers. This raw count serves as a simple but effective signal of author-level connection.

These features aim to capture structural and collaborative similarity between papers beyond their textual content.

We use the same set of positive and negative paper pairs (source-target pairs) as in the textual features extraction. This ensures consistency when combining features later.

#### Code for Generating Positive Examples for Jaccard Similarity

In [5]:

# Function to compute Jaccard similarity between two sets of authors
def compute_author_overlap(authors1, authors2):   # Handling multiple authors per paper as comma-separated lists for overlap calculation
    set1 = set(authors1.split(','))
    set2 = set(authors2.split(','))
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union > 0 else 0

author_overlap_features = [] # to store author overlap features

# Iterate through the edgelist and compute author overlap for each pair
for _, row in edges_df.iterrows():
    paper1_id = row["source"]
    paper2_id = row["target"]
    
    # Get authors for both papers
    authors1 = authors_df[authors_df["paper_id"] == paper1_id]["authors"].values[0]
    authors2 = authors_df[authors_df["paper_id"] == paper2_id]["authors"].values[0]
    
    # Compute author overlap
    overlap = compute_author_overlap(authors1, authors2)
    
    # Append to features
    author_overlap_features.append({
        "source": paper1_id,
        "target": paper2_id,
        "author_overlap": overlap,
        "label" : 1 # positive example
    })

# Convert to DataFrame
#author_overlap_features_df = pd.DataFrame(author_overlap_features)

#print(author_overlap_features_df.head())

#### Code for Negative Examples for Jaccard Similarity

In [6]:
negative_author_overlap_features = []

# Computes the 'author_overlap' feature on the already generated negative_samples
for paper1_id, paper2_id in tqdm(negative_samples, desc="Author overlap for negatives"):
    try:
        authors1 = authors_df[authors_df["paper_id"] == paper1_id]["authors"].values[0]
        authors2 = authors_df[authors_df["paper_id"] == paper2_id]["authors"].values[0]
        
        # authors_df 
        #paper_id  authors
    # e.g.   3      Alice, Bob, Charlie
    # authors1 : 'Alice, Bob, Charlie'
        
        
        overlap = compute_author_overlap(authors1, authors2)

        negative_author_overlap_features.append({
            "source": paper1_id,
            "target": paper2_id,
            "author_overlap": overlap,
            "label": 0  # negative example
        })
    except IndexError:
        # if a paper_id is not in authors_df (.values[0] -> error)
        continue

# DataFrame with the negative author overlap features
#negative_author_df = pd.DataFrame(negative_author_overlap_features)


# Merging positive and negative feature sets into a single DataFrame
#DataFrame of (positive + negative) author_overlap features
all_author_overlap_features_df = pd.DataFrame(author_overlap_features + negative_author_overlap_features)

print(all_author_overlap_features_df.head())

#all_author_overlap_features_df.to_csv("authors_features_labeled.csv", index=False)

#  Combining TF-IDF Cosine Similarity and Author Information for Paper Pair Feature Extraction


merged_textual_authors_overlap_df = all_textual_features_df.merge(all_author_overlap_features_df, on=["source", "target", "label"], how="inner")



Author overlap for negatives: 100%|██████████| 1091955/1091955 [11:17<00:00, 1611.23it/s]


   source  target  author_overlap  label
0       0       1        0.166667      1
1       0       2        0.000000      1
2       1       3        0.000000      1
3       1       5        0.000000      1
4       1       6        0.200000      1


#### Code for Generating Positive Examples for Common Authors


Author names in the dataset vary in format (capitalization, punctuation, use of initials). To handle this, we normalized names by converting to lowercase, removing special characters, and standardizing them to an “initials + last name” format. We mapped all variations to a canonical form and created author sets per paper. This allowed accurate calculation of features like common author counts, improving the quality of author-based similarity metrics.

In [7]:
# Function to normalize author names
def normalize_author_name(name):
    name = name.lower().strip()
    name = re.sub(r"[^\w\s.]", "", name)  # Keeps only letters, numbers, spaces, and dots
    return name

# Function to convert to 'initials + last name' format
def get_initials_lastname(full_name):
    words = full_name.split()
    if len(words) > 1:
        initials = ".".join([w[0] for w in words[:-1]]) + "."
        last_name = words[-1]
        return initials + " " + last_name
    return full_name

# dictionary of name variations (variation → canonical)
author_variants = defaultdict(set)

for authors in authors_df["authors"]:
    for name in authors.split(","):
        norm_name = normalize_author_name(name)
        initials_last = get_initials_lastname(norm_name)
        author_variants[norm_name].add(norm_name)
        author_variants[norm_name].add(initials_last)

# Dictionary: variation -> canonical name
canonical_lookup = {}
for canonical_name, variations in author_variants.items():
    for var in variations:
        canonical_lookup[var] = canonical_name

# author sets for each paper
paper_authors = {}

for _, row in authors_df.iterrows():
    paper_id = row["paper_id"]
    author_list = row["authors"]
    canonical_authors = set()
    for name in author_list.split(","):
        norm_name = normalize_author_name(name)
        initials_last = get_initials_lastname(norm_name)
        canonical_name = canonical_lookup.get(norm_name, canonical_lookup.get(initials_last, norm_name))
        canonical_authors.add(canonical_name)
    paper_authors[paper_id] = canonical_authors

#Calculate the 'common_authors_count' feature for positive examples
common_authors_features = []

for _, row in tqdm(edges_df.iterrows(), total=len(edges_df), desc="Processing common authors"):
    paper1_id = row["source"]
    paper2_id = row["target"]

    authors1 = paper_authors.get(paper1_id, set())
    authors2 = paper_authors.get(paper2_id, set())

    common_count = len(authors1.intersection(authors2))

    common_authors_features.append({
        "source": paper1_id,
        "target": paper2_id,
        "common_authors_count": common_count,
        "label": 1 # positive example
    })

    
#common_authors_features_df = pd.DataFrame(common_authors_features)
#print(common_authors_features_df.head())

Processing common authors: 100%|██████████| 1091955/1091955 [00:50<00:00, 21596.15it/s]


#### Code for Generating Negative Examples for Common Authors

In [9]:
# Calculate feature 'common_authors_count' for negative samples

negative_common_authors_features = []

for paper1_id, paper2_id in tqdm(negative_samples, desc="Processing common authors (negatives)"):
    authors1 = paper_authors.get(paper1_id, set())
    authors2 = paper_authors.get(paper2_id, set())

    common_count = len(authors1.intersection(authors2))

    negative_common_authors_features.append({
        "source": paper1_id,
        "target": paper2_id,
        "common_authors_count": common_count,
        "label": 0 # negative example
    })



# Merging positive and negative feature sets into a single DataFrame
#DataFrame of (positive + negative) common_authors features
all_common_authors_features_df = pd.DataFrame(common_authors_features + negative_common_authors_features)


# Data Frame of merged (textual + author_overlap + common) features, based on keys: (source, target, label)
merged_textual_author_overlap_common_df = merged_textual_authors_overlap_df.merge(
    all_common_authors_features_df,
    on=["source", "target", "label"],
    how="left"
)
cols = [
    "source",
    "target",
    "tfidf_similarity",
    "word2vec_similarity",
    "author_overlap",
    "common_authors_count",
    "label"
]

merged_textual_author_overlap_common_df = merged_textual_author_overlap_common_df[cols]



#print(merged_textual_author_overlap_common_df.head())

merged_textual_author_overlap_common_df.to_csv("features_labeled.csv", index=False)


Processing common authors (negatives): 100%|██████████| 1091955/1091955 [00:01<00:00, 566980.76it/s]


### From Graph
We create a Directed Graph (DiGraph), meaning edges have direction.
edge: paper1 → paper2  (paper1 cites paper2)

For each pair (paper1, paper2), we compute the following graph-based features:


##### 1. Common Neighbors
The number of nodes that are directly connected to both paper1 and paper2.
To calculate this, we temporarily treat the graph as undirected.


##### 2. Jaccard Coefficient -  Neighborhood Similarity
Measures how similar the sets of neighbors of paper1 and paper2 are.

If two papers share many of the same neighbors, the Jaccard value will be high.
Higher Jaccard ➔ means the papers are more closely related in the graph.
    
e.g. 
    -Paper A neighbors: {1, 2, 3, 4}

    -Paper B neighbors: {3, 4, 5, 6}

    -Common neighbors = {3, 4} ➔ size = 2
    
    -Union of neighbors = {1,2,3,4,5,6} ➔ size = 6

    -Jaccard = 2 / 6 = 0.333


##### 3. Adamic-Adar Index

Like common neighbors, but gives higher weight to less common neighbors.

Rare neighbors (nodes with fewer connections) are considered more informative.
A higher Adamic-Adar score → stronger implicit connection between papers.


##### 4. Degree 
In-Degree: Number of papers that cite this paper (higher in-degree = more popular paper).

Out-Degree: Number of papers this paper cites (higher out-degree = more active paper).


##### 5. Node2Vec Embeddings 
We generate Node2Vec embeddings by performing biased random walks on the undirected version of the citation graph, learning 16-dimensional vector representations for each paper. These embeddings encode both local neighborhood structure and broader graph connectivity patterns that are not captured by simple graph metrics.

For each pair of papers, we compute similarity features using the dot product and cosine similarity of their Node2Vec vectors. These embedding-based similarities are added to the final feature set, improving the model’s ability to capture complex relationships within the citation network.


We include two similarity measures between Node2Vec embeddings for each paper pair:

- Dot Product (n2v_dot): Measures the raw similarity in vector space, capturing the overall magnitude and alignment of the embeddings. It can be sensitive to vector length.

- Cosine Similarity (n2v_cosine): Measures the angle between the two vectors, focusing on their directional similarity regardless of magnitude. This normalizes for vector length differences.


-----
We also experimented with extracting additional graph-based features:
- Preferential Attachment:
Measures the product of the degrees of two nodes, reflecting their "popularity." More popular nodes are more likely to connect.

- Resource Allocation Index:
Estimates how much "resource" two nodes share via common neighbors. It is often useful in link prediction tasks.

- Shortest Path Length:
Measures the minimum number of edges between two nodes in the directed citation graph. It reflects how "reachable" one paper is from another.

However, these features resulted in worse performance (higher log loss) and were computationally expensive (took too long to compute), so we decided to remove them from the final code.



In [10]:

## Create the Directed Graph
citation_graph = nx.DiGraph()
citation_graph.add_edges_from(edges_df.itertuples(index=False, name=None))

print("Citation graph created ")

# Prepare the list of (source, target) pairs
edge_pairs = list(zip(merged_textual_author_overlap_common_df["source"], merged_textual_author_overlap_common_df["target"]))

print(f"Total pairs to process: {len(edge_pairs)}")





# Node2Vec Embeddings

print("Training Node2Vec embeddings...")

# Use an undirected version of the graph (experimentally better for Node2Vec)
n2v_graph = citation_graph.to_undirected()

# Initialize Node2Vec model
#node2vec = Node2Vec(n2v_graph, dimensions=32, walk_length=15, num_walks=30, workers=1, seed=42)  # experimenting with num_walks 
node2vec = Node2Vec(n2v_graph, dimensions=16, walk_length=20, num_walks=50, workers=1, seed=42)  


# Train the Node2Vec model (Word2Vec embedding training)
n2v_model = node2vec.fit(window=10, min_count=1, batch_words=4)

# Create embedding dictionary: paper_id -> embedding vector
node2vec_embeddings = {int(node): n2v_model.wv.get_vector(str(node)) for node in n2v_graph.nodes}

print("Node2Vec training complete ")



# Compute graph-based features in bulk

print("Calculating Common Neighbors...")
common_neighbors_dict = {}
undirected_graph = citation_graph.to_undirected()
for u, v in tqdm(edge_pairs, desc="Common Neighbors"):
    try:
        common = len(list(nx.common_neighbors(undirected_graph, u, v)))
    except:
        common = 0
    common_neighbors_dict[(u, v)] = common

print("Calculating Jaccard Coefficient...")
jaccard_scores = nx.jaccard_coefficient(undirected_graph, edge_pairs)
jaccard_dict = {(u, v): score for u, v, score in tqdm(jaccard_scores, desc="Jaccard Coefficients")}

print("Calculating Adamic-Adar Index...")
adamic_scores = nx.adamic_adar_index(undirected_graph, edge_pairs)
adamic_dict = {(u, v): score for u, v, score in tqdm(adamic_scores, desc="Adamic-Adar Scores")}

print("Calculating Degrees...")
in_degrees = dict(citation_graph.in_degree())
out_degrees = dict(citation_graph.out_degree())


# Combine all features into a final list
graph_features = []

for u, v in tqdm(edge_pairs, desc="Building final feature list"):


        # Embedding-based similarity metrics
    if u in node2vec_embeddings and v in node2vec_embeddings:
        vec_u = node2vec_embeddings[u]
        vec_v = node2vec_embeddings[v]

        n2v_dot = np.dot(vec_u, vec_v)
        n2v_cosine = cosine_similarity([vec_u], [vec_v])[0][0]
    else:
        n2v_dot = 0.0
        n2v_cosine = 0.0

    
    graph_features.append({
        "source": u,
        "target": v,
        "common_neighbors": common_neighbors_dict.get((u, v), 0),
        "jaccard_coeff": jaccard_dict.get((u, v), 0.0),
        "adamic_adar": adamic_dict.get((u, v), 0.0),
        "in_degree_source": in_degrees.get(u, 0),
        "out_degree_source": out_degrees.get(u, 0),
        "in_degree_target": in_degrees.get(v, 0),
        "out_degree_target": out_degrees.get(v, 0),
        "n2v_dot": n2v_dot,          # Node2Vec dot product similarity
        "n2v_cosine": n2v_cosine,    # Node2Vec cosine similarity

    })

graph_features_df = pd.DataFrame(graph_features)

print("Graph features DataFrame created ")

# Merge graph features with existing textual + author dataset

graph_textual_author_overlap_common_df = merged_textual_author_overlap_common_df.merge(
    graph_features_df,
    on=["source", "target"],
    how="left"
)


# Move 'label' column to the end 
# (for aesthetic and organizational purposes only — does not affect model performance)

cols = [col for col in graph_textual_author_overlap_common_df.columns if col != "label"] + ["label"]
graph_textual_author_overlap_common_df = graph_textual_author_overlap_common_df[cols]



# Scale the Node2Vec similarity features using the previously fitted scaler
# to ensure they have the same distribution as the training data features.
# This is important for consistent model performance during inference.

scaler = StandardScaler()

# Normalize only the Node2Vec similarity features
graph_textual_author_overlap_common_df[["n2v_dot", "n2v_cosine"]] = scaler.fit_transform(
    graph_textual_author_overlap_common_df[["n2v_dot", "n2v_cosine"]]
)


# Shuffling the dataset before training helps to randomize the order of the samples. 
# This prevents the model from learning any unintended patterns related to the original data sequence and ensures that training batches are more representative and diverse. 
# It generally leads to better generalization and more stable training.

graph_textual_author_overlap_common_df = shuffle(graph_textual_author_overlap_common_df).reset_index(drop=True)


# Checking the dataset’s dimensions and sample rows helps verify that all features have been correctly merged and that the data is ready for training or analysis.
print("Final dataset shape:", graph_textual_author_overlap_common_df.shape)
print(graph_textual_author_overlap_common_df.head())

# Optional: Save final dataset to CSV
graph_textual_author_overlap_common_df.to_csv("final_dataset_with_graph_features.csv", index=False)


Citation graph created 
Total pairs to process: 2183910
Training Node2Vec embeddings...


Computing transition probabilities:   0%|          | 0/138499 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 50/50 [39:23<00:00, 47.28s/it] 


Node2Vec training complete 
Calculating Common Neighbors...


Common Neighbors: 100%|██████████| 2183910/2183910 [00:14<00:00, 150214.64it/s]


Calculating Jaccard Coefficient...


Jaccard Coefficients: 2183910it [00:28, 77307.73it/s] 


Calculating Adamic-Adar Index...


Adamic-Adar Scores: 2183910it [00:14, 154039.19it/s]


Calculating Degrees...


Building final feature list: 100%|██████████| 2183910/2183910 [08:11<00:00, 4443.75it/s]


Graph features DataFrame created 
Final dataset shape: (2183910, 16)
   source  target  tfidf_similarity  word2vec_similarity  author_overlap  \
0   16646   73007          0.407265             0.914058             0.0   
1  101985  108800          0.059412             0.846719             0.0   
2    3212   17042          0.209458             0.920713             0.0   
3   31188   32764          0.000000             0.000000             0.0   
4   12509   71431          0.050083             0.855473             0.0   

   common_authors_count  common_neighbors  jaccard_coeff  adamic_adar  \
0                     0                 2       0.086957     0.916328   
1                     0                 0       0.000000     0.000000   
2                     0                 2       0.038462     0.441081   
3                     0                 0       0.000000     0.000000   
4                     0                 0       0.000000     0.000000   

   in_degree_source  out_degree_sou

## Β. Training LightGBM Model with Stratified 10-Fold Cross-Validation

This code trains a LightGBM binary classifier on the final dataset containing graph-based, textual, and author-related features. It uses Stratified K-Fold Cross-Validation (with n=10 folds) to ensure balanced label distribution across splits. For each fold, the model is trained and evaluated using log loss as the performance metric. Early stopping and logging callbacks are used during training for efficiency and monitoring. At the end, the average cross-validation log loss is reported, and trained models are stored for future use or ensemble methods.

------

We also experimented with the XGBoost model, but it did not yield better results.

In [12]:
# Load the dataset containing features and labels
df = pd.read_csv("final_dataset_with_graph_features.csv")

# Separate features (X) and target label (y)
X = df.drop(columns=["label"]) # input features only - Features for prediction
y = df["label"]                # target: 1 = citation, 0 = no citation


# Define LightGBM hyperparameters
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'max_depth': -1,
    'min_child_samples': 20,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'n_jobs': -1,
    'verbose': -1,
    'random_state': 42
}



# Number of cross-validation folds
n_folds = 10

# Create Stratified K-Fold object to maintain label balance across folds
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

models = []    # to store trained models
cv_scores = [] # to store log loss scores for each fold



# Train and evaluate model on each fold
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    print(f"Training fold {fold+1}/{n_folds}...")
    
    X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]

    
    # Create LightGBM Datasets
    train_data = lgb.Dataset(X_train_fold, label=y_train_fold)
    val_data = lgb.Dataset(X_val_fold, label=y_val_fold)

    
    
    # Train the model with early stopping
    # Early stopping helps prevent overfitting by monitoring the validation loss.
    # If the model's performance does not improve for a specified number of rounds (here: 50), training stops early.
    # This ensures that we do not continue training unnecessarily once the model has converged.
    
    model = lgb.train(
        params,
        train_data,
        valid_sets=[val_data],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)],
        verbose_eval=False
    )
    
    # Predict on validation set and compute log loss
    y_val_pred = model.predict(X_val_fold)
    score = log_loss(y_val_fold, y_val_pred)
    cv_scores.append(score)
    print(f"Fold {fold+1} Log Loss: {score:.5f}")

    
    # Store the trained model
    models.append(model)

    

# Print average log loss across all folds
print(f"\nAverage CV Log Loss across {n_folds} folds: {np.mean(cv_scores):.5f}")


Training fold 1/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0768035
[100]	valid_0's binary_logloss: 0.0366219
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0366219
Fold 1 Log Loss: 0.03662
Training fold 2/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0767561
[100]	valid_0's binary_logloss: 0.0370133
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0370133
Fold 2 Log Loss: 0.03701
Training fold 3/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0760788
[100]	valid_0's binary_logloss: 0.0359873
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0359873
Fold 3 Log Loss: 0.03599
Training fold 4/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0762631
[100]	valid_0's binary_logloss: 0.0361037
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0361037
Fold 4 Log Loss: 0.03610
Training fold 5/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0748327
[100]	valid_0's binary_logloss: 0.0345864
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0345864
Fold 5 Log Loss: 0.03459
Training fold 6/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0768169
[100]	valid_0's binary_logloss: 0.0370194
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0370194
Fold 6 Log Loss: 0.03702
Training fold 7/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0761297
[100]	valid_0's binary_logloss: 0.0359406
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0359406
Fold 7 Log Loss: 0.03594
Training fold 8/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0760844
[100]	valid_0's binary_logloss: 0.0360771
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0360771
Fold 8 Log Loss: 0.03608
Training fold 9/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0754101
[100]	valid_0's binary_logloss: 0.0352512
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.0352512
Fold 9 Log Loss: 0.03525
Training fold 10/10...




Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.0765832
[100]	valid_0's binary_logloss: 0.036782
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.036782
Fold 10 Log Loss: 0.03678

Average CV Log Loss across 10 folds: 0.03614



We chose to use 10 folds for cross-validation (k=10) to balance between reliable performance estimation and computational efficiency. With 10 folds, each training run uses 90% of the data for training and 10% for validation, providing a robust and stable measure of model performance. This setup helps reduce variance in the evaluation metrics and ensures that the model generalizes well. The results from example runs showed consistent log loss scores around 0.138, confirming the stability of this choice.

Log Loss Formula

L=−1/N Σi=1,N(yi*log(pi)+(1−yi)log(1−pi))


    N: Number of examples (e.g., number of test pairs)

    yᵢ: True label of example i (0 or 1)

    pᵢ: Model prediction for example i (probability of class 1, e.g., 0.82)
    
    

If the true label is 1, the model is penalized when pᵢ is close to 0.

If the true label is 0, the model is penalized when pᵢ is close to 1.

Log Loss = 0: Perfect prediction (e.g., pᵢ = 1 when yᵢ = 1)

High Log Loss: When the model is very confident but wrong (e.g., pᵢ = 0.99 while yᵢ = 0)


### Examples

| True Label (y) | Prediction (p) | Log Loss Contribution | Interpretation |
|----------------|----------------|------------------------|----------------|
| 1              | 0.99           | Low                    | Very good      |
| 1              | 0.60           | Moderate               | Okay           |
| 1              | 0.10           | High                   | Poor           |
| 0              | 0.01           | Low                    | Good           |
| 0              | 0.90           | High                   | Bad            |



## C. Feature Extraction From test

In this section, we extract the same set of features for the test dataset as we did for the training data. This ensures consistency and allows the trained model to make predictions based on comparable input. 


In [13]:

# Load test pairs
test_pairs = pd.read_csv("test.txt", names=["source", "target"])

# Extract features for test pairs
test_features = []
for _, row in tqdm(test_pairs.iterrows(), total=len(test_pairs), desc="Extracting test features"):
    paper1_id, paper2_id = row["source"], row["target"]

    # Compute textual similarities
    tfidf_sim = get_tfidf_similarity(paper1_id, paper2_id, tfidf_matrix)
    word2vec_sim = get_word2vec_similarity(paper1_id, paper2_id, abstract_vectors)

    # Compute authorship features 
    authors1 = authors_df[authors_df["paper_id"] == paper1_id]["authors"]
    authors2 = authors_df[authors_df["paper_id"] == paper2_id]["authors"]

    if not authors1.empty and not authors2.empty:
        author_overlap = compute_author_overlap(authors1.values[0], authors2.values[0])
    else:
        author_overlap = 0.0

    # Common authors 
    common_authors = len(paper_authors.get(paper1_id, set()).intersection(paper_authors.get(paper2_id, set())))

    # Graph-based features
    try:
        common_neighbors = len(list(nx.common_neighbors(undirected_graph, paper1_id, paper2_id)))
    except:
        common_neighbors = 0

    try:
        jaccard = list(nx.jaccard_coefficient(undirected_graph, [(paper1_id, paper2_id)]))[0][2]
    except:
        jaccard = 0.0

    try:
        adamic = list(nx.adamic_adar_index(undirected_graph, [(paper1_id, paper2_id)]))[0][2]
    except:
        adamic = 0.0
        
        
        # Node2Vec similarity features
    if paper1_id in node2vec_embeddings and paper2_id in node2vec_embeddings:
        vec_u = node2vec_embeddings[paper1_id]
        vec_v = node2vec_embeddings[paper2_id]
        n2v_dot = np.dot(vec_u, vec_v)
        n2v_cosine = cosine_similarity([vec_u], [vec_v])[0][0]
    else:
        n2v_dot = 0.0
        n2v_cosine = 0.0


    test_features.append({
        "source": paper1_id,
        "target": paper2_id,
        "tfidf_similarity": tfidf_sim,
        "word2vec_similarity": word2vec_sim,
        "author_overlap": author_overlap,
        "common_authors_count": common_authors,
        "common_neighbors": common_neighbors,
        "jaccard_coeff": jaccard,
        "adamic_adar": adamic,
        "in_degree_source": in_degrees.get(paper1_id, 0),
        "out_degree_source": out_degrees.get(paper1_id, 0),
        "in_degree_target": in_degrees.get(paper2_id, 0),
        "out_degree_target": out_degrees.get(paper2_id, 0),
        
        "n2v_dot": n2v_dot,
        "n2v_cosine": n2v_cosine
    })

test_features_df = pd.DataFrame(test_features)


# Scale the Node2Vec similarity features using the previously fitted scaler
test_features_df[["n2v_dot", "n2v_cosine"]] = scaler.transform(
    test_features_df[["n2v_dot", "n2v_cosine"]]
)



Extracting test features: 100%|██████████| 106692/106692 [03:41<00:00, 480.92it/s]


Now we use all the trained models to make predictions on the test data. We combine their results by averaging to get more reliable scores. Finally, we save these predictions to a csv file for submission.

In [14]:
# Predict using all models and average the results (ensemble method)
test_preds = np.mean([model.predict(test_features_df) for model in models], axis=0)

# Save predictions to submission file
submission = pd.DataFrame({
    "ID": test_pairs.index,
    "Label": test_preds
})
submission.to_csv("submission.csv", index=False)
