# Task 2 — Co-citation and Bibliographic Coupling
This notebook computes pairwise similarity between papers in `dblp_subset.json` using two measures:

1. Co-citation score: number of papers that cite both paper A and paper B.
2. Bibliographic coupling: number of references that two papers share (i.e., how many papers they both cite).

The notebook reports the top-10 most similar paper pairs for each measure (showing titles and the score).

In [1]:
# Imports
import json
import os
import time
from collections import Counter, defaultdict
from itertools import combinations

In [2]:
# Load the subset file (assumed to be in the same folder as this notebook)
subset_path = os.path.join(os.getcwd(), 'dblp_subset.json')
print('Reading', subset_path)
start = time.time()
papers = []
with open(subset_path, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            papers.append(json.loads(line))
        except Exception:
            continue
print(f'Read {len(papers)} papers in {time.time()-start:.2f}s')

# Build mappings: id -> index, index -> title, and references (as indices)
id_to_idx = {}
titles = []
refs_by_index = []
for idx, p in enumerate(papers):
    pid = p.get('id')
    id_to_idx[pid] = idx
    titles.append(p.get('title', '') or '')
    refs_by_index.append(p.get('references', []) or [])

n = len(papers)
print('n =', n)

Reading /Users/ankushchhabra/Downloads/Data Mining Assignment2/dblp_subset.json
Read 49572 papers in 0.72s
n = 49572


In [3]:
# Helper: convert reference id lists to indices (filter missing ids)
refs_idx = [None] * n
for i, refs in enumerate(refs_by_index):
    lst = []
    for r in refs:
        j = id_to_idx.get(r)
        if j is not None:
            lst.append(j)
    refs_idx[i] = lst

# Basic stats about references per paper
num_refs = [len(r) for r in refs_idx]
print('refs per paper: min', min(num_refs) if num_refs else 0, 'median', sorted(num_refs)[len(num_refs)//2] if num_refs else 0, 'max', max(num_refs) if num_refs else 0)

refs per paper: min 0 median 1 max 161


## Co-citation (C_ij) — number of papers that cite both i and j
Approach: iterate over each *citing* paper k, take its reference list (indices), and increment counts for each pair (i,j) among those references. This counts how many papers co-cite i and j.

In [5]:
start = time.time()
co_counter = Counter()
# For each citing paper (row), add combinations of its referenced papers
for k, ref_list in enumerate(refs_idx):
    if len(ref_list) < 2:
        continue
    # iterate over unordered pairs of references
    for a, b in combinations(sorted(set(ref_list)), 2):
        co_counter[(a, b)] += 1

print('Co-citation counts built in', time.time()-start, 's; unique pairs =', len(co_counter))

# Top-10 co-cited pairs
top_k = 10
print('\nTop-{} pairs by Co-citation score:'.format(top_k))
for rank, ((i, j), score) in enumerate(co_counter.most_common(top_k), start=1):
    print(f'{rank}. (score={score})')
    print('   Paper A:', titles[i])
    print('   Paper B:', titles[j])
    print()

Co-citation counts built in 0.3212602138519287 s; unique pairs = 636478

Top-10 pairs by Co-citation score:
1. (score=154)
   Paper A: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
   Paper B: ImageNet Classification with Deep Convolutional Neural Networks

2. (score=148)
   Paper A: The Pascal Visual Object Classes (VOC) Challenge
   Paper B: Object Detection with Discriminatively Trained Part-Based Models

3. (score=122)
   Paper A: Real-time human pose recognition in parts from single depth images
   Paper B: Real-time human pose recognition in parts from single depth images

4. (score=95)
   Paper A: Very Deep Convolutional Networks for Large-Scale Image Recognition
   Paper B: ImageNet Classification with Deep Convolutional Neural Networks

5. (score=91)
   Paper A: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
   Paper B: ImageNet Classification with Deep Convolutional Neural Networks

6. (score=91)
   Paper A: Ca

## Bibliographic coupling (B_ij) — number of shared references between papers i and j
Approach: for each *cited* paper r, get the list of papers that cite r (i.e., the citing papers). For each pair of citing papers (i, j) in that list, increment coupling count. This counts how many common references i and j have.

In [6]:
start = time.time()
# Build mapping: cited index -> list of citing paper indices
cited_to_citers = defaultdict(list)
for citer_idx, ref_list in enumerate(refs_idx):
    for cited in set(ref_list):
        cited_to_citers[cited].append(citer_idx)

# Now, for each cited paper, increment pairs among its citers
bib_counter = Counter()
for cited, citers in cited_to_citers.items():
    if len(citers) < 2:
        continue
    for a, b in combinations(sorted(set(citers)), 2):
        bib_counter[(a, b)] += 1

print('Bibliographic coupling counts built in', time.time()-start, 's; unique pairs =', len(bib_counter))

# Top-10 bibliographic coupling pairs
print('\nTop-{} pairs by Bibliographic Coupling score:'.format(top_k))
for rank, ((i, j), score) in enumerate(bib_counter.most_common(top_k), start=1):
    print(f'{rank}. (score={score})')
    print('   Paper A:', titles[i])
    print('   Paper B:', titles[j])
    print()

Bibliographic coupling counts built in 0.666895866394043 s; unique pairs = 1219299

Top-10 pairs by Bibliographic Coupling score:
1. (score=43)
   Paper A: Salient Object Detection: A Benchmark
   Paper B: Salient Object Detection: A Survey

2. (score=40)
   Paper A: Software-Defined Networking: A Comprehensive Survey
   Paper B: Security in Software Defined Networks: A Survey

3. (score=40)
   Paper A: Software-Defined Networking: A Comprehensive Survey
   Paper B: A Survey and a Layered Taxonomy of Software-Defined Networking

4. (score=38)
   Paper A: Design Guidelines for Spatial Modulation
   Paper B: Spatial Modulation for Generalized MIMO: Challenges, Opportunities, and Implementation

5. (score=37)
   Paper A: Urban Computing: Concepts, Methodologies, and Applications
   Paper B: Trajectory Data Mining: An Overview

6. (score=37)
   Paper A: Software-Defined Networking: A Comprehensive Survey
   Paper B: A Survey on Software-Defined Networking

7. (score=34)
   Paper A: Salient

### Notes
- The algorithm uses combinatorial counting over reference lists and citing lists; it avoids building dense n-by-n matrices. It should be reasonably fast for the provided subset, but runtime depends on average reference list sizes.
- If you want the results saved to CSV (pair ids, titles, score), tell me and I will add a cell that writes them out.

In [7]:
# Save top-10 results to CSV files
import csv
# Build idx -> id mapping (we have id_to_idx)
idx_to_id = [None] * n
for pid, idx in id_to_idx.items():
    idx_to_id[idx] = pid

top_k = 10
co_top = co_counter.most_common(top_k)
bib_top = bib_counter.most_common(top_k)

co_path = os.path.join(os.getcwd(), 'co_top10.csv')
bib_path = os.path.join(os.getcwd(), 'bib_top10.csv')

with open(co_path, 'w', newline='', encoding='utf-8') as f:
    w = csv.writer(f)
    w.writerow(['rank','paperA_id','paperA_title','paperB_id','paperB_title','score'])
    for rank, ((i,j), score) in enumerate(co_top, start=1):
        w.writerow([rank, idx_to_id[i], titles[i], idx_to_id[j], titles[j], score])

with open(bib_path, 'w', newline='', encoding='utf-8') as f:
    w = csv.writer(f)
    w.writerow(['rank','paperA_id','paperA_title','paperB_id','paperB_title','score'])
    for rank, ((i,j), score) in enumerate(bib_top, start=1):
        w.writerow([rank, idx_to_id[i], titles[i], idx_to_id[j], titles[j], score])

print('Wrote', co_path)
print('Wrote', bib_path)

Wrote /Users/ankushchhabra/Downloads/Data Mining Assignment2/co_top10.csv
Wrote /Users/ankushchhabra/Downloads/Data Mining Assignment2/bib_top10.csv
