# Citation Extractor

Given a pdf file, an extract of text in the file, fetch all the papers cited in the extract.


TODO:

- [ ] The DOIs are not accurate
- [ ] Make it work for name-based citation

In [1]:
import fitz  # PyMuPDF
import re
from rapidfuzz import fuzz
from tqdm import tqdm
from time import sleep

In [2]:
file = "../test2.pdf"
doc = fitz.open(file)

In [3]:
extract = """
This is the same objective optimized in prior works [49, 38, 1, 26] using the DPO-equivalent reward
for the reward class of rφ . In this setting, we can interpret the normalization term in f (rφ, πref , β)
as the soft value function of the reference policy πref . While this term does not affect the optimal
solution, without it, the policy gradient of the objective could have high variance, making learning
unstable. We can accommodate for the normalization term using a learned value function, but that
can also be difficult to optimize. Alternatively, prior works have normalized rewards using a human
completion baseline, essentially a single sample Monte-Carlo estimate of the normalizing term. In
contrast the DPO reparameterization yields a reward function that does not require any baselines.
"""

extract = extract.strip()

## Find the extract text in the document

In [4]:
THRESHOLD = 95

In [5]:
matches = []

for page_num in tqdm(range(len(doc)), leave=False):
    page = doc.load_page(page_num)  # load the current page
    text_blocks = page.get_text_blocks()  # get a list of links on the current page
    for block in text_blocks:
        text = block[4]
        
        match_score = fuzz.partial_ratio(extract, text)

        if match_score >= THRESHOLD:
            matches.append((block, page_num, match_score))

matches = sorted(matches, key=lambda x: x[2], reverse=True)

                                                                                                                                                         

In [6]:
matches

[((107.69100189208984,
   634.4305419921875,
   505.1564636230469,
   722.7994995117188,
   'This is the same objective optimized in prior works [49, 38, 1, 26] using the DPO-equivalent reward\nfor the reward class of rϕ. In this setting, we can interpret the normalization term in f(rϕ, πref, β)\nas the soft value function of the reference policy πref. While this term does not affect the optimal\nsolution, without it, the policy gradient of the objective could have high variance, making learning\nunstable. We can accommodate for the normalization term using a learned value function, but that\ncan also be difficult to optimize. Alternatively, prior works have normalized rewards using a human\ncompletion baseline, essentially a single sample Monte-Carlo estimate of the normalizing term. In\ncontrast the DPO reparameterization yields a reward function that does not require any baselines.\n',
   25,
   0),
  5,
  99.24812030075188)]

For now keep the top match

In [7]:
matched_block = matches[0][0]
matched_page = matches[0][1]

## Get all citation numbers in the text and the corresponding links

In [8]:
matched_bbox = fitz.Rect(matched_block[:4])

Get citing links

In [9]:
matched_links = []

for link in doc[matched_page].get_links():
    if link['kind'] == 4:   # internal links
        link_bbox = link['from']
        if matched_bbox.intersects(link_bbox):
            matched_links.append(link)

Get citation numbers for each each link.

Here we also filter out the citation links that are not part of the original extract.

In [10]:
matched_links_filtered = []

page = doc[matched_page]
for link in matched_links:
    citation_num = page.get_text('text', clip=link['from'])
    citation_num = re.findall(r'\d+', citation_num)[0]

    if citation_num not in extract:
        continue
    
    link['citation_number'] = citation_num
    matched_links_filtered.append(link)

In [11]:
[m['citation_number'] for m in matched_links_filtered]

['49', '38', '1', '26']

## Get the references for these citations

In [12]:
matched_references = []

for link in matched_links_filtered:
    linked_page = doc.load_page(link['page'])
    text_blocks = linked_page.get_text("blocks")
    citation_num = link['citation_number']
    num_pat = r'\b' + citation_num + r'\b'
    
    for text in text_blocks:
        # citation number should be present in the initial section of the reference
        # if citation_num in text[4][:15]:
        if re.search(num_pat, text[4][:15]):
            matched_references.append(text[4].strip())

In [13]:
matched_references = list(map(lambda x: x.replace('\n', ''), matched_references))
matched_references

['[49] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, andG. Irving. Fine-tuning language models from human preferences, 2020.',
 '[38] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, andP. Christiano. Learning to summarize from human feedback, 2022.',
 '[1] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli,T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson,D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. Training ahelpful and harmless assistant with reinforcement learning from human feedback, 2022.',
 '[26] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,P. F. Christiano, J

## Get the metadata of these references

### Habanero

Get the DOIs

In [14]:
from habanero import Crossref
cr = Crossref()

In [15]:
matched_references_meta = []

for ref in tqdm(matched_references):
    results = cr.works(query = ref, limit = 1)
    meta = results['message']['items'][0]
    matched_references_meta.append(meta)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.20s/it]


In [16]:
[m['DOI'] for m in matched_references_meta]

['10.31235/osf.io/sthwk',
 '10.3102/1892071',
 '10.1145/3531146.3533229',
 '10.47205/jdss.2021(2-iv)74']

### OpenAccessButton

Get the open access PDF urls

In [17]:
import requests

In [18]:
for meta in matched_references_meta:
    doi = meta['DOI']
    url = f"https://api.openaccessbutton.org/find?id={doi}"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        if 'url' not in data and data['metadata']['publisher'] == 'ACM':
            pdf_url = f"https://dl.acm.org/doi/pdf/{doi}"
        else:
            pdf_url = data.get('url')

        meta['pdf_url'] = pdf_url
    else:
        raise ValueError("failed request")

In [19]:
[m['pdf_url'] for m in matched_references_meta]

['https://osf.io/sthwk/download',
 None,
 'https://dl.acm.org/doi/pdf/10.1145/3531146.3533229',
 'https://jdss.org.pk/issues/v2/4/water-sharing-issues-in-pakistan-impacts-on-inter-provincial-relations.pdf']

## Download the PDFs

In [20]:
from pathlib import Path

for meta in tqdm(matched_references_meta):
    doi = meta['DOI']
    pdf = meta['pdf_url']

    file = Path(f"papers/{doi.replace('/', '_')}.pdf")
    
    # download
    # response = requests.get(pdf)
    # file.write_bytes(response.content)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 27548.79it/s]


# Problems

Selection across paragraphs, pages.

Paragraphs broken by images and tables.