Skip to content
Permalink
Browse files

move code to dedicated repo

  • Loading branch information
IllDepence committed Jan 29, 2019
0 parents commit 8cb23200868f1a014d1ee9adea1004e09c205fb9
21 LICENSE
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2018 Tarek Saier

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
@@ -0,0 +1,38 @@
# unarXive

Code for generating a data set for citation based tasks using arXiv.org submissions.

## Prerequisites
* software
* Tralics (Ubuntu: `# apt install tralics`)
* latexpand (Ubuntu: `# apt install texlive-extra-utils`)
* data
* arXiv source files: see [arXiv.org help - arXiv Bulk Data Access](https://arxiv.org/help/bulk_data)
* MAG DB: **#TODO** add link/section describing MAG data base
* arXiv title lookup DB: see file `aid_title.db.placeholder`

## Setup
* create virtual environment: `$ python3 -m venv venv`
* activate virtual environment: `$ source venv/bin/activate`
* install requirements: `$ pip install -r requirements.txt`
* in `match_bibitems_mag.py`
* adjust line `mag_db_uri = 'postgresql+psycopg2://XXX:YYY@localhost:5432/MAG'`
* adjust line `doi_headers = { [...] working on XXX; mailto: XXX [...] }`
* depending on your arXiv title lookup DB, adjust line `aid_db_uri = 'sqlite:///aid_title.db'`

## Usage
1. Extract plain texts and reference items with: `prepare.py` (or `normalize_arxiv_dump.py` + `prase_latex_tralics.py`)
2. Match reference items with: `match_bibitems_mag.py`
3. (optional) Clean txt output with: `clean_txt_output.py`
4. (optional) Adjust parameters in `extract_contexts.py` at `def generate(...)`
5. (optional) Extract citation contexts with: `extract_contexts.py`

### Example
* `$ source venv/bin/activate`
* `$ python3 prepare.py /tmp/arxiv-sources /tmp/arxiv-txt`
* `$ python3 match_bibitems_mag.py path /tmp/arxiv-txt 10`
* `$ python3 clean_txt_output.py /tmp/arxiv-txt`
* `$ python3 extract_contexts.py /tmp/arxiv-txt`

## Matching evaluation
For a manual evaluation of the reference resolution (`match_bibitems_mag.py`) we performed on a sample of 300 matchings, see `doc/matching_evaluation/`.
@@ -0,0 +1,26 @@
The title lookup data base for arXiv.org submission IDs has to look as
indicated below.

SQLite example:

$ sqlite3 aid_title.db
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE paper (
id INTEGER NOT NULL,
aid VARCHAR(36),
title TEXT,
PRIMARY KEY (id)
);
sqlite> select * from paper limit 1;
1|1103.3880|C*-algebras associated with some second order differential operators


Note 1: Because SQLAlchemy is used to access the data base, you're relatively
free in your choice of data base system and only have to adjust the data base
URL in the code. For reference, see:
https://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls)

Note 2: For bulk access to arXiv metadata see:
https://arxiv.org/help/bulk_data
@@ -0,0 +1,43 @@
""" Clean arXiv dump txt ouput
"""

import os
import re
import shutil
import sys

CITE_PATT = re.compile((r'\{\{cite:([0-9A-F]{8}-[0-9A-F]{4}-4[0-9A-F]{3}'
'-[89AB][0-9A-F]{3}-[0-9A-F]{12})\}\}'), re.I)


def clean(in_dir):
""" Separate output files with no citations in them.
"""

no_cit_dir = os.path.join(in_dir, 'no_cit')
if not os.path.isdir(no_cit_dir):
os.makedirs(no_cit_dir)

file_names = os.listdir(in_dir)
for file_idx, fn in enumerate(file_names):
if file_idx%100 == 0:
print('{}/{}'.format(file_idx, len(file_names)))
path = os.path.join(in_dir, fn)
aid, ext = os.path.splitext(fn)
if ext != '.txt':
continue
with open(path) as f:
text = f.read()
if not re.search(CITE_PATT, text):
new_path = os.path.join(no_cit_dir, fn)
shutil.move(path, new_path)


if __name__ == '__main__':
if len(sys.argv) != 2:
print('usage: python3 clean_txt_output.py </path/to/input/dir>')
sys.exit()
in_dir = sys.argv[1]
ret = clean(in_dir)
if not ret:
sys.exit()
@@ -0,0 +1,38 @@
from sqlalchemy import Column, Integer, String, UnicodeText, ForeignKey
from sqlalchemy.schema import UniqueConstraint
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Bibitem(Base):
__tablename__ = 'bibitem'
uuid = Column(String(36), primary_key=True)
in_doc = Column(String(36))
bibitem_string = Column(UnicodeText())


class BibitemLinkMap(Base):
__tablename__ = 'bibitemlinkmap'
# __table_args__ = (UniqueConstraint(
# 'uuid', 'link', name='uid_link_uniq'),)
id = Column(Integer(), autoincrement=True, primary_key=True)
uuid = Column(String(36), ForeignKey('bibitem.uuid'))
link = Column(UnicodeText())


class BibitemArxivIDMap(Base):
__tablename__ = 'bibitemarxividmap'
# __table_args__ = (UniqueConstraint(
# 'uuid', 'arxiv_id', name='uid_aid_uniq'),)
id = Column(Integer(), autoincrement=True, primary_key=True)
uuid = Column(String(36), ForeignKey('bibitem.uuid'))
arxiv_id = Column(String(36))


class BibitemMAGIDMap(Base):
__tablename__ = 'bibitemmagidmap'
# __table_args__ = (UniqueConstraint(
# 'uuid', 'mag_id', name='uid_mid_uniq'),)
id = Column(Integer(), autoincrement=True, primary_key=True)
uuid = Column(String(36), ForeignKey('bibitem.uuid'))
mag_id = Column(String(36))

0 comments on commit 8cb2320

Please sign in to comment.
You can’t perform that action at this time.