# ArXiv Data Processing

The purpose of this file is to process the maths submissions of the [ArXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) from Kaggle into tables more suitable for data analysis (e.g. to add to a database or open in Power BI).

The data begins as a single table with the ArXiv ID, submitter, authors, title, comments, journal reference, doi, report-no, categories, license, abstract, versions, update date, and a list of the authors in a parsed format.

This data is processed into five tables:
- Submission, to record the submissions themselves
- Category, to record the categories a submission can be labelled as
- Author, to record the authors of the papers
- Submission-Category, to capture many-to-many relationship between a Submission and its Categories.
- Submission-Author, to capture the many-to-many realtionship between Submission and Author.

![database](data/arxiv-db.png)

The dataset is quite large, so some of the computations take some time (but not 'leave your computer on overnight' time).

In [3]:
# Import required packages
import pandas as pd
import csv
from datetime import datetime
from dateutil import parser

from collections import defaultdict
import gender_guesser.detector as gender

## Filtering out the maths submissions

In [None]:
# Determine whether a paper is a maths paper
def isMathCategory(categories):
    cats = categories.split(" ")
    for cat in cats:
        if cat[:cat.find(".")] == "math":
            return True
    return False

# Load the relevant data from the ArXiv file with a given filter on the categories.
# There is also the option to only load a selection of columns.
def loadArxivData(file, catFilter = lambda x : True, cols=None):
    f = open(file)
    data = []
    for line in f:
        doc = json.loads(line)
        if cols:
            lst = [doc[col] for col in cols]
        else:
            lst = doc
        if catFilter(doc['categories']):
            data.append(lst)
    f.close()
    return pd.DataFrame(data=data, columns=cols)

In [None]:
# Load all the maths ArXiv data
fileName = 'data/arxiv-metadata-oai-snapshot.json'
arxiv_math = loadArxivData(fileName, catFilter = isMathCategory)

# Save to csv so that we don't need to re-collect the data
arxiv_math.to_csv('data/arxiv-math.csv', index=False)

Start from here if you already have the arxiv-math csv:

In [4]:
# Load the ArXiv data for math submissions
arxiv_math = pd.read_csv('data/arxiv-math.csv')

print(arxiv_math.shape)
arxiv_math.info()

  arxiv_math = pd.read_csv('data/arxiv-math.csv')


(593560, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593560 entries, 0 to 593559
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   id              593560 non-null  object
 1   submitter       591905 non-null  object
 2   authors         593560 non-null  object
 3   title           593560 non-null  object
 4   comments        412682 non-null  object
 5   journal-ref     138202 non-null  object
 6   doi             145397 non-null  object
 7   report-no       22328 non-null   object
 8   categories      593560 non-null  object
 9   license         519282 non-null  object
 10  abstract        593560 non-null  object
 11  versions        593560 non-null  object
 12  update_date     593560 non-null  object
 13  authors_parsed  593560 non-null  object
dtypes: object(14)
memory usage: 63.4+ MB


## Submission

In [5]:
# Get date of first submisson
arxiv_math['versions'] = arxiv_math['versions'].apply(eval)
arxiv_math['submission_date'] = arxiv_math['versions'].apply(lambda x : datetime.date(parser.parse(x[0]['created'])))

arxiv_math.sample(5)[['versions', 'submission_date']]

Unnamed: 0,versions,submission_date
180168,"[{'version': 'v1', 'created': 'Mon, 19 Jan 201...",2015-01-19
125985,"[{'version': 'v1', 'created': 'Wed, 24 Apr 201...",2013-04-24
370100,"[{'version': 'v1', 'created': 'Thu, 19 Dec 201...",2019-12-19
193100,"[{'version': 'v1', 'created': 'Wed, 3 Jun 2015...",2015-06-03
279764,"[{'version': 'v1', 'created': 'Sat, 14 Oct 201...",2017-10-14


In [None]:
# Change journal column to a binary value
arxiv_math['journal-ref'] = 1 - arxiv_math['journal-ref'].isnull().astype('int')

In [6]:
# Create submissions dataframe
submission_cols = ['id', 'submission_date', 'title', 'abstract', 'journal-ref', 'comments']
submission = arxiv_math[submission_cols].copy()
submission.rename(columns={'id':'arxiv_id'})

print(submission.shape)
submission.info()

(593560, 6)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593560 entries, 0 to 593559
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   id               593560 non-null  object
 1   submission_date  593560 non-null  object
 2   title            593560 non-null  object
 3   abstract         593560 non-null  object
 4   journal-ref      138202 non-null  object
 5   comments         412682 non-null  object
dtypes: object(6)
memory usage: 27.2+ MB


In [65]:
# Save submission table
submission.to_csv('data/submission.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)

## Category

In [64]:
# Create the category table, starting with the category names and acronyms (to be used as a primary key)
category_dict = {'AC':'Commutative Algebra',
                'AG':'Algebraic Geometry',
                'AP':'Analysis of PDEs',
                'AT':'Algebraic Topology',
                'CA':'Classicial Analysis and ODES',
                'CO':'Combinatorics',
                'CT':'Category Theory',
                'CV':'Complex Variables',
                'DG':'Differential Geometry',
                'DS':'Dynamical Systems',
                'FA':'Functional Analysis',
                'GM':'General Mathematics',
                'GN':'General Topology',
                'GR':'Group Theory',
                'GT':'Geometric Topology',
                'HO':'History and Overview',
                'IT':'Information Theory', 
                'KT':'K-Theory and Homology',
                'LO':'Logic',
                'MG':'Metric Geometry',
                'MP':'Mathematical Physics',
                'NA':'Numerical Analysis',
                'NT':'Number Theory',
                'OA':'Operator Algebras',
                'OC':'Optimization and Control',
                'PR':'Probability',
                'QA':'Quantum Algebra',
                'RA':'Rings and Algebras',
                'RT':'Representation Theory',
                'SG':'Symplectic Geometry',
                'SP':'Spectral Theory',
                'ST':'Statistics Theory'}


category = pd.DataFrame(category_dict.items(), columns = ['category_id', 'category_name'])
category.to_csv('data/category.csv', index=False)

print(category.shape)
category.head()

(32, 2)


Unnamed: 0,category_id,category_name
0,AC,Commutative Algerba
1,AG,Algebraic Geometry
2,AP,Analysis of PDEs
3,AT,Algebraic Topology
4,CA,Classicial Analysis and ODES


## Author

In [20]:
# Create author table
arxiv_math['authors_parsed'] = arxiv_math['authors_parsed'].apply(eval)

names_list = arxiv_math['authors_parsed'].explode().to_list()
unique_names = [name for name in set(tuple(x[:2]) for x in names_list)]

author = pd.DataFrame()
author['surname'] = pd.Series([name[0] for name in unique_names])
author['first_name'] = pd.Series([name[1] for name in unique_names])

# Make the index the author id
author['author_id'] = author.index
author = author[['author_id', 'surname', 'first_name']]

print(author.shape)
author.head()

(272979, 3)


Unnamed: 0,author_id,surname,first_name
0,0,Coclite,Giuseppe Maria
1,1,Rabee,Khalid Bou
2,2,panhuis,J. in 't
3,3,Nakakita,Shogo H.
4,4,Andersen,Per Kragh


In [21]:
# Add genders
names = [x.split(' ')[0] for x in author['first_name'].to_list()]
unique_names2 = list(set(names))

# Guess the genders of all the names and construct a dictionary
d = gender.Detector()
genders = [d.get_gender(name) for name in unique_names2]

gender_dict = defaultdict(lambda :"unknown")
for name, gender_ in zip(unique_names2, genders):
    gender_dict[name] = gender_

genders = [gender_dict[name] for name in names]

author['gender'] = pd.Series(genders)
author.loc[(author['gender'] != 'male') & (author['gender'] != 'female'),'gender'] = 'unknown'

author.head()

Unnamed: 0,author_id,surname,first_name,gender
0,0,Coclite,Giuseppe Maria,male
1,1,Rabee,Khalid Bou,male
2,2,panhuis,J. in 't,unknown
3,3,Nakakita,Shogo H.,male
4,4,Andersen,Per Kragh,male


In [18]:
# Save the dataframe
author.to_csv('author.csv', index=False)

## Submission-Author

In [23]:
# Create a dict to map author name to index
author_dict = dict(zip(unique_names, author['author_id'].to_list()))

for name in unique_names[:5]:
    print(name, author_dict[name])

('Coclite', 'Giuseppe Maria') 0
('Rabee', 'Khalid Bou') 1
('panhuis', "J. in 't") 2
('Nakakita', 'Shogo H.') 3
('Andersen', 'Per Kragh') 4


In [24]:
# Create submission-author dataframe

submission_author = arxiv_math[['id', 'authors_parsed']].copy().explode('authors_parsed')
submission_author['author_id'] = submission_author['authors_parsed'].apply(lambda x : author_dict[tuple(x[:2])])
submission_author.drop('authors_parsed', axis=1, inplace=True)
submission_author.rename(columns={'id':'arxiv_id'}, inplace=True)

submission_author.to_csv('submission_author.csv', index=False)

submission_author.head()

Unnamed: 0,arxiv_id,author_id
0,704.0002,94229
0,704.0002,49170
1,704.0004,178374
2,704.0005,220029
2,704.0005,225925


## Submission-Category

In [84]:
# Create submission-category table

submission_category = arxiv_math[['id']].copy()
submission_category['category_id'] = arxiv_math['categories'].apply(lambda x : x.split(' '))

submission_category = submission_category.explode('category_id')

# Only want math categories
submission_category = submission_category[submission_category['category_id'].str.startswith('math')]
submission_category['category_id'] = submission_category['category_id'].apply(lambda x : x[5:])

submission_category.rename(columns={'id':'arxiv_id'}, inplace=True)

submission_category.to_csv('submission_category.csv', index=False)

submission_category.head()

Unnamed: 0,arxiv_id,category_id
0,704.0002,CO
1,704.0004,CO
2,704.0005,CA
2,704.0005,FA
3,704.001,CO


In [1]:
import pandas as pd
import csv

types = {"arxiv_id":'str', 'submission_date':'str', 'title':'str', 'abstract':'str', 'journal-ref':'str', 'comments':'str'}
submission = pd.read_csv('data/submission.csv', dtype=types, parse_dates=True)

submission['journal-ref'] = 1 - submission['journal-ref'].isnull().astype('int')

submission.to_csv('data/submission.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)


In [22]:
row = submission.iloc[0]

for entry in row:
    print(type(entry), entry)

<class 'str'> 704.0002
<class 'str'> 2007-03-31
<class 'str'> Sparsity-certifying Graph Decompositions
<class 'str'>   We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use
it obtain a characterization of the family of $(k,\ell)$-sparse graphs and
algorithmic solutions to a family of problems concerning tree decompositions of
graphs. Special instances of sparse graphs appear in rigidity theory and have
received increased attention in recent years. In particular, our colored
pebbles generalize and strengthen the previous results of Lee and Streinu and
give a new proof of the Tutte-Nash-Williams characterization of arboricity. We
also present a new decomposition that certifies sparsity based on the
$(k,\ell)$-pebble game with colors. Our work also exposes connections between
pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and
Westermann and Hendrickson.

<class 'numpy.int32'> 0
<class 'str'> To appear in Graphs and Combinatorics
