In my project I wanted to have a look at the articles on arxiv.org between May 2017 and May 2018. At first I needed data for that and I got it by building a custom metadata harvester to put the data into a Pandas dataframe. This data included the link of an article, the title, participating authors, categories, summary and publishing date.
First, I import the needed packages: Pandas, regular expressions, time and the metadata harvester Sickle.
You can get sickle here:
https://github.com/mloesch/sickle

Install it with !pip install sickle
Sickle needs a base URL from where to retrieve the metadata, so it is initalized with such.

In [0]:
import pandas as pd
import re
import sickle
import time
from sickle import Sickle
sickle = Sickle('http://export.arxiv.org/oai2')

I built a frame before that is being used to create the statistics later on.
When the original frame was stored it did that including the index. So since I don't need that information again I drop the first column.

In [0]:
frame=pd.read_csv('framemai17.csv')
frame=frame.drop(frame.columns[0], axis=1)

This part is to show how the frame was built.
The URLs are constructed in the style of oai:arXiv.org:1708.00000.
17 represents the year: 2017
08 represents the month: August
Then there is a 5 digit consexcutively increasing number: 00000 for the first article of the month, 00001 for the second and so on.
So to get all the information I need to loop through all the numbers for a month and loop through all months.
Once I retrieve a request it looks like this:
http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:0804.2273&metadataPrefix=oai_dc

Using regular expressions I look for all the tags I need and gather the information between them. I construct a dataframe with the information and add it to the already existing one.
Sometimes it would throw an error message:
In the better case it throws an error that the ID does not exist. Since the articles are consecutively numbers it is then save to assume that there are no more articles for that month and I can stop the loop and go to the next one.
In another case it gave me an error message where I couldn't figure out the cause. Whenever that happened it should wait 60 seconds and then try again. That always solved the problem.

<b>Funfact</b>: It took 3 days of consecutive querying to construct the whole frame

In [0]:
for i in range(1,10):
    print i
    try:
        records = sickle.GetRecord(metadataPrefix='oai_dc', identifier="oai:arXiv.org:1708.0000"+str(i))
        frameadd = pd.DataFrame([[re.findall('<dc:identifier>(.+?)</dc:identifier>', records.raw), re.findall('<dc:creator>(.+?)</dc:creator>', records.raw), re.search('<dc:title>(.+?)</dc:title>', records.raw, re.DOTALL).group(1), re.findall('<dc:subject>(.+?)</dc:subject>', records.raw), re.search('<dc:description>(.+?)</dc:description>', records.raw, re.DOTALL).group(1), re.findall('<dc:date>(.+?)</dc:date>', records.raw)]], columns=["Identifier", "Authors", "Title", "Subjects", "Description", "Date"])
        frame = frame.append(frameadd, ignore_index=True)
    except Exception as e:
        print(e)
        if str(type(e))=="<class 'sickle.oaiexceptions.IdDoesNotExist'>":
            print "Fertig mit diesem Monat"
            break
        else:
            time.sleep(60)
            print i
            records = sickle.GetRecord(metadataPrefix='oai_dc', identifier="oai:arXiv.org:1708.0000"+str(i))
            frameadd = pd.DataFrame([[re.findall('<dc:identifier>(.+?)</dc:identifier>', records.raw), re.findall('<dc:creator>(.+?)</dc:creator>', records.raw), re.search('<dc:title>(.+?)</dc:title>', records.raw, re.DOTALL).group(1), re.findall('<dc:subject>(.+?)</dc:subject>', records.raw), re.search('<dc:description>(.+?)</dc:description>', records.raw, re.DOTALL).group(1), re.findall('<dc:date>(.+?)</dc:date>', records.raw)]], columns=["Identifier", "Authors", "Title", "Subjects", "Description", "Date"])
            frame = frame.append(frameadd, ignore_index=True)

1
2
3
4
5
6
7
8
9


Here you can see what the frame at the end looks like

In [0]:
frame

Unnamed: 0,Identifier,Authors,Title,Subjects,Description,Date
0,[http://arxiv.org/abs/1805.00001],"[Xin, Lin]",Spin-1 Bosons in the Presence of Spin-orbit Co...,"[Condensed Matter - Quantum Gases, Quantum Phy...","In this paper, I'm going to talk about the t...",[2018-04-28]
1,[http://arxiv.org/abs/1805.00002],"[Graber-Mitchell, Nicolas]",Finding the period of a simple pendulum,[Physics - Classical Physics],Pendulums have long fascinated humans ever s...,[2018-04-28]
2,[http://arxiv.org/abs/1805.00003],"[Hansraj, Sudan, Banerjee, Ayan, Channuie, Pho...",Impact of the Rastall parameter on perfect flu...,"[General Relativity and Quantum Cosmology, Hig...",We examine the effects of the Rastall parame...,[2018-04-29]
3,[http://arxiv.org/abs/1805.00004],"[Gunathillake, Ashanie]",Maximum Likelihood Coordinate Systems for Wire...,[Computer Science - Networking and Internet Ar...,Many WSN protocols require the location coor...,[2018-04-29]
4,"[http://arxiv.org/abs/1805.00005, Phys. Rev. D...","[Turimov, Bobur, Ahmedov, Bobomurat, Abdujabba...",Electromagnetic fields of slowly rotating magn...,[General Relativity and Quantum Cosmology],The exact analytical solutions for vacuum el...,[2018-04-29]
5,[http://arxiv.org/abs/1805.00006],"[Mutuk, Halil]",Asymptotic Iteration and Variational Methods f...,[Quantum Physics],In this paper we studied approximate solutio...,"[2018-04-30, 2018-06-01]"
6,[http://arxiv.org/abs/1805.00007],"[Zhang, Chunhua, Guo, Zhaoli, Liang, Hong]",A high-order lattice Boltzmann model for the C...,[Physics - Computational Physics],"In this paper, a lattice Boltzmann model wit...",[2018-04-30]
7,[http://arxiv.org/abs/1805.00008],"[Defenu, Nicolo, Enss, Tilman, Morigi, Giovanna]",Dynamical critical scaling of long-range inter...,"[Condensed Matter - Quantum Gases, Quantum Phy...",Slow variations (quenches) of the magnetic f...,"[2018-04-30, 2018-05-02]"
8,[http://arxiv.org/abs/1805.00009],"[Ferreira, Bruno L M, Guzzo Jr., Henrique, Fer...",Generalized Jordan derivations on semiprime rings,"[Mathematics - Rings and Algebras, 16W25, 47B47]",The purpose of this note is to prove the fol...,[2018-04-30]
9,"[http://arxiv.org/abs/1805.00010, doi:10.1093/...","[Dong, Subo, Katz, Boaz, Kollmeier, Juna A., K...",A Significantly off-center Ni56 Distribution f...,[Astrophysics - High Energy Astrophysical Phen...,We present nebular-phase spectra of the Type...,"[2018-04-30, 2018-05-31]"


For statistics on how many papers there were per month I am looping over the URLs and count how many times the specific year and month numbers come up

In [0]:
counting={"Mai 17":0, "Juni 17":0, "Juli 17":0, "August 17":0,"September 17":0, "Oktober 17":0,"November 17":0, "Dezember 17":0, "Januar 18":0,"Februar 18":0,"Maerz 18":0,"April 18":0,"Mai 18":0}
for i in range(0,len(frame)):
    if re.search("abs/1805", frame["Identifier"][i]):
        counting["Mai 18"]+=1
    elif re.search("abs/1804", frame["Identifier"][i]):
        counting["April 18"]+=1
    elif re.search("abs/1803", frame["Identifier"][i]):
        counting["Maerz 18"]+=1
    elif re.search("abs/1802", frame["Identifier"][i]):
        counting["Februar 18"]+=1
    elif re.search("abs/1801", frame["Identifier"][i]):
        counting["Januar 18"]+=1
    elif re.search("abs/1712", frame["Identifier"][i]):
        counting["Dezember 17"]+=1
    elif re.search("abs/1711", frame["Identifier"][i]):
        counting["November 17"]+=1
    elif re.search("abs/1710", frame["Identifier"][i]):
        counting["Oktober 17"]+=1
    elif re.search("abs/1709", frame["Identifier"][i]):
        counting["September 17"]+=1
    elif re.search("abs/1708", frame["Identifier"][i]):
        counting["August 17"]+=1
    elif re.search("abs/1707", frame["Identifier"][i]):
        counting["Juli 17"]+=1
    elif re.search("abs/1706", frame["Identifier"][i]):
        counting["Juni 17"]+=1
    elif re.search("abs/1705", frame["Identifier"][i]):
        counting["Mai 17"]+=1

In [0]:
counting

{'April 18': 11351,
 'August 17': 9854,
 'Dezember 17': 10332,
 'Februar 18': 10592,
 'Januar 18': 10608,
 'Juli 17': 9980,
 'Juni 17': 10297,
 'Maerz 18': 11560,
 'Mai 17': 11194,
 'Mai 18': 12595,
 'November 17': 11589,
 'Oktober 17': 11627,
 'September 17': 10517}

For statistics on the title I loop through the frame and store the title and length in a dictionary

In [0]:
titleentity=dict()
for i in range(0,len(frame)):
    print i
    titleentity.update({frame["Title"][i]:len(frame["Title"][i].split())})

Longest title

In [0]:
sorted(titleentity, key=titleentity.__getitem__, reverse=True)[0]

'The Intensities of Cosmic Ray H and He Nuclei at ~250 MeV/nuc Measured\r\r\n  by Voyagers 1 and 2 - Using these Intensities to Determine the Solar\r\r\n  Modulation Parameter in the Inner Heliosphere and the Heliosheath Over a 40\r\r\n  Year Time Period'

In [0]:
titleentity["The Intensities of Cosmic Ray H and He Nuclei at ~250 MeV/nuc Measured\r\r\n  by Voyagers 1 and 2 - Using these Intensities to Determine the Solar\r\r\n  Modulation Parameter in the Inner Heliosphere and the Heliosheath Over a 40\r\r\n  Year Time Period"]

41

Number of titles with just one word

In [0]:
counter=0
for key in titleentity:
    if titleentity[key]==1:
        counter+=1

In [0]:
counter

38

Total words

In [0]:
gesamttitle=0
for key in titleentity:
    gesamttitle+=titleentity[key]

In [0]:
gesamttitle

1392402

Average title length

In [0]:
gesamttitle/len(titleentity)

9

The same for numbers of characters

In [0]:
zeichentitle=0
for key in titleentity:
    zeichentitle+=len(key)

In [0]:
zeichentitle

10911885

In [0]:
zeichentitle/len(titleentity)

76

And for summaries

In [0]:
descriptionentity=dict()
for i in range(0,len(frame)):
    print i
    descriptionentity.update({frame["Description"][i]:len(frame["Description"][i].split())})

In [0]:
sorted(descriptionentity, key=descriptionentity.__getitem__)[0]

'  The main theorem is incorrectly stated.\r\r\n'

In [0]:
descriptionentity["  The main theorem is incorrectly stated.\r\r\n"]

6

In [0]:
sorted(descriptionentity, key=descriptionentity.__getitem__, reverse=True)[0]

'  -Complex manufacturing systems are subject to high levels of variability that\r\r\r\ndecrease productivity, increase cycle times and severely impact the systems\r\r\r\ntractability. As accurate modelling of the sources of variability is a\r\r\r\ncornerstone to intelligent decision making, we investigate the consequences of\r\r\r\nthe assumption of independent and identically distributed variables that is\r\r\r\noften made when modelling sources of variability such as down-times, arrivals,\r\r\r\nor process-times. We first explain the experiment setting that allows, through\r\r\r\nsimulations and statistical tests, to measure the variability potential stored\r\r\r\nin a specific sequence of data. We show from industrial data that dependent\r\r\r\nbehaviors might actually be the rule with potentially considerable consequences\r\r\r\nin terms of cycle time. As complex industries require strong levers to allow\r\r\r\ntheir tractability, this work underlines the need for a richer and mor

In [0]:
len(sorted(descriptionentity, key=descriptionentity.__getitem__, reverse=True)[0].split())

655

In [0]:
gesamtdescription=0
for key in descriptionentity:
    gesamtdescription+=descriptionentity[key]

In [0]:
gesamtdescription

20614232

In [0]:
gesamtdescription/len(descriptionentity)

145

In [0]:
zeichendescription=0
for key in descriptionentity:
    zeichendescription+=len(key)

In [0]:
zeichendescription

146107589

In [0]:
zeichendescription/len(descriptionentity)

1028

For author statistics I had to preprocess the frame to bring the authors into a "Last name, First name" format

In [0]:
span=2
for s in range(0,len(frame)):
    print s
    words = str(frame["Authors"][s]).strip(",").split(",")
    liste=[",".join(words[i:i+span]) for i in range(0, len(words), span)]
    liste=[element.strip() for element in liste]
    liste=[element.strip("[") for element in liste]
    frame["Authors"][s]=[element.strip("]") for element in liste]

Counting the authors.
I accidentally initialized the first occurence of an author with 0 and not with 1. Therefore at the end 1 has to be added.

In [0]:
authors={}
for s in range(0,len(frame)):
    print s
    for i in range(0,len(frame["Authors"][s])):
        if frame["Authors"][s][i] in authors.keys():
            authors[frame["Authors"][s][i]]+=1
        else:
            authors.update({frame["Authors"][s][i]:0})

The most common authors.
You can notice an artifact here: "..." which occurs when there are too many authors on a paper

In [0]:
sorted(authors, key=authors.__getitem__, reverse=True)

['...',
 'CMS Collaboration',
 'ATLAS Collaboration',
 'Taniguchi, Takashi',
 'Wang, Wei',
 'Watanabe, Kenji',
 'Liu, Wei',
 'Liu, Yang',
 'Li, Wei',
 'Chen, Xi',
 'Li, Bo',
 'V., Bay',
 'Chen, Wei',
 'Zhang, Rui',
 'Liu, Chang',
 'S., Albrecht',
 'Zhang, Yi',
 'Wang, Xin',
 'Guo, Guang-Can',
 'Bengio, Yoshua',
 'Li, Jun',
 'A., Bondar',
 'Zhang, Wei',
 'Gruendl, R. A.',
 'Van Gool, Luc',
 'Levine, Sergey',
 'Kuehn, K.',
 'Gruen, D.',
 'T., Bondar',
 'Wang, Yi',
 'Poor, H. Vincent',
 'A., Betancourt',
 'Zhang, Yu',
 'Smith, M.',
 'V., Belloli',
 'J. E., Appleby',
 'R., Adeva',
 'A. A., Amato',
 'A., Artuso',
 'Ia., Bifani',
 'M., Ajaltouni',
 'M., Bossu',
 'Liu, Yu',
 'F., Bettler',
 'S., Borisyak',
 'S., Bocci',
 'G., Benson',
 "F., d'Argent",
 'C., Betti',
 'S., Beranek',
 'S., Barter',
 'K., Belyaev',
 'Z., Akar',
 'A., Bizzeti',
 'S., Amerio',
 'M., Bezshyiko',
 'E., Bencivenni',
 'S., Amhis',
 'T. J. V., Bowen',
 'S., Balagura',
 'W., Baryshnikov',
 'F., Boubdir',
 'G., Andreotti'

In [0]:
authors["CMS Collaboration"]

149

In [0]:
authors["ATLAS Collaboration"]

116

In [0]:
authors["Taniguchi, Takashi"]

100

In [0]:
authors["Wang, Wei"]

99

In [0]:
authors["Watanabe, Kenji"]

96

In [0]:
len(authors)

280206

Statistics for the categories. As with the authors some pre-processing is necessary.

In [0]:
for i in range(0,len(frame)):
    frame["Subjects"][i]=frame["Subjects"][i].split(",")

In [0]:
for i in range(0,len(frame)):
    print i
    for s in range(0,len(frame["Subjects"][i])):
        a=frame["Subjects"][i][s].strip("[")
        frame["Subjects"][i][s]=a
        a=frame["Subjects"][i][s].strip("]")
        frame["Subjects"][i][s]=a
        a=frame["Subjects"][i][s].strip(" ")
        frame["Subjects"][i][s]=a

Statistics including subcategories

In [0]:
subjects={}
for s in range(0,len(frame)):
    print s
    for i in range(0,len(frame["Subjects"][s])):
        if frame["Subjects"][s][i] in subjects.keys():
            subjects[frame["Subjects"][s][i]]+=1
        else:
            subjects.update({frame["Subjects"][s][i]:1})

In [0]:
sorted(subjects, key=subjects.__getitem__, reverse=True)

['Quantum Physics',
 'Computer Science - Computer Vision and Pattern Recognition',
 'Computer Science - Learning',
 'High Energy Physics - Phenomenology',
 'High Energy Physics - Theory',
 'Statistics - Machine Learning',
 'Condensed Matter - Mesoscale and Nanoscale Physics',
 'Condensed Matter - Materials Science',
 'Astrophysics - Astrophysics of Galaxies',
 'General Relativity and Quantum Cosmology',
 'Mathematics - Analysis of PDEs',
 'Mathematics - Combinatorics',
 'Astrophysics - Solar and Stellar Astrophysics',
 'Astrophysics - High Energy Astrophysical Phenomena',
 'Mathematical Physics',
 'Computer Science - Information Theory',
 'Computer Science - Artificial Intelligence',
 'Condensed Matter - Strongly Correlated Electrons',
 'Mathematics - Probability',
 'Astrophysics - Cosmology and Nongalactic Astrophysics',
 'Mathematics - Optimization and Control',
 'Condensed Matter - Statistical Mechanics',
 'Computer Science - Computation and Language',
 'Physics - Optics',
 'Mathema

In [0]:
subjects["Quantum Physics"]

7786

In [0]:
subjects["Computer Science - Computer Vision and Pattern Recognition"]

7361

In [0]:
subjects["Computer Science - Learning"]

7163

In [0]:
subjects["High Energy Physics - Phenomenology"]

7076

In [0]:
subjects["High Energy Physics - Theory"]

6874

Statistics for the main categories. I chose the ones which had a realistic chance for having the most articles.

In [0]:
subjectsnew={}
for key in subjects:
    if key.startswith("Quantum Physics"):
        if "Quantum Physics" in subjectsnew.keys():
            subjectsnew["Quantum Physics"]+=subjects[key]
        else:
            subjectsnew.update({"Quantum Physics":subjects[key]})
    elif key.startswith("Computer Science"):
        if "Computer Science" in subjectsnew.keys():
            subjectsnew["Computer Science"]+=subjects[key]
        else:
            subjectsnew.update({"Computer Science":subjects[key]})
    elif key.startswith("High Energy Physics"):
        if "High Energy Physics" in subjectsnew.keys():
            subjectsnew["High Energy Physics"]+=subjects[key]
        else:
            subjectsnew.update({"High Energy Physics":subjects[key]})
    elif key.startswith("Statistics"):
        if "Statistics" in subjectsnew.keys():
            subjectsnew["Statistics"]+=subjects[key]
        else:
            subjectsnew.update({"Statistics":subjects[key]})
    elif key.startswith("Condensed Matter"):
        if "Condensed Matter" in subjectsnew.keys():
            subjectsnew["Condensed Matter"]+=subjects[key]
        else:
            subjectsnew.update({"Condensed Matter":subjects[key]})
    elif key.startswith("Astrophysics"):
        if "Astrophysics" in subjectsnew.keys():
            subjectsnew["Astrophysics"]+=subjects[key]
        else:
            subjectsnew.update({"Astrophysics":subjects[key]})
    elif key.startswith("Mathematics"):
        if "Mathematics" in subjectsnew.keys():
            subjectsnew["Mathematics"]+=subjects[key]
        else:
            subjectsnew.update({"Mathematics":subjects[key]})

In [0]:
subjectsnew

{'Astrophysics': 21288,
 'Computer Science': 51486,
 'Condensed Matter': 24599,
 'High Energy Physics': 17726,
 'Mathematics': 48337,
 'Quantum Physics': 7786,
 'Statistics': 10425}