In [None]:
import urllib, urllib.request
import xmltodict
import json
import pymongo
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

# **Overview**

Personally, I am a person whom enjoys the research community and the new methods created and applied by researchers and developers around the world. When I am in the mood, I will often go to [arXiv](https://arxiv.org), an open access archive created and maintained by Cornell University for scholarly articles in the sciences, to read up on interesting new methods created and or applied within Computer Science, Statistics, Mathematics, and Economics. arXiv is vast, with nearly two million documents in their database. Thankfully they not only have a search engine for finding new papers, but also an API for retrieving scholarly articles within various categories according to their [taxonomy](https://arxiv.org/category_taxonomy).


<center><img src="/Users/wastechs/Documents/git-repos/arxiv_mongo/images/arxiv.png" width="512" height="174"/></center>


As stated above, I am interested in the taxonomies of:

**Computer Science**
 - cs.AI = Artificial Intelligence
 - cs.CE = Computational Engineering
 - cs.DB = Databases
 - cs.ET = Emerging Technologies
 - cs.DC = Distributed Computing
 - cs.LG = Machine Learning
 - cs.IT = Information Theory

**Statistics**
 - stat.AP = Statistical Applications
 - stat.ML = Machine Learning
 - stat.TH = Theory
 - stat.ME = Methodology

**Mathematics**
 - math.PR = Probability Theory
 - math.ST = Mathematical Statistics

**Economics**
 - econ.EM = Econometrics

The goal of this project is to not only learn about MongoDB, but to also inform myself of relevant research articles and researchers in the areas I am interested in. However, due to the limiting factor of storage constraints on the free tier of MongoDB Atlas, only the top 1000 documents, **by relevance**, for each category were retrieved. 

# **ETL**

In [None]:
# Database Information
cnx = 'mongodb+srv://gabe:gabe_mongo@arxiv.xawxi.mongodb.net/test'
# Connection to MongoDB
client = pymongo.MongoClient(cnx)

In [None]:
# Access 'arXiv' database
db = client['arxiv']

## **Fetch Data**

In [None]:
# arXiv category taxonomy for the "for loop"
csCats = ['cs.AI', 'cs.CE', 'cs.DB', 'cs.ET', 'cs.DC', 'cs.LG', 'cs.IT']
statCats = ['stat.AP', 'stat.ML', 'stat.TH', 'stat.ME']
mathCats = ['math.PR', 'math.ST']
econCats = ['econ.EM']

In [None]:
def get_arxiv(db, collection, category=list, file=bool):

    if collection == 'Math':
        col = db.Math
    elif collection == 'ComputerScience':
        col = db.ComputerScience
    elif collection == 'Economics':
        col = db.Economics
    elif collection == 'Statistics':
        col = db.Statistics
    else:
        raise ValueError('Collection not in MongoDB')

    if type(category) != list:
        raise TypeError('Category is not in list format')
    else:
        for cat in category:
            url = 'http://export.arxiv.org/api/query?search_query=cat:{}&start=0&max_results=1000&sortBy=relevance&sortOrder=ascending'.format(cat)
            data = urllib.request.urlopen(url)
            arxiv_data = data.read().decode('utf-8')

            # Returned data is an "Atom Document" - convert to ordered dictionary
            arxiv_dict = xmltodict.parse(arxiv_data)

            # Converting to JSON
            arxivJSON = json.dumps(arxiv_dict, indent=4)

            # Decoding JSON
            arxiv_final = json.loads(arxivJSON)

            # Insert document into collection
            try:
                col.insert_many(arxiv_final['feed']['entry'])
                print('Document {} inserted into {} collection'.format(cat, collection))
            except:
                print('An error has occured')

            # Optional - write and save to file
            if file == True:
                with open('{}.json'.format(cat.replace('.', '_')), 'w') as write_file:
                    json.dump(arxiv_dict, write_file, indent=4)

    
    return arxiv_final

In [None]:
#math = get_arxiv(db, 'Math', mathCats, False)
#cs = get_arxiv(db, 'ComputerScience', csCats, False)
#stat = get_arxiv(db, 'Statistics', statCats, False)
#econ = get_arxiv(db, 'Economics', econCats, False)

## **Document Structure**

The documents returned by the arXiv API were converted to JSON format and straight away imported into their respective collection in the arXiv database. The class diagram below represents what a _single collection_ looks like. However, all collections have the same structure. The '{}' indicates a nested substructure where the additional data is found linked below the main document (publication). A particularity worth noting is that the '@' in the nested substructure is a formatting design by arXiv where, as will be seen below in the analysis, one can simply use dot notation for the field to access the information, i.e,. 'arxiv:journal_ref.@xmlns:arxiv'. 

![class diagram](/Users/wastechs/Documents/git-repos/arxiv_mongo/images/class_diagram.png)

# **Analysis**

In performing analysis, the following system architecture was developed to gain a more intuitive understanding of how the Mongo aggregation pipelines were utilized. Starting with the analyst, they develop aggregation queries in their IDE or text editor of choice which are then sent to the arXiv database using the Python library _pymongo_. From there, the queries are processed on the MongoDB server and the results are returned back to the client which are then displayed to the analyst. 

<center><img src="/Users/wastechs/Documents/git-repos/arxiv_mongo/images/system-arch-flow-3.png" width="1058" height="554"/></center>

The queries designed here sometimes reflect my personal interests. For example, in regard to statistics, I classify myself as a Bayesian and thus, I have constructured a regex pipeline for finding any document containing Bayes, Bayesian, Bayesianism. . .

In [None]:
# View all of the collections in the MongoDB
db = client['arxiv']
collections = db.list_collection_names()
collections

In [None]:
# Number of documents in each collection
db.Math.count_documents({}), db.ComputerScience.count_documents({}), db.Economics.count_documents({}), db.Statistics.count_documents({})

In [None]:
# Authors with the most relevant papers in Statistics
project = {'$project': {'_id': 0, 'author.name':1}}
unwind = {'$unwind': '$author.name'}
groupby = {'$group': {'_id': '$author.name', 'count': {'$sum': 1}}}

pipeline = [project, unwind, groupby]

statAuthors = db.Statistics.aggregate(pipeline)

statAuthors = pd.DataFrame(statAuthors)
statAuthors.sort_values(by=['count'], ascending=False)

In [None]:
# One of my favorite machine learning researchers at the moment - Does he have any relevant papers?
for doc in db.ComputerScience.aggregate([
    {'$match': {'author.name': 'Kilian Q. Weinberger'}},
    {'$project': {'title': 1, 'author.name': 1, '_id': 0}}]):

    print(doc)

In [None]:
# Statistics papers with "Baye" in the title
for doc in db.Statistics.aggregate([
    {'$project': {'_id': 0,
                  'title': 1,
                  'author.name': 1}},
    {'$match': {'title': {'$regex': '^Bayes'}}}
]):
    print(doc)

In [None]:
# Mathematics and Computer Science complement each other very well
stage_lookup = {
    '$lookup': {
        'from': 'Math',
        'localField': 'author.name',
        'foreignField': 'author.name',
        'as': 'same_author'
    }
}

match = {'$match': {'same_author.0': {'$exists': True}}}

add_fields = {'$addFields': {
    'author_name': 'author.name',
    'paper_title': 'title'
}}

project = {'$project': {'_id': 0, 'author.name':1, 'title': 1}}

unwind = {'$unwind': '$author.name'}

group_by = {'$group': {'_id': '$author.name', 'count': {'$sum': 1}}}

limit = {'$limit': 3}

pipeline = [stage_lookup, match, project, add_fields, project, limit]
#pipeline = [stage_lookup, match, project, unwind, group_by, limit]

for doc in db.ComputerScience.aggregate(pipeline):
    print(doc)

In [None]:
# Given the advancements in Computer Science and AI the past decade(s) I wonder how many revelant papers
# are in the more "recent" years?
def art_by_year(col, cat):
    project = {'$project': {'_id': 0}}
    group_by = {'$group': {'_id': {'year': {'$year': '$formatted_date'}},
                'count': {'$sum': 1}}}
    
    group_by_date = col.aggregate([project, group_by])
    byYear = pd.DataFrame(group_by_date)
    byYear['_id'] = pd.json_normalize(byYear['_id'])

    plt.figure(figsize=(9, 6))
    sns.barplot(x='_id', y='count', data=byYear)
    plt.xticks(rotation=45)
    plt.xlabel('Year')
    plt.title('{} Articles'.format(cat))
    plt.show()

    return byYear

In [None]:
csYear = art_by_year(db.ComputerScience_Clean, 'CS')