## Data Pre-Processing:

ℹ️ This file reads in data from a JSON file, preprocesses and stores it in a new JSON file. 

✅ It filters the data to only keep papers that are categorized under topics related to *Computer Science*, which helps to reduce the amount of resources and processing capacity needed to index and search the papers. 

🚀 The notebook is intended to prepare the data for use in the Parec project, particularly for indexing and searching in Elasticsearch.

***
⚠️ Please note that the original file (`arxiv-metadata-oai-snapshot.json`) could not be uploaded to this repository due to its large size. If you want to re-implement this code, you need to download it from the [kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv) website. ⚠️

***


In [1]:
import json
import pickle
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True) 
import itertools
import re
import requests
import sys

In [2]:
data_file = 'arxiv-metadata-oai-snapshot.json'

In [3]:
# Filter for categories
# see https://arxiv.org/help/api/user-manual --> only keep categories related to Computer Science

category_map = {
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control'}

In [6]:
# with open("cs_categories.json", "w") as outfile:      #save categories as json
#     json.dump(category_map, outfile)

In [4]:
def get_metadata(path_to_dataset):
    """Reads in a dataset file and returns an iterable generator.
    
    Args:
        path_to_dataset (str): A string representing the path to the dataset file.
    
    Yields:
        str: A string containing each line of the dataset file.
    """
    with open(path_to_dataset, 'r') as f:     #load original data set
        for line in f:
            yield line

# Strip trailing whitespaces and \n-characters
def clean_strings(strings):
    cleaned = strings.strip() 
    return re.sub('\s+',' ', cleaned)

In [8]:
def filter_dataset(path_to_dataset, category_map):
    """
    Filter the data set to include only papers in the Computer Science category published from 2017 to 2022.

    Args:
        path_to_dataset (str): path to the dataset file.
        category_map (dict): mapping of category codes to names.

    Returns:
        A dictionary with the filtered metadata of the papers, containing abstract, title, author, year, category, and paper_id.
    """

    authors = []
    titles = []
    abstracts = []
    years = []
    categories = []
    ids = []
    metadata = get_metadata(path_to_dataset)

    for paper in metadata:
        paper_dict = json.loads(paper)
        ref = paper_dict.get('journal-ref')
        try:
            year = int(ref[-4:]) 
            if 2003 < year <= 2023:
                categories.append(category_map[paper_dict.get('categories').split(" ")[0]])
                authors.append(paper_dict.get('authors'))
                years.append(year)
                titles.append(paper_dict.get('title'))
                abstracts.append(paper_dict.get('abstract'))
                ids.append(paper_dict.get('id'))
        except:
            pass 
    #print("Check length: ", len(titles), len(abstracts), len(years), len(authors), len(categories))

    cleaned_abstracts = [clean_strings(abstract) for abstract in abstracts]
    cleaned_titles = [clean_strings(title) for title in titles]
    cleaned_authors = [clean_strings(author) for author in authors]

    reduced = []
    for author, title, abstract, year, category, id in zip(cleaned_authors, cleaned_titles, cleaned_abstracts, years, categories, ids):
        reduced.append({"abstract":abstract, "title":title, "author":author, "year":year, "category":category, "paper_id": id})
    
    return {"root": reduced}        #add root

In [9]:
data = filter_dataset(data_file, category_map)

In [10]:
print(data["root"][0:1])

[{'abstract': 'Given a multiple-input multiple-output (MIMO) channel, feedback from the receiver can be used to specify a transmit precoding matrix, which selectively activates the strongest channel modes. Here we analyze the performance of Random Vector Quantization (RVQ), in which the precoding matrix is selected from a random codebook containing independent, isotropically distributed entries. We assume that channel elements are i.i.d. and known to the receiver, which relays the optimal (rate-maximizing) precoder codebook index to the transmitter using B bits. We first derive the large system capacity of beamforming (rank-one precoding matrix) as a function of B, where large system refers to the limit as B and the number of transmit and receive antennas all go to infinity with fixed ratios. With beamforming RVQ is asymptotically optimal, i.e., no other quantization scheme can achieve a larger asymptotic rate. The performance of RVQ is also compared with that of a simpler reduced-rank

In [12]:
with open('../runtime/arxiv_large.json', 'w') as fp:    #save reduced data set
    json.dump(data, fp)