## Task Definition

Task: 
- Create a new dataset in csv format with only records containing comments and without the word COVID in the title. 

- This new dataset will have only 5 columns: id, title, comments, journal-ref and categories. 

## Imports

In [1]:
import pandas as pd
import numpy as np
import dask.bag as db
import dask.dataframe as dd
import json

In [2]:
from dask.distributed import Client, LocalCluster

# Creating a local cluster
cluster = LocalCluster()

# Scaling the cluster to have 8 workers
cluster.scale(8)

# Enabling adaptive scaling to automatically adjust the number of workers based on workload
cluster.adapt(minimum=1, maximum=8)

# Creating a client to connect to the cluster
client = Client(cluster)

## Read JSON File

In [3]:
import dask.bag as db
import json

# Read JSON file with dask.bag
docs = db.read_text('../data/raw_data/arxiv-metadata-oai-snapshot.json').map(json.loads)

In [4]:
# Preview the first record in the data
docs.take(1)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

In [5]:
# Calculate the total number of records
docs.count().compute()



2223467

## Filter Data
- Only records containing comments
- Only records without 'covid' (all cases) in the title

In [6]:
# Select records with comments
docs = docs.filter(lambda record: record['comments'] is not None and record['comments'] != '')

In [7]:
# Select records that do not contain 'covid' in the title (any case format by using .lower())
docs = docs.filter(lambda record: 'covid' not in record['title'].lower())

In [8]:
# Preview 10 filtered records 
docs.take(10)

({'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

In [9]:
import dask.dataframe as dd

def flatten(record):
    """Flatten a record by extracting specific fields"""
    
    return {
        'id': record['id'],
        'title': record['title'],
        'comments': record['comments'],
        'journal-ref': record['journal-ref'],
        'categories': record['categories']
    }

# Apply the 'flatten' function to each record in 'docs',
# Convert the flattened Dask Bag to a Dask DataFrame
ddf = docs.map(flatten).to_dataframe()

In [10]:
# Preview dask dataframe
ddf.head()

Unnamed: 0,id,title,comments,journal-ref,categories
0,704.0001,Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",hep-ph
1,704.0002,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,math.CO cs.CG
2,704.0003,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,physics.gen-ph
3,704.0004,A determinant of Stirling cycle numbers counts...,11 pages,,math.CO
4,704.0006,Bosonic characters of atomic Cooper pairs acro...,"6 pages, 4 figures, accepted by PRA",,cond-mat.mes-hall


In [11]:
len(ddf)

1704135

In [12]:
ddf = ddf.dropna(subset=['comments'])  

In [13]:
# Print length of dataframe
len(ddf)

1704135

In [14]:
# Compute dask dataframe to pandas dataframe
df = ddf.compute()
len(df)

1704135

In [None]:
# Save dataframe to CSV file
df.to_csv('../data/processed_data/filtered_arxiv_dataset.csv')

In [None]:
# Close clusters
cluster.close()