# **Arxiv metadata Analytics with PySpark RDD: JSON case study**

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark

### Author: Amin Karami (PhD, FHEA)
#### email: amin.karami@ymail.com

In [1]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |█████�[K     |██████████████████�██▌          | 189.3 MB 327 kB/s eta 0:04:427 0:00:07��███████████▏                  | 115.7 MB 26.7 MB/s eta 0:00:07█████████▎                  | 116.5 MB 26.7 MB/s eta 0:00:07     |█████████████▎                  | 117.1 MB 26.7 MB/s eta 0:00:07��██▍                  | 117.5 MB 9.7 MB/s eta 0:00:17        | 118.3 MB 9.7 MB/s eta 0:00:17█████████▌                  | 118.7 MB 9.7 MB/s eta 0:00:17         | 119.5 MB 9.7 MB/s eta 0:00:17�█████████▋                  | 119.9 MB 9.7 MB/s eta 0:00:17.7 MB/s eta 0:00:17�██▊                  | 120.7 MB 9.7 MB/s eta 0:00:17[K     |█████████████▊                  | 121.1 MB 9.7 MB/s eta 0:00:17         | 121.9 MB 245 kB/s eta 0:10:50�██████████                  | 122.3 MB 245 kB/s eta 0:10:4845 kB/s eta 0:10:47�███                  | 123.1 MB 245 kB/s eta 0:10:45[K     |██████████████                  | 123.5 MB 245 kB/s eta 0:10:43       

In [2]:
########## ONLY in Ubuntu Machine ##########
# Load Spark engine
!pip3 install -q findspark
import findspark
findspark.init()
########## ONLY in Ubuntu Machine ##########

In [3]:
# Initializing Spark
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("Archive_PySpark").setMaster("local[*]")


In [5]:
sc = SparkContext(conf =conf)
print(sc)

<SparkContext master=local[*] appName=Archive_PySpark>


In [6]:
# Read and Load Data to Spark
# Data source: https://www.kaggle.com/Cornell-University/arxiv/version/62
import json

rdd_json = sc.textFile("archive/arxiv-metadata-oai-snapshot.json",100)
rdd = rdd_json.map(lambda x: json.loads(x))
rdd.persist()


PythonRDD[2] at RDD at PythonRDD.scala:53

In [7]:
# Check the number of parallelism and partitions:
print(sc.defaultParallelism)
print(rdd.getNumPartitions())

4
100


## Question 1: Count elements

In [8]:
rdd.count()

2011231

## Question 2: Get the first two records


In [9]:
rdd.take(2)

[{'id': '0704.0001',
  'submitter': 'Pavel Nadolsky',
  'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
  'comments': '37 pages, 15 figures; published version',
  'journal-ref': 'Phys.Rev.D76:013009,2007',
  'doi': '10.1103/PhysRevD.76.013009',
  'report-no': 'ANL-HEP-PR-07-12',
  'categories': 'hep-ph',
  'license': None,
  'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with d

## Question 3: Get all attributes


In [12]:
#rdd.flatMap(lambda x:x.keys()).distinct().count()
rdd.flatMap(lambda x:x.keys()).distinct().collect()

['authors',
 'comments',
 'title',
 'id',
 'journal-ref',
 'versions',
 'submitter',
 'categories',
 'update_date',
 'authors_parsed',
 'report-no',
 'license',
 'abstract',
 'doi']

## Question 4: Get the name of the licenses

In [13]:
rdd.map(lambda x:x["license"]).distinct().collect()

[None,
 'http://creativecommons.org/licenses/publicdomain/',
 'http://creativecommons.org/licenses/by-nc-nd/4.0/',
 'http://creativecommons.org/licenses/by-nc-sa/4.0/',
 'http://creativecommons.org/licenses/by-nc-sa/3.0/',
 'http://creativecommons.org/licenses/by/3.0/',
 'http://creativecommons.org/licenses/by/4.0/',
 'http://creativecommons.org/publicdomain/zero/1.0/',
 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
 'http://creativecommons.org/licenses/by-sa/4.0/']

## Question 5: Get the shortest and the longest titles

In [15]:
sortest_title= rdd.map(lambda x:x["title"]).reduce(lambda x,y : x if x<y else y)
longest_title= rdd.map(lambda x:x["title"]).reduce(lambda x,y : x if x>y else y)
print(sortest_title)
print(longest_title)

!-Graphs with Trivial Overlap are Context-Free
Weyl formula for the negative dissipative eigenvalues of Maxwell's
  equations


## Question 6: Find abbreviations with 5 or more letters in the abstract

## Question 7: Get the number of archive records per month ('update_date' attribute)

## Question 8: Get the average number of pages