# Predicting Citation Count by Keywords using Regression Models

Models adapted from https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a 

Start Spark Session

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import format_number, mean, min, max, corr, stddev
from pyspark.sql.functions import (dayofmonth, hour, dayofyear, month, year, weekofyear, format_number, date_format, asc, desc)
from pyspark.sql.functions import explode, col, element_at, size, split
from pyspark.sql.functions import udf


In [2]:
# Build a SparkSession named as "test123"
spark = SparkSession.builder \
    .appName('test_123') \
    .master('local[*]') \
    .config('spark.sql.execution.arrow.pyspark.enabled', True) \
    .config('spark.sql.session.timeZone', 'UTC') \
    .config('spark.driver.memory','12g') \
    .config('spark.ui.showConsoleProgress', True) \
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .getOrCreate()

In [3]:
paps = spark.read.json("/home/jovyan/work/dummy.json/")

The data used in this model was 2019 data, gathered in the same manner as the AL_read_papers notebook.
I used wget "https://inspirehep.net/api/literature?sort=mostrecent&size=1000&page=1&q=date%202019&subject=Phenomenology-HEP" -O papers_2019.json to get these papers, and extracted keywords using aleksei's notebook.

In [4]:
paps.columns

['Beyond_Standard_Model',
 'Beyond_the_standard_model',
 'CERN_LHC_Coll',
 'CP__violation',
 'None',
 'Strong_Interactions',
 'citation_count',
 'dark_matter',
 'effective_field_theory',
 'energy__high',
 'lattice',
 'lattice_field_theory',
 'neutrino__mass',
 'neutrino__oscillation',
 'new_physics',
 'num_refs',
 'number_of_pages',
 'numerical_calculations',
 'p_p__scattering',
 'quantum_chromodynamics',
 'sensitivity',
 'statistical_analysis',
 'structure',
 'supersymmetry',
 'title']

**remove NAN values so we can fit a model with no errors**

In [5]:
#Adding an ID to each paper so the abstract data analysis can be attributed to a paper
from pyspark.sql.functions import monotonically_increasing_id
papersWIDs = paps.withColumn("id", monotonically_increasing_id())
# remving nans
print(papersWIDs.count())
papersWIDs_woNA = papersWIDs.dropna()
print(papersWIDs_woNA.count())
papersWIDs_woNA_woNone = papersWIDs_woNA.filter("None == 0")
print(papersWIDs_woNA_woNone.count())

1000
992
655


# Create features column so that we can fit a regression model. 
Title is not included since it is the only non-numeric column.

What this means is that, the models we will be constructing will attempt to predict Citation Count based off of the keywords. The real world implications of this would be knowing that papers on certain topics are cited more, and allow authors to write more influential papers on well documented topics. 

In [6]:
from pyspark.ml.feature import VectorAssembler

numericCols = ['Beyond_Standard_Model',
 'Beyond_the_standard_model',
 'CERN_LHC_Coll',
 'CP__violation',
 'None',
 'Strong_Interactions',
 'dark_matter',
 'effective_field_theory',
 'energy__high',
 'lattice',
 'lattice_field_theory',
 'neutrino__mass',
 'neutrino__oscillation',
 'new_physics',
 'num_refs',
 'number_of_pages',
 'numerical_calculations',
 'p_p__scattering',
 'quantum_chromodynamics',
 'sensitivity',
 'statistical_analysis',
 'structure',
 'supersymmetry']
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
df = assembler.transform(papersWIDs_woNA_woNone)
df.show()

+---------------------+-------------------------+-------------+-------------+----+-------------------+--------------+-----------+----------------------+------------+-------+--------------------+--------------+---------------------+-----------+--------+---------------+----------------------+---------------+----------------------+-----------+--------------------+---------+-------------+--------------------+---+--------------------+
|Beyond_Standard_Model|Beyond_the_standard_model|CERN_LHC_Coll|CP__violation|None|Strong_Interactions|citation_count|dark_matter|effective_field_theory|energy__high|lattice|lattice_field_theory|neutrino__mass|neutrino__oscillation|new_physics|num_refs|number_of_pages|numerical_calculations|p_p__scattering|quantum_chromodynamics|sensitivity|statistical_analysis|structure|supersymmetry|               title| id|            features|
+---------------------+-------------------------+-------------+-------------+----+-------------------+--------------+-----------+---

In [25]:
from pyspark.ml.feature import StringIndexer

label_stringIdx = StringIndexer(inputCol = 'citation_count', outputCol = 'labelIndex')
df = label_stringIdx.fit(df).transform(df)
df.show()

+---------------------+-------------------------+-------------+-------------+----+-------------------+--------------+-----------+----------------------+------------+-------+--------------------+--------------+---------------------+-----------+--------+---------------+----------------------+---------------+----------------------+-----------+--------------------+---------+-------------+--------------------+---+--------------------+----------+
|Beyond_Standard_Model|Beyond_the_standard_model|CERN_LHC_Coll|CP__violation|None|Strong_Interactions|citation_count|dark_matter|effective_field_theory|energy__high|lattice|lattice_field_theory|neutrino__mass|neutrino__oscillation|new_physics|num_refs|number_of_pages|numerical_calculations|p_p__scattering|quantum_chromodynamics|sensitivity|statistical_analysis|structure|supersymmetry|               title| id|            features|labelIndex|
+---------------------+-------------------------+-------------+-------------+----+-------------------+--------

# Split Data into Training and Testing Set
70% of papers will be used for training, and it's performance will be evaluated on the training set. 
Future work could include not splitting the data, but using 20xx data for testing, and 20xx+1 data as testing.

In [7]:
train, test = df.randomSplit([0.7, 0.3], seed = 2018)


# Fit a Standard Linear Regression Model
Regressing on citation count, using keywords as features

In [8]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='citation_count', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [2.8157774002119016,0.7983470337372263,-0.6617708201185918,-1.9339701256333166,0.0,3.6539254880560703,4.299078253174551,0.8043891515041933,-2.2080765914293936,0.0,0.0,0.6907061632722683,0.0,-3.619500241295578,0.19139255586649162,-0.18526214976722516,-0.08713131967368194,0.0,3.0649095738380203,7.334928577818144,13.102483421456945,-1.7230704451081877,-5.542330182306755]
Intercept: 4.018998518064363


In [9]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 18.112322
r2: 0.268493


In [10]:
train.describe().show()

+-------+---------------------+-------------------------+-------------------+-------------------+----+--------------------+------------------+-------------------+----------------------+-------------------+--------------------+--------------------+-------------------+---------------------+------------------+-----------------+-----------------+----------------------+--------------------+----------------------+-------------------+--------------------+-------------------+--------------------+--------------------+------------------+
|summary|Beyond_Standard_Model|Beyond_the_standard_model|      CERN_LHC_Coll|      CP__violation|None| Strong_Interactions|    citation_count|        dark_matter|effective_field_theory|       energy__high|             lattice|lattice_field_theory|     neutrino__mass|neutrino__oscillation|       new_physics|         num_refs|  number_of_pages|numerical_calculations|     p_p__scattering|quantum_chromodynamics|        sensitivity|statistical_analysis|          stru

In [12]:
lr_predictions = lr_model.transform(test)
lr_predictions.select("prediction","citation_count","features").show(20)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="citation_count",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+--------------------+--------------+--------------------+
|          prediction|citation_count|            features|
+--------------------+--------------+--------------------+
|   4.345304340619997|             0|(23,[14,15,16],[7...|
|   9.794055904529598|             0|(23,[14,15,18],[1...|
|  11.899374019061007|             0|(23,[14,15,18],[3...|
|  11.158325419992106|             0|(23,[14,15,18],[3...|
|   6.041693192113198|             0|(23,[14,15,18,22]...|
|  12.697792364864465|             0|(23,[14,15,16],[7...|
|-0.09498411154022612|             0|(23,[13,14,15],[1...|
|   3.146428525991599|             0|(23,[13,14,15],[1...|
|    2.91081749314629|             0|(23,[13,14,15,16]...|
|   4.806351833513407|             0|(23,[13,14,15,16]...|
|    9.41203398650774|             0|(23,[13,14,15,16]...|
| -0.2127674617102402|             0|(23,[12,13,14,15,...|
|   6.548581688893582|             0|(23,[12,13,14,15]...|
|   8.110374166321847|             0|(23,[12,13,14,15]..

In [13]:
test_result = lr_model.evaluate(test)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 14.4411


In [14]:
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()

numIterations: 10
objectiveHistory: [0.5, 0.4795067148990139, 0.4088534387611, 0.4015884537212795, 0.3951838671759592, 0.3891911512386198, 0.38699867674982447, 0.38547261811477584, 0.38472510007121224, 0.38446853505766815, 0.3843724318849703]




+-------------------+
|          residuals|
+-------------------+
| -5.965667344890003|
|-3.2835428827849658|
|-5.3145279321509875|
| -5.505920488017479|
|-13.675159390677473|
|  -9.22600864302939|
|-13.044502433900648|
| -6.289881929781245|
|-10.374363978228338|
|-11.911634831259539|
| -6.369304648952784|
|-12.034287285750148|
| -9.426495888810496|
| -8.870709439508822|
| -28.23588312870352|
|-18.970982629891118|
|-0.2448881574978916|
|-1.9490299419985169|
|-1.7011580969546758|
|-1.9919427846933822|
+-------------------+
only showing top 20 rows



In [15]:
predictions = lr_model.transform(test)
predictions.select("prediction","citation_count","features").show(20)

+--------------------+--------------+--------------------+
|          prediction|citation_count|            features|
+--------------------+--------------+--------------------+
|   4.345304340619997|             0|(23,[14,15,16],[7...|
|   9.794055904529598|             0|(23,[14,15,18],[1...|
|  11.899374019061007|             0|(23,[14,15,18],[3...|
|  11.158325419992106|             0|(23,[14,15,18],[3...|
|   6.041693192113198|             0|(23,[14,15,18,22]...|
|  12.697792364864465|             0|(23,[14,15,16],[7...|
|-0.09498411154022612|             0|(23,[13,14,15],[1...|
|   3.146428525991599|             0|(23,[13,14,15],[1...|
|    2.91081749314629|             0|(23,[13,14,15,16]...|
|   4.806351833513407|             0|(23,[13,14,15,16]...|
|    9.41203398650774|             0|(23,[13,14,15,16]...|
| -0.2127674617102402|             0|(23,[12,13,14,15,...|
|   6.548581688893582|             0|(23,[12,13,14,15]...|
|   8.110374166321847|             0|(23,[12,13,14,15]..

## Evaluation of Linear Regression
This model performed with an RMSE of ~14 on the testing data. This isn't the most accurate model ever, however it does do an okay job at predicting citations.

# Decision Tree Regression

In [16]:
from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor(featuresCol ='features', labelCol = 'citation_count')
dt_model = dt.fit(train)
dt_predictions = dt_model.transform(test)
dt_evaluator = RegressionEvaluator(
    labelCol="citation_count", predictionCol="prediction", metricName="rmse")
rmse = dt_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 15.0328


In [17]:
dt_model.featureImportances


SparseVector(23, {1: 0.003, 5: 0.2754, 6: 0.0074, 7: 0.0295, 9: 0.0029, 10: 0.0074, 13: 0.0148, 14: 0.497, 15: 0.0698, 18: 0.0341, 19: 0.02, 20: 0.0117, 21: 0.027})

This model performed with an RMSE of 15, which is roughly the same accuracy as the standard regression model.

# GBT Regression

In [18]:
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol = 'features', labelCol = 'citation_count', maxIter=10)
gbt_model = gbt.fit(train)
gbt_predictions = gbt_model.transform(test)
gbt_predictions.select('prediction', 'citation_count', 'features').sort('prediction').show(20)

+-------------------+--------------+--------------------+
|         prediction|citation_count|            features|
+-------------------+--------------+--------------------+
| -4.758025729493576|            19|(23,[1,13,14,15,1...|
|0.43929852742762365|             0|(23,[2,14,15],[1....|
| 0.8477842724012375|             1|(23,[13,14,15],[1...|
| 1.0574511190273665|             3|(23,[14,15,18,21]...|
| 1.4119738664636814|            10|(23,[2,14,15,17],...|
| 1.6797149286371216|             0|(23,[14,15,16],[7...|
| 1.6797149286371216|             1|(23,[14,15,16],[7...|
| 1.6797149286371216|             4|(23,[14,15,18],[8...|
| 2.0412499417114693|             0|(23,[12,13,14,15,...|
| 2.1118824056366816|             0|(23,[13,14,15],[1...|
|   2.11771230729489|             1|(23,[13,14,15],[1...|
|   2.11771230729489|             0|(23,[13,14,15],[1...|
|   2.11771230729489|             1|(23,[13,14,15],[1...|
| 2.1361174476605007|             2|(23,[14,15,16],[1...|
| 2.1361174476

In [19]:
gbt_evaluator = RegressionEvaluator(
    labelCol="citation_count", predictionCol="prediction", metricName="rmse")
rmse = gbt_evaluator.evaluate(gbt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 15.2337


This model also performed with an RMSE of ~15, which has roughly the same predictive power as the other models.