# Scientific concepts project report

## The problem

In scientific community it is important to updated about recent works and concepts, that are the **State-of-the-art** now. As more and more papers getting published every day, it is hard to follow all of them even in particular area of interest. With this project we want to provide the academic researcher with personilized list of scientific concepts studying of which would enhance their research capabilities. 

**Model user** is an academic researcher in the field of biomedicine. Said person has
some prior (domain) knowledge that I consider as a set of concepts that person selfclaim to be comfortable with. 

**Major assumption** that I make is that "scientific concepts studying of which would
enhance their research capabilities" are the ones that respective community of
researches deems important, while notion of importance can be deduced in the form of
numeric score, based on what they (researchers) write in scientific publications. 

## The data

The data for this project is taken from **Semantic Scholar Open Research Corpus**. Semantic Scholar is an open scientific paper search engine that connects relevant papers and extract useful information like abstracts, figures and entities of the paper. Since 2018 it includes more than 40 million papers on Biomedicine, Neuroscience and Computer Science. The data is avalible for downloading by [public API](https://api.semanticscholar.org/corpus/). 

The data is provided in JSON format with such fields: 
* id
* title
* paperAbstruct
* entities
* s2Url
* s2PdfUrl
* pdfUrls
* authors
* inCitations
* outCitations
* year
* venue
* journalName
* journalVolume
* journalPages
* sources
* doi
* doiUrl
* pmid

For this project we only use title, abstruct, year, entities and id fields. 

## Preliminary steps recap

In order to execute code presented in this report, all (!) preliminary steps listed in readme.md required to be completed first. 

### Installed and configured

* Java 8
* Hadoop 3.1.2
* Spark 2.4.3 (Pre-built with user-provided Apache Hadoop)
* Python 3.6.8

As well as Python libraries listed under **requirements.txt**

### Downloaded

* Semantic Scholar Open Research Corpus
* FastText pretrained model

### HDFS and Yarn

* Name-nodes formatted
* Daemons started (start-dfs.sh, start-yarn.sh)
* Data converted to parquet and pushed to HDFS under scholar_data/base.parquet

**Note**: you will need daemons up and running while executing code in this notebook.

### Get data

#### If you have done the preliminary steps you can skip this part (you should have parque data saved) .

Download by running from project folder (creates data/ subfolder and stores there):
```
sh bash/get_data.sh
```

Or by following instructions on the website. In later case, make sure to remove sample file sample-S2-records.gz in order to avoid data duplication.

Now you can convert data from set of compressed jsons to parquet and store to HDFS. This is by far the most time consuming step, may take several hours to complete. On the bright side, due to to conversion, all consequent operations become rather fast.
```
python src/0_convert_to_parquet_store_to_hdfs.py
```

The **0_convert_to_parquet_store_to_hdfs.py** file selects 'id', 'title', 'year', 'entities', 'paperAbstract' fields and converts data to spark.sql parquet format optimized to work with big table data on pyspark. 

## Setup

### Global variables

In [1]:
preprocess_data = True  # change to False, if you want to rerun all cells for demo, but have made preprocessing already

In [None]:
dirpath = os.getcwd()
model_bin_path = f'{dirpath}/fasttext/crawl-300d-2M-subword.bin'

### Libraries

In [2]:
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find() # solely to check whether path is correct

'/usr/local/spark/spark-2.4.3-bin-without-hadoop'

In [3]:
from pyspark.sql import SparkSession
from pyspark import SQLContext

import pyspark.sql.functions as F
import pyspark.sql.types as T

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.feature import BucketedRandomProjectionLSH, BucketedRandomProjectionLSHModel

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq

In [4]:
import os
import fasttext
import numpy as np
import pandas as pd

### Functions

In [5]:
def pivot_keyterms(df):
    # year - single integer, entities - list of strings (keyterms)
    # keep only papers for periods (years) that fully passed
    # make separate row for each keyterm
    df = df.select('year', 'entities').\
        filter("year < 2019").\
        withColumn('entities', F.explode('entities'))
    
    # turn all keyterms to lowercase
    df = df.withColumn('entities', F.lower(F.col('entities')))
    
    # cut number formatting: 10,000 -> 10000
    df = df.withColumn('entities', F.regexp_replace(F.col('entities'), '(\d)[, ](\d{3})', '$1$2'))
    
    # treat comma separated values as a different keyterms (even though they were posed as a unit)
    df = df.withColumn('entities', F.explode(F.split('entities', '[,]')))
    
    # substitute underscore with a space as a means of separation
    # get rid of excessive spaces
    df = df.withColumn('entities', F.regexp_replace(F.col('entities'), '_', ' '))
    df = df.withColumn('entities', F.trim(F.col('entities')))
    df = df.withColumn('entities', F.regexp_replace(F.col('entities'), '\s+', ' '))

    # keep only keyterms that contain alpha-numeric values and spaces
    # retrieve counts for each keyterm for each year
    # \w also includes uderscore, but we handled it earlier
    df = df.filter(~(F.col('entities').rlike('[^\w\s]'))).\
        groupby('entities').\
        pivot('year').\
        count().\
        sort('entities')
    
    return df

In [6]:
def get_tokens_frequency(df, token_min_length=3, token_min_count=11, enhance_factor=100000.0):
    # discard suspiciously short tokens 
    df = df.filter(f'LENGTH(entities) >= {token_min_length}')

    # gather column names linked to years
    col_years = [col_name for col_name in df.columns]
    col_years.remove('entities')

    # Find peak usage of token across the years
    # https://stackoverflow.com/questions/40874657/pyspark-compute-row-maximum-of-the-subset-of-columns-and-add-to-an-exisiting-da
    minf = F.lit(float("-inf"))
    df = df.withColumn("year_max", F.greatest(*[F.coalesce(F.col(year), minf) for year in col_years]))

    # forget about tokens that have never been really used
    df = df.filter(f'year_max >= {token_min_count}').drop('year_max')

    # find total number of "valid" tokens used on each year
    df = df.join(df.groupby().sum(*col_years))

    # retrieve token frequency (times common coefficient) for each year
    # coefficient is to make sure we do not limitations of float precision too hard
    for year in col_years:
        df = df.withColumn(year, enhance_factor*F.col(year) / F.col(f'sum({year})')).drop(f'sum({year})')
    
    return df 

In [7]:
def get_embeddings(pd_df, emb_length=300):
    # retrieve embeding for each keyterm
    model = fasttext.load_model(model_bin_path)
    pd_df['embeddings'] = pd_df['entities'].apply(model.__getitem__)
    del model

    # append embeddings as another column to the data frame 
    emb_components = pd.DataFrame(pd_df['embeddings'].tolist(), columns=[f'v{i}' for i in range(emb_length)])
    pd_df['embeddings'] = pd_df['embeddings'].apply(lambda x: list(x))
    pd_df = pd.concat([pd_df[:], emb_components[:]], axis=1) 
    return pd_df

In [8]:
def fit_lsh_model(df, seed=42, bucketLength=12.0, numHashTables=20):
    def make_normed_vector(x):
        # make L2-norm of the vector to be unit
        x_np = np.array(x, dtype=np.float64)
        x_np = x_np / np.linalg.norm(x_np)
        return Vectors.dense(x_np)
    
    df = df.select('entities', 'embeddings')

    to_vector = F.udf(lambda x: Vectors.dense(x), VectorUDT())
    to_normed_vector = F.udf(make_normed_vector, VectorUDT())

    df = df.withColumn('normed_embeddings', to_normed_vector('embeddings'))
    df = df.withColumn('embeddings', to_vector('embeddings'))

    # even though method is designed for Euclidean distances, we can use it for cosine distances, as done here
    # to do so, we normalize input vectors first, then Euclidean distance is nothing else, but sqrt(2)*sqrt(cosine_distance), and sqrt is a monotone transformation
    brpLSH = BucketedRandomProjectionLSH(inputCol="normed_embeddings", outputCol="hashes", seed=seed, bucketLength=bucketLength, numHashTables=numHashTables)
    brpLSHmodel = brpLSH.fit(df)

    return brpLSHmodel, df

In [9]:
def norm_np_array(x):
    return Vectors.dense(x/np.linalg.norm(x))

## Execution

### Preprocessing

In [11]:
if preprocess_data:
    spark = SparkSession.builder.\
        master('yarn').\
        appName('scholar').\
        getOrCreate()

    sc = spark.sparkContext
    sqlContext = SQLContext(sc)

    # enables cartesian (!) join
    spark.conf.set("spark.sql.crossJoin.enabled", "true")
    spark.conf.set("spark.sql.broadcastTimeout", "1800")   # ensure that stage 11 does not terminate on timeout 

    df = sqlContext.read.format('parquet').load('hdfs:/scholar_data/base.parquet')
    df_pivoted = pivot_keyterms(df)
    df_pivoted.write.save('hdfs:/scholar_data/tokens_count_by_year.parquet', format='parquet', mode='overwrite')

    df_token_freq = get_tokens_frequency(df_pivoted)
    df_token_freq.write.save('hdfs:/scholar_data/tokens_freq_by_year.parquet', format='parquet', mode='overwrite')

    spark.stop()

In [12]:
if preprocess_data:
    spark = SparkSession.builder.\
        master('local[2]').\
        appName('scholar').\
        getOrCreate()

    sc = spark.sparkContext
    sqlContext = SQLContext(sc)

    df_token_freq = sqlContext.read.format('parquet').load('hdfs:/scholar_data/tokens_freq_by_year.parquet')
    pd_df_entities = df_token_freq.select('entities').toPandas()

    spark.stop()

In [13]:
if preprocess_data:
    pd_df_embeddings = get_embeddings(pd_df_entities)
    pa_df_embeddings = pa.Table.from_pandas(pd_df_embeddings)

    fs = pa.hdfs.connect()
    with fs.open('hdfs:/scholar_data/token_embeddings.parquet', 'wb') as target:
        pq.write_table(pa_df_embeddings, target)

In [14]:
if preprocess_data:
    spark = SparkSession.builder.\
        master('local[2]').\
        appName('scholar').\
        getOrCreate()

    sc = spark.sparkContext
    sqlContext = SQLContext(sc)

    df_embeddings = sqlContext.read.format('parquet').load('hdfs:/scholar_data/token_embeddings.parquet')
    brpLSHmodel, df_normed_embeddings = fit_lsh_model(df_embeddings)
    brpLSHmodel.write().overwrite().save('hdfs:/scholar_model/brpLSH_model')
    df_normed_embeddings.write.save('hdfs:/scholar_data/token_normed_vector_embeddings.parquet', format='parquet', mode='overwrite')

    spark.stop()

### Forecast

In [None]:
# TODO: move forecast step from tsa notebook here and add comparison 

## Demo

Once loaded, you can enter your search queriers (word or several words). To exit gracefully type 'stop'.

If you want to run it after reloading notebook with data already preprocessed, set **preprocess_data=False** back at top, that way you can skip unnecessary repetiton of data preprocessing.

In [16]:
spark = SparkSession.builder.\
    master('yarn').\
    appName('scholar').\
    getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

print("Loading...", end='')
df_normed_embeddings = sqlContext.read.format('parquet').load('hdfs:/scholar_data/token_normed_vector_embeddings.parquet').select('entities', 'normed_embeddings')
df_token_freq = sqlContext.read.format('parquet').load('hdfs:/scholar_data/tokens_freq_by_year.parquet').select('entities', F.col('2018').alias('score'))
ftmodel = fasttext.load_model(model_bin_path)
brpLSHmodel = BucketedRandomProjectionLSHModel.load('hdfs:/scholar_model/brpLSH_model')
print("Completed! You can start!")

try:
    neighbors_num = 20
    dist_cutoff = 1
    token = ''
    while token != 'stop':
        token = input()
        token = token.lower()
        if token == 'stop':
            break
        token_vector = norm_np_array(ftmodel[token])

        search_result = brpLSHmodel.approxNearestNeighbors(df_normed_embeddings, token_vector, neighbors_num).select('entities', 'distCol').filter(f'distCol < {dist_cutoff}')
        search_result = search_result.join(df_token_freq, 'entities', how='left')
        search_result = search_result.select('entities', *[F.round(F.col(c), 3).alias(c) for c in ['distCol', 'score']])
        search_result.orderBy('score', ascending=False).show(n=neighbors_num, truncate=False)
except:
    print("Error occured while executing demo loop, stop spark session gracefully")
    
del df_normed_embeddings    
del df_token_freq    
del ftmodel
del brpLSHmodel
    
spark.stop()

Loading...




Completed! You can start!
i have a bream
+----------------------+-------+-----+
|entities              |distCol|score|
+----------------------+-------+-----+
|hay fever             |0.757  |1.346|
|sea bream             |0.59   |1.127|
|tobacco mosaic virus  |0.78   |0.652|
|pancreatic juice      |0.78   |0.474|
|lingual thyroid       |0.77   |0.332|
|grape juice           |0.772  |0.255|
|pituitary thyroid axis|0.781  |0.148|
|vaginal atrophy       |0.764  |0.113|
|orange juice          |0.771  |0.113|
|grapefruit juice      |0.774  |0.101|
|cranberry juice       |0.773  |0.095|
|raspberry juice       |0.749  |0.071|
|pineapple juice       |0.777  |0.065|
|carrot juice          |0.757  |0.059|
|stoma cap             |0.776  |0.047|
|iris atrophy          |0.778  |0.047|
|centriacinar emphysema|0.777  |0.036|
|mango juice           |0.738  |0.036|
|moloney leukemia virus|0.777  |0.024|
|dnajb1 protein        |0.781  |0.024|
+----------------------+-------+-----+

stop
