<a href="https://colab.research.google.com/github/Justin-Jonany/SciDigest/blob/main/Usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SciDigest**
With the growth of many fields, more papers are published annually. Researchers need to find sources for their research, so abstracts are really helpful. However, reading through hundreds of unstructured abstracts is time-consuming and irritating.

**SciDigest** is a deep learning model that aims to help people, especially researchers, to digest abstracts better. This model receives abstracts as inputs and turn it into a structured abstract.

**SciDigest** will be trained on [PubMed 200k and 20k RCT dataset](https://github.com/Franck-Dernoncourt/pubmed-rct).

Some of the model architecture will be referenced and based on:
* [Paper 1](https://arxiv.org/pdf/1710.06071.pdf)
* [Paper 2](https://arxiv.org/pdf/1612.05251.pdf)

**Note:**
Through this notebook and other notebooks, these papers will be referred as **[Paper 1](https://arxiv.org/pdf/1710.06071.pdf)** and **[Paper 2](https://arxiv.org/pdf/1612.05251.pdf)**

## Goal
The goal of this project:
1. Replicate the model architecture in **Paper 2**
2. Beat the F1-Score of the model in **Paper 1**, that is **91.6**


## Notebook Goal
This notebook aims to:
1. To demonstrate how to use the model

## How (step-by-step):
1. Get example abstract
2. Place it into the `data_preprocess` function which will return the **Formatted Data**
3. Now you can call the `strucurizer` function and put in the model and formatted data.

## All In One Go
Call the `preprocess_and_strucurizer` to immediately plug in a paragraph and abstractized your abstract.

## Libraries

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import nltk

import os
import re

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Helper Function

In [None]:
def split_chars(text):
  '''
  Adds a space (' ') in between every character in text
  '''
  return " ".join(list(text))

## Preprocessing and Abstractizer Function

In [None]:
def data_preprocess(paragraph, verbose=0):
  """
  Preprocesses the paragraph into the correct format for the model

  Args:
    paragraph: the abstract in form of a string
    verbose: 0 for no information, 1 for information while preprocessing

  Returns:
    A list of data with 4 items in the following order:
      1. A list of line number of each sentence in the abstract one hot encoded
      2. A list with the same object, the total number of lines, as long as the number
        of sentences
      3. A list of every sentence in the abstract
      4. A list of list of characters of each sentences
      5. A list of every sentence without any preprocessing
  """

  # From the paragraph we need to get
  # 1. A list of line number of each sentence in the abstract one hot encoded
  # 2. A list with the same object, the total number of lines, as long as the number
  #    of sentences
  # 3. A list of every sentence in the abstract
  # 4. A list of list of characters of each sentences
  # 5. A list of every sentence without any preprocessing


  # 5. A list of every sentence without any preprocessing
  list_sentences_original = nltk.tokenize.sent_tokenize(paragraph)

  # We now need to replace every number in the sentence with '@'
  paragraph = re.sub(r'\d', '@', paragraph)

  # 3. A list of every sentences in the abstract
  # Using the nltk library to effectively separate the sentences
  list_sentences = nltk.tokenize.sent_tokenize(paragraph)

  # 1. A list of line number of each sentence in the abstract one hot encoded
  NUM_CATEGORIES_LN = 15 # set to 15 by default based on the training data
  list_ln = [i for i in range(len(list_sentences))]
  list_ln = tf.one_hot(list_ln, depth=NUM_CATEGORIES_LN)


  # 2. A list with the same object, the total number of lines, as long as the number
  #    of sentences
  NUM_CATEGORIES_TNL = 20 # set to 20 by default based on the training data
  list_total_lines = [len(list_sentences)] * len(list_sentences)
  list_total_lines = tf.one_hot(list_total_lines, depth=NUM_CATEGORIES_TNL)


  # 4. A list of list of characters of each sentences
  list_char_sentences = [split_chars(sentence) for sentence in list_sentences]

  if verbose == 1:
    print('=========================')
    print('\nlist of setences:')
    [print(i) for i in list_sentences]
    print('\nlist of one hot encoded line numbers:')
    print(list_ln)
    print('\nlist of one hot encoded total number of lines:')
    print(list_total_lines)
    print('\nlist of sentences with characters separated:')
    [print(i) for i in list_char_sentences]
    print('=========================')

  return [list_ln, list_total_lines, tf.constant(list_sentences), tf.constant(list_char_sentences),
          list_sentences_original]

def strucurizer(model, data):
  """
  Turns data into a structured abstract

  Args:
    model: A machine learning model
    data: list of data with 4 items in the following order:
      1. A list of line number of each sentence in the abstract one hot encoded
      2. A list with the same object, the total number of lines, as long as the number
        of sentences
      3. A list of every sentence in the abstract
      4. A list of list of characters of each sentences
      5. A list of every sentence without any preprocessing

  Returns:
    Groups of sentences or a sentence is assigned a type of sentence which could be background,
    methods, results, conclusions, or objective. Returned in the format of a dataframe.
  """
  class_names = ['BACKGROUND', 'CONCLUSIONS', 'METHODS', 'OBJECTIVE', 'RESULTS', ]

  # predict
  pred_probs = model.predict(x=tuple(data[:-1]), verbose=0)
  preds = [class_names[i] for i in tf.argmax(pred_probs, axis=1)]

  # format data and prediction
  list_sentences = [sentence.numpy().decode(('utf-8')) for sentence in data[2]]
  structured_abstract = pd.DataFrame({'Sentences': list_sentences, 'Target': preds})

  # Printing and Compacting the structured abstract df
  temporary_sentence = ''
  structured_abstract_combined = pd.DataFrame({'Sentences': [], 'Target': []})
  i = 0
  print('')
  print('Structured Abstract')
  print('')
  while (i < len(list_sentences)):
    print(preds[i])
    temporary_sentence = list_sentences[i]
    j = i
    print(data[-1][j])
    while((j < (len(list_sentences) - 1)) and (preds[j] == preds[j+1])):
      print(data[-1][j+1])
      temporary_sentence += data[-1][j+1]
      j += 1
      i += 1
    i+= 1
    print('\n')
    structured_abstract_combined.loc[len(structured_abstract_combined)] = [temporary_sentence, preds[j]]

  return structured_abstract_combined

def preprocess_and_strucurizer(paragraph, model, verbose=0):
  '''
  Preprocesses paragraph (the abstract) and structurize it.

  Args:
    model: A machine learning model
    paragraph: the abstract in form of a string
    verbose: 0 for no information, 1 for information while preprocessing

  Returns:
    Groups of sentences or a sentence is assigned a type of sentence which could be background,
    methods, results, conclusions, or objective. Returned in the format of a dataframe.
  '''
  data = data_preprocess(paragraph, verbose)
  structured_abstract = strucurizer(model, data)
  return structured_abstract


## Loading


In [None]:
PATH = '/content/drive/MyDrive/ColabNotebooks/projects/SciDigest/'

In [None]:
# best_model = tf.keras.models.load_model(PATH + 'best_model')
# best_model.save(PATH + 'best_model.keras')

In [None]:
best_model = tf.keras.models.load_model(PATH + 'best_model.keras')



## Usage

**Example 1**

Source: https://iopscience.iop.org/article/10.1088/1751-8121/ac4b13

We classify four qubit states under SLOCC operations, that is, we classify the orbits of the group on the Hilbert space . We approach the classification by realising this representation as a symmetric space of maximal rank. We first describe general methods for classifying the orbits of such a space. We then apply these methods to obtain the orbits in our special case, resulting in a complete and irredundant classification of -orbits on . It follows that an element of is conjugate to an element of precisely 87 classes of elements. Each of these classes either consists of one element or of a parameterised family of elements, and the elements in the same class all have equal stabiliser in . We also present a complete and irredundant classification of elements and stabilisers up to the action of where Sym4 permutes the four tensor factors of .

In [None]:
example_abstract = 'We classify four qubit states under SLOCC operations, that is, we classify the orbits of the group on the Hilbert space . We approach the classification by realising this representation as a symmetric space of maximal rank. We first describe general methods for classifying the orbits of such a space. We then apply these methods to obtain the orbits in our special case, resulting in a complete and irredundant classification of -orbits on . It follows that an element of is conjugate to an element of precisely 87 classes of elements. Each of these classes either consists of one element or of a parameterised family of elements, and the elements in the same class all have equal stabiliser in . We also present a complete and irredundant classification of elements and stabilisers up to the action of where Sym4 permutes the four tensor factors of .'

In [None]:
df_example_abstract = preprocess_and_strucurizer(example_abstract, best_model, 1)


list of setences:
We classify four qubit states under SLOCC operations, that is, we classify the orbits of the group on the Hilbert space .
We approach the classification by realising this representation as a symmetric space of maximal rank.
We first describe general methods for classifying the orbits of such a space.
We then apply these methods to obtain the orbits in our special case, resulting in a complete and irredundant classification of -orbits on .
It follows that an element of is conjugate to an element of precisely @@ classes of elements.
Each of these classes either consists of one element or of a parameterised family of elements, and the elements in the same class all have equal stabiliser in .
We also present a complete and irredundant classification of elements and stabilisers up to the action of where Sym@ permutes the four tensor factors of .

list of one hot encoded line numbers:
tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0

In [None]:
df_example_abstract

Unnamed: 0,Sentences,Target
0,We classify four qubit states under SLOCC oper...,BACKGROUND
1,It follows that an element of is conjugate to ...,RESULTS
2,Each of these classes either consists of one e...,METHODS
3,We also present a complete and irredundant cla...,RESULTS


**Example 2**

Source: https://pubmed.ncbi.nlm.nih.gov/20232240/

This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.



In [None]:
example_abstract_2 = 'This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.'

In [None]:
%%time
df_example_abstract_2 = preprocess_and_strucurizer(example_abstract_2, best_model, 0)


Structured Abstract

OBJECTIVE
This RCT examined the efficacy of a manualized social intervention for children with HFASDs.


METHODS
Participants were randomly assigned to treatment or wait-list conditions.
Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.
A response-cost program was applied to reduce problem behaviors and foster skills acquisition.


RESULTS
Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).
Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.
High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.
Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.


CPU times: user 319 ms, sys: 4.33 ms, total: 323 ms

In [None]:
df_example_abstract_2

Unnamed: 0,Sentences,Target
0,This RCT examined the efficacy of a manualized...,OBJECTIVE
1,Participants were randomly assigned to treatme...,METHODS
2,Significant treatment effects were found for f...,RESULTS


**Example 3**

source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6078146/

The aim of this paper is to map the scientific landscape related to cancer research worldwide between 2012 and 2017. We use scientific publication data from Web of Science Core Collection and combine bibliometrics and social network analysis techniques to identify the most relevant journals, research areas, countries and research organizations in cancer scientific landscape. The results show: Oncotarget as the journal with most publications; a significant increase in China’s publications, reaching United States’ publications in 2017; MD Cancer Center, University of California and Harvard University as organizations with most publications; cell biology as the most frequent research area; breast, lung and colorectal cancer as the most frequent keywords; high density of co-authorship between organizations in the West, especially in the US, and low density between organizations in Asian and lower and medium income countries. Our findings can be used to guide a global knowledge platform guiding policy, planning and funding decisions as well as to establish new institutional collaborations.

In [None]:
example_abstract_3 = "The aim of this paper is to map the scientific landscape related to cancer research worldwide between 2012 and 2017. We use scientific publication data from Web of Science Core Collection and combine bibliometrics and social network analysis techniques to identify the most relevant journals, research areas, countries and research organizations in cancer scientific landscape. The results show: Oncotarget as the journal with most publications; a significant increase in China’s publications, reaching United States’ publications in 2017; MD Cancer Center, University of California and Harvard University as organizations with most publications; cell biology as the most frequent research area; breast, lung and colorectal cancer as the most frequent keywords; high density of co-authorship between organizations in the West, especially in the US, and low density between organizations in Asian and lower and medium income countries. Our findings can be used to guide a global knowledge platform guiding policy, planning and funding decisions as well as to establish new institutional collaborations."

In [None]:
%%time
df_example_abstract_3 = preprocess_and_strucurizer(example_abstract_3, best_model, 0)


Structured Abstract

BACKGROUND
The aim of this paper is to map the scientific landscape related to cancer research worldwide between 2012 and 2017.
We use scientific publication data from Web of Science Core Collection and combine bibliometrics and social network analysis techniques to identify the most relevant journals, research areas, countries and research organizations in cancer scientific landscape.


RESULTS
The results show: Oncotarget as the journal with most publications; a significant increase in China’s publications, reaching United States’ publications in 2017; MD Cancer Center, University of California and Harvard University as organizations with most publications; cell biology as the most frequent research area; breast, lung and colorectal cancer as the most frequent keywords; high density of co-authorship between organizations in the West, especially in the US, and low density between organizations in Asian and lower and medium income countries.


CONCLUSIONS
Our find

In [None]:
df_example_abstract_3

Unnamed: 0,Sentences,Target
0,The aim of this paper is to map the scientific...,BACKGROUND
1,The results show: Oncotarget as the journal wi...,RESULTS
2,Our findings can be used to guide a global kno...,CONCLUSIONS
