# **Problem Statement**
Given a research paper in PDF (use this paper: https://aclanthology.org/P19-1106/), how would
you find the “contributing statements” of the paper? For definition of contributing statement
and the associated task please refer to here: https://ncg-task.github.io
1. Train a model to extract the contributing statements from a paper. Combine the
contributing statements smartly to form a paper summary.
Read about contributing statements from research papers here:
https://ceur-ws.org/Vol-2658/paper2.pdf
The dataset for training/fine-tuning and testing your model is here:
https://zenodo.org/record/4737071#.Y7UiQy0RpQI
2. Evaluate your summary against the abstract of the paper (use this paper:
https://aclanthology.org/P19-1106/) using ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, BARTScore (taking the abstract of the paper as the reference summary)

#**Dataset**

NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph. 

# **Proposed Solution**

In the Training Dataset for every Research paper, The Raw Text has been extracted from the PDF using [Grobid](https://github.com/kermitt2/grobid) and passed to [Stanza](https://github.com/stanfordnlp/stanza) which provides formatted text in the text file format and contribution sentences from the paper has been annoted and stored as a seperate text file


Our Task is to build model to classify the contribution sentences from the paper and generate a summary using the contribution sentences


**Required Dataset**

├── [articlename].pdf                      # scholarly article pdf
        │   
├── [articlename]-Grobid-out.txt           # plaintext output from the [Grobid parser](https://github.com/kermitt2/grobid)
        │   
├── [articlename]-Stanza-out.txt           # plaintext preprocessed output from [Stanza](https://github.com/stanfordnlp/stanza)
        │   
├── sentences.txt                          # annotated Contribution sentences in the file

### **Download and unzip the Training dataset from [SemEval-2021 Task 11: NLPContributionGraph](https://zenodo.org/record/4737071#.ZALbCtJBw3F)**

In [None]:
!wget https://zenodo.org/record/4737071/files/training-set.zip

--2023-03-04 09:21:31--  https://zenodo.org/record/4737071/files/training-set.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 159144523 (152M) [application/octet-stream]
Saving to: ‘training-set.zip’


2023-03-04 09:23:14 (1.50 MB/s) - ‘training-set.zip’ saved [159144523/159144523]



In [None]:
!unzip training-set.zip

Archive:  training-set.zip
   creating: training-set/
 extracting: training-set/desktop.ini  
   creating: training-set/natural_language_inference/
   creating: training-set/natural_language_inference/0/
  inflating: training-set/natural_language_inference/0/1606.01549v3-Grobid-out.txt  
  inflating: training-set/natural_language_inference/0/1606.01549v3-Stanza-out.txt  
  inflating: training-set/natural_language_inference/0/1606.01549v3.pdf  
  inflating: training-set/natural_language_inference/0/entities.txt  
   creating: training-set/natural_language_inference/0/info-units/
  inflating: training-set/natural_language_inference/0/info-units/ablation-analysis.json  
  inflating: training-set/natural_language_inference/0/info-units/code.json  
  inflating: training-set/natural_language_inference/0/info-units/model.json  
  inflating: training-set/natural_language_inference/0/info-units/research-problem.json  
  inflating: training-set/natural_language_inference/0/info-units/results.jso

In [None]:
!rm training-set/desktop.ini

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m104.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1


In [None]:
from transformers import logging
logging.set_verbosity_info()

In [None]:
import os
import glob
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [None]:
EPOCHS = 10
BATCH_SIZE = 8

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## **Dataset Preparation**

In [None]:
def load_articles(data_dir):
    articles = []
    contributions = []

    for category in os.listdir(data_dir):
        if category != 'README.md' and category != '.git':
            article_category = os.path.join(data_dir, category)

            for foldname in sorted(os.listdir(article_category)):
                article_index = os.path.join(article_category, foldname)

                with open(glob.glob(os.path.join(article_index, '*-Stanza-out.txt'))[0], encoding='utf-8') as f:
                    article = f.read()
                    articles.append(article.lower())

                with open(os.path.join(article_index, 'sentences.txt'), encoding='utf-8') as f:
                    contribution = []
                    for line in f.readlines():
                        article_contribution = int(line.strip())
                        contribution.append(article_contribution)
                    contributions.append(contribution)
    return articles, contributions

def article2sentence_and_labels(articles, contributions):
    sentences = []
    labels = []
    for i, article in enumerate(articles):
        contribution = contributions[i]

        sents = article.split('\n')[0:-1]
        for row, sent in enumerate(sents):
            sentences.append(sent)
            if (row + 1) in contribution:
                labels.append(1)
            else:
                labels.append(0)
    return sentences, labels

In [None]:
train_data_dir = 'training-set/'
train_articles, train_contributions = load_articles(train_data_dir)
train_sentences, train_labels = article2sentence_and_labels(train_articles, train_contributions)
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_sentences, train_labels, test_size=.2)

# **Pretrained Model**

### **SCIBERT**

This is the pretrained model presented in SciBERT: A Pretrained Language Model for Scientific Text, which is a BERT model trained on scientific text.

The training corpus was papers taken from Semantic Scholar. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

This model has been trained specifially on scientific texts, such as research papers, scientific articles, and patents

In [None]:
from transformers import AutoTokenizer
tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')

# Load the pre-trained model with a classification head on top
model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased', num_labels=2)

model.to(device)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/228k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None


Downloading (…)lve/main/config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/config.json
Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_uncased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/config.json
Model config BertConfig {
  "atte

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/442M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/pytorch_model.bin
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of 

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31090, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### **Tokenize the Train and validation sentences**

In [None]:
train_encodings = tokenizer(train_sentences, truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_sentences, truncation=True, padding=True, return_tensors='pt')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], torch.tensor(train_labels))
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)

val_dataset = TensorDataset(val_encodings['input_ids'], val_encodings['attention_mask'], torch.tensor(val_labels))
val_sampler = SequentialSampler(val_dataset)
val_dataloader = DataLoader(val_dataset, sampler=val_sampler, batch_size=BATCH_SIZE)

In [None]:
print("Training Dataset :", len(train_dataset))
print("Validation Dataset:",len(val_dataset))

Training Dataset : 44160
Validation Dataset: 11041


In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5, eps=1e-8)
loss_fn = nn.CrossEntropyLoss()

In [17]:
for epoch in range(EPOCHS):
    model.train()
    train_loss, train_acc = 0, 0
    for batch in train_dataloader:
        batch = tuple(t.to(device) for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        optimizer.zero_grad()
        outputs = model(**inputs)
        loss = loss_fn(outputs.logits, inputs['labels'])
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        train_acc += (outputs.logits.argmax(dim=1) == inputs['labels']).float().mean().item()

    train_loss /= len(train_dataloader)
    train_acc /= len(train_dataloader)

    model.eval()
    val_loss, val_acc = 0, 0
    with torch.no_grad():
        for batch in val_dataloader:
            batch = tuple(t.to(device) for t in batch)
            inputs = {'input_ids': batch[0],
                      'attention_mask': batch[1],
                      'labels': batch[2]}
            outputs = model(**inputs)
            loss = loss_fn(outputs.logits, inputs['labels'])
            val_loss += loss.item()
            val_acc += (outputs.logits.argmax(dim=1) == inputs['labels']).float().mean().item()

    val_loss /= len(val_dataloader)
    val_acc /= len(val_dataloader)

    

    print(f'Epoch {epoch + 1}/{EPOCHS}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.3f}, Val Loss: {val_loss:.3f}, Val Acc: {val_acc:.3f}')

Epoch 1/10, Train Loss: 0.233, Train Acc: 0.913, Val Loss: 0.290, Val Acc: 0.871
Epoch 2/10, Train Loss: 0.189, Train Acc: 0.927, Val Loss: 0.284, Val Acc: 0.878
Epoch 3/10, Train Loss: 0.148, Train Acc: 0.941, Val Loss: 0.279, Val Acc: 0.893
Epoch 4/10, Train Loss: 0.107, Train Acc: 0.958, Val Loss: 0.267, Val Acc: 0.899
Epoch 5/10, Train Loss: 0.078, Train Acc: 0.970, Val Loss: 0.247, Val Acc: 0.908
Epoch 6/10, Train Loss: 0.058, Train Acc: 0.979, Val Loss: 0.234, Val Acc: 0.898
Epoch 7/10, Train Loss: 0.047, Train Acc: 0.984, Val Loss: 0.223, Val Acc: 0.910
Epoch 8/10, Train Loss: 0.042, Train Acc: 0.985, Val Loss: 0.229, Val Acc: 0.905
Epoch 9/10, Train Loss: 0.035, Train Acc: 0.988, Val Loss: 0.210, Val Acc: 0.910
Epoch 10/10, Train Loss: 0.034, Train Acc: 0.989,Val Loss: 0.207, Val Acc: 0.916


## **Model Evaluvation**

Model evaluvation using test dataset provided by [SemEval-2021 Task 11: NLPContributionGraph](https://zenodo.org/record/4737071#.ZAMIttJBw3H)

In [18]:
!wget https://zenodo.org/record/4737071/files/test-set.zip
!unzip test-set.zip

--2023-03-04 11:58:48--  https://zenodo.org/record/4737071/files/test-set.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 225826758 (215M) [application/octet-stream]
Saving to: ‘test-set.zip’


2023-03-04 12:01:26 (1.38 MB/s) - ‘test-set.zip’ saved [225826758/225826758]

Archive:  test-set.zip
   creating: test-set/
   creating: test-set/constituency_parsing/
   creating: test-set/constituency_parsing/0/
  inflating: test-set/constituency_parsing/0/1602.07776v4-Grobid-out.txt  
  inflating: test-set/constituency_parsing/0/1602.07776v4-Stanza-out.txt  
  inflating: test-set/constituency_parsing/0/1602.07776v4.pdf  
  inflating: test-set/constituency_parsing/0/entities.txt  
   creating: test-set/constituency_parsing/0/info-units/
  inflating: test-set/constituency_parsing/0/info-units/hyperparameters.json  
  inflating: test-set/constituency_parsing/0/in

In [None]:
test_data_dir = '/content/test-set'
test_articles, test_contributions = load_articles(test_data_dir)
test_sentences, test_labels = article2sentence_and_labels(test_articles, test_contributions)

test_encodings = tokenizer(test_sentences, truncation=True, padding=True, return_tensors='pt')
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], torch.tensor(test_labels))
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)

In [20]:
model.eval()
test_loss, test_acc = 0, 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = tuple(t.to(device) for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        outputs = model(**inputs)
        loss = loss_fn(outputs.logits, inputs['labels'])
        test_loss += loss.item()
        test_acc += (outputs.logits.argmax(dim=1) == inputs['labels']).float().mean().item()

test_loss /= len(test_dataloader)
test_acc /= len(test_dataloader)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.3f}')

Test Loss: 0.469, Test Acc: 0.898


### **Save the Trained model weights**

In [None]:
!mkdir my_model

In [None]:
model.save_pretrained('my_model')
tokenizer.save_pretrained('my_model')

### **Push the model to hugging face Repo**

Model Instance URL - https://huggingface.co/GouthamVicky/ContributionSentClassification-bert

In [25]:
# Set up the Hugging Face API token
os.environ['HUGGINGFACE_TOKEN'] = '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'

In [24]:
# Push the model and tokenizer to the model hub
model_name = f'Goutham-Vignesh/ContributionSentClassification-scibert'
model.push_to_hub(model_name, use_auth_token=os.getenv('HUGGINGFACE_TOKEN'))
tokenizer.push_to_hub(model_name, use_auth_token=os.getenv('HUGGINGFACE_TOKEN'))

Configuration saved in /tmp/tmp6e0cvyot/config.json
Model weights saved in /tmp/tmp6e0cvyot/pytorch_model.bin
Uploading the following files to Goutham-Vignesh/ContributionSentClassification-scibert: pytorch_model.bin,config.json


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer config file saved in /tmp/tmpmni0s0tq/tokenizer_config.json
Special tokens file saved in /tmp/tmpmni0s0tq/special_tokens_map.json
Uploading the following files to Goutham-Vignesh/ContributionSentClassification-scibert: vocab.txt,tokenizer_config.json,special_tokens_map.json


CommitInfo(commit_url='https://huggingface.co/Goutham-Vignesh/ContributionSentClassification-scibert/commit/cb78e8a79cb9844e62604285f6c5d4bf7a79b8fa', commit_message='Upload tokenizer', commit_description='', oid='cb78e8a79cb9844e62604285f6c5d4bf7a79b8fa', pr_url=None, pr_revision=None, pr_num=None)