In [None]:
!pip install arxiv transformers datasets torch sentencepiece rouge-score nltk spacy fastapi uvicorn

Collecting arxiv
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.2.0-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge-score, sgmllib3k
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=a20d3da003795975b325108ed24daac27a2a645dc616dad4bee2968b84ead9b8
  Stored in director

In [None]:
# Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Dataset loading, it imports the "load_dataset" function from the Hugging Face library
# "datasets" provides a lot of preprocessed datasets to be used for ML and NLP tasks
from datasets import load_dataset

# Data Processing
import pandas as pd
import numpy as np

# NLP utilities
import nltk
import spacy

# For Evaluation
from rouge_score import rouge_scorer

# Downloads and loads the scientific papers dataset
# It stores the result of the function call into the var "dataset"
# 1st param is the "Library of research papers",
#   2nd param is the subset inside the "Library" only containing CS, Physics, Math papers that we will summarize later
dataset = load_dataset ("ccdv/arxiv-summarization", "section")
# See which splits exist
# print(dataset.keys())  # e.g., dict_keys(['train', 'validation', 'test'])

# See the columns in the train split
# print(dataset['train'].column_names)  # e.g., ['article', 'abstract']


train_texts = dataset ['train'] ['article'] # Fetch the training split - the section column
train_summs = dataset ['train'] ['abstract'] # Fetch the training split again - the abstract column

# dataset now holds all the data in a structured format. Nothing is being trained. It is ready for later use.


In Machine Learning and NLP, a dataset is divided into splits to ensure the model learns properly and can be evaluated fairly.

A portion of the dataset will be used for model training and it will learn from this "split"

Another portion of the dataset let's say 10% of the papers in the dataset will be left for validation where the model will not be trained on this portion. It is used during training.

With the "train", "validation", "test", we are accessing the different slices of the dataset that will then serve different purposes.

"Train" = model learns
"Validation" = monitor the learning
"Test" = final evaluation

An analogy, train split is the lessons you study, Validation split is the quizes you take to see if you are learning meanwhile test split is the final exam

In [None]:
# After loading the dataset, the next steps are about exploring and validating the data,
#  making sure it's in the format wer expect...
print(dataset.keys())


dict_keys(['train', 'validation', 'test'])


- 'Train' -> training split
- 'Validation' -> validation split
- 'Test' -> test split

Each of the splits is a Dataset object that contains the actual text data. (70% will be used to train, 25% for validation, 5% for testing)

Anyways, the output behaves the way we expect.

In [None]:
print (dataset ['train'].column_names) # usually article and abstract

['article', 'abstract']


In [None]:
print (dataset['train'][0]) # first sample paper, not yet summarized
# This also includes the abstract which is the "target" or "answer" hand written summary.

{'article': 'additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models . \n it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models . \n many examples of such estimators belong to the large class of regularized kernel based methods over a reproducing kernel hilbert space @xmath0 , see e.g. @xcite . in the last years \n many interesting results on learning rates of regularized kernel based models for additive models have been published when the focus is on sparsity and when the classical least squares loss function is used , see e.g. @xcite , @xcite , @xcite , @xcite , @xcite , @xcite and the references therein

Around this part of the code, the workflow is:
- Loading the given dataset ✅
- Inspect and validate the dataset ✅
- Preprocess and tokenize the dataset (we are preparing for model training)
- Define the model and train arguments
- Train the model

In [None]:
# Check for missing data. Filter out the sample papers that are missng an abstract or an article
dataset['train'] = dataset['train'].filter(lambda x: x['article'] and x ['abstract'])

# Before training, we must convert trext into token IDs so the model can understand by using a tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained()