## Dataset preparation

In this notebook, we will build and pre-process the arXiv paper dataset.

In [1]:
# imports
import sys
sys.path.insert(0, "../")
from dataset import ArXivDataset
import nltk
nltk.download('stopwords')
from PyPDF2 import PdfReader

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vitoriano/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


I used a variety of techniques to pre-process the text, such as the removal of LaTex equations, tokenization, n-gram phrase detection, and lemmatization.

In [2]:
metadata_filepath = "..data/CAT2023-2AEN-V1.pdf"


In [3]:
reader = PdfReader(metadata_filepath)
page = reader.pages[19]
text = (page.extract_text())

text.replace("\n","").split(".")

[' 20 temporal approach  c',
 ' Application examples  d',
 ' General theorems: intrinsic limitations of the ideal loop (needs to formalize, trade -off to be made)  2',
 ' State -space representation  a',
 ' Reminders  b',
 ' Pro perties (controllability, observability)  c',
 ' Linearized / nonlinear links - implementation of the control law on the nonlinear model  3',
 ' State feedback control law design  a',
 ' Control by pole placement in the monovariable case, reference tracking and accuracy  b',
 ' Linear Quadratic controller (LQ control)  c',
 ' Case of measurable disturbances (disturbance rejection)  d',
 ' LQ control with integral action  4',
 ' Estimated state feedback control  a',
 ' Observer by pole placement  b',
 ' Kalman filter (LQ control duality)  c',
 ' Linear Quadratic Gaussian controller (LQG) – Separation theorem  5',
 ' Performance and robustness analysis of a control law  a',
 ' Reminders: links with the transfer function  b',
 ' Equivalent controller for the LQ an

In [4]:
# build and pre-process texts
metadata_filepath = "../CAT2023-2AEN-V1.pdf"
dataset = ArXivDataset.from_metadata(metadata_filepath)
print("# papers: {n}".format(n=len(dataset)))

 [1/6] Removing LaTex equations...
 [2/6] Removing newlines and extra spaces...
 [3/6] Tokenizing documents...
 [4/6] Removing stopwords...
 [5/6] Identifying n-gram phrases...
 [6/6] Lemmatizing...
 Done.
# papers: 2


Now that the texts have been pre-processed, they can be exported as a dataset object.

In [5]:
dataset.documents

[['temporal',
  'approach',
  'application',
  'example',
  'general',
  'intrinsic',
  'limitation',
  'ideal',
  'loop',
  'need',
  'trade',
  'state',
  'space',
  'representation',
  'pro',
  'pertie',
  'controllability',
  'observability',
  'nonlinear',
  'link',
  'implementation',
  'control',
  'law',
  'nonlinear',
  'model',
  'state',
  'feedback',
  'control',
  'law',
  'design',
  'control',
  'pole',
  'placement',
  'monovariable',
  'case',
  'reference',
  'tracking',
  'accuracy',
  'linear',
  'quadratic',
  'controller',
  'lq',
  'control',
  'case',
  'measurable',
  'disturbance',
  'disturbance',
  'rejection',
  'control',
  'integral',
  'action',
  'state',
  'feedback',
  'control',
  'observer',
  'pole',
  'placement',
  'kalman',
  'filter',
  'control',
  'duality',
  'quadratic',
  'gaussian',
  'controller',
  'lqg',
  'separation',
  'performance',
  'robustness',
  'analysis',
  'control',
  'law',
  'link',
  'function',
  'equivalent',
  'contr

In [7]:
# export dataset
dataset.save("../data/dataset.obj")