## Model inference

In this notebook, we will load a previously trained model, explore the learned topics, and predict topics for a paper on arXiv.

In [5]:
# imports
import sys
sys.path.insert(0, "../")
from utils import scrape_arxiv_abstract
from model import TopicModel
from dataset import ArXivDataset
from gensim.models import LdaModel
from pprint import pprint
import PyPDF2

### Build topic model

To build a `TopicModel` object, we need to pass in as arguments the dataset used to create the model (to process new instances) and the model itself (to predict topics for the new instances).

In [2]:
# create topic model
model_path = "../models/lda_n38_p5_r929_c37.1"
dataset_path = "../object/dataset.obj"
model = TopicModel(model_path, dataset_path)

### Investigate topics

Next, let us explore the different topics learned by the model so that we can assign understandable topic names to each cluster.

In [3]:
# print topics
pprint(model.topics)

[(0,
  '0.025*"control" + 0.017*"model" + 0.017*"system" + 0.014*"course" + '
  '0.011*"work" + 0.011*"skill" + 0.010*"law" + 0.010*"lab" + 0.009*"student" '
  '+ 0.008*"problem"'),
 (1,
  '0.039*"system" + 0.020*"course" + 0.012*"model" + 0.011*"student" + '
  '0.011*"project" + 0.010*"design" + 0.010*"campus" + 0.008*"engineering" + '
  '0.007*"problem" + 0.007*"component"'),
 (2,
  '0.032*"course" + 0.013*"information" + 0.011*"communication" + '
  '0.011*"digital" + 0.010*"software" + 0.009*"student" + 0.009*"system" + '
  '0.009*"channel" + 0.009*"application" + 0.009*"exam"'),
 (3,
  '0.033*"system" + 0.021*"course" + 0.014*"control" + 0.011*"application" + '
  '0.010*"student" + 0.009*"design" + 0.008*"campus" + 0.007*"modeling" + '
  '0.007*"development" + 0.007*"android"'),
 (4,
  '0.007*"project" + 0.004*"course" + 0.004*"thay" + 0.002*"resource" + '
  '0.002*"student" + 0.002*"design" + 0.002*"work" + 0.002*"cluster" + '
  '0.002*"economic" + 0.002*"campus"'),
 (5,
  '0.036*

We can see that there are some clusters that seem to refer to specific topics in machine learning. One of them is topic 7, which seems to direcly relate to sequential and time-series data. Another example is topic 10, which seems to be related to reinforcement learning.

To make it easier to refer to these topic clusters, we will assign (tentative) names to each of them. Note that these names are subject to interpretation and are only assigned to help "summarize" each cluster.

In [18]:
# set topic names
topic_names = [
  "1. Acoustics",
  "2. Aerospace Engineering",
  "3. Agricultural Engineering",
  "4. Automation and Control Systems",
  "5. Automotive Engineering",
  "6. Biomechanics",
  "7. Chemical Engineering",
  "8. Civil Engineering",
  "9. Communications Engineering",
  "10. Computer Engineering",
  "11. Construction Engineering",
  "12. Control Systems Engineering",
  "13. Electrical Engineering",
  "14. Energy Systems Engineering",
  "15. Environmental Engineering",
  "16. Fluid Mechanics",
  "17. Geotechnical Engineering",
  "18. Industrial Engineering",
  "19. Instrumentation and Measurement",
  "20. Manufacturing Engineering",
  "21. Marine Engineering",
  "22. Materials Science and Engineering",
  "23. Mechanical Engineering",
  "24. Mechatronics",
  "25. Microelectronics",
  "26. Mining Engineering",
  "27. Nanotechnology",
  "28. Nuclear Engineering",
  "29. Optics and Photonics",
  "30. Petroleum Engineering",
  "31. Power Engineering",
  "32. Process Engineering",
  "33. Robotics",
  "34. Structural Engineering",
  "35. Systems Engineering",
  "36. Telecommunications Engineering",
  "37. Thermal Sciences",
  "38. Transportation Engineering"
]


model.set_topic_names(topic_names)
pprint(model.topics)

[('1. Acoustics',
  '0.025*"control" + 0.017*"model" + 0.017*"system" + 0.014*"course" + '
  '0.011*"work" + 0.011*"skill" + 0.010*"law" + 0.010*"lab" + 0.009*"student" '
  '+ 0.008*"problem"'),
 ('2. Aerospace Engineering',
  '0.039*"system" + 0.020*"course" + 0.012*"model" + 0.011*"student" + '
  '0.011*"project" + 0.010*"design" + 0.010*"campus" + 0.008*"engineering" + '
  '0.007*"problem" + 0.007*"component"'),
 ('3. Agricultural Engineering',
  '0.032*"course" + 0.013*"information" + 0.011*"communication" + '
  '0.011*"digital" + 0.010*"software" + 0.009*"student" + 0.009*"system" + '
  '0.009*"channel" + 0.009*"application" + 0.009*"exam"'),
 ('4. Automation and Control Systems',
  '0.033*"system" + 0.021*"course" + 0.014*"control" + 0.011*"application" + '
  '0.010*"student" + 0.009*"design" + 0.008*"campus" + 0.007*"modeling" + '
  '0.007*"development" + 0.007*"android"'),
 ('5. Automotive Engineering',
  '0.007*"project" + 0.004*"course" + 0.004*"thay" + 0.002*"resource" + '
 

### Predict topics for a paper

Let us see how our model predicts a paper taken directly from arXiv. Using the `scrape_arxiv_abstract()` function, we can extract the title and the abstract of any paper on arXiv given its URL. Once scraped, this title and abstract can be passed into our topic model's `predict()` method.

To illustrate, let us scrape the title and abstract from the seminal paper ["Attention Is All You Need" (2017)](https://arxiv.org/abs/1706.03762) and see what topics the model detects.

In [19]:
text = "The physical systems generally rely on the fundamental concept of a feedback loop, allowing them to be controlled and giving them a behavior that is as insensitive as possible to environmental disturbances. The general objective of this course is to provide students with the concepts and skills enabling them to understand the structure and interactions within existing dynamic systems or along with their design. They will also be able to process information, design a control law to meet specifications, and analyze its performance and robustness. To achieve this, the students must first be able to define a model (or a set of models), highlighting the variables influencing the state of this system (inputs), the measures allowing access to this state, and variables to which the specifications relate (outputs), as well as the relationships between these variables. Then, in a second step, and from the analysis of the inputs that can be controlled (commands) or those that are undergone (disturbances), students will have to design a control law in order to ensure the expected performances. The last step in this course will concern the analysis of the robustness of the determined control law."

In [20]:
# get predictions
model.predict(text)

[('1. Acoustics', 0.9858876)]

In [21]:
text2 = "The goal of this training is to let you discover the different stages of an aircraft design process in both a theoretical and a practical perspective. You will be introduced to the typical methods used in an aircraft design office, and apply this knowledge by doing the preliminary design of your own aircraft. After completing this training course, you will have acquired knowledge and skills that will enable you to work out the main aircraft characteristics and layout in a very short time frame."


In [22]:
# get predictions
model.predict(text2)

[('38. Transportation Engineering', 0.77075034),
 ('17. Geotechnical Engineering', 0.19963962)]