## Model inference

In this notebook, we will load a previously trained model, explore the learned topics, and predict topics for a paper on arXiv.

In [1]:
# imports
import sys
import os
import re
sys.path.insert(0, "../")
from utils import scrape_arxiv_abstract
from model import TopicModel
from dataset import ArXivDataset
from gensim.models import LdaModel
from pprint import pprint
from PyPDF2 import PdfReader
import plotly.express as px
import pandas as pd

from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
OpenAI.api_key = os.getenv('OPENAI_API_KEY')
llm = OpenAI(temperature=0.9)


### Build topic model

To build a `TopicModel` object, we need to pass in as arguments the dataset used to create the model (to process new instances) and the model itself (to predict topics for the new instances).

In [2]:
# create topic model
model_path = "../models/lda_n105_p10_r929_c39.1"
dataset_path = "../object/dataset.obj"
model = TopicModel(model_path, dataset_path)

In [3]:
prompt = PromptTemplate(
    input_variables=["values"],
    template="give me only the topic name knowing that that these are the words and the and the pertinance of each word on the topic: {values}",
)

### Investigate topics

Next, let us explore the different topics learned by the model so that we can assign understandable topic names to each cluster.

In [4]:
# print topics
pprint(model.topics)

[(0,
  [('system', 0.00064308546),
   ('control', 0.00046642686),
   ('problem', 0.0003609612),
   ('tool', 0.00031240965),
   ('model', 0.00027746803),
   ('description', 0.0002760079),
   ('complexity', 0.00027373905),
   ('logic', 0.00027366143),
   ('design', 0.0002652074),
   ('project', 0.00024900687)]),
 (1,
  [('system', 0.023838768),
   ('energy', 0.0150708025),
   ('design', 0.012751054),
   ('electronic', 0.012170386),
   ('problem', 0.009324392),
   ('tutorial', 0.008552619),
   ('model', 0.008287635),
   ('basic', 0.0075536296),
   ('lecture', 0.007428441),
   ('description', 0.007147056)]),
 (2,
  [('game', 0.052449316),
   ('strategy', 0.026261821),
   ('strategic', 0.021482978),
   ('games', 0.016608218),
   ('model', 0.0142783895),
   ('delivery', 0.013869685),
   ('player', 0.013686951),
   ('game_theory', 0.012409714),
   ('algorithm', 0.010465735),
   ('agent', 0.010398761)]),
 (3,
  [('system', 0.046363305),
   ('safety', 0.03733902),
   ('formal', 0.030734418),
  

In [5]:
topicName = ''
TopicsNames =[]
for i in model.topics:
    for name,prob in i[1][:10]:
        topicName = topicName + name + ' ' + str(prob) + ' '
    text = prompt.format(values=topicName)
    name = llm(text)
    TopicsNames.append(name.strip())
    topicName = ''

In [6]:
TopicsNames

['System Design',
 'Electronic System Design',
 'Game Theory',
 'System Safety Verification for Air Traffic Control',
 'Health Information System Integration Project',
 'Computer Network Security',
 'Dynamic Surface Material System',
 'Management',
 'Financial Risk Management',
 'Energy Systems.',
 'Reliability Analysis of Systems',
 'Energy Consumption',
 'Bayesian Inference',
 'Corporate Finance',
 'Hybrid System Model Team Tool Presentation',
 'Algebraic Theory and its Applications',
 'Compression',
 'Blockchain Technology',
 'Control System Design',
 'Linux Operating System Kernel',
 'Electronic Circuit Design',
 'Optical Spectroscopy',
 'Fluid Mechanics',
 'Teaching Physics Tutorials',
 'Numerical Analysis.',
 'Project Management',
 'Project Optimization',
 'Systems Modeling',
 'Software Development',
 'Numerical Optimization Methods for Energy System Analysis and Simulation',
 'Data Analysis',
 'Thermal Energy Balance Model',
 'Economic Model Design for Software and Complex Netwo

In [7]:
model.set_topic_names(TopicsNames)
pprint(model.topics)

[('System Design',
  '0.001*"system" + 0.000*"control" + 0.000*"problem" + 0.000*"tool" + '
  '0.000*"model" + 0.000*"description" + 0.000*"complexity" + 0.000*"logic" + '
  '0.000*"design" + 0.000*"project"'),
 ('Electronic System Design',
  '0.024*"system" + 0.015*"energy" + 0.013*"design" + 0.012*"electronic" + '
  '0.009*"problem" + 0.009*"tutorial" + 0.008*"model" + 0.008*"basic" + '
  '0.007*"lecture" + 0.007*"description"'),
 ('Game Theory',
  '0.052*"game" + 0.026*"strategy" + 0.021*"strategic" + 0.017*"games" + '
  '0.014*"model" + 0.014*"delivery" + 0.014*"player" + 0.012*"game_theory" + '
  '0.010*"algorithm" + 0.010*"agent"'),
 ('System Safety Verification for Air Traffic Control',
  '0.046*"system" + 0.037*"safety" + 0.031*"formal" + 0.027*"critical" + '
  '0.024*"model" + 0.024*"verification" + 0.014*"method" + 0.014*"time" + '
  '0.012*"st" + 0.011*"air_traffic"'),
 ('Health Information System Integration Project',
  '0.049*"datum" + 0.030*"system" + 0.014*"project" + 0.

We can see that there are some clusters that seem to refer to specific topics in machine learning. One of them is topic 7, which seems to direcly relate to sequential and time-series data. Another example is topic 10, which seems to be related to reinforcement learning.

To make it easier to refer to these topic clusters, we will assign (tentative) names to each of them. Note that these names are subject to interpretation and are only assigned to help "summarize" each cluster.

### Predict topics for a paper

Let us see how our model predicts a paper taken directly from arXiv. Using the `scrape_arxiv_abstract()` function, we can extract the title and the abstract of any paper on arXiv given its URL. Once scraped, this title and abstract can be passed into our topic model's `predict()` method.

To illustrate, let us scrape the title and abstract from the seminal paper ["Attention Is All You Need" (2017)](https://arxiv.org/abs/1706.03762) and see what topics the model detects.

In [8]:
dir_path = "../data/ByCourse/"

courses = {"texts" : [] , "course" : [], "topics" : []}

for file in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, file)):
        text_of_file = ''
        reader = PdfReader(dir_path + file)
        for i in range(len(reader.pages)):
            text_of_file = text_of_file + reader.pages[i].extract_text()
    text_of_file = text_of_file.replace(u'\xa0', u' ')
    text_of_file = text_of_file.replace(u'\n', u' ')
    courses["texts"].append(text_of_file)
    courses["course"].append(file.replace(".pdf",""))

In [9]:
print(courses["course"][5])

2EL2220


In [10]:
for text in courses["texts"]:
    match = re.findall(r".Description.+Quarter number ", text)
    if match:
        prediction = model.predict(match[0])
        courses["topics"].append(prediction)


In [11]:
courses2 = courses
courses2["topics"] = []
for text in courses2["texts"]:
    match = re.findall(r".Description.+Quarter number ", text)
    if match:
        prediction = model.predictTopTopics(match[0],numberOfTopics=5)
        courses2["topics"].append(prediction)


In [13]:
name = courses2["course"][1] + ".png"

In [12]:
for i in range(len(courses2["topics"])):
    fig = px.line_polar(r=courses2["topics"][i][1], theta=courses2["topics"][i][0], line_close=True)
    
    name = courses2["course"][i] + ".png"
    fig.write_image("../data/plots/" + name)

  trace_data = trace_data.append(trace_data.iloc[0])

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.conc