# Topic Extraction

The extraction of the entities that are considered as topics is performed by taking every question's title and description and using the [Wikifier](https://wikifier.org/info.html) API for getting the annotations within each text. A JSON dataset with the extracted topics for each question is generated.

 An example of the annotations made by Wikifier for the description of this Metaculus [question](https://www.metaculus.com/questions/504/how-many-subscribers-will-netflix-have-by-2022/) is provided in the `data/annotations_example.json` file.

In [1]:
# Import libraries
import pandas as pd
import requests
import json

### Page rank and Wiki data classes
 * Set the page rank for disambiguation and filtering. A greater page rank means more detected entities.
 * Set the wiki data classes. Every entity has many associated wiki data classes. Set this parameter to get the top X most important.



In [2]:
# Import class for entity extraction and create instance
from EntityExtractor import EntityExtractor

page_rank = 0.006
wiki_data_classes = 3

extractor = EntityExtractor()

### Extraction of topics in binary questions

In [3]:
# Read the binary question dataset (Change path to dataset)
binary_questions = pd.read_json(
    "data/questions-binary-hackathon.json",
    orient="records",
    convert_dates=False,
)

binary_questions = binary_questions.iloc[:5]

In [4]:
# Generate the topic dataset for all the questions in the binary questions dataset
# using the Wififier demo tool as entity extraction method.

starts_at = 0
finishes_at = len(binary_questions)
topic_dataset_name = 'data/binary_topics.json'

extractor.generate_topic_dataset( question_dataset = binary_questions,
                        topic_dataset_name = topic_dataset_name, 
                        starts_at = starts_at, 
                        finishes_at = finishes_at)

Topic extraction failed: 'wikiDataItemId'
Topic extraction failed: 'wikiDataItemId'
Processed questions:  1
Processed questions:  2
Processed questions:  3
Topic extraction failed: 'wikiDataItemId'
Processed questions:  4
Processed questions:  5


In [5]:
# Open the generated topic dataset and merge it to the original question dataset
# and create a new enriched dataset in a JSON file
# This will only work if the length of the question dataset is equal to the topic dataset

with open(topic_dataset_name, "r") as f: 
        binary_topics = json.load(f)

EntityExtractor.join_datasets(binary_questions, binary_topics, 12, 'data/enriched-binary-questions.json')

Unnamed: 0,question_id,title,title_short,description,created_time,publish_time,close_time,resolve_time,resolution,resolution_comment,split,categories,topics
0,10756,Will the 7-day moving average of current confi...,VA COVID Hospitalizations >750 2022-07-01,"Nationwide, new COVID-19 hospitalizations have...",1650485000.0,1650485000.0,1651248000.0,1657133000.0,0.0,resolved,test,[Tournament -- Real-time Pandemic Decision Mak...,"[{'title': 'Moving average', 'id': 'Q1130194',..."
1,8049,Will the existence and smoothness properties o...,N-S existence & smoothness and compactness,The [Navier-Stokes existence and smoothness co...,1632582000.0,1633410000.0,2840130000.0,4765122000.0,,not yet resolved,train,"[Mathematics – Pure Mathematics, Mathematics –...",[{'title': 'Navier–Stokes existence and smooth...
2,110,Will a consensus explanation of the strange be...,,NASA’s Kepler Mission revealed that the star [...,1453507000.0,1456849000.0,1464782000.0,1483301000.0,0.0,resolved,train,[Physical Sciences – Astrophysics and Cosmolog...,"[{'title': 'Star', 'id': 'Q523', 'types': [], ..."
3,4554,Will more than one entrant achieve a perfect s...,,The International Mathematical Olympiad (IMO) ...,1591097000.0,1591434000.0,1600124000.0,1601243000.0,0.0,resolved,train,"[Mathematics, Academy Series]",[{'title': 'International Mathematical Olympia...
4,1639,British Pound / US Dollar parity before 2020?,,The British Pound is currently trading at appr...,1544478000.0,1544659000.0,1546301000.0,1577789000.0,0.0,resolved,train,"[Finance, Finance – markets, Economy]","[{'title': 'United Kingdom', 'id': 'Q145', 'ty..."


### Extraction of topics in continuous questions

In [6]:
# Read the binary question dataset (Change path to dataset)
continuous_questions = pd.read_json(
    "data/questions-continuous-hackathon.json",
    orient="records",
    convert_dates=False, 
)

continuous_questions = continuous_questions.iloc[:10]

In [7]:
# Generate the topic dataset for all the questions in the continuous questions dataset
# using the Wififier demo tool as entity extraction method.

starts_at = 0
finishes_at = len(continuous_questions)
topic_dataset_name = 'data/continuous_topics.json'

extractor.generate_topic_dataset( question_dataset = continuous_questions,
                        topic_dataset_name = topic_dataset_name, 
                        starts_at = starts_at, 
                        finishes_at = finishes_at)

Topic extraction failed: 'wikiDataItemId'
Processed questions:  1
Processed questions:  2
Processed questions:  3
Topic extraction failed: 'wikiDataItemId'
Processed questions:  4
Processed questions:  5
Topic extraction failed: 'wikiDataItemId'
Processed questions:  6
Processed questions:  7
Processed questions:  8
Processed questions:  9
Topic extraction failed: 'wikiDataItemId'
Processed questions:  10


In [8]:
# Open the generated topic dataset and merge it to the original question dataset
# and create a new enriched dataset in a JSON file
# This will only work if the length of the question dataset is equal to the topic dataset

with open(topic_dataset_name, "r") as f: 
        continuous_topics = json.load(f)

EntityExtractor.join_datasets(continuous_questions, continuous_topics, 19, 'data/enriched-continuous-questions.json')

Unnamed: 0,question_id,title,title_short,description,created_time,publish_time,close_time,resolve_time,lower_bound_type,upper_bound_type,format,scale_min,scale_max,scale_deriv_ratio,resolution_comment,resolution,x_grid,split,categories,topics
0,6577,What will the combined sector weighting of Inf...,IT & Comms sector weighting 2030-01-01,"Electricity, internal combustion engines, and ...",1613251000.0,1613257000.0,1618438000.0,1893452000.0,open,open,num,20.0,75.0,1.0,not yet resolved,,"[20.0, 20.275, 20.55, 20.825, 21.1, 21.375, 21...",train,"[Economy – US, Deep Learning Round, Finance]","[{'title': 'Information technology', 'id': 'Q1..."
1,6157,"How many e-prints on AI Safety, Interpretabili...",AI Safety & other: 2021-01-14 to 2022-01-14,<small>\nThis question is part of the Hill Cli...,1609881000.0,1610665000.0,1615728000.0,1642632000.0,open,open,num,150.0,1700.0,1.0,resolved,560.0,"[150.0, 157.75, 165.5, 173.25, 181.0, 188.75, ...",train,"[Hill Climbing Round, Series — Forecasting AI ...","[{'title': 'Artificial intelligence', 'id': 'Q..."
2,5704,What will the mean level of transit activity b...,Transit activity in Phoenix for December,All 50 states in the U.S. have begun to reopen...,1605188000.0,1605049000.0,1608851000.0,1609073000.0,open,open,num,15.0,145.0,1.0,resolved,54.04,"[15.0, 15.65, 16.3, 16.95, 17.6, 18.25, 18.9, ...",train,[Series — Road to recovery],"[{'title': 'United States', 'id': 'Q30', 'type..."
3,6985,What will be the number of new incident U.S. a...,New US COVID hospital admissions 25 Apr-1 May,Changes in the number of hospitalizations due ...,1617803000.0,1617818000.0,1618942000.0,1623951000.0,closed,open,num,0.0,75000.0,1.0,resolved,34263.0,"[0.0, 375.0, 750.0, 1125.0, 1500.0, 1875.0, 22...",train,[Consensus Forecasting to Improve Public Health],"[{'title': 'Influenza', 'id': 'Q2840', 'types'..."
4,5948,What will the value of the herein defined Imag...,Image Classification Index 2026-12-14,<small>\nThis question is part of the Maximum ...,1607806000.0,1607980000.0,1739488000.0,1797203000.0,open,open,num,120.0,350.0,1.0,not yet resolved,,"[120.0, 121.15, 122.3, 123.45, 124.6, 125.75, ...",train,"[Maximum Likelihood Round, Computing – Artific...","[{'title': 'Artificial intelligence', 'id': 'Q..."
5,7908,"How many flights will the Mars Helicopter, Ing...",Total Number of Ingenuity Flights,Inspired by [this question](https://www.metacu...,1630995000.0,1631592000.0,1672600000.0,1893524000.0,closed,open,num,12.0,300.0,25.0,not yet resolved,,"[12.0, 12.1946950952, 12.3925490388, 12.593613...",train,"[Industry – Space, Technology – Space]","[{'title': 'Flight', 'id': 'Q206021', 'types':..."
6,7173,Kolik tu bude otázek (accepted) po 7 dnech pro...,Počet otázek tady za týden.,Rád bych věděl jak moc bude lidi bavit metacul...,1620305000.0,1620252000.0,1620943000.0,1620980000.0,open,open,num,5.0,100.0,1.0,resolved,9.0,"[5.0, 5.475, 5.95, 6.425, 6.9, 7.375, 7.85, 8....",train,,"[{'title': 'Otázka', 'id': 'Q189756', 'types':..."
7,3091,What will the U.S. market for plant-based meat...,,Data from [SPINS](https://www.spins.com/) summ...,1568152000.0,1568412000.0,1625177000.0,1743632000.0,open,open,num,800.0,20000.0,25.0,not yet resolved,,"[800.0, 812.9796730139, 826.1699359172, 839.57...",train,"[Animal Welfare Series, Animal welfare -- plan...","[{'title': 'United States', 'id': 'Q30', 'type..."
8,6232,What will be the sum of the performance (in ex...,Total compute TOP500 supercomputers Nov 21,<small>\nThis question is part of the Hill Cli...,1610403000.0,1610665000.0,1615676000.0,1637067000.0,open,open,num,2.4,7.5,1.0,resolved,3.04,"[2.4, 2.4255, 2.451, 2.4765, 2.502, 2.5275, 2....",train,"[Hill Climbing Round, Series — Forecasting AI ...","[{'title': 'FLOPS', 'id': 'Q188768', 'types': ..."
9,9326,Flu Hospitalizations for NY (Feb 19),Feb 19,Confirmed hospitalizations due to influenza is...,1641926000.0,1642003000.0,1644880000.0,1646658000.0,open,open,num,4.0,1600.0,400.0,resolved,30.0,"[4.0, 4.1216422316, 4.2469836714, 4.3761368143...",train,[Tournament -- 2022 Influenza],"[{'title': 'Influenza', 'id': 'Q2840', 'types'..."
