# Topic Extraction

The extraction of the entities that are considered as topics is performed by taking every question's title and description and using the [Wikifier](https://wikifier.org/info.html) API for getting the annotations within each text. A JSON dataset with the extracted topics for each question is generated.

 An example of the annotations made by Wikifier for the description of this Metaculus [question](https://www.metaculus.com/questions/504/how-many-subscribers-will-netflix-have-by-2022/) is provided in the `data/annotations_example.json` file.

In [2]:
# Import libraries
import pandas as pd
import requests
import json

### Page rank and Wiki data classes
 * Set the page rank for disambiguation and filtering. A greater page rank means more detected entities.
 * Set the wiki data classes. Every entity has many associated wiki data classes. Set this parameter to get the top X most important.



In [3]:
# Import class for entity extraction and create instance
from EntityExtractor import EntityExtractor

page_rank = 0.006
wiki_data_classes = 3

extractor = EntityExtractor()

### Extraction of topics in binary questions

In [5]:
# Read the binary question dataset (Change path to dataset)
binary_questions = pd.read_json(
    "data/questions-binary-hackathon.json",
    orient="records",
    convert_dates=False,
)

In [6]:
# Generate the topic dataset for all the questions in the binary questions dataset
# using the Wififier demo tool as entity extraction method.

starts_at = 0
finishes_at = len(binary_questions)
topic_dataset_name = 'data/binary_topics.json'

extractor.generate_topic_dataset( question_dataset = binary_questions,
                        topic_dataset_name = topic_dataset_name, 
                        starts_at = starts_at, 
                        finishes_at = finishes_at)

2622


In [None]:
# Open the generated topic dataset and merge it to the original question dataset
# and create a new enriched dataset in a JSON file
# This will only work if the length of the question dataset is equal to the topic dataset

with open(topic_dataset_name, "r") as f: 
        binary_topics = json.load(f)

EntityExtractor.join_datasets(binary_questions, binary_topics, 12, 'data/enriched-binary-questions.json')

### Extraction of topics in continuous questions

In [None]:
# Read the binary question dataset (Change path to dataset)
continuous_questions = pd.read_json(
    "data/questions-continuous-hackathon.json",
    orient="records",
    convert_dates=False, 
)

In [None]:
# Generate the topic dataset for all the questions in the continuous questions dataset
# using the Wififier demo tool as entity extraction method.

starts_at = 0
finishes_at = len(continuous_questions)
topic_dataset_name = 'data/continuous_topics.json'

extractor.generate_topic_dataset( question_dataset = continuous_questions,
                        topic_dataset_name = topic_dataset_name, 
                        starts_at = starts_at, 
                        finishes_at = finishes_at)

In [None]:
# Open the generated topic dataset and merge it to the original question dataset
# and create a new enriched dataset in a JSON file
# This will only work if the length of the question dataset is equal to the topic dataset

with open(topic_dataset_name, "r") as f: 
        continuous_topics = json.load(f)

EntityExtractor.join_datasets(continuous_questions, continuous_topics, 19, 'data/enriched-continuous-questions.json')