In [None]:
# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [2]:
import os
os.environ["OPENAI_API_KEY"] = 

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [None]:
import os
import sys
folder_path = os.path.abspath("topicgpt_python")

# Add the folder to sys.path
sys.path.append(folder_path)

# Import modules directly
from data_sample import sample_data
from generation_1 import generate_topic_lvl1
from generation_2 import generate_topic_lvl2
from refinement import refine_topics
from assignment import assign_topics
from correction import correct_topics
from comparison import comparison
from baseline import baseline
from evaluate import evaluate
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

In [None]:

generate_topic_lvl1(
    "openai",
    "gpt-4o",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic generation...
Model: gpt-4o
Data file: true_inputs/run.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: true_outputs/1/generation_1.jsonl
Topic file: true_outputs/1/generation_1.md
-------------------


 50%|█████     | 1/2 [00:02<00:02,  2.05s/it]

Prompt token usage: 949 ~$0.004745
Response token usage: 45 ~$0.000675
Topics: [1] Urban and Regional Planning: Involves the development and design of land use and the built environment, including the infrastructure passing into and out of urban areas, such as transportation, communications, and distribution networks.
--------------------


100%|██████████| 2/2 [00:03<00:00,  1.83s/it]

Prompt token usage: 937 ~$0.004685
Response token usage: 50 ~$0.00075
Topics: [1] Urban and Regional Planning: Involves the development and design of land use and the built environment, including air, water, and the infrastructure passing into and out of urban areas, such as transportation, communications, and distribution networks.
--------------------





<topicgpt_python.utils.TopicTree at 0x7f344cc7d240>

In [5]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "openai",
        "gpt-4o",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: gpt-4o
Data file: true_outputs/1/generation_1.jsonl
Prompt file: prompt/generation_2.txt
Seed file: true_outputs/1/generation_1.md
Output file: true_outputs/1/generation_2.jsonl
Topic file: true_outputs/1/generation_2.md
-------------------
Number of remaining documents for prompting: 2


  0%|          | 0/1 [00:00<?, ?it/s]

Current topic: [1] Urban and Regional Planning


100%|██████████| 1/1 [00:03<00:00,  3.02s/it]

Subtopics: [1] Urban and Regional Planning
   [2] National Physical Planning (Document: 1, 2): Involves the development and implementation of national-level physical planning documents such as the National Physical Plan-3 (NPP-3), focusing on strategic and sectoral policies translated into spatial and physical dimensions.
   [2] Urbanisation Policy (Document: 1, 2): Pertains to policies guiding sustainable urban planning and development, emphasizing balanced development across physical, environmental, social, and economic dimensions.
   [2] Rural Development Planning (Document: 1, 2): Concerns the planning and implementation of strategies for rural development, as outlined in documents like the National Rural Physical Plan 2030, focusing on spatial rural development and policy measures.
Not a match: [2] National Physical Planning (Document: 1, 2): Involves the development and implementation of national-level physical planning documents such as the National Physical Plan-3 (NPP-3), focu




In [6]:
# Assignment
assign_topics(
    "openai",
    "gpt-4o-mini",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation_2"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: gpt-4o-mini
Data file: true_inputs/run.jsonl
Prompt file: prompt/assignment.txt
Output file: true_outputs/1/assignment_2.jsonl
Topic file: true_outputs/1/generation_2.md
-------------------


 50%|█████     | 1/2 [00:03<00:03,  3.66s/it]

Prompt token usage: 924 ~$0.00462
Response token usage: 140 ~$0.0021
Response: [2] Urbanisation Policy: The document discusses the Second National Urbanisation Policy (NUP-2), which provides guidance on sustainable urban planning and emphasizes balanced development across various dimensions. (Supporting quote: "The NUP-2 provides guidance on sustainable urban planning and development with an emphasis for balanced development physically, environmentally, socially and economically.")

[2] Rural Development Planning: The document mentions the National Rural Physical Plan 2030 (NRPP 2030), which outlines strategies and implementation measures for rural development. (Supporting quote: "The NRPP 2030 is the first spatial rural development document that outlines policy statements, strategies and implementation measures towards materialising the rural development vision.")
--------------------


100%|██████████| 2/2 [00:06<00:00,  3.19s/it]

Prompt token usage: 904 ~$0.00452
Response token usage: 166 ~$0.00249
Response: [2] Urbanisation Policy: The document discusses the Second National Urbanisation Policy (NUP2), which is focused on guiding and coordinating sustainable urban planning and development, emphasizing balanced development across various dimensions. (Supporting quote: "The NUP2 is a policy to guide and coordinate sustainable urban planning and development with emphasis on balanced development physically, environmentally, socially and economically.")

[2] Rural Development Planning: The document mentions the National Rural Physical Plan 2030 (NRPP 2030), which is described as the nation’s first spatial rural development document outlining strategies and implementation measures for rural development. (Supporting quote: "The NRPP 2030 is the nation’s first spatial rural development document that outlines policy statements, strategies and implementation measures according to specific themes and thrusts towards mater




In [7]:
# Optional: Generate comparisons
if config["generate_comparison"]:
    comparison(
        "openai",
        "gpt-4o",
        config["assignment"]["output"],
        config["comparison"]["prompt"],
        config["comparison"]["output"],
    )

Final prompt sent to API Agent:
As a climate scientist and specialized Q&A bot with expertise in climate change, climate science, environmental science, physics, and energy science, your primary objective is: 
1. Provide an accurate and comprehensive comparison from the two documents inputted by the user.  
2. Provide detailed discussions on both similarities of the points, and differences, additionally discuss when only one topic appears in one document only 
3. In cases where sufficient information is lacking to address the comparison, reply with ’There is not enough info to answer the question.’ 
4. It’s imperative to maintain accuracy and refrain from creating information. If any aspect is unclear, do not create answers about that aspect.

[Instructions]
Here is two documents, as well as key discussion points, provide a comparison:

[Document]
Document 1:
Text: ,For the period until 2020, national physical development in Peninsular Malaysia is guided by three physical planning docu

In [None]:
baseline(
    "openai",
    "gpt-4o",
    config["assignment"]["output"],
    config["baseline"]["prompt"],
    config["baseline"]["output"],
)

In [None]:
evaluate(
    "openai",
    "gpt-4o",
    config["comparison"]["output"],
    config["baseline"]["output"],
    config["evaluate"]["prompt"],
    config["evaluate"]["output"],
)