In [None]:
# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [None]:
# Run in shell
!pip install --upgrade topicgpt_python

# Needed only for the OpenAI API deployment
export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

# Needed only for Gemini deployment
export GEMINI_API_KEY={your_gemini_api_key}

# Needed only for the Azure API deployment
export AZURE_OPENAI_API_KEY={your_azure_api_key}
export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [17]:
from topicgpt_python import *
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [18]:
import os
os.environ["OPENAI_API_KEY"] = 
generate_topic_lvl1(
    "openai",
    "gpt-4o",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic generation...
Model: gpt-4o
Data file: true_inputs/run.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: true_outputs/1/generation_1.jsonl
Topic file: true_outputs/1/generation_1.md
-------------------


 50%|█████     | 1/2 [00:02<00:02,  2.05s/it]

Prompt token usage: 949 ~$0.004745
Response token usage: 45 ~$0.000675
Topics: [1] Urban and Regional Planning: Involves the development and design of land use and the built environment, including the infrastructure passing into and out of urban areas, such as transportation, communications, and distribution networks.
--------------------


100%|██████████| 2/2 [00:03<00:00,  1.83s/it]

Prompt token usage: 937 ~$0.004685
Response token usage: 50 ~$0.00075
Topics: [1] Urban and Regional Planning: Involves the development and design of land use and the built environment, including air, water, and the infrastructure passing into and out of urban areas, such as transportation, communications, and distribution networks.
--------------------





<topicgpt_python.utils.TopicTree at 0x7f344cc7d240>

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [19]:
# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        "openai",
        "gpt-4o",
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [20]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "openai",
        "gpt-4o",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: gpt-4o
Data file: true_outputs/1/generation_1.jsonl
Prompt file: prompt/generation_2.txt
Seed file: true_outputs/1/generation_1.md
Output file: true_outputs/1/generation_2.jsonl
Topic file: true_outputs/1/generation_2.md
-------------------
Number of remaining documents for prompting: 2


  0%|          | 0/1 [00:00<?, ?it/s]

Current topic: [1] Urban and Regional Planning


100%|██████████| 1/1 [00:04<00:00,  4.76s/it]

Subtopics: [1] Urban and Regional Planning
   [2] National Physical Planning (Document: 1, 2): Involves the creation and implementation of national-level physical planning documents such as the National Physical Plan-3 (NPP-3), which guide development with a focus on sustainability and resilience.
   [2] Urbanisation Policy (Document: 1, 2): Pertains to policies like the Second National Urbanisation Policy (NUP-2) that guide sustainable urban planning and development, emphasizing balanced physical, environmental, social, and economic growth.
   [2] Rural Development Planning (Document: 1, 2): Concerns the planning and implementation of strategies for rural areas, as outlined in documents like the National Rural Physical Plan 2030 (NRPP 2030), focusing on spatial rural development and policy measures.
Not a match: [2] National Physical Planning (Document: 1, 2): Involves the creation and implementation of national-level physical planning documents such as the National Physical Plan-3 (N




### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [None]:
# Assignment
assign_topics(
    "openai",
    "gpt-4o-mini",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation_2"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: gpt-4o-mini
Data file: true_inputs/run.jsonl
Prompt file: prompt/assignment.txt
Output file: true_outputs/1/assignment.jsonl
Topic file: true_outputs/1/generation_1.md
-------------------


 50%|█████     | 1/2 [00:01<00:01,  1.89s/it]

Prompt token usage: 850 ~$0.0042499999999999994
Response token usage: 74 ~$0.0011099999999999999
Response: [1] Urban and Regional Planning: The document discusses national physical development in Peninsular Malaysia, focusing on various planning documents that guide urban and rural development, emphasizing sustainability and land use planning. (Supporting quote: "The NPP-3 is the highest-ranking planning document in the national development framework which translates strategic and sectoral policies into spatial and physical dimensions.")
--------------------


100%|██████████| 2/2 [00:03<00:00,  1.75s/it]

Prompt token usage: 830 ~$0.00415
Response token usage: 125 ~$0.001875
Response: [1] Urban and Regional Planning: The document discusses national physical development in Peninsular Malaysia, focusing on various planning documents that guide urban and rural development. It emphasizes sustainable urban planning and development, which aligns with the topic of urban and regional planning. Supporting quotes include references to the National Physical Plan-3 (NPP-3) and the Second National Urbanisation Policy (NUP2), which are key documents in guiding urban planning efforts. 

Supporting quote: "The NUP2 is a policy to guide and coordinate sustainable urban planning and development with emphasis on balanced development physically, environmentally, socially and economically."
--------------------





In [None]:
# Optional: Generate comparisons
if config["generate_comparison"]:
    generate_comparison(
        "openai",
        "gpt-4o",
        config["assign_topics"]["output"],
        config["comparison"]["prompt"],
        config["comparison"]["output"],
        verbose=config["verbose"],
    )

In [None]:
# Correction
correct_topics(
    "openai",
    "gpt-4o-mini",
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)