In [None]:
# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [8]:
import os
import sys
folder_path = os.path.abspath("topicgpt_python")

# Add the folder to sys.path
sys.path.append(folder_path)

# Import modules directly
from data_sample import sample_data
from generation_1 import generate_topic_lvl1
from generation_2 import generate_topic_lvl2
from refinement import refine_topics
from assignment import assign_topics
from correction import correct_topics
from comparison import comparison
from baseline import baseline
from evaluate import evaluate
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

INFO 03-21 02:01:38 [__init__.py:256] Automatically detected platform cuda.


In [9]:

generate_topic_lvl1(
    "openai",
    "gpt-4o",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic generation...
Model: gpt-4o
Data file: true_inputs/run3.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: true_outputs/1/generation_1_3.jsonl
Topic file: true_outputs/1/generation_1_3.md
-------------------


 50%|█████     | 1/2 [00:04<00:04,  4.02s/it]

Prompt token usage: 1451 ~$0.007255
Response token usage: 32 ~$0.00047999999999999996
Topics: [1] Climate Change: Addresses the impacts, adaptation strategies, and policies related to climate change and its effects on the environment, economy, and society.
--------------------


100%|██████████| 2/2 [00:05<00:00,  2.70s/it]

Prompt token usage: 1479 ~$0.0073950000000000005
Response token usage: 32 ~$0.00047999999999999996
Topics: [1] Climate Change: Discusses the impacts, adaptation measures, and resilience strategies related to climate change, particularly in vulnerable regions like small island states.
--------------------





<topicgpt_python.utils.TopicTree at 0x7fc9e16b8ee0>

In [10]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "openai",
        "gpt-4o",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: gpt-4o
Data file: true_outputs/1/generation_1_3.jsonl
Prompt file: prompt/generation_2.txt
Seed file: true_outputs/1/generation_1_3.md
Output file: true_outputs/1/generation_2_3.jsonl
Topic file: true_outputs/1/generation_2_3.md
-------------------
Number of remaining documents for prompting: 2


  0%|          | 0/1 [00:00<?, ?it/s]

Current topic: [1] Climate Change


100%|██████████| 1/1 [00:01<00:00,  1.30s/it]

Subtopics: [1] Climate Change
   [2] Climate Change Adaptation (Document: 1, 2): Focuses on strategies and measures to adjust to the effects of climate change, enhancing resilience and reducing vulnerability in various sectors.
   [2] Climate Change Mitigation (Document: 1): Involves efforts to reduce or prevent the emission of greenhouse gases, aiming to limit the magnitude of future climate change.
Climate Change Adaptation (Count: 0): Focuses on strategies and measures to adjust to the effects of climate change, enhancing resilience and reducing vulnerability in various sectors.
Climate Change Mitigation (Count: 0): Involves efforts to reduce or prevent the emission of greenhouse gases, aiming to limit the magnitude of future climate change.
--------------------------------------------------





In [11]:
# Assignment
assign_topics(
    "openai",
    "gpt-4o-mini",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation_2"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: gpt-4o-mini
Data file: true_inputs/run3.jsonl
Prompt file: prompt/assignment.txt
Output file: true_outputs/1/assignment_2_3.jsonl
Topic file: true_outputs/1/generation_2_3.md
-------------------


 50%|█████     | 1/2 [00:04<00:04,  4.17s/it]

Prompt token usage: 1403 ~$0.0070149999999999995
Response token usage: 146 ~$0.00219
Response: [2] Climate Change Adaptation: The document discusses the importance of adaptation strategies in the UAE to address the impacts of climate change, highlighting the development of policies and initiatives aimed at enhancing resilience. (Supporting quote: "Considering the anticipated climate change challenges, the UAE is actively rolling out a comprehensive set of policies and initiatives at both the national and local levels. These efforts are aimed at enhancing resilience to climate change and mitigating its impacts...") 

[2] Climate Change Mitigation: The document also mentions managing greenhouse gas emissions as part of the UAE's climate strategy, indicating efforts to mitigate climate change. (Supporting quote: "The National Climate Change Plan focuses on three key objectives (i) manage GHG emissions...")
--------------------


100%|██████████| 2/2 [00:05<00:00,  2.87s/it]

Prompt token usage: 1425 ~$0.007125
Response token usage: 92 ~$0.00138
Response: [2] Climate Change Adaptation: The document discusses the need for Singapore to prepare for and adapt to the impacts of climate change, highlighting the establishment of a multi-agency Resilience Working Group to coordinate adaptation efforts. (Supporting quote: "Even with international efforts to limit the rise in global temperatures, there is a need to prepare Singapore for the impacts of climate change. Some adaptation measures require longer lead times to implement and have to be undertaken early.")
--------------------





In [12]:
# Optional: Generate comparisons
if config["generate_comparison"]:
    comparison(
        "openai",
        "gpt-4o",
        config["assignment"]["output"],
        config["comparison"]["prompt"],
        config["comparison"]["output"],
    )

Final prompt sent to API Agent:
As a climate scientist and specialized Q&A bot with expertise in climate change, climate science, environmental science, physics, and energy science, your primary objective is: 
1. Provide an accurate and comprehensive comparison from the two documents inputted by the user.  
2. Provide detailed discussions on both similarities of the points, and differences, additionally discuss when only one topic appears in one document only 
3. In cases where sufficient information is lacking to address the comparison, reply with ’There is not enough info to answer the question.’ 
4. It’s imperative to maintain accuracy and refrain from creating information. If any aspect is unclear, do not create answers about that aspect.

[Instructions]
Here is two documents, as well as key discussion points, provide a comparison:

[Document]
Document 1:
Text: ,Introduction,Climate change adaptation holds equal significance to mitigation in the UAE, given that the,country is susce

In [22]:
baseline(
    "openai",
    "gpt-4o",
    config["assignment"]["output"],
    config["baseline"]["prompt"],
    config["baseline"]["output"],
)

Final prompt sent to API Agent:
[Instructions]
Here are two documents. Provide a comparison:

[Document]
Document 1:
Text: ,The window for decisive international action,on climate change is narrowing. The recently,completed Intergovernmental Panel on Climate,Change (IPCC) Sixth Assessment Report (AR6),cycle concluded that the effects of climate change,are widespread, rapid and intensifying.,As a low-lying island city-state, climate change,is an existential threat for Singapore. While,we account for only 0.1% of global emissions,,Singapore has taken important steps to contribute,to the global effort to tackle climate change and is,continually working to overcome our constraints to,raise our climate ambition.,At the Copenhagen Conference in 2009, Singapore,pledged to reduce emissions by 16% below our,business-as-usual (BAU) level in 2020. We are,happy to announce that Singapore has achieved,this target. This was achieved through sustained,efforts in improving energy efficiency across,var

In [23]:
evaluate(
    "openai",
    "gpt-4o",
    config["comparison"]["output"],
    config["baseline"]["output"],
    config["evaluate"]["prompt"],
    config["evaluate"]["output"],
)

Final prompt sent to API Agent:
You are an expert evaluator. Two comparisons are provided below, each generated following specific instructions. Your task is to evaluate them based on accuracy, comprehensiveness, clarity, and adherence to the guidelines provided in the comparison prompt.

Comparison One:
Comparison 1:
 **Comparison of Climate Change Strategies: Singapore and Saudi Arabia**

**1. Emissions Reduction:**

- **Singapore:** Singapore has committed to achieving net zero emissions by 2050. The country has set specific targets, such as reducing emissions to around 60 MtCO2e by 2030, and has implemented a carbon tax to support these goals. Singapore's strategy includes improving energy efficiency and transitioning to cleaner energy sources, with a focus on solar energy.

- **Saudi Arabia:** Saudi Arabia's updated Nationally Determined Contribution (NDC) aims to reduce greenhouse gas emissions by 278 million tons of CO2 equivalent by 2030. The Kingdom's approach includes the Cir