In [None]:
# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [5]:
import os
import sys
folder_path = os.path.abspath("topicgpt_python")

# Add the folder to sys.path
sys.path.append(folder_path)

# Import modules directly
from data_sample import sample_data
from generation_1 import generate_topic_lvl1
from generation_2 import generate_topic_lvl2
from refinement import refine_topics
from assignment import assign_topics
from correction import correct_topics
from comparison import comparison
from baseline import baseline
from evaluate import evaluate
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

In [6]:

generate_topic_lvl1(
    "openai",
    "gpt-4o",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic generation...
Model: gpt-4o
Data file: true_inputs/run4.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: true_outputs/1/generation_1_4.jsonl
Topic file: true_outputs/1/generation_1_4.md
-------------------


 50%|█████     | 1/2 [00:02<00:02,  3.00s/it]

Prompt token usage: 9210 ~$0.046049999999999994
Response token usage: 169 ~$0.002535
Topics: [1] Trade: Mentions the exchange of capital, goods, and services.  
[1] Agriculture: Mentions policies relating to agricultural practices and products.  
[1] Climate: Discusses climate conditions, climate change impacts, and related policies.  
[1] Energy: Covers topics related to energy production, consumption, and resources.  
[1] Water Resources: Involves the management and use of water resources, including desalination and water conservation.  
[1] Tourism: Pertains to the development and promotion of tourism and related infrastructure.  
[1] Education: Relates to educational systems, policies, and development.  
[1] Health: Concerns healthcare systems, policies, and services.  
[1] Industrial Development: Involves the growth and management of industrial sectors and economic zones.
--------------------


100%|██████████| 2/2 [00:10<00:00,  5.46s/it]

Prompt token usage: 13177 ~$0.065885
Response token usage: 537 ~$0.008055000000000001
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Topics: [1] Climate: The document discusses the UAE's comprehensive approach to climate policies, including the establishment of the UAE Council on Climate Action, the National Climate Change Plan, and the UAE Net Zero strategy. It highlights the UAE's commitment to addressing climate change, reducing emissions, and adapting to its effects through various strategies and initiatives.

[1] Energy: The document mentions the UAE's efforts to expand clean energy generation, explore eco-friendly technologies, and develop a carbon capture and storage industry. It also discusses the UAE's ro




<topicgpt_python.utils.TopicTree at 0x7fb53bca4820>

In [7]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "openai",
        "gpt-4o",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: gpt-4o
Data file: true_outputs/1/generation_1_4.jsonl
Prompt file: prompt/generation_2.txt
Seed file: true_outputs/1/generation_1_4.md
Output file: true_outputs/1/generation_2_4.jsonl
Topic file: true_outputs/1/generation_2_4.md
-------------------
Number of remaining documents for prompting: 2


  0%|          | 0/12 [00:00<?, ?it/s]

Current topic: [1] Trade


  8%|▊         | 1/12 [00:01<00:19,  1.81s/it]

Subtopics: [1] Trade
   [2] Exports (Document: 1): Discusses the promotion and regulation of exports, including policies and coordination efforts.
   [2] Economic Development (Document: 1): Mentions economic diversification and development plans, such as Saudi Vision 2030, which includes trade-related initiatives.
   [2] Industrial Development (Document: 1): Covers the development of industrial zones and economic cities, which are relevant to trade and economic growth.
Exports (Count: 0): Discusses the promotion and regulation of exports, including policies and coordination efforts.
Economic Development (Count: 0): Mentions economic diversification and development plans, such as Saudi Vision 2030, which includes trade-related initiatives.
Industrial Development (Count: 0): Covers the development of industrial zones and economic cities, which are relevant to trade and economic growth.
--------------------------------------------------
Current topic: [1] Agriculture


 17%|█▋        | 2/12 [00:26<02:29, 15.00s/it]

Subtopics: [1] Agriculture
   [2] Climate and Environment (Document: 1, 2): Discusses the impact of climate change on agriculture and the environment, including water scarcity and sustainable practices.
   [2] Water Resources (Document: 1, 2): Covers the management and use of water resources, including surface water, groundwater, and desalination, which are crucial for agriculture.
   [2] Sustainable Agriculture Practices (Document: 1, 2): Focuses on innovative agricultural technologies and practices, such as hydroponics and organic farming, to enhance food security and sustainability.
   [2] Economic Development and Diversification (Document: 1, 2): Relates to the role of agriculture in economic diversification and development, including initiatives to boost agricultural productivity and sustainability.
Climate and Environment (Count: 0): Discusses the impact of climate change on agriculture and the environment, including water scarcity and sustainable practices.
Water Resources (Coun

 25%|██▌       | 3/12 [01:10<04:15, 28.41s/it]

Subtopics: [1] Climate
   [2] Climate Change Adaptation (Document: 1, 2): Discusses strategies and actions taken to adapt to the impacts of climate change, including policies and initiatives aimed at managing climate risks and enhancing resilience.
   [2] Climate Change Mitigation (Document: 1, 2): Focuses on efforts to reduce greenhouse gas emissions and implement strategies to mitigate the effects of climate change, such as the UAE's Net Zero strategy and Saudi Arabia's Vision 2030.
   [2] Climate and Geography (Document: 1, 2): Covers the relationship between climate and geographical features, including how topography and location influence climate patterns and conditions in Saudi Arabia and the UAE.
Climate Change Adaptation (Count: 0): Discusses strategies and actions taken to adapt to the impacts of climate change, including policies and initiatives aimed at managing climate risks and enhancing resilience.
Climate Change Mitigation (Count: 0): Focuses on efforts to reduce greenho

 33%|███▎      | 4/12 [01:55<04:39, 34.98s/it]

Subtopics: [1] Energy
   [2] Oil and Gas (Document: 1): Discusses the reserves, production, and export of oil and gas in Saudi Arabia.
   [2] Electricity (Document: 1): Covers the power generation capacity, types of power plants, and electricity consumption in Saudi Arabia.
   [2] Renewable Energy (Document: 2): Mentions the UAE's efforts in expanding clean energy generation and renewable energy investments.
   [2] Climate Change and Energy Policy (Document: 2): Discusses the UAE's climate policies, including the National Climate Change Plan and the Net Zero by 2050 Strategic Initiative.
Oil and Gas (Count: 0): Discusses the reserves, production, and export of oil and gas in Saudi Arabia.
Electricity (Count: 0): Covers the power generation capacity, types of power plants, and electricity consumption in Saudi Arabia.
Renewable Energy (Count: 0): Mentions the UAE's efforts in expanding clean energy generation and renewable energy investments.
Climate Change and Energy Policy (Count: 0): 

 42%|████▏     | 5/12 [02:38<04:25, 37.92s/it]

Subtopics: [1] Water Resources
   [2] Surface Water (Document: 1): Discusses the management and storage of surface water through dams and runoff collection.
   [2] Groundwater (Document: 1): Covers the use and categorization of groundwater resources, including shallow and deep aquifers.
   [2] Desalinated Water (Document: 1): Focuses on the production and technologies used for desalinating water to meet potable water demands.
   [2] Reclaimed Wastewater (Document: 1): Describes the treatment and reuse of wastewater for various purposes.
   [2] Water Demand and Management (Document: 1): Addresses the water demand across different sectors and strategies for efficient water use.
Surface Water (Count: 0): Discusses the management and storage of surface water through dams and runoff collection.
Groundwater (Count: 0): Covers the use and categorization of groundwater resources, including shallow and deep aquifers.
Desalinated Water (Count: 0): Focuses on the production and technologies used 

 50%|█████     | 6/12 [03:23<04:01, 40.17s/it]

Subtopics: [1] Tourism
   [2] Mega Projects (Document: 1): Discusses large-scale development projects aimed at boosting tourism and economic diversification in Saudi Arabia.
   [2] Cultural Heritage (Document: 1): Mentions historical and cultural sites that attract tourists, such as UNESCO World Heritage Sites.
   [2] Economic Diversification (Document: 1): Relates to efforts to diversify the economy through tourism as part of Saudi Vision 2030.
   [2] Infrastructure Development (Document: 1): Covers the development of infrastructure to support tourism, including new hotels, airports, and entertainment facilities.
Mega Projects (Count: 0): Discusses large-scale development projects aimed at boosting tourism and economic diversification in Saudi Arabia.
Cultural Heritage (Count: 0): Mentions historical and cultural sites that attract tourists, such as UNESCO World Heritage Sites.
Economic Diversification (Count: 0): Relates to efforts to diversify the economy through tourism as part of 

 58%|█████▊    | 7/12 [03:35<02:34, 30.94s/it]

Subtopics: [1] Education
    [2] Educational Development (Document: 1): Focuses on the expansion and enhancement of education systems and policies in Saudi Arabia, including government initiatives and private sector involvement.
    [2] Vision 2030 and Education (Document: 1): Discusses the role of education within the broader context of Saudi Vision 2030, highlighting goals for literacy, gender equality in education, and the continuous education program.
Educational Development (Count: 0): Focuses on the expansion and enhancement of education systems and policies in Saudi Arabia, including government initiatives and private sector involvement.
Vision 2030 and Education (Count: 0): Discusses the role of education within the broader context of Saudi Vision 2030, highlighting goals for literacy, gender equality in education, and the continuous education program.
--------------------------------------------------
Current topic: [1] Health


 67%|██████▋   | 8/12 [04:26<02:30, 37.50s/it]

Subtopics: [1] Health
    [2] Healthcare System (Document: 1): Discusses the development and quality of healthcare services in Saudi Arabia, including government initiatives and budget allocations.
    [2] Climate Change and Health (Document: 2): Covers the impact of climate change on health, including strategies for adaptation and mitigation in the UAE.
    [2] Health and Economic Development (Document: 1): Explores the role of health in economic development plans, such as Saudi Vision 2030, and the integration of health services in national development strategies.
Healthcare System (Count: 0): Discusses the development and quality of healthcare services in Saudi Arabia, including government initiatives and budget allocations.
Climate Change and Health (Count: 0): Covers the impact of climate change on health, including strategies for adaptation and mitigation in the UAE.
Health and Economic Development (Count: 0): Explores the role of health in economic development plans, such as Sau

 75%|███████▌  | 9/12 [05:11<01:59, 39.88s/it]

Subtopics: [1] Industrial Development
   [2] Vision 2030 (Document: 1): Discusses Saudi Arabia's Vision 2030, focusing on economic diversification and sustainable development.
   [2] Energy (Document: 1, 2): Covers the energy sector, including oil, gas, and renewable energy initiatives in Saudi Arabia and the UAE.
   [2] Water Resources (Document: 1): Discusses water scarcity, desalination, and water management strategies in Saudi Arabia.
   [2] Agriculture (Document: 1, 2): Covers agricultural development, food security, and sustainable farming practices in Saudi Arabia and the UAE.
   [2] Climate Change and Environment (Document: 1, 2): Discusses climate change impacts, environmental policies, and conservation efforts in Saudi Arabia and the UAE.
Vision 2030 (Count: 0): Discusses Saudi Arabia's Vision 2030, focusing on economic diversification and sustainable development.
Energy (Count: 0): Covers the energy sector, including oil, gas, and renewable energy initiatives in Saudi Arabia

 83%|████████▎ | 10/12 [05:35<01:09, 34.87s/it]

Subtopics: [1] Governance
    [2] Federal Structure (Document: 1): Discusses the federal governance system of the UAE, including the roles of the Federal Supreme Council, Federal National Council, and the Cabinet.
    [2] Climate Policy Governance (Document: 1): Covers the UAE's approach to climate policies through a comprehensive governance framework, including the UAE Council on Climate Action and the National Climate Change Plan.
    [2] Local Government Structures (Document: 1): Describes the variations in local government structures across the seven emirates, including executive councils and autonomous agencies.
    [2] Ministries and Their Roles (Document: 1): Details the functions and responsibilities of various UAE ministries, such as the Ministry of Defence, Ministry of Finance, and Ministry of Climate Change and Environment.
Federal Structure (Count: 0): Discusses the federal governance system of the UAE, including the roles of the Federal Supreme Council, Federal National Co

 92%|█████████▏| 11/12 [06:02<00:32, 32.44s/it]

Subtopics: [1] Environment
   [2] Climate Change (Document: 1): Discusses the UAE's strategies and initiatives to address climate change, including the National Climate Change Plan and the Net Zero by 2050 Strategic Initiative.
   [2] Biodiversity Conservation (Document: 1): Covers the UAE's efforts in conserving biodiversity, including federal laws, international commitments, and specific projects like the National Red List Project and mangrove rehabilitation.
   [2] Sustainable Agriculture (Document: 1): Describes the UAE's initiatives to promote sustainable agricultural practices, including Ag Tech Accelerators, hydroponics, and the Food Tech Valley project.
Climate Change (Count: 0): Discusses the UAE's strategies and initiatives to address climate change, including the National Climate Change Plan and the Net Zero by 2050 Strategic Initiative.
Biodiversity Conservation (Count: 0): Covers the UAE's efforts in conserving biodiversity, including federal laws, international commitment

100%|██████████| 12/12 [06:28<00:00, 32.41s/it]

Subtopics: [1] Economy
   [2] Governance (Document: 1): Discusses the governance structure and political framework of the UAE, including federal and emirate-level policies and initiatives.
   [2] Climate Change and Environment (Document: 1): Covers the UAE's strategies and initiatives for addressing climate change, environmental protection, and sustainability.
   [2] Energy (Document: 1): Focuses on the UAE's energy sector, including diversification efforts, renewable energy initiatives, and energy security.
   [2] Agriculture (Document: 1): Describes the UAE's agricultural sector, including sustainable practices, technological advancements, and food security initiatives.
Governance (Count: 0): Discusses the governance structure and political framework of the UAE, including federal and emirate-level policies and initiatives.
Climate Change and Environment (Count: 0): Covers the UAE's strategies and initiatives for addressing climate change, environmental protection, and sustainability.




In [8]:
# Assignment
assign_topics(
    "openai",
    "gpt-4o-mini",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation_2"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: gpt-4o-mini
Data file: true_inputs/run4.jsonl
Prompt file: prompt/assignment.txt
Output file: true_outputs/1/assignment_2_4.jsonl
Topic file: true_outputs/1/generation_2_4.md
-------------------


 50%|█████     | 1/2 [00:06<00:06,  6.16s/it]

Prompt token usage: 10632 ~$0.053160000000000006
Response token usage: 476 ~$0.0071400000000000005
Response: [2] Economic Development: The document discusses Saudi Vision 2030, which aims to diversify the economy and includes various development initiatives. "The main goal of Vision 2030 is 'to raise the share of non-oil exports in non-oil GDP from 16% to 50%.'"

[2] Industrial Development: The document mentions the development of industrial zones and economic cities as part of economic diversification efforts. "The Kingdom of Saudi Arabia has been developing a number of industrial zones and economic cities to achieve economic development and diversification of the economy."

[2] Water Resources: The document addresses the management and use of water resources, including groundwater and desalination, which are crucial for agriculture and overall sustainability. "The Kingdom of Saudi Arabia is one of the world’s most water scarce country with an average rainfall of approximately 100-150

100%|██████████| 2/2 [00:11<00:00,  5.64s/it]

Prompt token usage: 14552 ~$0.07276
Response token usage: 353 ~$0.005295
Response: [2] Climate Change Mitigation: The document discusses the UAE's commitment to addressing climate change through various strategies and initiatives, including the development of the UAE Net Zero strategy and the National Climate Change Plan. ("In 2017, the UAE introduced the National Climate Change Plan 2017-2050, which serves as a blueprint for managing GHG emissions, climate adaptation strategies, and promoting economic diversification through innovation in the private sector.")

[2] Climate Change Adaptation: The document highlights the UAE's actions to adapt to the effects of climate change, including the establishment of the UAE Council on Climate Action and various policies aimed at enhancing resilience. ("The UAE has demonstrated its commitment to addressing climate change and adapting to its effects through a series of decisive actions.")

[2] Economic Development and Diversification: The document




In [12]:
# Optional: Generate comparisons
if config["generate_comparison"]:
    comparison(
        "openai",
        "gpt-4o",
        config["assignment"]["output"],
        config["comparison"]["prompt"],
        config["comparison"]["output"],
    )

Final prompt sent to API Agent:
As a climate scientist and specialized Q&A bot with expertise in climate change, climate science, environmental science, physics, and energy science, your primary objective is: 
1. Provide an accurate and comprehensive comparison from the two documents inputted by the user.  
2. Provide detailed discussions on both similarities of the points, and differences, additionally discuss when only one topic appears in one document only 
3. In cases where sufficient information is lacking to address the comparison, reply with ’There is not enough info to answer the question.’ 
4. It’s imperative to maintain accuracy and refrain from creating information. If any aspect is unclear, do not create answers about that aspect.

[Instructions]
Here is two documents, as well as key discussion points, provide a comparison:

[Document]
Document 1:
Text: ,Introduction,Climate change adaptation holds equal significance to mitigation in the UAE, given that the,country is susce

In [22]:
baseline(
    "openai",
    "gpt-4o",
    config["assignment"]["output"],
    config["baseline"]["prompt"],
    config["baseline"]["output"],
)

Final prompt sent to API Agent:
[Instructions]
Here are two documents. Provide a comparison:

[Document]
Document 1:
Text: ,The window for decisive international action,on climate change is narrowing. The recently,completed Intergovernmental Panel on Climate,Change (IPCC) Sixth Assessment Report (AR6),cycle concluded that the effects of climate change,are widespread, rapid and intensifying.,As a low-lying island city-state, climate change,is an existential threat for Singapore. While,we account for only 0.1% of global emissions,,Singapore has taken important steps to contribute,to the global effort to tackle climate change and is,continually working to overcome our constraints to,raise our climate ambition.,At the Copenhagen Conference in 2009, Singapore,pledged to reduce emissions by 16% below our,business-as-usual (BAU) level in 2020. We are,happy to announce that Singapore has achieved,this target. This was achieved through sustained,efforts in improving energy efficiency across,var

In [23]:
evaluate(
    "openai",
    "gpt-4o",
    config["comparison"]["output"],
    config["baseline"]["output"],
    config["evaluate"]["prompt"],
    config["evaluate"]["output"],
)

Final prompt sent to API Agent:
You are an expert evaluator. Two comparisons are provided below, each generated following specific instructions. Your task is to evaluate them based on accuracy, comprehensiveness, clarity, and adherence to the guidelines provided in the comparison prompt.

Comparison One:
Comparison 1:
 **Comparison of Climate Change Strategies: Singapore and Saudi Arabia**

**1. Emissions Reduction:**

- **Singapore:** Singapore has committed to achieving net zero emissions by 2050. The country has set specific targets, such as reducing emissions to around 60 MtCO2e by 2030, and has implemented a carbon tax to support these goals. Singapore's strategy includes improving energy efficiency and transitioning to cleaner energy sources, with a focus on solar energy.

- **Saudi Arabia:** Saudi Arabia's updated Nationally Determined Contribution (NDC) aims to reduce greenhouse gas emissions by 278 million tons of CO2 equivalent by 2030. The Kingdom's approach includes the Cir