# TopicGPT

In [31]:
def qry1(document: str) -> str:
    """
    This function generates a query to identify generalizable topics from a document based on a given topic hierarchy.
    
    Args:
    document (str): The text of the document to analyze for topics.
    
    Returns:
    str: A formatted query string that includes the document and instructions for identifying topics.
    """
    query_template = f"""
    You will receive a document and a set of top-level topics from a topic hierarchy. Your task is to identify generalizable topics
    within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set,
    please add them. Otherwise, output the existing top-level topics as identified in the document.
    [Top-level topics]
    [1] World news
    [1] Weather
    [1] Estonian local news

    [Examples]
    Example 1: Adding "[1] Estonian local news"
    Document:
    Osaühisus „Mootor“ teatas teedeministeeriumile, et täistuisanud teede tõttu on katkestatud ühendus järgmiste liinide vahel:\n\n- Tallinn—Märjamaa\n- Tallinn—Loksa—Viinistu\n- Tallinn—Tsitri—Leesi\n\nSõidugraafikute täitmine oli seetõttu ajutiselt wõimatu
    Your response:
    [1] Estonian local news: Mentsions local transportation disruptions due to weather conditions.
    Example 2: Duplicate "[1] Weather", returning the existing topic
    Document:
    Eile hommikpoolsel ööl sadas kogu EeStis lund. Madalrõhkkond on lõuna ja kagu poole liikunud. Teine madalrõhkkond asub Põhja…b metobs.: merel paiguti kõva, maal keskmise kiirusega põhjakaarte tuuled. Pilvine. Paiguti lumesajud. Temperatuur madalam.
    Your response:
    [1] Weather: Mentions atmospheric conditions, precipitation, and temperature variations.
    [Instructions]
    Step 1: Determine topics mentioned in the document.
    - The topic labels must be as GENERALIZABLE as possible. They must not be document-specific.
    - The topics must reflect a SINGLE topic instead of a combination of topics.
    - The new topics must have a level number, a short general label, and a topic description.
    - The topics must be broad enough to accommodate future subtopics.
    Step 2: Perform ONE of the following operations:
    1. If there are already duplicates or relevant topics in the hierarchy, output those topics and stop here.
    2. If the document contains no topic, return "None".
    3. Otherwise, add your topic as a top-level topic. Stop here and output the added topic(s). DO NOT add any additional levels.
    [Document]
    {document}
    Please ONLY return the relevant or modified topics at the top level in the hierarchy.
    [Your response]
    """
    return query_template.format(document=document)

In [39]:
def qry2(document: str,topic: str) -> str:
    query_template = f"""
    You will receive a branch from a topic hierarchy along with some documents assigned to the top-level topic of that branch. Your
    task is to identify generalizable second-level topics that can act as subtopics to the top-level topic in the provided branch. Add
    your topic(s) if they are missing from the provided branch. Otherwise, return the existing relevant or duplicate topics.
    [Example] (Return "[2] Infrastructure" (new) and "[2] Social Event" (existing) as the subtopics of "[1] Estonian local news" (provided).)
    Topic branch:
    [1] Estonian local news
    [2] Social Event
    [2] Education
    Document 1:
    Praegu on juba vundament valmis. 
    Kiriku plaan on konsistooriumi poolt kinnitatud. 
    Kogudus on juba kaunis rohkesti ehitusmaterjali kokku vedanud, teliskiva, laudu, palke 250.000 mk. eest. 
    Siiamaalne ehitus on juba 11/»l 1 /» milj. mk. maksnud. 
    Sel suvel loodab kogudus ehitust jätkata, ehk ta küll ühe kolmandiku võrra väiksemaks on jäänud selle tõttu, et piiri tõmbamise läbi suur osa maad on Lätimaale jäänud. 
    Document 2: 
    Esimese üritusena pidustuste sarjas oli mälestusrännak V.Maarja kalmistule esimese ärijuhi G. Nosenbergi kalmukünkale,
    Document 3:
    alalisele tänavavalgustusele pandi alus alles 1793. aastal, mil 3. juunil nõudis Tartu politseiülem linnanõukogult, et asehaldurkonna valitsuse määruse kohaselt celseisvast sügisest alates valgustataks linna ja eeslinna laternatega.
    Your response:
    [1] Estonian local news
    [2] Infrastructure (Document: 1, 3): Mentions building of a church and street lighting.
    [2] Social Event (Document: 2): Mentions tax policies on imports or exports of goods.
    [Instructions]
    Step 1: Determine PRIMARY and GENERALIZABLE topics mentioned in the documents.
    - The topics must be generalizable among the provided documents.
    - Each topic must not be too specific so that it can accommodate future subtopics.
    - Each topic must reflect a SINGLE topic instead of a combination of topics.
    - Each top-level topic must have a level number and a short label. Second-level topics should also include the original documents
    associated with these topics (separated by commas) as well as a short description of the topic.
    - The number of topics proposed cannot exceed the number of documents provided.
    Step 2: Perform ONE of the following operations:
    1. If the provided top-level topic is specific enough, DO NOT add any subtopics. Return the provided top-level topic.
    2. If your topic is duplicate or relevant to the provided topics, DO NOT add any subtopics. Return the existing relevant topic.
    3. If your topic is relevant to and more specific than the provided top-level topic, add your topic as a second-level topic. DO
    NOT add to the first or third level of the hierarchy.
    [Topic branch]
    {topic}
    [Documents]
    {document}
    DO NOT add first- or third-level topics.
    [Your response]
    """
    return query_template.format(document=document)

In [None]:
def qry3( topic: list) -> str:
    query_template = f"""
    You will receive a list of topics that belong to the same level of a topic hierarchy. Your task is to merge topics that are paraphrases
    or near duplicates of one another. Return "None" if no modification is needed.
    [Examples]
    Example 1: Merging topics ("[1] Employer Taxes" and "[1] Employment Tax Reporting" into "[1] Employment Taxes")
    Topic List:
    [1] Employer Taxes: Mentions taxation policy for employer
    [1] Employment Tax Reporting: Mentions reporting requirements for employer
    [1] Immigration: Mentions policies and laws on the immigration process
    [1] Voting: Mentions rules and regulation for the voting process
    Your response:
    [1] Employment Taxes: Mentions taxation report and requirement for employer ([1] Employer Taxes, [1] Employment Tax
    Reporting)
    Example 2: Merging topics ([2] Digital Literacy and [2] Telecommunications into [2] Technology)
    [2] Mathematics: Discuss mathematical concepts, figures and breakthroughs.
    [2] Digital Literacy: Discuss the ability to use technology to find, evaluate, create, and communicate information.
    [2] Telecommunications: Mentions policies and regulations related to the telecommunications industry, including wireless
    service providers and consumer rights.
    Your response
    [2] Technology: Discuss technology and its impact on society. ([2] Digital Literacy, [2] Telecommunications)
    [Rules]
    - Each line represents a topic, with a level indicator and a topic label.
    - Perform the following operations as many times as needed:
    - Merge relevant topics into a single topic.
    - Do nothing and return "None" if no modification is needed.
    - When merging, the output format should contain a level indicator, the updated label and description, followed by the original
    topics.
    [Topic List]
    {topic}
    Output the modification or "None" where appropriate. Do not output anything else.
    [Your response]
    """
    return query_template.format(Topics=topic)

In [54]:
def qry4(topic_hierarchy: list, document:str) -> str:
    query_template = f"""
    You will receive a document and a topic hierarchy. Assign the document to the most relevant topics the hierarchy. Then, output
    the topic labels, assignment reasoning and supporting quotes from the document. DO NOT make up new topics or quotes.
    Here is the topic hierarchy:
    {topic_hierarchy}
    [Examples]
    Example 1: Adding "[1] Estonian local news"
    Document:
    Osaühisus „Mootor“ teatas teedeministeeriumile, et täistuisanud teede tõttu on katkestatud ühendus järgmiste liinide vahel:\n\n- Tallinn—Märjamaa\n- Tallinn—Loksa—Viinistu\n- Tallinn—Tsitri—Leesi\n\nSõidugraafikute täitmine oli seetõttu ajutiselt wõimatu
    Your response:
    [1] Estonian local news: Mentions local transportation disruptions due to weather conditions.
    Example 2: Assign "[2] Social event" to the document
    Document : 
    Esimese üritusena pidustuste sarjas oli mälestusrännak V.Maarja kalmistule esimese ärijuhi G. Nosenbergi kalmukünkale,
    Your response:
    [1] Estonian local news
    [2] Social event: Mentions a memorial event at a cemetery.
    [Instructions]
    1. Topic labels must be present in the provided topic hierarchy. You MUST NOT make up new topics.
    2. The quote must be taken from the document. You MUST NOT make up quotes.
    3. If the assigned topic is not on the top level, you must also output the path from the top-level topic to the assigned topic.
    [Document]
    {document}
    Double check that your assignment exists in the hierarchy!
    [Your response]
    """
    return query_template.format(Topics=topic_hierarchy, document=document)

In [66]:
import os
import requests
import json
from dotenv import load_dotenv

load_dotenv()
GEMINI_API_KEY = os.getenv("GEM")
url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:generateContent?key={GEMINI_API_KEY}"

headers = {
    "Content-Type": "application/json"
}

scanned_docs= []
with open('../../src/datasets/silver/scanned/full_scope_data/text_and_label.jsonl') as f:
    documents = json.load(f)
with open ('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl','r', encoding='utf-8') as g:
    for line in g:
        scanned_docs.append(json.loads(line.strip()))
scanned_ids = {doc['id'] for doc in scanned_docs}


# Open the output file in write mode
with open('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl', 'a') as outfile:
    nbr_of_files = 0
    for i, document in enumerate(documents):
        #if not document['kp']:
        #    continue
        if document['id'] in scanned_ids:
            continue
        nbr_of_files += 1
        if nbr_of_files > 20:
            break
        


        question_txt = qry1(document['modernized_text'])

        data = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": question_txt
                        }
                    ]
                }
            ]
        }

        response = requests.post(url, headers=headers, data=json.dumps(data))

        if response.status_code == 200:
            response_data = response.json()
            extracted_data = {
                "qry_id": 1,
                "id": document.get("id", ""),
                "content": response_data.get("candidates", [{}])[0].get("content", {}).get("parts", [{}])[0].get("text", ""),
                "avgLogprobs": response_data.get("candidates", [{}])[0].get("avgLogprobs", None),
                "modelVersion": response_data.get("modelVersion", "")
            }

            # Write extracted data as a new line in the output file
            outfile.write(json.dumps(extracted_data) + '\n')
            print(f"Response for document {i + 1} successfully written.")
        else:
            print(f"Error for document {i + 1}: {response.status_code}, {response.text}")
            break

print("All responses have been successfully written to a JSON Lines file.")


Response for document 128 successfully written.
Response for document 129 successfully written.
Response for document 130 successfully written.
Response for document 131 successfully written.
Response for document 132 successfully written.
Response for document 133 successfully written.
Response for document 134 successfully written.
Response for document 135 successfully written.
Response for document 136 successfully written.
Response for document 137 successfully written.
Response for document 138 successfully written.
Response for document 139 successfully written.
Response for document 140 successfully written.
Response for document 141 successfully written.
Response for document 142 successfully written.
Response for document 143 successfully written.
Response for document 144 successfully written.
Response for document 145 successfully written.
Response for document 146 successfully written.
Response for document 147 successfully written.
All responses have been successfully wri

In [74]:

input = []
documents = []
url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={GEMINI_API_KEY}"

with open('../../src/datasets/silver/scanned/full_scope_data/only_topic_1.jsonl') as f:
    for line in f:
        documents.append(json.loads(line.strip()))


scanned_docs= []

with open ('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl','r', encoding='utf-8') as g:
    for line in g:
        scanned_docs.append(json.loads(line.strip()))
scanned_ids = {doc['id'] for doc in scanned_docs if doc['qry_id'] == 2}

with open('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl', 'a') as outfile:
    nbr_of_files = 0
    for i, document in enumerate(documents):
        if document['id'] in scanned_ids:
            continue
        nbr_of_files += 1
        if nbr_of_files > 20:
            break
        #print(document)
        #print(document['modernized_text'],document['topic_lvl1'])

        question_txt = qry2(document['modernized_text'],document['topic_lvl1'])
        #break

        data = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": question_txt
                        }
                    ]
                }
            ]
        }

        response = requests.post(url, headers=headers, data=json.dumps(data))

        if response.status_code == 200:
            response_data = response.json()
            extracted_data = {
                "qry_id": 2,
                "id": document.get("id", ""),
                "content": response_data.get("candidates", [{}])[0].get("content", {}).get("parts", [{}])[0].get("text", ""),
                "avgLogprobs": response_data.get("candidates", [{}])[0].get("avgLogprobs", None),
                "modelVersion": response_data.get("modelVersion", "")
            }

            # Write extracted data as a new line in the output file
            outfile.write(json.dumps(extracted_data) + '\n')
            print(f"Response for document {i + 1} successfully written.")
        else:
            print(f"Error for document {i + 1}: {response.status_code}, {response.text}")
            break

print("All responses have been successfully written to a JSON Lines file.")


Response for document 8 successfully written.
Response for document 9 successfully written.
Response for document 10 successfully written.
Response for document 12 successfully written.
Response for document 16 successfully written.
Response for document 18 successfully written.
Response for document 21 successfully written.
Response for document 22 successfully written.
Response for document 23 successfully written.
Response for document 26 successfully written.
Response for document 28 successfully written.
Response for document 30 successfully written.
Response for document 33 successfully written.
Response for document 34 successfully written.
Response for document 38 successfully written.
Response for document 39 successfully written.
Error for document 40: 429, {
  "error": {
    "code": 429,
    "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.",

In [63]:
with open('../../src/datasets/silver/scanned/full_scope_data/topic_refin.jsonl', 'a') as outfile:
    
    
    #list_of_topics = ['Social Event', 'Student Affairs', 'Scouting', 'Economy', 'Land Disputes', 'Newspaper Contact Information', 'Politics', 'Business', 'Infrastructure', 'Announcements', 'Religious Services', 'Crime', 'Education']
    list_of_topics = ['Military', 'Media', 'Crime', 'Political conflict', 'Politics', 'Religion', 'Political Affairs', 'Film review', 'Business', 'Education', 'Entertainment', 'Infrastructure', 'Land reform']
    url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:generateContent?key={GEMINI_API_KEY}"


    question_txt = qry3(list_of_topics)
        #break

    data = {
        "contents": [
            {
                "parts": [
                    {
                        "text": question_txt
                    }
                ]
            }
            ]
        }

    response = requests.post(url, headers=headers, data=json.dumps(data))

    if response.status_code == 200:
        response_data = response.json()
        extracted_data = {
            "qry_id": 2,
            "id": document.get("id", ""),
            "content": response_data.get("candidates", [{}])[0].get("content", {}).get("parts", [{}])[0].get("text", ""),
            "avgLogprobs": response_data.get("candidates", [{}])[0].get("avgLogprobs", None),
            "modelVersion": response_data.get("modelVersion", "")
        }
        # Write extracted data as a new line in the output file
        outfile.write(json.dumps(extracted_data) + '\n')
        print(f"Response for document {i + 1} successfully written.")
    else:
        print(f"Error for document {i + 1}: {response.status_code}, {response.text}")
        

Response for document 37 successfully written.


In [65]:
Hierarchy = [['[1] Estonian local news','[2] Social Event', '[2] Economy', '[2] Newspaper Contact Information', '[2] Infrastructure', '[2] Land Disputes', '[2] Education', '[2] Business', '[2] Crime', '[2] Student Affairs', '[2] Politics', '[2] Announcements', '[2] Religious Services', '[2] Scouting']
             ,['[1] World news', '[2] Military', '[2] Media', '[2] Crime', '[2] Political conflict', '[2] Politics', '[2] Religion', '[2] Political Affairs', '[2] Film review', '[2] Business', '[2] Education', '[2] Entertainment', '[2] Infrastructure', '[2] Land reform']]

input = []
documents = []
url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-05-20:generateContent?key={GEMINI_API_KEY}"

with open('../../src/datasets/silver/scanned/full_scope_data/world_news.jsonl') as f:
    for line in f:
        documents.append(json.loads(line.strip()))


scanned_docs= []

with open('../../src/datasets/silver/scanned/full_scope_data/text_and_label.jsonl') as f:
    documents = json.load(f)

with open ('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl','r', encoding='utf-8') as g:
    for line in g:
        scanned_docs.append(json.loads(line.strip()))
scanned_ids = {doc['id'] for doc in scanned_docs if doc['qry_id'] in (1,2,4)}

with open('../../src/datasets/silver/scanned/full_scope_data/responses.jsonl', 'a') as outfile:
    nbr_of_files = 0
    for i, document in enumerate(documents):
        if not document['kp']:
            continue
        if document['id'] in scanned_ids:
            continue
        nbr_of_files += 1
        if nbr_of_files > 20:
            break
        #print(document)
        #print(document['modernized_text'],document['topic_lvl1'])

        question_txt = qry4(Hierarchy,document['modernized_text'])
        #break

        data = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": question_txt
                        }
                    ]
                }
            ]
        }

        response = requests.post(url, headers=headers, data=json.dumps(data))

        if response.status_code == 200:
            response_data = response.json()
            extracted_data = {
                "qry_id": 4,
                "id": document.get("id", ""),
                "content": response_data.get("candidates", [{}])[0].get("content", {}).get("parts", [{}])[0].get("text", ""),
                "avgLogprobs": response_data.get("candidates", [{}])[0].get("avgLogprobs", None),
                "modelVersion": response_data.get("modelVersion", "")
            }

            # Write extracted data as a new line in the output file
            outfile.write(json.dumps(extracted_data) + '\n')
            print(f"Response for document {i + 1} successfully written.")
        else:
            print(f"Error for document {i + 1}: {response.status_code}, {response.text}")
            break

print("All responses have been successfully written to a JSON Lines file.")

Response for document 3044 successfully written.
Response for document 3046 successfully written.
Response for document 3050 successfully written.
Response for document 3051 successfully written.
Response for document 3053 successfully written.
Response for document 3054 successfully written.
All responses have been successfully written to a JSON Lines file.
