In [99]:
from typing import Type, get_origin, get_args
from pydantic import BaseModel

def generate_field_description(model: Type[BaseModel], indent: int = 0) -> str:
    """
    Recursively generates a formatted field description and a JSON example
    for a Pydantic model, including nested classes.
    """
    indent_str = '  ' * indent
    field_descriptions = []
    json_example_lines = []

    for field_name, field_info in model.model_fields.items():
        annotation = field_info.annotation
        field_desc = field_info.description or ""

        # Detect if it's a nested Pydantic model
        if isinstance(annotation, type) and issubclass(annotation, BaseModel):
            field_descriptions.append(
                f"{indent_str}- {field_name} [object]: {field_desc}"
            )
            # Recurse for nested model
            nested_desc, nested_json = generate_field_description_with_json(annotation, indent + 1)
            field_descriptions.append(nested_desc)
            json_example_lines.append(f'{indent_str}"{field_name}": {{\n{nested_json}\n{indent_str}}}')
        else:
            type_name = annotation.__name__ if hasattr(annotation, "__name__") else str(annotation)
            field_descriptions.append(f"{indent_str}- {field_name} [{type_name}]: {field_desc}")
            json_example_lines.append(f'{indent_str}"{field_name}": <{type_name}>')

    field_description_str = "\n".join(field_descriptions)
    json_example_str = ",\n".join(json_example_lines)

    if indent == 0:
        # Top-level output
        return (
            "Generate a JSON response. The fields should include:\n" +
            field_description_str +
            "\n\nPlease provide the next response **only** in the following valid JSON format, without any additional text:\n\n" +
            "{\n" + json_example_str + "\n}"
        )
    else:
        # Return components for recursive case
        return field_description_str, json_example_str

def generate_field_description_with_json(model: Type[BaseModel], indent: int = 0):
    """
    Helper function that returns both the nested field description and nested JSON.
    """
    return generate_field_description(model, indent)

In [100]:
from pydantic import BaseModel, Field

class Topic(BaseModel):  # Inherit from BaseModel
    topic: str = Field(..., description="The topic")
    topic_reason: str = Field(..., description="Reasoning for choosing the topic for the given input, explain what is in the input and how it is related to the topic.")

class row_topic(BaseModel):
    topic1: Topic = Field(..., description="Best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the most relavant topic to the input out of the given list.")
    topic2: Topic = Field(..., description="Second best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the second relavant topic to the input out of the given list.")
    topic3: Topic = Field(..., description="Third best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the third relavant topic to the input out of the given list.")





In [101]:
print(generate_field_description(row_topic))

Generate a JSON response. The fields should include:
- topic1 [object]: Best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the most relavant topic to the input out of the given list.
  - topic [str]: The topic
  - topic_reason [str]: Reasoning for choosing the topic for the given input, explain what is in the input and how it is related to the topic.
- topic2 [object]: Second best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the second relavant topic to the input out of the given list.
  - topic [str]: The topic
  - topic_reason [str]: Reasoning for choosing the topic for the given input, explain what is in the input and how it is related to the topic.
- topic3 [object]: Third best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the third relavant topic to the input out of the given list.
  - topic

In [102]:
import requests

# Define the server URL (update if needed)
SERVER_URL = "http://localhost:5000"

try:
    # Send a GET request
    response = requests.get(SERVER_URL+'/health')

    # Check if request succeeded (HTTP status code 200)
    if response.status_code == 200:
        print(f"Success: Server responded with {response.status_code}")
        print("Response Content:", response.text)
    else:
        print(f"Warning: Server responded with status code {response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"Error: Could not connect to the server.\nDetails: {e}")

Success: Server responded with 200
Response Content: {"status":"running"}



In [103]:
# Corrected candidate topics list 
candidate_topics = [
    "Auschwitz",
    "Baltic states and Sweden",
    "Boder trying to understand reasons, skepticism",
    "Christians, church, religion",
    "France, French",
    "Germany, Germans",
    "Interactions with SS and SS actions",
    "Interactions with and feelings about Germans, Aryan identities",
    "Italy",
    "Jews and Jewish identity, false papers",
    "Jews and fate of Jewish community",
    "Palestine, Judaism, kibbutzim",
    "Paris, France",
    "Poland, Polish",
    "Russia, Russians",
    "Switzerland",
    "To and from cities and countries",
    "Actions, choices, and intentions",
    "Arrest, beating",
    "Behavior, treatment, and aftermath",
    "Birth, living situation, camp life",
    "Children",
    "Clarifications, mostly geographic and people",
    "Closed-ended, factual questions, mostly yes-no questions",
    "Clothing and material objects",
    "Collective actions vs. individual agency",
    "Countryside (forest, fields, and farms)",
    "Days, daily life, daily routines",
    "Encounters with Germans",
    "Escaping, leaving, being evacuated, fleeing",
    "Factory work, forced labor, business",
    "Factual clarification (name, address, number, meaning)",
    "Family separation",
    "Family (background, what happened to)",
    "Food, eating, and drinking",
    "Fuel",
    "Geographic locations, camp names/locations",
    "Ghetto experiences",
    "Hospital, doctor",
    "How long (lengths of time)",
    "How many (people)",
    "How many people in family",
    "How old (age)",
    "Hygiene, toilet facilities, body violations",
    "Introductions, storytelling",
    "Killing, shooting, and death",
    "Lager (names, conditions, factual details)",
    "Liberation, end of war",
    "Living quarters (homes, barracks)",
    "Locations and activities in streets, cities",
    "Longer questions about unique Holocaust experiences",
    "Marriage, marital relations, family",
    "Men (fathers, sons, brothers, husbands, uncles)",
    "Men and women",
    "Military, soldiers, and construction",
    "Mistreatment and death, corpses, gassings, crematoria",
    "Mistreatment and reactions",
    "Music, song",
    "Night",
    "Numerical quantities (mostly people)",
    "Numerical quantities and dates",
    "Outbreak of the war, events of war",
    "Payments and wages",
    "Prisoners",
    "Professional work and industry",
    "Quantities (size, weight, amounts, distance)",
    "Questions about men",
    "Questions of meaning",
    "References to women",
    "Schooling",
    "Sexual violence and assault",
    "Short negation questions (factual)",
    "Short questions about processes",
    "Shorter durations of time (hours, days, weeks, months)",
    "Sickness, disease, illness",
    "Sleeping arrangements, beds",
    "Smoking, cigarettes, bartering",
    "Studying, learning",
    "Tattoos",
    "Tell me/describe",
    "Time (month, year), when",
    "Transportation and deportation",
    "War destruction",
    "What",
    "What happened (next)",
    "What, how",
    "Where",
    "Who",
    "Why, how",
    "Witnessing events",
    "Women (mothers, daughters, sisters, wives, aunts)",
    "Writing/language/inscriptions"
]

In [None]:
import requests
import json


FLASK_API_URL = "http://127.0.0.1:5000/chat"

def generate_payload(question,answer):
    # Define the request payload (matches the curl command)
    payload = {
        "model": "DeepSeek-R1:latest",
        "stream": False,
        "messages": [
            {
                "role": "system",
                "content": (
                generate_field_description(row_topic)
                ),
            },
            {
                "role": "system",
                "content": "You are informed about the Holocaust during World War II. You are a helpful assistant tasked with tagging conversations realated to the Holocaust based on their content."
            },
            {
                "role": "system",
                "content": "these are the topics you can choose from: " + str(candidate_topics)
            },
            {
                "role": "user",
                "content": f'''
                    The interview question is: {question}
                    The given response and answer is: {answer}
                '''
            }
        ]
    }
    return payload


def content_request(payload):
# Set the headers for JSON request
    headers = {"Content-Type": "application/json"}

    try:
        # Send POST request to the Flask API
        response = requests.post(FLASK_API_URL, headers=headers, json=payload)

        # Check if the request was successful
        if response.status_code == 200:
            response_json = response.json()
            raw_content = response_json["message"]["content"]
            #print(raw_content)
            return raw_content

        else:
            print(f"Error {response.status_code}: {response.text}")

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")



def content_request2(payload):
    


In [105]:
temp = json.loads('''{
"topic1": {
  "topic": "Interactions with SS and SS actions",
  "topic_reason": "The speaker mentions that their father was a prisoner in World War I taken by the Germans, which implies interaction with the German authorities."
},
"topic2": {
  "topic": "Actions of Nazis during World War II",
  "topic_reason": "Although not explicitly stated, the mention of Dora's family being affected by the Germans' actions during World War I and their subsequent birth in Germany implies involvement of Nazi policies and practices."
},
"topic3": {
  "topic": "Family separation due to war",
  "topic_reason": "The speaker describes a situation where her parents were separated from each other due to German captivity, which resulted in the birth of two children. This suggests a family has been torn apart by war-related circumstances."
}
                  }''')

print(type(temp))



import json

valid_json_str = '''
{
  "topic1": {
    "topic": "Test Topic",
    "topic_reason": "Because it fits"
  }
}
'''

parsed = json.loads(valid_json_str)
print(type(parsed))  # Should output: <class 'dict'>



<class 'dict'>
<class 'dict'>


In [106]:
print(generate_field_description(row_topic))


Generate a JSON response. The fields should include:
- topic1 [object]: Best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the most relavant topic to the input out of the given list.
  - topic [str]: The topic
  - topic_reason [str]: Reasoning for choosing the topic for the given input, explain what is in the input and how it is related to the topic.
- topic2 [object]: Second best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the second relavant topic to the input out of the given list.
  - topic [str]: The topic
  - topic_reason [str]: Reasoning for choosing the topic for the given input, explain what is in the input and how it is related to the topic.
- topic3 [object]: Third best fit topic out of the provided topic list, make sure it is different from the other two topics. This should be the third relavant topic to the input out of the given list.
  - topic

In [107]:
import pandas as pd
from tqdm import tqdm
from pathlib import Path
import os
import json


# Point to the folder path
input_folder = Path("Phase2_Folders/030_processed_html")

output_folder= "/Users/xiyuancao/Desktop/USHMM-PDF-cleansing/Gender_family_topics_identification/topic_labeling"
# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

class File:
    def __init__(self, file_path):
        self.file_path = file_path
        self.topics = {}


# Dictionary to store results
files = {}


def safe_json_loads(s):
    """Attempts to fix and parse a possibly incomplete JSON string."""
    s = s.strip()
    
    # Basic brace balancing
    open_braces = s.count('{')
    close_braces = s.count('}')
    
    if close_braces < open_braces:
        s += '}' * (open_braces - close_braces)
    
    try:
        return json.loads(s)
    except json.JSONDecodeError as e:
        print(f"JSON parsing failed: {e}")
        return None

def single_csv_classification(file_obj, output_path):
    """Process a single CSV file and classify topics for each row."""
    df = pd.read_csv(file_obj.file_path)
    df["topics"] = ""

    for i, row in tqdm(df.iterrows(), total=len(df)):
        try:
            question = row["Question"]
            answer = row["Answer"]
            #print(question+"\n"+answer)
            payload = generate_payload(question, answer)
            #print(payload)
            topics = content_request(payload)
            #topics=[1,2,3]
            df.at[i, "topics"] = topics
            
            print(topics)
            
            temp = safe_json_loads(topics)
            if temp:
                # Extract only the topic strings (e.g., temp["topic1"]["topic"])
                for key in temp:
                    if isinstance(temp[key], dict) and "topic" in temp[key]:
                        topic_str = temp[key]["topic"]
                        file_obj.topics[topic_str] = file_obj.topics.get(topic_str, 0) + 1

            

            # Count topic frequencies
            for key, value in temp.items():
                print(f"{key}: {value}")
                #file_obj.topics[key] = file_obj.topics.get(topic, 0) + 1

        except Exception as e:
            print(f"Error at row {i}: {e}")
            df.at[i, "topics"] = []

    # Save to file
    filename = Path(file_obj.file_path).stem
    df.to_csv(f"{output_path}/{filename}_labeled.csv", index=False)



n=1
# Loop through all CSV files in the folder
for file_path in input_folder.glob("*.csv"):
    if(n==1):
        break 
    n=n+1

    file_obj = File(file_path)
    single_csv_classification(file_obj, output_folder)
    files[file_path.name] = file_obj  # Save the object to access topic counts later


file_obj = File("/Users/xiyuancao/Desktop/USHMM-PDF-cleansing/Phase2_Folders/030_processed_html/RG-50.030.0038_trs_en_cleaned.csv")
single_csv_classification(file_obj, output_folder)
files[file_path.name] = file_obj  # Save the object to access topic counts later


# Example: Print topic counts from all files
for fname, file_obj in list(files.items())[:1]:
    print(f"\n {fname}")
    #print(file_obj.topics)
    for topic, count in file_obj.topics.items():
        print(f"  {topic}: {count}")

  1%|▏         | 1/75 [00:43<54:10, 43.93s/it]

<think>
Okay, so the user has provided a JSON structure with three topic slots. They want me to fill these based on an interview response about someone's name and birthplace. The topics available are a bunch of historical ones related to the Holocaust.

First, looking at the response: "My name is Leo Bretholz. Do I have to spell that? I usually do have to. And I was born in Vienna, Austria." This seems like personal information about someone's name and place of birth. 

I need to figure out which topics this fits into. The first topic should be the most relevant. Since the user asked for a name and birthplace, it directly relates to "Biography or life story" but that's not in the provided list. Wait, let me check the available topics again.

Looking through the list: 'Jews and Jewish identity...' could be related if Leo Bretholz was Jewish, which I'm assuming given the context of Holocaust-related conversations. But without knowing for sure, it might be a stretch. Alternatively, "Where

  3%|▎         | 2/75 [00:53<28:58, 23.81s/it]

<think>
Okay, let me try to figure this out. So the user has provided some data about Holocaust education and a JSON response based on an interview question asking for the year when something happened. Specifically, they mentioned that a witness named Anna survived Auschwitz in 1943.

Now, I need to create another JSON response following the same structure but with different topics. The task is to generate a response where topic1 is different from the previous one and then include two more relevant topics.

First, looking at the previous example: topic1 was about Auschwitz, which made sense because it's directly related to the event. So for this new question about concentration camps in general, topic1 should be something like "Concentration camps (general)" or another broad category that includes all types of camps used by Nazi Germany.

Next, topic2 and topic3 need to cover other relevant areas. Since the user has a long list of topics, I can pick another area, maybe "Numbers" becaus

  4%|▍         | 3/75 [01:07<22:50, 19.04s/it]

<think>
Alright, so I'm trying to help this user by generating a JSON response based on their query. The task is to evaluate the provided answer about a family's history during the Holocaust and assign three topics: topic1, topic2, and topic3. Each should be from the given list of topics.

First, let me read through the provided answer carefully. The answer talks about a family in Austria before World War II, mentioning their background, father as a tailor from Poland, parents moving after the first World War, marriage in Austria, three children, one child dying shortly after birth, and the father's death when the person was nine. It also mentions the mother being a young widow at 35 who didn't remarry and coped with supporting the family.

Looking at the topics list: 'Baltic states and Sweden', 'Christians, church, religion', 'Germany, Germans', etc., but none of these directly relate to Austria or the family's history in Poland before the war. So topic1 needs to be something that rel

  5%|▌         | 4/75 [01:12<16:13, 13.71s/it]

<think>
Alright, I'm looking at this query where someone is asking about a Holocaust-related interview response. The user provided an input and output example to clarify what they're expecting.

So, the task here is to create JSON responses for topic1, topic2, and topic3 based on given topics. Each field should have an object with 'topic' (a string) and 'topic_reason' (a string). The response needs to be in valid JSON without any extra text.

First, I need to analyze the user's provided example. In that case, the input was about economic struggles leading up to the Holocaust. The output correctly identified 'Auschwitz' as topic1 because it relates to the Nazi concentration camps where many died during the Holocaust. Topic2 is 'Jews and Jewish identity' since the story mentions a family with a mother who was a widow and involved in helping others, highlighting their experiences. Topic3 is 'Interactions with SS and SS actions,' pointing to the family being affected by the SS, like hiding

  7%|▋         | 5/75 [01:27<16:35, 14.22s/it]

<think>
Okay, so I need to figure out how to tag this interview response according to the given topics. The user provided a detailed analysis of the response and identified it as related to 'Vienna' because the setting is Vienna with mentions of antisemitism in Austria.

First, looking at the content, the respondent talks about their experience growing up in Vienna, discussing antisemitic attitudes and practices like being labeled during holidays. Vienna itself isn't one of the provided topics, but the context ties into a broader European antisemitism issue which is covered under 'Auschwitz' or other Nazi-associated topics.

Wait, but the topic list includes several options like 'Baltic states and Sweden', 'Christians...', etc., and 'Vienna' isn't listed. Hmm. So maybe the best fit would be a more general category. Alternatively, if 'Vienna' isn't an option, perhaps it's under a broader category.

Looking back at the topic list: there's no Vienna mentioned, so I have to choose from the

  8%|▊         | 6/75 [01:35<13:49, 12.03s/it]

<think>
Alright, so I need to figure out which topics from the given list apply to this interview response. Let me read through the provided content first.

The user is talking about their mother helping them escape from Austria during the Anschluss in 1938. They mention how their family members were arrested and the dangers they faced. Their mother made efforts to send them away, which led to feelings of guilt. Later, she's described as "Eshet Khayil," meaning a woman of valor.

Now, looking at the topics:

- **Auschwitz**: This is a concentration camp. The text mentions "ashes" from people being cremated, but it doesn't specifically name Auschwitz.
  
- **Baltic states and Sweden**: Not relevant here; no mention of Sweden or Baltic regions.

- **Boder trying to understand reasons, skepticism**: The response discusses the mother's role but isn't skeptical about the events. It talks about her efforts and feelings, so this might not fit.

- **Christians, church, religion**: No religious

  9%|▉         | 7/75 [01:48<14:07, 12.46s/it]

<think>
Okay, so I need to help create a JSON response based on an interview question about someone named her doing something specific. The user is asking me to tag conversations related to the Holocaust using predefined topics.

First, let's understand the input: The interview question was "What did she do?" and the answer talks about her helping others by painting, especially assisting a poor painter who couldn't afford the job himself because his family needed it. She always helped those less fortunate.

Now, I need to map this response to the provided topics list. Let me look at the topics again: they include things like "Christians, church, religion," "Jews and Jewish identity..." but also more general topics like "Helping others" or specific actions like painting. However, looking closely, one of the options is "Boder trying to understand reasons, skepticism." That seems a bit off because it's about understanding reasons rather than a direct action.

Wait, maybe I should check if

 11%|█         | 8/75 [01:59<13:09, 11.79s/it]

<think>
Alright, so the user provided a specific query about tagging conversations related to the Holocaust based on content. They gave a JSON format with three topics to choose from: Auschwitz, Baltic states and Sweden, and Boder trying to understand reasons, skepticism.

The example given was an interview question where someone asked about their mother in the context of the Holocaust. The response was that the mother urged leaving, which is part of Auschwitz experiences.

So, first, I need to analyze this input to determine the best topics. The user's instructions were clear: topic1 should be the most relevant, followed by topic2 and topic3, all different from each other.

Looking at the content, it clearly relates to the Holocaust because it mentions leaving, which is a significant event in that context. The mother urging to leave points towards a family separation scenario, which ties into Auschwitz experiences where families were separated.

For topic1, "Auschwitz" makes sense bec

 12%|█▏        | 9/75 [02:02<10:11,  9.26s/it]

<think>
Alright, let's tackle this query step by step. The user has provided a JSON structure that includes three topics related to the Holocaust and has asked for a response tagging their conversation based on these topics.

First, I need to understand the context. The interview question is about where someone went, specifically mentioning Trier in Germany. This falls under 'Germany, Germans' as it's directly referring to a location associated with the Holocaust.

Next, I should consider why this location might be significant. Trier was part of the German territories that were occupied and subjected to forced labor during the Nazi era. This makes it highly relevant to the topic of 'Germany, Germans'.

The third topic needs to be something else but still related. Perhaps a nearby place like Koblenz could make sense here since it's also part of the occupied region and was used in concentration camps.

I should ensure that each topic is distinct from the others, so I'll assign different 

 13%|█▎        | 10/75 [02:11<09:46,  9.02s/it]

<think>
Okay, so I need to create a JSON response based on this user's query. Let me read through the entire text first to understand what it's about.

The user provided a detailed narrative starting with a request for a JSON response in the form of a story about a young man escaping Nazi Germany and making his way to Israel. It mentions various steps like moving from Luxembourg to Belgium, crossing into France, and ending up in Baltimore. There are also references to events like Kristallnacht.

Hmm, the user's query seems to be asking for a JSON response based on this narrative but it doesn't explicitly state what exactly they want in the JSON format. They just wrote "Create a JSON response" after providing the story. I need to figure out what information should be included in the JSON.

Looking at the structure of typical JSON responses, it's usually an array or an object containing key-value pairs. Since this is a narrative with multiple entities like names, places, events, etc., pe

 13%|█▎        | 10/75 [02:11<14:17, 13.19s/it]


KeyboardInterrupt: 