Step2: LLM-based structured extraction (ollama gemma3:4b)
- Using ollama gemma3:4b as it is goog to perform speech-level structured extraction
- After that save the output as CSV file to continue on COLAB for KG and network analysis.

In [1]:
import pandas as pd
import os
import json
import subprocess #from chatGPT

In [2]:
df = pd.read_csv("eu_debates_2020_clean.csv")
print("Total speeches:", len(df)) #after cleaning all the N/A rows
df.head()


Total speeches: 7012


Unnamed: 0,speaker_name,speaker_role,speaker_party,intervention_language,original_language,date,year,debate_title,text,translated_text,analysis_text,mentions_health_kw
0,Rasa Juknevičienė,MEP,PPE,lt,lt,2020-01-13,2020,Statement by the President,– 1991-ųjų metų Sausio 13-oji iš esmės buvo p...,"January 13, 1991 was essentially the last of ...","January 13, 1991 was essentially the last of ...",False
1,Tomas Tobé,MEP,PPE,en,en,2020-01-13,2020,Order of business,"on behalf of the PPE Group. – Mr President, Sw...",,"on behalf of the PPE Group. – Mr President, Sw...",False
2,Iratxe García Pérez,MEP,S&D,es,es,2020-01-13,2020,Order of business,"– Señor presidente, a mí lo que me parece lam...","“Mr. President, what seems regrettable to me ...","“Mr. President, what seems regrettable to me ...",False
3,Joëlle Mélin,MEP,ID,fr,fr,2020-01-13,2020,Order of business,"au nom du groupe ID. – Monsieur le Président, ...","on behalf of the ID group. “Mr. President, my ...","on behalf of the ID group. “Mr. President, my ...",False
4,Petros Kokkalis,MEP,GUE/NGL,en,en,2020-01-13,2020,Order of business,"– Mr President, unfortunately the nightmare t...",,"– Mr President, unfortunately the nightmare t...",False


In [3]:
texts = df["analysis_text"].dropna().tolist()
texts[:20]

[' January 13, 1991 was essentially the last of the Battle of the Second World War. It was delayed because it had been almost fifty years since the Ribentropo-Molotov Pact, but won.  Thank you, Mr. President, that you entered the unforgettable, which many of my colleagues have entered today. Today, January 13 is not just the day of tragic losses, although after the tanks, the defenders were killed or wounded. Today is a day of victory for us, setting the way to the European Union and NATO. Finally, the Second World War ended two years later – the first of the ninety-third of September, when the Soviet army was taken out of Lithuania. This memorandum is a sign of respect and remembrance to all the victims of Soviet communism: from the Baltics, Poland to Romania, Bulgaria, Albania, and also to millions of killed in Russia itself. Today I am a good opportunity to thank all the colleagues who voted for an important resolution on the actions of the Russian Federation against the investigato

- Select sample size at 1000 speeches to be good anough to see patterns but runable locally

In [4]:
df_sample = df.sample(1000, random_state=42).reset_index(drop=True)
print("sample size:", len(df_sample))

sample size: 1000


- Built the prompt for LLM to extract the scpeeches mentioned COVID-19 or health issues

In [13]:
def build_prompt(text):
    return f"""
You are an expert in political science and natural language processing. Your task is to analyze the following speech text from a European Parliament debate and extract,
Task: Decide wether the sceeph discusses 
- COVID-19
- public health issues
- vaccination
- healthcare systems
- hospitals
- public health emergency
- dicease control
- pandemic
- health policies

Do not include other topics e.g. environment or climate change unless they are explicitly linked to COVID-19 or public health.

Return ONLY valid JSON in the following format:
{{
  "health_mentioned": true or false,
  "health_topics": [list of short topic phrase]
}}
Speech:
\"\"\"{text[:3000]}\"\"\"
"""

#code suggested and debug (uft-8) by chat gpt

In [12]:
def query_ollama(prompt, model="gemma3:4b"):
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt,
        text=True,
        encoding="utf-8",        
        capture_output=True
    )
    return result.stdout.strip()


In [14]:
test_text = df_sample.loc[0, "analysis_text"]
prompt = build_prompt(test_text)
response = query_ollama(prompt)
print(response)


```json
{
  "health_mentioned": true,
  "health_topics": [
    "public health",
    "healthcare systems",
    "health policies",
    "disease control",
    "pandemic",
    "health crisis"
  ]
}
```


#code parse output debug by chatgpt

In [19]:
def parse_output(output):
    try:
        cleaned = re.sub(r"```json|```", "", output).strip()
        data = json.loads(cleaned)

        discusses = (
            data.get("health_mentioned")
            or data.get("health_mention")
            or data.get("discusses_health_issues")
        )

        topics = data.get("health_topics", [])

        if not isinstance(discusses, bool):
            discusses = None
        if not isinstance(topics, list):
            topics = []

        return {
            "health_mentioned": discusses,
            "health_topics": topics
        }

    except Exception as e:
        print("PARSING ERROR:", e)
        print("RAW OUTPUT:", output)
        return {
            "health_mentioned": None,
            "health_topics": []
        }


In [20]:
row = df_sample.iloc[0]
raw = query_ollama(build_prompt(row["analysis_text"]))
print("RAW:")
print(raw)

print("PARSED:")
print(parse_output(raw))


RAW:
```json
{
  "health_mentioned": true,
  "health_topics": [
    "public health",
    "healthcare systems",
    "health policies",
    "disease control",
    "pandemic",
    "COVID-19"
  ]
}
```
PARSED:
{'health_mentioned': True, 'health_topics': ['public health', 'healthcare systems', 'health policies', 'disease control', 'pandemic', 'COVID-19']}


In [22]:
results = []

for i, row in df_sample.iterrows():
    raw = query_ollama(build_prompt(row["analysis_text"]))
    parsed = parse_output(raw)

    results.append({
        "speaker_party": row["speaker_party"],
        "health_mentioned": parsed["health_mentioned"],
        "health_topics": parsed["health_topics"]
    })

    if i % 50 == 0:
        print(f"Processed {i} / {len(df_sample)}")

llm_df = pd.DataFrame(results)


Processed 0 / 1000
PARSING ERROR: Extra data: line 7 column 1 (char 57)
RAW OUTPUT: ```json
{
  "health_mentioned": false,
  "health_topics": []
}
```

**Explanation:**

This speech focuses solely on border security, migration control, and crime prevention. It does not discuss any aspects of COVID-19, public health, vaccination, healthcare systems, hospitals, public health emergencies, disease control, pandemics, or health policies. Therefore, the `health_mentioned` flag is set to `false`, and the `health_topics` list is empty.
Processed 50 / 1000
Processed 100 / 1000
Processed 150 / 1000
PARSING ERROR: Extra data: line 7 column 1 (char 57)
RAW OUTPUT: ```json
{
  "health_mentioned": false,
  "health_topics": []
}
```

**Explanation:**

This speech does not contain any explicit references to health-related topics. It focuses solely on economic comparisons related to Brexit and the performance of the UK and EU economies. Therefore, all the requested topic flags are set to `false`.
PARSI

In [23]:
llm_df

Unnamed: 0,speaker_party,health_mentioned,health_topics
0,S&D,True,"[public health, healthcare systems, health pol..."
1,Greens/EFA,True,"[economic impact, social inequality, public he..."
2,S&D,,[]
3,PPE,True,"[COVID-19, public health issues, healthcare sy..."
4,ID,,[]
...,...,...,...
995,PPE,True,"[COVID-19, public health issues, vaccination, ..."
996,PPE,,[]
997,S&D,,[]
998,ID,,[]


In [24]:
llm_df.to_csv(
    "eu_debates_2020_llm_health_sample1000.csv",
    index=False
)