# Preparing Debate Data for Streamlit üî•

This notebook performs comprehensive data‚Äêpreparation steps to transform raw parliamentary debate CSVs (2009‚Äì2025) into cleaned, filtered, sampled, anonymized, and exportable formats suitable for the Streamlit app (made in the other repo). Briefly put it goes through the following:

- **1. Importing libraries & defining file paths**: Loading necessary Python packages and establish absolute file paths to the five cleaned CSVs.  
- **2. Loading & filtering by turn count**: Reading each CSV into a df and remove debates that have fewer than three speaking turns (i.e., ‚Äúshort‚Äù exchanges).
* **3. Excluding already‚Äêannotated debates & chunking for annotation**: Between workshops new debateunits were extracted for annotations, necessitating exclusion of already annotated ones (before I had a well-working system). The function `split_fix_and_save_debate_chunks()` is also made to randomly group the remaining debates in batches of 25, fix missing Danish role labels (mapping English roles when `TurnRole_Danish` is ‚ÄúUkendt‚Äù), and each chunk is saved.
* **4. Inspection: longest debates & topic coverage**: For each debate category (Reading of Bill, Deliberation, Question‚ÄêAnswering, Other), I inspected the top 5 longest debates by maximum `TurnSequence`, concatenate them into a summary DataFrame, and print each dataset‚Äôs unique `AgendaCategory` values to verify topic representation (this was before I turned towards only using PLDs which only have one topic)
* **5. Sampling debates by topic across categories**: `sample_debates()` is defined to filter debates with 2‚Äì25 turns (argumentation sufficient-debates) and sample entire debates per `AgendaCategory` (defaulting to six policy topics when none are provided), apply it to each filtered DataFrame (rob, db, pld, qa) with `sample_n=25` per topic, concatenate the samples into `final_sampled_df`, and verify that each sampled `DebateUnitID` appears in full.
* **6. Exporting debate exchanges to text & randomization**: `export_debate_exchanges_to_txt()` is made to group each debate by `DebateUnitID`, sort by `TurnSequence`, format each line as `[DebateUnitID] **TurnRole_Danish**: Utterance`, write to a `.txt` file, `randomize_lines()` is defined to shuffle debate lines, load and inspect manually corrected chunks, export and randomize a sample chunk, then load the full manually corrected party‚Äêleader dataset, apply party pseudonymization (mapping known party names to ‚ÄúParti\_A,‚Äù ‚ÄúParti\_B,‚Äù etc.) and speaker anonymization (replacing names with role‚Äêbased labels ‚ÄúSp√∏rgeren,‚Äù ‚ÄúOrdf√∏reren,‚Äù or ‚ÄúTaleren‚Äù), drop any erroneous rows (`TurnSequence == "tale"`), and save the final randomized text file for annotation in Streamlit.



In [2]:
import os
import pandas as pd

ROB_FILE_PATH = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/output/clean/reading_of_bill/reading_of_bill_nochair_data_2009_2025.csv"
DB_FILE_PATH = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/output/clean/deliberation/deliberation_nochair_data_2009_2025.csv"
PLD_FILE_PATH = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/output/clean/party_leader_debate/party_leader_debate_nochair_data_2009_2025.csv"
QA_FILE_PATH = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/output/clean/question_answering/question_answering_nochair_data_2009_2025.csv"
OTHER_FILE_PATH = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/output/clean/other/other_nochair_data_2009_2025.csv"

# Load data
rob_df = pd.read_csv(ROB_FILE_PATH)
db_df = pd.read_csv(DB_FILE_PATH)
pld_df = pd.read_csv(PLD_FILE_PATH)
qa_df = pd.read_csv(QA_FILE_PATH)
other_df = pd.read_csv(OTHER_FILE_PATH)

# Shorten it to keep only debates with at least 3 turns
rob_df_over_2 = rob_df[rob_df.groupby("DebateUnitID")["TurnSequence"].transform("max") > 2].copy()
db_df_over_2 = db_df[db_df.groupby("DebateUnitID")["TurnSequence"].transform("max") > 2].copy()
pld_df_over_2 = pld_df[pld_df.groupby("DebateUnitID")["TurnSequence"].transform("max") > 2].copy()
qa_df_over_2 = qa_df[qa_df.groupby("DebateUnitID")["TurnSequence"].transform("max") > 2].copy()
other_df_over_2 = other_df[other_df.groupby("DebateUnitID")["TurnSequence"].transform("max") > 2].copy()

## How many debates are in the dfs? 


In [3]:
print("Full data:")
print(f"Reading of bill debate exchanges: {len(rob_df.DebateUnitID.unique())}")
print(f"Deliberation debate exchanges: {len(db_df.DebateUnitID.unique())}")
print(f"Party-leader debate exchanges: {len(pld_df.DebateUnitID.unique())}")
print(f"Question-answering debate exchanges: {len(qa_df.DebateUnitID.unique())}")
print(f"Other-debate exchanges: {len(other_df.DebateUnitID.unique())}")

print("\nLonger debates:")
print(f"Reading of bill debate exchanges: {len(rob_df_over_2.DebateUnitID.unique())}")
print(f"Deliberation debate exchanges: {len(db_df_over_2.DebateUnitID.unique())}")
print(f"Party-leader debate exchanges: {len(pld_df_over_2.DebateUnitID.unique())}")
print(f"Question-answering debate exchanges: {len(qa_df_over_2.DebateUnitID.unique())}")
print(f"Other-debate exchanges: {len(other_df_over_2.DebateUnitID.unique())}")


Full data:
Reading of bill debate exchanges: 64302
Deliberation debate exchanges: 20554
Party-leader debate exchanges: 457
Question-answering debate exchanges: 10126
Other-debate exchanges: 1951

Longer debates:
Reading of bill debate exchanges: 40059
Deliberation debate exchanges: 14877
Party-leader debate exchanges: 406
Question-answering debate exchanges: 8349
Other-debate exchanges: 36


In [None]:
# Get all debates 2020-2025
import pandas as pd

# Pd-concat them + filter for Date
combined_df= pd.concat([rob_df_over_2,
                      db_df_over_2,
                      pld_df_over_2,
                      qa_df_over_2,
                      other_df_over_2], axis=0, ignore_index=True) 


# Convert to datetime if not already
combined_df['Date'] = pd.to_datetime(combined_df['Date'])

# Saving
combined_df.to_csv("all_debate_types_over_2_turns_all_years.csv")

# Get min and max date
date_min = combined_df['Date'].min()
date_max = combined_df['Date'].max()

print(f"Date range: {date_min} to {date_max}")

# Filter rows from 2020-01-01 onward
combined_df_2020_2025 = combined_df[combined_df['Date'] >= pd.to_datetime("2020-01-01")]

# Gonna use these on the 18th and test it
combined_df_2020_2025.to_csv("all_types_2020_2025.csv")

Date range: 2009-10-07 13:00:00 to 2025-02-20 10:00:00


In [None]:
pld_debates_2020_2025 = combined_df_2020_2025[combined_df_2020_2025['DebateType'] == "party_leader_debate"]
print(len(pld_debates_2020_2025['DebateUnitID'].unique()))

# Code to take out the numbers we already have annotated + sample these into pieces of 25 a piece
already_annotated = [
    101091, 79870, 79862, 101079, 92435, 101076, 92447,
    101112, 92403, 89394, 114406, 101062, 101129, 114445,
    92428, 79845, 114428, 101133, 79832, 101134, 106920,
    101104
]

# Keep only rows whose DebateUnitID is not in the removal list
pld_debates_2020_2025_filt = pld_debates_2020_2025[~pld_debates_2020_2025["DebateUnitID"].isin(already_annotated)]

pld_debates_2020_2025_filt

366


Unnamed: 0,SessionID,MeetingNumber,Date,Location,AgendaItemNo,AgendaTitle,DebateType,TurnNo,Speaker,Party,Role,TurnRole,Time,Utterance,AgendaCategory,MeetingDateID,AgendaTitleDateID,TurnSequence,DebateUnitID,TurnRole_Danish
242148,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,2,Mette Frederiksen,,minister,minister,,P√• vegne af kollektivet af partiledere skal je...,Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00,0,79830,Minister
242149,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,4,Jakob Ellemann-Jensen,V,medlem,asker,,"Tak for det, og tak til i denne forbindelse pa...",Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00,1,79830,Sp√∏rger
242150,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,5,Mette Frederiksen,,minister,minister,,"F√∏rst og fremmest vil jeg sige, at det undrer ...",Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00,2,79830,Minister
242151,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,7,Jakob Ellemann-Jensen,V,medlem,member,,"Jo, men den er jo ikke rigtig nok, hvis resten...",Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00,3,79830,Ukendt
242152,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,9,Mette Frederiksen,,minister,minister,,Nu kender jeg jo sp√∏rgeren som en gl√∏dende eur...,Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00,4,79830,Minister
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243667,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,386,Martin Lidegaard,RV,medlem,member,,"Jeg synes, det er sp√¶ndende tanker, og jeg vil...",Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,3,114448,Ukendt
243668,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,394,Pelle Dragsted,EL,medlem,asker,,Tak for det. Nu opfattede partilederen jo krit...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,0,114450,Sp√∏rger
243669,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,396,Peter Kofod,DF,medlem,member,,Tak for roserne. Jeg har irettesat vores ordf√∏...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,1,114450,Ukendt
243670,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,398,Pelle Dragsted,EL,medlem,member,,Det synes jeg lyder rigtig godt. Vi har jo at ...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,2,114450,Ukendt


In [None]:
# Sample and saving pld 2020-2025 in pieces of 25
import pandas as pd
import random

def split_fix_and_save_debate_chunks(df, 
                                     debate_id_col="DebateUnitID",
                                     turn_seq_col="TurnSequence",
                                     speaker_col="Speaker",
                                     turnrole_col="TurnRole",
                                     turnrole_dk_col="TurnRole_Danish",
                                     chunk_size=25, 
                                     file_prefix="debates_chunk",
                                     random_seed=None):
    """
    Splits the df into chunks of up to chunk_size unique DebateUnitIDs,
    in a random order, fixes 'Ukendt' roles by remembering each speaker's first 
    known role in that debate, and saves each chunk to a separate CSV.
    """

    # 1) Set a random seed 
    if random_seed is not None:
        random.seed(random_seed)

    # 2) Role mapping from the English TurnRole to the correct Danish TurnRole
    role_mapping_en_to_da = {
        "asker":    "Sp√∏rger",
        "minister": "Minister",
        "member":   "Medlem",
        "medlem":   "Medlem",  # or unify as needed
    }

    # 3) Get unique DebateUnitIDs and shuffle them
    unique_ids = df[debate_id_col].unique().tolist()
    random.shuffle(unique_ids)

    # 4) Iterate over IDs in steps of chunk_size
    for i in range(0, len(unique_ids), chunk_size):
        chunk_ids = unique_ids[i : i + chunk_size]
        
        # Filter the df for these DebateUnitIDs
        chunk_df = df[df[debate_id_col].isin(chunk_ids)].copy()

        # 5) Fix roles inside this chunk by grouping each DebateUnitID separately
        fixed_dfs = []
        for debate_id, debate_df in chunk_df.groupby(debate_id_col):
            # Sort to ensure we process in ascending TurnSequence
            debate_df = debate_df.sort_values(by=turn_seq_col).copy()

            # A dict mapping speaker -> TurnRole_Danish for this DebateUnitID
            speaker_role_map = {}

            # Iterate over each turn in this debate
            for idx, row in debate_df.iterrows():
                speaker = row[speaker_col]
                curr_dk = row[turnrole_dk_col]   # Current TurnRole_Danish
                curr_en = row[turnrole_col]      # Current TurnRole (English)

                # Have we seen this speaker before in this debate?
                if speaker in speaker_role_map:
                    # Overwrite TurnRole_Danish with the stored role if needed
                    if pd.isna(curr_dk) or curr_dk == "Ukendt":
                        debate_df.at[idx, turnrole_dk_col] = speaker_role_map[speaker]
                else:
                    # If 'Ukendt' or NaN, try to map from TurnRole (English)
                    if pd.isna(curr_dk) or curr_dk == "Ukendt":
                        mapped_dk = role_mapping_en_to_da.get(curr_en, "Ukendt")
                        debate_df.at[idx, turnrole_dk_col] = mapped_dk
                        speaker_role_map[speaker] = mapped_dk
                    else:
                        # If it already has a valid Danish role, store it
                        speaker_role_map[speaker] = curr_dk

            fixed_dfs.append(debate_df)

        # Merge all DebateUnitIDs in this chunk back together
        chunk_df_fixed = pd.concat(fixed_dfs, ignore_index=True)

        # 6) Save this chunk to CSV
        chunk_num = (i // chunk_size) + 1
        out_file_name = f"{file_prefix}_2020_2025_pld_{chunk_num}.csv"
        chunk_df_fixed.to_csv(out_file_name, index=False)

        print(f"Saved chunk {chunk_num} with {len(chunk_ids)} DebateUnitIDs -> {out_file_name}")

split_fix_and_save_debate_chunks(pld_debates_2020_2025_filt, debate_id_col="DebateUnitID", chunk_size=25, file_prefix="debates_chunk")


Saved chunk 1 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_1.csv
Saved chunk 2 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_2.csv
Saved chunk 3 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_3.csv
Saved chunk 4 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_4.csv
Saved chunk 5 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_5.csv
Saved chunk 6 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_6.csv
Saved chunk 7 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_7.csv
Saved chunk 8 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_8.csv
Saved chunk 9 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_9.csv
Saved chunk 10 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_10.csv
Saved chunk 11 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_11.csv
Saved chunk 12 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_12.csv
Saved chunk 13 with 25 DebateUnitIDs -> debates_chunk_2020_2025_pld_13.csv
Saved chunk 14 with 19 DebateUnitIDs -> deb

In [None]:
combined_df_2020_2025

# how many unique debates do we have now? 
len(combined_df_2020_2025['DebateUnitID'].unique())

# Now make pieces of 25 debates from different topics
# Corrected function to retain all rows of a sampled DebateUnitID
import pandas as pd

def sample_debates(df, df_name="Dataset", topic_col="AgendaCategory", topics=None, min_turns=2, max_turns=25, sample_n=1):
    """
    Filters debates based on the number of turns and samples entire debates (all rows) per specified topic.
    """

    if df.empty:
        print(f"{df_name} is empty. No debates to sample.")
        return pd.DataFrame()

    if topics is None:
        topics = ["Health Care", "Environment and Energy", "Education", "Immigration", "Justice", "Culture"]

    # Filter debates with at least min_turns and at most max_turns
    filtered_df = df[df.groupby("DebateUnitID")["TurnSequence"].transform("max").between(min_turns - 1, max_turns)]

    sampled_debates_list = []

    for topic in topics:
        topic_df = filtered_df[filtered_df[topic_col] == topic]
        if topic_df.empty:
            print(f"There are no entries for topic '{topic}' in {df_name}.")
        else:
            # Sample `sample_n` DebateUnitIDs
            sampled_ids = topic_df["DebateUnitID"].drop_duplicates().sample(min(len(topic_df["DebateUnitID"].unique()), sample_n), random_state=42)
            sampled_debate = topic_df[topic_df["DebateUnitID"].isin(sampled_ids)]
            sampled_debates_list.append(sampled_debate)

    # Combine sampled debates
    sampled_debates = pd.concat(sampled_debates_list, ignore_index=True) if sampled_debates_list else pd.DataFrame()

    return sampled_debates

split_fix_and_save_debate_chunks(df, debate_id_col="DebateUnitID", chunk_size=25, file_prefix="debates_chunk")



23642

## Check how much in % of each is unknown in TurnRole

In [None]:
# Function to check the percentage of unknown TurnRole values in each df
def check_unknown_turnrole(df, df_name):
    total_rows = len(df)
    unknown_count = (df["TurnRole"].isna() | (df["TurnRole"].str.lower() == "unknown")).sum()
    unknown_percentage = (unknown_count / total_rows) * 100 if total_rows > 0 else 0
    
    # Count DebateUnitIDs where at least one row has 'unknown' TurnRole
    debate_units_with_unknown = df[df["TurnRole"].str.lower() == "unknown"]["DebateUnitID"].nunique()
    total_debate_units = df["DebateUnitID"].nunique()
    debate_units_percentage = (debate_units_with_unknown / total_debate_units) * 100 if total_debate_units > 0 else 0
    
    print(f"{df_name}: {unknown_count} out of {total_rows} rows ({unknown_percentage:.2f}%) have 'unknown' TurnRole.")
    print(f"{df_name}: {debate_units_with_unknown} out of {total_debate_units} DebateUnitIDs ({debate_units_percentage:.2f}%) contain at least one 'unknown' TurnRole.\n")

# Check for each dataset
check_unknown_turnrole(rob_df_over_2, "Reading of bill debate exchanges")
check_unknown_turnrole(db_df_over_2, "Deliberation debate exchanges")
check_unknown_turnrole(pld_df_over_2, "Party leader debate exchanges")
check_unknown_turnrole(qa_df_over_2, "Question-answering debate exchanges")
check_unknown_turnrole(other_df_over_2, "Other-debate exchanges")

Reading of bill debate exchanges: 14105 out of 178033 rows (7.92%) have 'unknown' TurnRole.
Reading of bill debate exchanges: 6070 out of 40059 DebateUnitIDs (15.15%) contain at least one 'unknown' TurnRole.

Deliberation debate exchanges: 5504 out of 63947 rows (8.61%) have 'unknown' TurnRole.
Deliberation debate exchanges: 2297 out of 14877 DebateUnitIDs (15.44%) contain at least one 'unknown' TurnRole.

Party leader debate exchanges: 0 out of 1692 rows (0.00%) have 'unknown' TurnRole.
Party leader debate exchanges: 0 out of 406 DebateUnitIDs (0.00%) contain at least one 'unknown' TurnRole.

Question-answering debate exchanges: 0 out of 57309 rows (0.00%) have 'unknown' TurnRole.
Question-answering debate exchanges: 0 out of 8349 DebateUnitIDs (0.00%) contain at least one 'unknown' TurnRole.

Other-debate exchanges: 22 out of 224 rows (9.82%) have 'unknown' TurnRole.
Other-debate exchanges: 7 out of 36 DebateUnitIDs (19.44%) contain at least one 'unknown' TurnRole.



### Taking a look at one

In [5]:
pld_df_over_2

Unnamed: 0,SessionID,MeetingNumber,Date,Location,AgendaItemNo,AgendaTitle,DebateType,TurnNo,Speaker,Party,Role,TurnRole,Time,Utterance,AgendaCategory,MeetingDateID,AgendaTitleDateID,TurnSequence,DebateUnitID,TurnRole_Danish
0,20181,45,2019-01-15 13:00:00,Folketingssalen,1,Partilederdebat,party_leader_debate,2,Mette Frederiksen,S,medlem,asker,,Tak for det. Jeg kan jo allerede gl√¶de mig ove...,Elections & Parliamentary Processes,45_2019-01-15 13:00:00,Partilederdebat_2019-01-15 13:00:00,0,72395,Sp√∏rger
1,20181,45,2019-01-15 13:00:00,Folketingssalen,1,Partilederdebat,party_leader_debate,4,Kristian Thulesen Dahl,DF,medlem,asker,,"Tak for det. I den finanslovsaftale, Parti_F l...",Elections & Parliamentary Processes,45_2019-01-15 13:00:00,Partilederdebat_2019-01-15 13:00:00,1,72395,Sp√∏rger
2,20181,45,2019-01-15 13:00:00,Folketingssalen,1,Partilederdebat,party_leader_debate,6,Mette Frederiksen,S,medlem,member,,Vi st√∏tter tanken ‚Äì fuldst√¶ndig ‚Äì og den gule ...,Elections & Parliamentary Processes,45_2019-01-15 13:00:00,Partilederdebat_2019-01-15 13:00:00,2,72395,Ukendt
3,20181,45,2019-01-15 13:00:00,Folketingssalen,1,Partilederdebat,party_leader_debate,8,Kristian Thulesen Dahl,DF,medlem,member,,"Det er jo en bekymring, jeg sagtens kan forst√•...",Elections & Parliamentary Processes,45_2019-01-15 13:00:00,Partilederdebat_2019-01-15 13:00:00,3,72395,Ukendt
4,20181,45,2019-01-15 13:00:00,Folketingssalen,1,Partilederdebat,party_leader_debate,10,Mette Frederiksen,S,medlem,member,,"Alts√•, jeg synes jo, den bedste l√∏sning p√• den...",Elections & Parliamentary Processes,45_2019-01-15 13:00:00,Partilederdebat_2019-01-15 13:00:00,4,72395,Ukendt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1799,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,386,Martin Lidegaard,RV,medlem,member,,"Jeg synes, det er sp√¶ndende tanker, og jeg vil...",Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,3,114448,Ukendt
1803,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,394,Pelle Dragsted,EL,medlem,asker,,Tak for det. Nu opfattede partilederen jo krit...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,0,114450,Sp√∏rger
1804,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,396,Peter Kofod,DF,medlem,member,,Tak for roserne. Jeg har irettesat vores ordf√∏...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,1,114450,Ukendt
1805,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,398,Pelle Dragsted,EL,medlem,member,,Det synes jeg lyder rigtig godt. Vi har jo at ...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00,2,114450,Ukendt


## What are the ranges of the debates? 
Show the longest (debateUnitIDs with the highest turnsequence) + show the dist + the average turnsequence length for unit.

In [7]:
# Check out the longest ones
import pandas as pd

# Define the list of dataframes and their corresponding labels
dataframes = {
    "Reading of Bill (rob_df)": rob_df_over_2,
    "Deliberation (db_df)": db_df_over_2,
    "Question-Answering (qa_df)": qa_df_over_2,
    "Other (other_df)": other_df_over_2,
}

# Create an empty list to store the longest debates for each dataframe
longest_debates_list = []

# Iterate over each dataframe and find the longest debates
for label, df in dataframes.items():
    debate_lengths = df.groupby("DebateUnitID")["TurnSequence"].max()
    longest_debates = debate_lengths.nlargest(5)  # Get top 5 longest debates

    # Retrieve full rows corresponding to the longest DebateUnitIDs
    longest_df = df[df["DebateUnitID"].isin(longest_debates.index)].copy()
    longest_df["Dataset"] = label  # Add a column to indicate which dataset it came from

    longest_debates_list.append(longest_df)

# Concatenate all results into a single dataframe for display
longest_debates_df = pd.concat(longest_debates_list, ignore_index=True)
longest_debates_df

Unnamed: 0,SessionID,MeetingNumber,Date,Location,AgendaItemNo,AgendaTitle,DebateType,TurnNo,Speaker,Party,...,TurnRole,Time,Utterance,AgendaCategory,MeetingDateID,AgendaTitleDateID,TurnSequence,DebateUnitID,TurnRole_Danish,Dataset
0,20222,62,2023-05-16 13:00:00,Folketingssalen,22,1. behandling af B 63: Om udvidet producentans...,reading of bill,2,Signe Munk,SF,...,proponent,,Mange tak for det. I Danmark har vi en st√¶rk m...,Business,62_2023-05-16 13:00:00,1. behandling af B 63: Om udvidet producentans...,0,109736,Ordf√∏rer,Reading of Bill (rob_df)
1,20222,62,2023-05-16 13:00:00,Folketingssalen,22,1. behandling af B 63: Om udvidet producentans...,reading of bill,2,Signe Munk,SF,...,unknown,,Mange tak for det. I Danmark har vi en st√¶rk m...,Business,62_2023-05-16 13:00:00,1. behandling af B 63: Om udvidet producentans...,1,109736,Ukendt,Reading of Bill (rob_df)
2,20222,62,2023-05-16 13:00:00,Folketingssalen,22,1. behandling af B 63: Om udvidet producentans...,reading of bill,4,Magnus Heunicke,,...,minister,,Produktion og forbrug af tekstiler er milj√∏- o...,Business,62_2023-05-16 13:00:00,1. behandling af B 63: Om udvidet producentans...,2,109736,Minister,Reading of Bill (rob_df)
3,20222,62,2023-05-16 13:00:00,Folketingssalen,22,1. behandling af B 63: Om udvidet producentans...,reading of bill,4,Magnus Heunicke,,...,minister,,Produktion og forbrug af tekstiler er milj√∏- o...,Business,62_2023-05-16 13:00:00,1. behandling af B 63: Om udvidet producentans...,3,109736,Minister,Reading of Bill (rob_df)
4,20222,62,2023-05-16 13:00:00,Folketingssalen,22,1. behandling af B 63: Om udvidet producentans...,reading of bill,6,Signe Munk,SF,...,asker,,"Tak for ministerens tale. Det var lige ved, at...",Business,62_2023-05-16 13:00:00,1. behandling af B 63: Om udvidet producentans...,4,109736,Sp√∏rger,Reading of Bill (rob_df)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1162,20121,68,2013-03-13 13:00:00,Folketingssalen,2,Besvarelse af oversendte sp√∏rgsm√•l til ministrene,other,145,Manu Sareen,,...,minister,,"Jeg synes, det lige er at stramme den en anels...",Other,68_2013-03-13 13:00:00,Besvarelse af oversendte sp√∏rgsm√•l til ministr...,44,27880,Minister,Other (other_df)
1163,20121,68,2013-03-13 13:00:00,Folketingssalen,2,Besvarelse af oversendte sp√∏rgsm√•l til ministrene,other,147,Fatma √òktem,V,...,asker,,"Det er imponerende, at ligestillingsministeren...",Other,68_2013-03-13 13:00:00,Besvarelse af oversendte sp√∏rgsm√•l til ministr...,45,27880,Sp√∏rger,Other (other_df)
1164,20121,68,2013-03-13 13:00:00,Folketingssalen,2,Besvarelse af oversendte sp√∏rgsm√•l til ministrene,other,149,Manu Sareen,,...,minister,,"Jeg synes, det er et problem for de kvinder og...",Other,68_2013-03-13 13:00:00,Besvarelse af oversendte sp√∏rgsm√•l til ministr...,46,27880,Minister,Other (other_df)
1165,20121,68,2013-03-13 13:00:00,Folketingssalen,2,Besvarelse af oversendte sp√∏rgsm√•l til ministrene,other,151,Fatma √òktem,V,...,asker,,"Det l√∏d meget flot, og det er fuldst√¶ndig korr...",Other,68_2013-03-13 13:00:00,Besvarelse af oversendte sp√∏rgsm√•l til ministr...,47,27880,Sp√∏rger,Other (other_df)


# Take a look at 'other_df' types

## Getting debates into a format that makes sense


In [5]:
# Print unique topics for each dataset to check representation

if "AgendaCategory" in rob_df_over_2.columns:
    print(f"Topics in rob_df_over_2: {rob_df_over_2['AgendaCategory'].unique()}")
else:
    print("AgendaCategory column is missing in rob_df_over_2")

if "AgendaCategory" in db_df_over_2.columns:
    print(f"Topics in db_df_over_2: {db_df_over_2['AgendaCategory'].unique()}")
else:
    print("AgendaCategory column is missing in db_df_over_2")

if "AgendaCategory" in pld_df_over_2.columns:
    print(f"Topics in pld_df_over_2: {pld_df_over_2['AgendaCategory'].unique()}")
else:
    print("AgendaCategory column is missing in pld_df_over_2")

if "AgendaCategory" in qa_df_over_2.columns:
    print(f"Topics in qa_df_over_2: {qa_df_over_2['AgendaCategory'].unique()}")
else:
    print("AgendaCategory column is missing in qa_df_over_2")

if "AgendaCategory" in other_df_over_2.columns:
    print(f"Topics in other_df_over_2: {other_df_over_2['AgendaCategory'].unique()}")
else:
    print("AgendaCategory column is missing in other_df_over_2")


Topics in rob_df_over_2: ['Foreign Affairs' 'Justice' 'Infrastructure' 'Other' 'Economy'
 'Health Care' 'Culture' 'Labour' 'Agriculture' 'Environment and Energy'
 'Defence' 'Business' 'Housing' 'Local and Regional Affairs'
 'Social Affairs' 'Immigration' 'Education'
 'Elections & Parliamentary Processes' 'Territories']
Topics in db_df_over_2: ['Other' 'Environment and Energy' 'Immigration' 'Defence'
 'Local and Regional Affairs' 'Business' 'Education' 'Culture'
 'Health Care' 'Justice' 'Infrastructure' 'Social Affairs'
 'Foreign Affairs' 'Economy' 'Agriculture' 'Territories' 'Labour'
 'Elections & Parliamentary Processes' 'Housing']
Topics in pld_df_over_2: ['Elections & Parliamentary Processes']
Topics in qa_df_over_2: ['Elections & Parliamentary Processes' 'Immigration' 'Foreign Affairs'
 'Economy' 'Defence']
Topics in other_df_over_2: ['Other']


## Sampling function

In [None]:
# Corrected function to retain all rows of a sampled DebateUnitID
import pandas as pd

def sample_debates(df, df_name="Dataset", topic_col="AgendaCategory", topics=None, min_turns=2, max_turns=25, sample_n=1):
    """
    Filters debates based on the number of turns and samples entire debates (all rows) per specified topic.
    """

    if df.empty:
        print(f"{df_name} is empty. No debates to sample.")
        return pd.DataFrame()

    if topics is None:
        topics = ["Health Care", "Environment and Energy", "Education", "Immigration", "Justice", "Culture"]

    # Filter debates with at least min_turns and at most max_turns
    filtered_df = df[df.groupby("DebateUnitID")["TurnSequence"].transform("max").between(min_turns - 1, max_turns)]

    sampled_debates_list = []

    for topic in topics:
        topic_df = filtered_df[filtered_df[topic_col] == topic]
        if topic_df.empty:
            print(f"There are no entries for topic '{topic}' in {df_name}.")
        else:
            # Sample `sample_n` DebateUnitIDs
            sampled_ids = topic_df["DebateUnitID"].drop_duplicates().sample(min(len(topic_df["DebateUnitID"].unique()), sample_n), random_state=42)
            sampled_debate = topic_df[topic_df["DebateUnitID"].isin(sampled_ids)]
            sampled_debates_list.append(sampled_debate)

    # Combine sampled debates
    sampled_debates = pd.concat(sampled_debates_list, ignore_index=True) if sampled_debates_list else pd.DataFrame()

    return sampled_debates

# Define the list of topics to sample from
rob_topics = rob_df_over_2['AgendaCategory'].unique()
db_topics = db_df_over_2['AgendaCategory'].unique()
pld_topics = pld_df_over_2['AgendaCategory'].unique()
qa_topics = qa_df_over_2['AgendaCategory'].unique()

# Apply the function to each dataset and sample debates
rob_sampled = sample_debates(rob_df_over_2, df_name="rob_df_over_2", topics=rob_topics, sample_n=25) # samples from each topic
db_sampled = sample_debates(db_df_over_2, df_name="db_df_over_2", topics=db_topics, sample_n=25)
pld_sampled = sample_debates(pld_df_over_2, df_name="pld_df_over_2", topics=pld_topics, sample_n=25)
qa_sampled = sample_debates(qa_df_over_2, df_name="qa_df_over_2", topics=qa_topics, sample_n=25)
#other_sampled = sample_debates(other_df_over_2, df_name="other_df_over_2", topics=general_topics) # Excluding for now

# Concatenate all sampled debates into one DataFrame
final_sampled_df = pd.concat([rob_sampled, db_sampled, pld_sampled, qa_sampled], ignore_index=True)
final_sampled_df

# Verifying if multiple rows exist per sampled DebateUnitID
sampled_df_grouped_counts = final_sampled_df.groupby("DebateUnitID").size()

# Count unique DebateUnitID values
unique_debate_ids = final_sampled_df["DebateUnitID"].nunique()

# Checking if multiple rows exist per sampled DebateUnitID
{
    "Total unique DebateUnitID": unique_debate_ids,
    "Total rows in final_sampled_df": len(final_sampled_df),
    "DebateUnitID rows count": sampled_df_grouped_counts.describe()
}


{'Total unique DebateUnitID': 995,
 'Total rows in final_sampled_df': 4464,
 'DebateUnitID rows count': count    995.000000
 mean       4.486432
 std        0.781171
 min        4.000000
 25%        4.000000
 50%        4.000000
 75%        5.000000
 max       10.000000
 dtype: float64}

In [10]:
len(final_sampled_df['DebateUnitID'].unique())

995

In [11]:
final_sampled_df

Unnamed: 0,SessionID,MeetingNumber,Date,Location,AgendaItemNo,AgendaTitle,DebateType,TurnNo,Speaker,Party,Role,TurnRole,Time,Utterance,AgendaCategory,MeetingDateID,AgendaTitleDateID,TurnSequence,DebateUnitID,TurnRole_Danish
0,20111,91,2012-05-31 10:00:00,Folketingssalen,19,1. behandling af B 85: Om udbetaling af EU-opl...,reading of bill,58,Lisbeth Bech Poulsen,SF,medlem,proponent,,Med det fremsatte forslag √∏nsker forslagsstill...,Foreign Affairs,91_2012-05-31 10:00:00,1. behandling af B 85: Om udbetaling af EU-opl...,0,23139,Ordf√∏rer
1,20111,91,2012-05-31 10:00:00,Folketingssalen,19,1. behandling af B 85: Om udbetaling af EU-opl...,reading of bill,60,Lene Espersen,KF,medlem,asker,,"Jeg vil gerne starte med at sp√∏rge Ordf√∏reren,...",Foreign Affairs,91_2012-05-31 10:00:00,1. behandling af B 85: Om udbetaling af EU-opl...,1,23139,Sp√∏rger
2,20111,91,2012-05-31 10:00:00,Folketingssalen,19,1. behandling af B 85: Om udbetaling af EU-opl...,reading of bill,62,Lisbeth Bech Poulsen,SF,medlem,proponent,,Jeg har ikke taget beslutningsforslaget med he...,Foreign Affairs,91_2012-05-31 10:00:00,1. behandling af B 85: Om udbetaling af EU-opl...,2,23139,Ordf√∏rer
3,20111,91,2012-05-31 10:00:00,Folketingssalen,19,1. behandling af B 85: Om udbetaling af EU-opl...,reading of bill,64,Lene Espersen,KF,medlem,asker,,Jeg vil bare bede Ordf√∏reren svare p√• mit sp√∏r...,Foreign Affairs,91_2012-05-31 10:00:00,1. behandling af B 85: Om udbetaling af EU-opl...,3,23139,Sp√∏rger
4,20111,91,2012-05-31 10:00:00,Folketingssalen,19,1. behandling af B 85: Om udbetaling af EU-opl...,reading of bill,66,Lisbeth Bech Poulsen,SF,medlem,proponent,,"Jeg beklager, hvis den Parti_D ordf√∏rer blev s...",Foreign Affairs,91_2012-05-31 10:00:00,1. behandling af B 85: Om udbetaling af EU-opl...,4,23139,Ordf√∏rer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4459,20091,91,2010-05-12 13:00:00,Folketingssalen,1,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,question-answering,84,Gitte Lillelund Bech,,minister,minister,,"Jeg var enormt ked af den historie, der d√©r fo...",Defence,91_2010-05-12 13:00:00,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,1,7563,Minister
4460,20091,91,2010-05-12 13:00:00,Folketingssalen,1,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,question-answering,86,John Dyrby Paulsen,S,medlem,asker,,Tak for svaret. Jeg er meget enig et langt sty...,Defence,91_2010-05-12 13:00:00,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,2,7563,Sp√∏rger
4461,20091,91,2010-05-12 13:00:00,Folketingssalen,1,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,question-answering,88,Gitte Lillelund Bech,,minister,minister,,Det vil jeg jo s√•dan set sige at jeg ikke kan ...,Defence,91_2010-05-12 13:00:00,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,3,7563,Minister
4462,20091,91,2010-05-12 13:00:00,Folketingssalen,1,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,question-answering,90,John Dyrby Paulsen,S,medlem,asker,,"Tak. Det, jeg h√∏rer forsvarsministeren sige, e...",Defence,91_2010-05-12 13:00:00,Sp√∏rgsm√•l til ministrene til umiddelbar besvar...,4,7563,Sp√∏rger


## Format and export (incl. random order)

In [None]:
import re

def export_debate_exchanges_to_txt(df, output_filename):
    """
    Groups the DataFrame by DebateUnitID, sorts each group by TurnSequence,
    formats the exchanges, and writes the result to a .txt file.
    
    Each line in the output file will be of the form:
    [DebateUnitID] **TurnRole_Danish**: Utterance **TurnRole_Danish**: Utterance ...
    
    Parameters:
        df (pandas.DataFrame): Input DataFrame containing at least the columns
            DebateUnitID, TurnSequence, TurnRole_Danish, and Utterance.
        output_filename (str): The name of the output .txt file.
        
    Returns:
        output_text (str): The complete text that was written to the file.
    """
    
    # Check if DataFrame is empty
    if df.empty:
        print("‚ö†Ô∏è No valid exchanges found.")
        return ""
    
    grouped_exchanges = []
    
    # Group by DebateUnitID and process each group
    for debate_id, group in df.groupby("DebateUnitID", as_index=False):
        # Sort each group by TurnSequence
        group_sorted = group.sort_values("TurnSequence")
        # Format each utterance as: **TurnRole_Danish**: Utterance
        formatted_utterances = " ".join(
            f"**{row['TurnRole_Danish']}**: {row['Utterance']}" 
            for _, row in group_sorted.iterrows()
        )
        # Prepend the DebateUnitID (only once per debate) to the formatted utterances
        formatted_exchange = f"[{debate_id}] {formatted_utterances}"
        grouped_exchanges.append(formatted_exchange)
    
    # Join each debate's formatted exchange as separate lines
    output_text = "\n".join(grouped_exchanges)
        
    # Write the result to a text file
    with open(output_filename, "w", encoding="utf-8") as f:
        f.write(output_text)
    
    print(f"Exchanges exported to {output_filename}")
    return output_text

# Use it like this - continuously through workshops to create new ones as we go.
#output_text = export_debate_exchanges_to_txt(final_sampled_df, "debate_exchanges_18.txt")
#output_text


In [None]:
import random

def randomize_lines(output_text):
    """
    Randomizes the order of the lines in a given multi-line text.

    Parameters:
        output_text (str): Multi-line string to be randomized.
    
    Returns:
        str: The randomized multi-line string.
    """
    # Split the text into lines, filtering out empty ones
    lines = [line for line in output_text.strip().split("\n") if line.strip()]
    # Shuffle the list of lines in place
    random.shuffle(lines)
    # Rejoin the lines into a single string with newline separation
    randomized_output = "\n".join(lines)
    return randomized_output


import pandas as pd

# Here's a few generated ones that had some mistakes in need of manual edit - therefore loaded in.
file_2025_pld_1 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_1_manually_corrected.csv", sep=";")
print(len(file_2025_pld_1['DebateUnitID'].unique()))
file_2025_pld_1

file_2025_pld_2 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_2_manually_corrected.csv", sep=";")
print(len(file_2025_pld_2['DebateUnitID'].unique()))
file_2025_pld_2

file_2025_pld_3 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_3_manually_corrected.csv", sep=";")
print(len(file_2025_pld_3['DebateUnitID'].unique()))
file_2025_pld_3

file_2025_pld_4 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_4_manually_corrected.csv", sep=";")
print(len(file_2025_pld_4['DebateUnitID'].unique()))
file_2025_pld_4

file_2025_pld_5 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_5_manually_corrected.csv", sep=";")
print(len(file_2025_pld_5['DebateUnitID'].unique()))
file_2025_pld_5

file_2025_pld_6 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_6_manually_corrected.csv", sep=";")
print(len(file_2025_pld_6['DebateUnitID'].unique()))
file_2025_pld_6

file_2025_pld_7 = pd.read_csv("/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/thesis_work/data_cleaning/debates_chunk_2020_2025_pld_7_manually_corrected.csv", sep=";")
print(len(file_2025_pld_7['DebateUnitID'].unique()))
file_2025_pld_7


# randomized_text = randomize_lines(output_text_filtered) # 2020-2025 (300 or so debates)

output_text = export_debate_exchanges_to_txt(file_2025_pld_7, "debate_exchanges_pld_2020_2025_7_27th_march_3.txt")
randomized_output_text = randomize_lines(output_text) # 2020-2025 (300 or so debates)

# Save the randomized text to a file
output_filename_random = "r_debate_exchanges_pld_2020_2025_7_27th_march_3.txt"
with open(output_filename_random, "w", encoding="utf-8") as f:
    f.write(randomized_output_text)


25
25
25
25
25
25
25
Exchanges exported to debate_exchanges_pld_2020_2025_7_27th_march_3.txt


In [None]:
# Found some errors in the pld debates - anonymizations that were flagged by annotators that slipped through filters. Are fixed here:

# Get it in  manually corrected for edits in turnrole
pld_2020_2025 = pd.read_csv("partilederdebatter_2020_2025_manually_corrected.csv", sep =";")

# Run some replacements stuff that didnt quite work:
import re

# Define a mapping of party names (including historical names) to pseudonyms
party_pseudonyms = {
    # Socialdemokratiet
    "Socialdemokratiet": "Parti_A",
    "Socialdemokraterne": "Parti_A",
    "Socialdemokraternes": "Parti_As",
    "Socialdemokratiets": "Parti_As",
    "Socialdemokratisk": "Parti_As",
    "Socialdemokrater": "Parti_A",

    # Venstre
    "Venstre": "Parti_B",
    "Venstres": "Parti_Bs",

    # Radikale Venstre
    "Radikale Venstre": "Parti_C",
    "Det Radikale Venstre": "Parti_C",
    #"Radikale": "Parti_C",
    "Radikales": "Parti_Cs",
    "De Radikale": "Parti_C",
    "De Radikales": "Parti_Cs",
    "Radikale": "Parti_C",

    # Konservative Folkeparti
    "Konservative Folkeparti": "Parti_D",
    "Det Konservative Folkeparti": "Parti_D",
    "Konservative": "Parti_D",
    "Konservatives": "Parti_Ds",
    "De Konservative": "Parti_D",
    "De Konservatives": "Parti_Ds",
    "konservativ side": "Parti_Ds side",

    # Socialistisk Folkeparti
    "Socialistisk Folkeparti": "Parti_E",
    "Socialistisk Folkepartis": "Parti_Es",
    "Socialistiske Folkeparti": "Parti_E",
    "Socialistiskes": "Parti_Es",
    "SF": "Parti_E",
    "SFs": "Parti_Es",
    "SF's": "Parti_Es",

    # Dansk Folkeparti
    "Dansk Folkeparti": "Parti_F",
    "Dansk Folkepartis": "Parti_Fs",

    # Fremskridtspartiet (Historisk DF-navn)
    "Fremskridtspartiet": "Parti_F",
    "Fremskridtspartiets": "Parti_Fs",

    # Enhedslisten
    "Enhedslisten": "Parti_G",
    "Enhedslistens": "Parti_Gs",
    "R√∏d-Gr√∏n Alliance": "Parti_G",
    "R√∏d-Gr√∏nne Alliance": "Parti_G",

    # Liberal Alliance
    "Liberal Alliance": "Parti_H",
    "Liberale Alliance": "Parti_H",
    "Liberal Alliances": "Parti_Hs",
    "Liberales": "Parti_Hs",  # Genitive form

    # Ny Alliance (Historisk f√∏r LA)
    "Ny Alliance": "Parti_H",
    "Ny Alliances": "Parti_Hs",

    # Alternativet
    "Alternativet": "Parti_I",
    "Alternativets": "Parti_Is",

    # Danmarksdemokraterne
    "Danmarksdemokraterne": "Parti_J",
    "Danmarksdemokraternes": "Parti_Js",

    # Nye Borgerlige
    "Nye Borgerlige": "Parti_K",
    "Nye Borgerliges": "Parti_Ks",

    # Frie Gr√∏nne
    "Frie Gr√∏nne": "Parti_L",
    "De Frie Gr√∏nne": "Parti_L",
    "Frie Gr√∏nnes": "Parti_Ls",

    # Kristendemokraterne
    "Kristendemokraterne": "Parti_M",
    "Kristendemokraternes": "Parti_Ms",
    "De Kristne Demokrater": "Parti_M",
    "Kristendemokratiet": "Parti_M",
    "Kristendemokratiets": "Parti_Ms",
}


# Compile regex pattern to match any of the party names (case insensitive)
party_pattern = re.compile(r'\b(' + '|'.join(re.escape(party) for party in party_pseudonyms.keys()) + r')\b', re.IGNORECASE)

# Function to replace party names with pseudonyms
def replace_party_names(text):
    if pd.isna(text):  # Handle missing values
        return text
    
    # Perform case-insensitive replacement while preserving original case
    return party_pattern.sub(lambda match: party_pseudonyms.get(match.group(0), 
                                                                party_pseudonyms.get(match.group(0).title(), 
                                                                match.group(0))), text)

# Apply function to the "Utterance" column
pld_2020_2025["Utterance"] = pld_2020_2025["Utterance"].astype(str).apply(replace_party_names)

print("‚úÖ Party names replaced with pseudonyms, including historical names.")

import re
import pandas as pd
import re

def replace_mentioned_names(row, name_list):
    text = row["Utterance"]
    role = row["TurnRole"]
    
    if pd.isna(text):  # Handle missing values
        return text

    # Set replacement term based on current speaker's role
    if role in ["minister", "proponent"]:
        replacement = "Sp√∏rgeren"
    elif role == "asker":
        replacement = "Ordf√∏reren"
    else:
        replacement = "Taleren"

    # Replace all names found in the name_list
    for name in name_list:
        if isinstance(name, str):  # Ensure it's a valid string
            # Regex to match:
            #   - "hr." or "fru." (case-insensitive)
            #   - The speaker name
            #   - A genitive 's, ‚Äôs, or just s (to catch e.g. "Mercados" or "Mercado‚Äôs")
            name_pattern = re.compile(
                rf"\b(?:hr\.|fru)\s*{re.escape(name)}(?:[‚Äô']s|s)?\b",
                re.IGNORECASE
            )
            text = name_pattern.sub(replacement, text)

            # Catch the name alone (without "Hr./Fru.") too without genitive s
            name_pattern_no_honorific = re.compile(
                rf"\b{re.escape(name)}(?:[‚Äô']s|s)?\b",
                re.IGNORECASE
            )
            text = name_pattern_no_honorific.sub(replacement, text)

    return text

# Process each DebateUnitID group
for debateunitid, group in pld_2020_2025.groupby("DebateUnitID"):
    unique_names = group["Speaker"].dropna().unique().tolist()  # Get unique speaker names

    pld_2020_2025.loc[group.index, "Utterance"] = group.apply(
        lambda row: replace_mentioned_names(row, unique_names),
        axis=1
    )


print("‚úÖ Speaker names in utterances replaced with generic references.")

# Display a sample of the updated dataset
pld_2020_2025.head(30)

pld_2020_2025 = pld_2020_2025[pld_2020_2025["TurnSequence"] != "tale"]

# Take a look
pld_2020_2025

‚úÖ Party names replaced with pseudonyms, including historical names.
‚úÖ Speaker names in utterances replaced with generic references.


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,SessionID,MeetingNumber,Date,Location,AgendaItemNo,AgendaTitle,DebateType,TurnNo,...,Role,TurnRole,TurnSequence,DebateUnitID,TurnRole_Danish,Time,Utterance,AgendaCategory,MeetingDateID,AgendaTitleDateID
0,1374,3978,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,19,...,medlem,asker,0,79832,Sp√∏rger,,N√•r jeg bev√¶ger mig p√• gader og str√¶der og kig...,Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00
1,1375,3979,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,21,...,minister,minister,1,79832,Minister,,F√∏rst og fremmest vil jeg gerne takke Parti_C ...,Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00
2,1376,3980,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,23,...,medlem,asker,2,79832,Sp√∏rger,,"Jeg synes jo, at risikoen ved den tilgang er, ...",Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00
3,1377,3981,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,25,...,minister,minister,3,79832,Minister,,"Jeg synes ikke, der er grund til at vente med ...",Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00
4,1378,3982,20191,50,2020-01-21 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,125,...,medlem,asker,0,79845,Sp√∏rger,,Parti_B har som bekendt tilsluttet sig klimalo...,Elections & Parliamentary Processes,50_2020-01-21 13:00:00,Partilederdebat._2020-01-21 13:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,1466,4070,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,232,...,medlem,member,3,114428,Medlem,,"Det er jeg s√•dan set enig i, men det handler n...",Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00
93,1467,4071,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,356,...,medlem,asker,0,114445,Sp√∏rger,,Jeg har ogs√• v√¶ret ude at bes√∏ge en del unge o...,Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00
94,1468,4072,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,358,...,medlem,member,1,114445,Medlem,,"Vi er glade for, at de uddannelser bliver l√∏ft...",Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00
95,1469,4073,20231,44,2024-01-16 13:00:00,Folketingssalen,1,Partilederdebat.,party_leader_debate,360,...,medlem,asker,2,114445,Sp√∏rger,,"Men man har jo ikke √∏nsket sig det s√• meget, a...",Elections & Parliamentary Processes,44_2024-01-16 13:00:00,Partilederdebat._2024-01-16 13:00:00


In [13]:
# Now export and randomize 
output_text_2020_2025_pld = export_debate_exchanges_to_txt(pld_2020_2025, "pld_short_2020_2025_manually_corrected.txt")

randomized_text = randomize_lines(output_text_2020_2025_pld)

# Save the randomized text to a file
output_filename_random = "randomized_pld_short_2020_2025_7.txt" # print more and upload in new folder with username_short
with open(output_filename_random, "w", encoding="utf-8") as f:
    f.write(randomized_text)

print(f"Randomized exchanges saved to {output_filename_random}")


Exchanges exported to pld_short_2020_2025_manually_corrected.txt
Randomized exchanges saved to randomized_pld_short_2020_2025_7.txt
