## 📙 00 Preprocessing

This notebook includes the preparation of Hellenic Parliament Proceedings dataset for analysis. This preprocessing steps follow best practices in political text analysis (Denny & Spirling 2017) to accomodate the specific needs for each downstream analysis of each Research Question.

In [1]:
#data-cleaning-processing-core-libs
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import time
tqdm.pandas()
import ipywidgets as widgets
widgets.IntSlider()

#text-processing
import re
import string
import unicodedata

#greek-nlp-toolkit-Loukas(2024)
from gr_nlp_toolkit import Pipeline

#greek-stopwords
import stopwordsiso as stopwords_iso
greek_stopwords = list(stopwords_iso.stopwords("el"))

IntSlider(value=0)

In [35]:
df = pd.read_csv('par10_20c.csv')

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536446 entries, 0 to 536445
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   member_name            522524 non-null  object
 1   sitting_date           536446 non-null  object
 2   parliamentary_period   536446 non-null  object
 3   parliamentary_session  536446 non-null  object
 4   parliamentary_sitting  536446 non-null  object
 5   political_party        536327 non-null  object
 6   government             536446 non-null  object
 7   member_region          513121 non-null  object
 8   roles                  522524 non-null  object
 9   member_gender          522524 non-null  object
 10  speech                 536446 non-null  object
dtypes: object(11)
memory usage: 45.0+ MB


### Data handling & Cleaning

Irrelevant columns that are not useful for textual or contextual analysis are dropped:

In [37]:
columns_to_drop = ["parliamentary_session", "parliamentary_sitting", "parliamentary_period","member_region"]
df = df.drop(columns=columns_to_drop)

Rows with missing critical values such as member_name, roles, member_gender, and political_party are dropped

In [38]:
missing_percent = df.isnull().mean().sort_values(ascending=False) * 100
print(missing_percent.round(2))

member_name        2.60
roles              2.60
member_gender      2.60
political_party    0.02
sitting_date       0.00
government         0.00
speech             0.00
dtype: float64


In [39]:
df = df.dropna(subset=["member_name", "roles", "member_gender", "political_party"])

A new column year is extracted from sitting_date to facilitate time-based analysis.

In [40]:
df['sitting_date'] = pd.to_datetime(df['sitting_date'], errors='coerce')

In [41]:
df['year'] = df['sitting_date'].dt.year

In [42]:
df['political_party'].unique()

array(['πανελληνιο σοσιαλιστικο κινημα', 'νεα δημοκρατια',
       'εξωκοινοβουλευτικός', 'κομμουνιστικο κομμα ελλαδας',
       'λαικος ορθοδοξος συναγερμος',
       'συνασπισμος ριζοσπαστικης αριστερας',
       'ανεξαρτητοι (εκτος κομματος)', 'δημοκρατικη αριστερα',
       'ανεξαρτητοι ελληνες - πανος καμμενος',
       'λαικος συνδεσμος - χρυση αυγη',
       'ανεξαρτητοι δημοκρατικοι βουλευτες', 'το ποταμι',
       'ανεξαρτητοι ελληνες εθνικη πατριωτικη δημοκρατικη συμμαχια',
       'λαικη ενοτητα',
       'δημοκρατικη συμπαραταξη (πανελληνιο σοσιαλιστικο κινημα - δημοκρατικη αριστερα)',
       'ενωση κεντρωων', 'κινημα αλλαγης',
       'ελληνικη λυση - κυριακος βελοπουλος',
       'μετωπο ευρωπαικης ρεαλιστικης ανυπακοης (μερα25)'], dtype=object)

In [43]:
df['government'].unique()

array(["['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)']",
       "['παπαδημου λουκα δ.(11/11/2011-17/05/2012)']",
       "['πικραμμενου παναγιωτη οθ. (υπηρεσιακη)(17/05/2012-21/06/2012)']",
       "['σαμαρα κ. αντωνιου(21/06/2012-26/01/2015)']",
       "['τσιπρα π. αλεξιου(26/01/2015-27/08/2015)']",
       "['τσιπρα π. αλεξιου(21/09/2015-08/07/2019)']",
       "['μητσοτακη κυριακου(08/07/2019-28/07/2020)']"], dtype=object)

A custom mapping is created that links prime minister names, years, and their coalition parties

In [44]:
# Define the government coalition mapping (PM name substring, years, and parties)
gov_coalitions = [
    {
        'pm': 'παπανδρεου α. γεωργιου',
        'years': [2009, 2010, 2011],
        'parties': ['πανελληνιο σοσιαλιστικο κινημα']
    },
    {
        'pm': 'παπαδημου λουκα δ.',
        'years': [2011, 2012],
        'parties': ['πανελληνιο σοσιαλιστικο κινημα', 'νεα δημοκρατια', 'λαικος ορθοδοξος συναγερμος']
    },
    {
        'pm': 'σαμαρα κ. αντωνιου',
        'years': [2012, 2013, 2014, 2015],
        'parties': ['πανελληνιο σοσιαλιστικο κινημα', 'νεα δημοκρατια']
    },
    {
        'pm': 'τσιπρα π. αλεξιου',
        'years': [2015, 2016, 2017, 2018, 2019],
        'parties': ['συνασπισμος ριζοσπαστικης αριστερας', 'ανεξαρτητοι ελληνες εθνικη πατριωτικη δημοκρατικη συμμαχια']
    }
]

A function is defined to check whether a speaker is part of the governing coalition 

In [45]:
# Function to check if a speaker was in government
def is_government(row):
    gov_name = row['government'].lower()
    party = row['political_party'].lower()
    year = row['year']

    for coalition in gov_coalitions:
        if (coalition['pm'] in gov_name) and (year in coalition['years']) and (party in coalition['parties']):
            return 1
    return 0

In [46]:
df['is_government'] = df.apply(is_government, axis=1)

A function is defined to extract speaker role in government

In [47]:
gov_titles = [
    'πρωθυπουργος',
    'αντιπροεδρος της κυβερνησης',
    'αναπληρωτης υπουργος',
    'υπουργος',
    'υφυπουργος'
]


In [48]:
def extract_speaker_role(role_text):
    role_text = str(role_text).lower()
    for title in gov_titles:
        if title in role_text:
            return title
    return None

In [49]:
df['speaker_gov_role'] = df['roles'].apply(extract_speaker_role)

In [50]:
leadership_titles = [
    'αρχηγος κομματος',
    'αρχηγος αξιωματικης αντιπολιτευσης'
]

A function is defined to what the speaker's leadership role

In [51]:
def extract_leadership_role(role_text):
    role_text = str(role_text).lower()
    for title in leadership_titles:
        if title in role_text:
            return title
    return None

In [52]:
df['leadership_role'] = df['roles'].apply(extract_leadership_role)

In [53]:
df = df[~df['roles'].str.contains('αντιπροεδρος βουλης|προεδρος βουλης|αντιπροεδρος', case=False, na=False)]

### Preprocessing Speech Text Pipeline

In [11]:
p_df = pd.read_csv('processed01_par10-20.csv')

In [12]:
p_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341805 entries, 0 to 341804
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        341805 non-null  int64 
 1   member_name       341805 non-null  object
 2   sitting_date      341805 non-null  object
 3   political_party   341805 non-null  object
 4   government        341805 non-null  object
 5   roles             341805 non-null  object
 6   member_gender     341805 non-null  object
 7   speech            341805 non-null  object
 8   year              341805 non-null  int64 
 9   is_government     341805 non-null  int64 
 10  speaker_gov_role  60473 non-null   object
 11  leadership_role   10689 non-null   object
 12  speech_clean      341668 non-null  object
dtypes: int64(3), object(10)
memory usage: 33.9+ MB


In [13]:
p_df.head()

Unnamed: 0.1,Unnamed: 0,member_name,sitting_date,political_party,government,roles,member_gender,speech,year,is_government,speaker_gov_role,leadership_role,speech_clean
0,0,τσιαρας αλεξανδρου κωνσταντινος,2010-01-11,νεα δημοκρατια,['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)'],['βουλευτης'],male,"Σας ευχαριστώ πολύ, κύριε Πρόεδρε. Κυρίες και...",2010,0,,,"σας πολυ, . και , μιας και ειναι η πρωτη μερα ..."
1,1,ζωης κωνσταντινου χρηστος,2010-01-11,νεα δημοκρατια,['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)'],['βουλευτης'],male,"Ευχαριστώ, κύριε Πρόεδρε.Επιτρέψτε μου κι εμέ...",2010,0,,,", .επιτρεψτε μου κι εμενα, πριν απ’ ολα, να απ..."
2,2,ζωης κωνσταντινου χρηστος,2010-01-11,νεα δημοκρατια,['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)'],['βουλευτης'],male,Εσείς δυστυχώς πρέπει να αναθεωρήσετε τις θέσ...,2010,0,,,εσεις δυστυχως πρεπει να αναθεωρησετε τις θεσε...
3,3,ταλιαδουρος αθανασιου σπυριδων,2010-01-11,νεα δημοκρατια,['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)'],['βουλευτης'],male,"Κύριε Πρόεδρε, όπως επισημάνθηκε και από τους...",2010,0,,,", οπως επισημανθηκε και απο τους συναδελφους μ..."
4,4,χαρακοπουλος παντελη μαξιμος,2010-01-11,νεα δημοκρατια,['παπανδρεου α. γεωργιου(06/10/2009-11/11/2011)'],['βουλευτης'],male,"Ευχαριστώ πολύ, κύριε Πρόεδρε.Κυρίες και κύρι...",2010,0,,,"πολυ, . και , η ερωτηση που καταθεσαμε οι βουλ..."


Removes diacritics (accent marks) from Greek characters using Unicode normalization to standardize text.

In [11]:
#removing ntonation/diacritics (accent marks) used in written greek
def remove_greek_diacritics(text):
    text = str(text)
    text = unicodedata.normalize("NFD", text)
    text = ''.join([char for char in text if not unicodedata.combining(char)])
    return text

Cleans speech text by removing diacritics, lowercasing, filtering out formal expressions, punctuation, numbers, and excess whitespace.

In [14]:
def basic_cleaning(text):
    # Remove Greek diacritics
    text = remove_greek_diacritics(text)
    
    # Lowercase
    text = text.lower()
    
    # Remove formal expressions and honorifics
    patterns_to_remove = [
        r'κυριε\s+(υπουργε|υφυπουργε|προεδρε|αντιπροεδρε)',
        r'\bκ\.\s*',
        r'\bευχαριστω\b',
        r'\bαγαπητοι συναδελφοι\b',
        r'\bκυριοι συναδελφοι\b',
        r'\bκυριες\b'
    ]
    for pattern in patterns_to_remove:
        text = re.sub(pattern, '', text)
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

Removes Greek stopwords from the text to reduce noise and retain meaningful content.

In [8]:
#stopwords_removal-function
def stopwords_removal(text):
    #remove stopwords
    words = text.split()
    words = [word for word in words if word not in greek_stopwords]
    return ' '.join(words)

In [17]:
p_df['speech_clean'] = p_df['speech'].progress_apply(basic_cleaning)

  0%|          | 0/341805 [00:00<?, ?it/s]