This notebook applies basic preprocessing actions to the final AgoraSpeech dataset.

* Input: 'AgoraSpeech.csv'
* Output: 'AgoraSpeech_preprocessed.csv'
* Actions: 
    1. Convert continuous features (sentiment, polarization, populism) into categorical features based on specific mapping in each case
    2. Factorize 'elections'
    3. Extract from the named entities list of lists, the specific entities recognized and their respective categories, creating two additional columns for each case

In [1]:
import ast
import numpy as np
import pandas as pd

In [2]:
# read the final dataset
data = pd.read_csv('AgoraSpeech.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5279 entries, 0 to 5278
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   elections                  5279 non-null   object 
 1   speech_id                  5279 non-null   object 
 2   politician                 5279 non-null   object 
 3   date (YYYY-MM-DD)          5279 non-null   object 
 4   location                   5279 non-null   object 
 5   paragraph                  5279 non-null   int64  
 6   text                       5279 non-null   object 
 7   text_el                    5279 non-null   object 
 8   criticism_or_agenda_gpt    5139 non-null   object 
 9   topic_gpt                  5180 non-null   object 
 10  sentiment_gpt              5185 non-null   float64
 11  polarization_gpt           5184 non-null   float64
 12  populism_gpt               5184 non-null   float64
 13  named entities_gpt         5236 non-null   objec

Convert continuous features (sentiment, polarization, populism) into categorical features based on specific mapping in each case

In [3]:
# reusable mapping function
def mapping_function(value, boundaries, values):
    for i, boundary in enumerate(boundaries):
        if value <= boundary:
            return values[i]
    return values[-1]

In [4]:
# SENTIMENT: [-1, -0.34] -> negative , [-0.33, 0.33] -> neutral, [0.34, 1] -> positive
boundaries = [-0.34, 0.33]
categories = ['negative', 'neutral', 'positive']
data['sentiment_human_category'] = data['sentiment_human'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)
data['sentiment_gpt_category'] = data['sentiment_gpt'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)

In [5]:
# POLARIZATION: [0, 0.5] -> low , [0.51, 0.8] -> medium, [0.81, 1] -> high
categories = ['low', 'medium', 'high']
boundaries = [0.5, 0.8]
data['polarization_human_category'] = data['polarization_human'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)
data['polarization_gpt_category'] = data['polarization_gpt'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)

In [6]:
# POPULISM: [0, 0.5] -> 0 -> low , [0.51, 0.8] -> 1 -> medium, [0.81, 1] -> 2 -> high
data['populism_human_category'] = data['populism_human'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)
data['populism_gpt_category'] = data['populism_gpt'].apply(lambda x: mapping_function(x, boundaries, categories) if not pd.isna(x) else np.nan)

Factorize 'elections'

In [7]:
data['elections'], unique_labels = pd.factorize(data['elections'])

Extract from the named entities list of lists, the specific entities recognized and their respective categories, creating two additional columns for each case

In [8]:
def named_entities_extraction(df, col_name, new_col_name_chat_data, ith_el):
    entities_list = ['organization', 'group of people', 'location', 'person', 'country', 'date', 'political party']

    if df[col_name].dtype == str:
        df[col_name] = df[col_name].apply(lambda x: ast.literal_eval(x))

    df[new_col_name_chat_data] = ''

    for i, row in df.iterrows():
        if pd.isna(row[col_name]):
            continue
        entity_lists = ast.literal_eval(row[col_name])
        new_el = []
        for sublist in entity_lists:
            if len(sublist) >= ith_el:
                new_el.append(sublist[ith_el - 1])
        df.at[i, new_col_name_chat_data] = new_el

    if new_col_name_chat_data == 'entity_category':
        df[new_col_name_chat_data] = df[new_col_name_chat_data].apply(lambda x: [val if val in entities_list else 'other' for val in x])

In [9]:
named_entities_extraction(data, 'named entities_human', 'entity_specific_human', 1)
named_entities_extraction(data, 'named entities_human', 'entity_category_human', 2)
named_entities_extraction(data, 'named entities_gpt', 'entity_specific_gpt', 1)
named_entities_extraction(data, 'named entities_gpt', 'entity_category_gpt', 2)

In [10]:
data.to_csv('AgoraSpeech_preprocessed.csv', index=False)

In [11]:
# find number of missing data per column
data.isnull().sum()

elections                        0
speech_id                        0
politician                       0
date (YYYY-MM-DD)                0
location                         0
paragraph                        0
text                             0
text_el                          0
criticism_or_agenda_gpt        140
topic_gpt                       99
sentiment_gpt                   94
polarization_gpt                95
populism_gpt                    95
named entities_gpt              43
criticism_or_agenda_human        0
topic_human                      2
sentiment_human                343
polarization_human               2
populism_human                   7
named entities_human             0
sentiment_human_category       343
sentiment_gpt_category          94
polarization_human_category      2
polarization_gpt_category       95
populism_human_category          7
populism_gpt_category           95
entity_specific_human            0
entity_category_human            0
entity_specific_gpt 