# Preprocessing of manifestos

This notebook contains the preprocessing pipeline for manifestos from Danish political parties 1997-2022.

The output is a merged df consisting of all manifestos from all parties between 1997-2022, and this df is preprocessed to prepare for making BERT embeddings.

The end of this notebook also includes some light visualisations for exploration purposes.

In [1]:
directory_path = "/Users/pbrams/Desktop/AARHUS_UNIVERSITY/kandidat/data_sci/data_sci_project/predicting_manifesto_alignment"

In [2]:
import pandas as pd
import numpy as np

from utils import find_files, read_and_concatenate_manif

# Reading ind and concatenating
path_to_manifs = f"{directory_path}/data/raw/manifestos/raw"

fl = find_files(path_to_manifs, ".csv")

manifestos = read_and_concatenate_manif(fl)

manifestos

Unnamed: 0,text,text_en,cmp_code,eu_code,Source_File,Party_Number,Year,Party_Name
0,Klar besked om SF’s valggrundlag. 12 konk...,A clear message about SF's election platform. ...,,,13230_201109.csv,13230,2011,Socialistisk Folkeparti
1,Når vi danskere inden længe skal i stemmebokse...,When we Danes go to the ballot box in the near...,0.0,,13230_201109.csv,13230,2011,Socialistisk Folkeparti
2,"For den kurs, vi nu vælger for Danmark, afgør ...",Because the course we choose for Denmark now w...,0.0,,13230_201109.csv,13230,2011,Socialistisk Folkeparti
3,Skal vi spare os ud af krisen – eller skal vi ...,Should we save our way out of the crisis - or ...,409.0,,13230_201109.csv,13230,2011,Socialistisk Folkeparti
4,Skal vi give de rigeste flere skattelettelser ...,Should we give the richest more tax breaks - o...,503.0,,13230_201109.csv,13230,2011,Socialistisk Folkeparti
...,...,...,...,...,...,...,...,...
18921,"Vi så, hvordan det gik, da de sidst havde rege...",We saw how it went when they were last in gove...,305.0,,13320_200111.csv,13320,2001,Socialdemokratiet
18922,Det har taget os 75 år at bygge velfærdssamfun...,It has taken us 75 years to build the welfare ...,504.0,,13320_200111.csv,13320,2001,Socialdemokratiet
18923,Det kan rives ned på meget kortere tid.,It can be demolished in much less time.,305.0,,13320_200111.csv,13320,2001,Socialdemokratiet
18924,Valget er dit.,The choice is yours.,0.0,,13320_200111.csv,13320,2001,Socialdemokratiet


In [3]:
# Checking things
print(manifestos['Party_Name'].unique())
print(manifestos['Year'].unique())

['Socialistisk Folkeparti' 'Konservative Folkeparti' 'Dansk Folkeparti'
 'Socialdemokratiet' 'Venstre' nan 'Enhedslisten' 'Liberal Alliance'
 'Alternativet' 'Det Radikale Venstre' 'Nye Borgerlige']
['2011' '2005' '2007' '2001' '1998' '1994' '2019' '2015']


## Grouping and preprocessing
Before embedding, we'll remove any non-informative text (like HTML tags, if any), lowercasing the text, removing punctuation and numbers, and possibly removing common stopwords (though in political texts, some 'stopwords' might carry significant meaning).

To ensure embeddings focus more on the ideological content, we also remove the party names.

In [4]:
# Group by 'Party_Name' and 'Year' and join all 'text' entries into a single document
df_doc = manifestos.groupby(['Party_Name', 'Year'])['text'].apply(' '.join).reset_index()

df_doc

Unnamed: 0,Party_Name,Year,text
0,Alternativet,2015,Alternativet Alternativet er klar til valg Dan...
1,Alternativet,2019,Vores Politik På denne side finder du Alternat...
2,Dansk Folkeparti,1998,UDLÆNDINGEPOLITIKKEN UD med særlove og hovsa-l...
3,Dansk Folkeparti,2001,Fælles værdier – fælles ansvar Arbejdsprogram...
4,Dansk Folkeparti,2005,Vi vil have et trygt land Danmark skal være et...
...,...,...,...
57,Venstre,2005,Valgløfter Danmark skal være verdensmestre i v...
58,Venstre,2007,Valggrundlag Folketingsvalg 13. november 2007 ...
59,Venstre,2011,Nye tider. Varig velfærd Der er udskrevet folk...
60,Venstre,2015,DET VIL VENSTRE Ydelser til flygtninge skal ne...


In [None]:
from utils import update_stopwords, preprocess_text

# Calculate extended stopwords
party_names = df_doc['Party_Name'].unique()
extended_stopwords = update_stopwords(party_names)

# Apply preprocessing
df_doc['processed_text'] = df_doc['text'].apply(lambda x: preprocess_text(x, extended_stopwords, remove_stopwords=True))

# Get token count
df_doc['token_count'] = df_doc['processed_text'].str.split().apply(len)

# Take a look
df_doc

In [None]:
# Save
df_doc.to_csv(f"{directory_path}/data/preprocessed/clean/manifestos/clean_manifestos.csv")