# Text Analysis: Irish Parliament Debates


## Overview

This code conducts an analysis of the debates of the 33rd
session of the Dáil Éireann (Irish Parliament) that was in sitting
between 2020 and 2024. 

The dataset is structured as follows:

| dail | vol | no  | date | speaker_name | speaker_role | constituency | party | text |
|------|-----|-----|------|--------------|--------------|--------------|-------|------|

where:

`dail` - is the number of the Dáil (e.g. 33rd Dáil)

`vol` - is the volume number of the debates (e.g. 1000)

`no` - is the number of the debate in the volume (e.g. 1)

`date` - is the date of the debate (in YYYY-MM-DD form, e.g. 2020-01-01)

`speaker_name` - is the name of the speaker

`speaker_role` - is the role of the speaker (e.g. TD, Minister, etc.)

`constituency` - is the constituency of the speaker

`party` - is the party of the speaker

`text` - is the text of the speech

Note that some of the texts belong to the outside speakers, such as,
e.g. external experts, witnesses, etc. Another aspect of this data to
keep in mind is that some of the recorded speeches are in Irish. You can
choose to use those in your analysis or exclude them.

## Part 1: Modelling Topics

In this part we model the topics of the speeches in the Dáil. 

## Part 2: Modelling Ideology

In this part we model the ideology of the speakers in the Dáil.

In [3]:
# Installing packages, loading libraries, dictionaries 

#install.packages("quanteda")
#install.packages("textcat")
#install.packages("cld2")

#install.packages("topicmodels")
#install.packages("devtools")
#devtools::install_github("kbenoit/quanteda.dictionaries") 
#remotes::install_github("kbenoit/quanteda.dictionaries")

library(dplyr)
library(textcat)
library(quanteda)
library(cld2)
library(quanteda.dictionaries)
library(topicmodels)
library(tidyr)
library(ggplot2)

# Load the dictionaries
data("data_dictionary_LaverGarry", package = "quanteda.dictionaries")

# Check the structure of the dictionaries
#str(data_dictionary_LaverGarry)

In [4]:
#Uploading the speeches datafile
speeches <- readr::read_csv(
  "/path/to/your/dail_33_small.csv"
)
head(speeches, 3)

[1mRows: [22m[34m591949[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (5): speaker_name, speaker_role, constituency, party, text
[32mdbl[39m  (3): dail, vol, no
[34mdate[39m (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


dail,vol,no,date,speaker_name,speaker_role,constituency,party,text
<dbl>,<dbl>,<dbl>,<date>,<chr>,<chr>,<chr>,<chr>,<chr>
33,1061,3,2024-11-07,Catherine Connolly,An Leas-Cheann Comhairle,Galway West,Independent,"Seanad Éireann has accepted the Finance Bill 2024, without recommendation."
33,1061,3,2024-11-07,Catherine Connolly,An Leas-Cheann Comhairle,Galway West,Independent,"I wish to advise the House of the following matters in respect of which notice has been given under Standing Order 37 and the name of the Member in each case: Deputy Claire Kerrane - to discuss the need for a rural Garda plan. Deputy Brendan Smith - to discuss the need to approve DEIS status for St Patrick's National School, Shercock, County Cavan. Deputy Marian Harkin - to discuss the interim accommodation arrangement for the post-primary school at Grange, County Sligo. Deputy David Stanton - to discuss the anomaly of business owners who are tenants being ineligible for supports such as the power up grant. Deputy Pearse Doherty - to discuss the need to engage with Transport Infrastructure Ireland regarding speed limits on national secondary roads. Deputy Donnchadh Ó Laoghaire - to discuss accommodation conditions of international protection applicants. Deputy Gino Kenny - to discuss future funding for Scoil Mochua, Clondalkin. Deputy Éamon Ó Cuív - to discuss upgrading the facilities at Dún Aengus in Árann. Deputy Sean Sherlock - to discuss the future-proofing and staffing of Mallow General Hospital. Deputy James O'Connor - to discuss water quality issues in the Cork East constituency. Deputy Pauline Tully - to discuss the gaps in home care provision in Cavan. Deputy Bernard Durkan - to discuss the situation of ensuring that access to the State-owned historical residence, Castletown House, and lands in Kildare is guaranteed and that maintenance and protection of the property is reinstated. Deputy Gary Gannon - to discuss public safety in Dublin city. The matters raised by Deputies David Stanton, Éamon Ó Cuív, Pearse Doherty, Gary Gannon, Brendan Smith, Claire Kerrane and Bernard Durkan have been selected for discussion."
33,1061,3,2024-11-07,David Stanton,Deputy David Stanton,Cork East,Fine Gael,"I thought I was finished yesterday, but I am glad to be here this morning. I thank the Minister of State for being here. The various grants the Government has made available to support small businesses are extraordinarily welcome and very important. The increased cost of business grant and the power up grant are the two most recent ones. In every scheme, anomalies will always arise. A number of anomalies has arisen here, which I want to draw attention to this morning, and that I ask the Minister of State and the Department to look at. One is the issue whereby a business is running an operation in a property but it is renting the property and the property owner is paying the tax. This happens because sometimes property owners were left with the rates bill in the past if the business was not paying the rates. The arrangements were made in many cases where the rates cost would be included in the rental cost, and the landlord would pay the rates and would ensure that it was paid. What has happened now is, because the business itself is not paying the rates, it is deemed ineligible for the power up grant and the increased cost of business grant. That is quite unfair. I have been looking through some of the documentation the Department has put together. In fairness to the Department and the Government, they have done great work in this particular area. The Minister, Deputy Peter Burke, stated that ""the priority has been to ensure that as many businesses as possible"" in the hospitality and retail sectors that are facing great difficulty due to increased costs of running a business ""receive the money as quickly as possible"". However, these particular small businesses are now excluded. Another frequently asked question is who is eligible for the grant. The eligibility criteria states: ""Your business must be a commercially ... [traded entity] currently operating from a property that is commercially rateable."" If the rates are being paid, even indirectly, I contend that we can find a mechanism whereby the business can actually receive the grant and it can be kept in business and keep people employed. It should not be beyond the bounds of possibility to do that. Other issues have arisen as well. Somebody applied for the increased cost of business grant and did not tick the fact he or she was a retailer. The person might be a retailer anyway, but, for example, if the person was involved in art or something else instead, if he or she goes for the power up grant, the algorithm in the computer the person applied through will not allow him or her go any further and he or she is stopped. The person contacts the local authority, which says it cannot talk about that as it is a Department issue, but there is no one to talk to. There is a number of business owners around the country doing the best they could, were honest and straightforward and did not tick the correct box the first time and now cannot go forward for the power up grant. Again, I ask the Minister, and the Department, if it is listening, to have a look at that and see if there is a way of actually sorting that out. A third anomaly is if somebody was in a rateable property and rates were being paid, and the person moved to a different property during the course of this particular grant being administered. The person is paying rates on the second property as well. That person also cannot draw down the power up grant because he or she moved properties and the property he or she is in now is different. The person has not had the time involved in paying rates through that property in 2023 to claim it back. The ratepayer needs to talk a person, not a computer, in order to fix that. It is unfair on the people if they have paid the rates and did everything upfront, lawfully and so on. I ask that it be looked at and fixed. The final issue is slightly unrelated to the point this morning is that the Government has also made grants available for businesses to put solar panels on the roofs of its properties to cut down the cost of electricity. If the property is rented, the landlord or property owner cannot apply because he is not the business owner, and the business owner cannot apply because he does not own the premises. Again, there is a lacuna or anomaly there where nobody wins. Perhaps the Minister of State can shed some light on this."


In [5]:
# Exploring the datafile 
str(speeches)
summary(speeches)
table(speeches$constituency)
table(speeches$party)

#Checking for NAs, dropping NAs for text 
colSums(is.na(speeches))
speeches <- speeches[!is.na(speeches$text), ]

# Subsetting a smaller sample, since the operations are too slow. 
set.seed(123)
speeches <- speeches[sample(nrow(speeches), 50000), ]
summary(speeches)

# Note: Speeches where party and constituency are missing texts belong to the outside speakers, such as, e.g. external experts, witnesses. 

spc_tbl_ [591,949 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ dail        : num [1:591949] 33 33 33 33 33 33 33 33 33 33 ...
 $ vol         : num [1:591949] 1061 1061 1061 1061 1061 ...
 $ no          : num [1:591949] 3 3 3 3 3 3 3 3 3 3 ...
 $ date        : Date[1:591949], format: "2024-11-07" "2024-11-07" ...
 $ speaker_name: chr [1:591949] "Catherine Connolly" "Catherine Connolly" "David Stanton" "Thomas Byrne" ...
 $ speaker_role: chr [1:591949] "An Leas-Cheann Comhairle" "An Leas-Cheann Comhairle" "Deputy David Stanton" "Minister of State at the Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media (Deputy Thomas Byrne)" ...
 $ constituency: chr [1:591949] "Galway West" "Galway West" "Cork East" "Meath East" ...
 $ party       : chr [1:591949] "Independent" "Independent" "Fine Gael" "Fianna Fáil" ...
 $ text        : chr [1:591949] "Seanad Éireann has accepted the Finance Bill 2024, without recommendation." "I wish to advise the House of the following matters in res

      dail         vol               no              date           
 Min.   :33   Min.   : 992     Min.   : 1.0     Min.   :2020-02-20  
 1st Qu.:33   1st Qu.:1009     1st Qu.: 2.0     1st Qu.:2021-11-11  
 Median :33   Median :1025     Median : 4.0     Median :2022-10-26  
 Mean   :33   Mean   :1026     Mean   : 3.8     Mean   :2022-10-06  
 3rd Qu.:33   3rd Qu.:1043     3rd Qu.: 5.0     3rd Qu.:2023-10-19  
 Max.   :33   Max.   :1061     Max.   :10.0     Max.   :2024-11-07  
              NA's   :424058   NA's   :424058                       
 speaker_name       speaker_role       constituency          party          
 Length:591949      Length:591949      Length:591949      Length:591949     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                               


           Administrative Panel              Agricultural Panel 
                           2673                            9489 
                Carlow-Kilkenny                  Cavan-Monaghan 
                           8544                           13450 
                          Clare                       Cork East 
                           8897                            6909 
             Cork North-Central                 Cork North-West 
                           5228                            2625 
             Cork South-Central                 Cork South-West 
                          20347                            5026 
 Cultural and Educational Panel                         Donegal 
                          10707                           13163 
             Donegal North-East                Dublin Bay North 
                            174                            6163 
               Dublin Bay South                  Dublin Central 
                        


Anti-Austerity Alliance - People Before Profit 
                                         14969 
                                   Fianna Fáil 
                                        105068 
                                     Fine Gael 
                                         85354 
                                   Green Party 
                                         38014 
                                   Independent 
                                         49295 
                         Independents 4 Change 
                                           920 
                                  Labour Party 
                                         16414 
                                     Sinn Féin 
                                        101320 
                              Social Democrats 
                                         20163 

      dail         vol              no             date           
 Min.   :33   Min.   : 992    Min.   : 1.00   Min.   :2020-02-20  
 1st Qu.:33   1st Qu.:1009    1st Qu.: 2.00   1st Qu.:2021-11-10  
 Median :33   Median :1025    Median : 4.00   Median :2022-10-26  
 Mean   :33   Mean   :1025    Mean   : 3.81   Mean   :2022-10-04  
 3rd Qu.:33   3rd Qu.:1042    3rd Qu.: 5.00   3rd Qu.:2023-10-18  
 Max.   :33   Max.   :1061    Max.   :10.00   Max.   :2024-11-07  
              NA's   :35791   NA's   :35791                       
 speaker_name       speaker_role       constituency          party          
 Length:50000       Length:50000       Length:50000       Length:50000      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
  

In [6]:
# Detect the language of each speech in the filtered data
speeches$language <- cld2::detect_language(speeches$text)

# View the detected language distribution
table(speeches$language)


   de    en    ga    gd    mg    pt 
    1 47042  1105     6     1     2 

In [7]:
# Filter the speeches to keep only those detected as Irish (ga)
speeches_ga <- speeches[speeches$language == "ga", ]

# Display the first 5 rows of speeches where the language is Irish (ga)
#head(speeches_ga, 20)

# Note: 

# 1. Texts containing both English and Irish are detected as Gaelic if there is a substantial amount of Irish text. 
#For this assignment, such mixed texts will be dropped. A potential solution for future cases could involve sentence-level language detection, 
#where the text is split into sentences, the language of each sentence is identified, and Irish sentences are removed while keeping the English ones. 
#Another approach would be to apply Irish language dictionaries during tokenization and stopword removal to better handle Irish text.

# 2. TEXCAT function performed poorly for short texts and gave worse results (multiple languages detected). CLD2 performed better. 

# Keeping speeches where language is English
speeches_en <- speeches[speeches$language == "en", ]

# Verify the new dataset
summary(speeches_en)

      dail           vol              no             date           
 Min.   :33     Min.   : 992    Min.   : 1.00   Min.   :2020-02-20  
 1st Qu.:33     1st Qu.:1009    1st Qu.: 2.00   1st Qu.:2021-11-10  
 Median :33     Median :1025    Median : 4.00   Median :2022-10-25  
 Mean   :33     Mean   :1025    Mean   : 3.82   Mean   :2022-10-01  
 3rd Qu.:33     3rd Qu.:1042    3rd Qu.: 5.00   3rd Qu.:2023-10-12  
 Max.   :33     Max.   :1061    Max.   :10.00   Max.   :2024-11-07  
 NA's   :1843   NA's   :35371   NA's   :35371   NA's   :1843        
 speaker_name       speaker_role       constituency          party          
 Length:48885       Length:48885       Length:48885       Length:48885      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                               

In [8]:
# Part 1: Modelling Topics

# Creating DFM
speeches_dfm <- speeches_en |>
  quanteda::corpus(text_field = "text") |>
  quanteda::tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
  quanteda::tokens_remove(pattern = stopwords::stopwords("english")) |>
  quanteda::tokens_wordstem() |>
  quanteda::dfm()

“NA is replaced by empty string”


In [9]:
# Remove rows with no tokens (empty rows)
speeches_dfm <- speeches_dfm[quanteda::ntoken(speeches_dfm) > 0, ]
speeches_dfm

Document-feature matrix of: 46,878 documents, 33,097 features (99.86% sparse) and 9 docvars.
       features
docs    youth justic strategi quit posit document young peopl ireland front
  text1     2      1        4    2     1        1     2     2       1     1
  text2     0      0        0    0     0        0     0     0       0     0
  text3     0      0        0    0     0        0     0     0       0     0
  text4     0      0        0    0     0        0     0     0       0     0
  text5     0      0        0    0     0        0     0     2       1     0
  text6     0      0        0    0     0        0     0     1       1     0
[ reached max_ndoc ... 46,872 more documents, reached max_nfeat ... 33,087 more features ]

In [10]:
# Perform LDA topic modelling for 5 topics
speeches_lda_5 <- topicmodels::LDA(
  speeches_dfm,
  k = 5,
  method = "Gibbs"
)

# Checking Log-likelihood for speeches_lda_5
log_lik_5 <- topicmodels::logLik(speeches_lda_5)
log_lik_5

# Printing top 10 terms across topics
terms_top10_lda5 <- topicmodels::terms(speeches_lda_5, 10)
terms_top10_lda5

'log Lik.' -21691373 (df=165485)

Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
peopl,committe,hous,servic,ireland
get,deputi,€,need,govern
go,member,year,work,state
can,mr,local,support,right
one,report,plan,health,countri
come,legisl,govern,school,issu
want,bill,scheme,peopl,irish
know,make,fund,children,import
look,may,cost,provid,must
need,question,develop,depart,also


In [11]:
# Perform LDA topic modelling for 10 topics
speeches_lda_10 <- topicmodels::LDA(
  speeches_dfm,
  k = 10,
  method = "Gibbs"
)

# Checking Log-likelihood for speeches_lda_10
log_lik_10 <- topicmodels::logLik(speeches_lda_5)
log_lik_10

# Printing top 10 terms across topics
terms_top10_lda10 <- topicmodels::terms(speeches_lda_10, 10)
terms_top10_lda10

'log Lik.' -21691373 (df=165485)

Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,Topic 10
can,servic,ireland,peopl,hous,depart,€,deputi,legisl,get
use,need,right,minist,local,work,year,question,bill,can
transport,health,countri,govern,plan,programm,cost,committe,amend,go
energi,school,european,go,author,support,increas,mr,case,look
farmer,peopl,state,state,area,develop,million,ask,report,one
climat,children,issu,time,home,nation,busi,thank,act,come
need,care,eu,get,deputi,also,scheme,member,section,peopl
industri,hospit,irish,mani,develop,provid,pay,meet,inform,want
ireland,support,govern,know,build,includ,tax,make,relat,point
product,disabl,also,happen,counti,new,budget,wit,person,work


In [12]:
# Perform LDA topic modelling for 10 topics
speeches_lda_7 <- topicmodels::LDA(
  speeches_dfm,
  k = 7,
  method = "Gibbs"
)

# Checking Log-likelihood for speeches_lda_10
log_lik_7 <- topicmodels::logLik(speeches_lda_7)
log_lik_7

# Printing top 10 terms across topics
terms_top10_lda7 <- topicmodels::terms(speeches_lda_7, 10)
terms_top10_lda7

'log Lik.' -21049908 (df=231679)

Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7
€,ireland,servic,legisl,plan,peopl,deputi
hous,govern,need,bill,develop,get,question
year,state,health,report,local,go,minist
govern,countri,school,amend,area,one,committe
scheme,right,work,case,work,can,ask
cost,irish,support,process,author,want,mr
increas,european,peopl,act,project,know,thank
home,eu,children,inform,fund,come,member
busi,issu,care,provid,build,need,make
million,us,provid,section,depart,look,meet


In [18]:
# Note: Will choose the model with 7 topics, since it offers
# comparatively well-separated, interpretable topics without overfitting. Log-likelihood does not differ much. 
# Some other methods may give more robust results. 

# Topic 1: Housing crisis
# Topic 2: EU relations 
# Topic 3: Social services 
# Topic 4: Legislation and law 
# Topic 5: Development and infrastructure 
# Topic 6: Social needs and engagement 
# Topic 7: Parliamentary proceedings 

# Extract the topic distribution (gamma values) from the LDA model
topics_gamma <- posterior(speeches_lda_7)$topics

# Checking the speeches LDA class and length 
class(speeches_lda_7)
nrow(speeches_en)
nrow(topics_gamma)                                                    

In [19]:
# Ensure both have the same number of rows and remove rows where topic is missing
if (nrow(speeches_en) != nrow(topics_gamma)) {
  # Assign topics only for rows that have corresponding topics
  speeches_en$dominant_topic <- NA
  speeches_en$dominant_topic[1:nrow(topics_gamma)] <- apply(topics_gamma, 1, function(x) which.max(x))
  
  # Remove rows with missing topics (NA values)
  speeches_en <- speeches_en %>%
    filter(!is.na(dominant_topic))
}

# Define the topic labels
topic_labels <- c(
  "Housing crisis",
  "EU relations",
  "Social services",
  "Legislation and law",
  "Development and infrastructure",
  "Social needs and engagement",
  "Parliamentary proceedings"
)

# Map the topic labels to the corresponding dominant topics
speeches_en$topic_label <- topic_labels[speeches_en$dominant_topic]
                                                            
head(speeches_en[, c("date", "party", "dominant_topic", "topic_label")])                                                            

date,party,dominant_topic,topic_label
<date>,<chr>,<int>,<chr>
2023-06-21,Fianna Fáil,2,EU relations
2023-11-16,Fine Gael,4,Legislation and law
2023-12-05,Fianna Fáil,6,Social needs and engagement
2023-03-22,Fianna Fáil,1,Housing crisis
2022-05-12,,4,Legislation and law
2023-06-13,,3,Social services


In [20]:
# Replacing missing party names 
speeches_en <- speeches_en %>%
  mutate(party = tidyr::replace_na(party, "External"))

# Counting topics by party
party_topic <- speeches_en %>%
  count(party, topic_label) %>%
  pivot_wider(
    names_from = topic_label,
    values_from = n,
    values_fill = list(n = 0)
  ) %>%
  arrange(party)

# Creating percentage table
party_topic_pct <- party_topic %>%
  mutate(
    total = rowSums(select(., -party)),
    across(-c(party, total), ~ round(.x / total * 100, 1))
  ) %>%
  select(-total)

# Printing results
cat("\nParty by Topic Frequency Table (Counts)\n")
print(as.data.frame(party_topic), row.names = FALSE)

cat("\nParty by Topic Frequency Table (Percentages)\n")
print(as.data.frame(party_topic_pct), row.names = FALSE)

# The table below shows that there are no significant differences in the topics discussed (measured by the most dominant topic per speech) 
# across parties. 
# While the percentage distribution reveals some variation in topic coverage for the Independents for Change party, 
# this is likely due to the small sample size. 
# To obtain more robust results (possible), alternative modeling techniques should be explored.
# The results will not be plotted further. 


Party by Topic Frequency Table (Counts)
                                          party Development and infrastructure
 Anti-Austerity Alliance - People Before Profit                            159
                                       External                           1763
                                    Fianna Fáil                            931
                                      Fine Gael                            827
                                    Green Party                            349
                                    Independent                            466
                          Independents 4 Change                              5
                                   Labour Party                            162
                                      Sinn Féin                            944
                               Social Democrats                            190
 EU relations Housing crisis Legislation and law Parliamentary proceedings
          129  

In [21]:
# Topics by Year

# Extracting year from date and counting topics by year
year_topic <- speeches_en %>%
  mutate(year = format(date, "%Y")) %>%
  filter(!is.na(year)) %>%
  count(year, topic_label) %>%
  pivot_wider(
    names_from = topic_label,
    values_from = n,
    values_fill = 0
  ) %>%
  arrange(year)

# Creating percentage table
year_topic_pct <- year_topic %>%
  mutate(
    total = rowSums(select(., -year)),
    across(-c(year, total), ~ round(.x / total * 100, 1))
  ) %>%
  select(-total)

# Printing results 
cat("\nYear by Topic Frequency Table (Counts)\n")
print(as.data.frame(year_topic), row.names = FALSE)

cat("\nYear by Topic Frequency Table (Percentages)\n")
print(as.data.frame(year_topic_pct), row.names = FALSE)

# The results of this analysis demonstrate that there is no major change in the discussed topics across years.
# Legislation and law, followed by Houseing crisis and EU relations remain the most discussed ones from 2020-2024. 
# To obtain more robust results (possible), alternative modeling techniques should be explored.


Year by Topic Frequency Table (Counts)
 year Development and infrastructure EU relations Housing crisis
 2020                            615          480            758
 2021                            978          897           1293
 2022                           1526         1357           1916
 2023                           1449         1264           1690
 2024                           1026          967           1388
 Legislation and law Parliamentary proceedings Social needs and engagement
                 670                       896                         729
                1157                      1483                        1286
                1738                      2200                        1926
                1619                      2136                        1702
                1214                      1545                        1353
 Social services
             634
            1058
            1569
            1472
            1127

Year by Topic Fre

In [22]:
# Part 2: Modelling Ideology

#Modelling ideology with LG dictionary 
ideology_dfm <- dfm_lookup(speeches_dfm, data_dictionary_LaverGarry)
ideology_dfm

Document-feature matrix of: 46,878 documents, 20 features (90.85% sparse) and 9 docvars.
       features
docs    CULTURE.CULTURE-HIGH CULTURE.CULTURE-POPULAR CULTURE.SPORT CULTURE
  text1                    0                       0             0       0
  text2                    0                       0             0       0
  text3                    0                       0             0       0
  text4                    0                       0             0       0
  text5                    0                       0             0       0
  text6                    0                       0             0       0
       features
docs    ECONOMY.+STATE+ ECONOMY.=STATE= ECONOMY.-STATE-
  text1              10               5               1
  text2               0               0               0
  text3               1               0               0
  text4               0               0               0
  text5              14               3               5
  text6           

In [23]:
# Checking for empty texts
sum(rowSums(ideology_dfm) > 0)

# Printing top features 
topfeatures(ideology_dfm, 15)

In [24]:
# Convert to data frame 
ideology_scores <- quanteda::convert(ideology_dfm, to = "data.frame")

# Extracting column names and grouping into ideologies (selected topics only)

# Note: A more optimal code to group ideologies can be applied, at this point, this version worked. 

ideology_scores$left <- 

  (if("ECONOMY.=STATE=" %in% names(ideology_scores)) ideology_scores$"ECONOMY.=STATE=" else 0) +
  (if("ECONOMY.+STATE+" %in% names(ideology_scores)) ideology_scores$"ECONOMY.+STATE+" else 0) +
  (if("INSTITUTIONS.RADICAL" %in% names(ideology_scores)) ideology_scores$"INSTITUTIONS.RADICAL" else 0) +
  (if("ENVIRONMENT.PRO ENVIRONMENT" %in% names(ideology_scores)) ideology_scores$"ENVIRONMENT.PRO ENVIRONMENT" else 0) +
  (if("GROUPS_ETHNIC" %in% names(ideology_scores)) ideology_scores$"GROUPS_ETHNIC" else 0) +
  (if("GROUPS_WOMEN" %in% names(ideology_scores)) ideology_scores$"GROUPS_WOMEN" else 0) +
  (if("VALUES_LIBERAL" %in% names(ideology_scores)) ideology_scores$"VALUES_LIBERAL" else 0) +
  (if("LAW_AND_ORDER.LAW-LIBERAL" %in% names(ideology_scores)) ideology_scores$"LAW_AND_ORDER.LAW-LIBERAL" else 0)

ideology_scores$right <- 
   
  (if("ECONOMY.-STATE-" %in% names(ideology_scores)) ideology_scores$"ECONOMY.-STATE-" else 0) +
  (if("INSTITUTIONS.CONSERVATIVE" %in% names(ideology_scores)) ideology_scores$"INSTITUTIONS.CONSERVATIVE" else 0) +
  (if("LAW_AND_ORDER.LAW-CONSERVATIVE" %in% names(ideology_scores)) ideology_scores$"LAW_AND_ORDER.LAW-CONSERVATIVE" else 0) +
  (if("VALUES.CONSERVATIVE" %in% names(ideology_scores)) ideology_scores$"VALUES.CONSERVATIVE" else 0) +
  (if("CULTURE.CULTURE-HIGH" %in% names(ideology_scores)) ideology_scores$"CULTURE.CULTURE-HIGH" else 0)

# Calculating net score
ideology_scores$net_score <- ideology_scores$left - ideology_scores$right

# Printing results
head(ideology_scores[order(-ideology_scores$net_score), c("doc_id", "left", "right", "net_score")], 10)

Unnamed: 0_level_0,doc_id,left,right,net_score
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
12981,text13495,286,38,248
17442,text18164,233,23,210
45477,text47417,228,18,210
6457,text6706,228,29,199
3140,text3262,221,29,192
40818,text42560,201,22,179
11816,text12286,191,20,171
45481,text47421,165,8,157
38209,text39822,169,16,153
31336,text32662,156,9,147


In [25]:
# Merging speeches_en with ideology_scores (ensure they have the same number of rows and same order)
ideology_results <- cbind(
  speeches_en,  # Your original data with metadata (e.g., party)
  ideology_scores  # Converted ideology scores
)

# Printing the first few rows of the merged results and checking the right order
#head(ideology_results)
nrow(speeches_en)  # Check the number of rows in speeches_en
nrow(ideology_scores)  # Check the number of rows in ideology_scores

# Adding token counts using speeches_dfm
ideology_results <- ideology_results %>%
  mutate(ntokens = quanteda::ntoken(speeches_dfm))  # Add token counts using speeches_dfm

# Performing transformations to calculate relative ideology scores
ideology_results <- ideology_results %>%
  mutate(
    rel_ideology_mp = (right - left) / ntokens,                # Method 1: Proportional difference
    rel_ideology_laver = (right - left) / (right + left),      # Method 2: Laver-style scaling
    rel_ideology_lowe = log((right + 1) / (left + 1))          # Method 3: Log-ratio (Lowe)
  ) 

In [26]:
# Selecting columns and printing the result 
ideology_results <- ideology_results %>%
  select(party, left, right, ntokens, rel_ideology_mp, rel_ideology_laver, rel_ideology_lowe)
head(ideology_results, 10)

Unnamed: 0_level_0,party,left,right,ntokens,rel_ideology_mp,rel_ideology_laver,rel_ideology_lowe
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
1,Fianna Fáil,16,6,188,-0.053191489,-0.45454545,-0.8873032
2,Fine Gael,0,0,4,0.0,,0.0
3,Fianna Fáil,1,0,8,-0.125,-1.0,-0.69314718
4,Fianna Fáil,0,0,3,0.0,,0.0
5,External,17,8,439,-0.020501139,-0.36,-0.69314718
6,External,4,0,64,-0.0625,-1.0,-1.60943791
7,Green Party,0,0,18,0.0,,0.0
8,Fine Gael,17,16,301,-0.003322259,-0.03030303,-0.05715841
9,Fianna Fáil,7,2,77,-0.064935065,-0.55555556,-0.98082925
10,External,6,3,148,-0.02027027,-0.33333333,-0.55961579


In [27]:
# Dropping NA and zero values 
filtered_ideology_results <- ideology_results %>%
  filter(!is.na(rel_ideology_mp) & rel_ideology_mp != 0,
         !is.na(rel_ideology_laver) & rel_ideology_laver != 0,
         !is.na(rel_ideology_lowe) & rel_ideology_lowe != 0)

# Calculating mean summary 
mean_summary <- filtered_ideology_results %>%
  group_by(party) %>%
  summarise(
    mean_rel_ideology_mp = mean(rel_ideology_mp, na.rm = TRUE),
    mean_rel_ideology_laver = mean(rel_ideology_laver, na.rm = TRUE),
    mean_rel_ideology_lowe = mean(rel_ideology_lowe, na.rm = TRUE)
  )

# Calculating standard deviation summary 
sd_summary <- filtered_ideology_results %>%
  group_by(party) %>%
  summarise(
    sd_rel_ideology_mp = sd(rel_ideology_mp, na.rm = TRUE),
    sd_rel_ideology_laver = sd(rel_ideology_laver, na.rm = TRUE),
    sd_rel_ideology_lowe = sd(rel_ideology_lowe, na.rm = TRUE)
  )

# Printing the results
print("Mean Summary:")
print(mean_summary)

print("Standard Deviation Summary:")
print(sd_summary)

# The analysis of ideology measures reveals that most parties in the dataset lean left, with no major differences.  
# While the mean scores suggest a left-leaning trend across parties, 
# standard deviations show varying levels of ideological diversity, particularly within parties like Labour and 
# Independents 4 Change (which has a small n). 
# The findings highlight a generally consistent left-leaning stance, with some internal variability in ideological positions.

# While these methods offer some insights, other approaches might provide a more nuanced understanding of party ideologies.

[1] "Mean Summary:"
[90m# A tibble: 10 × 4[39m
   party      mean_rel_ideology_mp mean_rel_ideology_la…¹ mean_rel_ideology_lowe
   [3m[90m<chr>[39m[23m                     [3m[90m<dbl>[39m[23m                  [3m[90m<dbl>[39m[23m                  [3m[90m<dbl>[39m[23m
[90m 1[39m Anti-Aust…              -[31m0[39m[31m.[39m[31m0[39m[31m50[4m2[24m[39m                 -[31m0[39m[31m.[39m[31m573[39m                 -[31m0[39m[31m.[39m[31m876[39m
[90m 2[39m External                -[31m0[39m[31m.[39m[31m0[39m[31m48[4m8[24m[39m                 -[31m0[39m[31m.[39m[31m553[39m                 -[31m0[39m[31m.[39m[31m860[39m
[90m 3[39m Fianna Fá…              -[31m0[39m[31m.[39m[31m0[39m[31m51[4m2[24m[39m                 -[31m0[39m[31m.[39m[31m554[39m                 -[31m0[39m[31m.[39m[31m860[39m
[90m 4[39m Fine Gael               -[31m0[39m[31m.[39m[31m0[39m[31m48[4m7[24m[39m                 -[