1. [Working directory and packages](#chapter1)
2. [Data](#chapter2)
3. [Preprocessing](#chapter3)
4. [Wrapper function](#chapter4)
5. [Rooduijn & Pauwels](#chapter5)
   1. [Construct validity](#subparagraph1)
   2. [Face validity](#subparagraph2)
   3. [External validity](#subparagraph3)
       1. [CHES](#subparagraph4)
       2. [PopuList](#subparagraph5)
6. [Decadri & Boussalis](#chapter6)
   1. [Construct validity](#subparagraph6)
   2. [Face validity](#subparagraph7)
   3. [External validity](#subparagraph8)
       1. [CHES](#subparagraph9)
       2. [PopuList](#subparagraph10)

# Working directory and packages <a class="anchor" id="chapter1"></a>

Setting the working directory

In [1]:
setwd("C:/Users/jacop/Tesi/")

Loading the libraries

In [15]:
suppressWarnings(suppressPackageStartupMessages(library(dtplyr)))
suppressWarnings(suppressPackageStartupMessages(library(tidyverse)))
suppressWarnings(suppressPackageStartupMessages(library(data.table)))
suppressWarnings(suppressPackageStartupMessages(library(quanteda)))
suppressWarnings(suppressPackageStartupMessages(library(manifestoR)))

The 'tokens_group' function from the latest version of Quanteda often returns an error when grouping the tokens by more than one variable. We'll thus need to install a previous version of Quanteda. Let's check which of version we currently have installed.

In [3]:
sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] quanteda_2.1.2    data.table_1.14.2 forcats_0.5.1     stringr_1.4.0    
 [5] dplyr_1.0.7       purrr_0.3.4       readr_2.1.0       tidyr_1.1.4      
 [9] tibble_3.1.6      ggplot2_3.3.5     tidyverse_1.3.1   dtplyr_1.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         lubridate_1.8.0    lattice_0.20-44    assertthat_0.2.1  
 [5] digest_0.6.29      utf8_1.2.2         IRdisplay_1.0      R6_2.5.1          
 [9] cellranger_1.1.0   repr_1.1.3         backports_1.4.0    reprex_2.0.

If it's the latest one, we'll need to unistall it and replace it with a previous version (2.1.2 in this case but others may work as well)

In [None]:
# remove.packages('quanteda')
# devtools::install_version("quanteda", version = "2.1.2", repos = "http://cran.us.r-project.org")

# Data <a class="anchor" id="chapter2"></a>

## Speeches dataset

Loading the data

In [None]:
load("data/parliamentary_groups2.rds")

Creating a lazy data.table out of our dataframe so that we can use dtplyr on it

In [None]:
texts <- lazy_dt(Texts)

Casting the "legislatura" variable as numeric

In [None]:
texts <- texts %>% mutate(legislatura = as.integer(legislatura)) %>% as_tibble()

Filtering the dataset by focusing on the last seven legislatures

In [None]:
texts <- texts %>% filter(legislatura >= 12) %>% as_tibble()

## Project Manifesto dataset

Setting the API key in our work environment

In [17]:
mp_setapikey("data/manifesto_apikey.txt")

## Stopwords

Decadri and Boussalis' additional stopwords

In [None]:
db_additional_stopwords  <- suppressMessages(read_csv("data/it_stopwords_new_list.csv")) %>% 
                            pull(stopwords)

Procedural stopwords

In [None]:
procedural_stopwords <- suppressMessages(read_csv("data/it_stopwords_procedural.csv")) %>% 
                        pull(it_stopwords_procedural)

## Dictionaries

Rooduijn and Pauwels' dictionary

In [None]:
anti_elitism <- c("elit*", "consens*", "antidemocratic*", "referend*", "corrot*", "propagand*", 
                  "politici*","ingann*", "tradi*", "vergogn*", "scandal*", "verita", "disonest*", 
                  "partitocrazia", "menzogn*", "mentir*")

rp_dictionary <- dictionary(list(anti_elitism = anti_elitism))

Decadri and Boussalis' dictionary

In [None]:
anti_elitism <- c("antidemocratic*", "casta", "consens*", "corrot*", "disonest*", "elit*", 
                  "establishment", "ingann*", "mentir*", "menzogn*", "partitocrazia", "propagand*", 
                  "scandal*", "tradim*", "tradir*", "tradit*", "vergogn*", "verita")

people_centrism  <- c("abitant*", "cittadin*", "consumator*", "contribuent*", "elettor*", "gente", "popol*")

db_dictionary <- dictionary(list(anti_elitism = anti_elitism, 
                                 people_centrism = people_centrism))

Grundl's dictionary

In [44]:
grundl <- readxl::read_xlsx("data/gruendl_terms_Fedra_Silvia.xlsx", sheet = 2)  %>% 
filter(!is.na(terms) & !str_detect(string = terms, pattern = "\\?+"))  %>% 
mutate(terms = str_split(terms, ', ')) %>% 
unnest(cols = c(terms)) %>% 
distinct(terms)

## External validity datasets

Let's load the two datasets we'll be using to test the dictionaries' external validity: the Chapel Hill Expert Survey and the PopuList dataset.

In [None]:
ches <- read_csv("data/1999-2019_CHES_dataset_means(v2).csv", show_col_types = FALSE)

populist <- readxl::read_xlsx("data/populist-version-2-20200626.xlsx")

# Preprocessing <a class="anchor" id="chapter3"></a>

Creating the corpus

In [None]:
my_corpus <- corpus(texts, text_field = "textclean")

Tokenizing the corpus, removing stopwords and grouping the tokens by the 'year' and 'gruppoP' variables

In [None]:
toks <- my_corpus %>% 
        tokens(., remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_separators = TRUE)  %>% 
        tokens_remove(., pattern = stopwords("it"), padding = TRUE) %>% 
        tokens_remove(., pattern = db_additional_stopwords) %>% 
        tokens_remove(., pattern = procedural_stopwords) %>% 
        quanteda:::tokens_group(x = ., groups = c('year', 'gruppoP'))

# Wrapper function <a class="anchor" id="chapter4"></a>

In [None]:
dict_analysis <- function(tokens, dictionary) {
    
  # Extracting the number of tokens for each group
    
  total_toks = ntoken(tokens)
    
  # Applying Rooduijn and Pauwels' dictionary  
  
  if (dictionary == "Rooduijn_Pauwels") {
  
  my_dict_lookup <- tokens_lookup(x = tokens, dictionary = rp_dictionary)
  
  my_dfm <- dfm(my_dict_lookup)  %>% 
            convert(., to = "data.frame") %>% 
            mutate(year = docvars(tokens)$year,
                   party = docvars(tokens)$gruppoP,
                   cluster = docvars(tokens)$group_cluster,
                   total_toks = total_toks,
                   perc_of_populist_toks = anti_elitism / total_toks,
                   standardized_perc_of_populist_toks = as.double(scale(perc_of_populist_toks))) %>% 
            relocate(doc_id, year, party, cluster, anti_elitism, total_toks, perc_of_populist_toks, 
                     standardized_perc_of_populist_toks) %>% 
            as_tibble()

  }
    
  # Applying Decadri and Boussalis' dictionary
  
  if (dictionary == "Decadri_Boussalis") {
    
    my_dict_lookup <- tokens_lookup(x = tokens, dictionary = db_dictionary)
    
    my_dfm <- dfm(my_dict_lookup) %>% 
              convert(., to = "data.frame") %>% 
              mutate(year = docvars(tokens)$year,
                     party = docvars(tokens)$gruppoP,
                     cluster = docvars(tokens)$group_cluster,
                     populist_toks = anti_elitism + people_centrism,
                     total_toks = total_toks,
                     perc_of_populist_toks = populist_toks / total_toks,
                     standardized_perc_of_populist_toks = as.double(scale(perc_of_populist_toks))) %>% 
              relocate(doc_id, year, party, cluster, anti_elitism, people_centrism, populist_toks,
                       total_toks, perc_of_populist_toks, standardized_perc_of_populist_toks) %>% 
              as_tibble()
    
  }
  
  return(my_dfm)
  
}


# Rooduijn & Pauwels <a class="anchor" id="chapter5"></a>

Let's run the dictionary analysis by using Roodujin and Pauwels' dictionary

In [None]:
df_rp <- dict_analysis(tokens = toks, dictionary = "Rooduijn_Pauwels")

The first rows of the dataframe

In [None]:
head(df_rp)

## Construct validity <a class="anchor" id="subparagraph1"></a>

Rooduijn and Pauwels' dictionary captures the "anti-elitism" component of populism, but not the "people-centrism" one. As a result, from a construct validity standpoint, it is only partially valid. The authors motivated the decision to leave out the "people-centrism" dimension by pointing out that the "people" is often referenced to by words such as "us", "we" and "our" which are also used to reference entities other than the people (such as political parties). The inclusion of these words in the dictionary, they argue, would result in a large number of false positives.

## Face validity <a class="anchor" id="subparagraph2"></a>

A populist dictionary has face validity if the allegedly populist parties are indeed populist. In the Italian case, we would expect populist values to be higher for parties that the literature deems populist (i.e. Five Star Movement, Lega Nord, Forza Italia and Il Popolo delle Libertà).

The following are the 20 party-year combinations with the highest populist score in the 1994-2021 period. Consistently with our expectations, we find populist parties such as FdI-AN (2014), Forza Italia (2019), FdI-AN (2013) and FdI-AN (2017). However, we also find mainstream parties such as UDC (2009), SI-SEL-POS-LU (2018), IV (2018), PD (2018) and PD (2019). These results could be interpreted as evidence of either populist contagion or lack of face validity. The absence of M5S and Lega among the most populist parties makes me lean towards the latter.

In [None]:
df_rp %>% 
arrange(desc(standardized_perc_of_populist_toks)) %>% 
head(20)

The following are the party-year combinations with the lowest populist scores. Again we see a mixture of both mainstream and populist parties. Interestingly, LNA (2018), FdI-AN (2018) and PdL (2013) are ranked among the least populist parties. This might be further evidence of lack of face validity in Rooduijn and Pauwels' dictionary.

In [None]:
df_rp %>% 
arrange(desc(standardized_perc_of_populist_toks)) %>% 
tail(20) %>% 
arrange(standardized_perc_of_populist_toks)

## External validity <a class="anchor" id="subparagraph3"></a>

### Chapel Hill Expert Survey <a class="anchor" id="subparagraph4"></a>

As Rooduijn and Pauwels' dictionary only captures the anti-elite dimension of populism, the external validity will be carried out against the anti-elite salience variable from the CHES dataset, which has been introduced in 2014.

The countrycode for Italy is 8. The following is a list of all Italian parties in the CHES dataset in the 2014-2019 time period.

In [None]:
ches %>% filter(country == 8 & year >= 2014 & year <= 2019) %>% distinct(party)

While these are the parties included in our dataset in the same timeframe

In [None]:
df_rp %>% filter(year >= 2014 & year <= 2019) %>% distinct(party)

Let's now compare how R&P' dictionary and the CHES dataset ranked party-year combinations by populism in 2014 and 2019. We'll drop all parties that are not present in both datasets.

The difference between the two rankings is stark. According to the dictionary analysis, PD (2019) ranks among the most populist party-year combinations and M5S (2019) among the least populists ones, while the opposite is true in the CHES dataset. Moreover, Lega (2019), one of the most populist party-year combinations according to CHES, is only slightly populist according to R&P' dictionary.

In [None]:
df_rp %>% 
filter((year == 2014 | year == 2019) & party != "MISTO" & party != "IV") %>% 
arrange(desc(standardized_perc_of_populist_toks))

In [None]:
to_drop <- c('VdA', 'SVP', 'RI')

ches %>% 
filter(country == 8 & year >= 2014 & year <= 2019 & (!party %in% to_drop))  %>% 
group_by(party, year) %>% 
summarize(mean_anti_elite_salience = mean(antielite_salience), .groups = "keep") %>% 
arrange(desc(mean_anti_elite_salience))

### The PopuList <a class="anchor" id="subparagraph5"></a>

All the Italian parties in the PopuList dataset

In [None]:
populist %>% filter(country_name == "Italy") %>% distinct(party_name)

Let's compare the populism scores between PopuList and R&D' dictionary by focusing on parties that are present in both datasets.

According to the dictionary analysis, FI-PDL, FdI-AN, Lega and M5S have higher populism scores compared to most parties. These parties are all coded as populist in the PopuList dataset. The two measures can thus be considered similar.

In [None]:
to_keep <- c("F-ITA", "FI", "PDL", "FI-PDL", "FDI-AN", "FDI", "LEGA-N", "LEGA-NORD-P", "LNA", "LEGA", "LNP", "M5S", 
             "RC-PROGR", "COMUNISTA", "RC", "COM/IT/", "RC-SE", "SI-SEL-POS-LU")

df_rp %>% 
filter(party %in% to_keep) %>% 
arrange(desc(perc_of_populist_toks)) %>% 
head(20)

In [None]:
to_drop <- c("Fiamma Tricolore", "Lega d'Azione Meridionale", "Movimento Sociale Italiano")

populist %>% 
filter(country_name == "Italy" & (!party_name %in% to_drop)) %>% 
select(party_name, populist) %>% 
arrange(desc(populist))

# Decadri & Boussalis <a class="anchor" id="chapter6"></a>

Let's run the dictionary analysis with Decadri and Boussalis' dictionary

In [None]:
df_db <- dict_analysis(tokens = toks, dictionary = "Decadri_Boussalis")

The first rows of the dataframe

In [None]:
head(df_db)

## Construct validity <a class="anchor" id="subparagraph6"></a>

Decadri and Boussalis' dictionary catpures both the "anti-elitism" and "people-centrism" dimenions of populist ideology and it thus constitutes an improvement over Rooduijn and Pauwels' dictionary in terms of construct validity.

## Face validity <a class="anchor" id="subparagraph7"></a>

To assess the face validity of Decadri and Boussalis' dictionary we'll have a look at the mean % of populist tokens (both anti-establishment and people-centrism) grouped by party and year.

As it was the case for R&P' dictionary, both mainstream (UDC, UDEUR, PPI) and populist (Lega, M5S, FDI-AN) party-year combinations received high populist scores.

In [None]:
df_db %>% 
arrange(desc(standardized_perc_of_populist_toks)) %>% 
head(20)

Similarly, when we look at the party-year combinations with the lowest populist scores we find both mainstream and populist parties. This seems to suggest that D&B' dictionary lacks face validity.

In [None]:
df_db %>% 
arrange(desc(standardized_perc_of_populist_toks)) %>% 
tail(20) %>% 
arrange(standardized_perc_of_populist_toks)

## External validity <a class="anchor" id="subparagraph8"></a>

### Chapel Hill Expert Survey <a class="anchor" id="subparagraph9"></a>

As Decadri and Boussalis' dictionary captures both dimensions of populism we will validate it against a combination of two different variables from the CHES dataset, i.e. "anti-élite salience" and "people_vs_élite". We'll use the former as a proxy for the anti-establishment component and the latter as a proxy for the people-centrist one. The "people_vs_élite" variable has been introduced in the 2019 edition of the dataset, so we'll only work with observations from that year.

The following are the Italian parties in the CHES dataset for the year 2019

In [None]:
ches %>% filter(country == 8 & year == 2019) %>% select(party, antielite_salience, people_vs_elite)

The parties in our dataset in the same year

In [None]:
df_db %>% filter(year == 2019) %>% distinct(party)

Let's compute the average populist value for each party in the CHES dataset by summing the people vs elite and the anti-elite salience variables and then taking the mean. "Radicali Italiani" and "Südtiroler Volkspartei" are not in our dataset so we'll drop them from CHES.

In [None]:
to_drop <- c("RI", "SVP")

ches %>% 
filter(country == 8 & year == 2019 & (!party %in% to_drop)) %>% 
group_by(party) %>% 
summarize(mean_populism = mean(people_vs_elite + antielite_salience)) %>% 
arrange(desc(mean_populism))

The two rankings are rather different. According to CHES, M5S and Lega rank as the two most populist parties, whereas in the results of the dictionary analysis they turned out to be the least populist ones.

In [None]:
to_drop <- c("IV", "MISTO")

df_db %>% 
filter(year == 2019 & (! party %in% to_drop)) %>% 
arrange(desc(perc_of_populist_toks))

### The PopuList <a class="anchor" id="subparagraph10"></a>

Let's now compare D&B' dictionary with the PopuList dataset.

Lega, FdI, FI/PdL and M5S rank among the most populist parties according to D&B' dictionary. These parties have all been coded as populist by PopuList. The two measures can thus be considered to be similar.

In [None]:
populist %>% 
filter(country_name == "Italy") %>%
select(party_name, populist) %>% 
arrange(desc(populist))

In [None]:
to_keep <- c("F-ITA", "FI", "PDL", "FI-PDL", "FDI-AN", "FDI", "LEGA-N", "LEGA-NORD-P", "LNA", "LEGA", "LNP", "M5S", 
             "RC-PROGR", "COMUNISTA", "RC", "COM/IT/", "RC-SE", "SI-SEL-POS-LU")

df_db %>% 
filter(party %in% to_keep) %>% 
arrange(desc(perc_of_populist_toks)) %>% 
head(20)