# Concept Network Tool: Data Prepartion Tutorial

In this notebook, we show how you can prepare your survey with open-ended questions to be used with Concept Network Tool. 

At the moment, preparing data requires few manual steps that have to be run externally, these steps include:

1. Preprocessing – check that there are no empty strings or white space and your data is imported correctly to R.
2. Text annotation – tokenize, split into sentences, tag, parse, lemmatize text data

In [1]:
pacman::p_load(tidyverse, udpipe, stopwords) # load R packages

# Helsingin Sanomat Loneliness Survey 2014

The following example data set is publicly available for research, teaching and study at the Finnish Social Science Data Archive (FSD). In order to run this notebook, you have to register and download the data. You will be asked to supply the purpose of data use for the download.


Data Citation:

- Saari, Juho (Tampere University) & Helsingin Sanomat & Kauhanen, Jussi (University of Eastern Finland) & Karhunen, Leila (University of Eastern Finland) & Lagus, Krista (Aalto University) & Kainulainen, Sakari (Diaconia University of Applied Sciences) & Pantzar, Mika (University of Helsinki) & Erola, Jani (University of Turku) & Junttila, Niina (University of Turku) & Müller, Kiti (Finnish Institute of Occupational Health) & Huhta, Jaana (Finnish Institute of Occupational Health): Helsingin Sanomat Loneliness Survey 2014 [dataset]. Version 1.0 (2020-09-01). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3360

## Preprocessing

- Import original data file downloaded from Finnish Social Science Data Archive (FSD).
- Process whitespace and empty strings.
- Create an indicator to separate respondents who have replied to the (only) open-ended question.
- Add columns for word and character counts

In [2]:
df.raw <- read.csv("data/daF3360.csv", sep = ";") %>%
  mutate(q9 = trimws(q9)) %>% # remove white space
  mutate_all(na_if, "") %>% # replace empty string with NA 
  mutate(grp = ifelse(!is.na(q9), "Yes", "No"), # add an indicator based on if the respondent replied to open-ended questions
         n_words = str_count(q9, "\\w+"), # add word count columns
         n_chars = nchar(q9)) # add character count column


## Lemmatization, morphological tagging and dependency parsing 
[Turku Neural Parser](https://turkunlp.org/Turku-neural-parser-pipeline/) can tokenize, split into sentences, tag, parse, lemmatize Finnish texts. Alternatively, UDPipe R package also includes two Finnish language parsers that you can use instead of Turku Neural Parser as well:

- UD Finnish FTB - FinnTreeBank 1 based model
- UD Finnish TDT - Turku Dependency Treebank (TDT) based mode


All of the methods produced [CoNLL-U](https://universaldependencies.org/format.html) formatted output which is the format the Concept Network tool uses as input.

Preparing survey data for Turku Neural Parser pipeline processing:
- Export open-ended survey responses as text file (`txt`)
- Add double line breaks (`\n\n`) so that each response is interpreted as a paragraph.
- Use row number as identification variable (`conllu_id`) in order to link the parsed output back to survey data

In [3]:
df.export <- df.raw %>%
  select(fsd_id, q9) %>%
  mutate(conllu_id = row_number()) # add paragraph_id for parsing

# add double line breaks as then Neural Parser will consider each response as a paragraph
writeLines(df.export$q9, "text_data/hs2014_q9.txt", sep = "\n\n") 

### Running Turku Neural Parser pipeline

Resources, tutorials and documentation:
- Neural Parser documentation: [Turku neural parser pipeline](https://turkunlp.org/Turku-neural-parser-pipeline/)
- Running the parser in CSC server environments: [GPU-accelerated machine learning - Docs CSC](https://docs.csc.fi/support/tutorials/gpu-ml/#turku-neural-parser)
- [UDPipe - Text Annotation with UDpipe modles](https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html) (with R)


### Annotated output
- After successfully running the parser, export the output file formatted in [CoNLL-U](https://universaldependencies.org/format.html) format back to R, using package [`udpipe`](https://cran.r-project.org/web/packages/udpipe/index.html).
- Èdit `doc_id` to include the index from `paragraph_id` and reset `paragraph_id`.


In [4]:
df.annotated <- udpipe_read_conllu("text_data/hs2014_q9.conllu") %>%
  mutate(sentence_id = as.numeric(sentence_id),
         doc_id = paste0("doc", paragraph_id)) %>%
  mutate(paragraph_id = 1)

In [5]:
# Set output path to export data
output_dir = "processed_data" 

if (!dir.exists(output_dir)){
    dir.create(output_dir)
}

write.csv(df.annotated, paste(output_dir, "hs2014_processed.csv" ,sep = "/"),row.names = FALSE)