# Data preprocessing - Creation of the final data frame for data analysis

## Introduction

This document contains the code for the item creation corresponding to the master thesis "...". 

The research question of the thesis is: Does the similarity of the first three noun phrase antecedents of double embedded sentences have an effect on the reading times of grammatical and ungrammatical double embedded sentences?

To examine this question the self-paced-reading (SPR) and eye-tracking data of Vasishth et al. (2010) are examined. The similarity of the three noun phrase antecedents is calculated with the machine learning model Word2vec for each sentence. A linear mixed model with varying intercepts for subjects is fitted that measures the effect of the interaction between the similarity and the grammaticality of the sentences.
In the following paragraphs the exact procedure for the creation of items as well as the actual code of item creation is presented. 

End product of this notebook: A data frame that contains the subject ID, the sentence number, the condition of the sentence, the reading time of the post V1 region (log transformed) and the corresponding similarity value (centred). The data was empirically collected for the Vasishth et al. (2010) study.

## Coding part

At first, the required packages have to be downloaded. Pandas and numpy are both libraries that provide functions for Python, enabling the user to deal with data or mathematical structures.

In [1]:
import gensim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Next, the data corresponding to the first German SPR experiment of the Vasishth et al. (2010) study is uploaded into the notebook.

In [2]:
df = pd.read_csv('e3_de_spr_data_single_space.txt', sep=' ', names=['Subject_ID', 'Experiment', 'Sentence_Number', 'Condition', 'Word_Position','Word', 'RT','Similarity', 'Grammaticality'])
df

Unnamed: 0,Subject_ID,Experiment,Sentence_Number,Condition,Word_Position,Word,RT,Similarity,Grammaticality
0,10,inter,10,c,1,"Der_Pianist,",957,similar,ungrammatical
1,10,inter,10,c,2,den,823,similar,ungrammatical
2,10,inter,10,c,3,"der_Cellist,",1456,similar,ungrammatical
3,10,inter,10,c,4,den,1057,similar,ungrammatical
4,10,inter,10,c,5,der_Hausmeister,1415,similar,ungrammatical
...,...,...,...,...,...,...,...,...,...
4466,9,inter,12,d,4,den,507,dissimilar,ungrammatical
4467,9,inter,12,d,5,der_Zuschauer,557,dissimilar,ungrammatical
4468,9,inter,12,d,6,"bewunderte,",619,dissimilar,ungrammatical
4469,9,inter,12,d,7,beobachtete,1108,dissimilar,ungrammatical


### Extraction of relevant data 

A lot of data points are not needed for the purpose of this study. Only the post V1 region of the 'gug' condition (the experimental condition) is needed. The relevant data has to be extracted of the data frame.

In [3]:
# Extract the experimental condition 'gug'
df_gug = df[df.Experiment=='gug']
df_gug

Unnamed: 0,Subject_ID,Experiment,Sentence_Number,Condition,Word_Position,Word,RT,Similarity,Grammaticality


All words of the "gug" condition are now extracted and stored in a new data frame called df_gug. 
Since only the post V1 region is relevant for the data analysis the relevant rows, corresponding to the post V1 region, have to be extracted from the data frame df_gug. The post V1 region is always the second last word of every sentence.

How to get the second last word of every sentence:
Since all sentences differ in their length, it is not directly possible to access the second last word of each sentence. We therefore try first to find out the word positions of the last words of all sentences to be able to extract the second last words.

Our procedure to find the last word of every sentence works as follows: We start by finding the words that occur with a dot since every sentence ends with a dot. Therefore, the most straight forward way to find out where sentences end is to find out the positions of these dots. To do so, we create a Boolean series that gives us a TRUE when a dot occurs after a word and a FALSE when this is not the case.

In [4]:
last_word = df.Word.apply(lambda x: '.' in x)
last_word

0       False
1       False
2       False
3       False
4       False
        ...  
4466    False
4467    False
4468    False
4469    False
4470     True
Name: Word, Length: 4471, dtype: bool

Next, we create an array 'index_last_word' that extracts the words' numbers that have a TRUE as output in our Boolean series (thus, that occur with a dot in the end). As a consequence, we have the indices of all last words of all sentences.

In [5]:
index_last_word = np.where(last_word)[0]

Since we don't need the last word but the second last word, we subtract 1 from each index and store the numbers in an array index_second_last. We then extract all words that correspond to the numbers in the array index_second_last.

In [6]:
# index of second last word
index_second_last = index_last_word -1

In [7]:
second_last_word = df.iloc[index_second_last]
second_last_word

Unnamed: 0,Subject_ID,Experiment,Sentence_Number,Condition,Word_Position,Word,RT,Similarity,Grammaticality
6,10,inter,10,c,7,ersetzte,762,similar,ungrammatical
14,10,inter,11,d,7,verarztete,792,dissimilar,ungrammatical
23,10,inter,9,b,8,bestahl,598,dissimilar,grammatical
32,10,inter,4,a,8,hasste,752,similar,grammatical
40,10,inter,6,c,7,beschuldigte,720,similar,ungrammatical
...,...,...,...,...,...,...,...,...,...
4435,9,inter,4,d,7,hasste,558,dissimilar,ungrammatical
4444,9,inter,5,a,8,besuchte,557,similar,grammatical
4452,9,inter,8,d,7,beschimpfte,752,dissimilar,ungrammatical
4461,9,inter,2,b,8,verärgerte,680,dissimilar,grammatical


### The similarity values

In this part we first import the nouns with two conditions, the similar and the contrast condition, and their corresponding similarity values. Next the concept of centred similarity is explained and carried out.

We now import the similarity values and store them into a data frame called Simval_df. 
We have two similarity values for each sentence. One similarity value is for the similar condition with three animate noun phrases, one for the contrast condition with two animate nouns and one inanimate noun in the middle. 
The similarity values are the mean similarity of the three nouns respectively.

In [9]:
Simval_df = pd.read_csv('df_SimilarityValues_ger.csv', sep=',', names=['Sentence_Number', 'Similar_condition', 'Contrast_condition', 'SimVal_Similar', 'SimVal_Contrast'])
Simval_df

Unnamed: 0,Sentence_Number,Similar_condition,Contrast_condition,SimVal_Similar,SimVal_Contrast
0,0,"('Anwalt', 'Zeuge', 'Spion')","('Anwalt', 'Saebel', 'Spion')",0.490121,0.217141
1,1,"('Beamte', 'Buerokrat', 'Besucher')","('Beamte', 'Tisch', 'Besucher')",0.19958,0.243456
2,2,"('Braeutigam', 'Schwiegervater', 'Musiker')","('Braeutigam', 'Bilderrahmen', 'Musiker')",0.413416,0.310512
3,3,"('Bruder', 'Cousin', 'Bauer')","('Bruder', 'Schmuck', 'Bauer')",0.62518,0.218115
4,4,"('Zauberer', 'Akrobat', 'Zuschauer')","('Zauberer', 'Hut', 'Zuschauer')",0.456369,0.308386
5,5,"('Einbrecher', 'Dieb', 'Mann')","('Einbrecher', 'Stein', 'Mann')",0.721395,0.321455
6,6,"('Neurotiker', 'Exzentriker', 'Psychiater')","('Neurotiker', 'Dolch', 'Psychiater')",0.449083,0.226837
7,7,"('Arbeiter', 'Monteur', 'Vorarbeiter')","('Arbeiter', 'Eimer', 'Vorarbeiter')",0.601547,0.379001
8,8,"('Banker', 'Kreditgeber', 'Kunde')","('Banker', 'Geldautomat', 'Kunde')",0.539573,0.435449
9,9,"('Pianist', 'Cellist', 'Hausmeister')","('Pianist', 'Ball', 'Hausmeister')",0.578632,0.25438


The concept of centred similarity is an equivalent to the sum contrast coding that will be used for the data analysis later on. 
In the sum contrast coding each data point is coded as a function of the overall mean of all data points.
For the sum contrast coding for the condition of grammaticality we have +1 for grammatical and -1 for ungrammatical sentences.
For the examination of the research question we have to carry out an interaction analysis to see the impact of the similarity values (hence the condition of similarity) on the condition of grammaticality. 
For this purpose, the similarity values have to be adapted to the sum contrast coding concept. This is done by subtracting the mean similarity value from each individual similarity value.

First we calculate the mean similarity value from the similar and the contrast condition.

In [10]:
Simval_Similar = list(Simval_df['SimVal_Similar'])
Simval_Contrast = list(Simval_df['SimVal_Contrast'])

In [11]:
added_sim = Simval_Similar + Simval_Contrast

In [12]:
mean_sim = np.mean(added_sim)
mean_sim

0.38956377375870943

Next, we need to subtract the mean from each similarity value.

This is done with a for loop that loops over the Simval_Similar list (containing all similarity values corresponding to the noun pairs of the similar group) or Simval_Contrast list (containing all similarity values corresponding to the noun pairs of the contrast group). For each value of the list, we subtract the mean from that value and store the result in a list called Centred_simval_similar or Centred_simval_contrast.

In [13]:
Centred_simval_similar = []
for value in Simval_Similar:
    Centred_simval_similar.append(value - mean_sim)
print(Centred_simval_similar)

[0.10055741202086205, -0.18998370785266158, 0.023852135054767132, 0.23561581503599882, 0.06680520903319115, 0.3318315101787448, 0.05951962899416685, 0.21198287140578032, 0.15000962745398283, 0.1890681041404605, -0.05884656775742775, -0.0122876511886717]


In [15]:
Centred_simval_contrast = []
for value in Simval_Contrast:
    Centred_simval_contrast.append(value - mean_sim)
print(Centred_simval_contrast)

[-0.17242250312119722, -0.14610760379582644, -0.07905130553990602, -0.17144855577498674, -0.08117786515504122, -0.06810848880559206, -0.16272658575326204, -0.010562394745647907, 0.04588540922850365, -0.13518370408564806, -0.05449009407311678, -0.07273069489747286]


For our end data frame we need log transformed reading times.

In [16]:
logRT = list(np.log(second_last_word['RT']))

The next section is about merging the correct similarity value to its corresponding sentence number,  condition and reading time. 

To do that, we first create a data frame containing all the relevant information for our end data frame, except of the similarity values and the contrast coding for the condition of grammaticality (+1 and -1).

In [17]:
df_WithoutSim = pd.DataFrame(list(zip(second_last_word.Subject_ID, second_last_word.Sentence_Number, second_last_word.Condition, logRT)),
                            columns = ['Subject_ID', 'Sentence_Number', 'Condition', 'logRT'])
df_WithoutSim

Unnamed: 0,Subject_ID,Sentence_Number,Condition,logRT
0,10,10,c,6.635947
1,10,11,d,6.674561
2,10,9,b,6.393591
3,10,4,a,6.622736
4,10,6,c,6.579251
...,...,...,...,...
521,9,4,d,6.324359
522,9,5,a,6.322565
523,9,8,d,6.622736
524,9,2,b,6.522093


The data frame we need for the statistical analysis would have one more column containing the centred similarity values. We have two similarity values for all sentences (1-16). Condition a and c need the similarity values of the similar condition, condition b and d need the similarity value of the contrast condition.

To do that, we write a for loop that loops over the list “Sentence_Number” of our data frame “df_WithoutSim”. Note that the list “Sentence_Number” does not only contain the numbers from 1 to 16! It has length 800 and contains the sentence numbers that correspond to the sentences that have been read by a specific participant. The ‘iterater’ variable starts to count from zero until the end of a list. In other words, it assigns an index, stating at zero until the end, for every time we move on to the next item of our for loop. The variable ‘i’ refers to the entry at a specific point in the list “Sentence_Number”. The function enumerate makes the for loop loop through the whole list “Sentence_Number”.  Assume we start at the first entry in Sentence_Number. The first entry is the sentence 14, read by participant 1. Thus, our iterater would start counting at zero and the variable ‘i’ would get 14. Before jumping to the next entry, we now want to know which condition this entry has. This is important in order to assign the correct similarity value to that entry. Therefore, we look at the entry in the “Condition” column of our data frame “df_WithoutSim” in the same row (in our example row zero). If the entry is either ‘a’ or ‘c’ (similar condition) we append the centred similarity value for the similar group of that same sentence in an empty list called “Similarity_Values_dat”. In our example we would append the centred similarity value for the contrast condition of sentence 14 ~-0.212, since the condition in the first entry of the data frame is 'b'. Thereby, we have to take the entry i-1 of the “Centred_simval_similar” list. We have to take i-1 since Python starts counting at zero. ‘i’ is the correct sentence number. In our example i=14. In the “Centred_simval_similar” list we have all the similarity values for all sentences in the correct order, starting from zero to fifteen. If we want to add the similarity value of the fourteenth sentence, we have to take the thirteenth entry of the “Centred_simval_similar” list. If the condition is not ‘a’ or ‘c’ (meaning it has to be ‘b’ or ‘d’) we want to add the corresponding contrast centred similarity value. We then take the i-1th entry of the “Centered_simval_contrast” list and append it to the “Similarity_Values_dat” list. After that, enumerate makes us loop to the next entry of “Sentence_Number” etc.

In [18]:
Similarity_Values_dat = []

for iterater, i in enumerate(df_WithoutSim.Sentence_Number):
    if df_WithoutSim.Condition[iterater] in ['a', 'c']:
        simVal_similar = Centred_simval_similar[i-1]
        Similarity_Values_dat.append(simVal_similar)
    else:
        simVal_Con = Centred_simval_contrast[i-1]
        Similarity_Values_dat.append(simVal_Con)
print(Similarity_Values_dat)

[0.1890681041404605, -0.05449009407311678, 0.04588540922850365, 0.23561581503599882, 0.3318315101787448, -0.0122876511886717, -0.17242250312119722, 0.21198287140578032, -0.07905130553990602, -0.18998370785266158, -0.16272658575326204, -0.08117786515504122, -0.05449009407311678, -0.08117786515504122, -0.0122876511886717, -0.16272658575326204, 0.04588540922850365, 0.23561581503599882, 0.3318315101787448, 0.21198287140578032, -0.17242250312119722, 0.1890681041404605, -0.18998370785266158, -0.07905130553990602, -0.13518370408564806, 0.06680520903319115, -0.010562394745647907, 0.05951962899416685, -0.05884656775742775, -0.17144855577498674, -0.06810848880559206, 0.10055741202086205, -0.14610760379582644, -0.07273069489747286, 0.15000962745398283, 0.023852135054767132, 0.04588540922850365, 0.21198287140578032, -0.17242250312119722, 0.3318315101787448, 0.1890681041404605, -0.16272658575326204, -0.18998370785266158, -0.05449009407311678, 0.23561581503599882, -0.07905130553990602, -0.0122876511

For the final data frame for data analysis we need the grammaticality condition to be coded as + and - 1. Conditions a and b refer to grammatical sentences, condition c and d to ungrammatical sentences. We iterate over the column "Condition" of our data frame 'df_WithoutSim'. If the condition is equal to 'a' or 'b' we enter a 1 in the list 'Contrast_Coding', else (if condition is equal to 'c' or 'd') we enter -1 in the list. This list can later be added in the data frame in a new column called "Contrast_coding".

In [19]:
Contrast_Coding = []

for cond in df_WithoutSim.Condition:
    if cond in ['a','b']:
        Contrast_Coding.append(1)
    else:
        Contrast_Coding.append(-1)
print(Contrast_Coding)

[-1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1, 1, -1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, 1, -1, 1, -1, -1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, -1, 1, -1, 1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, -1, -1, -1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, 1, 1, 1, 1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, -1, -1, -1, 1, 1, -1, -1, 1, -1, -1, -1, 1, 1, 1, 1, -1

Now we have everything we need to create the final data frame, thus to add a column for the contrast coding numbers and the centred similarity values.

In [20]:
df_WithSim = pd.DataFrame(list(zip(df_WithoutSim.Subject_ID, df_WithoutSim.Sentence_Number, df_WithoutSim.Condition, Contrast_Coding, logRT, Similarity_Values_dat)),
                            columns = ['Subject_ID', 'Sentence_Number', 'Condition', 'Contrast_Coding','logRT', 'Centered_Sim_Val'])
df_WithSim

Unnamed: 0,Subject_ID,Sentence_Number,Condition,Contrast_Coding,logRT,Centered_Sim_Val
0,10,10,c,-1,6.635947,0.189068
1,10,11,d,-1,6.674561,-0.054490
2,10,9,b,1,6.393591,0.045885
3,10,4,a,1,6.622736,0.235616
4,10,6,c,-1,6.579251,0.331832
...,...,...,...,...,...,...
521,9,4,d,-1,6.324359,-0.171449
522,9,5,a,1,6.322565,0.066805
523,9,8,d,-1,6.622736,-0.010562
524,9,2,b,1,6.522093,-0.146108


In [21]:
# We are now ready to export the data frame for the data analysis in r
df_WithSim.to_csv('Final_DataFrame_e3.csv', sep=',', header=False, index=True)