# Data preprocessing - Creation of the final data frame for data analysis

## Introduction

This document contains the code for the item creation corresponding to the master thesis "...". 

The research question of the thesis is: Does the similarity of the first three noun phrase antecedents of double embedded sentences have an effect on the reading times of grammatical and ungrammatical double embedded sentences?

To examine this question the self-paced-reading (SPR) and eye-tracking data of Vasishth et al. (2010) are examined. The similarity of the three noun phrase antecedents is calculated with the machine learning model Word2vec for each sentence. A linear mixed model with varying intercepts for subjects is fitted that measures the effect of the interaction between the similarity and the grammaticality of the sentences.
In the following paragraphs the exact procedure for the creation of items as well as the actual code of item creation is presented. 

End product of this notebook: A data frame that contains the subject ID, the sentence number, the condition of the sentence, the reading time of the post V1 region (TFT) and the corresponding similarity value (centred). The data was empirically collected for the Vasishth et al. (2010) study.

## Coding part

At first, the required packages have to be downloaded. Pandas and numpy are both libraries that provide functions for Python, enabling the user to deal with data or mathematical structures.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyreadr

Next, the data corresponding to the first English SPR experiment of the Vasishth et al. (2010) study is uploaded into the notebook.

In [2]:
result = pyreadr.read_r('e4_de_et_data.Rda')
result

OrderedDict([('d.rs',
                        subject  trial  version  response  roi condition   c1   c2  itemnum  \
              rownames                                                                        
              1              46     63        1         1    1         a -1.0 -1.0      1.0   
              2              35     63        1         1    5         a -1.0 -1.0      1.0   
              3             104     63        1         0   11         a -1.0 -1.0      1.0   
              4              50     63        1         1   11         a -1.0 -1.0      1.0   
              5              50     63        1         1    9         a -1.0 -1.0      1.0   
              ...           ...    ...      ...       ...  ...       ...  ...  ...      ...   
              85676         113     16        1         0   12         d  1.0  1.0     12.0   
              85677          39     16        1         0    1         d  1.0  1.0     12.0   
              85678         

In [3]:
df = result["d.rs"] # extract the pandas data frame for object df
df

Unnamed: 0_level_0,subject,trial,version,response,roi,condition,c1,c2,itemnum,times,value
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,46,63,1,1,1,a,-1.0,-1.0,1.0,SFD,0
2,35,63,1,1,5,a,-1.0,-1.0,1.0,SFD,0
3,104,63,1,0,11,a,-1.0,-1.0,1.0,SFD,0
4,50,63,1,1,11,a,-1.0,-1.0,1.0,SFD,239591
5,50,63,1,1,9,a,-1.0,-1.0,1.0,SFD,0
...,...,...,...,...,...,...,...,...,...,...,...
85676,113,16,1,0,12,d,1.0,1.0,12.0,LPRT,12606
85677,39,16,1,0,1,d,1.0,1.0,12.0,LPRT,365686
85678,121,16,1,0,11,d,1.0,1.0,12.0,LPRT,193331
85679,121,16,1,0,12,d,1.0,1.0,12.0,LPRT,54650


In [4]:
df.times=="TFT"

rownames
1        False
2        False
3        False
4        False
5        False
         ...  
85676    False
85677    False
85678    False
85679    False
85680    False
Name: times, Length: 85680, dtype: bool

In [5]:
df_TFT = df[df.times=="TFT"]

In [6]:
df_TFT["TFT_milliseconds"] = df_TFT["value"]*0.001
df_TFT["TFT_milliseconds"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


rownames
61201    353.018
61202    298.432
61203     201.76
61204    239.591
61205    920.424
          ...   
67316    218.561
67317     739.76
67318    1630.72
67319    1147.42
67320    311.028
Name: TFT_milliseconds, Length: 6120, dtype: object

In [7]:
d=df_TFT[(df.roi==11)|(df.roi==12)]
d

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,subject,trial,version,response,roi,condition,c1,c2,itemnum,times,value,TFT_milliseconds
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
61203,104,63,1,0,11,a,-1.0,-1.0,1.0,TFT,201760,201.76
61204,50,63,1,1,11,a,-1.0,-1.0,1.0,TFT,239591,239.591
61206,42,63,1,1,11,a,-1.0,-1.0,1.0,TFT,886855,886.855
61210,35,63,1,1,11,a,-1.0,-1.0,1.0,TFT,722899,722.899
61214,104,63,1,0,12,a,-1.0,-1.0,1.0,TFT,84072,84.072
...,...,...,...,...,...,...,...,...,...,...,...,...
67310,35,16,1,0,12,d,1.0,1.0,12.0,TFT,874261,874.261
67314,46,16,1,0,12,d,1.0,1.0,12.0,TFT,277380,277.38
67316,113,16,1,0,12,d,1.0,1.0,12.0,TFT,218561,218.561
67318,121,16,1,0,11,d,1.0,1.0,12.0,TFT,1630721,1630.72


### Extraction of relevant data 

In [8]:
d[['subject', 'condition', 'itemnum', 'TFT_milliseconds']]

Unnamed: 0_level_0,subject,condition,itemnum,TFT_milliseconds
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
61203,104,a,1.0,201.76
61204,50,a,1.0,239.591
61206,42,a,1.0,886.855
61210,35,a,1.0,722.899
61214,104,a,1.0,84.072
...,...,...,...,...
67310,35,d,12.0,874.261
67314,46,d,12.0,277.38
67316,113,d,12.0,218.561
67318,121,d,12.0,1630.72


explain how to get the rows corresponding to the post-V1 region

Create a data frame that has all relevant columns (except for the similarity values)

In [9]:
df_WithoutSim = pd.DataFrame(list(zip(d[("subject")], d[("itemnum")], d[("condition")], d[("TFT_milliseconds")])),
                            columns = ['subject', 'item', 'level', 'TFT'])
df_WithoutSim

Unnamed: 0,subject,item,level,TFT
0,104,1.0,a,201.760
1,50,1.0,a,239.591
2,42,1.0,a,886.855
3,35,1.0,a,722.899
4,104,1.0,a,84.072
...,...,...,...,...
1219,35,12.0,d,874.261
1220,46,12.0,d,277.380
1221,113,12.0,d,218.561
1222,121,12.0,d,1630.721


In [10]:
df_WithoutSim["TFT"] = df_WithoutSim["TFT"].apply(np.float)

In [24]:
log_RT = np.log(df_WithoutSim.TFT)

### The similarity values

In this part we first import the nouns with two conditions, the similar and the contrast condition, and their corresponding similarity values. Next the concept of centred similarity is explained and carried out.

We now import the similarity values and store them into a data frame called Simval_df. 
We have two similarity values for each sentence. One similarity value is for the similar condition with three animate noun phrases, one for the contrast condition with two animate nouns and one inanimate noun in the middle. 
The similarity values are the mean similarity of the three nouns respectively.

In [12]:
Simval_df = pd.read_csv('df_SimilarityValues_ger.csv', sep=',', names=['Sentence_Number', 'Similar_condition', 'Contrast_condition', 'SimVal_Similar', 'SimVal_Contrast'])
Simval_df

Unnamed: 0,Sentence_Number,Similar_condition,Contrast_condition,SimVal_Similar,SimVal_Contrast
0,0,"('Anwalt', 'Zeuge', 'Spion')","('Anwalt', 'Saebel', 'Spion')",0.490121,0.217141
1,1,"('Beamte', 'Buerokrat', 'Besucher')","('Beamte', 'Tisch', 'Besucher')",0.19958,0.243456
2,2,"('Braeutigam', 'Schwiegervater', 'Musiker')","('Braeutigam', 'Bilderrahmen', 'Musiker')",0.413416,0.310512
3,3,"('Bruder', 'Cousin', 'Bauer')","('Bruder', 'Schmuck', 'Bauer')",0.62518,0.218115
4,4,"('Zauberer', 'Akrobat', 'Zuschauer')","('Zauberer', 'Hut', 'Zuschauer')",0.456369,0.308386
5,5,"('Einbrecher', 'Dieb', 'Mann')","('Einbrecher', 'Stein', 'Mann')",0.721395,0.321455
6,6,"('Neurotiker', 'Exzentriker', 'Psychiater')","('Neurotiker', 'Dolch', 'Psychiater')",0.449083,0.226837
7,7,"('Arbeiter', 'Monteur', 'Vorarbeiter')","('Arbeiter', 'Eimer', 'Vorarbeiter')",0.601547,0.379001
8,8,"('Banker', 'Kreditgeber', 'Kunde')","('Banker', 'Geldautomat', 'Kunde')",0.539573,0.435449
9,9,"('Pianist', 'Cellist', 'Hausmeister')","('Pianist', 'Ball', 'Hausmeister')",0.578632,0.25438


The concept of centred similarity is an equivalent to the sum contrast coding that will be used for the data analysis later on. 
In the sum contrast coding each data point is coded as a function of the overall mean of all data points.
For the sum contrast coding for the condition of grammaticality we have +1 for grammatical and -1 for ungrammatical sentences.
For the examination of the research question we have to carry out an interaction analysis to see the impact of the similarity values (hence the condition of similarity) on the condition of grammaticality. 
For this purpose, the similarity values have to be adapted to the sum contrast coding concept. This is done by subtracting the mean similarity value from each individual similarity value.

First we calculate the mean similarity value from the similar and the contrast condition.

In [13]:
Simval_Similar = list(Simval_df['SimVal_Similar'])
Simval_Contrast = list(Simval_df['SimVal_Contrast'])

In [14]:
added_sim = Simval_Similar + Simval_Contrast

In [15]:
mean_sim = np.mean(added_sim)
mean_sim

0.38956377375870943

Next, we need to subtract the mean from each similarity value.

This is done with a for loop that loops over the Simval_Similar list (containing all similarity values corresponding to the noun pairs of the similar group) or Simval_Contrast list (containing all similarity values corresponding to the noun pairs of the contrast group). For each value of the list, we subtract the mean from that value and store the result in a list called Centred_simval_similar or Centred_simval_contrast.

In [16]:
Centred_simval_similar = []
for value in Simval_Similar:
    Centred_simval_similar.append(value - mean_sim)
print(Centred_simval_similar)

[0.10055741202086205, -0.18998370785266158, 0.023852135054767132, 0.23561581503599882, 0.06680520903319115, 0.3318315101787448, 0.05951962899416685, 0.21198287140578032, 0.15000962745398283, 0.1890681041404605, -0.05884656775742775, -0.0122876511886717]


In [17]:
Centred_simval_contrast = []
for value in Simval_Contrast:
    Centred_simval_contrast.append(value - mean_sim)
print(Centred_simval_contrast)

[-0.17242250312119722, -0.14610760379582644, -0.07905130553990602, -0.17144855577498674, -0.08117786515504122, -0.06810848880559206, -0.16272658575326204, -0.010562394745647907, 0.04588540922850365, -0.13518370408564806, -0.05449009407311678, -0.07273069489747286]


The next section is about merging the correct similarity value to its corresponding sentence number,  condition and reading time (TFT). 

To do that, we first create a data frame containing all the relevant information for our end data frame, except of the similarity values and the contrast coding for the condition of grammaticality (+1 and -1).

The data frame we need for the statistical analysis would have one more column containing the centred similarity values. We have two similarity values for all sentences (1-16). Condition a and c need the similarity values of the similar condition, condition b and d need the similarity value of the contrast condition.

To do that, we write a for loop that loops over the list “Sentence_Number” of our data frame “df_WithoutSim”. Note that the list “Sentence_Number” does not only contain the numbers from 1 to 16! It has length 800 and contains the sentence numbers that correspond to the sentences that have been read by a specific participant. The ‘iterater’ variable starts to count from zero until the end of a list. In other words, it assigns an index, stating at zero until the end, for every time we move on to the next item of our for loop. The variable ‘i’ refers to the entry at a specific point in the list “Sentence_Number”. The function enumerate makes the for loop loop through the whole list “Sentence_Number”.  Assume we start at the first entry in Sentence_Number. The first entry is the sentence 14, read by participant 1. Thus, our iterater would start counting at zero and the variable ‘i’ would get 14. Before jumping to the next entry, we now want to know which condition this entry has. This is important in order to assign the correct similarity value to that entry. Therefore, we look at the entry in the “Condition” column of our data frame “df_WithoutSim” in the same row (in our example row zero). If the entry is either ‘a’ or ‘c’ (similar condition) we append the centred similarity value for the similar group of that same sentence in an empty list called “Similarity_Values_dat”. In our example we would append the centred similarity value for the contrast condition of sentence 14 ~-0.212, since the condition in the first entry of the data frame is 'b'. Thereby, we have to take the entry i-1 of the “Centred_simval_similar” list. We have to take i-1 since Python starts counting at zero. ‘i’ is the correct sentence number. In our example i=14. In the “Centred_simval_similar” list we have all the similarity values for all sentences in the correct order, starting from zero to fifteen. If we want to add the similarity value of the fourteenth sentence, we have to take the thirteenth entry of the “Centred_simval_similar” list. If the condition is not ‘a’ or ‘c’ (meaning it has to be ‘b’ or ‘d’) we want to add the corresponding contrast centred similarity value. We then take the i-1th entry of the “Centered_simval_contrast” list and append it to the “Similarity_Values_dat” list. After that, enumerate makes us loop to the next entry of “Sentence_Number” etc.

In [21]:
Similarity_Values_dat = []

for iterater, i in enumerate(df_WithoutSim.item):
    if df_WithoutSim.level[iterater] in ['a', 'c']:
        simVal_similar = Centred_simval_similar[int(i)-1]
        Similarity_Values_dat.append(simVal_similar)
    else:
        simVal_Con = Centred_simval_contrast[int(i)-1]
        Similarity_Values_dat.append(simVal_Con)
print(Similarity_Values_dat)

[0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, 0.10055741202086205, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.17242250312119722, -0.1724225031

For the final data frame for data analysis we need the grammaticality condition to be coded as + and - 1. Conditions a and b refer to grammatical sentences, condition c and d to ungrammatical sentences. We iterate over the column "Condition" of our data frame 'df_WithoutSim'. If the condition is equal to 'a' or 'b' we enter a 1 in the list 'Contrast_Coding', else (if condition is equal to 'c' or 'd') we enter -1 in the list. This list can later be added in the data frame in a new column called "Contrast_coding".

In [23]:
Contrast_Coding = []

for cond in df_WithoutSim.level:
    if cond in ['a','b']:
        Contrast_Coding.append(1)
    else:
        Contrast_Coding.append(-1)
print(Contrast_Coding)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,

Now we have everything we need to create the final data frame, thus to add a column for the contrast coding numbers and the centred similarity values.

In [26]:
df_WithSim = pd.DataFrame(list(zip(df_WithoutSim.subject, df_WithoutSim.item, df_WithoutSim.level, Contrast_Coding, log_RT, Similarity_Values_dat)),
                            columns = ['subject', 'item', 'level', 'contrast gram.','log_TFT', 'centered similarity'])
df_WithSim

Unnamed: 0,subject,item,level,contrast gram.,log_TFT,centered similarity
0,104,1.0,a,1,5.307079,0.100557
1,50,1.0,a,1,5.478933,0.100557
2,42,1.0,a,1,6.787681,0.100557
3,35,1.0,a,1,6.583270,0.100557
4,104,1.0,a,1,4.431674,0.100557
...,...,...,...,...,...,...
1219,35,12.0,d,-1,6.773379,-0.072731
1220,46,12.0,d,-1,5.625388,-0.072731
1221,113,12.0,d,-1,5.387065,-0.072731
1222,121,12.0,d,-1,7.396778,-0.072731


In [27]:
# We are now ready to export the data frame for the data analysis in r
df_WithSim.to_csv('Final_DataFrame_ger4.csv', sep=',', header=False, index=True)