# Data preprocessing - Creation of the final data frame for data analysis

## Introduction

This document contains the code for the item creation corresponding to the master thesis "...". 

The research question of the thesis is: Does the similarity of the first three noun phrase antecedents of double embedded sentences have an effect on the reading times of grammatical and ungrammatical double embedded sentences?

To examine this question the self-paced-reading (SPR) and eye-tracking data of Vasishth et al. (2010) are examined. The similarity of the three noun phrase antecedents is calculated with the machine learning model Word2vec for each sentence. A linear mixed model with varying intercepts for subjects is fitted that measures the effect of the interaction between the similarity and the grammaticality of the sentences.
In the following paragraphs the exact procedure for the creation of items as well as the actual code of item creation is presented. 

End product of this notebook: A data frame that contains the subject ID, the sentence number, the condition of the sentence, the reading time of the post V1 region (TFT) and the corresponding similarity value (centred). The data was empirically collected for the Vasishth et al. (2010) study.

## Coding part

At first, the required packages have to be downloaded. Pandas and numpy are both libraries that provide functions for Python, enabling the user to deal with data or mathematical structures.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Next, the data corresponding to the first English SPR experiment of the Vasishth et al. (2010) study is uploaded into the notebook.

In [2]:
df = pd.read_csv('e2_en_et_data.txt', sep=' ', names=['Subject', 'Trial', 'Experiment', 'Item', 'Level','Accuracy', 'roi', 'FFD', 'FFP', 'SFD','FPRT', 'RBRT', 'TFT', 'RPD', 'CRPD', 'RRT', 'RRTP', 'RRTR', 'FPRC', 'TRC', 'LPRT'])
df

Unnamed: 0,Subject,Trial,Experiment,Item,Level,Accuracy,roi,FFD,FFP,SFD,...,RBRT,TFT,RPD,CRPD,RRT,RRTP,RRTR,FPRC,TRC,LPRT
0,subject,trial,experiment,item,condition,accuracy,roi,FFD,FFP,SFD,...,RBRT,TFT,RPD,CRPD,RRT,RRTP,RRTR,FPRC,TRC,LPRT
1,1,3,gug,16,d,1,1,144,1,0,...,260,888,260,260,628,0,628,0,0,376
2,1,3,gug,16,d,1,2,152,1,0,...,304,1672,304,564,1368,0,1368,0,0,456
3,1,3,gug,16,d,1,3,48,1,0,...,256,1620,408,972,1544,180,1364,1,4,304
4,1,3,gug,16,d,1,4,152,1,0,...,724,1064,1408,2380,700,360,340,1,3,140
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8984,48,76,gug,7,b,1,11,208,1,0,...,208,800,208,4840,592,0,592,0,0,592
8985,48,76,gug,7,b,1,12,56,1,0,...,644,644,2332,7172,384,384,0,2,2,384
8986,48,76,gug,7,b,1,13,,,,...,,,,,,,,,,
8987,48,76,gug,7,b,1,14,,,,...,,,,,,,,,,


### Extraction of relevant data 

A lot of data points are not needed for the purpose of this study. Only the post V1 region of the 'gug' condition (the experimental condition) is needed. The relevant data has to be extracted of the data frame.

In [3]:
# Extract the experimental condition 'gug'
df_gug = df[df.Experiment=='gug']
df_gug

Unnamed: 0,Subject,Trial,Experiment,Item,Level,Accuracy,roi,FFD,FFP,SFD,...,RBRT,TFT,RPD,CRPD,RRT,RRTP,RRTR,FPRC,TRC,LPRT
1,1,3,gug,16,d,1,1,144,1,0,...,260,888,260,260,628,0,628,0,0,376
2,1,3,gug,16,d,1,2,152,1,0,...,304,1672,304,564,1368,0,1368,0,0,456
3,1,3,gug,16,d,1,3,48,1,0,...,256,1620,408,972,1544,180,1364,1,4,304
4,1,3,gug,16,d,1,4,152,1,0,...,724,1064,1408,2380,700,360,340,1,3,140
5,1,3,gug,16,d,1,5,332,1,0,...,332,1420,332,2712,1088,0,1088,0,2,156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8984,48,76,gug,7,b,1,11,208,1,0,...,208,800,208,4840,592,0,592,0,0,592
8985,48,76,gug,7,b,1,12,56,1,0,...,644,644,2332,7172,384,384,0,2,2,384
8986,48,76,gug,7,b,1,13,,,,...,,,,,,,,,,
8987,48,76,gug,7,b,1,14,,,,...,,,,,,,,,,


In [4]:
df_gug_TFT=df_gug[['Subject', 'Item', 'Level', 'roi', 'TFT']]
df_gug_TFT

Unnamed: 0,Subject,Item,Level,roi,TFT
1,1,16,d,1,888
2,1,16,d,2,1672
3,1,16,d,3,1620
4,1,16,d,4,1064
5,1,16,d,5,1420
...,...,...,...,...,...
8984,48,7,b,11,800
8985,48,7,b,12,644
8986,48,7,b,13,
8987,48,7,b,14,


extract roi 11 or 12

In [5]:
df_gug_TFT_roi = df_gug_TFT[(df.roi=="11") | (df.roi=="12")]
df_gug_TFT_roi

  """Entry point for launching an IPython kernel.


Unnamed: 0,Subject,Item,Level,roi,TFT
8,1,16,d,11,188
9,1,16,d,12,308
20,1,13,a,11,1944
21,1,13,a,12,324
32,1,7,c,11,1524
...,...,...,...,...,...
8961,48,8,c,12,272
8972,48,13,d,11,640
8973,48,13,d,12,156
8984,48,7,b,11,800


All words of the "gug" condition are now extracted and stored in a new data frame called df_gug. 
Since only the post V1 region is relevant for the data analysis the relevant rows, corresponding to the post V1 region, have to be extracted from the data frame df_gug. The post V1 region is always the second last word of every sentence.

explain how to get the rows corresponding to the post-V1 region

Create a data frame that has all relevant columns (except for the similarity values)

In [6]:
df_gug_TFT_roi[("Subject")]

8        1
9        1
20       1
21       1
32       1
        ..
8961    48
8972    48
8973    48
8984    48
8985    48
Name: Subject, Length: 1498, dtype: object

In [7]:
df_WithoutSim = pd.DataFrame(list(zip(df_gug_TFT_roi[("Subject")], df_gug_TFT_roi[("Item")], df_gug_TFT_roi[("Level")],df_gug_TFT_roi[("TFT")])),
                            columns = ['Subject_ID', 'Sentence_Number', 'Level', 'TFT'])
df_WithoutSim

Unnamed: 0,Subject_ID,Sentence_Number,Level,TFT
0,1,16,d,188
1,1,16,d,308
2,1,13,a,1944
3,1,13,a,324
4,1,7,c,1524
...,...,...,...,...
1493,48,8,c,272
1494,48,13,d,640
1495,48,13,d,156
1496,48,7,b,800


In [8]:
df_WithoutSim["TFT"] = df_WithoutSim["TFT"].apply(np.float)

### The similarity values

In this part we first import the nouns with two conditions, the similar and the contrast condition, and their corresponding similarity values. Next the concept of centred similarity is explained and carried out.

We now import the similarity values and store them into a data frame called Simval_df. 
We have two similarity values for each sentence. One similarity value is for the similar condition with three animate noun phrases, one for the contrast condition with two animate nouns and one inanimate noun in the middle. 
The similarity values are the mean similarity of the three nouns respectively.

In [9]:
Simval_df = pd.read_csv('df_SimilarityValues.csv', sep=',', names=['Sentence_Number', 'Similar_condition', 'Contrast_condition', 'SimVal_Similar', 'SimVal_Contrast'])
Simval_df

Unnamed: 0,Sentence_Number,Similar_condition,Contrast_condition,SimVal_Similar,SimVal_Contrast
0,0,"('carpenter', 'craftsman', 'peasant')","('carpenter', 'pillar', 'peasant')",0.391551,0.137571
1,1,"('mother', 'daughter', 'sister')","('mother', 'gun', 'sister')",0.810291,-0.069852
2,2,"('worker', 'tenant', 'foreman')","('worker', 'bucket', 'foreman')",0.369354,0.225532
3,3,"('trader', 'businessman', 'professor')","('trader', 'computer', 'professor')",0.623997,0.101565
4,4,"('painter', 'musician', 'father')","('painter', 'hut', 'father')",0.423884,0.271721
5,5,"('saxophonist', 'trumpeter', 'conductor')","('saxophonist', 'baton', 'conductor')",0.703727,0.308906
6,6,"('pharmacist', 'optician', 'stranger')","('pharmacist', 'button', 'stranger')",0.295159,0.233564
7,7,"('cleaner', 'janitor', 'doctor')","('cleaner', 'ball', 'doctor')",0.315141,0.207383
8,8,"('dancer', 'singer', 'bystander')","('dancer', 'shoe', 'bystander')",0.568988,0.452954
9,9,"('artist', 'sportsman', 'guard')","('artist', 'computer', 'guard')",0.253536,0.001793


The concept of centred similarity is an equivalent to the sum contrast coding that will be used for the data analysis later on. 
In the sum contrast coding each data point is coded as a function of the overall mean of all data points.
For the sum contrast coding for the condition of grammaticality we have +1 for grammatical and -1 for ungrammatical sentences.
For the examination of the research question we have to carry out an interaction analysis to see the impact of the similarity values (hence the condition of similarity) on the condition of grammaticality. 
For this purpose, the similarity values have to be adapted to the sum contrast coding concept. This is done by subtracting the mean similarity value from each individual similarity value.

First we calculate the mean similarity value from the similar and the contrast condition.

In [10]:
Simval_Similar = list(Simval_df['SimVal_Similar'])
Simval_Contrast = list(Simval_df['SimVal_Contrast'])

In [11]:
added_sim = Simval_Similar + Simval_Contrast

In [12]:
mean_sim = np.mean(added_sim)
mean_sim

0.3123188643658068

Next, we need to subtract the mean from each similarity value.

This is done with a for loop that loops over the Simval_Similar list (containing all similarity values corresponding to the noun pairs of the similar group) or Simval_Contrast list (containing all similarity values corresponding to the noun pairs of the contrast group). For each value of the list, we subtract the mean from that value and store the result in a list called Centred_simval_similar or Centred_simval_contrast.

In [13]:
Centred_simval_similar = []
for value in Simval_Similar:
    Centred_simval_similar.append(value - mean_sim)
print(Centred_simval_similar)

[0.0792321980989073, 0.497972247103462, 0.05703535387874581, 0.3116784066951368, 0.1115649611747358, 0.39140784452320077, -0.017159967770567164, 0.0028219790256117094, 0.2566690714156721, -0.0587831987941172, -0.08934552324353714, 0.38843387077213265, -0.039501111110439524, 0.14478000710369088, 0.20878489772439934, 0.23193129608989693]


In [14]:
Centred_simval_contrast = []
for value in Simval_Contrast:
    Centred_simval_contrast.append(value - mean_sim)
print(Centred_simval_contrast)

[-0.1747479225450661, -0.38217081417678855, -0.0867871894442942, -0.21075361417024396, -0.04059784402488731, -0.003413009544601664, -0.07875486006378196, -0.10493575324653648, 0.140635010699043, -0.3105253755056765, -0.23551733707427047, -0.2871846380585339, -0.007859083911171183, -0.2122081099951174, -0.36469443430542015, -0.11800735731958412]


The next section is about merging the correct similarity value to its corresponding sentence number,  condition and reading time (TFT). 

To do that, we first create a data frame containing all the relevant information for our end data frame, except of the similarity values and the contrast coding for the condition of grammaticality (+1 and -1).

The data frame we need for the statistical analysis would have one more column containing the centred similarity values. We have two similarity values for all sentences (1-16). Condition a and c need the similarity values of the similar condition, condition b and d need the similarity value of the contrast condition.

To do that, we write a for loop that loops over the list “Sentence_Number” of our data frame “df_WithoutSim”. Note that the list “Sentence_Number” does not only contain the numbers from 1 to 16! It has length 800 and contains the sentence numbers that correspond to the sentences that have been read by a specific participant. The ‘iterater’ variable starts to count from zero until the end of a list. In other words, it assigns an index, stating at zero until the end, for every time we move on to the next item of our for loop. The variable ‘i’ refers to the entry at a specific point in the list “Sentence_Number”. The function enumerate makes the for loop loop through the whole list “Sentence_Number”.  Assume we start at the first entry in Sentence_Number. The first entry is the sentence 14, read by participant 1. Thus, our iterater would start counting at zero and the variable ‘i’ would get 14. Before jumping to the next entry, we now want to know which condition this entry has. This is important in order to assign the correct similarity value to that entry. Therefore, we look at the entry in the “Condition” column of our data frame “df_WithoutSim” in the same row (in our example row zero). If the entry is either ‘a’ or ‘c’ (similar condition) we append the centred similarity value for the similar group of that same sentence in an empty list called “Similarity_Values_dat”. In our example we would append the centred similarity value for the contrast condition of sentence 14 ~-0.212, since the condition in the first entry of the data frame is 'b'. Thereby, we have to take the entry i-1 of the “Centred_simval_similar” list. We have to take i-1 since Python starts counting at zero. ‘i’ is the correct sentence number. In our example i=14. In the “Centred_simval_similar” list we have all the similarity values for all sentences in the correct order, starting from zero to fifteen. If we want to add the similarity value of the fourteenth sentence, we have to take the thirteenth entry of the “Centred_simval_similar” list. If the condition is not ‘a’ or ‘c’ (meaning it has to be ‘b’ or ‘d’) we want to add the corresponding contrast centred similarity value. We then take the i-1th entry of the “Centered_simval_contrast” list and append it to the “Similarity_Values_dat” list. After that, enumerate makes us loop to the next entry of “Sentence_Number” etc.

In [15]:
Similarity_Values_dat = []

for iterater, i in enumerate(df_WithoutSim.Sentence_Number):
    if df_WithoutSim.Level[iterater] in ['a', 'c']:
        simVal_similar = Centred_simval_similar[int(i)-1]
        Similarity_Values_dat.append(simVal_similar)
    else:
        simVal_Con = Centred_simval_contrast[int(i)-1]
        Similarity_Values_dat.append(simVal_Con)
print(Similarity_Values_dat)

[-0.11800735731958412, -0.11800735731958412, -0.039501111110439524, -0.039501111110439524, -0.017159967770567164, -0.017159967770567164, 0.1115649611747358, 0.1115649611747358, -0.3105253755056765, -0.3105253755056765, 0.05703535387874581, 0.05703535387874581, 0.2566690714156721, 0.2566690714156721, -0.08934552324353714, -0.08934552324353714, -0.10493575324653648, -0.10493575324653648, -0.38217081417678855, -0.38217081417678855, 0.20878489772439934, 0.20878489772439934, -0.2871846380585339, -0.2871846380585339, 0.0792321980989073, 0.0792321980989073, -0.21075361417024396, -0.21075361417024396, -0.003413009544601664, -0.003413009544601664, -0.2122081099951174, -0.2122081099951174, 0.23193129608989693, 0.23193129608989693, 0.39140784452320077, 0.39140784452320077, -0.04059784402488731, -0.04059784402488731, 0.497972247103462, 0.497972247103462, 0.140635010699043, 0.140635010699043, -0.007859083911171183, -0.007859083911171183, -0.36469443430542015, -0.36469443430542015, 0.002821979025611

For the final data frame for data analysis we need the grammaticality condition to be coded as + and - 1. Conditions a and b refer to grammatical sentences, condition c and d to ungrammatical sentences. We iterate over the column "Condition" of our data frame 'df_WithoutSim'. If the condition is equal to 'a' or 'b' we enter a 1 in the list 'Contrast_Coding', else (if condition is equal to 'c' or 'd') we enter -1 in the list. This list can later be added in the data frame in a new column called "Contrast_coding".

In [16]:
Contrast_Coding = []

for cond in df_WithoutSim.Level:
    if cond in ['a','b']:
        Contrast_Coding.append(1)
    else:
        Contrast_Coding.append(-1)
print(Contrast_Coding)

[-1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 

Now we have everything we need to create the final data frame, thus to add a column for the contrast coding numbers and the centred similarity values.

In [17]:
log_RT = np.log(df_WithoutSim.TFT)
log_RT

0       5.236442
1       5.730100
2       7.572503
3       5.780744
4       7.329094
          ...   
1493    5.605802
1494    6.461468
1495    5.049856
1496    6.684612
1497    6.467699
Name: TFT, Length: 1498, dtype: float64

In [18]:
df_WithSim = pd.DataFrame(list(zip(df_WithoutSim.Subject_ID, df_WithoutSim.Sentence_Number, df_WithoutSim.Level, Contrast_Coding, log_RT, Similarity_Values_dat)),
                            columns = ['subject', 'item', 'level', 'contrast gram.','log_TFT', 'centered similarity'])
df_WithSim

Unnamed: 0,subject,item,level,contrast gram.,log_TFT,centered similarity
0,1,16,d,-1,5.236442,-0.118007
1,1,16,d,-1,5.730100,-0.118007
2,1,13,a,1,7.572503,-0.039501
3,1,13,a,1,5.780744,-0.039501
4,1,7,c,-1,7.329094,-0.017160
...,...,...,...,...,...,...
1493,48,8,c,-1,5.605802,0.002822
1494,48,13,d,-1,6.461468,-0.007859
1495,48,13,d,-1,5.049856,-0.007859
1496,48,7,b,1,6.684612,-0.078755


In [26]:
# We are now ready to export the data frame for the data analysis in r
df_WithSim.to_csv('Final_DataFrame_e2.csv', sep=',', header=False, index=True)

In [20]:
df_WithSim.describe()

Unnamed: 0,contrast gram.,log_TFT,centered similarity
count,1498.0,1206.0,1498.0
mean,0.001335,6.313073,0.001204
std,1.000333,0.870136,0.221936
min,-1.0,3.78419,-0.382171
25%,-1.0,5.63479,-0.118007
50%,1.0,6.306275,-0.01716
75%,1.0,6.972605,0.14478
max,1.0,8.833463,0.497972
