# Item creation - Creation of similarity values corresponding to the experimental items of Vasishth et al. (2010)

## Introduction

This document contains the code for the item creation corresponding to the master thesis "...". 

The research question of the thesis is: Does the similarity of the first three noun phrase antecedents of double embedded sentences have an effect on the reading times of grammatical and ungrammatical double embedded sentences?

To examine this question the self-paced-reading (SPR) and eye-tracking data of Vasishth et al. (2010) are examined. The similarity of the three noun phrase antecedents is calculated with the machine learning model Word2vec for each sentence. A linear mixed model with varying intercepts for subjects is fitted that measures the effect of the interaction between the similarity and the grammaticality of the sentences.
In the following paragraphs the exact procedure for the creation of items as well as the actual code of item creation is presented. 

### Description of the English items of Vasishth et al. (2010)

The English items out of the study from Vasishth et al. (2010) are 16 double embedded sentences.
There are two different conditions for each sentence:

First condition: (a) grammatical and (b) ungrammatical sentences.

(a) The grammatical sentences are correct doubly embedded sentences. The grammatical condition is the baseline condition.
(b) The ungrammatical versions are doubly embedded sentences in which the second verb, corresponding to the first relative clause, is left out. 

Second condition: (c) three animate antecedents (similar condition) and (d) two animate antecedents and one inanimate antecedent in the middle (contrast condition)

(c) The sentences corresponding to the similar condition have three noun phrases with animate antecedents. The similar condition is the baseline condition.
(d) The sentences corresponding to the contrast condition have an animate antecedent in the matrix sentence, an inanimate antecedent in the middle (first relative clause) and again an animate antecedent in the second relative clause. 

This makes 64 sentences in total, since each of the 16 sentences has four different variants.

### Description of the German items of Vasishth et al. (2010)

The German items are constructed similarly as the English sentence. For the German version only twelve items were constructed.

### The similarity values

For the calculation of similarity values word2vec, a neural network, has been chosen. Word2vec is part of a language model family called the embedding models, in which words are represented as word vectors. Similar words are grouped together in vector spaces. In other words, similar words have similar embedding vectors. The similarity values are calculated by the cosine similarity method, which is similar to taking the dot product between two vectors or reducing the dimensions onto the number line. 

The examination of the research question requires one similarity value per sentence, which demonstrates how similar the antecedents of the three noun phrases are to each other. In word2vec similarity can only be measured between two words. Each experimental sentence has three relevant nouns. One corresponding to the matrix sentence and two corresponding to the two relative clause sentences. In order to get only one representative similarity value per sentence the mean similarity value of the three nouns corresponding to each sentence is calculated. For each sentence two different mean similarity values have to be calculated. One is the mean similarity value of the similar condition (three animate nouns). The other one is the mean similarity value of the contrast condition (one inanimate noun in the middle).

## Coding part for the English sentences

In this section we show the coding for the similarity values that correspond to the English sentences/ nouns. Each coding action is explained in detail.

### The word2vec model

At first, the required packages have to be downloaded. The Word2vec model is taken from gensim and was trained on the text8 corpus. The data frames are built on top of pandas, an open source library for Python that provides functions that enable the management and the analysis of data. The statistics module contains a large variety of functions that help to do statistical data analysis and calculations.

In [5]:
# downloading of required packages (or modules) 
import pandas as pd
import gensim
from gensim.models.word2vec import Text8Corpus
from gensim.models import Word2Vec
import statistics as st
from statistics import mean

As a next step the Word2vec model can be downloaded.
Since the underlying algorithm of Word2vec is based on a randomization process, the similarity number between two words is always slightly different. This is even the case when the model is trained with exactly the same training corpus and hyper parameter. To get more precise similarity values two models, trained with the text8 corpus, are downloaded.
For each sentence the similarity value is calculated with the two models and the mean similarity value of both models is taken as representative similarity value for each sentence in the end.

In [2]:
w2v_model_1 = Word2Vec(
    Text8Corpus(r'text8'),
    size=100,
    window=5,
    min_count=5,
    workers=3)

In [3]:
w2v_model_2 = Word2Vec(
    Text8Corpus(r'text8'),
    size=100,
    window=5,
    min_count=5,
    workers=3)

Now the nouns that correspond to the English experimental sentences of the Vasishth et al. (2010) study are uploaded. 
In the "num" column are the sentences' numbers. In the "w1eng" column are the first nouns of all sentences. In the "w2eng" column are the second nouns of all sentences of the similar condition. In the "w3eng" column are the third nouns of all sentences. In the "w2eng_i" column the second nouns of the contrast condition.
The nouns with their corresponding sentence numbers are uploaded as a pandas data frame and called df_eng (df referring to data frame). The column at the left can be ignored. It is simply the index for each row ranging from zero to fifteen, as it is always the case in Python.

In [57]:
# upload the nouns that corresponds to the experimental sentences
df_eng = pd.read_csv('vasishth_2010_Eng.csv', sep=';')
df_eng

Unnamed: 0,num,w1eng,w2eng,w3eng,w2eng_i
0,1,carpenter,craftsman,peasant,pillar
1,2,mother,daughter,sister,gun
2,3,worker,tenant,foreman,bucket
3,4,trader,businessman,professor,computer
4,5,painter,musician,father,hut
5,6,saxophonist,trumpeter,conductor,baton
6,7,pharmacist,optician,stranger,button
7,8,cleaner,janitor,doctor,ball
8,9,dancer,singer,bystander,shoe
9,10,artist,sportsman,guard,computer


In this data frame the word "walking-stick" that is contained in the English experimental items of Vasishth et al. (2010) is interchanged with the synonym “cane”. "walking-stick" was not contained in the text8 corpus. When a word is not contained in the underlying training corpus no word embedding is available for that word. As a consequence, it is not possible to do any calculations with that word. The solution in this case is to take a synonym that is contained in the training data set, since synonyms have similar embedding vectors. It is not possible to add words to a trained model in hindsight.

### Calculation of similarity values

Next, we will calculate the similarity value for each sentence.
The similarity can always be measured between two words. We have three different nouns for each sentence, but only need one single similarity value that represents the similarity between all three words.

The approach: 
1. we calculate the similarity between noun 1 and noun 2 (or 2_inanimate)
2. we calculate the similarity between noun 2 (or 2_inanimate) and noun 3

To calculate the similarity values of all sentences automatically, we first define a function called sim(x,y) that calculates the similarity between two words x and y. In that function we first calculate the similarity value between two words with model 1, then with model 2, next, we take the mean of both similarity values.

The nouns must be fed into our function sim(x,y). To do that, we create a function that takes always two nouns row by row out of the data frame. Four different types of tuples are created where [i] refers to the row number:
(w1eng[i], w2eng[i])
(w2eng[i], w3eng[i])
(w1eng[i], w2eng_i[i])
(w2eng_i[i], w3eng[i])
These tuples are inserted in our function sim(x,y). The calculated similarity value is directly stored in a new column of the data frame. “w1_w2_sim” contains the similarity value of the tuples (w1eng[i], w2eng[i]). “w2_w3_sim” contains the similarity value of the tuples (w2eng[i], w3eng[i]). “w1_w2_i_sim” contains the similarity value of the tuples (w1eng[i], w2eng_i[i]). “w2_i_w3_sim” contains the similarity value of the tuples (w2eng_i[i], w3eng[i]).


In [58]:
def sim(x,y):
    return (w2v_model_1.wv.similarity(x,y) + w2v_model_2.wv.similarity(x,y))/2

In [59]:
df_eng['w1_w2_sim'] = df_eng.apply(lambda row: sim(row.w1eng, row.w2eng), axis=1) #https://www.w3schools.com/python/python_lambda.asp

In [60]:
df_eng['w2_w3_sim'] = df_eng.apply(lambda row: sim(row.w2eng, row.w3eng), axis=1)

In [61]:
df_eng['w1_w2_i_sim'] = df_eng.apply(lambda row: sim(row.w1eng, row.w2eng_i), axis=1)

In [62]:
df_eng['w2_i_w3_sim'] = df_eng.apply(lambda row: sim(row.w2eng_i, row.w3eng), axis=1)

In [63]:
df_eng

Unnamed: 0,num,w1eng,w2eng,w3eng,w2eng_i,w1_w2_sim,w2_w3_sim,w1_w2_i_sim,w2_i_w3_sim
0,1,carpenter,craftsman,peasant,pillar,0.404821,0.35037,0.048433,0.269812
1,2,mother,daughter,sister,gun,0.79979,0.799172,-0.036155,-0.105432
2,3,worker,tenant,foreman,bucket,0.257354,0.513957,0.094218,0.320811
3,4,trader,businessman,professor,computer,0.755615,0.509984,0.019602,0.199334
4,5,painter,musician,father,hut,0.655421,0.181854,0.360938,0.119007
5,6,saxophonist,trumpeter,conductor,baton,0.867773,0.581504,0.37534,0.300734
6,7,pharmacist,optician,stranger,button,0.350721,0.140072,0.057956,0.243833
7,8,cleaner,janitor,doctor,ball,0.269843,0.329229,0.17938,0.224646
8,9,dancer,singer,bystander,shoe,0.739673,0.348316,0.607813,0.251723
9,10,artist,sportsman,guard,computer,0.433199,0.133064,0.220077,-0.174135


In the end, the mean similarity value for each condition for each sentence is needed.
For similar condition  we need the mean similarity value between the first, second and third noun. 
For contrast condition we need the mean similarity value between the first, the second inanimate and the third noun.

In order to use the .mean function of the statistics library that calculates the mean per row between all columns that contain a number, we have to extract the relevant rows of our data frame. We start by extracting the relevant rows for the similar condition. These are the first, second and third word of each sentence and their corresponding similarity values. We create a new data frame called df_1,2,3 that contains the extracted rows plus a new row with the mean similarity values.

In [74]:
#extract relevant columns and create a new data frame
w1_w2_sim = df['w1_w2_sim']
w2_w3_sim = df['w2_w3_sim']
w1eng = df_eng['w1eng']
w2eng = df_eng['w2eng']
w3eng = df_eng['w3eng']

In [75]:
df_eng_123 = pd.DataFrame(list(zip(w1eng, w2eng, w3eng, w1_w2_sim, w2_w3_sim)),
                          columns =["w1eng", "w2eng", "w3eng", "w1_w2_sim", "w2_w3_sim"])

In [76]:
df_123['mean_similar'] = df_123.mean(axis=1)

Next, we do the same for the contrast condition. Therefore, we first extract the relevant columns of the original data frame. These are the first noun, the second inanimate noun and the third noun plus all the corresponding similarity values. We create a new data frame called df_12i3 and insert the extracted columns plus the mean similarity values for these nouns.

In [77]:
#extract relevant columns
w2eng_i = df['w2eng_i']
w1_w2_i_sim = df['w1_w2_i_sim']
w2_i_w3_sim = df['w2_i_w3_sim']

In [78]:
df_eng_12i3 = pd.DataFrame(list(zip(w1eng, w2eng_i, w3eng, w1_w2_i_sim, w2_i_w3_sim)),
                          columns =["w1eng", "w2eng_i", "w3eng", "w1_w2_i_sim", "w2_i_w3_sim"])

In [79]:
df_eng_12i3['mean_contrast'] = df_12i3.mean(axis=1)

In [80]:
mean_similar_eng = df_eng_123.mean(axis=1)
mean_contrast_eng = df_eng_12i3.mean(axis=1)

In order to get a better feeling for what just happened, we will create a data frame that contains all the relevant information at once. This data frame will be called df_mean and contains the three animate nouns for each sentence in the “Condition_similar” column. The three nouns (with one inanimate in the middle) of the contrast condition are in the “Condition_contrast” column. The mean similarity value for all three nouns of the similar condition are in the “mean_similar” column. The mean similarity value for all the nouns of the contrast condition are in the “mean_contrast” column.

We start by grouping the corresponding noun combinations together. The noun combination of the similar condition contains the first, the second and the third noun of each sentence. The noun combination of the contrast condition contains the first, the second inanimate and the third noun of each sentence.
We write a for loop that iterates over the w1eng, w2eng and w3eng list. We always create a tuple of the form (w1eng[i], w2eng[i], w3eng[i]), where i refers to the row number. These tuples are stored in a list called "condition_similar_eng”. We do the same for the contrast condition except that the w2eng noun is interchanged with the inanimate version w2eng_i. The tuples are stored in a list called “condition_contrast_eng”.

In [81]:
condition_similar_eng=[(w1eng[i],w2eng[i],w3eng[i]) for i in range(0,len(w1eng))]
condition_contrast_eng=[(w1eng[i],w2eng_i[i],w3eng[i]) for i in range(0,len(w1eng))]

In [82]:
df_eng_mean = pd.DataFrame(list(zip(condition_similar_eng,condition_contrast_eng, mean_similar_eng, mean_contrast_eng)),
                          columns =['condition_similar', 'condition_contrast', 'mean_similar','mean_contrast'])
df_eng_mean

Unnamed: 0,condition_similar,condition_contrast,mean_similar,mean_contrast
0,"(carpenter, craftsman, peasant)","(carpenter, pillar, peasant)",0.377595,0.159123
1,"(mother, daughter, sister)","(mother, gun, sister)",0.799481,-0.070793
2,"(worker, tenant, foreman)","(worker, bucket, foreman)",0.385656,0.207514
3,"(trader, businessman, professor)","(trader, computer, professor)",0.6328,0.109468
4,"(painter, musician, father)","(painter, hut, father)",0.418638,0.239972
5,"(saxophonist, trumpeter, conductor)","(saxophonist, baton, conductor)",0.724639,0.338037
6,"(pharmacist, optician, stranger)","(pharmacist, button, stranger)",0.245396,0.150894
7,"(cleaner, janitor, doctor)","(cleaner, ball, doctor)",0.299536,0.202013
8,"(dancer, singer, bystander)","(dancer, shoe, bystander)",0.543995,0.429768
9,"(artist, sportsman, guard)","(artist, computer, guard)",0.283132,0.022971


In [83]:
df_eng_mean.describe()

Unnamed: 0,mean_similar,mean_contrast
count,16.0,16.0
mean,0.467509,0.151981
std,0.176603,0.142525
min,0.245396,-0.070793
25%,0.296355,0.047706
50%,0.440883,0.155009
75%,0.566196,0.217705
max,0.799481,0.429768


In [84]:
df_mean.to_csv('df_SimilarityValues_eng.csv', sep=',', header=False, index=True)

## Coding part for the German sentences

In this section we show the coding for the similarity values that correspond to the German sentences/nouns. Each coding action is explained in detail.

### The word2vec model

In [6]:
trained_model = gensim.models.KeyedVectors.load_word2vec_format('german.model', binary=True)

Now the German nouns that correspond to the experimental sentences of the Vasishth et al. (2010) study are uploaded. 
In the "num" column are the sentences' numbers. In the "w1ger" column are the first nouns of all German sentences. In the "w2ger" column are the second nouns of all sentences of the similar condition. In the "w3ger" column are the third nouns of all sentences. In the "w2ger_i" column the second nouns of the contrast condition.
The nouns with their corresponding sentence numbers are uploaded as a pandas data frame and called df_ger (referring to data frame). The column at the left can again be ignored.

In [14]:
# upload the nouns that corresponds to the German experimental sentences
df_ger = pd.read_csv('vasishth_2010_Ger.csv', sep=';')
df_ger

Unnamed: 0,nom,w1ger,w2ger,w3ger,w2ger_i
0,1,Anwalt,Zeuge,Spion,Saebel
1,2,Beamte,Buerokrat,Besucher,Tisch
2,3,Braeutigam,Schwiegervater,Musiker,Bilderrahmen
3,4,Bruder,Cousin,Bauer,Schmuck
4,5,Zauberer,Akrobat,Zuschauer,Hut
5,6,Einbrecher,Dieb,Mann,Stein
6,7,Neurotiker,Exzentriker,Psychiater,Dolch
7,8,Arbeiter,Monteur,Vorarbeiter,Eimer
8,9,Banker,Kreditgeber,Kunde,Geldautomat
9,10,Pianist,Cellist,Hausmeister,Ball


No words are missing with this model

Next, we will again calculate the similarity value for each sentence.

The approach: 
1. we calculate the similarity between noun 1 and noun 2 (or 2_inanimate)
2. we calculate the similarity between noun 2 (or 2_inanimate) and noun 3

This time, the model is totally pretrained, which means that we can not run the training again. Therefore, we always have the same word vectors for a specific word. Consequently, for the German model it is not possible to create two models and to take the mean of both. We have to stick with the model we have. 
We create a function that always takes a tuple of the format tuple(column[a]row[i], column[b]row[i]). For our specific data frame we have the following tuples:
(w1ger[i], w2ger[i])
(w2ger[i], w3ger[i])
(w1ger[i], w2ger_i[i])
(w2ger_i[i], w3ger[i])
We calculate the similarity value for each tuple. The calculated similarity value is directly stored in a new column of the data frame. “w1_w2_sim” contains the similarity value of the tuples (w1ger[i], w2ger[i]). “w2_w3_sim” contains the similarity value of the tuples (w2ger[i], w3ger[i]). “w1_w2_i_sim” contains the similarity value of the tuples (w1ger[i], w2ger_i[i]). “w2_i_w3_sim” contains the similarity value of the tuples (w2ger_i[i], w3ger[i]).

In [11]:
word_dict = trained_model.wv.vocab

  """Entry point for launching an IPython kernel.


In [15]:
df_ger['w1_w2_sim'] = df_ger.apply(lambda row: trained_model.wv.similarity(row.w1ger, row.w2ger), axis=1)

  """Entry point for launching an IPython kernel.


In [16]:
df_ger['w2_w3_sim'] = df_ger.apply(lambda row: trained_model.wv.similarity(row.w2ger, row.w3ger), axis=1)

  """Entry point for launching an IPython kernel.


In [17]:
df_ger['w1_w2_i_sim'] = df_ger.apply(lambda row: trained_model.wv.similarity(row.w1ger, row.w2ger_i), axis=1)

  """Entry point for launching an IPython kernel.


In [18]:
df_ger['w2_i_w3_sim'] = df_ger.apply(lambda row: trained_model.wv.similarity(row.w2ger_i, row.w3ger), axis=1)

  """Entry point for launching an IPython kernel.


In [19]:
df_ger

Unnamed: 0,nom,w1ger,w2ger,w3ger,w2ger_i,w1_w2_sim,w2_w3_sim,w1_w2_i_sim,w2_i_w3_sim
0,1,Anwalt,Zeuge,Spion,Saebel,0.556733,0.423509,0.16913,0.265152
1,2,Beamte,Buerokrat,Besucher,Tisch,0.278928,0.120232,0.199348,0.287564
2,3,Braeutigam,Schwiegervater,Musiker,Bilderrahmen,0.517236,0.309596,0.364921,0.256104
3,4,Bruder,Cousin,Bauer,Schmuck,0.83033,0.420029,0.20292,0.23331
4,5,Zauberer,Akrobat,Zuschauer,Hut,0.585162,0.327576,0.344346,0.272425
5,6,Einbrecher,Dieb,Mann,Stein,0.787015,0.655775,0.271649,0.371261
6,7,Neurotiker,Exzentriker,Psychiater,Dolch,0.49094,0.407227,0.243915,0.209759
7,8,Arbeiter,Monteur,Vorarbeiter,Eimer,0.569527,0.633567,0.37718,0.380822
8,9,Banker,Kreditgeber,Kunde,Geldautomat,0.610209,0.468938,0.354249,0.516649
9,10,Pianist,Cellist,Hausmeister,Ball,0.854085,0.303179,0.247347,0.261414


We again calculate the mean similarity value for all sentences

In [20]:
#extract relevant columns and create a new data frame
w1_w2_sim = df_ger['w1_w2_sim']
w2_w3_sim = df_ger['w2_w3_sim']
w1ger = df_ger['w1ger']
w2ger = df_ger['w2ger']
w3ger = df_ger['w3ger']

In [22]:
df_ger_123 = pd.DataFrame(list(zip(w1ger, w2ger, w3ger, w1_w2_sim, w2_w3_sim)),
                          columns =["w1ger", "w2ger", "w3ger", "w1_w2_sim", "w2_w3_sim"])

In [23]:
df_ger_123['mean_similar'] = df_ger_123.mean(axis=1)

In [25]:
#extract relevant columns
w2ger_i = df_ger['w2ger_i']
w1_w2_i_sim = df_ger['w1_w2_i_sim']
w2_i_w3_sim = df_ger['w2_i_w3_sim']

In [26]:
df_ger_12i3 = pd.DataFrame(list(zip(w1ger, w2ger_i, w3ger, w1_w2_i_sim, w2_i_w3_sim)),
                          columns =["w1ger", "w2ger_i", "w3ger", "w1_w2_i_sim", "w2_i_w3_sim"])

In [27]:
df_ger_12i3['mean_contrast'] = df_ger_12i3.mean(axis=1)

In [28]:
mean_similar_ger = df_ger_123.mean(axis=1)
mean_contrast_ger = df_ger_12i3.mean(axis=1)

We again create our data frame that has all the relevant information in it.
We always create a tuple of the form (w1ger[i], w2gerg[i], w3ger[i]), where i refers to the row number. These tuples are stored in a list called "condition_similar_ger”. We do the same for the contrast condition except that the w2ger noun is interchanged with the inanimate version w2ger_i. The tuples are stored in a list called “condition_contrast_ger”.

In [29]:
condition_similar_ger=[(w1ger[i],w2ger[i],w3ger[i]) for i in range(0,len(w1ger))]
condition_contrast_ger=[(w1ger[i],w2ger_i[i],w3ger[i]) for i in range(0,len(w1ger))]

In [30]:
df_ger_mean = pd.DataFrame(list(zip(condition_similar_ger,condition_contrast_ger, mean_similar_ger, mean_contrast_ger)),
                          columns =['condition_similar', 'condition_contrast', 'mean_similar','mean_contrast'])
df_ger_mean

Unnamed: 0,condition_similar,condition_contrast,mean_similar,mean_contrast
0,"(Anwalt, Zeuge, Spion)","(Anwalt, Saebel, Spion)",0.490121,0.217141
1,"(Beamte, Buerokrat, Besucher)","(Beamte, Tisch, Besucher)",0.19958,0.243456
2,"(Braeutigam, Schwiegervater, Musiker)","(Braeutigam, Bilderrahmen, Musiker)",0.413416,0.310512
3,"(Bruder, Cousin, Bauer)","(Bruder, Schmuck, Bauer)",0.62518,0.218115
4,"(Zauberer, Akrobat, Zuschauer)","(Zauberer, Hut, Zuschauer)",0.456369,0.308386
5,"(Einbrecher, Dieb, Mann)","(Einbrecher, Stein, Mann)",0.721395,0.321455
6,"(Neurotiker, Exzentriker, Psychiater)","(Neurotiker, Dolch, Psychiater)",0.449083,0.226837
7,"(Arbeiter, Monteur, Vorarbeiter)","(Arbeiter, Eimer, Vorarbeiter)",0.601547,0.379001
8,"(Banker, Kreditgeber, Kunde)","(Banker, Geldautomat, Kunde)",0.539573,0.435449
9,"(Pianist, Cellist, Hausmeister)","(Pianist, Ball, Hausmeister)",0.578632,0.25438


In [31]:
df_ger_mean.describe()

Unnamed: 0,mean_similar,mean_contrast
count,12.0,12.0
mean,0.481907,0.29722
std,0.142981,0.067992
min,0.19958,0.217141
25%,0.404381,0.239301
50%,0.473245,0.309449
75%,0.584361,0.32486
max,0.721395,0.435449


In [32]:
df_ger_mean.to_csv('df_SimilarityValues_ger.csv', sep=',', header=False, index=True)