Goal is to create tuples that contain nouns with the same animacy (similar group) and with different animacy (contrast group), to calculate their similarity values and to export the created dataframe for data analysis later on.

In [1]:
import csv
import pandas as pd
from gensim.models.word2vec import Text8Corpus
from gensim.models import Word2Vec

First of all, a csv file which contains animate and inanimate nouns is uploaded. The nouns stem from the brown corpus. By using nltk all nouns were extracted out of the brown corpus and randomized. After that, the first 150 animate and inanimte nouns were extracted. Pseudowords, Names, words that were not clearly assignable to the animate or inanimate category (the definition of animacy is explained in detail in the prestudy) or words that were not contained in the training dataset of the word2vec model were excluded.

The path represents the place where the csv file is stored.
This csv file is accessed and opend for reading purposes.
All items are stored in one line and separated by ';'. The file, only containing line 0, has to be splitted in many parts after each ';', such that the items are accessible. 
All animate and inanimate nouns are extracted and stored in their correspinding lists.

In [2]:
file=open("Nouns_studyProposal.CSV", "r")

noun_file = csv.reader(file)

animate=[]
inanimate=[]
for line in noun_file:
    splitted_file=line[0].split(sep=";")
    animate.append(splitted_file[0])
    inanimate.append(splitted_file[1])
print(animate, inanimate)

['Animate', 'guard', 'peoples', 'uncle', 'men', 'workers', 'fish', 'teacher', 'battalion', 'liberal', 'commentators', 'attorney', 'founders', 'beef', 'partner', 'principal', 'victim', 'girls', 'communists', 'coach', 'president', 'governance', 'peasant', 'daughters', 'employee', 'lady', 'master', 'commanders', 'delegates', 'tourists', 'mother', 'parent', 'commissioner', 'bird', 'dolphin', 'viewers', 'editor', 'individual', 'sons', 'scholars', 'chancellor', 'prisoners', 'wife', 'couple', 'lover', 'musicians', 'veterans', 'elite', 'couple', 'commanders', 'dealer', 'swan', 'researcher', 'bully', 'tiger', 'confessor', 'grandmother', 'shopkeeper', 'neighbor', 'militarist', 'supervisor', 'fishermen', 'lifeguard', 'catcher', 'maid', 'dispenser', 'sociologist', 'banker', 'piers', 'landlord', 'sinner', 'boss', 'proprietor', 'council', 'postmaster', 'sister', 'committee', 'eyewitness', 'developer', 'referent', 'boa', 'junior', 'amateur', 'misses', 'husband', 'shipmates', 'senator', 'pilot', 'pick

Since the first item of our two lists are the headings, we delete them. We don't want to have the headings in the tuples later on.

In [3]:
animate.pop(0)
inanimate.pop(0)

'Inanimate'

We create our first data frame df which only contains our animate and inanimate items.

In [4]:
df = pd.DataFrame(list(zip(animate, inanimate)),
                          columns =['Animate', 'Inanimate'])
df

Unnamed: 0,Animate,Inanimate
0,guard,suffrage
1,peoples,incest
2,uncle,sea
3,men,demonstrations
4,workers,definition
...,...,...
145,contender,plane
146,farmer,ear
147,descendant,finality
148,dancers,quote


We now want to check whether all nouns are in the training dataset. With missing nouns no similarity values can be calculated.
To do so, we need to upload our word2vec model. We take the pre trained word vectors that have been trained on the text8 corpus. For the pre study we take the same min_count as in our main experiment later (min_count=5).

In [5]:
w2v_model = Word2Vec(
    Text8Corpus("text8"),
    size=100,
    window=5,
    min_count=5,
    workers=3)

To check whether all words of our two lists are contained in the training dataset we add two new columns to our dataframe that containes the boolien expression TRUE when a word is contained in the dataset and FALSE when it is not contained. 
We use the lambda function to check each row of the animate list and apply the result to the new column 'animate_inW2v'. We do the same for the contrast list. To be able to see the whole output we specify that we want to see all 150 rows in the output window.

In [6]:
#Check whether all words are in the training dataset
word_dict = w2v_model.wv.vocab
df['animate_inW2v'] = df.Animate.apply(lambda x: x in word_dict)
df['inanimate_inW2V'] = df.Inanimate.apply(lambda x: x in word_dict)
pd.set_option('display.max_rows', 150)
df

Unnamed: 0,Animate,Inanimate,animate_inW2v,inanimate_inW2V
0,guard,suffrage,True,True
1,peoples,incest,True,True
2,uncle,sea,True,True
3,men,demonstrations,True,True
4,workers,definition,True,True
5,fish,experiment,True,True
6,teacher,cosmology,True,True
7,battalion,canon,True,True
8,liberal,portrait,True,True
9,commentators,listing,True,True


We can see that indeed every word is contained in the training dataset which means that we can continue...

The next goal is to create a data frame that containes tuples (pairs of nouns). One column should contain tuples with nouns with the same animacy (similar_tuples) and one column should contain tuples with nouns with different animacy (contrast_tuples).

To create those tuples we start by extracting the 'Animate' and 'Inanimate' column out of our dataframe df and create two separate lists, containing the corresponding items.

In [7]:
Animate = list(df['Animate'])
Inanimate = list(df['Inanimate'])

Each tuple of the form (a,b) of the similar and the contrast group, has a target item a at the first place. This target item ‘a’ is compared with another item ‘b’, where ‘b’ is either an item of the same animacy as ‘a’ for tuples of the similar group or an item with different animacy for tuples of the contrast group.

To create the tuples for the similar group, every fourth word, starting at the first item of the animate and the inanimate list is stored in a separate list ‘list_a’. All items of ‘list_a’ are target items and stand in the first position in the similar and the contrast group tuples. 
Every fourth word starting at the third word of the animate and inanimate list are stored in another list ‘similar_list_b’. The tuples of the similar list have the format (list_a[i], similar_list_b[i]), where i is representing the row number of both lists. For each tuple a similarity value is calculated.

In [9]:
# create the similar group: always pair the two items that are one below the other (just within the groups)

list_Animate_a = Animate[0::4]
similar_list_Animate_b = Animate[2::4]
list_Inanimate_a = Inanimate[0::4]
similar_list_Inanimate_b = Inanimate[2::4]

list_a = list_Animate_a + list_Inanimate_a
list_a = list_a[0:74]
similar_list_b = similar_list_Animate_b + similar_list_Inanimate_b

In [11]:
len(list_a)

74

In [12]:
len(similar_list_b)

74

Now that we have the two lists that we need to create our similar tuples, we store them in a new dataframe called df_similar.
Next, we want to calculate the similarity between the two words that are in each row (word x of list a with word x of list b). 
To do so, we assign the similarity function to the variable sim. We then use again the lambda function to apply the similarity function for each row of the dataframe and store the results in a new column called 'simval_similar'

In [13]:
#Create a pandas dataframe for the similar dataset
df_similar = pd.DataFrame(list(zip(list_a, similar_list_b)),
                          columns =['list_a', 'similar_list_b'])
sim = w2v_model.wv.similarity 
df_similar['simval_similar'] = df_similar.apply(lambda row: sim(row.list_a, row.similar_list_b), axis=1)
df_similar

Unnamed: 0,list_a,similar_list_b,simval_similar
0,guard,uncle,0.18277
1,workers,teacher,0.001084
2,liberal,attorney,0.242627
3,beef,principal,0.146274
4,girls,coach,0.340212
5,governance,daughters,-0.098016
6,lady,commanders,0.081941
7,tourists,parent,0.028724
8,bird,viewers,0.045432
9,individual,scholars,0.045291


We do the same for the contrast tuples.

In [14]:
# create the contrast group: always pair the two items that are in the same row (animate with inanimate)

Inanimate_contrast_list_b = Inanimate[1::4]
Animate_contrast_list_b = Animate[1::4]

contrast_list_b = Inanimate_contrast_list_b+Animate_contrast_list_b

df_contrast = pd.DataFrame(list(zip(list_a, contrast_list_b)),
                          columns =['list_a', 'contrast_list_b'])
df_contrast['simval_contrast'] = df_contrast.apply(lambda row: sim(row.list_a, row.contrast_list_b), axis=1)
df_contrast

Unnamed: 0,list_a,contrast_list_b,simval_contrast
0,guard,incest,-0.146781
1,workers,experiment,-0.077382
2,liberal,listing,-0.161327
3,beef,studio,0.103131
4,girls,convention,-0.255836
5,governance,boat,-0.1349
6,lady,factors,-0.220688
7,tourists,switches,0.029916
8,bird,needs,-0.017414
9,individual,hotel,-0.205426


For the data analysis we do not want the words that belong to one tuple in seperate columns. We want one column for all tuples of the similar and one for all tuples of the contrast group. Therefore, we create tuples of the format (list_a[i], similar_list_b[i]) for the similar group and tuples of the format (list_a[i], contrast_list_b[i]) for the contrast group. 

In [75]:
similar_tuples=[(similar_list_a[i],similar_list_b[i]) for i in range(0,len(similar_list_a))]

                
contrast_tuples=[(contrast_list_a[i],contrast_list_b[i]) for i in range(0,len(contrast_list_a))]

Now we want to create the final data frame. It should contain one columns with the similar tuples, then a column containing the corresponding similarity values, after that a column with the contrast tuples and again the corresponding similarity values.

To do so, we extract the already calculated similarity values out of the similar and the contrast dataframe and create the new dataframe calles df_simval with all needed columns.

In [78]:
simval_similar = list(df_similar['simval_similar'])
simval_contrast = list(df_contrast['simval_contrast'])
df_simval = pd.DataFrame(list(zip(similar_tuples, simval_similar, contrast_tuples, simval_contrast)),
                          columns =['Similar_tuples', 'Simval_similar', 'Contrast_tuples', 'Simval_contrast'])
df_simval

Unnamed: 0,Similar_tuples,Simval_similar,Contrast_tuples,Simval_contrast
0,"(guard, uncle)",0.171724,"(guard, incest)",-0.119413
1,"(workers, teacher)",0.044274,"(workers, experiment)",-0.099778
2,"(liberal, attorney)",0.234067,"(liberal, listing)",-0.101899
3,"(beef, principal)",0.138217,"(beef, studio)",0.037278
4,"(girls, coach)",0.293362,"(girls, convention)",-0.264332
5,"(governance, daughters)",-0.098058,"(governance, boat)",-0.177869
6,"(lady, commanders)",0.124951,"(lady, factors)",-0.191033
7,"(tourists, parent)",0.005909,"(tourists, switches)",0.045088
8,"(bird, viewers)",0.068633,"(bird, needs)",-0.015839
9,"(individual, scholars)",0.010562,"(individual, hotel)",-0.214935


Now we export the dataframe to be able to use it for data analysis.

In [79]:
df_simval.to_csv('df_simval.csv', sep=',', header=False, index=True)