Goal is to create 150 tuples that contain nouns with the same animacy (similar group) and 150 tuples that contain nouns with different animacy (contrast group), to calculate their similarity values and to export the created dataframe for data analysis later on.

In [3]:
import csv
import pandas as pd
from gensim.models.word2vec import Text8Corpus
from gensim.models import Word2Vec

First of all, a csv file which contains animate and inanimate nouns is uploaded. The nouns stem from the brown corpus. By using nltk all nouns were extracted out of the brown corpus and randomized. After that, the first 150 animate and inanimte nouns were extracted. Pseudowords, Names, words that were not clearly assignable to the animate or inanimate category or words that were not contained in the training dataset of the word2vec model were excluded.

The path represents the place where the csv file is stored.
This csv file is accessed and opend for reading purposes.
All items are stored in one line and separated by ';'. The file, only containing line 0, has to be splitted in many parts after each ';', such that the items are accessible. 
All animate and inanimate nouns are extracted and stored in their correspinding lists.

In [4]:
file=open("Nouns_prestudy.CSV", "r")

noun_file = csv.reader(file)

animate=[]
inanimate=[]
for line in noun_file:
    splitted_file=line[0].split(sep=";")
    animate.append(splitted_file[0])
    inanimate.append(splitted_file[1])
print(animate, inanimate)

['Animate', 'guard', 'peoples', 'uncle', 'men', 'workers', 'fish', 'teacher', 'battalion', 'liberal', 'commentators', 'attorney', 'founders', 'beef', 'partner', 'principal', 'victim', 'girls', 'communists', 'coach', 'president', 'governance', 'peasant', 'daughters', 'employee', 'lady', 'master', 'commanders', 'delegates', 'tourists', 'mother', 'parent', 'commissioner', 'bird', 'dolphin', 'viewers', 'editor', 'individual', 'sons', 'scholars', 'chancellor', 'prisoners', 'wife', 'couple', 'lover', 'musicians', 'veterans', 'elite', 'couple', 'commanders', 'dealer', 'swan', 'researcher', 'bully', 'tiger', 'confessor', 'grandmother', 'shopkeeper', 'neighbor', 'militarist', 'supervisor', 'fishermen', 'lifeguard', 'catcher', 'maid', 'dispenser', 'sociologist', 'banker', 'piers', 'landlord', 'sinner', 'boss', 'proprietor', 'council', 'postmaster', 'sister', 'committee', 'eyewitness', 'developer', 'referent', 'boa', 'junior', 'amateur', 'misses', 'husband', 'shipmates', 'senator', 'pilot', 'pick

Since the first item of our two lists are the headings, we delete them. We don't want to have the headings in the tuples later on.

In [4]:
animate.pop(0)
inanimate.pop(0)

'Inanimate'

We create our first data frame df which only contains our animate and inanimate items.

In [5]:
df = pd.DataFrame(list(zip(animate, inanimate)),
                          columns =['Animate', 'Inanimate'])
df

Unnamed: 0,Animate,Inanimate
0,guard,suffrage
1,peoples,incest
2,uncle,sea
3,men,demonstrations
4,workers,definition
...,...,...
145,contender,plane
146,farmer,ear
147,descendant,finality
148,dancers,quote


We now want to check whether all nouns are in the training dataset. With missing nouns no similarity values can be calculated.
To do so, we need to upload our word2vec model. We take the pre trained word vectors that have been trained on the text8 corpus. For the pre study we take the same min_count as in our main experiment later (min_count=5).

In [27]:
w2v_model = Word2Vec(
    Text8Corpus(r'C:\\Users\\marie\\Dropbox\\Uni_Potsdam\\Vasishth\\Thesis\\Word2Vec\\text8'),
    size=100,
    window=5,
    min_count=5,#we need 5 here
    workers=3)

To check whether all words of our two lists are contained in the training dataset we add two new columns to our dataframe that containes the boolien expression TRUE when a word is contained in the dataset and FALSE when it is not contained. 
We use the lambda function to check each row of the animate list and apply the result to the new column 'animate_inW2v'. We do the same for the contrast list. To be able to see the whole output we specify that we want to see all 150 rows in the output window.

In [9]:
#Check whether all words are in the training dataset
word_dict = w2v_model.wv.vocab
df['animate_inW2v'] = df.Animate.apply(lambda x: x in word_dict)
df['inanimate_inW2V'] = df.Inanimate.apply(lambda x: x in word_dict)
pd.set_option('display.max_rows', 150)
df

Unnamed: 0,Animate,Inanimate,animate_inW2v,inanimate_inW2V
0,guard,suffrage,True,True
1,peoples,incest,True,True
2,uncle,sea,True,True
3,men,demonstrations,True,True
4,workers,definition,True,True
5,fish,experiment,True,True
6,teacher,cosmology,True,True
7,battalion,canon,True,True
8,liberal,portrait,True,True
9,commentators,listing,True,True


We can see that indeed every word is contained in the training dataset which means that we can continue...

The next goal is to create a dataframe that containes tuples (pairs of nouns). One column should contain tuples with nouns with the same animacy (similar_tuples) and one column should contain tuples with nouns with different animacy (contrast_tuples).

To create those tuples we start by extracting the 'Animate' and 'Inanimate' column out of our dataframe df and create two separate lists, containing the corresponding items.

In [10]:
Animate = list(df['Animate'])
Inanimate = list(df['Inanimate'])

For the similar group, we always want to pair the two items that stand one below the other within each group. Therefore we create one list containing each second entry of the animate and inanimate list (one starts at 0, the other at 1). We merge the corresponding similar and contrast parts together (a to a and b to b)

In [20]:
# create the similar group: always pair the two items that are one below the other (just within the groups)

similar_list_Animate_a = Animate[0::2]
similar_list_Animate_b = Animate[1::2]
similar_list_Inanimate_a = Inanimate[0::2]
similar_list_Inanimate_b = Inanimate[1::2]

similar_list_a = similar_list_Animate_a+similar_list_Inanimate_a
similar_list_b = similar_list_Animate_b+similar_list_Inanimate_b

Now that we have the two lists that we need to create our similar tuples, we store them in a new dataframe called df_similar.
Next, we want to calculate the similarity between the two words that are in each row (word x of list a with word x of list b). 
To do so, we assign the similarity function to the variable sim. We then use again the lambda function to apply the similarity function for each row of the dataframe and store the results in a new column called 'simval_similar'

In [23]:
#Create a pandas dataframe for the similar dataset
df_similar = pd.DataFrame(list(zip(similar_list_a, similar_list_b)),
                          columns =['similar_list_a', 'similar_list_b'])
sim = w2v_model.wv.similarity 
df_similar['simval_similar'] = df_similar.apply(lambda row: sim(row.similar_list_a, row.similar_list_b), axis=1)
df_similar

Unnamed: 0,similar_list_a,similar_list_b,simval_similar
0,guard,peoples,0.072658
1,uncle,men,0.199144
2,workers,fish,0.185286
3,teacher,battalion,0.016067
4,liberal,commentators,0.369242
5,attorney,founders,0.305781
6,beef,partner,0.116828
7,principal,victim,-0.119173
8,girls,communists,0.003628
9,coach,president,0.456646


We do the same for the contrast tuples.

In [19]:
# create the contrast group: always pair the two items that are in the same row (animate with inanimate)

df_contrast = pd.DataFrame(list(zip(Animate, Inanimate)),
                          columns =['Animate', 'Inanimate'])
df_contrast['simval_contrast'] = df_contrast.apply(lambda row: sim(row.Animate, row.Inanimate), axis=1)
df_contrast

Unnamed: 0,Animate,Inanimate,simval_contrast
0,guard,suffrage,0.007539
1,peoples,incest,0.08135
2,uncle,sea,-0.005888
3,men,demonstrations,0.152362
4,workers,definition,-0.126815
5,fish,experiment,-0.135816
6,teacher,cosmology,0.136031
7,battalion,canon,-0.072913
8,liberal,portrait,0.030814
9,commentators,listing,-0.04476


For the data analysis we do not want the words that belong to one tuple in seperated columns, but we want one column containing all tuples. Therefore, we extract for each row the corresponding item of list a and list b and store both words in a tuple formate (word_a,word_b) in the similar_tuples and contrast_tuples list.

In [24]:
similar_tuples=[(similar_list_a[i],similar_list_b[i]) for i in range(0,len(similar_list_a))
contrast_tuples=[(Animate[i],Inanimate[i]) for i in range(0,len(Animate))]

Now we want to create the final data frame. It should contain one columns with the similar tuples, then a column containing the corresponding similarity values, after that a column with the contrast tuples and again the corresponding similarity values.

To do so, we extract the already calculated similarity values out of the similar and the contrast dataframe and create the new dataframe calles df_simval with all needed columns.

In [26]:
simval_similar = list(df_similar['simval_similar'])
simval_contrast = list(df_contrast['simval_contrast'])
df_simval = pd.DataFrame(list(zip(similar_tuples, simval_similar, contrast_tuples, simval_contrast)),
                          columns =['Similar_tuples', 'Simval_similar', 'Contrast_tuples', 'Simval_contrast'])
df_simval

Unnamed: 0,Similar_tuples,Simval_similar,Contrast_tuples,Simval_contrast
0,"(guard, peoples)",0.072658,"(guard, suffrage)",0.007539
1,"(uncle, men)",0.199144,"(peoples, incest)",0.08135
2,"(workers, fish)",0.185286,"(uncle, sea)",-0.005888
3,"(teacher, battalion)",0.016067,"(men, demonstrations)",0.152362
4,"(liberal, commentators)",0.369242,"(workers, definition)",-0.126815
5,"(attorney, founders)",0.305781,"(fish, experiment)",-0.135816
6,"(beef, partner)",0.116828,"(teacher, cosmology)",0.136031
7,"(principal, victim)",-0.119173,"(battalion, canon)",-0.072913
8,"(girls, communists)",0.003628,"(liberal, portrait)",0.030814
9,"(coach, president)",0.456646,"(commentators, listing)",-0.04476


Now we export the dataframe to be able to use it for data analysis.

In [43]:
df_simval.to_csv('df_simval.csv', sep=',', header=False, index=True)