# Progress Report 2

- Author: Teresa Davison
- Email: tid30@pitt.edu
- Course: LING 1340 taught by Dr. Na-Rae Han
- Date: Mar 21, 2024

This is a continuation from progress report 1.

## Load Data

In [1]:
#load necessary packages
import pickle
import pandas as pd
import plotly.express as px
import matplotlib as plt

In [130]:
#Load the dataframes from the pickle object (in my private directory)
l_df = pd.read_pickle("private/l_df.pkl")
p_df = pd.read_pickle("private/p_df.pkl")
f_df = pd.read_pickle("private/f_df.pkl")

## Work with one language
Here I am working to figure out the work flow and a good way to represent colexifications within a single language. I will start with German, since I am familiar with the language.

In [3]:
l_df[l_df.Name == 'German']

Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,dataset_ID,...,Type,Date,Transcriber,Short_Name,Region,Culture,Description,Orthography,Comment,Family_factor
396,194,German,stan1295,German,deu,Eurasia,48.649,12.4676,Indo-European,ids,...,,,,,,,,,,Indo-European
747,deu,German,stan1295,German,deu,Eurasia,48.649,12.4676,Indo-European,northeuralex,...,,,,,,,,,,Indo-European
1525,41700,German,stan1295,German,ger,Eurasia,48.649,12.4676,Indo-European,diacl,...,,,,,,,,,,Indo-European


I'm going to do some cleaning up of the form df, adding a colex_key column and getting rid of rows with NaN value in the concepticon_id column.

In [110]:
#Add the unique code for each language in the form dataframe
f_df = pd.merge(l_df[['ID', 'Glottocode']], f_df, left_on=['ID'], right_on=['Language_ID'], how='right')
#f_df = f_df.drop(columns=['Glottocode_x'])
#f_df = f_df.rename(columns={"Glottocode_y": "Glottocode"})


#I'm also changing how the Colex key is made so it isn't dependent on dataset the data is originally from
f_df['Value'] = f_df['Value'].map(lambda x: x.lower())
#I'm going to lowercase the forms because German capitalizes nouns but not other word categories
f_df["Colex_Key"] = f_df['Glottocode'] +'_'+ f_df['Value']
f_df = f_df.dropna(subset=["Concepticon_ID"]) 

In [97]:
f_df.head()

Unnamed: 0,ID_x,ID_y,Glottocode,Concepticon_ID,dataset_ID,ID_y.1,Local_ID,Language_ID,Parameter_ID,Value,...,effect,contact_situation,original_script,diacl_id,meaning,meaning_note,transliteration,ipa,Local_Orthography,Colex_Key
0,Venetianstd,Venetianstd,vene1258,1692,logos,Venetianstd-350_friday-1,,Venetianstd,350_friday,divendres,...,,,,,,,,,,vene1258_divendres
1,Hindistd,Hindistd,hind1269,1692,logos,Hindistd-350_friday-1,,Hindistd,350_friday,शुक्रवार,...,,,,,,,,,,hind1269_शुक्रवार
2,Romagnolstd,Romagnolstd,roma1328,1692,logos,Romagnolstd-350_friday-1,,Romagnolstd,350_friday,vèner,...,,,,,,,,,,roma1328_vèner
3,Latinstd,Latinstd,lati1261,1692,logos,Latinstd-350_friday-1,,Latinstd,350_friday,Venerisdies,...,,,,,,,,,,lati1261_venerisdies
4,Galicianstd,Galicianstd,gali1258,1692,logos,Galicianstd-350_friday-1,,Galicianstd,350_friday,venres,...,,,,,,,,,,gali1258_venres


In [98]:
#Clean up columns a little
f_df = f_df.drop(columns=['ID_x'])

In [99]:
f_df = f_df.rename(columns={"ID_y": "ID"})
f_df.head()

Unnamed: 0,ID,Glottocode,Concepticon_ID,dataset_ID,ID.1,Local_ID,Language_ID,Parameter_ID,Value,Form,...,effect,contact_situation,original_script,diacl_id,meaning,meaning_note,transliteration,ipa,Local_Orthography,Colex_Key
0,Venetianstd,vene1258,1692,logos,Venetianstd-350_friday-1,,Venetianstd,350_friday,divendres,divendres,...,,,,,,,,,,vene1258_divendres
1,Hindistd,hind1269,1692,logos,Hindistd-350_friday-1,,Hindistd,350_friday,शुक्रवार,शुक्रवार,...,,,,,,,,,,hind1269_शुक्रवार
2,Romagnolstd,roma1328,1692,logos,Romagnolstd-350_friday-1,,Romagnolstd,350_friday,vèner,vèner,...,,,,,,,,,,roma1328_vèner
3,Latinstd,lati1261,1692,logos,Latinstd-350_friday-1,,Latinstd,350_friday,Venerisdies,venerisdies,...,,,,,,,,,,lati1261_venerisdies
4,Galicianstd,gali1258,1692,logos,Galicianstd-350_friday-1,,Galicianstd,350_friday,venres,venres,...,,,,,,,,,,gali1258_venres


Now I will make a dataframe of all of the forms for german.

In [30]:
german_colex = pd.DataFrame(f_df.loc[f_df.Glottocode == 'stan1295',['Colex_Key', 'dataset_ID', 'Form']])
#drop duplicates of the same form
german_colex = german_colex.drop_duplicates(subset=["Colex_Key"])
german_colex.head()

Unnamed: 0,Colex_Key,dataset_ID,Form
102856,stan1295_oft,ids,oft
102857,stan1295_häufig,ids,häufig
103202,stan1295_immer,ids,immer
103203,stan1295_stets,ids,stets
103567,stan1295_bereit,ids,bereit


In [31]:
german_colex.shape

(2917, 3)

There are 2917 forms. Now make a list of each unique concept associated with each form (takes a while).

In [37]:
#took a couple minutes
german_colex['Colex_IDs'] = german_colex.Colex_Key.map(lambda x: set(f_df.Concepticon_ID[f_df.Colex_Key == x].to_list()))
german_colex.head()                                                       

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts
102856,stan1295_oft,ids,oft,{1852},1
102857,stan1295_häufig,ids,häufig,{1852},1
103202,stan1295_immer,ids,immer,{1676},1
103203,stan1295_stets,ids,stets,{1676},1
103567,stan1295_bereit,ids,bereit,{790},1


In [38]:
#choose ones that have more than one entry in colex_IDs
#many forms don't have colexifications but some have more than 2 colexified concepts!
german_colex['Num_concepts'] = german_colex.Colex_IDs.map(lambda x: len(x))
german_colex.Num_concepts.value_counts()

Num_concepts
1    2767
2     136
3      12
5       1
4       1
Name: count, dtype: int64

In [39]:
#choose only the word forms that are actually colexified (num_concepts > 1)
german_co = pd.DataFrame(german_colex[german_colex.Num_concepts > 1])
#sort df and reset index
german_co = german_co.sort_values('Colex_Key')
german_co = german_co.reset_index(drop=True)
german_co.head()

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts
0,stan1295_Achsel,ids,Achsel,"{1482, 1769}",2
1,stan1295_Alter,ids,Alter,"{406, 1789}",2
2,stan1295_Base,ids,Base,"{1272, 1643}",2
3,stan1295_Bauch,ids,Bauch,"{2114, 819}",2
4,stan1295_Berg,ids,Berg,"{639, 2118}",2


In [40]:
german_co[german_co.Form == 'Fleisch']

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts
22,stan1295_Fleisch,ids,Fleisch,"{2615, 2259, 634}",3


Here we have a dataframe with the IDs for each concept that corresponds to the particular form in German. It's hard to interpret those IDs though, so let's add the gloss.

In [46]:
german_co['Concept_names'] = german_co.Colex_IDs.map(lambda x: 
                                                     [p_df.Concepticon_Gloss[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
                                                                                        

In [47]:
german_co.head()

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names
0,stan1295_Achsel,ids,Achsel,"{1482, 1769}",2,"[SHOULDER, AXLE]"
1,stan1295_Alter,ids,Alter,"{406, 1789}",2,"[OLD MAN, AGE]"
2,stan1295_Base,ids,Base,"{1272, 1643}",2,"[AUNT, COUSIN]"
3,stan1295_Bauch,ids,Bauch,"{2114, 819}",2,"[BELLY OR STOMACH, STOMACH]"
4,stan1295_Berg,ids,Berg,"{639, 2118}",2,"[MOUNTAIN, MOUNTAIN OR HILL]"


In [48]:
#Let's take a look at the ones with more than two concepts
german_co[german_co.Num_concepts > 2]

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names
7,stan1295_Boden,ids,Boden,"{1632, 690, 1228}",3,"[FLOOR, BOTTOM, EARTH (SOIL)]"
16,stan1295_Farbe,ids,Farbe,"{1613, 647, 157}",3,"[COLOR, PAINT (PIGMENTS), DYE]"
19,stan1295_Fell,ids,Fell,"{763, 189, 629}",3,"[SKIN, HAIR (BODY), LEATHER]"
22,stan1295_Fleisch,ids,Fleisch,"{2615, 2259, 634}",3,"[FLESH OR MEAT, FLESH, MEAT]"
30,stan1295_Haut,ids,Haut,"{580, 763, 629, 2613}",4,"[FUR, SKIN, LEATHER, SKIN (HUMAN)]"
35,stan1295_Jahr,ids,Jahr,"{1276, 1318, 1226}",3,"[AUTUMN, CHICKEN, YEAR]"
50,stan1295_Oheim,ids,Oheim,"{1985, 1327, 1984}",3,"[FATHER'S BROTHER, UNCLE, MOTHER'S BROTHER]"
51,stan1295_Onkel,ids,Onkel,"{1985, 1327, 1984}",3,"[FATHER'S BROTHER, UNCLE, MOTHER'S BROTHER]"
60,stan1295_Schere,ids,Schere,"{356, 341, 3327}",3,"[SHEARS, SICKLE, AUTUMN OR HARVEST SEASON]"
61,stan1295_Schnake,ids,Schnake,"{1509, 730, 95}",3,"[MOSQUITO, SNAKE, SANDFLY]"


In [50]:
#Add the semantic field for each concept
german_co['Semantic_field'] = german_co.Colex_IDs.map(lambda x: 
                                                     [p_df.Semantic_Field[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
german_co.tail()                                                                  

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field
145,stan1295_weiblich,ids,weiblich,"{2260, 2262}",2,"[FEMALE (OF PERSON), FEMALE (OF ANIMAL)]","[Kinship, Animals]"
146,stan1295_wieviel?,ids,wieviel?,"{531, 3450}",2,"[HOW MUCH, HOW MANY PIECES]","[Cognition, Cognition]"
147,stan1295_wir,ids,wir,"{1131, 1212, 1130}",3,"[WE (INCLUSIVE), WE, WE (EXCLUSIVE)]","[Kinship, Kinship, Kinship]"
148,stan1295_ɛsən,northeuralex,ɛsən,"{1336, 1526}",2,"[EAT, MEAL]","[Food and drink, Food and drink]"
149,stan1295_ʃaɪ̯nən,northeuralex,ʃaɪ̯nən,"{1521, 708}",2,"[SEEM, SHINE]","[Cognition, Sense perception]"


In [52]:
#I may get rid of the northeuralex or any other dataset that puts the form in IPA because
#it's harder to get rid of duplicates that way
german_co[german_co.Form == 'scheinen']

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field
129,stan1295_scheinen,ids,scheinen,"{1521, 708}",2,"[SEEM, SHINE]","[Cognition, Sense perception]"


I found a fix for this problem by using the value, rather than form in the colex_key because only the forms are sometimes written in IPA.

In [54]:
#Semantic field may not match for colexification, ontological category probably does though
#Let's add a column for it
german_co['Ontological_category'] = german_co.Colex_IDs.map(lambda x: 
                                                     [p_df.Ontological_Category[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
german_co.head()    

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
0,stan1295_Achsel,ids,Achsel,"{1482, 1769}",2,"[SHOULDER, AXLE]","[The body, Motion]","[Person/Thing, Person/Thing]"
1,stan1295_Alter,ids,Alter,"{406, 1789}",2,"[OLD MAN, AGE]","[Kinship, Time]","[Person/Thing, Person/Thing]"
2,stan1295_Base,ids,Base,"{1272, 1643}",2,"[AUNT, COUSIN]","[Kinship, Kinship]","[Person/Thing, Person/Thing]"
3,stan1295_Bauch,ids,Bauch,"{2114, 819}",2,"[BELLY OR STOMACH, STOMACH]","[The body, The body]","[Person/Thing, Person/Thing]"
4,stan1295_Berg,ids,Berg,"{639, 2118}",2,"[MOUNTAIN, MOUNTAIN OR HILL]","[The physical world, The physical world]","[Person/Thing, Person/Thing]"


In [55]:
german_co.tail()  

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
145,stan1295_weiblich,ids,weiblich,"{2260, 2262}",2,"[FEMALE (OF PERSON), FEMALE (OF ANIMAL)]","[Kinship, Animals]","[Property, Property]"
146,stan1295_wieviel?,ids,wieviel?,"{531, 3450}",2,"[HOW MUCH, HOW MANY PIECES]","[Cognition, Cognition]","[Other, Other]"
147,stan1295_wir,ids,wir,"{1131, 1212, 1130}",3,"[WE (INCLUSIVE), WE, WE (EXCLUSIVE)]","[Kinship, Kinship, Kinship]","[Other, Other, Other]"
148,stan1295_ɛsən,northeuralex,ɛsən,"{1336, 1526}",2,"[EAT, MEAL]","[Food and drink, Food and drink]","[Action/Process, Person/Thing]"
149,stan1295_ʃaɪ̯nən,northeuralex,ʃaɪ̯nən,"{1521, 708}",2,"[SEEM, SHINE]","[Cognition, Sense perception]","[Action/Process, Action/Process]"


Here in the final version of the dataframe for german, we have the form that has colexified meanings, the concepts being colexified (in ID and gloss form), the number of concepts being colexfied for that form, and the semantic field and ontological category for each of the colexified concepts.

The semantic field has more categories, but with the ontological category we can see when nouns are colexified with verbs, like with 'ɛsən' (essen/Essen). With German, nouns are also capitalized while verbs are not, so in the future I will make sure the un-capitalize the form before making the colex_key to standardize for this.

In [73]:
#only considers first two entries if more than two concepts colexified
#out of 136
german_co2 = german_co[german_co.Num_concepts == 2]
print(german_co2.shape)
match_sem = sum(german_co2.Semantic_field.map(lambda x: x[0] == x[1]))
match_sem

(136, 8)


82

In [72]:
#out of 136
match_ont = sum(german_co2.Ontological_category.map(lambda x: x[0] == x[1]))
match_ont

123

In [76]:
#I'm going to look at the counts for each combination of semantic fields (only using colexifications with two concepts)
from collections import defaultdict
sem_counts = defaultdict(int)
for colex in german_co2.Semantic_field.to_list(): 
    #sort it alphabetically
    colex.sort()
    sem_counts[colex[0] + ':' + colex[1]] += 1
sem_counts

defaultdict(int,
            {'Motion:The body': 1,
             'Kinship:Time': 1,
             'Kinship:Kinship': 14,
             'The body:The body': 5,
             'The physical world:The physical world': 6,
             'Agriculture and vegetation:Agriculture and vegetation': 7,
             'Animals:Animals': 13,
             'The house:Warfare and hunting': 1,
             'Basic actions and technology:Basic actions and technology': 3,
             'Spatial relations:Time': 2,
             'Speech and language:The body': 3,
             'Agriculture and vegetation:The physical world': 2,
             'Agriculture and vegetation:Food and drink': 1,
             'Cognition:Religion and belief': 1,
             'Agriculture and vegetation:The body': 4,
             'Religion and belief:The physical world': 1,
             'The house:The house': 1,
             'Basic actions and technology:The physical world': 1,
             'Social and political relations:The physical world': 1

In [77]:
#make sure they're all there
sum(sem_counts.values())

136

In [85]:
#get the top five combos
sorted(sem_counts, key=sem_counts.get, reverse=True)[:5]

['Kinship:Kinship',
 'Animals:Animals',
 'Agriculture and vegetation:Agriculture and vegetation',
 'The physical world:The physical world',
 'Spatial relations:Spatial relations']

In [88]:
ont_counts = defaultdict(int)
for colex in german_co2.Ontological_category.to_list(): 
    #sort it alphabetically
    colex.sort()
    ont_counts[colex[0] + ':' + colex[1]] += 1

In [89]:
ont_counts

defaultdict(int,
            {'Person/Thing:Person/Thing': 75,
             'Action/Process:Person/Thing': 7,
             'Other:Other': 6,
             'Person/Thing:Property': 4,
             'Other:Property': 1,
             'Action/Process:Action/Process': 29,
             'Property:Property': 13,
             'Other:Person/Thing': 1})

In [90]:
#all there :)
sum(ont_counts.values())

136

I was thinking that these numbers could be used for analysis, say if certain languages tend to colexify terms of kinship with each other or colexify verbs with nouns (Action/Process:Person/Thing) more often. So I would need to make a dictionary or set of columns like this for each language. But because some languages will have more colexifications than other, I will need to account for that when feature engineering.

## Make function

I want to now simplify the process of making a df of colexifications for a language by making a function that you can simply input the language ID to.

In [111]:
#glottocode should be a string of the glottocode for the target language
def build_lang_df(glottocode):
    lang_forms = pd.DataFrame(f_df.loc[f_df.Glottocode == glottocode,['Colex_Key', 'dataset_ID', 'Form']])
    #drop duplicates of the same form

    #the part that takes a while
    lang_forms = lang_forms.drop_duplicates(subset=["Colex_Key"])
    lang_forms['Colex_IDs'] = lang_forms.Colex_Key.map(lambda x: set(f_df.Concepticon_ID[f_df.Colex_Key == x].to_list()))

    lang_forms['Num_concepts'] = lang_forms.Colex_IDs.map(lambda x: len(x))
    #get only colexifications
    lang_forms = pd.DataFrame(lang_forms[lang_forms.Num_concepts > 1])
    #sort df and reset index
    lang_forms = lang_forms.sort_values('Colex_Key')
    lang_forms = lang_forms.reset_index(drop=True)
    #get the glosses for the concepts
    lang_forms['Concept_names'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Concepticon_Gloss[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    #get semantic field, ontological category
    lang_forms['Semantic_field'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Semantic_Field[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    lang_forms['Ontological_category'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Ontological_Category[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    return lang_forms

### Russian

In [101]:
l_df[l_df.Name == 'Russian']

Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,dataset_ID,...,Type,Date,Transcriber,Short_Name,Region,Culture,Description,Orthography,Comment,Family_factor
406,204,Russian,russ1263,Russian,rus,Eurasia,59.0,50.0,Indo-European,ids,...,,,,,,,,,,Indo-European
740,rus,Russian,russ1263,Russian,rus,Eurasia,59.0,50.0,Indo-European,northeuralex,...,,,,,,,,,,Indo-European
1483,37000,Russian,russ1263,Russian,rus,Eurasia,59.0,50.0,Indo-European,diacl,...,,,,,,,,,,Indo-European


In [112]:
#Let's try with Russian (should take a while)
russ_co = build_lang_df('russ1263')
russ_co.head()

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
0,russ1263_batat,ids,batat,"{410, 159}",2,"[YAM, SWEET POTATO]","[Agriculture and vegetation, Agriculture and v...","[Person/Thing, Person/Thing]"
1,russ1263_bedro,ids,bedro,"{800, 1745}",2,"[THIGH, HIP]","[The body, The body]","[Person/Thing, Person/Thing]"
2,russ1263_čto,ids,čto,"{1236, 1157}",2,"[WHAT, BECAUSE]","[Cognition, Cognition]","[Other, Other]"
3,russ1263_den,ids,den,"{1807, 1225}",2,"[AFTERNOON, DAY (NOT NIGHT)]","[Time, Time]","[Person/Thing, Person/Thing]"
4,russ1263_derevo,ids,derevo,"{906, 1803}",2,"[TREE, WOOD]","[Agriculture and vegetation, The physical world]","[Person/Thing, Person/Thing]"


In [113]:
russ_co.sample(n=10)

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
55,russ1263_pridti,ids,pridti,"{1387, 1446}",2,"[ARRIVE, COME]","[Motion, Motion]","[Action/Process, Action/Process]"
73,russ1263_vejat,ids,vejat,"{360, 175}",2,"[FAN ACTION, BLOW (OF WIND)]","[Basic actions and technology, Motion]","[Action/Process, Action/Process]"
83,russ1263_zjat,ids,zjat,"{2267, 2266}",2,"[SON-IN-LAW (OF MAN), SON-IN-LAW (OF WOMAN)]","[Kinship, Kinship]","[Person/Thing, Person/Thing]"
35,russ1263_maz,ids,maz,"{1926, 323}",2,"[OINTMENT, FAT (ORGANIC SUBSTANCE)]","[Clothing and grooming, The body]","[Person/Thing, Person/Thing]"
95,russ1263_любить,northeuralex,lubʲiˑtʲ,"{1117, 923}",2,"[LIKE, LOVE]","[Emotions and values, Emotions and values]","[Action/Process, Action/Process]"
96,russ1263_масло,northeuralex,maˑsɫə,"{1484, 1245}",2,"[OIL (ORGANIC SUBSTANCE), BUTTER]","[Food and drink, Food and drink]","[Person/Thing, Person/Thing]"
52,russ1263_podnjat ́ sja,ids,podnjat ́ sja,"{705, 568}",2,"[GO UP (ASCEND), RISE (MOVE UPWARDS)]","[Motion, The physical world]","[Action/Process, Action/Process]"
30,russ1263_les,ids,les,"{420, 1803}",2,"[FOREST, WOOD]","[The physical world, The physical world]","[Person/Thing, Person/Thing]"
53,russ1263_pole,ids,pole,"{652, 212}",2,"[PLAIN, FIELD]","[The physical world, Agriculture and vegetation]","[Person/Thing, Person/Thing]"
19,russ1263_ključ,ids,ključ,"{1472, 2144}",2,"[KEY, SPRING OR WELL]","[The house, The physical world]","[Person/Thing, Person/Thing]"


The lack of distinction between 'liking' and 'loving' could be interesting.

In [131]:
#more than two concepts colexified for a form
russ_co[russ_co.Num_concepts > 2]

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
6,russ1263_djadja,ids,djadja,"{1985, 1327, 1984}",3,"[FATHER'S BROTHER, UNCLE, MOTHER'S BROTHER]","[Kinship, Kinship, Kinship]","[Person/Thing, Person/Thing, Person/Thing]"
8,russ1263_dérevo,diacl,dérevo,"{906, 945, 1803}",3,"[TREE, SPEAR, WOOD]","[Agriculture and vegetation, The physical worl...","[Person/Thing, Person/Thing, Person/Thing]"
12,russ1263_idti,ids,idti,"{705, 1443, 695}",3,"[GO UP (ASCEND), WALK, GO]","[Motion, Motion, Motion]","[Action/Process, Action/Process, Action/Process]"
23,russ1263_korá,diacl,korá,"{2409, 629, 1204}",3,"[BROWN, LEATHER, BARK]","[Agriculture and vegetation, Clothing and groo...","[Person/Thing, Person/Thing, Property]"
41,russ1263_my,ids,my,"{1131, 1212, 1130}",3,"[WE (INCLUSIVE), WE, WE (EXCLUSIVE)]","[Kinship, Kinship, Kinship]","[Other, Other, Other]"
47,russ1263_peč,ids,peč,"{1143, 183, 859}",3,"[OVEN, BAKE, STOVE]","[Food and drink, Food and drink, The house]","[Action/Process, Person/Thing, Person/Thing]"
65,russ1263_syroj,ids,syroj,"{1958, 1959, 1726}",3,"[UNRIPE, RAW, WET]","[Food and drink, Food and drink, Sense percept...","[Property, Property, Property]"
74,russ1263_vernyj,ids,vernyj,"{100, 509, 1657}",3,"[FAITHFUL, CERTAIN, TRUE]","[Cognition, Emotions and values, Emotions and ...","[Property, Property, Property]"


Germand and Russian both don't distinguish between WE (INCLUSIVE), WE, or WE (EXCLUSIVE).

In [116]:
#now make a function for the semantic field and ontological category dictionaries
#inputs are the column (the string name) and the df to count for
def build_dict_counts(column, df):
    counts = defaultdict(int)
    for colex in df[column].to_list(): 
        #sort it alphabetically
        colex.sort()
        counts[colex[0] + ':' + colex[1]] += 1
    return counts

In [118]:
sem_counts_r = build_dict_counts('Semantic_field', russ_co)
sem_counts_r

defaultdict(int,
            {'Agriculture and vegetation:Agriculture and vegetation': 6,
             'The body:The body': 7,
             'Cognition:Cognition': 3,
             'Time:Time': 2,
             'Agriculture and vegetation:The physical world': 3,
             'Possession:Possession': 2,
             'Kinship:Kinship': 4,
             'Emotions and values:Emotions and values': 3,
             'Food and drink:Food and drink': 9,
             'The physical world:The physical world': 4,
             'Motion:Motion': 7,
             'Food and drink:The body': 3,
             'Animals:Time': 1,
             'Speech and language:The body': 2,
             'The house:The physical world': 1,
             'Animals:Animals': 4,
             'Spatial relations:Time': 1,
             'Basic actions and technology:Warfare and hunting': 1,
             'Agriculture and vegetation:Clothing and grooming': 1,
             'Basic actions and technology:Clothing and grooming': 3,
            

In [123]:
sorted(sem_counts_r, key=sem_counts_r.get, reverse=True)[:5]

['Food and drink:Food and drink',
 'The body:The body',
 'Motion:Motion',
 'Agriculture and vegetation:Agriculture and vegetation',
 'Kinship:Kinship']

In [122]:
#german, for comparison
sorted(sem_counts, key=sem_counts.get, reverse=True)[:5]

['Kinship:Kinship',
 'Animals:Animals',
 'Agriculture and vegetation:Agriculture and vegetation',
 'The physical world:The physical world',
 'Spatial relations:Spatial relations']

In [121]:
ont_counts_r = build_dict_counts('Ontological_category', russ_co)
ont_counts_r

defaultdict(int,
            {'Person/Thing:Person/Thing': 66,
             'Other:Other': 4,
             'Action/Process:Action/Process': 19,
             'Property:Property': 9,
             'Number:Other': 1,
             'Action/Process:Person/Thing': 2,
             'Person/Thing:Property': 2,
             'Other:Person/Thing': 1,
             'Action/Process:Other': 1,
             'Number:Property': 1})

In [124]:
ont_counts

defaultdict(int,
            {'Person/Thing:Person/Thing': 75,
             'Action/Process:Person/Thing': 7,
             'Other:Other': 6,
             'Person/Thing:Property': 4,
             'Other:Property': 1,
             'Action/Process:Action/Process': 29,
             'Property:Property': 13,
             'Other:Person/Thing': 1})

Already we can see slight differences between the German and the Russian with the types of syntactic categories that show up, as well as the ones that are more common. The ontological category does not have as much difference. Russian and German are the same language family, but Russian is Slavic, while German is Germanic, which could possibly account for these differences, as well as the different languages they are surrounded by. 

Let's try one more language that is not Indo-European.

### Tamil

In [125]:
l_df[l_df.Name == 'Tamil']

Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,dataset_ID,...,Type,Date,Transcriber,Short_Name,Region,Culture,Description,Orthography,Comment,Family_factor
586,708,Tamil,tami1289,Tamil,tam,Eurasia,10.520219,78.825989,Dravidian,ids,...,,,,,,,,,,Dravidian
788,tam,Tamil,tami1289,Tamil,tam,Eurasia,10.520219,78.825989,Dravidian,northeuralex,...,,,,,,,,,,Dravidian


In [126]:
#build df
tamil_co = build_lang_df('tami1289')
tamil_co.head()

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
0,tami1289_amaiti,ids,amaiti,"{3458, 1338}",2,"[SILENCE, PEACE]","[Sense perception, Warfare and hunting]","[Person/Thing, Person/Thing]"
1,tami1289_arukē,ids,arukē,"{1942, 1816}",2,"[NEAR, BESIDE]","[Spatial relations, Spatial relations]","[Property, Other]"
2,tami1289_avarkaḷ,ids,avarkaḷ,"{817, 1130}",2,"[THEY, WE (EXCLUSIVE)]","[Kinship, Kinship]","[Other, Other]"
3,tami1289_aṭittal,ids,aṭittal,"{2133, 175}",2,"[STRIKE OR BEAT, BLOW (OF WIND)]","[Basic actions and technology, Motion]","[Action/Process, Action/Process]"
4,tami1289_aṭuppu,ids,aṭuppu,"{1143, 859}",2,"[OVEN, STOVE]","[Food and drink, The house]","[Person/Thing, Person/Thing]"


In [132]:
#more than two concepts colexified for a form
tamil_co[tamil_co.Num_concepts > 2]

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
24,tami1289_peṇ,ids,peṇ,"{2260, 962, 2262}",3,"[FEMALE (OF PERSON), WOMAN, FEMALE (OF ANIMAL)]","[Animals, Kinship, Kinship]","[Property, Person/Thing, Property]"
53,tami1289_கட்டு,northeuralex,kʌʈːu,"{1917, 1840, 2891}",3,"[TIE, BUILD, TIE UP (TETHER)]","[Basic actions and technology, Basic actions a...","[Action/Process, Action/Process, Action/Process]"
68,tami1289_செய்தி,northeuralex,t͡ɕejt̪i,"{2961, 2214, 2964}",3,"[MESSAGE, AFFAIR, NEWS]","[Basic actions and technology, Basic actions a...","[Person/Thing, Person/Thing, Person/Thing]"
70,tami1289_சொல்,northeuralex,t͡ɕol,"{2898, 1458, 1711}",3,"[CONVEY (A MESSAGE), SAY, TELL]","[Basic actions and technology, Speech and lang...","[Action/Process, Action/Process, Action/Process]"
73,tami1289_நகம்,northeuralex,n̪ʌhʌm,"{1896, 72, 2965}",3,"[FINGERNAIL OR TOENAIL, CLAW, PAW]","[Animals, Animals, The body]","[Person/Thing, Person/Thing, Person/Thing]"
78,tami1289_பை,northeuralex,pʌi,"{2990, 2951, 2468}",3,"[SACK, POUCH, BAG]","[Basic actions and technology, Basic actions a...","[Person/Thing, Person/Thing, Person/Thing]"
80,tami1289_முடி,northeuralex,muɖi,"{1040, 1766, 1928}",3,"[HAIR, FINISH, CEASE]","[The body, Time, Time]","[Person/Thing, Action/Process, Action/Process]"


In [143]:
tamil_co.sample(n=10)

Unnamed: 0,Colex_Key,dataset_ID,Form,Colex_IDs,Num_concepts,Concept_names,Semantic_field,Ontological_category
34,tami1289_āṟu,ids,āṟu,"{666, 1703}",2,"[RIVER, SIX]","[Quantity, The physical world]","[Person/Thing, Number]"
24,tami1289_peṇ,ids,peṇ,"{2260, 962, 2262}",3,"[FEMALE (OF PERSON), WOMAN, FEMALE (OF ANIMAL)]","[Animals, Kinship, Kinship]","[Property, Person/Thing, Property]"
69,tami1289_சேர்,northeuralex,t͡ɕeːɾ,"{2862, 2882}",2,"[CONNECT, ADD]","[Basic actions and technology, Motion]","[Action/Process, Action/Process]"
48,tami1289_எழுப்பு,northeuralex,eɻupːu,"{3174, 1602}",2,"[WAKE UP (SOMEONE), WAKE UP]","[The body, The body]","[Action/Process, Action/Process]"
38,tami1289_ஆறு,northeuralex,ɑːru,"{666, 1703}",2,"[RIVER, SIX]","[Quantity, The physical world]","[Person/Thing, Number]"
43,tami1289_உயர எழு,northeuralex,ujʌɾʌ eɻu,"{705, 568}",2,"[GO UP (ASCEND), RISE (MOVE UPWARDS)]","[Motion, The physical world]","[Action/Process, Action/Process]"
31,tami1289_vari,ids,vari,"{873, 1048}",2,"[TAX, LINE]","[Possession, Spatial relations]","[Person/Thing, Person/Thing]"
49,tami1289_ஏமாற்று,northeuralex,eːmɑːtːu,"{3230, 3055}",2,"[DECEIVE, SWINDLE]","[Cognition, Law]","[Action/Process, Action/Process]"
50,tami1289_ஒரே முறை,northeuralex,oɾeː murʌi,"{2914, 374}",2,"[ONCE (IN THE PAST), ONE TIME]","[Time, Time]","[Other, Other]"
4,tami1289_aṭuppu,ids,aṭuppu,"{1143, 859}",2,"[OVEN, STOVE]","[Food and drink, The house]","[Person/Thing, Person/Thing]"


The amalgamation of data from separate datasets once again poses a problem since northeuralex uses the Tamil script, while ids does not.

In [127]:
sem_counts_t = build_dict_counts('Semantic_field', tamil_co)
sem_counts_t

defaultdict(int,
            {'Sense perception:Warfare and hunting': 1,
             'Spatial relations:Spatial relations': 5,
             'Kinship:Kinship': 3,
             'Basic actions and technology:Motion': 3,
             'Food and drink:The house': 1,
             'Possession:Spatial relations': 2,
             'Sense perception:The physical world': 4,
             'Food and drink:Food and drink': 2,
             'Basic actions and technology:Basic actions and technology': 9,
             'Sense perception:Spatial relations': 1,
             'The physical world:The physical world': 4,
             'Speech and language:Speech and language': 2,
             'Motion:The body': 1,
             'Spatial relations:Time': 2,
             'Cognition:Emotions and values': 1,
             'Motion:The physical world': 2,
             'Cognition:Cognition': 1,
             'Emotions and values:Sense perception': 1,
             'Animals:Animals': 2,
             'Cognition:Speech and lan

In [128]:
sorted(sem_counts_t, key=sem_counts_t.get, reverse=True)[:5]

['Basic actions and technology:Basic actions and technology',
 'Spatial relations:Spatial relations',
 'Sense perception:The physical world',
 'The physical world:The physical world',
 'The body:The body']

In [129]:
ont_counts_t = build_dict_counts('Ontological_category', russ_co)
ont_counts_t

defaultdict(int,
            {'Person/Thing:Person/Thing': 66,
             'Other:Other': 4,
             'Action/Process:Action/Process': 19,
             'Property:Property': 9,
             'Number:Other': 1,
             'Action/Process:Person/Thing': 2,
             'Person/Thing:Property': 2,
             'Other:Person/Thing': 1,
             'Action/Process:Other': 1,
             'Number:Property': 1})

Again the ontological category doesn't seem to provide any obvious differences, but immediately the semantic field does seem to be pretty different from both Russian and German, which is encourgaging since the languages are neither related nor close geographically. Kinship and agriculture showed up in the top five for both Russian and German but agriculture barely appears for Tamil, rather Tamil's colexifications seem to tend more spatial or motion-related in nature.

### Data Sharing

To add to the samples of the l_df, p_df, and f_df that I've shared, I will also shared the dfs for the languages that I make as I go. Since I built these myself, there should be no problem sharing them. People can also recreate them using the function I made earlier.

### Next Steps

Next, I will have to decide which languages to use in my analysis. Because it takes a few minutes to get the colexifications for each language, I will need to narrow down the number a lot. I may choose some areas of the world that have languages of different families or sub-families close together, like parts of Europe or Southern India or accepted Sprachbund areas, and see if the geographic proximity is more linked to the types of colexifications or if language family is.

Since I can create a feature like the dictionary with the two semantic fields being colexified and the number per language, I could use vector representations for all the possible combinations as a feature. I will have to standardize for the number of colexifications per language, however. Semantic field seemed to be more informative at a glance, but I may retain the ontological category because it can tell us if a language tends to colexify noun meanings with verb meanings, which could be helpful. Using the actual concepts would be interesting, but they are often phrases, rather than singular words, so I do not think there would be a way I could use them.

In [145]:
#pickle dfs since it takes so long to make them
pd.to_pickle(f_df, "private/f_df.pkl")
#share the dfs I made for specific languages as csv files
german_co.to_csv('data_samples/german.csv', index=False)
russ_co.to_csv('data_samples/russian.csv', index=False)
tamil_co.to_csv('data_samples/tamil.csv', index=False)
pd.to_pickle(german_co, "private/german_df.pkl")
pd.to_pickle(russ_co, "private/russian_df.pkl")
pd.to_pickle(tamil_co, "private/tamil_df.pkl")