# Progress Report 3

- Author: Teresa Davison
- Email: tid30@pitt.edu
- Class/Professor: LING1340 Na-Rae Han
- Date: Apr 4, 2024

This is a continuation from progress reports 1 and 2.

## Load data & packages

In [1]:
#load necessary packages
import pickle
import pandas as pd
import plotly.express as px
import matplotlib as plt

In [59]:
#Load the dataframes from the pickle object (in my private directory)
l_df = pd.read_pickle("../private/l_df.pkl")
p_df = pd.read_pickle("../private/p_df.pkl")
f_df = pd.read_pickle("../private/f_df.pkl")

In [61]:
#I cleaned the df's to exclude the unused columns, for space efficiency.
l_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3248 entries, 0 to 3247
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ID              3248 non-null   object 
 1   Name            3248 non-null   object 
 2   Glottocode      3164 non-null   object 
 3   Glottolog_Name  2614 non-null   object 
 4   ISO639P3code    2820 non-null   object 
 5   Macroarea       2961 non-null   object 
 6   Latitude        2684 non-null   float64
 7   Longitude       2684 non-null   float64
 8   Family          3163 non-null   object 
 9   dataset_ID      3248 non-null   object 
dtypes: float64(2), object(8)
memory usage: 253.9+ KB


## Create Features

Since it takes so long to create the features, I'm going to narrow down my analysis to a random sample of 100 that have info for the Family, Macroarea, and Lat/Longitude.

In [65]:
#Let's see if we can use languages that have above 800 forms
form_count_dict = f_df.Glottocode.value_counts().to_dict()

In [69]:
#add the form counts to the language dataframe for sampling
l_df['Form_count'] = l_df.Glottocode.map(form_count_dict)
l_df.head()

Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,dataset_ID,Form_count
0,Venetianstd,Venetian-std,vene1258,Venetian,vec,Eurasia,45.503581,12.214167,Indo-European,logos,625.0
1,Hindistd,Hindi-std,hind1269,Hindi,hin,Eurasia,25.0,77.0,Indo-European,logos,3831.0
2,Romagnolstd,Romagnol-std,roma1328,Romagnol,rgn,Eurasia,44.2349,11.7189,Indo-European,logos,627.0
3,Latinstd,Latin-std,lati1261,Latin,lat,Eurasia,41.9026,12.4502,Indo-European,logos,4304.0
4,Galicianstd,Galician-std,gali1258,Galician,glg,Eurasia,42.2446,-7.5343,Indo-European,logos,620.0


In [70]:
#check that it worked correctly, should be 627
form_count_dict['roma1328']

627

Papuanesian languages are heavily represented in this dataset, so I'm going to put a cap on how many we can get in our sample so we can explore across more areas of the world.

In [83]:
#take a random sample of 100 languages that have lat/longitude, macroarea, and family info
#and have lots of data
l_df_sample = l_df[(~l_df.Latitude.isnull()) & (~l_df.Family.isnull()) & 
        (~l_df.Macroarea.isnull()) & (l_df.Form_count >800) & (l_df.Macroarea != 'Papunesia')].sample(90)

In [84]:
l_df_sample_pn = l_df[(~l_df.Latitude.isnull()) & (~l_df.Family.isnull()) & 
        (~l_df.Macroarea.isnull()) & (l_df.Form_count >800) & (l_df.Macroarea == 'Papunesia')].sample(10)

In [85]:
#let's reset the indexes since they're scrambled and look at the head of the new df
sample_df = pd.concat([l_df_sample, l_df_sample_pn], ignore_index=True, sort=False)
sample_df = sample_df.reset_index(drop=True)
sample_df.head()

Unnamed: 0,ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,dataset_ID,Form_count
0,Jinghpaw,Jinghpo,kach1280,Jinghpaw,kac,Eurasia,25.4634,97.3278,Sino-Tibetan,satterthwaitetb,2054.0
1,Kanuri,Kanuri,cent2050,Central Kanuri,knc,Africa,11.8863,16.3918,Saharan,wold,1590.0
2,spa,Spanish,stan1288,Spanish,spa,Eurasia,40.4414,-1.11788,Indo-European,northeuralex,3206.0
3,221,Polci,polc1243,Polci,plj,Africa,10.0258,9.53362,Afro-Asiatic,ids,1990.0
4,BijiangNusu,Nusu (Bijiang),nusu1239,Nusu,nuf,Eurasia,26.7292,98.8801,Sino-Tibetan,suntb,1290.0


In [86]:
#let's look at the distribution
sample_df.Macroarea.value_counts()

Macroarea
Eurasia          61
Africa           14
South America    11
Papunesia        10
North America     4
Name: count, dtype: int64

In [87]:
sample_df.Family.value_counts()

Family
Indo-European               25
Nakh-Daghestanian           10
Atlantic-Congo              10
Sino-Tibetan                 8
Nuclear Trans New Guinea     6
Uralic                       5
Hmong-Mien                   5
Tai-Kadai                    4
Turkic                       2
Afro-Asiatic                 2
Austronesian                 2
Cariban                      2
Pano-Tacanan                 2
Ambakich                     1
Otomanguean                  1
South Bird's Head Family     1
Chapacuran                   1
PÃ¡ez                         1
Dravidian                    1
Matacoan                     1
Waorani                      1
Tucanoan                     1
Eskimo-Aleut                 1
Seri                         1
Saharan                      1
Chibchan                     1
Tsimshian                    1
Basque                       1
Puinave                      1
Timor-Alor-Pantar            1
Name: count, dtype: int64

In [None]:
#Shows the languages where they're located according to the data, colored by language family
#Hold mouse over dot to see lanuage name, you can also pan and zoom
fig = px.scatter_geo(l_df, lat='Latitude', lon='Longitude', color='Family_factor',
                     hover_name='Name', #size=... could use to show number of concepts per lang,
                     title='Languages by location', width=870, height=600)
fig.show()

In [23]:
l_df_sample.Macroarea.value_counts()

Macroarea
Papunesia        61
Eurasia          16
Africa           12
South America     6
Australia         4
North America     1
Name: count, dtype: int64

In [None]:
#glottocode should be a string of the glottocode for the target language
def build_lang_df(glottocode):
    lang_forms = pd.DataFrame(f_df.loc[f_df.Glottocode == glottocode,['Colex_Key', 'dataset_ID', 'Form']])
    #drop duplicates of the same form

    #the part that takes a while
    lang_forms = lang_forms.drop_duplicates(subset=["Colex_Key"])
    lang_forms['Colex_IDs'] = lang_forms.Colex_Key.map(lambda x: set(f_df.Concepticon_ID[f_df.Colex_Key == x].to_list()))

    lang_forms['Num_concepts'] = lang_forms.Colex_IDs.map(lambda x: len(x))
    #get only colexifications
    lang_forms = pd.DataFrame(lang_forms[lang_forms.Num_concepts > 1])
    #sort df and reset index
    lang_forms = lang_forms.sort_values('Colex_Key')
    lang_forms = lang_forms.reset_index(drop=True)
    #get the glosses for the concepts
    lang_forms['Concept_names'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Concepticon_Gloss[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    #get semantic field, ontological category
    lang_forms['Semantic_field'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Semantic_Field[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    lang_forms['Ontological_category'] = lang_forms.Colex_IDs.map(lambda x: 
                                                     [p_df.Ontological_Category[(p_df.Concepticon_ID == y)].iloc[0] for y in x])
    return lang_forms

In [None]:
#now make a function for the semantic field and ontological category dictionaries
#inputs are the column (the string name) and the df to count for
def build_dict_counts(column, df):
    counts = defaultdict(int)
    for colex in df[column].to_list(): 
        #sort it alphabetically
        colex.sort()
        counts[colex[0] + ':' + colex[1]] += 1
    return counts

#make this so that you have a list with each possible combo, make a vector rep indexed by this list for each df