Setup:

Setting up the conda environment:
1. install Python 3.6.8 and Anaconda
2. run "conda create -n example_name python=3.6.8" to create a conda environment
3. activate the example_name environment
4. run "conda install pytorch-cpu==1.1.0 torchvision-cpu==0.3.0 cpuonly -c pytorch"
5. run "conda install allennlp==0.8.5 seqeval six tqdm lang2vec overrides==3.1.0"

The code for the project itself can be found on Github: https://github.com/NotJona/DAP-Project/tree/jona

Important: this project worked on both a Windows 11 PC and one running on Linux, but NOT on Mac!

Let's look at the data we are using. The treebanks are all taken form the Universal Dependencies website (make sure to download them yourself and save them under udapter/data (folder has to be created)). 
1. For English: English EWT
2. For Japanese: Japanese GSD
3. For Vietnamese: Vietnamese VTB
4. For Chinese: Chinese GSD

Let's look at the different Treebanks. How much sentences are in each set? How much tokens?

In [1]:
def count_conllu_items_sentences(file_path):
    """Counts segments in a conllu file, which (mostly) correspond to sentences"""
    item_count = 0
    inside_item = False
    
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line.startswith('# sent_id'):
                if inside_item:
                    item_count += 1
                inside_item = True
            elif line == '' and inside_item:
                item_count += 1
                inside_item = False

    return item_count

In [2]:
def count_conllu_items_tokens(file_path):
    """Counts 'sentence elements'. This is not the same as tokens, since sometimes a 'sentence element' is made of two words"""
    item_count = 0
    
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Strip leading/trailing whitespace
            line = line.strip()
            # Ignore empty lines and comment lines
            if line and not line.startswith('#'):
                item_count += 1

    return item_count

Let's look at English:

In [3]:
file_path_en_dev = "data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu"
item_count_en_dev = count_conllu_items_sentences(file_path_en_dev)
token_count_en_dev = count_conllu_items_tokens(file_path_en_dev)
file_path_en_test = "data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-test.conllu"
item_count_en_test = count_conllu_items_sentences(file_path_en_test)
token_count_en_test = count_conllu_items_tokens(file_path_en_test)
file_path_en_train = "data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-train.conllu"
item_count_en_train = count_conllu_items_sentences(file_path_en_train)
token_count_en_train = count_conllu_items_tokens(file_path_en_train)

print(f'The English training CoNLL-U file contains {item_count_en_train} items and {token_count_en_train} tokens.')
print(f'The English developement  CoNLL-U file contains {item_count_en_dev} items and {token_count_en_dev} tokens.')
print(f'The English test CoNLL-U file contains {item_count_en_test} items and {token_count_en_test} tokens.')
print(f'In total the English CoNLL-U files contains {item_count_en_dev + item_count_en_test + item_count_en_train} items and {token_count_en_dev + token_count_en_test + token_count_en_train} tokens.')

The English training CoNLL-U file contains 12543 items and 204607 tokens.
The English developement  CoNLL-U file contains 2002 items and 25150 tokens.
The English test CoNLL-U file contains 2077 items and 25097 tokens.
In total the English CoNLL-U files contains 16622 items and 254854 tokens.


Let's look at Japanese:

In [4]:
file_path_ja_dev = "data/ud-treebanks-v2.3/UD_Japanese-GSD/ja_gsd-ud-dev.conllu"
item_count_ja_dev = count_conllu_items_sentences(file_path_ja_dev)
token_count_ja_dev = count_conllu_items_tokens(file_path_ja_dev)
file_path_ja_test = "data/ud-treebanks-v2.3/UD_Japanese-GSD/ja_gsd-ud-test.conllu"
item_count_ja_test = count_conllu_items_sentences(file_path_ja_test)
token_count_ja_test = count_conllu_items_tokens(file_path_ja_test)
file_path_ja_train = "data/ud-treebanks-v2.3/UD_Japanese-GSD/ja_gsd-ud-train.conllu"
item_count_ja_train = count_conllu_items_sentences(file_path_ja_train)
token_count_ja_train = count_conllu_items_tokens(file_path_ja_train)

print(f'The Japanese training CoNLL-U file contains {item_count_ja_train} items and {token_count_ja_train} tokens.')
print(f'The Japanese developement  CoNLL-U file contains {item_count_ja_dev} items and {token_count_ja_dev} tokens.')
print(f'The Japanese test CoNLL-U file contains {item_count_ja_test} items and {token_count_ja_test} tokens.')
print(f'In total the Japanese CoNLL-U files contains {item_count_ja_dev + item_count_ja_test + item_count_ja_train} items and {token_count_ja_dev + token_count_ja_test + token_count_ja_train} tokens.')


The Japanese training CoNLL-U file contains 7133 items and 160419 tokens.
The Japanese developement  CoNLL-U file contains 511 items and 11491 tokens.
The Japanese test CoNLL-U file contains 551 items and 12438 tokens.
In total the Japanese CoNLL-U files contains 8195 items and 184348 tokens.


Let's look at Vietnamese:

In [5]:
file_path_vi_dev = "data/ud-treebanks-v2.3/UD_Vietnamese-VTB/vi_vtb-ud-dev.conllu"
item_count_vi_dev = count_conllu_items_sentences(file_path_vi_dev)
token_count_vi_dev = count_conllu_items_tokens(file_path_vi_dev)
file_path_vi_test = "data/ud-treebanks-v2.3/UD_Vietnamese-VTB/vi_vtb-ud-test.conllu"
item_count_vi_test = count_conllu_items_sentences(file_path_vi_test)
token_count_vi_test = count_conllu_items_tokens(file_path_vi_test)
file_path_vi_train = "data/ud-treebanks-v2.3/UD_Vietnamese-VTB/vi_vtb-ud-train.conllu"
item_count_vi_train = count_conllu_items_sentences(file_path_vi_train)
token_count_vi_train = count_conllu_items_tokens(file_path_vi_train)

print(f'The Vietnamese training CoNLL-U file contains {item_count_vi_train} items and {token_count_vi_train} tokens.')
print(f'The Vietnamese developement  CoNLL-U file contains {item_count_vi_dev} items and {token_count_vi_dev} tokens.')
print(f'The Vietnamese test CoNLL-U file contains {item_count_vi_test} items and {token_count_vi_test} tokens.')
print(f'In total the Vietnamese CoNLL-U file contains {item_count_vi_dev + item_count_vi_test + item_count_vi_train} items and {token_count_vi_dev+token_count_vi_test+token_count_vi_train} tokens.')

The Vietnamese training CoNLL-U file contains 1400 items and 20285 tokens.
The Vietnamese developement  CoNLL-U file contains 800 items and 11514 tokens.
The Vietnamese test CoNLL-U file contains 800 items and 11955 tokens.
In total the Vietnamese CoNLL-U file contains 3000 items and 43754 tokens.


Let's now generate subsets. For my different models I generated random subsets of training, dev and/or test set for English, Japanese and Vietnamese. The smallest models had 100 sentences in each set, the "big" models had 400 sentences for the training sets, but test and dev set remnained the same. For the "ultimate" models training, dev and test set in the 3 languages have the sizes seen below: 

In [None]:
import random

def generate_rand_values(file_path, n):
    l = []
    with open(file_path, 'r', encoding='utf-8') as file:
        file_content = file.read()
        length = len(parse(file_content))
    
    while len(l)<n:
        a = random.randint(0, length)
        if a not in l:
            l.append(a)
    return l

def choose_segments(file_path, n):
    result_data = []
    index = generate_rand_values(file_path, n)
    item_count = 0
    inside_item = False

    with open(file_path, 'r', encoding='utf-8') as file:    
        for line in file:
            line = line.strip()
            if line.startswith('# sent_id'):
                inside_item = True
                if item_count in index:
                    result_data.append(line)
            elif line == '' and inside_item:
                item_count += 1
                inside_item = False
                if item_count in index:
                    result_data.append(line)
            else:
                if item_count in index:
                    result_data.append(line)
    return result_data

In [None]:
train_data_en = choose_segments(file_path_en_train, 12000) 
train_data_ja = choose_segments(file_path_ja_train, 6000) 
train_data_vi = choose_segments(file_path_vi_train, 1200) 
test_data_en = choose_segments(file_path_en_test, 740) 
test_data_ja = choose_segments(file_path_ja_test, 470) 
test_data_vi = choose_segments(file_path_vi_test, 590) 
dev_data_en = choose_segments(file_path_en_dev, 740) 
dev_data_ja = choose_segments(file_path_ja_dev, 470) 
dev_data_vi = choose_segments(file_path_vi_dev, 590) 

In [None]:
with open("en_ud-train.conllu", "w", encoding="utf-8") as new_file:
    for line in train_data_en:
        new_file.write(line+"\n")
with open("ja_ud-train.conllu", "w", encoding="utf-8") as new_file:
    for line in train_data_ja:
        new_file.write(line+"\n")
with open("vi_ud-train.conllu", "w", encoding="utf-8") as new_file:
    for line in train_data_vi:
        new_file.write(line+"\n")
with open("en_ud-test.conllu", "w", encoding="utf-8") as new_file:
    for line in test_data_en:
        new_file.write(line+"\n")
with open("ja_ud-test.conllu", "w", encoding="utf-8") as new_file:
    for line in test_data_ja:
        new_file.write(line+"\n")
with open("vi_ud-test.conllu", "w", encoding="utf-8") as new_file:
    for line in test_data_vi:
        new_file.write(line+"\n")
with open("en_ud-dev.conllu", "w", encoding="utf-8") as new_file:
    for line in dev_data_en:
        new_file.write(line+"\n")
with open("ja_ud-dev.conllu", "w", encoding="utf-8") as new_file:
    for line in dev_data_ja:
        new_file.write(line+"\n")
with open("vi_ud-dev.conllu", "w", encoding="utf-8") as new_file:
    for line in dev_data_vi:
        new_file.write(line+"\n")

Next the different datasets have to be sorted into the different folders for the different models, depending on the proportions required for the specific model. For examle let's take the 'ultimate_en33_ja33_vi33' model:
1. under udapter/data create the folder 'ud-treebanks-v2.3_ultimate_en33_ja33_vi33' (first part is just to stick with the naming of the original udapter code)
2. in this folder create the 3 folders 'UD_English-EWT', 'UD_Japanese-GSD' and 'UD_Vietnamese-VTB'
3. In each folder put the matching train, dev and test set.
4. Now copy the folders to get the correct proportions. So in this case there should be one folder for English, two for Japanese and 10 for Vietnamese. Make sure to delete the dev and test set from the copied folders (we want the dev and test set to be the same)

After organizing the folders, run the concat_ud_data.sh (found under scripts) in a git bash (scripts/concat_ud_data.sh --add_lang_id). Under udapter/data/ud you can now find the datasets for the different models. For our example the folder is named 'multilingual_ultimate_en33_ja33_vi33'

Next the config files had to be adapted, depending on the model the number of epochs has to be changed and for each model the paths to the training, dev and test data have to be ajusted. The config files for all the models can be found under config/ud/name_of_the_specific_model (For our example from above it would be under config/ud/multilingual_ultimate_en33_ja33_vi33). 

Now the models can be run. To run a model use the command "python train.py --config config/ud/name_of_model/udapter-test.json --name udapter" (For our example from above it would be under "python train.py --config config/ud/multilingual_ultimate_en33_ja33_vi33/udapter-test.json --name udapter"). 

Let us look at the results of the models trained on a 100/400 sentences per language:

In [6]:
import json
import plotly.graph_objects as go

file_path_ultimate_en33_ja33_vi33 = "logs/udapter/2024.06.12_22.28.41_ultimate_en33_ja33_vi33"
file_path_ultimate_en50_ja25_vi25 = "logs/udapter/2024.06.14_09.49.50_ultimate_en50_ja25_vi25"
file_path_ultimate_en33_ja50_vi17 = "logs/udapter/2024.06.16_00.52.40_ultimate_en33_ja50_vi17"
file_path_ultimate_en33_ja17_vi50 = "logs/udapter/2024.06.17_18.59.22_ultimate_en33_ja17_vi50"

Now lets look at the performance of the 4 "ultimate" models. We are interested in their LAS scores, so let's see how they compare and how they improved over time:

In [7]:
file_path_ultimate_en33_ja33_vi33 = "logs/udapter/2024.06.12_22.28.41_ultimate_en33_ja33_vi33"
file_path_ultimate_en50_ja25_vi25 = "logs/udapter/2024.06.14_09.49.50_ultimate_en50_ja25_vi25"
file_path_ultimate_en33_ja50_vi17 = "logs/udapter/2024.06.16_00.52.40_ultimate_en33_ja50_vi17"
file_path_ultimate_en33_ja17_vi50 = "logs/udapter/2024.06.17_18.59.22_ultimate_en33_ja17_vi50"

data_ultimate_en33_ja33_vi33 = {}
for i in range(10):
    f = open(file_path_ultimate_en33_ja33_vi33+"/metrics_epoch_"+str(i)+".json")
    data = json.load(f)
    f.close()
    data_ultimate_en33_ja33_vi33[(i+1)*36000] = [data["training_.run/deps/LAS"], data["validation_.run/deps/LAS"]]

data_ultimate_en50_ja25_vi25 = {}
for i in range(15):
    f = open(file_path_ultimate_en50_ja25_vi25+"/metrics_epoch_"+str(i)+".json")
    data = json.load(f)
    f.close()
    data_ultimate_en50_ja25_vi25[(i+1)*24000] = [data["training_.run/deps/LAS"], data["validation_.run/deps/LAS"]]

data_ultimate_en33_ja50_vi17 = {}
for i in range(10):
    f = open(file_path_ultimate_en33_ja50_vi17+"/metrics_epoch_"+str(i)+".json")
    data = json.load(f)
    f.close()
    data_ultimate_en33_ja50_vi17[(i+1)*36000] = [data["training_.run/deps/LAS"], data["validation_.run/deps/LAS"]]

data_ultimate_en33_ja17_vi50 = {}
for i in range(10):
    f = open(file_path_ultimate_en33_ja17_vi50+"/metrics_epoch_"+str(i)+".json")
    data = json.load(f)
    f.close()
    data_ultimate_en33_ja17_vi50[(i+1)*36000] = [data["training_.run/deps/LAS"], data["validation_.run/deps/LAS"]]

In [132]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja33_vi33.keys()), y= [x[1] for x in data_ultimate_en33_ja33_vi33.values()], name='Ultimate_en33_ja33_vi33',
                         line=dict(color='darkslategray', width=1)))
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja33_vi33.keys()), y= [x[0] for x in data_ultimate_en33_ja33_vi33.values()], name='Ultimate_en33_ja33_vi33',
                         line=dict(color='darkslategray', width=1, dash='dot')))
fig.add_trace(go.Scatter(x=list(data_ultimate_en50_ja25_vi25.keys()), y= [x[1] for x in data_ultimate_en50_ja25_vi25.values()], name = 'Ultimate_en50_ja25_vi25',
                         line=dict(color='steelblue', width=1)))
fig.add_trace(go.Scatter(x=list(data_ultimate_en50_ja25_vi25.keys()), y= [x[0] for x in data_ultimate_en50_ja25_vi25.values()], name = 'Ultimate_en50_ja25_vi25',
                         line=dict(color='steelblue', width=1, dash= 'dot')))
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja50_vi17.keys()), y= [x[1] for x in data_ultimate_en33_ja50_vi17.values()], name = 'Ultimate_en33_ja50_vi17',
                         line=dict(color='yellowgreen', width=1)))
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja50_vi17.keys()), y= [x[0] for x in data_ultimate_en33_ja50_vi17.values()], name = 'Ultimate_en33_ja50_vi17',
                         line=dict(color='yellowgreen', width=1, dash = 'dot')))
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja17_vi50.keys()), y= [x[1] for x in data_ultimate_en33_ja17_vi50.values()], name = 'Ultimate_en33_ja17_vi50',
                         line=dict(color='gold', width=1)))
fig.add_trace(go.Scatter(x=list(data_ultimate_en33_ja17_vi50.keys()), y= [x[0] for x in data_ultimate_en33_ja17_vi50.values()], name = 'Ultimate_en33_ja17_vi50',
                         line=dict(color='gold', width=1, dash = 'dot')))

Let's see how well the ultimate models perform (dev and test set)

In [8]:
l = [
"logs/udapter/2024.06.12_22.28.41_ultimate_en33_ja33_vi33",
"logs/udapter/2024.06.14_09.49.50_ultimate_en50_ja25_vi25",
"logs/udapter/2024.06.16_00.52.40_ultimate_en33_ja50_vi17",
"logs/udapter/2024.06.17_18.59.22_ultimate_en33_ja17_vi50"]

results_dev_test = []
for path in l:
    f = open(path+"/test_results.json")
    data = json.load(f)
    f.close()
    test_set = data["LAS"]["precision"]
    f = open(path+"/dev_results.json")
    data = json.load(f)
    f.close()
    dev_set = data["LAS"]["precision"]
    results_dev_test.append([dev_set, test_set])

index = ["en33_ja33_vi33", "en50_ja25_vi25", "en33_ja50_vi17", "en33_ja17_vi50"]
fig = go.Figure(data=[
    go.Bar(name='dev set', x=index, y=[d[0] for d in results_dev_test], marker_color = 'maroon'),
    go.Bar(name='test set', x=index, y=[d[1] for d in results_dev_test], marker_color = 'plum')
])

fig.update_layout(barmode='group')
fig.update_layout(yaxis_range=[0,1])
fig.show()


Now let's see how well the 4 ultimate models perfom on the test set of the models that were trained on 100 and 400 sentences per language, this test set is called test_small!

We have to run the following code from the powershell:

python predict.py logs/udapter/2024.06.12_22.28.41_ultimate_en33_ja33_vi33/model.tar.gz data/ud/multilingual_ultimate_en33_ja33_vi33/test_small.conllu output_test_small_ultimate_en33_ja33_vi33.conllu --eval_file results_small_ultimate_en33_ja33_vi33.json ;

python predict.py logs/udapter/2024.06.14_09.49.50_ultimate_en50_ja25_vi25/model.tar.gz data/ud/multilingual_ultimate_en50_ja25_vi25/test_small.conllu output_test_small_ultimate_en50_ja25_vi25.conllu --eval_file results_small_ultimate_en50_ja25_vi25.json ;

python predict.py logs/udapter/2024.06.16_00.52.40_ultimate_en33_ja50_vi17/model.tar.gz data/ud/multilingual_ultimate_en33_ja50_vi17/test_small.conllu output_test_small_ultimate_en33_ja50_vi17.conllu --eval_file results_small_ultimate_en33_ja50_vi17.json ;

python predict.py logs/udapter/2024.06.17_18.59.22_ultimate_en33_ja17_vi50/model.tar.gz data/ud/multilingual_ultimate_en33_ja17_vi50/test_small.conllu output_test_small_ultimate_en33_ja17_vi50.conllu --eval_file results_small_ultimate_en33_ja17_vi50.json

In [9]:
l = ["results_small_ultimate_en33_ja33_vi33.json",
"results_small_ultimate_en50_ja25_vi25.json",
"results_small_ultimate_en33_ja50_vi17.json",
"results_small_ultimate_en33_ja17_vi50.json"]

results_dev_test_testsmall = []
n = 0
for file in l:
    f = open(file)
    data = json.load(f)
    f.close()
    results_dev_test_testsmall.append(results_dev_test[n]+[data["LAS"]["precision"]])
    n +=1

fig = go.Figure(data=[
    go.Bar(name='dev set', x=index, y=[d[0] for d in results_dev_test_testsmall], marker_color = 'maroon'),
    go.Bar(name='test set', x=index, y=[d[1] for d in results_dev_test_testsmall], marker_color = 'plum'),
    go.Bar(name='small test set', x=index, y=[d[2] for d in results_dev_test_testsmall], marker_color = 'pink'),
])

fig.update_layout(barmode='group')
fig.update_layout(yaxis_range=[0,1])
fig.show()


Now let's see how well the 4 ultimate models perform for Chinese! This test set is called test_chinese


We have to run the following code from the powershell:

python predict.py logs/udapter/2024.06.12_22.28.41_ultimate_en33_ja33_vi33/model.tar.gz data/ud/multilingual_ultimate_en33_ja33_vi33/test_chinese.conllu output_chinese_ultimate_en33_ja33_vi33.conllu --eval_file results_chinese_ultimate_en33_ja33_vi33.json ;

python predict.py logs/udapter/2024.06.14_09.49.50_ultimate_en50_ja25_vi25/model.tar.gz data/ud/multilingual_ultimate_en50_ja25_vi25/test_chinese.conllu output_chinese_ultimate_en50_ja25_vi25.conllu --eval_file results_chinese_ultimate_en50_ja25_vi25.json ;

python predict.py logs/udapter/2024.06.16_00.52.40_ultimate_en33_ja50_vi17/model.tar.gz data/ud/multilingual_ultimate_en33_ja50_vi17/test_chinese.conllu output_chinese_ultimate_en33_ja50_vi17.conllu --eval_file results_chinese_ultimate_en33_ja50_vi17.json ; 

python predict.py logs/udapter/2024.06.17_18.59.22_ultimate_en33_ja17_vi50/model.tar.gz data/ud/multilingual_ultimate_en33_ja17_vi50/test_chinese.conllu output_chinese_ultimate_en33_ja17_vi50.conllu --eval_file results_chinese_ultimate_en33_ja17_vi50.json

In [10]:
l = ["results_chinese_ultimate_en33_ja33_vi33.json",
"results_chinese_ultimate_en50_ja25_vi25.json",
"results_chinese_ultimate_en33_ja50_vi17.json",
"results_chinese_ultimate_en33_ja17_vi50.json"]

results_dev_test_testsmall_testchinese = []
n = 0
for file in l:
    f = open(file)
    data = json.load(f)
    f.close()
    results_dev_test_testsmall_testchinese.append(results_dev_test_testsmall[n]+[data["LAS"]["precision"]])
    n +=1

fig = go.Figure(data=[
    go.Bar(name='dev set', x=index, y=[d[0] for d in results_dev_test_testsmall_testchinese], marker_color = 'maroon'),
    go.Bar(name='test set', x=index, y=[d[1] for d in results_dev_test_testsmall_testchinese], marker_color = 'plum'),
    go.Bar(name='small test set', x=index, y=[d[2] for d in results_dev_test_testsmall_testchinese], marker_color = 'pink'),
    go.Bar(name='chinese test set', x=index, y=[d[3] for d in results_dev_test_testsmall_testchinese], marker_color = 'orange')
])

fig.update_layout(barmode='group')
fig.update_layout(yaxis_range=[0,1])
fig.show()

The results for Chinese all look similar, is that correct? 

In [11]:
l = ["results_chinese_ultimate_en33_ja33_vi33.json",
"results_chinese_ultimate_en50_ja25_vi25.json",
"results_chinese_ultimate_en33_ja50_vi17.json",
"results_chinese_ultimate_en33_ja17_vi50.json"]

results_testchinese_in_numbers = []
for file in l:
    f = open(file)
    data = json.load(f)
    f.close()
    results_testchinese_in_numbers.append(data["LAS"]["correct"])

fig = go.Figure(data=[
    go.Bar(name='chinese test set', x=index, y=results_testchinese_in_numbers, marker_color = 'orange')
])
fig.update_layout(width= 1000)
fig.add_hline(y=3441, line_width=1.8, line_dash="dash", line_color="maroon")
fig.show()

Let us check if the results are statistically significant. Therefore we need to compare results and predictions:

In [133]:
def get_data(file_path, number):
    """collects the POS tag for every element and returns list of lists containing segmentnumber and POS tag"""
    item_count = 0
    inside_item = False
    POS_tag = []
    
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line.startswith('#') or line == '':
                pass
            else:
                line = line.split()
                if line != []:
                    POS_tag.append(line[number])
    return POS_tag


First we get the UAS data:

In [134]:
real_chinese_UAS_data = get_data('data/ud/multilingual_ultimate_en33_ja33_vi33/test_chinese.conllu', 6)
ultimate_en33_ja33_vi33_chinese_UAS_data = get_data('output_chinese_ultimate_en33_ja33_vi33.conllu', 6)
ultimate_en50_ja25_vi25_chinese_UAS_data = get_data('output_chinese_ultimate_en50_ja25_vi25.conllu', 6)
ultimate_en33_ja50_vi17_chinese_UAS_data = get_data('output_chinese_ultimate_en33_ja50_vi17.conllu', 6)
ultimate_en33_ja17_vi50_chinese_UAS_data = get_data('output_chinese_ultimate_en33_ja17_vi50.conllu', 6)

In [136]:
data_length = len(real_chinese_UAS_data)

In [137]:
def correct_prediction(data, prediction):
    "compares the POS Tags of the test data with the predictions of the models"
    results = []
    for entry in range(len(data)):
        results.append(data[entry]==prediction[entry])
    return results

In [138]:
ultimate_en33_ja33_vi33_chinese_prediction_UAS_data = correct_prediction(real_chinese_UAS_data, ultimate_en33_ja33_vi33_chinese_UAS_data)
ultimate_en50_ja25_vi25_chinese_prediction_UAS_data = correct_prediction(real_chinese_UAS_data, ultimate_en50_ja25_vi25_chinese_UAS_data)
ultimate_en33_ja50_vi17_chinese_prediction_UAS_data = correct_prediction(real_chinese_UAS_data, ultimate_en33_ja50_vi17_chinese_UAS_data)
ultimate_en33_ja17_vi50_chinese_prediction_UAS_data = correct_prediction(real_chinese_UAS_data, ultimate_en33_ja17_vi50_chinese_UAS_data)

In [140]:
print(len([x for x in ultimate_en33_ja33_vi33_chinese_prediction_UAS_data if x == True]))
print(len([x for x in ultimate_en50_ja25_vi25_chinese_prediction_UAS_data if x == True]))
print(len([x for x in ultimate_en33_ja50_vi17_chinese_prediction_UAS_data if x == True]))
print(len([x for x in ultimate_en33_ja17_vi50_chinese_prediction_UAS_data if x == True]))
print(len([x for x in ultimate_en33_ja33_vi33_chinese_prediction_UAS_data if x == True])/data_length)
print(len([x for x in ultimate_en50_ja25_vi25_chinese_prediction_UAS_data if x == True])/data_length)
print(len([x for x in ultimate_en33_ja50_vi17_chinese_prediction_UAS_data if x == True])/data_length)
print(len([x for x in ultimate_en33_ja17_vi50_chinese_prediction_UAS_data if x == True])/data_length)

6428
6289
6496
6145
0.5351315351315351
0.5235597735597736
0.5407925407925408
0.5115717615717615


Now lets get the LAS data

In [141]:
real_chinese_POS_data = get_data('data/ud/multilingual_ultimate_en33_ja33_vi33/test_chinese.conllu', 7)
ultimate_en33_ja33_vi33_chinese_POS_data = get_data('output_chinese_ultimate_en33_ja33_vi33.conllu', 7)
ultimate_en50_ja25_vi25_chinese_POS_data = get_data('output_chinese_ultimate_en50_ja25_vi25.conllu', 7)
ultimate_en33_ja50_vi17_chinese_POS_data = get_data('output_chinese_ultimate_en33_ja50_vi17.conllu', 7)
ultimate_en33_ja17_vi50_chinese_POS_data = get_data('output_chinese_ultimate_en33_ja17_vi50.conllu', 7)

Some language specific data has to be ignored, hence we have to clean the labels:

In [142]:
clean = []
for label in real_chinese_POS_data:
    if ':' in label:
        clean.append(label.split(':')[0])
    else:
        clean.append(label)
real_chinese_POS_data = clean

clean = []
for label in ultimate_en33_ja33_vi33_chinese_POS_data:
    if ':' in label:
        clean.append(label.split(':')[0])
    else:
        clean.append(label)
ultimate_en33_ja33_vi33_chinese_POS_data = clean

clean = []
for label in ultimate_en50_ja25_vi25_chinese_POS_data:
    if ':' in label:
        clean.append(label.split(':')[0])
    else:
        clean.append(label)
ultimate_en50_ja25_vi25_chinese_POS_data = clean

clean = []
for label in ultimate_en33_ja50_vi17_chinese_POS_data:
    if ':' in label:
        clean.append(label.split(':')[0])
    else:
        clean.append(label)
ultimate_en33_ja50_vi17_chinese_POS_data = clean

clean = []
for label in ultimate_en33_ja17_vi50_chinese_POS_data:
    if ':' in label:
        clean.append(label.split(':')[0])
    else:
        clean.append(label)
ultimate_en33_ja17_vi50_chinese_POS_data = clean

In [143]:
ultimate_en33_ja33_vi33_chinese_prediction_POS_data = correct_prediction(real_chinese_POS_data, ultimate_en33_ja33_vi33_chinese_POS_data)
ultimate_en50_ja25_vi25_chinese_prediction_POS_data = correct_prediction(real_chinese_POS_data, ultimate_en50_ja25_vi25_chinese_POS_data)
ultimate_en33_ja50_vi17_chinese_prediction_POS_data = correct_prediction(real_chinese_POS_data, ultimate_en33_ja50_vi17_chinese_POS_data)
ultimate_en33_ja17_vi50_chinese_prediction_POS_data = correct_prediction(real_chinese_POS_data, ultimate_en33_ja17_vi50_chinese_POS_data)

In [144]:
print(len([x for x in ultimate_en33_ja33_vi33_chinese_prediction_POS_data if x == True]))
print(len([x for x in ultimate_en50_ja25_vi25_chinese_prediction_POS_data if x == True]))
print(len([x for x in ultimate_en33_ja50_vi17_chinese_prediction_POS_data if x == True]))
print(len([x for x in ultimate_en33_ja17_vi50_chinese_prediction_POS_data if x == True]))
print(len([x for x in ultimate_en33_ja33_vi33_chinese_prediction_POS_data if x == True])/data_length)
print(len([x for x in ultimate_en50_ja25_vi25_chinese_prediction_POS_data if x == True])/data_length)
print(len([x for x in ultimate_en33_ja50_vi17_chinese_prediction_POS_data if x == True])/data_length)
print(len([x for x in ultimate_en33_ja17_vi50_chinese_prediction_POS_data if x == True])/data_length)

5490
5259
5268
5385
0.45704295704295705
0.43781218781218784
0.4385614385614386
0.4483016983016983


Now we have the UAS and POS data, so we can calculate the LAS labels, which is True, if both UAS and POS label are True and False otherwise

In [145]:
LAS_data = []
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_UAS_data[index]== True and ultimate_en33_ja33_vi33_chinese_prediction_POS_data[index] == True:
        LAS_data.append(True)
    else:
        LAS_data.append(False)
ultimate_en33_ja33_vi33_chinese_prediction_LAS_data = LAS_data

LAS_data = []
for index in range(data_length):
    if ultimate_en50_ja25_vi25_chinese_prediction_UAS_data[index]== True and ultimate_en50_ja25_vi25_chinese_prediction_POS_data[index] == True:
        LAS_data.append(True)
    else:
        LAS_data.append(False)
ultimate_en50_ja25_vi25_chinese_prediction_LAS_data = LAS_data

LAS_data = []
for index in range(data_length):
    if ultimate_en33_ja50_vi17_chinese_prediction_UAS_data[index]== True and ultimate_en33_ja50_vi17_chinese_prediction_POS_data[index] == True:
        LAS_data.append(True)
    else:
        LAS_data.append(False)
ultimate_en33_ja50_vi17_chinese_prediction_LAS_data = LAS_data

LAS_data = []
for index in range(data_length):
    if ultimate_en33_ja17_vi50_chinese_prediction_UAS_data[index]== True and ultimate_en33_ja17_vi50_chinese_prediction_POS_data[index] == True:
        LAS_data.append(True)
    else:
        LAS_data.append(False)
ultimate_en33_ja17_vi50_chinese_prediction_LAS_data = LAS_data

In [146]:
counter = 0
for zahl in ultimate_en33_ja33_vi33_chinese_prediction_LAS_data:
    if zahl == True:
        counter += 1
print(counter)

counter = 0
for zahl in ultimate_en50_ja25_vi25_chinese_prediction_LAS_data:
    if zahl == True:
        counter += 1
print(counter)

counter = 0
for zahl in ultimate_en33_ja50_vi17_chinese_prediction_LAS_data:
    if zahl == True:
        counter += 1
print(counter)

counter = 0
for zahl in ultimate_en33_ja17_vi50_chinese_prediction_LAS_data:
    if zahl == True:
        counter += 1
print(counter)

3441
3276
3279
3319


Nice! Now we have the data on which LAS labels are correct, now we can test the statistical significance: Is the model ultimate_en33_ja33_vi33 significantly better than the other models in analyzing the Chinese treebank?

To this end, we have to calculate:
1. The "Trials" - the Number of cases where the ultimate_en33_ja33_vi33 model and the 
    1. ultimate_en50_ja25_vi25
    2. ultimate_en33_ja50_vi17
    3. ultimate_en33_ja17_vi50 
    model differ
2. The "Successes" - the Number of cases where the ultimate_en33_ja33_vi33 model is better than the 
    1. ultimate_en50_ja25_vi25
    2. ultimate_en33_ja50_vi17
    3. ultimate_en33_ja17_vi50 
    model

Then we can use the binominal test to see if these differences are likely to be a coincidence or statistically significant.

In [152]:
trial_1 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] != ultimate_en50_ja25_vi25_chinese_prediction_LAS_data[index]:
        trial_1 += 1
print(trial_1)
successes_1 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] == True and ultimate_en50_ja25_vi25_chinese_prediction_LAS_data[index] == False:
        successes_1 += 1
print(successes_1)

trial_2 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] != ultimate_en33_ja50_vi17_chinese_prediction_LAS_data[index]:
        trial_2 += 1
print(trial_2)
successes_2 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] == True and ultimate_en33_ja50_vi17_chinese_prediction_LAS_data[index] == False:
        successes_2 += 1
print(successes_2)

trial_3 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] != ultimate_en33_ja17_vi50_chinese_prediction_LAS_data[index]:
        trial_3 += 1
print(trial_3)
successes_3 = 0
for index in range(data_length):
    if ultimate_en33_ja33_vi33_chinese_prediction_LAS_data[index] == True and ultimate_en33_ja17_vi50_chinese_prediction_LAS_data[index] == False:
        successes_3 += 1
print(successes_3)


961
563
1102
632
1072
597


Now let's calculate the p-value:

In [161]:
import scipy.stats
print(scipy.stats.binom_test(successes_1, trial_1))
print(scipy.stats.binom_test(successes_2, trial_2))
print(scipy.stats.binom_test(successes_3, trial_3))


1.14017773454234e-07
1.1842786994659202e-06
0.00021621297060019082


Intresting! In all 3 cases the p value is bigger than 0.05. Hence we can say that the ultimate_en33_ja33_vi33 model is significantly better than the other three (with 95% confidence).