# 1. Introduction

The aim of this exercise is to implement an HMM PoS tagger. In order to measure its performance, several experiments will be carried on 2 datasets (2 different languages) extracted from Universal Dependency. For our experiments, the languages we have chosen are Basque (BDT Treebank) and Japanese (GSD Treebank).

# 2. Dataset External Structure

As we can observe in the output of the commands below, the corpora are mainly composed of data already split in smaller datasets (train, dev and test).

In [1]:
!tree ../data/UD_Basque-BDT/

[01;34m../data/UD_Basque-BDT/[0m
├── [00meu_bdt-ud-dev.conllu[0m
├── [00meu_bdt-ud-dev.txt[0m
├── [00meu_bdt-ud-test.conllu[0m
├── [00meu_bdt-ud-test.txt[0m
├── [00meu_bdt-ud-train.conllu[0m
├── [00meu_bdt-ud-train_mini.conllu[0m
├── [00meu_bdt-ud-train.txt[0m
├── [00mLICENSE.txt[0m
├── [00mREADME.md[0m
└── [00mstats.xml[0m

0 directories, 10 files


In [2]:
!tree ../data/UD_Japanese-GSD/

[01;34m../data/UD_Japanese-GSD/[0m
├── [00mja_gsd-ud-dev.conllu[0m
├── [00mja_gsd-ud-dev.txt[0m
├── [00mja_gsd-ud-test.conllu[0m
├── [00mja_gsd-ud-test.txt[0m
├── [00mja_gsd-ud-train.conllu[0m
├── [00mja_gsd-ud-train.txt[0m
├── [00mLICENSE.txt[0m
├── [00mREADME.md[0m
└── [00mstats.xml[0m

0 directories, 9 files


Each of the splits comprises a plain text file, which just contains all the sentences of the split, and a *conllu* file with all the sentences annotated following the *Universal Dependency* guidelines.

### Plain text files

In [3]:
!head ../data/UD_Basque-BDT/eu_bdt-ud-train.txt

Gero, lortutako masa molde batean jarri. Bestalde, "herri palestinarrari
laguntza tekniko eta ekonomikoa ematen jarraitzeko eta Estatu baketsu eta
demokratiko baten ordezkari diren erakunde palestinarrak indartzeko lanean
jarraitzeko konpromisoa" baieztatu zuen EBk. Tour hartan bistaratu zitzaizkon
lehendabiziko aldiz, mendian zituen benetako zailtasunak. Nik ere ez eta hasten
naiz inketatzen. Zidane, Henry, Barthez, Deschamps, Blanc eta enparauek
Eurokopako talde sendoena osatzen dute aditu gehienentzat. Napoliko erregeordea
zen Lemosko kondearen idazkari izan zen eta, bere anaia Bartolomek esaten
duenez, erre egin zituen bere poemak hiri hartan. Guk errespetu handia diogu
Alavesi, eta espero dugu partidu ona egitea, hiru puntuak irabazteko. Beraz,


In [4]:
!head ../data/UD_Japanese-GSD/ja_gsd-ud-train.txt

ホッケーにはデンジャラスプレーの反則があるので、膝より上にボールを浮かすことは基本的に反則になるが、その例外の一つがこのスクープである。 

また行きたい、そんな気持ちにさせてくれるお店です。 

手に持った特殊な刃物を使ったアクロバティックな体術や、揚羽と薄羽同様にクナイや忍具を使って攻撃してくる。 

3年次にはトータルオフェンスで2,892ヤードを獲得し、これは大学記録となった。 

葬儀の最中ですよ! 



### CONLLU files

In [5]:
!head ../data/UD_Basque-BDT/eu_bdt-ud-train.conllu

# sent_id = train-s1
# text = Gero, lortutako masa molde batean jarri.
1	Gero	gero	ADV	_	_	7	advmod	_	SpaceAfter=No
2	,	,	PUNCT	_	_	1	punct	_	_
3	lortutako	lortu	VERB	_	Case=Loc|VerbForm=Part	4	advcl	_	_
4	masa	masa	NOUN	_	Animacy=Inan|Case=Abs|Definite=Def|Number=Sing	7	obj	_	_
5	molde	molde	NOUN	_	_	7	obl	_	_
6	batean	bat	NUM	_	NumType=Card	5	nummod	_	_
7	jarri	jarri	VERB	_	VerbForm=Part	0	root	_	SpaceAfter=No
8	.	.	PUNCT	_	_	7	punct	_	_


In [6]:
!head ../data/UD_Japanese-GSD/ja_gsd-ud-train.conllu

# newdoc id = train-s1
# sent_id = train-s1
# text = ホッケーにはデンジャラスプレーの反則があるので、膝より上にボールを浮かすことは基本的に反則になるが、その例外の一つがこのスクープである。
1	ホッケー	ホッケー	NOUN	名詞-普通名詞-一般	_	9	obl	_	BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|LUWBILabel=B|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No|UnidicInfo=,ホッケー,ホッケー,ホッケー,ホッケー,,,ホッケー,ホッケー,ホッケー
2	に	に	ADP	助詞-格助詞	_	1	case	_	BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|LUWBILabel=B|LUWPOS=助詞-格助詞|SpaceAfter=No|UnidicInfo=,に,に,に,ニ,,,ニ,ニ,に
3	は	は	ADP	助詞-係助詞	_	1	case	_	BunsetuBILabel=I|BunsetuPositionType=FUNC|LUWBILabel=B|LUWPOS=助詞-係助詞|SpaceAfter=No|UnidicInfo=,は,は,は,ワ,,,ハ,ハ,は
4	デンジャラス	デンジャラス	NOUN	名詞-普通名詞-一般	_	5	compound	_	BunsetuBILabel=B|BunsetuPositionType=CONT|LUWBILabel=B|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No|UnidicInfo=,デンジャラス,デンジャラス,デンジャラス,デンジャラス,,,デンジャラス,デンジャラスプレー,デンジャラスプレー
5	プレー	プレー	NOUN	名詞-普通名詞-サ変可能	_	7	nmod	_	BunsetuBILabel=I|BunsetuPositionType=SEM_HEAD|LUWBILabel=I|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No|UnidicInfo=,プレー,プレー,プレー,プレー,,,プレー,デンジャラスプレー,デンジャラスプレー
6	の	の	ADP	助詞-格助詞	_	5	ca

# 3. Dataset loading

In order to implement the HMM Pos tagger, we will need to process the *conllu* file, because it is the one that contains the information on the PoS tag that corresponds to each token within every sentence in the data split. For this specific task, in each sentence we will just need to extract every the token (second column) and its corresponding PoS tag (fourth column). This process will be performed by instances of the **Dataset** class, which is going to extract the information we need of each data split and store it in Python objects that are much more malleable for us. Apart from that, **Dataset** instances will also calculate some statistics which will be useful either for HMM PoS tagger implementation and observing the distribution of sentences and Pos tags along the dataset.

In [6]:
from src.dataset_loader import Dataset
from pathlib import Path

basque_dataset = Dataset(
    dataset_name='UD_Basque-BDT',
    train_path=Path('../data/UD_Basque-BDT/eu_bdt-ud-train.conllu'),
    dev_path=Path('../data/UD_Basque-BDT/eu_bdt-ud-dev.conllu'),
    test_path=Path('../data/UD_Basque-BDT/eu_bdt-ud-test.conllu'),
)

japanese_dataset = Dataset(
    dataset_name='UD_Japanese-GSD',
    train_path=Path('../data/UD_Japanese-GSD/ja_gsd-ud-train.conllu'),
    dev_path=Path('../data/UD_Japanese-GSD/ja_gsd-ud-dev.conllu'),
    test_path=Path('../data/UD_Japanese-GSD/ja_gsd-ud-test.conllu'),
)

# 4. Dataset Internal Structure

## 4.1 After processing the *conllu* files, we can see that each sentence is represented as a list of tuples, where the first element of each tuple matches with a token, whereas the second element is its corresponding PoS tag.

In [8]:
basque_dataset.train.data[0]

[('Gero', 'ADV'),
 (',', 'PUNCT'),
 ('lortutako', 'VERB'),
 ('masa', 'NOUN'),
 ('molde', 'NOUN'),
 ('batean', 'NUM'),
 ('jarri', 'VERB'),
 ('.', 'PUNCT')]

In [9]:
japanese_dataset.train.data[0]

[('ホッケー', 'NOUN'),
 ('に', 'ADP'),
 ('は', 'ADP'),
 ('デンジャラス', 'NOUN'),
 ('プレー', 'NOUN'),
 ('の', 'ADP'),
 ('反則', 'NOUN'),
 ('が', 'ADP'),
 ('ある', 'VERB'),
 ('の', 'SCONJ'),
 ('で', 'AUX'),
 ('、', 'PUNCT'),
 ('膝', 'NOUN'),
 ('より', 'ADP'),
 ('上', 'NOUN'),
 ('に', 'ADP'),
 ('ボール', 'NOUN'),
 ('を', 'ADP'),
 ('浮かす', 'VERB'),
 ('こと', 'NOUN'),
 ('は', 'ADP'),
 ('基本', 'NOUN'),
 ('的', 'PART'),
 ('に', 'AUX'),
 ('反則', 'NOUN'),
 ('に', 'ADP'),
 ('なる', 'VERB'),
 ('が', 'SCONJ'),
 ('、', 'PUNCT'),
 ('その', 'DET'),
 ('例外', 'NOUN'),
 ('の', 'ADP'),
 ('一', 'NUM'),
 ('つ', 'NOUN'),
 ('が', 'ADP'),
 ('この', 'DET'),
 ('スクープ', 'NOUN'),
 ('で', 'AUX'),
 ('ある', 'VERB'),
 ('。', 'PUNCT')]

## 4.2 When it comes to the amount of sentences in each sentence split, we have the following distribution for each language.

In [7]:
import plotly.express as px
import pandas as pd

basque_train_length = len(basque_dataset.train.data)
basque_dev_length = len(basque_dataset.dev.data)
basque_test_length = len(basque_dataset.test.data)

basque_split_lengths = pd.DataFrame(
    {
        'length': [basque_train_length, basque_dev_length, basque_test_length],
        'name': ['Train', 'Dev', 'Test']
    }
) 
basque_split_lengths_pie_chart = px.pie(
    data_frame=basque_split_lengths,
    names='name',
    values='length',
    title='Basque dataset split lengths'
)
basque_split_lengths_pie_chart.show()

In [8]:
japanese_train_length = len(japanese_dataset.train.data)
japanese_dev_length = len(japanese_dataset.dev.data)
japanese_test_length = len(japanese_dataset.test.data)

japanese_split_lengths = pd.DataFrame(
    {
        'length': [japanese_train_length, japanese_dev_length, japanese_test_length],
        'name': ['Train', 'Dev', 'Test']
    }
) 
japanese_split_lengths_pie_chart = px.pie(
    data_frame=japanese_split_lengths,
    names='name',
    values='length',
    title='Japanese dataset split lengths'
)
japanese_split_lengths_pie_chart.show()

## 4.3 Sentences

## In the box plots below it can be observed that sentences in Japanese usually have more tokens than the ones in Basque. **SE PUEDE DESARROLLAR MÁS**

In [38]:
basque_sentences_length_dataframe = pd.DataFrame(
    {
        'Sentence length': basque_dataset.train.statistics.sentences_length,
        'Language': ['Basque'] * len(basque_dataset.train.statistics.sentences_length)
    }
)
japanese_sentences_length_dataframe = pd.DataFrame(
    {
        'Sentence length': japanese_dataset.train.statistics.sentences_length,
        'Language': ['Japanese'] * len(japanese_dataset.train.statistics.sentences_length)
    }
)
sentences_length_dataframe = pd.concat([basque_sentences_length_dataframe, japanese_sentences_length_dataframe])

sentences_length_box_plot = px.box(
    data_frame=sentences_length_dataframe,
    x="Language",
    y="Sentence length",
    title="Sentence length distribution in train split"
)
sentences_length_box_plot.show()

## 4.4 Tags

**CONCLUSIONS ABOUT DATA**

In [14]:
import plotly.graph_objects as go

fig = go.Figure(
    layout=go.Layout(
        title=go.layout.Title(
            text="Individual PoS tag frequencies"
        )
    )
)
fig.add_trace(
    go.Bar(
        x=list(basque_dataset.train.statistics.individual_tag_frequencies.keys()),
        y=list(basque_dataset.train.statistics.individual_tag_frequencies.values()),
        name="Basque"
    )
)
fig.add_trace(
    go.Bar(
        x=list(japanese_dataset.train.statistics.individual_tag_frequencies.keys()),
        y=list(japanese_dataset.train.statistics.individual_tag_frequencies.values()),
        name="Japanese"
    )
)

fig.show()