Workflow:
1. Extract information from the CONLL-U
2. Translate
3. Tokenize English translations with Stanza
4. Word alignment, substitute English NE translations with lemmas from the source, get information on NE annotations for each translated word from the source annotations
5. Linguistically process English translation with Stanza (lemmas, POS)
6. Parse CONLL-u file and add additional information (sentence ids, alignments, NER annotations)

In [85]:
from conllu import parse
import pandas as pd
import os

In [131]:
# Get a list of TEI files

path = "ParlaMint-SI/ParlaMint-SI.conllu"
dir_list = os.listdir(path)

# Keep only files with parliamentary sessions:

parl_list = []

# Filter out only relevant files
for i in dir_list:
	if "ParlaMint-SI_" in i:
		if ".conllu" in i:
			parl_list.append("{}".format(i))

len(parl_list)

414

In [132]:
parl_list[0]

'ParlaMint-SI_2020-05-27-SDZ8-Redna-17.conllu'

## Extract information from CONLL-U files

In [101]:
# Import a test file - one that was also used in the big sample
data = open("ParlaMint-SI/ParlaMint-SI.conllu/ParlaMint-SI_2019-10-23-SDZ8-Redna-12.conllu", "r").read()

In [None]:
"""
Format:

# newdoc id = ParlaMint-SI_2014-08-01-SDZ7-Redna-01.u1
# newpar id = ParlaMint-SI_2014-08-01-SDZ7-Redna-01.seg1
# sent_id = ParlaMint-SI_2014-08-01-SDZ7-Redna-01.seg1.1
# text = Spoštovani, prosim, da zasedete svoja mesta.
1	Spoštovani	spoštovan	ADJ	Appmpn	Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part	3	discourse	_	NER=O|SpaceAfter=No
2	,	,	PUNCT	Z	_	1	punct	_	NER=O
3	prosim	prositi	VERB	Vmpr1s	Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	0	root	_	NER=O|SpaceAfter=No
4	,	,	PUNCT	Z	_	6	punct	_	NER=O
5	da	da	SCONJ	Cs	_	6	mark	_	NER=O
6	zasedete	zasesti	VERB	Vmer2p	Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin	3	ccomp	_	NER=O
7	svoja	svoj	DET	Px-npa	Case=Acc|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes	8	det	_	NER=O
8	mesta	mesto	NOUN	Ncnpa	Case=Acc|Gender=Neut|Number=Plur	6	obj	_	NER=O|SpaceAfter=No
9	.	.	PUNCT	Z	_	3	punct	_	NER=O
"""

In [None]:
# CONLLU parser cheatsheet
"""
# Find which words are proper names with the filtering function
sentence.filter(misc__NER=lambda x: x != "O")

# Adding new metadata to the file
sentences[0].metadata["alignment"] =  "1-1"

# To turn back into conll-u
print(sentences[1].serialize())
"""

Extract from each sentence in the CONLL-u file:
- sent_id (in metadata) (# sent_id = ParlaMint-SI_2014-08-01-SDZ7-Redna-01.seg1.1)
- "text" (in metadata): to be feed into the MT system (# text = Spoštovani, prosim, da zasedete svoja mesta.)
- tokenized text (punctuation separated from words by space): by iterating through the tokens in the sentence - create a list of tokens and join them into a string (["Spoštovani", "prosim", ",", "da"] -> "Spoštovani prosim , da)
- list of NE annotations (same length as the tokens) - we want NE annotations for all tokens, with the information on the lemma and index if the NE is not "0": [{0:["O"]}, {1:["O"]}, {2:["O"]}, {3: ["PER-I", "Borut"]}]

In [134]:
# Create an empty df
df = pd.DataFrame({"file": [""], "sentence_id": [""], "text": [""], "tokenized_text": [""], "NER": [""], "proper_nouns": [""]})

In [135]:
# Parse the data with CONLL-u parser

for doc in parl_list[:3]:
	# Open the file
	data = open("{}/{}".format(path,doc), "r").read()

	sentences = parse(data)

	sentence_id_list = []
	text_list = []
	tokenized_text_list = []
	NER_list = []
	proper_noun_list = []

	for sentence in sentences:
		# Find sentence ids
		current_sentence_id = sentence.metadata["sent_id"]
		sentence_id_list.append(current_sentence_id)

		# Find text
		current_text = sentence.metadata["text"]
		text_list.append(current_text)

		# Create a string out of tokens
		current_token_list = []
		current_ner_dict = {}
		word_dict = {}

		for token in sentence:
			current_token_list.append(token["form"])

			# Create a list of NE annotations with word indices.
			# I'll substract one from the word index, because indexing in the CONLLU file starts with 1, not 0
			current_index = int(token["id"]) - 1

			current_ner_dict[current_index] = token["misc"]["NER"]

			# Add information on the lemma if the NE is personal name
			if "PER" in token["misc"]["NER"]:
				word_dict[current_index] = [token["form"], token["lemma"]]

		proper_noun_list.append(word_dict)

		current_string = " ".join(current_token_list)

		tokenized_text_list.append(current_string)
		NER_list.append(current_ner_dict)
	
	new_df = pd.DataFrame({"sentence_id": sentence_id_list, "text": text_list, "tokenized_text": tokenized_text_list, "NER": NER_list, "proper_nouns": proper_noun_list})

	new_df["file"] = doc

	# Merge df to the previous df
	df = pd.concat([df, new_df])

In [137]:
# Reset index
df = df.reset_index(drop=True)

# Remove the first row
df = df.drop([0], axis="index")

# Show the results
df.describe(include="all")

Unnamed: 0,file,sentence_id,text,tokenized_text,NER,proper_nouns
count,8220,8220,8220,8220,8220,8220
unique,3,8220,7388,7388,2184,722
top,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.conllu,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.seg1.1,Hvala lepa.,Hvala lepa .,"{0: 'O', 1: 'O'}",{}
freq,3487,1,211,211,329,7404


In [156]:
df.head(2)

Unnamed: 0,file,sentence_id,text,tokenized_text,NER,proper_nouns
1,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.conllu,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.seg1.1,Nadaljujemo s prekinjeno 17. sejo zbora.,Nadaljujemo s prekinjeno 17. sejo zbora .,"{0: 'O', 1: 'O', 2: 'O', 3: 'O', 4: 'O', 5: 'O...",{}
2,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.conllu,ParlaMint-SI_2020-05-27-SDZ8-Redna-17.seg2.1,"Prehajamo na 2. TOČKO DNEVNEGA REDA, TO JE NA ...","Prehajamo na 2. TOČKO DNEVNEGA REDA , TO JE NA...","{0: 'O', 1: 'O', 2: 'O', 3: 'O', 4: 'O', 5: 'O...",{}


In [155]:
# Inspect an example
df.iloc[23].to_dict()

{'file': 'ParlaMint-SI_2020-05-27-SDZ8-Redna-17.conllu',
 'sentence_id': 'ParlaMint-SI_2020-05-27-SDZ8-Redna-17.seg11.1',
 'text': 'Besedo dajem Marjanu Maučecu, predstavniku Državnega sveta kot predlagatelja predloga zakona za predstavitev stališča do predloga matičnega delovnega telesa.',
 'tokenized_text': 'Besedo dajem Marjanu Maučecu , predstavniku Državnega sveta kot predlagatelja predloga zakona za predstavitev stališča do predloga matičnega delovnega telesa .',
 'NER': {0: 'O',
  1: 'O',
  2: 'B-PER',
  3: 'I-PER',
  4: 'O',
  5: 'O',
  6: 'B-ORG',
  7: 'I-ORG',
  8: 'O',
  9: 'O',
  10: 'O',
  11: 'O',
  12: 'O',
  13: 'O',
  14: 'O',
  15: 'O',
  16: 'O',
  17: 'O',
  18: 'O',
  19: 'O',
  20: 'O'},
 'proper_nouns': {2: ['Marjanu', 'Marjan'], 3: ['Maučecu', 'Maučec']}}

In [None]:
# Save the dataframe
df.to_csv("Parlamint-SI-sentences-conllu-workflow-sample.csv", sep="\t")

## Translate

## Word alignment

### Tokenization with Stanza

- We apply the stanza tokenization over the translation; use tokenize_no_ssplit to avoid splitting sentences in multiple sentences.

## Alignment

- Perform word alignment.
- Save forward and reverse alignment information for each sentence (2 additional columns).
- Transfer NE annotations to the translated sentence based on the alignment: add a column with information to which English token this information should go to (e.g. [{3: "B-PER", 5:"I-LOC"}])
- Substitute translated NE words with lemmas based on the annotation, save new translation to a new column.

## Linguistic processing of translated text

- We use Stanza to get POS and lemmas. Send in the "pre-tokenized text" (created in previous steps).
- Transform the result into CONLL-u (which should contain tokens, lemmas, pos). Parse the CONLL-u file and add: 1) sentence_id as metadata 2) forward and reverse alignment as metadata (# align_s = 1-1 2-2... and #align_t = 1-1 2-2...), 3) based on alignment, add NER information to each token (misc = {NER:} field)
- Save the file as CONLLU with the same name as the source CONLLU file (so each file will be saved separately). The number of sentences should be the same as in the source CONLLU and ANA file.

In [None]:
# Adding new metadata to the file
sentences[0].metadata["alignment"] =  "1-1"
sentences[0].metadata

In [None]:
# To turn back into conll-u
print(sentences[1].serialize())