Workflow:
1. Extract information from the CONLL-U
2. Translate
3. Tokenize English translations with Stanza
4. Word alignment, substitute English NE translations with lemmas from the source, get information on NE annotations for each translated word from the source annotations
5. Linguistically process English translation with Stanza (lemmas, POS)
6. Parse CONLL-u file and add additional information (sentence ids, alignments, NER annotations)

In [1]:
from conllu import parse
import pandas as pd
import os
import zipfile

In [None]:
# Unzip the folder with the files
#with zipfile.ZipFile("/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu.zip", 'r') as zip_ref:
#    zip_ref.extractall("/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu")

In [2]:
# Define the main information
lang_code = "CZ"
path = "/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu"
opus_lang_code = "cs"

# Create a folder named as the lang_code under results beforehand defining the following paths
extracted_dataframe_path = "/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-extracted-source-data.csv".format(lang_code, lang_code)

translated_dataframe_path = "/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-translated.csv".format(lang_code, lang_code)
translated_tokenized_dataframe_path = "/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-translated.csv".format(lang_code, lang_code)
final_dataframe = "/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-final-dataframe.csv".format(lang_code, lang_code)

In [3]:
parl_list = []
file_name_list = []

for dir1 in os.listdir(path):
    full_path = os.path.join(path, dir1)
    if os.path.isdir(full_path):
        current = os.listdir(full_path)
        # Keep only files with parliamentary sessions:
        for file in current:
            if "ParlaMint-{}_".format(lang_code) in file:
                if ".conllu" in file:
                    final_path = "{}/{}".format(full_path, file)
                    parl_list.append(final_path)
                    file_name_list.append(file)

print(parl_list[:2])
print(file_name_list[:2])

['/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu/2013/ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu', '/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu/2013/ParlaMint-CZ_2013-12-10-ps2013-004-01-004-003.conllu']
['ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu', 'ParlaMint-CZ_2013-12-10-ps2013-004-01-004-003.conllu']


In [4]:
# See how many files we have:
len(parl_list)

6328

## Extract information from CONLL-U files

In [None]:
"""
Format:

# sent_id = ParlaMint-CZ_2013-11-25-ps2013-001-01-000-000.u1.p4.s3
# text = Dovolte mi tedy, abych vás seznámila s omluvami, které předložili členové vlády.
1	Dovolte	dovolit	VERB	_	Aspect=Perf|Mood=Imp|Number=Plur|Person=2|Polarity=Pos|VerbForm=Fin	0	root	_	NER=O
2	mi	já	PRON	_	Case=Dat|Number=Sing|Person=1|PronType=Prs|Variant=Short	1	obl:arg	_	NER=O
3	tedy	tedy	ADV	_	_	1	advmod	_	NER=O|SpaceAfter=No
4	,	,	PUNCT	_	_	8	punct	_	NER=O
5-6	abych	_	_	_	_	_	_	_	NER=O
5	aby	aby	SCONJ	_	_	8	mark	_	_
6	bych	být	AUX	_	Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin	8	aux	_	_
7	vás	ty	PRON	_	Case=Acc|Number=Plur|Person=2|PronType=Prs	8	obj	_	NER=O
8	seznámila	seznámit	VERB	_	Aspect=Perf|Gender=Fem,Neut|Number=Plur,Sing|Polarity=Pos|Tense=Past|VerbForm=Part|Voice=Act	1	ccomp	_	NER=O
9	s	s	ADP	_	AdpType=Prep|Case=Ins	10	case	_	NER=O
10	omluvami	omluva	NOUN	_	Case=Ins|Gender=Fem|Number=Plur|Polarity=Pos	8	obl:arg	_	NER=O|SpaceAfter=No
11	,	,	PUNCT	_	_	13	punct	_	NER=O
12	které	který	DET	_	Case=Acc|Gender=Fem|Number=Plur|PronType=Int,Rel	13	obj	_	NER=O
13	předložili	předložit	VERB	_	Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Past|VerbForm=Part|Voice=Act	10	acl:relcl	_	NER=O
14	členové	člen	NOUN	_	Animacy=Anim|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos	13	nsubj	_	NER=O
15	vlády	vláda	NOUN	_	Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos	14	nmod	_	NER=O|SpaceAfter=No
16	.	.	PUNCT	_	_	1	punct	_	NER=O
"""

In [None]:
"""
Format of the parsed info (with conll-u parser):

token inside a token list:

{'id': 1,
 'form': '3',
 'lemma': '3',
 'upos': 'NUM',
 'xpos': None,
 'feats': {'NumForm': 'Digit', 'NumType': 'Card'},
 'head': 0,
 'deprel': 'root',
 'deps': None,
 'misc': {'NER': 'O', 'SpaceAfter': 'No'}}


 metadata:
 
metadata={newdoc id: "ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u1",
newpar id: "ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u1.p1",
sent_id: "ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u1.p1.s1",
text: "3."}

"""

The problem with this language is that we have multiword parts where one word is segmented into two syntactic units, as we can see in the case 5-6.  I need to transfer the NER annotation to the appropriate words. I will do this by applying the following rule: if a word does not have NER annotation, it should take the annotation from the word above it (the number-number word.) Then I will discard words that appear under indices in a form "number-number" from the tokenized text (to assure proper alignment).

I found out that the parts of the multiword token do not have misc and NER annotation (misc is None), while the annotation is in the multiword token (which does not have annotations for lemmas, pos tags and all other annotations). We can find multiword token by searching for the type of the token id - in all other cases, the type is integer, while here it is a tuple (it is printed as (5, -, 6)).

In [None]:
# CONLLU parser cheatsheet
"""
# Find which words are proper names with the filtering function
sentence.filter(misc__NER=lambda x: x != "O")

# Adding new metadata to the file
sentences[0].metadata["alignment"] =  "1-1"

# To turn back into conll-u
print(sentences[1].serialize())
"""

Extract from each sentence in the CONLL-u file:
- sent_id (in metadata) (# sent_id = ParlaMint-SI_2014-08-01-SDZ7-Redna-01.seg1.1)
- "text" (in metadata): to be feed into the MT system (# text = Spoštovani, prosim, da zasedete svoja mesta.)
- tokenized text (punctuation separated from words by space): by iterating through the tokens in the sentence - create a list of tokens and join them into a string (["Spoštovani", "prosim", ",", "da"] -> "Spoštovani prosim , da). In case of multiword tokens, we will add the subword tokens to the tokenized text and skip the multiword token. We will also get all necessary information about the ids and lemmas from the subword tokens. The subword tokens do not have the NER annotation, so we will use the multiword annotation for all of its subparts.
- information on the proper nouns: if the word is annotated as a proper noun (has "PER" in ner attribute), take its index, form and lemma and save it into a dictionary for each sentence ({0: (Taje, Taja), 1: (Kuzman, Kuzman)})

In [21]:
# Create an empty df
df = pd.DataFrame({"file_path": [""],"file": [""], "sentence_id": [""], "text": [""], "tokenized_text": [""], "NER": [""], "proper_nouns": [""]})

In [9]:
# Check whether there are any problems with parsing the documents
"""
error_count = 0
problematic_doc_list = []

for doc in parl_list:
	try:
		# Open the file
		data = open("{}".format(doc), "r").read()

		sentences = parse(data)
	except:
		error_count += 1
		problematic_doc_list.append(doc)

print(error_count)
print(problematic_doc_list)
"""

0
[]


In [22]:
# Parse the data with CONLL-u parser - to check if everything works, I will parse 100 files for now

for doc in parl_list[:100]:
	# Open the file
	data = open("{}".format(doc), "r").read()
	
	sentences = parse(data)

	sentence_id_list = []
	text_list = []
	tokenized_text_list = []
	proper_noun_list = []

	for sentence in sentences:
		# Find sentence ids
		current_sentence_id = sentence.metadata["sent_id"]
		sentence_id_list.append(current_sentence_id)

		# Find text - if texts consists of multiword tokens, these tokens will appear as they are,
		# not separated into subwords
		current_text = sentence.metadata["text"]
		text_list.append(current_text)

		# Create a string out of tokens
		current_token_list = []
		word_dict = {}

		for token in sentence:
			# Find multiword tokens and take their NER
			if type(token["id"]) != int:
				multiword_ner = token["misc"]["NER"]
			
			else:
			# Append to the tokenized text tokens that are not multiword tokens
			# (we append subtokens to the tokenized texts, not multiword tokens)
				current_token_list.append(token["form"])

				# Create a list of NE annotations with word indices.
				# I'll substract one from the word index,
				# because indexing in the CONLLU file starts with 1, not 0
				current_index = int(token["id"]) - 1

				# If the word does not have NER annotation,
				# take the annotation from the multiword token
				if token["misc"] is None:
					current_ner = multiword_ner
				else:
					current_ner = token["misc"]["NER"]

				# Add information on the lemma if the NE is personal name
				# and if the word is a PROPN
				if token["upos"] == "PROPN":
					if "PER" in current_ner:
						word_dict[current_index] = [token["form"], token["lemma"]]

		proper_noun_list.append(word_dict)

		current_string = " ".join(current_token_list)

		tokenized_text_list.append(current_string)
	
	new_df = pd.DataFrame({"sentence_id": sentence_id_list, "text": text_list, "tokenized_text": tokenized_text_list, "proper_nouns": proper_noun_list})

	new_df["file_path"] = doc

	# Get the file name
	file_name = file_name_list[parl_list.index(doc)]
	new_df["file"] = file_name

	# Merge df to the previous df
	df = pd.concat([df, new_df])

Extracting text from all 6000+ files of the CZ corpora took 25 minutes.

In [11]:
"""
# Parse the data with CONLL-u parser - code for parsing for NER annotations as well

for doc in parl_list:
	# Open the file
	data = open("{}".format(doc), "r").read()
	
	sentences = parse(data)

	sentence_id_list = []
	text_list = []
	tokenized_text_list = []
	NER_list = []
	proper_noun_list = []

	for sentence in sentences[:10]:
		# Find sentence ids
		current_sentence_id = sentence.metadata["sent_id"]
		sentence_id_list.append(current_sentence_id)

		# Find text
		current_text = sentence.metadata["text"]
		text_list.append(current_text)

		# Create a string out of tokens
		current_token_list = []
		current_ner_dict = {}
		word_dict = {}

		for token in sentence:
			# Find multiword tokens and take their NER
			if type(token["id"]) != int:
				multiword_ner = token["misc"]["NER"]
			
			else:
				current_token_list.append(token["form"])

				# Create a list of NE annotations with word indices.
				# I'll substract one from the word index, because indexing in the CONLLU file starts with 1, not 0
				current_index = int(token["id"]) - 1

				# If the word does not have NER annotation, take the annotation from the multiword token
				if token["misc"] is None:
					current_ner = multiword_ner
				else:
					current_ner = token["misc"]["NER"]
				
				current_ner_dict[current_index] = current_ner

				# Add information on the lemma if the NE is personal name
				# if there will be a case where a multiword token is annotated with PER, this will break, but I assume this won't happen
				if "PER" in current_ner:
					word_dict[current_index] = [token["form"], token["lemma"]]

		proper_noun_list.append(word_dict)

		current_string = " ".join(current_token_list)

		tokenized_text_list.append(current_string)
		NER_list.append(current_ner_dict)
	
	new_df = pd.DataFrame({"sentence_id": sentence_id_list, "text": text_list, "tokenized_text": tokenized_text_list, "NER": NER_list, "proper_nouns": proper_noun_list})

	new_df["file_path"] = doc

	# Get the file name
	file_name = file_name_list[parl_list.index(doc)]
	new_df["file"] = file_name

	# Merge df to the previous df
	df = pd.concat([df, new_df])
"""

In [23]:
# Reset index
df = df.reset_index(drop=True)

# Remove the first row
df = df.drop([0], axis="index")

# Reset index
df = df.reset_index(drop=True)

# Show the results
df.describe(include="all")

Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns
count,15117,15117,15117,15117,15117,0.0,15117
unique,100,100,15117,12796,12796,0.0,1610
top,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-16-ps2013-004-03-003-018....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Děkuji.,Děkuji .,,{}
freq,1506,1506,1,516,516,,13281


In [24]:
# Inspect what happened with sentences that contain multiword tokens
df[df["sentence_id"] == "ParlaMint-CZ_2013-11-25-ps2013-001-01-000-000.u1.p4.s3"].to_dict()

{'file_path': {10147: '/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu/2013/ParlaMint-CZ_2013-11-25-ps2013-001-01-000-000.conllu'},
 'file': {10147: 'ParlaMint-CZ_2013-11-25-ps2013-001-01-000-000.conllu'},
 'sentence_id': {10147: 'ParlaMint-CZ_2013-11-25-ps2013-001-01-000-000.u1.p4.s3'},
 'text': {10147: 'Dovolte mi tedy, abych vás seznámila s omluvami, které předložili členové vlády.'},
 'tokenized_text': {10147: 'Dovolte mi tedy , aby bych vás seznámila s omluvami , které předložili členové vlády .'},
 'NER': {10147: nan},
 'proper_nouns': {10147: {}}}

In [25]:
df.head()

Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns
0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.,3 .,,{}
1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Návrh zasedacího pořádku poslanců v jednacím s...,Návrh zasedacího pořádku poslanců v jednacím s...,,{}
2,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Podle § 52 odst. 1 našeho jednacího řádu Posla...,Podle § 52 odst . 1 našeho jednacího řádu Posl...,,{}
3,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Já bych k této věci otevřel rozpravu.,Já bych k této věci otevřel rozpravu .,,{}
4,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,,"{6: ['Zbyněk', 'Zbyněk'], 7: ['Stanjura', 'Sta..."


In [26]:
df.proper_nouns.value_counts()

TypeError: unhashable type: 'dict'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5231, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'dict'


{}                                                                                  13281
{2: ['Kalousek', 'Kalousek']}                                                          10
{5: ['Martina', 'Martin'], 6: ['Kolovratníka', 'Kolovratník']}                          8
{7: ['Kalousek', 'Kalousek']}                                                           6
{6: ['Martina', 'Martin'], 7: ['Kolovratníka', 'Kolovratník']}                          6
                                                                                    ...  
{4: ['Helena', 'Helena'], 5: ['Válková', 'Válková']}                                    1
{3: ['Stanjurovi', 'Stanjur'], 28: ['Jeroným', 'Jeroným'], 29: ['Tejc', 'Tejc']}        1
{2: ['Stanjura', 'Stanjura'], 17: ['Tejc', 'Tejc']}                                     1
{5: ['Milan', 'Milan'], 6: ['Štěch', 'Štěch']}                                          1
{14: ['Sklenáka', 'Sklenák']}                                                           1
Name: prop

In [27]:
# Add information on length
df["length"] = df["text"].str.split().str.len()

print("Number of words in the corpora: {}".format(df["length"].sum()))

df.describe()

Number of words in the corpora: 228433


Unnamed: 0,length
count,15117.0
mean,15.111001
std,13.67416
min,1.0
25%,5.0
50%,11.0
75%,21.0
max,193.0


In [28]:
# Save the dataframe
df.to_csv("{}".format(extracted_dataframe_path), sep="\t")

## Translate

In [3]:
# Open the df
df = pd.read_csv("{}".format(extracted_dataframe_path), sep="\t", index_col = 0)
df.head(2)

Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length
0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.,3 .,,{},1
1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Návrh zasedacího pořádku poslanců v jednacím s...,Návrh zasedacího pořádku poslanců v jednacím s...,,{},9


We need to translate the following corpora into English:
- Belgian (BE) - which language??
- Bulgarian (BG)
- Croatian (HR) - We will use "South Slavic MT" based on the manual analysis
- Czech (CZ)
- Danish (DK)
- Dutch (NL)
- French (FR)
- Hungarian (HU) - multilingual model only
- Icelandic (IS)
- Italian (IT)
- Latvian (LV)
- Lithuanian (LT)
- Polish (PL)
- Slovenian (SI) - We will use "Slavic MT" based on the results of the manual analysis
- Spanish? (ES)
- Turkish (TR)
- Austrian (AT)
- Basque (ES-PV)
- Bosnian (BA)
- Catalan (ES-CT)
- Estonian (EE)
- Finnish (FI)
- Galician (ES-GA)
- Greek (GR)
- Norwegian (NO) - NO OPUS-MT model (!) - we can use GT or eTranslation
- Portuguese (PT)
- Romanian (RO)
- Serbian (RS)
- Swedish (SE)
- Ukrainian (UA)

Explanation of language codes:
- sla = Slavic
- zls = South Slavic
- zlw = West Slavic
- zle = East Slavic
- gmq = North Germanic
- gem = Germanic
- gmw = West Germanic
- roa = Romance
- itc = Italic
- bat = Baltic
- trk = Turkic
- urj = Uralic
- fiu = Finno-Ugrian

choose_model()

In [4]:
def choose_model(lang_code):
	"""
	Compare a small sample of translations of all OPUS-MT models that are available
	for the language, to decide which one to use. The function prints out a dataframe with all translations of the sample and saves it as ParlaMint-{lang_code}-sample-model-comparison.csv.

	Args:
	- lang_code: the lang code that is used in the names of the files, it should be the same as for extract_text()
	"""
	import pandas as pd
	import regex as re
	from easynmt import EasyNMT
	from IPython.display import display
	
	lang_models_dict = {"BG": ["bg", "sla", "zls"], "HR": ["zls"], "CZ": ["cs", "sla", "zlw" ], "DK": ["da", "gmq", "gem"], "NL": ["nl", "gem", "gmw"], "FR": ["fr", "itc","roa"], "HU": ["mul"], "IS": ["is","gmq", "gem"], "IT": ["it", "roa", "itc"], "LV": ["lv","bat"], "LT": ["bat"], "PL": ["pl", "sla", "zlw"], "SI": ["sla", "zls"], "ES": ["es", "roa", "itc"], "TR": ["tr", "trk" ], "AT": ["de", "gem", "gmw"], "ES-PV": ["eu", "mul"], "BA": ["sla", "zls"], "ES-CT": ["ca", "roa", "itc"], "EE": ["et", "urj", "fiu"], "FI": ["fi", "urj", "fiu"], "ES-GA": ["gl", "roa", "itc"], "GR": ["grk"], "PT": ["roa", "itc"], "RO":["roa", "itc"], "RS": ["zls", "sla"], "SE": ["sv", "gmq", "gem"], "UA":["uk", "sla", "zle"]}

	# Open the file, created in the previous step
	df = pd.read_csv("{}".format(extracted_dataframe_path), sep="\t", index_col=0)

	# Define the model
	model = EasyNMT('opus-mt')

	print("Entire corpus has {} sentences and {} words.".format(df["text"].count(), df["length"].sum()))

	# Create a smaller sample - just a couple of sentences from one file
	df = df[df.file == list(df["file"].unique())[0]][:20]

	print("Sample files has {} sentences and {} words.".format(df["text"].count(), df["length"].sum()))

	# Create a list of sentences from the df
	sentence_list = df.text.to_list()

	# Translate the sample using all available models for this language
	for opus_lang_code in lang_models_dict[lang_code]:
		translation_list = model.translate(sentence_list, source_lang = "{}".format(opus_lang_code), target_lang='en')

		# Add the translations to the df
		df["translation-{}".format(opus_lang_code)] = translation_list
	
	df = df.drop(columns=["file", "sentence_id", "tokenized_text", "NER", "proper_nouns", "length"])

	# Save the df
	df.to_csv("/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-sample-model-comparison.csv".format(lang_code, lang_code))

	print("The file is saved as/home/tajak/Parlamint-translation/ results/{}/ParlaMint-{}-sample-model-comparison.csv. ".format(lang_code, lang_code))

	return df


In [5]:
df = choose_model(lang_code)

Entire corpus has 15117 sentences and 228433 words.
Sample files has 20 sentences and 245 words.
The file is saved as/home/tajak/Parlamint-translation/ results/CZ/ParlaMint-CZ-sample-model-comparison.csv. 


In [27]:
# Open the analysed sample

sample = pd.read_csv("/home/tajak/Parlamint-translation/results/{}/ParlaMint-{}-sample-model-comparison.csv".format(lang_code, lang_code), index_col = 0)
sample.head(2)

Unnamed: 0,file_path,text,translation-cs,translation-sla,translation-zlw
0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,3.,3.,3.,3.
1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,Návrh zasedacího pořádku poslanců v jednacím s...,Proposal for a sitting order of Members in the...,Draft meeting order of Members in the Chamber ...,Proposal for a meeting order of Members in the...


In [None]:
sample.comparison.value_counts()

The best model for Czech was shown to be cs.

translate()

In [29]:
def translate(lang_code, opus_lang_code):
	"""
	This function translates the text from the dataframe, created with the extract_text() function
	with OPUS-MT models using EasyNMT. It returns a dataframe with the translation.

	Args:
	- lang_code: the lang code that is used in the names of the files, it should be the same as for extract_text()
	- opus_lang_code: the lang code to be used in the OPUS-MT model - use the one that performed the best in the comparison (see function choose_model())
	"""
	import pandas as pd
	import regex as re
	from easynmt import EasyNMT
	from IPython.display import display

	# Open the file, created in the previous step
	df = pd.read_csv("{}".format(extracted_dataframe_path), sep="\t", index_col=0)

	# Define the model
	model = EasyNMT('opus-mt')

	print("Entire corpus has {} sentences and {} words.".format(df["text"].count(), df["length"].sum()))

	# Create a list of sentences from the df
	sentence_list = df.text.to_list()

	#Translate the list of sentences - you need to provide the source language as it is in the name of the model - the opus_lang_code
	translation_list = model.translate(sentence_list, source_lang = "{}".format(opus_lang_code), target_lang='en')

	# Add the translations to the df
	df["translation"] = translation_list

	# Display the df
	display(df[:3])

	# Save the df
	df.to_csv("{}".format(translated_dataframe_path), sep="\t")

	return df

In [30]:
df = translate(lang_code, opus_lang_code)

Entire corpus has 15117 sentences and 228433 words.


In [10]:
df.tail(2)

Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation
15115,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2016-07-01-ps2013-048-04-000-000....,ParlaMint-CZ_2016-07-01-ps2013-048-04-000-000....,Tento návrh byl přijat.,Tento návrh byl přijat .,,{},4,This proposal was adopted.
15116,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2016-07-01-ps2013-048-04-000-000....,ParlaMint-CZ_2016-07-01-ps2013-048-04-000-000....,Tím jsme se vypořádali s pořadem schůze a může...,Tím jsme se vypořádali s pořadem schůze a může...,,{},12,We've dealt with the agenda of the meeting and...


In [11]:
df.translation.to_list()[:3]

['3.',
 "Proposal for a sitting order of Members in the Chamber of Deputies' Chamber of Deputies",
 "According to Article 52 (2) (a) of the basic Regulation, the Commission considers that the aid is compatible with the internal market under Article 107 (3) (c) of the Treaty on the Functioning of the European Union and Article 108 (3) of the Treaty on the Functioning of the European Union. 1 of our Rules of Procedure The Chamber of Deputies approves the sitting arrangements of Members in the Chamber of Deputies' Chamber of Deputies' Chamber of Deputies' Chamber of Deputies' Chamber of Deputies' Chamber of Deputies' Chamber of Deputies, and a proposal has been given to you as agreed upon by the various Members' Clubs."]

## Word alignment

### Tokenization with Stanza

- We apply the stanza tokenization over the translation; use tokenize_no_ssplit to avoid splitting sentences in multiple sentences.

In [5]:
# Open the translated df

df = pd.read_csv("{}".format(translated_dataframe_path), sep="\t", index_col = 0)
df.head(2)

Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation
0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.,3 .,,{},1,3.
1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Návrh zasedacího pořádku poslanců v jednacím s...,Návrh zasedacího pořádku poslanců v jednacím s...,,{},9,Proposal for a sitting order of Members in the...


In [4]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit = True)

2023-01-17 13:15:10 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-17 13:15:11 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-01-17 13:15:11 INFO: Use device: gpu
2023-01-17 13:15:11 INFO: Loading: tokenize
2023-01-17 13:15:13 INFO: Done loading processors!


In [6]:
# Inspect how the output of Stanza pipeline looks like
En_sentences = df.translation.to_list()

tokenized_sentences = []
space_after_list = []

for i in En_sentences[:2]:
	doc = nlp(i).to_dict()

doc

[[{'id': 1, 'text': 'Proposal', 'start_char': 0, 'end_char': 8},
  {'id': 2, 'text': 'for', 'start_char': 9, 'end_char': 12},
  {'id': 3, 'text': 'a', 'start_char': 13, 'end_char': 14},
  {'id': 4, 'text': 'sitting', 'start_char': 15, 'end_char': 22},
  {'id': 5, 'text': 'order', 'start_char': 23, 'end_char': 28},
  {'id': 6, 'text': 'of', 'start_char': 29, 'end_char': 31},
  {'id': 7, 'text': 'Members', 'start_char': 32, 'end_char': 39},
  {'id': 8, 'text': 'in', 'start_char': 40, 'end_char': 42},
  {'id': 9, 'text': 'the', 'start_char': 43, 'end_char': 46},
  {'id': 10, 'text': 'Chamber', 'start_char': 47, 'end_char': 54},
  {'id': 11, 'text': 'of', 'start_char': 55, 'end_char': 57},
  {'id': 12, 'text': 'Deputies', 'start_char': 58, 'end_char': 66},
  {'id': 13, 'text': "'", 'start_char': 66, 'end_char': 67},
  {'id': 14, 'text': 'Chamber', 'start_char': 68, 'end_char': 75},
  {'id': 15, 'text': 'of', 'start_char': 76, 'end_char': 78},
  {'id': 16, 'text': 'Deputies', 'start_char': 

In [7]:
# Apply tokenization to English translation and add the sentences to the df
# Open the df
df = pd.read_csv("{}".format(translated_dataframe_path), sep="\t")


# Save also the information on whether there is a space after or before punctuation
# which we will need later, to remove unnecessary spaces

En_sentences = df.translation.to_list()

tokenized_sentences = []
space_after_list = []

for i in En_sentences:
	doc = nlp(i).to_dict()
	current_sentence_list = []
	current_space_after_list = []

	# Define a list of start_char and end_char
	start_chars = []
	end_chars = []

	# Loop through the tokens in the sentence and add them to a current sentence list
	for sentence in doc:
		for word in sentence:
			current_sentence_list.append(word["text"])

			# Add information on start and end chars to the list
			start_chars.append(word["start_char"])
			end_chars.append(word["end_char"])
		
	# Now loop through the start_char and end_char lists and find instances
	# where the end_char of one word is the same as the start_char of the next one
	# this means there is no space between them
	for char_index in range(len(start_chars)-1):
		if end_chars[char_index] == start_chars[(char_index+1)]:
			current_space_after_list.append("No")
		else:
			current_space_after_list.append("Yes")

	# This loop is not possible for the end token, so let's add information for the last token
	# just to avoid errors due to different lengths of lists
	current_space_after_list.append("Last")

	# Join the list into a space-separated string
	current_string = " ".join(current_sentence_list)

	tokenized_sentences.append(current_string)

	space_after_list.append(current_space_after_list)

# Add the result to the df
df["translation-tokenized"] = tokenized_sentences
df["space-after-information"] = space_after_list

df.head()

Unnamed: 0.1,Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation,translation-tokenized,space-after-information
0,0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.,3 .,,{},1,3.,3 .,"[No, Last]"
1,1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Návrh zasedacího pořádku poslanců v jednacím s...,Návrh zasedacího pořádku poslanců v jednacím s...,,{},9,Proposal for a sitting order of Members in the...,Proposal for a sitting order of Members in the...,"[Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ..."
2,2,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Podle § 52 odst. 1 našeho jednacího řádu Posla...,Podle § 52 odst . 1 našeho jednacího řádu Posl...,,{},33,According to Article 52 (2) (a) of the basic R...,According to Article 52 ( 2 ) ( a ) of the bas...,"[Yes, Yes, Yes, Yes, No, No, Yes, No, No, Yes,..."
3,3,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Já bych k této věci otevřel rozpravu.,Já bych k této věci otevřel rozpravu .,,{},7,I would like to open a debate on this matter.,I would like to open a debate on this matter .,"[Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ..."
4,4,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,,"{6: ['Zbyněk', 'Zbyněk'], 7: ['Stanjura', 'Sta...",8,Member and Chairman of the ODS Club Zbyněk Sta...,Member and Chairman of the ODS Club Zbyněk Sta...,"[Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, L..."


In [8]:
# Save the df
df.to_csv("{}".format(translated_tokenized_dataframe_path), sep="\t")

In [9]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation,translation-tokenized,space-after-information
0,0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.0,3 .,,{},1,3.0,3 .,"[No, Last]"


## Alignment

- Perform word alignment.
- Save forward and reverse alignment information for each sentence (2 additional columns).
- Substitute translated NE words with lemmas based on the annotation, save new translation to a new column.

In [None]:
""" # Code if we would add the NER information from the source to target
# Substitute words in the translation based on alignments
intermediate_list = list(zip(df["translation-tokenized"], df["proper_nouns"], df["alignments"], df["NER"]))

new_translations = []
substituted_all_info = []
substituted_only = []
NER_alignments = []

# Add information whether an error occurred
error_list = []

for i in intermediate_list[:3]:
	current_substituted_list = []
	current_substituted_only = []
	current_error = "No"

	# Create a list of NER alignments - at first, let all elements be "not NE", then we will substitute elements with appropriate tags
	# "O" is repeated as many times as there are tokens in the translation
	current_NER_list = ["O"] * len(intermediate_list[0].split())

	# Loop through the NER list for the source of this sentence
	for NER_pair in list(i[3].items()):
		# If the pair is not "O", get the word index
		if NER_pair[1] != "O":
			source_NE_index = NER_pair[0]

			# Find to which target index it corresponds:
			substituted_word_index = i[2][source_NE_index]

			# Substitute the element it the NER list under this index with the NE tag
			current_NER_list[substituted_word_index] = NER_pair[1]

	# If no proper names were detected, do not change the translation
	if i[1] == 0:
		new_translations.append(i[0])
	
	else:
		current_translation = i[0]

		# Substitute the word with the Slovene lemma based on the index - loop through the proper nouns to be changed
		for word_index in list(i[1].keys()):
			try:
				# split the translation into list of words
				word_list = current_translation.split()

				# Get index of the substituted word
				substituted_word_index = i[2][word_index]

				# Get the lemma to substitute the word with
				correct_lemma = i[1][word_index][1]

				# If the substitute word and lemma are not the same, get substituted word and its match
				if word_list[substituted_word_index] != correct_lemma:
					current_substituted_list.append((word_list[substituted_word_index], correct_lemma))
					current_substituted_only.append((word_list[substituted_word_index], correct_lemma))

					# Substitute the word in the word list
					word_list[substituted_word_index] = correct_lemma
				
				else:
					# Add information that substitution was not performed
					current_substituted_list.append(f"No substitution: {word_list[substituted_word_index], correct_lemma}")
				
				# Change the translation by merging the words back into a string
				current_translation = " ".join(word_list)

			except:
				print(f"Issue: index {word_index}: {i[1][word_index]}")
				current_error = f"Issue: index {word_index}: {i[1][word_index]}"

		# After the loop through proper nouns, save the new translation
		new_translations.append(current_translation)
	
	# Add information on what was substituted
	substituted_all_info.append(current_substituted_list)
	substituted_only.append(current_substituted_only)
	error_list.append(current_error)
"""

In [10]:
def correct_proper_nouns(lang_code):
	"""
	This function takes the translated text and the source text, aligns words with eflomal and corrects proper nouns.
	It takes the dataframe that was created in the function extract_text() and to which the translation was added
	in the function translate().

	To use eflomal, you need to install it first:
	!git clone https://github.com/robertostling/eflomal
	%cd eflomal
	!make
	!sudo make install
	!python3 setup.py install

	Args:
	- lang_code: the lang code that is used in the names of the files, it should be the same as for extract_text()
	"""
	import pandas as pd
	import re
	import ast
	from IPython.display import display

	# Open the file, created in the previous step
	df = pd.read_csv("{}".format(translated_dataframe_path), sep="\t", index_col=0)

	# Move into the eflomal folder
	%cd /home/tajak/Parlamint-translation/eflomal

	# Then we need to create files for all texts and all translations
	source_sentences = open("source_sentences.txt", "w")
	English_sentences = open("English_sentences.txt", "w")

	for i in df["tokenized_text"].to_list():
		source_sentences.write(i)
		source_sentences.write("\n")

	for i in df["translation-tokenized"].to_list():
		English_sentences.write(i)
		English_sentences.write("\n")

	source_sentences.close()
	English_sentences.close()

	# Align sentences with eflomal and get out a file with alignments
	!python3 align.py -s source_sentences.txt -t English_sentences.txt --model 3 -r source-en.rev -f source-en.fwd

	# Create a list of alignments from the returned files which will be added to the final conllu

	# Create target alignments from the source alignment direction (by changing the direction in the fwd file)
	aligns_list_target = open("source-en.fwd", "r").readlines()
	aligns_list_target = [i.replace("\n", "") for i in aligns_list_target]
	aligns_list_target = [i.split(" ") for i in aligns_list_target]

	aligns_list_target_final = []

	for i in aligns_list_target:
		current_sentence_align = ""
		for pair in i:
			current_pair = pair.split("-")
			current_sentence_align += "{}-{}".format(current_pair[1], current_pair[0])
			current_sentence_align += " "
	
		aligns_list_target_final.append(current_sentence_align)
	
	# Add aligns_list to the df
	df["aligns-target"] = aligns_list_target_final

	# Create a list of alignments for the source file
	aligns_list = open("source-en.rev", "r").readlines()
	aligns_list = [i.replace("\n", "") for i in aligns_list]

	# Add information to be added to the conllu
	df["aligns-source"] = aligns_list

	# Continue with processing the list to create the final alignments format which I'll use to correct proper names
	aligns_list = [i.split(" ") for i in aligns_list]

	for i in aligns_list:
		for pair in i:
			current_pair = pair.split("-")
			i[i.index(pair)] = {int(current_pair[0]): int(current_pair[1])}
	
	final_aligns = []

	# Create a dictionary out of the rev alignments
	for i in aligns_list:
		current_line = {}

		try:
			for element in i:
				a = list(element.items())[0][0]
				b = list(element.items())[0][1]
				current_line[a] = b
		
			# Check whether the number of pairs in the list is the same as number of items
			if len(i) != len(list(current_line.items())):
				print("Not okay:")
				print(i)
				print(current_line)

			final_aligns.append(current_line)
		
		except:
			print("error")
			print(aligns_list.index(i))
			print(i)
			final_aligns.append("Error")
		
	print("Number of aligned sentences: {}".format(len(final_aligns)))

	# Add a to the df
	df["alignments"] = final_aligns

	# Remove the rev and fwd file
	%rm source-en.rev
	%rm source-en.fwd

	# When we open the dataframe file, the dictionaries with proper names changed into strings - Change strings in the column proper_nouns into dictionaries

	df["proper_nouns"] = df.proper_nouns.astype("str")
	df["proper_nouns"] = df.proper_nouns.apply(lambda x: ast.literal_eval(x))

	# Change nan values in the proper_nouns columns
	df = df.fillna(0)

	# Substitute words in the translation based on alignments
	intermediate_list = list(zip(df["translation-tokenized"], df["proper_nouns"], df["alignments"]))

	new_translations = []
	substituted_all_info = []
	substituted_only = []

	# Add information whether an error occurred
	error_list = []

	for i in intermediate_list:
		current_substituted_list = []
		current_substituted_only = []
		current_error = "No"

		# If no proper names were detected, do not change the translation
		if i[1] == 0:
			new_translations.append(i[0])
		
		else:
			current_translation = i[0]

			# Substitute the word with the source lemma based on the index - loop through the proper nouns to be changed
			for word_index in list(i[1].keys()):
				try:
					# split the translation into list of words
					word_list = current_translation.split()

					# Get index of the substituted word
					substituted_word_index = i[2][word_index]

					# Get the lemma to substitute the word with
					correct_lemma = i[1][word_index][1]

					# If the substitute word and lemma are not the same, get substituted word and its match
					if word_list[substituted_word_index] != correct_lemma:
						current_substituted_list.append((word_list[substituted_word_index], correct_lemma))
						current_substituted_only.append((word_list[substituted_word_index], correct_lemma))

						# Substitute the word in the word list
						word_list[substituted_word_index] = correct_lemma
					
					else:
						# Add information that substitution was not performed
						current_substituted_list.append(f"No substitution: {word_list[substituted_word_index], correct_lemma}")
					
					# Change the translation by merging the words back into a string
					current_translation = " ".join(word_list)

				except:
					print(f"Issue: index {word_index}: {i[1][word_index]}")
					current_error = f"Issue: index {word_index}: {i[1][word_index]}"

			# After the loop through proper nouns, save the new translation
			new_translations.append(current_translation)
		
		# Add information on what was substituted
		if len(substituted_all_info) != 0:
			substituted_all_info.append(current_substituted_list)
		else:
			substituted_all_info.append(0)

		if len(current_substituted_only) != 0:
			substituted_only.append(current_substituted_only)
		else:
			substituted_only.append(0)

		error_list.append(current_error)


	# Add to the df
	df["new_translations"] = new_translations
	df["substitution_info"] = substituted_all_info
	df["substituted_words"] = substituted_only
	df["errors"] = error_list

	# Change the working directory once again
	%cd ..

	# Save the df
	df.to_csv("{}".format(final_dataframe), sep="\t")

	# Display most common substitutions
	df_substituted = df[df["proper_nouns"] != "0"]
	display(df_substituted.substituted_words.value_counts()[:20])

	return df

In [11]:
df = correct_proper_nouns(lang_code)

/home/tajak/Parlamint-translation/eflomal
Number of aligned sentences: 15117
/home/tajak/Parlamint-translation


TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5231, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


0                               14527
[(Faltynek, Faltýnek)]             26
[(Laudat, Laudát)]                 15
[(Vladimir, Vladimír)]             12
[(Philip, Filip)]                  11
[(Shincl, Šincl)]                   9
[(Peter, Petr)]                     8
[(Semel, Semelová)]                 8
[(Jerome, Jeroným)]                 7
[(Mark, Marková)]                   7
[(Mark, Marek)]                     7
[(Wark, Válková)]                   7
[(Zaoralek, Zaorálek)]              6
[(Zlatuska, Zlatuška)]              6
[(Putn, Putnová)]                   6
[(Free, Volný)]                     5
[(Sedya, Seďa)]                     5
[(Jerman, Jermanová)]               5
[(Vanya, Váňa)]                     5
[(Kolorodětík, Kolovratník)]        5
Name: substituted_words, dtype: int64

In [12]:
df[df["substituted_words"]!= 0][:5]

Unnamed: 0,Unnamed: 0.1,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation,translation-tokenized,space-after-information,aligns-target,aligns-source,alignments,new_translations,substitution_info,substituted_words,errors
96,96,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,Děkuji panu ministru Martinu Pecinovi a nyní p...,Děkuji panu ministru Martinu Pecinovi a nyní p...,0.0,"{3: ['Martinu', 'Martin'], 4: ['Pecinovi', 'Pe...",22,Thank you to Minister Martin Pecin and I will ...,Thank you to Minister Martin Pecin and I will ...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye...",0-0 1-0 2-1 3-2 4-3 5-4 6-5 9-6 10-7 12-8 15-9...,0-0 1-1 2-3 3-4 4-5 5-6 6-9 7-10 8-12 9-15 10-...,"{0: 0, 1: 1, 2: 3, 3: 4, 4: 5, 5: 6, 6: 9, 7: ...",Thank you to Minister Martin Pecin and I will ...,"[No substitution: ('Martin', 'Martin'), No sub...","[(Koniček, Koníček)]",No
98,98,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,"Vážený pane místopředsedo, dovoluji si navrhno...","Vážený pane místopředsedo , dovoluji si navrhn...",0.0,"{9: ['Vladimíra', 'Vladimír'], 10: ['Koníčka',...",10,"Mr Vice-President, I would like to propose Mr ...","Mr Vice - President , I would like to propose ...","['Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes',...",0-1 1-2 2-2 3-2 4-3 6-4 7-5 9-6 10-8 11-9 12-1...,1-0 2-1 3-4 4-7 5-8 6-9 7-10 9-11 10-12 11-13,"{1: 0, 2: 1, 3: 4, 4: 7, 5: 8, 6: 9, 7: 10, 9:...","Mr Vice - President , I would like to propose ...","[No substitution: ('Vladimír', 'Vladimír'), (K...","[(Koniček, Koníček)]",No
101,101,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,"Ptám se, kdo je pro to, aby zpravodajem pro pr...","Ptám se , kdo je pro to , aby by zpravodajem p...",0.0,"{17: ['Vladimír', 'Vladimír'], 18: ['Koníček',...",16,I ask who is in favour of the rapporteur for f...,I ask who is in favour of the rapporteur for f...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye...",0-0 1-1 2-3 3-4 5-5 7-9 8-10 9-11 10-12 11-13 ...,0-1 3-2 4-3 5-5 9-7 10-8 11-9 12-10 13-11 14-1...,"{0: 1, 3: 2, 4: 3, 5: 5, 9: 7, 10: 8, 11: 9, 1...",I ask who is in favour of the rapporteur for f...,"[(Vladimir, Vladimír), No substitution: ('Koní...","[(Vladimir, Vladimír)]",No
106,106,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024....,Pan Vladimír Koníček se stal zpravodajem pro p...,Pan Vladimír Koníček se stal zpravodajem pro p...,0.0,"{1: ['Vladimír', 'Vladimír'], 2: ['Koníček', '...",9,Mr Vladimir Koníček became rapporteur for firs...,Mr Vladimir Koníček became rapporteur for firs...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye...",0-0 1-1 2-2 3-4 4-5 5-6 6-7 7-8 8-9,0-0 1-1 2-2 4-3 5-4 6-5 7-6 8-7 9-8,"{0: 0, 1: 1, 2: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: ...",Mr Vladimír Koníček became rapporteur for firs...,"[(Vladimir, Vladimír), No substitution: ('Koní...","[(Vladimir, Vladimír)]",No
174,174,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-10-ps2013-004-01-019-021....,ParlaMint-CZ_2013-12-10-ps2013-004-01-019-021....,Nyní prosím předsedu výboru pro zdravotnictví ...,Nyní prosím předsedu výboru pro zdravotnictví ...,0.0,"{7: ['Rostislava', 'Rostislav'], 8: ['Vyzulu',...",16,I now ask the chairman of the Committee on Hea...,I now ask the chairman of the Committee on Hea...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye...",1-0 2-1 4-2 7-3 8-4 9-5 11-6 12-7 13-8 14-9 15...,0-1 1-2 2-4 3-7 4-8 5-9 6-11 7-12 8-13 9-14 10...,"{0: 1, 1: 2, 2: 4, 3: 7, 4: 8, 5: 9, 6: 11, 7:...",I now ask the chairman of the Committee on Hea...,"[No substitution: ('Rostislav', 'Rostislav'), ...","[(Vyzulu, Vyzula)]",No


In [13]:
df[df["errors"]!="No"].shape

(0, 19)

In [21]:
# Analyse errors
df[df["errors"] != "No"].iloc[9].to_dict()

{'Unnamed: 0.1': 11481,
 'file_path': '/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu/2013/ParlaMint-CZ_2013-12-10-ps2013-004-01-001-001.conllu',
 'file': 'ParlaMint-CZ_2013-12-10-ps2013-004-01-001-001.conllu',
 'sentence_id': 'ParlaMint-CZ_2013-12-10-ps2013-004-01-001-001.u2.p2.s3',
 'text': 'Význam Nelsona R. Mandely přesáhl hranice Jihoafrické republiky a oslovil globální společenství svým nezaměnitelným poselstvím lidské důstojnosti, svobody, statečnosti a státnické rozvahy.',
 'tokenized_text': 'Význam Nelsona R . Mandely přesáhl hranice Jihoafrické republiky a oslovil globální společenství svým nezaměnitelným poselstvím lidské důstojnosti , svobody , statečnosti a státnické rozvahy .',
 'NER': 0.0,
 'proper_nouns': {1: ['Nelsona', 'Nelson'],
  2: ['R', 'R'],
  3: ['.', '.'],
  4: ['Mandely', 'Mandela']},
 'length': 22,
 'translation': 'The importance of Nelson R. Mandela exceeded the borders of South Africa and addressed the global communit

In [16]:
df.describe(include="all")

Unnamed: 0,Unnamed: 0.1,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation,translation-tokenized,space-after-information,aligns-target,aligns-source,alignments,new_translations,substitution_info,substituted_words,errors
count,15117.0,15117,15117,15117,15117,15117,15117.0,15117,15117.0,15117,15117,15117,15117,15117,15117,15117,15117,15117.0,15117
unique,,100,100,15117,12796,12796,,1610,,12660,12660,5826,11276,11221,11221,12659,894,330.0,1
top,,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-16-ps2013-004-03-003-018....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Děkuji.,Děkuji .,,{},,Thank you.,Thank you .,"['Yes', 'No', 'Last']",0-0 1-0 2-1,0-0 1-2,"{0: 0, 1: 2}",Thank you .,[],0.0,No
freq,,1506,1506,1,516,516,,13281,,601,601,825,524,538,538,601,13280,14527.0,15117
mean,7558.0,,,,,,0.0,,15.111001,,,,,,,,,,
std,4364.046345,,,,,,0.0,,13.67416,,,,,,,,,,
min,0.0,,,,,,0.0,,1.0,,,,,,,,,,
25%,3779.0,,,,,,0.0,,5.0,,,,,,,,,,
50%,7558.0,,,,,,0.0,,11.0,,,,,,,,,,
75%,11337.0,,,,,,0.0,,21.0,,,,,,,,,,


In [17]:
df.length.sum()

228433

In [20]:
df.substituted_words.value_counts()[:20]

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5231, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


0                               14527
[(Faltynek, Faltýnek)]             26
[(Laudat, Laudát)]                 15
[(Vladimir, Vladimír)]             12
[(Philip, Filip)]                  11
[(Shincl, Šincl)]                   9
[(Peter, Petr)]                     8
[(Semel, Semelová)]                 8
[(Jerome, Jeroným)]                 7
[(Mark, Marková)]                   7
[(Mark, Marek)]                     7
[(Wark, Válková)]                   7
[(Zaoralek, Zaorálek)]              6
[(Zlatuska, Zlatuška)]              6
[(Putn, Putnová)]                   6
[(Free, Volný)]                     5
[(Sedya, Seďa)]                     5
[(Jerman, Jermanová)]               5
[(Vanya, Váňa)]                     5
[(Kolorodětík, Kolovratník)]        5
Name: substituted_words, dtype: int64

## Linguistic processing of translated text

This will have to be done for each file separately - from now onwards, we need to separate the df into files.

In [21]:
# Open df
df = pd.read_csv("{}".format(final_dataframe), sep="\t")

In [22]:
# Create a list of files
files = list(df.file.unique())
files

['ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu',
 'ParlaMint-CZ_2013-12-10-ps2013-004-01-004-003.conllu',
 'ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024.conllu',
 'ParlaMint-CZ_2013-12-10-ps2013-004-01-019-021.conllu',
 'ParlaMint-CZ_2013-11-27-ps2013-001-02-015-020.conllu',
 'ParlaMint-CZ_2013-12-12-ps2013-004-02-002-029.conllu',
 'ParlaMint-CZ_2013-11-27-ps2013-001-02-001-006.conllu',
 'ParlaMint-CZ_2013-12-10-ps2013-004-01-013-028.conllu',
 'ParlaMint-CZ_2013-12-19-ps2013-004-04-008-038.conllu',
 'ParlaMint-CZ_2013-12-19-ps2013-004-04-009-042.conllu',
 'ParlaMint-CZ_2013-12-04-ps2013-002-01-006-006.conllu',
 'ParlaMint-CZ_2013-12-06-ps2013-002-02-001-007.conllu',
 'ParlaMint-CZ_2013-12-10-ps2013-004-01-006-005.conllu',
 'ParlaMint-CZ_2013-12-19-ps2013-004-04-004-034.conllu',
 'ParlaMint-CZ_2013-11-25-ps2013-001-01-001-001.conllu',
 'ParlaMint-CZ_2013-11-27-ps2013-001-02-010-015.conllu',
 'ParlaMint-CZ_2013-12-12-ps2013-004-02-005-045.conllu',
 'ParlaMint-CZ_2013-12-10-ps201

### Linguistically process with Stanza

- We use Stanza to get POS, lemmas and ner. Send in the "pre-tokenized text" (created in previous steps).
- Transform the result into CONLL-u (which should contain tokens, lemmas, pos).

- Parse the CONLL-u file and add:
	1) sentence_id as metadata
	2) forward and reverse alignment as metadata (# align_s = 1-1 2-2... and #align_t = 1-1 2-2...),
	3) based on alignment, add SpaceAfter information to each token
	4) source text ("source")
	5) initial translation (#initial_translation metadata)
	6) improved translated text (#text metadata): based on SpaceAfter information, remove spaces around punctuation
	7) Delete startchar and endchar information from ["misc"] metadata element
- Save the file as CONLLU with the same name as the source CONLLU file (so each file will be saved separately). The number of sentences should be the same as in the source CONLLU and ANA file.

In [23]:
def create_conllu(file, lang_code):
	"""
	The function takes the dataframe (df), created in previous steps and takes only the instances from the df that belong
	to the file that is in the argument. It linguistically processes the translated sentences from the file and saves the file.
	Then we add additional information (metadata and NER annotations) to it with the conllu parser and save the final conllu file.

	Args:
		- file (str): file name from the files list (see above)
		- lang_code (str): the lang code that is used in the names of the files, it should be the same as for extract_text()
	"""

	# Process all sentences in the dataframe and save them to a conllu file
	from stanza.utils.conll import CoNLL
	from conllu import parse
	import ast
	import regex as re

	# Use the dataframe, created in previous steps
	df = pd.read_csv("{}".format(final_dataframe), sep="\t")

	# Filter out only instances from the file in question
	df = df[df["file"] == file]

	# When we open the dataframe file, the list of strings in "space-after-information"
	# is a string - change it back to a list
	df["space-after-information"] = df["space-after-information"].astype("str")
	df["space-after-information"] = df["space-after-information"].apply(lambda x: ast.literal_eval(x))
	
	# Create lists of information that we need to add to the conllu file
	ids_list = df.sentence_id.to_list()
	source_text = df.text.to_list()
	initial_translation = df.translation.to_list()
	aligns_source = df["aligns-source"].to_list()
	aligns_target = df["aligns-target"].to_list()
	space_after_list = df["space-after-information"].to_list()
	
	sentence_list = df.new_translations.to_list()

	# To feed the entire list into the pipeline, we need to create lists of tokens, split by space
	sentence_list = [x.split(" ") for x in sentence_list]

	# Linguistically process the list
	doc = nlp(sentence_list)

	# Save the conllu file
	CoNLL.write_doc2conll(doc, "/home/tajak/Parlamint-translation/results/{}/temp/{}".format(lang_code, file))

	print("{} processed and saved.".format(file))

	# Open the CONLL-u file with the CONLL-u parser

	data = open("/home/tajak/Parlamint-translation/results/{}/temp/{}".format(lang_code, file), "r").read()

	sentences = parse(data)

	# Adding additional information to the conllu
	for sentence in sentences:
		# Get the sentence index
		sentence_index = sentences.index(sentence)

		# Add metadata
		sentence.metadata["sent_id"] = ids_list[sentence_index]
		sentence.metadata["align_s"] = aligns_source[sentence_index]
		sentence.metadata["align_t"] = aligns_target[sentence_index]
		sentence.metadata["source"] = source_text[sentence_index]
		sentence.metadata["initial_translation"] = initial_translation[sentence_index]

		# Delete the current metadata for text
		del sentence.metadata["text"]

		new_translation_text = ""

		# Iterate through tokens and add SpaceAfter information if SpaceAfter is "No"
		for word in sentence:
			word_index = sentence.index(word)

			# Remove information on start_char and end_char from the annotation
			del word["misc"]["start_char"]
			del word["misc"]["end_char"]
			
			# Change the NER tags so that they are the same as in the source
			current_ner = word["misc"]["ner"]
			del word["misc"]["ner"]
			
			# Substitute parts of the tags so that they are tha same as in source
			current_ner = re.sub("S-", "B-", current_ner)
			current_ner = re.sub("E-", "I-", current_ner)

			word["misc"]["NER"] = current_ner

			# Get information about the space after based on the index
			current_space_after = space_after_list[sentence_index][word_index]

		# Create new text from translation, correcting the spaces around words
		# based on the SpaceAfter information
			if current_space_after == "No":
				word["misc"]["SpaceAfter"] = "No"
				new_translation_text += word["form"]
			elif current_space_after == "Last":
				new_translation_text += word["form"]
			else:
				new_translation_text += word["form"]
				new_translation_text += " "
		
		sentence.metadata["text"] = new_translation_text
	
	# Create a new conllu file with the updated information

	final_file = open("/home/tajak/Parlamint-translation/results/{}/final_translated_conllu/{}".format(lang_code, file), "w")

	for sentence in sentences:
		final_file.write(sentence.serialize())
	
	final_file.close()

	print("Final file {} is saved.".format(file))

In [None]:
"""
The code if we want to add NER elements too
def create_conllu(file, lang_code):

	# Process all sentences in the dataframe and save them to a conllu file
	from stanza.utils.conll import CoNLL
	from conllu import parse
	import ast

	# Use the dataframe, created in previous steps
	df = pd.read_csv("{}".format(final_dataframe), sep="\t")

	# When we open the df, the NER list turns into a string - we need to change it into a list
	df["target-NER-annotations"] = df["target-NER-annotations"].apply(lambda x: ast.literal_eval(x))
	
	# Filter out only instances from the file in question
	df = df[df["file"] == file]

	# Create lists of information that we need to add to the conllu file
	ids_list = df.sentence_id.to_list()
	aligns_source = df["aligns-source"].to_list()
	aligns_target = df["aligns-target"].to_list()
	ner_list = df["target-NER-annotations"].to_list()
	
	sentence_list = df.new_translations.to_list()

	# To feed the entire list into the pipeline, we need to create lists of tokens, split by space
	sentence_list = [x.split(" ") for x in sentence_list]

	# Linguistically process the list
	doc = nlp(sentence_list)

	# Save the conllu file
	CoNLL.write_doc2conll(doc, "results/{}/ParlaMint-{}-translated.conllu/temp/{}".format(lang_code, lang_code, file))

	print("{} processed and saved.".format(file))

	# Open the CONLL-u file with the CONLL-u parser

	data = open("results/{}/ParlaMint-{}-translated.conllu/temp/{}".format(lang_code, lang_code, file), "r").read()

	sentences = parse(data)

	# Adding additional information to the conllu
	for sentence in sentences:
		# Get the sentence index
		sentence_index = sentences.index(sentence)

		# Add metadata
		sentence.metadata["sent_id"] = ids_list[sentence_index]
		sentence.metadata["align_s"] = aligns_source[sentence_index]
		sentence.metadata["align_t"] = aligns_target[sentence_index]

		# Make the # text element be the last 
		current_text = sentence.metadata["text"]
		del sentence.metadata["text"]

		sentence.metadata["text"] = current_text

		# Iterate through tokens and add NER information to each
		for word in sentence:
			word_index = sentence.index(word)
			# Add NER information based on the word index
			word["misc"]["NER"] = ner_list[sentence_index][word_index]
		
	# Create a new conllu file with the updated information

	final_file = open("results/{}/ParlaMint-{}-translated.conllu/{}".format(lang_code, lang_code, file), "w")

	for sentence in sentences:
		final_file.write(sentence.serialize())
	
	final_file.close()

	print("Final file {} is saved.".format(file))
"""

In [24]:
import stanza

# Now, let's feed the changed translation to the Stanza pipeline to create the final format
#nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,ner', tokenize_pretokenized=True)

# Instruct it to use a specific package: 	CoNLL03
nlp = stanza.Pipeline(lang='en', processors="tokenize,mwt,pos,lemma,ner", package={"ner": ["conll03"]}, tokenize_pretokenized=True)

for file in files[:11]:
	create_conllu(file, lang_code)

2023-01-17 13:29:37 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2023-01-17 13:29:38 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |
| ner       | conll03  |

2023-01-17 13:29:38 INFO: Use device: gpu
2023-01-17 13:29:38 INFO: Loading: tokenize
2023-01-17 13:29:38 INFO: Loading: pos
2023-01-17 13:29:38 INFO: Loading: lemma
2023-01-17 13:29:38 INFO: Loading: ner
2023-01-17 13:29:38 INFO: Done loading processors!


ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu processed and saved.
Final file ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu is saved.
ParlaMint-CZ_2013-12-10-ps2013-004-01-004-003.conllu processed and saved.
Final file ParlaMint-CZ_2013-12-10-ps2013-004-01-004-003.conllu is saved.
ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024.conllu processed and saved.
Final file ParlaMint-CZ_2013-12-10-ps2013-004-01-015-024.conllu is saved.
ParlaMint-CZ_2013-12-10-ps2013-004-01-019-021.conllu processed and saved.
Final file ParlaMint-CZ_2013-12-10-ps2013-004-01-019-021.conllu is saved.
ParlaMint-CZ_2013-11-27-ps2013-001-02-015-020.conllu processed and saved.
Final file ParlaMint-CZ_2013-11-27-ps2013-001-02-015-020.conllu is saved.
ParlaMint-CZ_2013-12-12-ps2013-004-02-002-029.conllu processed and saved.
Final file ParlaMint-CZ_2013-12-12-ps2013-004-02-002-029.conllu is saved.
ParlaMint-CZ_2013-11-27-ps2013-001-02-001-006.conllu processed and saved.
Final file ParlaMint-CZ_2013-11-27-ps2

In [26]:
# Check whether the translated and source file have the same no. of sentences
from conllu import parse

source = open("/home/tajak/Parlamint-translation/ParlaMint-CZ/ParlaMint-CZ.conllu/ParlaMint-CZ.conllu/2013/ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu", "r").read()
source_sen = parse(source)

translation = open("/home/tajak/Parlamint-translation/results/CZ/final_translated_conllu/ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.conllu", "r").read()

translations_sen = parse(translation)

In [27]:
# Check if number of sentences match
print(len(source_sen))
print(len(translations_sen))

53
53


In [28]:
# Check if ids match
for i in [50,51,52]:
	print(source_sen[i].metadata["sent_id"])
	print(translations_sen[i].metadata["sent_id"])

ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u12.p1.s2
ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u12.p1.s2
ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u13.p1.s1
ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u13.p1.s1
ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u13.p2.s1
ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003.u13.p2.s1


In [29]:
# Check if content matches
for i in [50,51,52]:
	print(source_sen[i].metadata["text"])
	print(translations_sen[i].metadata["text"])

Takže přerušuji jednání Sněmovny do 15.30 hodin.
So I interrupt the House meeting by 3:30 p.m.
Dámy a pánové, čas, který byl vyhrazen na poradu poslaneckého klubu TOP 09, uplynul, a já vás tedy prosím, abyste zaujali svá místa v jednacím sále, a budeme pokračovat.
Ladies and gentlemen, the time allocated to the meeting of the MEP's TOP 09 club is over, and I therefore ask you to take your seats in the Chamber of Commerce, and we will continue.
Dalším bodem, který budeme projednávat, je
The next item we're going to discuss is


In [4]:
# Inspect translations with possessive adjectives
df = pd.read_csv("{}".format(translated_dataframe_path), sep="\t")

In [7]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,file_path,file,sentence_id,text,tokenized_text,NER,proper_nouns,length,translation,translation-tokenized,space-after-information
0,0,0,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,3.,3 .,,{},1,3.,3 .,"['No', 'Last']"
1,1,1,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Návrh zasedacího pořádku poslanců v jednacím s...,Návrh zasedacího pořádku poslanců v jednacím s...,,{},9,Proposal for a sitting order of Members in the...,Proposal for a sitting order of Members in the...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye..."
2,2,2,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Podle § 52 odst. 1 našeho jednacího řádu Posla...,Podle § 52 odst . 1 našeho jednacího řádu Posl...,,{},33,According to Article 52 (2) (a) of the basic R...,According to Article 52 ( 2 ) ( a ) of the bas...,"['Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes'..."
3,3,3,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Já bych k této věci otevřel rozpravu.,Já bych k této věci otevřel rozpravu .,,{},7,I would like to open a debate on this matter.,I would like to open a debate on this matter .,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye..."
4,4,4,/home/tajak/Parlamint-translation/ParlaMint-CZ...,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,ParlaMint-CZ_2013-12-04-ps2013-002-01-003-003....,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,Pan poslanec a předseda klubu ODS Zbyněk Stanj...,,"{6: ['Zbyněk', 'Zbyněk'], 7: ['Stanjura', 'Sta...",8,Member and Chairman of the ODS Club Zbyněk Sta...,Member and Chairman of the ODS Club Zbyněk Sta...,"['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Ye..."


In [16]:
file_list = list(df.file.unique())


for file in file_list:
	for i in instances:
		if file in i:
			print(file)

ParlaMint-CZ_2013-12-12-ps2013-004-02-005-045.conllu
ParlaMint-CZ_2013-12-06-ps2013-002-02-002-008.conllu
ParlaMint-CZ_2013-12-06-ps2013-002-02-002-008.conllu
ParlaMint-CZ_2013-12-06-ps2013-002-02-002-008.conllu
ParlaMint-CZ_2013-12-06-ps2013-002-02-002-008.conllu
ParlaMint-CZ_2013-12-06-ps2013-002-02-002-008.conllu
ParlaMint-CZ_2013-12-16-ps2013-004-03-003-018.conllu
ParlaMint-CZ_2013-12-16-ps2013-004-03-003-018.conllu
ParlaMint-CZ_2013-12-16-ps2013-004-03-003-018.conllu
ParlaMint-CZ_2013-12-19-ps2013-004-04-005-035.conllu
ParlaMint-CZ_2013-12-19-ps2013-004-04-005-035.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-12-06-ps2013-003-01-001-001.conllu
ParlaMint-CZ_2013-11-27-ps2013-001-02-013-019.

In [12]:
file = open("possessive_adjectives.txt", "r")

instances = file.readlines()

len(instances)

157