In this notebook, we are working with files that we created by automatically annotating MaCoCu corpora - they are in TSV or JSON format.

In [1]:
# Define the gpu  on the gpu machine
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=0


In [2]:
import pandas as pd
from tqdm import tqdm
import gzip
import regex as re
import sys
import pandas as pd
import json
from tqdm import tqdm

## Analyze the file, created with automatic genre identification

In [30]:
suffix = "uk-1.0"
path = f"/cache/tajak/macocu-mt/datasets/annotated/MaCoCu-{suffix}.tsv-genre-annotated.jsonl"
lang = "uk"

# Open tsv (created by above process) or json (created with the python_extended.py script)

if ".tsv" in path[-6:]:
    df = pd.read_csv(path, sep="\t", index_col = 0)
elif ".json" in path[-6:]:
    df = pd.read_json(path, orient="records", lines=True)

display(df)

Unnamed: 0,document_id,text,genre,logit
0,macocu.uk.1,"1. Хлюпотить стрімка водичка, В’ється, ллється...",Prose/Lyrical,"[-0.46740186214447005, -1.549988031387329, -0...."
1,macocu.uk.2,"Лопух угору звівши вуха, Уранці Півня пісню сл...",Prose/Lyrical,"[-0.325899422168731, -1.684817552566528, -0.97..."
2,macocu.uk.3,*** Я кохала Тебе норовливо червнево-серпнево....,Forum,"[1.543481707572937, -1.662479043006897, -2.043..."
3,macocu.uk.4,Інформація про роботу <p>Міністерство освіти і...,Information/Explanation,"[-0.656804740428924, 8.27550983428955, -0.6994..."
4,macocu.uk.5,Інформація про роботу <p>Контрольна робота із ...,Information/Explanation,"[-0.7865087389945981, 8.310611724853516, -0.67..."
...,...,...,...,...
13465417,macocu.uk.21471609,"протоієрей Павло МЕЛЬНИК, Духовність як відхід...",Information/Explanation,"[-0.25334486365318304, 7.606439113616943, -0.9..."
13465418,macocu.uk.21471610,"А. Л. Дудніков, доцент кафедри криміналістики ...",Information/Explanation,"[-0.7717455029487611, 8.313455581665039, -0.78..."
13465419,macocu.uk.21471611,Ключові слова: управління персоналом органів в...,Information/Explanation,"[-0.9155022501945491, 8.2792329788208, -0.6855..."
13465420,macocu.uk.21471612,"Ж.В. ЗАВАЛЬНА, докт. юрид. наук, доц., Сумська...",Information/Explanation,"[-0.809449970722198, 8.302559852600098, -0.776..."


In [8]:
print(df.genre.value_counts(normalize="True").to_markdown())
print("\n\n")
print(df.genre.value_counts().to_markdown())

| genre                   |   proportion |
|:------------------------|-------------:|
| News                    |   0.61676    |
| Opinion/Argumentation   |   0.0735929  |
| Promotion               |   0.0685441  |
| Information/Explanation |   0.0673885  |
| Instruction             |   0.0658549  |
| Forum                   |   0.0487328  |
| Mix                     |   0.0331768  |
| Other                   |   0.0115414  |
| Legal                   |   0.00858311 |
| Prose/Lyrical           |   0.00582556 |



| genre                   |   count |
|:------------------------|--------:|
| News                    |  893978 |
| Opinion/Argumentation   |  106671 |
| Promotion               |   99353 |
| Information/Explanation |   97678 |
| Instruction             |   95455 |
| Forum                   |   70637 |
| Mix                     |   48089 |
| Other                   |   16729 |
| Legal                   |   12441 |
| Prose/Lyrical           |    8444 |


In [5]:
df.genre.describe()

count     3990599
unique         10
top          News
freq      1747488
Name: genre, dtype: object

## Create a genre sample

For the sample, we will take random 10 instances from each of the categories, except Other and Mix.

In [6]:
def extract_genre_sample(df, source_lang):
	import googletrans
	from googletrans import Translator

	# We will extract all labels, except Mix and Other
	labels_list=['Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion']

	# First create the initial df to which all others in the loop will be added
	final_sample = df[df["genre"] == labels_list[0]].sample(n=10)

	# Add all other domains
	remaining_list = labels_list[1:]

	for i in remaining_list:
		try:
			added_instances = df[df["genre"] == i].sample(n=10)
			final_sample = pd.concat([final_sample, added_instances])
		except:
			print(df[df["genre"] == i][:2].to_markdown())

	# Shuffle rows
	final_sample = final_sample.sample(frac=1)

	# Discard logit information
	final_sample = final_sample.drop(columns="logit")

	# Change <p> signs to actual new lines
	final_sample["text"] = final_sample["text"].str.replace("<p>", "\n\n")

	sentence_list = final_sample["text"].to_list()

	# Apply Google Translate and machine translate the data - documentation: https://py-googletrans.readthedocs.io/en/latest/

	# Define the translation model
	translator = Translator()

	# Create the final list
	translation_GT = []

	print("Starting translation.")

	# The suffix that GT uses for all languages is the same as the suffix used in the dataset names, except for Montenegrin
	# GT does not have a special model for Montenegrin, so we will use the Serbian model
	if source_lang == "cnr":
		lang = "sr"
	else:
		lang = source_lang

	# Loop through the list of original sentences,
	# translate each and append the translation to the final list
	for i in sentence_list:
		# Translate the sentence from source language, e.g. Slovene (src = "sl") to English (dest = "en")
			current_translation = translator.translate(i, src = lang, dest='en')
		# Append the translated sentence to the final list
			translation_GT.append(current_translation.text)

	print("Translation finished.")

	# Append translations to the sample

	final_sample["translation"] = translation_GT

	# Save to JSON lines
	final_sample.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample.json".format(source_lang), orient="records", lines=True)

	print("Final file saved as MaCoCu-{}-genre-sample.json".format(source_lang))

	# Create also a version for the annotation tool, only with translation and labels
	ann_df = final_sample[["translation","genre"]]

	# For annotation, each label should be in a list
	ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])

	# Rename df
	ann_df.columns = ["text", "label"]

	# Add metadata
	text_ids = final_sample["document_id"].to_list()
	#domains = final_sample["domain"].to_list()

	metadata_list = []

	for i in list(zip(text_ids)):#,domains)):
		metadata = {"text_id": i[0]}#, "domain": i[1]}
		metadata_list.append(metadata)

	ann_df["metadata"] = metadata_list

	# Save to JSON lines
	ann_df.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample-for-annotation-tool.json".format(source_lang), orient="records", lines=True)

	print("File for annotation saved as MaCoCu-{}-genre-sample-for-annotation-tool.json".format(source_lang))
	
	return final_sample

In [32]:
sample = extract_genre_sample(df, lang)

Starting translation.
Translation finished.
Final file saved as MaCoCu-uk-genre-sample.json
File for annotation saved as MaCoCu-uk-genre-sample-for-annotation-tool.json


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["metadata"] = metadata_list


In [28]:
genre_file = pd.read_json(f"/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{lang}-genre-sample-for-annotation-tool.json", orient="records", lines=True)

display(genre_file.head())

Unnamed: 0,text,label,metadata
0,AÖL warned of frequent negligence and errors o...,[Instruction],{'text_id': 'macocu.tr.15851513'}
1,Criteria to be applied in the tender for bank ...,[Legal],{'text_id': 'macocu.tr.12699738'}
2,"all\n\nFriday, March 22, 2019, the Qur'an and ...",[Information/Explanation],{'text_id': 'macocu.tr.833490'}
3,----------------------------------------------...,[Opinion/Argumentation],{'text_id': 'macocu.tr.9263252'}
4,Sender Subject: First Step to Beekeeping (Read...,[Forum],{'text_id': 'macocu.tr.12349626'}


In [33]:
genre_file.label.value_counts()

label
[Instruction]                10
[Legal]                      10
[Information/Explanation]    10
[Opinion/Argumentation]      10
[Forum]                      10
[News]                       10
[Prose/Lyrical]              10
[Promotion]                  10
Name: count, dtype: int64

In [None]:
genre_file.shape

(1303402, 2)

## Add additional instances to the sample to achieve 10 instances per label

In [3]:
# Open the entire corpus, annotated with genres

#suffix = "uk-1.0"
#path = f"/cache/tajak/macocu-mt/datasets/annotated/MaCoCu-{suffix}.tsv-genre-annotated.jsonl"
#path = f"datasets/annotated/MaCoCu-sq-texts-with-genres.tsv"
path = "datasets/annotated/CLASSLA-web-sl-genre-annotated.jsonl"
#lang = suffix[:2]
lang = "sl"
source_lang = lang

# Open tsv (created by above process) or json (created with the python_extended.py script)

if ".tsv" in path[-6:]:
    df = pd.read_csv(path, sep="\t", index_col = 0)
elif ".json" in path[-6:]:
    df = pd.read_json(path, orient="records", lines=True)

display(df)

Unnamed: 0,document_id,genre,text
0,CLASSLA-web.sl.36,Promotion,Nakit /<p>Študij smeri NAKIT bo študentom nudi...
1,CLASSLA-web.sl.181,Information/Explanation,"Modna krila<p>Krila so oblačila, ki so namenje..."
2,CLASSLA-web.sl.203,Instruction,Električna kitara – dobro je vedeti<p>Električ...
3,CLASSLA-web.sl.259,Information/Explanation,Nacionalni inštitut za biologijo Oddelek za bi...
4,CLASSLA-web.sl.335,Instruction,Kako opremiti hodnik v stanovanju?<p>Hodnik la...
...,...,...,...
3958835,CLASSLA-web.sl.4062607,Information/Explanation,Uporaba: Za Land Rover Range Rover III SUV (LM...
3958836,CLASSLA-web.sl.4062894,Promotion,Pošlji prijatelju<p>100Sets 8 mm Multicolor St...
3958837,CLASSLA-web.sl.4063133,Promotion,Prednosti v primerjavi s HID & Halogenske žarn...
3958838,CLASSLA-web.sl.4063346,Promotion,"2020 Novih Moških Svile, Svileni Mreži majice ..."


Open the manually-annotated sample as well

In [6]:
# In case of first run of adding instances
#ann_df = pd.read_json(f"datasets/manually-evaluated/MaCoCu-{lang}-genre-sample-evaluated.jsonl", lines=True)

# In case of second run of adding instances
ann_df = pd.read_json(f"datasets/manually-evaluated/MaCoCu-{lang}-genre-sample-evaluated-complete-sample.jsonl", lines=True)

ann_df.describe(include="all")

Unnamed: 0,text_id,y_pred,text,translation,metadata,y_true
count,96,96,96,96,96,96
unique,96,9,96,96,96,10
top,CLASSLA-web.sl.1022390,News,Znova je čas za poletne bralne podvige – poziv...,It is time for summer reading feats - we urge ...,"CLASSLA-web.sl.1022390', 'domain': 'log.sik.si'}",Promotion
freq,1,12,1,1,1,13


In [7]:
# Filter out "Other" instances and "Problematic" instances so that we will do analysis on clear examples only
df_test_clean = ann_df[ann_df["y_pred"] != "Other"]
df_test_clean = df_test_clean[df_test_clean["y_true"] != "Multiple texts"]
df_test_clean = df_test_clean[df_test_clean["y_true"] != "Incomprehensible"]
df_test_clean = df_test_clean[df_test_clean["y_true"] != "Other"]

label_count_dict = df_test_clean["y_pred"].value_counts().to_dict()
label_count_dict

{'Opinion/Argumentation': 10,
 'Instruction': 10,
 'Promotion': 10,
 'Legal': 10,
 'Information/Explanation': 10,
 'Prose/Lyrical': 10,
 'Forum': 10,
 'News': 9}

In [8]:
# Calculate how many texts are missing
label_missing_count = {}

for item in list(label_count_dict.items()):
	if item[1] != 10:
		label_missing_count[item[0]] = int(10-item[1])

label_missing_count

{'News': 1}

In [9]:
# Get information about text ids so that we won't extract the same id
text_ids = df_test_clean["text_id"].to_list()
text_ids[:3]

['CLASSLA-web.sl.1087171', 'CLASSLA-web.sl.1215246', 'CLASSLA-web.sl.1230602']

For each label, take additional instances, if the num of instances is less than 10. Make sure that the text id is different than those that are in the df.

In [10]:
# We will extract all labels that are missing, except Mix and Other
labels_list = list(label_missing_count.keys())

# First create the initial df to which all others in the loop will be added
final_sample = df[df["genre"] == labels_list[0]].sample(n=label_missing_count[labels_list[0]])

# Add all other domains
remaining_list = labels_list[1:]

for i in remaining_list:
	try:
		added_instances = df[df["genre"] == i].sample(n=label_missing_count[i])
		final_sample = pd.concat([final_sample, added_instances])
	except:
		print(df[df["genre"] == i][:2].to_markdown())

# Check whether any of the text ids are duplicated
new_text_ids = final_sample["document_id"].to_list()
# Modification to the code for sq
#new_text_ids = final_sample["text_id"].to_list()

combined_list = new_text_ids + text_ids
if len(combined_list) == len(set(combined_list)):
	print("No duplicated texts.")
else:
	print("There are duplicated texts, repeat the process.")

# Shuffle rows
final_sample = final_sample.sample(frac=1)

# Discard logit information
#final_sample = final_sample.drop(columns="logit")

# Change <p> signs to actual new lines
final_sample["text"] = final_sample["text"].str.replace("<p>", "\n\n")

sentence_list = final_sample["text"].to_list()

final_sample

No duplicated texts.


Unnamed: 0,document_id,genre,text
564682,CLASSLA-web.sl.907174,News,Tunel pod središčem Maribora? Prva poročila so...


In [11]:
def extract_additional_genre_sample(sentence_list, source_lang, final_sample):
	import googletrans
	from googletrans import Translator

	# Apply Google Translate and machine translate the data - documentation: https://py-googletrans.readthedocs.io/en/latest/

	# Define the translation model
	translator = Translator()

	# Create the final list
	translation_GT = []

	print("Starting translation.")

	# The suffix that GT uses for all languages is the same as the suffix used in the dataset names, except for Montenegrin
	# GT does not have a special model for Montenegrin, so we will use the Serbian model
	if source_lang == "cnr":
		lang = "sr"
	else:
		lang = source_lang

	# Loop through the list of original sentences,
	# translate each and append the translation to the final list
	for i in sentence_list:
		# Translate the sentence from source language, e.g. Slovene (src = "sl") to English (dest = "en")
			current_translation = translator.translate(i, src = lang, dest='en')
		# Append the translated sentence to the final list
			translation_GT.append(current_translation.text)

	print("Translation finished.")

	# Append translations to the sample

	final_sample["translation"] = translation_GT

	# Save to JSON lines

	# In case of first run
	#final_sample.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample-additional-instances.json".format(source_lang), orient="records", lines=True)

	#print("Final file saved as MaCoCu-{}-genre-sample-additional-instances.json".format(source_lang))

	# In case of second run
	final_sample.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample-additional-instances-run2.json".format(source_lang), orient="records", lines=True)

	print("Final file saved as MaCoCu-{}-genre-sample-additional-instances-run2.json".format(source_lang))

	# Create also a version for the annotation tool, only with translation and labels
	ann_df = final_sample[["translation","genre"]]

	# For annotation, each label should be in a list
	ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])

	# Rename df
	ann_df.columns = ["text", "label"]

	# Add metadata
	text_ids = final_sample["document_id"].to_list()
	#text_ids = final_sample["text_id"].to_list()
	#domains = final_sample["domain"].to_list()

	metadata_list = []

	for i in list(zip(text_ids)):#,domains)):
		metadata = {"text_id": i[0]}#, "domain": i[1]}
		metadata_list.append(metadata)

	ann_df["metadata"] = metadata_list

	# Save to JSON lines
	# In case of first run
	#ann_df.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample-for-annotation-tool-additional-instances.json".format(source_lang), orient="records", lines=True)

	# In case of second run of adding instances
	ann_df.to_json("/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{}-genre-sample-for-annotation-tool-additional-instances-run2.json".format(source_lang), orient="records", lines=True)

	print("File for annotation saved as MaCoCu-{}-genre-sample-for-annotation-tool-additional-instances-run2.json".format(source_lang))
	
	return final_sample

In [12]:
sample = extract_additional_genre_sample(sentence_list, source_lang, final_sample)

Starting translation.
Translation finished.
Final file saved as MaCoCu-sl-genre-sample-additional-instances-run2.json
File for annotation saved as MaCoCu-sl-genre-sample-for-annotation-tool-additional-instances-run2.json


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["metadata"] = metadata_list


In [13]:
genre_file = pd.read_json(f"/cache/tajak/macocu-mt/datasets/samples/MaCoCu-{lang}-genre-sample-for-annotation-tool-additional-instances-run2.json", orient="records", lines=True)

display(genre_file.head())

Unnamed: 0,text,label,metadata
0,Tunnel below the center of Maribor?The first r...,[News],{'text_id': 'CLASSLA-web.sl.907174'}


In [13]:
genre_file.label.value_counts()

label
[Prose/Lyrical]    1
Name: count, dtype: int64

# Prepare similar JSONL corpora from CLASSLA corpora

This is needed so that we can extract additional instances easily.

In [4]:
lang = "hr"
corpus_path = f"/cache/tajak/macocu-mt/datasets/initial/CLASSLA-web.{lang}.1.0.vert.gz"

if ".gz" in corpus_path:
	corpus = gzip.open(corpus_path, "rt")
else:
	corpus = open(corpus_path, "r")

In [5]:

# Open a new file to which we will append each json line
new_path = f"/cache/tajak/macocu-mt/datasets/annotated/CLASSLA-web-{lang}-genre-annotated.jsonl"
new_file = open(f"{new_path}", "wt")
#new_file.close()
#new_file = open("{}".format(new_path), "a")

text_id_re = re.compile('id="(.+?)"')
genre_re = re.compile('genre="(.+?)"')

text_counter = 0

In [6]:
for line in corpus:
	if line.startswith("<text"):
		current_text = {}
		text_string = ""
		current_text["document_id"] = text_id_re.search(line).group(1)
		current_text["genre"] = genre_re.search(line).group(1)
		current_text["text"] = ""
		current_length = 0
	elif line.startswith("<p"):
		continue
	elif line.startswith("<s"):
		continue
	elif line.startswith("</p"):
		text_string = text_string.rstrip()
		text_string += "<p>"
	elif line.startswith("</s"):
		continue
	elif line.startswith("<g"):
		# Remove space before the last word if there is a symbol <g (= glue, meaning no space between words)
		text_string = text_string.rstrip()
	elif line.startswith("</text>"):
		# Shorten the texts to first 512 tokens
		space_re=re.compile(r'\s+',re.UNICODE)
		text = ' '.join(space_re.sub(' ',text_string).split(' ')[:512])
		current_text["text"] = text
		current_length = len(text.split())
		if current_length > 75:
			#new_file.write("{}".format(current_text))
			#new_file.write("\n")
			new_file.write(json.dumps(current_text)+'\n')
			new_file.flush()
		text_counter += 1
		if text_counter%10 == 0:
			print("Processed {} files.".format(text_counter))
	else:
		current_line = line.split("\t")
		current_word = current_line[0]
		text_string += current_word
		text_string += " "

new_file.close()
print("Processing completed. The new file is saved as {}.jsonl".format(corpus_path))

Processed 10 files.
Processed 20 files.
Processed 30 files.
Processed 40 files.
Processed 50 files.
Processed 60 files.
Processed 70 files.
Processed 80 files.
Processed 90 files.
Processed 100 files.
Processed 110 files.
Processed 120 files.
Processed 130 files.
Processed 140 files.
Processed 150 files.
Processed 160 files.
Processed 170 files.
Processed 180 files.
Processed 190 files.
Processed 200 files.
Processed 210 files.
Processed 220 files.
Processed 230 files.
Processed 240 files.
Processed 250 files.
Processed 260 files.
Processed 270 files.
Processed 280 files.
Processed 290 files.
Processed 300 files.
Processed 310 files.
Processed 320 files.
Processed 330 files.
Processed 340 files.
Processed 350 files.
Processed 360 files.
Processed 370 files.
Processed 380 files.
Processed 390 files.
Processed 400 files.
Processed 410 files.
Processed 420 files.
Processed 430 files.
Processed 440 files.
Processed 450 files.
Processed 460 files.
Processed 470 files.
Processed 480 files.
P

 Now, you should be able to create a genre sample with the same code as for other macocu corpora.

# Old code

In [None]:
# Extract texts

texts = []
text_ids = []

for line in open("datasets/MaCoCu-sq.tsv"):
    did,text=line.strip().split('\t')
    texts.append(text)
    text_ids.append(did)

print(len(texts), len(text_ids))

513904 513904


In [None]:
# Create a df out of the extacted texts and ids
text_file = pd.DataFrame({"text_id":text_ids,"text":texts})

text_file.head(3)

Unnamed: 0,text_id,text
0,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...
1,macocu.mt.2,Itlob biex jiġi jżurek Xhud taʼ Ġeħova biex ti...
2,macocu.mt.5,"tal-Għaqda Dilettanti Knisja ta' Lapsi, San Ġi..."


In [None]:
# Merge the dfs

df = text_file.merge(genre_file, on="text_id")

df.head()


Unnamed: 0,text_id,text,genre
0,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...,Other
1,macocu.mt.2,Itlob biex jiġi jżurek Xhud taʼ Ġeħova biex ti...,Prose/Lyrical
2,macocu.mt.5,"tal-Għaqda Dilettanti Knisja ta' Lapsi, San Ġi...",Information/Explanation
3,macocu.mt.7,"Intant, ħu pjaċir. U jekk tkun trid tgħaddilna...",Prose/Lyrical
4,macocu.mt.9,Qegħdin hawn biex ngħinuk. <p>Il-Family Planni...,Mix


In [None]:
df.shape

(513904, 3)

In [None]:
# Add text length information
df["text_length"] = df.text.apply(lambda x:len(x.split()))

df.head()

Unnamed: 0,text_id,text,genre,text_length
0,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...,Other,102
1,macocu.mt.2,Itlob biex jiġi jżurek Xhud taʼ Ġeħova biex ti...,Prose/Lyrical,124
2,macocu.mt.5,"tal-Għaqda Dilettanti Knisja ta' Lapsi, San Ġi...",Information/Explanation,297
3,macocu.mt.7,"Intant, ħu pjaċir. U jekk tkun trid tgħaddilna...",Prose/Lyrical,413
4,macocu.mt.9,Qegħdin hawn biex ngħinuk. <p>Il-Family Planni...,Mix,367


In [None]:
df.describe(include="all")

Unnamed: 0,text_id,text,genre,text_length
count,513904,513904,513904,513904.0
unique,513904,513904,10,
top,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...,Information/Explanation,
freq,1,1,210002,
mean,,,,423.462176
std,,,,126.537873
min,,,,76.0
25%,,,,366.0
50%,,,,512.0
75%,,,,512.0


In [None]:
print(df.genre.value_counts(normalize="True").to_markdown())

| genre                   |   proportion |
|:------------------------|-------------:|
| Information/Explanation |  0.408641    |
| Mix                     |  0.242082    |
| Forum                   |  0.101274    |
| Opinion/Argumentation   |  0.0943892   |
| Other                   |  0.0712001   |
| Prose/Lyrical           |  0.0700773   |
| Instruction             |  0.00454365  |
| Promotion               |  0.00443857  |
| News                    |  0.00332747  |
| Legal                   |  2.72424e-05 |


In [None]:
print(df.genre.value_counts().to_markdown())

| genre                   |   count |
|:------------------------|--------:|
| Information/Explanation |  210002 |
| Mix                     |  124407 |
| Forum                   |   52045 |
| Opinion/Argumentation   |   48507 |
| Other                   |   36590 |
| Prose/Lyrical           |   36013 |
| Instruction             |    2335 |
| Promotion               |    2281 |
| News                    |    1710 |
| Legal                   |      14 |


In [None]:
# Save the file as it is
df.to_csv("/cache/tajak/macocu-mt/MaCoCu-sq-texts-with-genres.tsv", sep="\t")

In [None]:
# Compare the file created with original code and the file created with the extended improved code
old_file = pd.read_csv("/cache/tajak/macocu-mt/MaCoCu-mt-texts-with-genres.tsv", sep="\t", index_col = 0)

display(old_file)

new_file = pd.read_json("/cache/tajak/macocu-mt/MaCoCu-mt-2.0.tsv-genre-annotated.jsonl", orient="records", lines=True)

display(new_file)

print(old_file.shape, new_file.shape)

Unnamed: 0,text_id,text,genre,text_length
0,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...,Other,102
1,macocu.mt.2,Itlob biex jiġi jżurek Xhud taʼ Ġeħova biex ti...,Prose/Lyrical,124
2,macocu.mt.5,"tal-Għaqda Dilettanti Knisja ta' Lapsi, San Ġi...",Information/Explanation,297
3,macocu.mt.7,"Intant, ħu pjaċir. U jekk tkun trid tgħaddilna...",Prose/Lyrical,413
4,macocu.mt.9,Qegħdin hawn biex ngħinuk. <p>Il-Family Planni...,Mix,367
...,...,...,...,...
513899,macocu.mt.537330,Events info <p>Aħbarijiet <p>Għaliex xogħol st...,Mix,454
513900,macocu.mt.537331,"Il-manifatturi, l-importaturi u l-utenti downs...",Information/Explanation,283
513901,macocu.mt.537332,Fażijiet Ħarsa ġenerali Proċess ta' valutazzjo...,Information/Explanation,153
513902,macocu.mt.537334,Id-deċiżjonijiet kollha ta' awtorizzazzjoni għ...,Information/Explanation,139


Unnamed: 0,document_id,text,genre,logit
0,macocu.mt.1,Bil-Filmat: Diversi Żgħażagħ Barranin Qaqoċċa ...,Other,"[6.211301326751709, -1.123962044715881, -2.108..."
1,macocu.mt.2,Itlob biex jiġi jżurek Xhud taʼ Ġeħova biex ti...,Prose/Lyrical,"[1.873501896858215, -0.670612275600433, -1.697..."
2,macocu.mt.5,"tal-Għaqda Dilettanti Knisja ta' Lapsi, San Ġi...",Information/Explanation,"[-0.45298385620117104, 8.026633262634277, -1.0..."
3,macocu.mt.7,"Intant, ħu pjaċir. U jekk tkun trid tgħaddilna...",Prose/Lyrical,"[0.431536048650741, -1.36893618106842, -1.2726..."
4,macocu.mt.9,Qegħdin hawn biex ngħinuk. <p>Il-Family Planni...,Mix,"[3.972864389419555, -0.131693825125694, -2.675..."
...,...,...,...,...
513899,macocu.mt.537330,Events info <p>Aħbarijiet <p>Għaliex xogħol st...,Mix,"[2.540858268737793, 4.555283546447754, -1.7210..."
513900,macocu.mt.537331,"Il-manifatturi, l-importaturi u l-utenti downs...",Information/Explanation,"[-0.7114518880844111, 8.28370189666748, -0.954..."
513901,macocu.mt.537332,Fażijiet Ħarsa ġenerali Proċess ta' valutazzjo...,Information/Explanation,"[-0.7427962422370911, 8.293700218200684, -0.53..."
513902,macocu.mt.537334,Id-deċiżjonijiet kollha ta' awtorizzazzjoni għ...,Information/Explanation,"[0.496592551469802, 7.515917778015137, -1.6420..."


(513904, 4) (513904, 4)


In [None]:
display(new_file.genre.value_counts())

display(old_file.genre.value_counts())

genre
Information/Explanation    210002
Mix                        124407
Forum                       52045
Opinion/Argumentation       48507
Other                       36590
Prose/Lyrical               36013
Instruction                  2335
Promotion                    2281
News                         1710
Legal                          14
Name: count, dtype: int64

genre
Information/Explanation    210002
Mix                        124407
Forum                       52045
Opinion/Argumentation       48507
Other                       36590
Prose/Lyrical               36013
Instruction                  2335
Promotion                    2281
News                         1710
Legal                          14
Name: count, dtype: int64

In [1]:
# Unzip the ZIP folder with the files
import zipfile

folder = "datasets/MaCoCu-sq-1.0.xml.zip"

with zipfile.ZipFile(folder, 'r') as zip_ref:
    zip_ref.extractall()

In [46]:
def merge_prevert_with_sample(sample_path, prevert_path):
    from prevert import dataset

    # Open the file that has the texts with genres sample
    sample = pd.read_csv(sample_path, sep="\t", index_col = 0)

    # Extract the list of all text ids
    text_ids = sample.text_id.to_list()

    prevert_texts = {}
    domains = {}

    # Open the dataset with the prevert parser 
    dset = dataset(prevert_path)

    # loop through the documents in MaCoCu prevert corpus and add
    # the document to the sample list if its id is in the genre sample
    for doc in tqdm(dset): # iterating through documents of a dataset
        current_text = ""
        current_text_id = doc.meta["id"]
        current_domain = doc.meta["domain"]
        if any(text_id == current_text_id for text_id in text_ids):
            for par in doc: # iterating through paragraphs of a document
                current_text += str(par)
                current_text += "\n"
            prevert_texts[current_text_id] = current_text
            domains[current_text_id] = current_domain
        else:
            continue

    print("Processing finished.")

    # Then append the new information to a df
    prevert_df = pd.DataFrame(list(zip(domains.keys(), domains.values(), prevert_texts.values())), columns=["text_id", 'domain', 'text'])

    # Merge the new dataframe to the sample dataframe
    final_df = sample.merge(prevert_df, on="text_id")

    # Remove unnecessary columns
    final_df = final_df.drop(columns=["Unnamed: 0", "text_x"])

    # Rename the text_y column
    final_df.rename(columns={"text_y":"text"}, inplace=True)

    # Save the file
    final_df.to_csv("{}-extracted-text-from-prevert.csv".format(sample_path))

    print("File created and saved as {}-extracted-text-from-prevert.csv".format(sample_path))

    return final_df

In [None]:
merge_prevert_with_sample("datasets/MaCoCu-sq-texts-with-genres.tsv-genre-sample.txt", "datasets/MaCoCu-sq-1.0.xml")

In [44]:
def prevert_enriched_sample_to_json(file_path, lang):
	""" Convert genre sample, enriched with the texts from the prevert corpus that have paragraph structure to a JSON file and translate them. This function is meant for corpora which were not extracted from VERT files, e.g. SQ corpus.

		Args:
		- file_path: path to the genre sample, created with the function merge_prevert_with_sample()
		- lang: sq"""

	sample_df = pd.read_csv(file_path, index_col = 0)

	# Apply Google Translate and machine translate the data
	import googletrans
	from googletrans import Translator

	# Define the translation model
	translator = Translator()

	# Create the final list
	translation_GT = []

	sentence_list = sample_df["text"].to_list()

	print("Starting translation.")

	# Loop through the list of original sentences,
	# translate each and append the translation to the final list
	for i in sentence_list:
		# Translate the sentence from source language, e.g. Slovene (src = "sl") to English (dest = "en")
			current_translation = translator.translate(i, src = lang, dest='en')
		# Append the translated sentence to the final list
			translation_GT.append(current_translation.text)

	print("Translation finished.")

	# Append translations to the sample

	sample_df["translation"] = translation_GT

	# Save to JSON lines
	sample_df.to_json("datasets/CLASSLA-web.{}.1.0.-translated-genre-sample.jsonl".format(lang), orient="records", lines=True)

	print("Final file saved as datasets/CLASSLA-web.{}.1.0.-translated-genre-sample.jsonl".format(lang))

	# Create also a version for the annotation tool, only with translation and labels
	ann_df = sample_df[["translation","genre"]]

	# For annotation, each label should be in a list
	ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])

	# Rename df
	ann_df.columns = ["text", "label"]

	# Add metadata
	text_ids = sample_df["text_id"].to_list()
	domains = sample_df["domain"].to_list()

	metadata_list = []

	for i in list(zip(text_ids,domains)):
		metadata = {"text_id": i[0], "domain": i[1]}
		metadata_list.append(metadata)

	ann_df["metadata"] = metadata_list

	# Save to JSON lines
	ann_df.to_json("datasets/CLASSLA-web.{}.1.0.-translated-genre-sample-for-annotation.jsonl".format(lang), orient="records", lines=True)

	print("File for annotation saved as datasets/CLASSLA-web.{}.1.0.-translated-genre-sample-for-annotation.jsonl".format(lang))

	return sample_df


In [47]:
prevert_enriched_sample_to_json("datasets/MaCoCu-sq-texts-with-genres.tsv-genre-sample.txt-extracted-text-from-prevert.csv", "sq")

Starting translation.
Translation finished.
Final file saved as datasets/CLASSLA-web.sq.1.0.-translated-genre-sample.jsonl
File for annotation saved as datasets/CLASSLA-web.sq.1.0.-translated-genre-sample-for-annotation.jsonl


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["metadata"] = metadata_list


Unnamed: 0,text_id,genre,text_length,domain,text,translation
0,macocu.sq.1061396,Opinion/Argumentation,341,fjalaejetes.org,Blog\n\n“Unë të kam dashur me një dashuri të p...,"Blog\n\n""I loved you with eternal love.""Jer 31..."
1,macocu.sq.1408163,Promotion,231,zemrashqiptare.net,"Fronti Bashkimit Kombëtar Shqiptar (FBKSH), or...","The Albanian National Union Front (FBKSH), the..."
2,macocu.sq.183383,Legal,140,eukos.org,Liria nga keqtrajtimi\n\nKonventa e të Drejtav...,Freedom from mistreatment\n\nStudent Rights Co...
3,macocu.sq.1191613,News,115,sportekspres.com,"Një milimetër larg golit, “VAR” bëhet makth pë...","A millimeter away from goal, ""Var"" becomes nig..."
4,macocu.sq.1104611,Other,342,burimijetes.com,Ç’MENDONI JU?\n\nPyetje: Nena ime ka pare ende...,What do you think?\n\nQuestion: My mother has ...
...,...,...,...,...,...,...
85,macocu.sq.915191,Forum,235,intervista.al,Gazeta Intervista\n\nGazeta Intervista\n\nPas ...,Interview newspapers\n\nInterview newspapers\n...
86,macocu.sq.1060790,Information/Explanation,121,radioiliria.net,Aleksandra Stan dhe superhiti i saj “Mr.Saxobe...,"Alexandra Stan and her superhiti ""Mr.Axobeat""\..."
87,macocu.sq.581026,Opinion/Argumentation,437,novamedia.al,Përfundon operacioni i larjes se halesë se Edv...,The operation of Edvin Czech's dump mask is co...
88,macocu.sq.1118037,Other,223,burimijetes.com,A LEJOHET TË THEM “EL HAMDU LILAH” KUR TESHTIJ...,"Is it permissible to say ""Al Hamdu lilah"" when..."


In [4]:
# Open the annotated part

annotated = pd.read_json("datasets/CLASSLA-sq-samo-oznaceni-primeri.jsonl", lines=True)
annotated.head(2)

Unnamed: 0,id,text,metadata,label,Comments
0,1,"Blog\n\n""I loved you with eternal love.""Jer 31...","{'text_id': 'macocu.sq.1061396', 'domain': 'fj...",[Opinion/Argumentation],[]
1,2,"The Albanian National Union Front (FBKSH), the...","{'text_id': 'macocu.sq.1408163', 'domain': 'ze...",[Other],[]


In [6]:
annotated.tail()

Unnamed: 0,id,text,metadata,label,Comments
49,50,"Altin Toska, CV\n\nThis is the vital curriculu...","{'text_id': 'macocu.sq.1616592', 'domain': 'pu...",[Information/Explanation],[]
50,51,Dermocosmetics\n\nLA ROCHE POSAY - TOLERIAN CA...,"{'text_id': 'macocu.sq.1486735', 'domain': 'fi...",[Promotion],[]
51,52,Does men's sleep worse with full moon?The stud...,"{'text_id': 'macocu.sq.797062', 'domain': 'val...",[Information/Explanation],[]
52,53,Dear reader!\n\nTitle: Dear Reader!14.09.10 6:...,"{'text_id': 'macocu.sq.1576314', 'domain': 'yo...",[Information/Explanation],[]
53,54,"For example, it is preparing to take responsib...","{'text_id': 'macocu.sq.463621', 'domain': 'rre...",[Opinion/Argumentation],[]


In [11]:
annotated.label.value_counts()

label
[Forum]                      8
[Information/Explanation]    8
[Prose/Lyrical]              7
[Instruction]                7
[Opinion/Argumentation]      6
[Legal]                      5
[Promotion]                  5
[News]                       4
[Other]                      2
[Incomprehensible]           1
[Multiple texts]             1
Name: count, dtype: int64

In [5]:
# Open the entire genre sample for annotation
sample = pd.read_json("datasets/CLASSLA-web.sq.1.0.-translated-genre-sample-for-annotation.jsonl", lines= True)

sample.head(2)

Unnamed: 0,text,label,metadata
0,"Blog\n\n""I loved you with eternal love.""Jer 31...",[Opinion/Argumentation],"{'text_id': 'macocu.sq.1061396', 'domain': 'fj..."
1,"The Albanian National Union Front (FBKSH), the...",[Promotion],"{'text_id': 'macocu.sq.1408163', 'domain': 'ze..."


In [8]:
sample[52:55]

Unnamed: 0,text,label,metadata
52,Dear reader!\n\nTitle: Dear Reader!14.09.10 6:...,[Other],"{'text_id': 'macocu.sq.1576314', 'domain': 'yo..."
53,"For example, it is preparing to take responsib...",[Opinion/Argumentation],"{'text_id': 'macocu.sq.463621', 'domain': 'rre..."
54,"Edi Rama's star, here's what symbolizes\n\nEdi...",[Opinion/Argumentation],"{'text_id': 'macocu.sq.372579', 'domain': 'jav..."


In [7]:
unanno = sample[54:]
unanno.head()

Unnamed: 0,text,label,metadata
54,"Edi Rama's star, here's what symbolizes\n\nEdi...",[Opinion/Argumentation],"{'text_id': 'macocu.sq.372579', 'domain': 'jav..."
55,"If you are always tired, you can actually suff...",[Instruction],"{'text_id': 'macocu.sq.1144450', 'domain': 'ar..."
56,"Enion Cala\n\nProduction, sale, import -export...",[Information/Explanation],"{'text_id': 'macocu.sq.1325438', 'domain': 'op..."
57,The secret falls!Now transparency on Albanians...,[News],"{'text_id': 'macocu.sq.959055', 'domain': 'shq..."
58,Dermedic Sunbrella Sun Protection Spf 50+\n\nP...,[Promotion],"{'text_id': 'macocu.sq.296193', 'domain': 'far..."


In [10]:
# Save the unannotated part
unanno.to_json("datasets/CLASSLA-sq-remaining-part-to-be-annotated.jsonl", orient="records", lines=True)

In [None]:
# Open samples

sample_paths = {"hr":"datasets/CLASSLA-web.hr.1.0.vert.gz-sample.txt-genre-sample.txt", "mk":"datasets/CLASSLA-web.mk.1.0.vert.gz-sample.txt-genre-sample.txt", "sl":"datasets/CLASSLA-web.sl.1.0.vert.gz-sample.txt-genre-sample.txt", "sq": "datasets/MaCoCu-sq-texts-with-genres.tsv-genre-sample.txt"}

In [None]:
def genre_sample_to_json(sample_paths, lang):
	""" Convert genre sample in TXT to JSON and add translations and paragraph structure.
	
		Args:
		- sample_paths: path to a file, created with the function extract_genre_sample
		- lang: hr, mk, sl"""
	
	sample_df = pd.read_csv(sample_paths[lang], sep="\t", index_col = 0)

	# Change <p> signs to actual new lines
	sample_df["text"] = sample_df["text"].str.replace("<p>", "\n\n")


	# Apply Google Translate and machine translate the data
	import googletrans
	from googletrans import Translator

	# Define the translation model
	translator = Translator()

	# Create the final list
	translation_GT = []

	sentence_list = sample_df["text"].to_list()

	print("Starting translation.")

	# Loop through the list of original sentences,
	# translate each and append the translation to the final list
	for i in sentence_list:
		# Translate the sentence from Slovene (src = "sl") to English (dest = "en")
			current_translation = translator.translate(i, src = lang, dest='en')
		# Append the translated sentence to the final list
			translation_GT.append(current_translation.text)

	print("Translation finished.")

	# Append translations to the sample

	sample_df["translation"] = translation_GT

	# Save to JSON lines
	sample_df.to_json("datasets/CLASSLA-web.{}.1.0.-translated-genre-sample.jsonl".format(lang), orient="records", lines=True)

	print("Final file saved as datasets/CLASSLA-web.{}.1.0.-translated-genre-sample.jsonl".format(lang))

	# Create also a version for the annotation tool, only with translation and labels
	ann_df = sample_df[["translation","genre"]]

	# For annotation, each label should be in a list
	ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])

	# Rename df
	ann_df.columns = ["text", "label"]

	# Add metadata
	text_ids = sample_df["text_id"].to_list()
	domains = sample_df["domain"].to_list()

	metadata_list = []

	for i in list(zip(text_ids,domains)):
		metadata = {"text_id": i[0], "domain": i[1]}
		metadata_list.append(metadata)

	ann_df["metadata"] = metadata_list

	# Save to JSON lines
	ann_df.to_json("datasets/CLASSLA-web.{}.1.0.-translated-genre-sample-for-annotation.jsonl".format(lang), orient="records", lines=True)

	print("File for annotation saved as datasets/CLASSLA-web.{}.1.0.-translated-genre-sample-for-annotation.jsonl".format(lang))

	return sample_df


In [None]:
sample_hr = genre_sample_to_json(sample_paths, "mk")

sample_hr.head()

Starting translation.
Translation finished.
Final file saved as datasets/CLASSLA-web.mk.1.0.-translated-genre-sample.jsonl
File for annotation saved as datasets/CLASSLA-web.mk.1.0.-translated-genre-sample-for-annotation.jsonl


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["genre"] = ann_df["genre"].apply(lambda x:[x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ann_df["metadata"] = metadata_list


Unnamed: 0,text_id,url,domain,genre,text,text_length,translation
15197,CLASSLA-web.mk.18600,https://radar.mk/?p=29847,radar.mk,Information/Explanation,Топењето на мразот е посебен феномен од 21-от ...,269,Melting ice is a special 21st century phenomen...
46057,CLASSLA-web.mk.138126,https://emiter.com.mk/napis/10922,emiter.com.mk,Information/Explanation,Широката распространетост на РЕЛ-заварувањето ...,204,The widespread distribution of the relief and ...
38999,CLASSLA-web.mk.921342,http://www.interesno.mk/nauka/38-nauka/50238-k...,interesno.mk,Instruction,Кои хороскопски знаци ќе ги погоди Розевата По...,195,What zodiac signs would hit the pink full moon...
48491,CLASSLA-web.mk.384653,https://www.mn.mk/pesni-za-makedonija/6075-Dob...,mn.mk,Prose/Lyrical,Валентина Ѓоргиевска Парго Добриот човек го им...,373,Valentina Gjorgievska Pargo The good man has s...
99412,CLASSLA-web.mk.1050389,"http://forum.carclub.mk/index.php/topic,102.ms...",forum.carclub.mk,Forum,Провери си тука http://www.autobulbsdirect.co....,440,Check out here http://www.autobulbsdirect.co.u...
