# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [4]:
corpus = pd.read_csv("Macocu-sl-en-predicted.csv", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
2584979,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,http://15.liffe.si/?lang_chg=sl,15.liffe.si,15.liffe.si,0.936808,It went out with a bang. The evening sparkled ...,Končalo se je razburljivo in z razkošjem. Veče...,574,463,0.103501,Opinion/Argumentation,"{'Other': 0.0003, 'Information/Explanation': 0...",0.988794,Opinion/Argumentation
1212933,No,sl-orig,http://16.liffe.si/?lang_chg=en,B,B,http://16.liffe.si/index.php?menu_item=domov,16.liffe.si,16.liffe.si,0.9,Some days ago the organisers of the 17th Liffe...,Pred dnevi smo se iz 59. mednarodnega filmskeg...,293,184,0.07622,News,"{'Other': 0.0009, 'Information/Explanation': 0...",0.9616,News
598330,Yes,sl-orig,http://17.liffe.si/?lang_chg=en,B,B,http://17.liffe.si/?lang_chg=sl,17.liffe.si,17.liffe.si,0.957875,17th LIFFe was brought to an end with the best...,S podelitvijo nagrad in predvajanjem Režiserja...,445,418,0.07393,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997264,News


In [3]:
# See initial number of texts
corpus.shape

(101807, 17)

In [4]:
corpus["X-GENRE"].value_counts()

Information/Explanation    32368
Promotion                  31384
News                       13605
Instruction                10846
Legal                       5866
Opinion/Argumentation       4863
Other                       2194
Forum                        405
Prose/Lyrical                276
Name: X-GENRE, dtype: int64

In [5]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
2584979,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,http://15.liffe.si/?lang_chg=sl,15.liffe.si,15.liffe.si,0.936808,It went out with a bang. The evening sparkled ...,Končalo se je razburljivo in z razkošjem. Veče...,574,463,0.103501,Opinion/Argumentation,"{'Other': 0.0003, 'Information/Explanation': 0...",0.988794,Opinion/Argumentation
1212933,No,sl-orig,http://16.liffe.si/?lang_chg=en,B,B,http://16.liffe.si/index.php?menu_item=domov,16.liffe.si,16.liffe.si,0.9,Some days ago the organisers of the 17th Liffe...,Pred dnevi smo se iz 59. mednarodnega filmskeg...,293,184,0.07622,News,"{'Other': 0.0009, 'Information/Explanation': 0...",0.9616,News
598330,Yes,sl-orig,http://17.liffe.si/?lang_chg=en,B,B,http://17.liffe.si/?lang_chg=sl,17.liffe.si,17.liffe.si,0.957875,17th LIFFe was brought to an end with the best...,S podelitvijo nagrad in predvajanjem Režiserja...,445,418,0.07393,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997264,News


In [6]:
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,101807,101807,101807,101807,101807,101807,101807,101807,101807.0,101807,101807,101807.0,101807.0,101807.0,101807,101807,101807.0,99613
unique,2,2,101807,4,4,92708,6066,6066,,101807,92544,,,,9,42318,,8
top,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,https://www.sofascore.com/sl/ekipa/nogomet/vik...,oblacila.si,oblacila.si,,It went out with a bang. The evening sparkled ...,"Ali se strinjate, da na vaš računalnik namesti...",,,,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",,Information/Explanation
freq,89024,90537,1,42890,57737,9,3600,3600,,1,30,,,,32368,5988,,32368
mean,,,,,,,,,0.897452,,,428.811084,495.158761,0.092997,,,0.970066,
std,,,,,,,,,0.063443,,,1694.062268,2320.090506,0.027555,,,0.089027,
min,,,,,,,,,0.502,,,75.0,2.0,0.015,,,0.247184,
25%,,,,,,,,,0.868429,,,119.0,93.0,0.07483,,,0.995622,
50%,,,,,,,,,0.913667,,,190.0,165.0,0.089552,,,0.998666,
75%,,,,,,,,,0.942684,,,346.0,324.0,0.106952,,,0.998966,


In [7]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [8]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [9]:
# See the final distribution
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,101807,101807,101807,101807,101807,101807,101807,101807,101807.0,101807,101807,101807.0,101807.0,101807.0,101807,101807,101807.0,91459
unique,2,2,101807,4,4,92708,6066,6066,,101807,92544,,,,9,42318,,7
top,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,https://www.sofascore.com/sl/ekipa/nogomet/vik...,oblacila.si,oblacila.si,,It went out with a bang. The evening sparkled ...,"Ali se strinjate, da na vaš računalnik namesti...",,,,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",,Information/Explanation
freq,89024,90537,1,42890,57737,9,3600,3600,,1,30,,,,32368,5988,,30307
mean,,,,,,,,,0.897452,,,428.811084,495.158761,0.092997,,,0.970066,
std,,,,,,,,,0.063443,,,1694.062268,2320.090506,0.027555,,,0.089027,
min,,,,,,,,,0.502,,,75.0,2.0,0.015,,,0.247184,
25%,,,,,,,,,0.868429,,,119.0,93.0,0.07483,,,0.995622,
50%,,,,,,,,,0.913667,,,190.0,165.0,0.089552,,,0.998666,
75%,,,,,,,,,0.942684,,,346.0,324.0,0.106952,,,0.998966,


In [4]:
# % of all labels with too low confidence
7749/101807

0.07611460901509719

In [5]:
print(corpus["final-X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |      0.331373   |
| Promotion               |      0.323959   |
| News                    |      0.13347    |
| Instruction             |      0.107163   |
| Legal                   |      0.0581353  |
| Opinion/Argumentation   |      0.0435168  |
| Prose/Lyrical           |      0.00238358 |


In [11]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['Opinion/Argumentation', 'News', nan, 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']


In [12]:
# Save the new file
corpus.to_csv("MaCoCu-sl-en-data/Macocu-sl-en-predicted.csv")

In [22]:
# Analyze English domains in the corpus
count = pd.DataFrame({"Count": list(corpus.en_domain.value_counts()), "Percentage": list(corpus.en_domain.value_counts(normalize="True")*100)}, index = corpus.en_domain.value_counts().index)

print(count.to_markdown())

|                                            |   Count |   Percentage |
|:-------------------------------------------|--------:|-------------:|
| oblacila.si                                |    3600 |  3.5361      |
| europarl.europa.eu                         |    2444 |  2.40062     |
| eur-lex.europa.eu                          |    2128 |  2.09023     |
| eu2008.si                                  |    1355 |  1.33095     |
| gov.si                                     |    1087 |  1.06771     |
| support.apple.com                          |    1001 |  0.983233    |
| ricinus2.mf.uni-lj.si                      |     974 |  0.956712    |
| ung.si                                     |     908 |  0.891884    |
| kibla.org                                  |     825 |  0.810357    |
| policija.si                                |     747 |  0.733741    |
| dk.um.si                                   |     746 |  0.732759    |
| ec.europa.eu                               |     740 |  0.7268

In [19]:
# See the distribution of genres in the most frequent domains:
frequent_domains = ["oblacila.si", "europarl.europa.eu", "eur-lex.europa.eu", "eu2008.si","gov.si"]

for i in frequent_domains:
	print(i)
	filtered_corpus = corpus[corpus["en_domain"] == i]
	print(filtered_corpus["final-X-GENRE"].value_counts(normalize="True").to_markdown())


oblacila.si
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Promotion               |     0.952122    |
| Information/Explanation |     0.0444381   |
| Instruction             |     0.00258028  |
| News                    |     0.000286697 |
| Opinion/Argumentation   |     0.000286697 |
| Prose/Lyrical           |     0.000286697 |
europarl.europa.eu
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Legal                   |      0.403617   |
| News                    |      0.387435   |
| Information/Explanation |      0.190386   |
| Instruction             |      0.0118991  |
| Promotion               |      0.00666349 |
eur-lex.europa.eu
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Legal                   |      0.843859   |
| Information/Explanation |      0.103928   |
| News                    |      0.0452511  |
| Instruction             |    

In [24]:
# Analyze differences in genres based on domain frequency

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_domain"].value_counts(normalize="True")[:5].to_markdown())

Opinion/Argumentation
|                       |   en_domain |
|:----------------------|------------:|
| ourspace.si           |   0.0494975 |
| kibla.org             |   0.0268844 |
| cityofwomen.org       |   0.0238693 |
| bsf.si                |   0.021608  |
| mrezni-muzej.mg-lj.si |   0.0193467 |
News
|                    |   en_domain |
|:-------------------|------------:|
| eu2008.si          |   0.0825756 |
| europarl.europa.eu |   0.0666831 |
| policija.si        |   0.0470222 |
| gov.si             |   0.0465307 |
| lek.si             |   0.0319489 |
Legal
|                    |   en_domain |
|:-------------------|------------:|
| eur-lex.europa.eu  |   0.319165  |
| europarl.europa.eu |   0.159488  |
| gov.si             |   0.0253903 |
| us-rs.si           |   0.0199361 |
| fu.gov.si          |   0.0126011 |
Information/Explanation
|                       |   en_domain |
|:----------------------|------------:|
| ricinus2.mf.uni-lj.si |   0.0321378 |
| dk.um.si              |

In [5]:
# Analyze differences in genres based on language varieties - print raw scores
print("\n\nDistribution of English varieties in genres - raw scores:\n\n")

print("Labels in this order: 'News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical'\n\n")

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
	#results_file.write("\n\n")
	#results_file.write(i)
	#results_file.write("\n\n")
	filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
	print(dict(sorted(filtered_corpus["en_var_doc"].value_counts().to_dict().items())))




Distribution of English varieties in genres - raw scores:


Labels in this order: 'News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical'


{'A': 1122, 'B': 6761, 'MIX': 541, 'UNK': 3783}
{'A': 686, 'B': 1716, 'MIX': 262, 'UNK': 1316}
{'A': 6608, 'B': 10522, 'MIX': 2327, 'UNK': 10172}
{'A': 2112, 'B': 2526, 'MIX': 333, 'UNK': 4830}
{'A': 4361, 'B': 13091, 'MIX': 1932, 'UNK': 10923}
{'A': 317, 'B': 3667, 'MIX': 196, 'UNK': 1137}
{'A': 54, 'B': 72, 'MIX': 20, 'UNK': 72}


In [6]:

# Analyze differences in genres based on language varieties - print normalized scores

print("\n\nDistribution of English varieties in genres - normalized scores:\n\n")

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
	#results_file.write("\n\n")
	#results_file.write(i)
	#results_file.write("\n\n")
	filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
	print((dict(sorted((filtered_corpus["en_var_doc"].value_counts(normalize = True)*100).round(2).to_dict().items()))))

print("Differences in language varieties distribution per genres analyzed.")




Distribution of English varieties in genres - normalized scores:


{'A': 9.19, 'B': 55.39, 'MIX': 4.43, 'UNK': 30.99}
{'A': 17.24, 'B': 43.12, 'MIX': 6.58, 'UNK': 33.07}
{'A': 22.3, 'B': 35.51, 'MIX': 7.85, 'UNK': 34.33}
{'A': 21.55, 'B': 25.77, 'MIX': 3.4, 'UNK': 49.28}
{'A': 14.39, 'B': 43.19, 'MIX': 6.37, 'UNK': 36.04}
{'A': 5.96, 'B': 68.97, 'MIX': 3.69, 'UNK': 21.38}
{'A': 24.77, 'B': 33.03, 'MIX': 9.17, 'UNK': 33.03}
Differences in language varieties distribution per genres analyzed.


In [6]:
# Analyze differences in genres based on language varieties

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.431156  |
| UNK |    0.330653  |
| A   |    0.172362  |
| MIX |    0.0658291 |
News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.553863  |
| UNK |    0.309904  |
| A   |    0.0919145 |
| MIX |    0.0443188 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.689675  |
| UNK |    0.213842  |
| A   |    0.0596201 |
| MIX |    0.0368629 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.431946  |
| UNK |    0.360412  |
| A   |    0.143894  |
| MIX |    0.0637476 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.355125  |
| UNK |    0.343312  |
| A   |    0.223025  |
| MIX |    0.0785379 |
Instruction
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.492807  |
| B   |    0.257729  |
| A   |    0.215488  |
| MIX |    0.0339761 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.330275  |
| UNK |   

In [7]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |  101807     |
| mean  |     428.811 |
| std   |    1694.06  |
| min   |      75     |
| 25%   |     119     |
| 50%   |     190     |
| 75%   |     346     |
| max   |   98761     |


In [24]:
# Analyze differences in genres based on text length

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

Opinion/Argumentation
|       |   en_length |
|:------|------------:|
| count |    3980     |
| mean  |     459.014 |
| std   |     931.279 |
| min   |      75     |
| 25%   |     132     |
| 50%   |     230     |
| 75%   |     473.25  |
| max   |   25411     |
News
|       |   en_length |
|:------|------------:|
| count |   12207     |
| mean  |     428.744 |
| std   |    1417.89  |
| min   |      75     |
| 25%   |     136     |
| 50%   |     232     |
| 75%   |     426     |
| max   |   75277     |
Legal
|       |   en_length |
|:------|------------:|
| count |     5317    |
| mean  |     2164.62 |
| std   |     5914.44 |
| min   |       75    |
| 25%   |      191    |
| 50%   |      429    |
| 75%   |     1299    |
| max   |    98761    |
Information/Explanation
|       |   en_length |
|:------|------------:|
| count |   30307     |
| mean  |     333.969 |
| std   |    1020.25  |
| min   |      75     |
| 25%   |     115     |
| 50%   |     179     |
| 75%   |     305     |
| max  