# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [3]:
corpus = pd.read_csv("MaCoCu-sl-en-data/Macocu-sl-en-predicted.csv", sep = "\t", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr
2584979,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,http://15.liffe.si/?lang_chg=sl,15.liffe.si,15.liffe.si,0.936808,It went out with a bang. The evening sparkled ...,Končalo se je razburljivo in z razkošjem. Veče...,574,463,0.103501,Opinion/Argumentation,"{'Other': 0.0003, 'Information/Explanation': 0...",0.988794
1212933,No,sl-orig,http://16.liffe.si/?lang_chg=en,B,B,http://16.liffe.si/index.php?menu_item=domov,16.liffe.si,16.liffe.si,0.9,Some days ago the organisers of the 17th Liffe...,Pred dnevi smo se iz 59. mednarodnega filmskeg...,293,184,0.07622,News,"{'Other': 0.0009, 'Information/Explanation': 0...",0.9616
598330,Yes,sl-orig,http://17.liffe.si/?lang_chg=en,B,B,http://17.liffe.si/?lang_chg=sl,17.liffe.si,17.liffe.si,0.957875,17th LIFFe was brought to an end with the best...,S podelitvijo nagrad in predvajanjem Režiserja...,445,418,0.07393,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997264


In [4]:
# See initial number of texts
corpus.shape

(101807, 17)

In [6]:
corpus["X-GENRE"].value_counts()

Information/Explanation    32368
Promotion                  31384
News                       13605
Instruction                10846
Legal                       5866
Opinion/Argumentation       4863
Other                       2194
Forum                        405
Prose/Lyrical                276
Name: X-GENRE, dtype: int64

In [8]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
2584979,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,http://15.liffe.si/?lang_chg=sl,15.liffe.si,15.liffe.si,0.936808,It went out with a bang. The evening sparkled ...,Končalo se je razburljivo in z razkošjem. Veče...,574,463,0.103501,Opinion/Argumentation,"{'Other': 0.0003, 'Information/Explanation': 0...",0.988794,Opinion/Argumentation
1212933,No,sl-orig,http://16.liffe.si/?lang_chg=en,B,B,http://16.liffe.si/index.php?menu_item=domov,16.liffe.si,16.liffe.si,0.9,Some days ago the organisers of the 17th Liffe...,Pred dnevi smo se iz 59. mednarodnega filmskeg...,293,184,0.07622,News,"{'Other': 0.0009, 'Information/Explanation': 0...",0.9616,News
598330,Yes,sl-orig,http://17.liffe.si/?lang_chg=en,B,B,http://17.liffe.si/?lang_chg=sl,17.liffe.si,17.liffe.si,0.957875,17th LIFFe was brought to an end with the best...,S podelitvijo nagrad in predvajanjem Režiserja...,445,418,0.07393,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997264,News


In [12]:
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,101807,101807,101807,101807,101807,101807,101807,101807,101807.0,101807,101807,101807.0,101807.0,101807.0,101807,101807,101807.0,99208
unique,2,2,101807,4,4,92708,6066,6066,,101807,92544,,,,9,42318,,7
top,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,https://www.sofascore.com/sl/ekipa/nogomet/vik...,oblacila.si,oblacila.si,,It went out with a bang. The evening sparkled ...,"Ali se strinjate, da na vaš računalnik namesti...",,,,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",,Information/Explanation
freq,89024,90537,1,42890,57737,9,3600,3600,,1,30,,,,32368,5988,,32368
mean,,,,,,,,,0.897452,,,428.811084,495.158761,0.092997,,,0.970066,
std,,,,,,,,,0.063443,,,1694.062268,2320.090506,0.027555,,,0.089027,
min,,,,,,,,,0.502,,,75.0,2.0,0.015,,,0.247184,
25%,,,,,,,,,0.868429,,,119.0,93.0,0.07483,,,0.995622,
50%,,,,,,,,,0.913667,,,190.0,165.0,0.089552,,,0.998666,
75%,,,,,,,,,0.942684,,,346.0,324.0,0.106952,,,0.998966,


In [10]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [13]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [15]:
# See the final distribution
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,sl_source,en_domain,sl_domain,average_score,en_doc,sl_doc,en_length,sl_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,101807,101807,101807,101807,101807,101807,101807,101807,101807.0,101807,101807,101807.0,101807.0,101807.0,101807,101807,101807.0,91459
unique,2,2,101807,4,4,92708,6066,6066,,101807,92544,,,,9,42318,,7
top,No,sl-orig,http://15.liffe.si/?lang_chg=en,B,B,https://www.sofascore.com/sl/ekipa/nogomet/vik...,oblacila.si,oblacila.si,,It went out with a bang. The evening sparkled ...,"Ali se strinjate, da na vaš računalnik namesti...",,,,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",,Information/Explanation
freq,89024,90537,1,42890,57737,9,3600,3600,,1,30,,,,32368,5988,,30307
mean,,,,,,,,,0.897452,,,428.811084,495.158761,0.092997,,,0.970066,
std,,,,,,,,,0.063443,,,1694.062268,2320.090506,0.027555,,,0.089027,
min,,,,,,,,,0.502,,,75.0,2.0,0.015,,,0.247184,
25%,,,,,,,,,0.868429,,,119.0,93.0,0.07483,,,0.995622,
50%,,,,,,,,,0.913667,,,190.0,165.0,0.089552,,,0.998666,
75%,,,,,,,,,0.942684,,,346.0,324.0,0.106952,,,0.998966,


In [19]:
print(corpus["final-X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |      0.331373   |
| Promotion               |      0.323959   |
| News                    |      0.13347    |
| Instruction             |      0.107163   |
| Legal                   |      0.0581353  |
| Opinion/Argumentation   |      0.0435168  |
| Prose/Lyrical           |      0.00238358 |


In [20]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['Opinion/Argumentation', 'News', nan, 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']


In [21]:
# Analyze differences in genres based on language varieties

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.431156  |
| UNK |    0.330653  |
| A   |    0.172362  |
| MIX |    0.0658291 |
News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.553863  |
| UNK |    0.309904  |
| A   |    0.0919145 |
| MIX |    0.0443188 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.689675  |
| UNK |    0.213842  |
| A   |    0.0596201 |
| MIX |    0.0368629 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.431946  |
| UNK |    0.360412  |
| A   |    0.143894  |
| MIX |    0.0637476 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.355125  |
| UNK |    0.343312  |
| A   |    0.223025  |
| MIX |    0.0785379 |
Instruction
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.492807  |
| B   |    0.257729  |
| A   |    0.215488  |
| MIX |    0.0339761 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.330275  |
| UNK |   

In [23]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |  101807     |
| mean  |     428.811 |
| std   |    1694.06  |
| min   |      75     |
| 25%   |     119     |
| 50%   |     190     |
| 75%   |     346     |
| max   |   98761     |


In [24]:
# Analyze differences in genres based on text length

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

Opinion/Argumentation
|       |   en_length |
|:------|------------:|
| count |    3980     |
| mean  |     459.014 |
| std   |     931.279 |
| min   |      75     |
| 25%   |     132     |
| 50%   |     230     |
| 75%   |     473.25  |
| max   |   25411     |
News
|       |   en_length |
|:------|------------:|
| count |   12207     |
| mean  |     428.744 |
| std   |    1417.89  |
| min   |      75     |
| 25%   |     136     |
| 50%   |     232     |
| 75%   |     426     |
| max   |   75277     |
Legal
|       |   en_length |
|:------|------------:|
| count |     5317    |
| mean  |     2164.62 |
| std   |     5914.44 |
| min   |       75    |
| 25%   |      191    |
| 50%   |      429    |
| 75%   |     1299    |
| max   |    98761    |
Information/Explanation
|       |   en_length |
|:------|------------:|
| count |   30307     |
| mean  |     333.969 |
| std   |    1020.25  |
| min   |      75     |
| 25%   |     115     |
| 50%   |     179     |
| 75%   |     305     |
| max  