# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = pd.read_csv("Macocu-mt-en-predicted.csv", sep = "\t", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mt_source,en_domain,mt_domain,average_score,en_doc,mt_doc,en_length,mt_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr
1129134,No,en-orig,http://amberalert.com.mt/amber-alert-malta/,B,B,http://www.amberalert.com.mt/mt/amber-alert-ma...,amberalert.com.mt,amberalert.com.mt,0.8833,AMBER Alert Europe connects law enforcement wi...,L-AMBER Alert Europe iqarreb lill-forzi tal-or...,223,207,0.083665,News,"{'Other': 0.001, 'Information/Explanation': 0....",0.953956
840121,No,en-orig,http://catholicvoices.mt/a-number-of-reactions...,B,B,http://catholicvoices.mt/numru-ta-reazzjonijie...,catholicvoices.mt,catholicvoices.mt,0.902923,A number of reactions to the Manifesto “Rebuil...,Ir-reazzjonijiet li waslu minn numru ta' kandi...,288,227,0.094801,Opinion/Argumentation,"{'Other': 0.0007, 'Information/Explanation': 0...",0.998019
4581,Yes,en-orig,http://catholicvoices.mt/pope-francis-says-pan...,B,B,http://catholicvoices.mt/papa-frangisku-jghid-...,catholicvoices.mt,catholicvoices.mt,0.876624,In an exclusive interview recorded for The Tab...,F'intervista esklussiva għal TheTablet — l-eww...,2243,1834,0.103216,Other,"{'Other': 0.9955, 'Information/Explanation': 0...",0.995455


In [3]:
# See initial number of texts
corpus.shape

(23999, 17)

In [6]:
print(corpus["X-GENRE"].value_counts(normalize=True).to_markdown())
print("\n")
print(corpus["X-GENRE"].value_counts().to_markdown())

|                         |    X-GENRE |
|:------------------------|-----------:|
| News                    | 0.335264   |
| Legal                   | 0.26847    |
| Information/Explanation | 0.194883   |
| Instruction             | 0.0864619  |
| Opinion/Argumentation   | 0.0427101  |
| Promotion               | 0.0286262  |
| Prose/Lyrical           | 0.0272095  |
| Other                   | 0.0143756  |
| Forum                   | 0.00200008 |


|                         |   X-GENRE |
|:------------------------|----------:|
| News                    |      8046 |
| Legal                   |      6443 |
| Information/Explanation |      4677 |
| Instruction             |      2075 |
| Opinion/Argumentation   |      1025 |
| Promotion               |       687 |
| Prose/Lyrical           |       653 |
| Other                   |       345 |
| Forum                   |        48 |


In [7]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mt_source,en_domain,mt_domain,average_score,en_doc,mt_doc,en_length,mt_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
1129134,No,en-orig,http://amberalert.com.mt/amber-alert-malta/,B,B,http://www.amberalert.com.mt/mt/amber-alert-ma...,amberalert.com.mt,amberalert.com.mt,0.8833,AMBER Alert Europe connects law enforcement wi...,L-AMBER Alert Europe iqarreb lill-forzi tal-or...,223,207,0.083665,News,"{'Other': 0.001, 'Information/Explanation': 0....",0.953956,News
840121,No,en-orig,http://catholicvoices.mt/a-number-of-reactions...,B,B,http://catholicvoices.mt/numru-ta-reazzjonijie...,catholicvoices.mt,catholicvoices.mt,0.902923,A number of reactions to the Manifesto “Rebuil...,Ir-reazzjonijiet li waslu minn numru ta' kandi...,288,227,0.094801,Opinion/Argumentation,"{'Other': 0.0007, 'Information/Explanation': 0...",0.998019,Opinion/Argumentation
4581,Yes,en-orig,http://catholicvoices.mt/pope-francis-says-pan...,B,B,http://catholicvoices.mt/papa-frangisku-jghid-...,catholicvoices.mt,catholicvoices.mt,0.876624,In an exclusive interview recorded for The Tab...,F'intervista esklussiva għal TheTablet — l-eww...,2243,1834,0.103216,Other,"{'Other': 0.9955, 'Information/Explanation': 0...",0.995455,


In [8]:
corpus["final-X-GENRE"].value_counts()

News                       8046
Legal                      6443
Information/Explanation    4677
Instruction                2075
Opinion/Argumentation      1025
Promotion                   687
Prose/Lyrical               653
Forum                        48
Name: final-X-GENRE, dtype: int64

In [9]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [10]:
corpus["final-X-GENRE"].value_counts()

News                       8046
Legal                      6443
Information/Explanation    4677
Instruction                2075
Opinion/Argumentation      1025
Promotion                   687
Prose/Lyrical               653
Name: final-X-GENRE, dtype: int64

In [11]:
corpus["final-X-GENRE"].describe()

count     23606
unique        7
top        News
freq       8046
Name: final-X-GENRE, dtype: object

In [12]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [13]:
# See the final distribution
corpus["final-X-GENRE"].describe()

count     21376
unique        7
top        News
freq       7481
Name: final-X-GENRE, dtype: object

In [16]:
print(corpus["final-X-GENRE"].value_counts().to_markdown())
print("\n")
print(corpus["final-X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |            7481 |
| Legal                   |            5962 |
| Information/Explanation |            4107 |
| Instruction             |            1829 |
| Opinion/Argumentation   |             820 |
| Prose/Lyrical           |             589 |
| Promotion               |             588 |


|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |       0.349972  |
| Legal                   |       0.278911  |
| Information/Explanation |       0.192131  |
| Instruction             |       0.0855632 |
| Opinion/Argumentation   |       0.0383608 |
| Prose/Lyrical           |       0.0275543 |
| Promotion               |       0.0275075 |


In [17]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['News', 'Opinion/Argumentation', nan, 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']


In [18]:
# Analyze differences in genres based on language varieties

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.675444  |
| UNK |    0.277904  |
| A   |    0.0239273 |
| MIX |    0.0227242 |
Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.403659  |
| A   |    0.342683  |
| UNK |    0.212195  |
| MIX |    0.0414634 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.522109  |
| UNK |    0.384354  |
| A   |    0.0680272 |
| MIX |    0.0255102 |
Instruction
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.509021  |
| UNK |    0.323674  |
| A   |    0.136687  |
| MIX |    0.0306178 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.587047  |
| UNK |    0.218164  |
| A   |    0.160214  |
| MIX |    0.0345751 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.751593  |
| UNK |    0.169406  |
| MIX |    0.0424354 |
| A   |    0.0365649 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| A   |    0.500849  |
| UNK |   

In [19]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |    23999    |
| mean  |     1290.69 |
| std   |     3911.68 |
| min   |       79    |
| 25%   |      153    |
| 50%   |      300    |
| 75%   |      853    |
| max   |   123935    |


In [20]:
# Analyze differences in genres based on text length

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

News
|       |   en_length |
|:------|------------:|
| count |    7481     |
| mean  |     592.741 |
| std   |    1451.67  |
| min   |      79     |
| 25%   |     128     |
| 50%   |     213     |
| 75%   |     447     |
| max   |   37591     |
Opinion/Argumentation
|       |   en_length |
|:------|------------:|
| count |     820     |
| mean  |     720.037 |
| std   |     758.044 |
| min   |      81     |
| 25%   |     235     |
| 50%   |     498     |
| 75%   |    1112.5   |
| max   |   10791     |
Promotion
|       |   en_length |
|:------|------------:|
| count |     588     |
| mean  |     257.4   |
| std   |     609.635 |
| min   |      79     |
| 25%   |     113     |
| 50%   |     171.5   |
| 75%   |     271.25  |
| max   |   12086     |
Instruction
|       |   en_length |
|:------|------------:|
| count |    1829     |
| mean  |     525.561 |
| std   |    1440.09  |
| min   |      79     |
| 25%   |     167     |
| 50%   |     284     |
| 75%   |     522     |
| max   |   461