# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = pd.read_csv("Macocu-mt-en-predicted.csv", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mt_source,en_domain,mt_domain,average_score,en_doc,mt_doc,en_length,mt_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
1129134,No,en-orig,http://amberalert.com.mt/amber-alert-malta/,B,B,http://www.amberalert.com.mt/mt/amber-alert-ma...,amberalert.com.mt,amberalert.com.mt,0.8833,AMBER Alert Europe connects law enforcement wi...,L-AMBER Alert Europe iqarreb lill-forzi tal-or...,223,207,0.083665,News,"{'Other': 0.001, 'Information/Explanation': 0....",0.953956,News
840121,No,en-orig,http://catholicvoices.mt/a-number-of-reactions...,B,B,http://catholicvoices.mt/numru-ta-reazzjonijie...,catholicvoices.mt,catholicvoices.mt,0.902923,A number of reactions to the Manifesto “Rebuil...,Ir-reazzjonijiet li waslu minn numru ta' kandi...,288,227,0.094801,Opinion/Argumentation,"{'Other': 0.0007, 'Information/Explanation': 0...",0.998019,Opinion/Argumentation
4581,Yes,en-orig,http://catholicvoices.mt/pope-francis-says-pan...,B,B,http://catholicvoices.mt/papa-frangisku-jghid-...,catholicvoices.mt,catholicvoices.mt,0.876624,In an exclusive interview recorded for The Tab...,F'intervista esklussiva għal TheTablet — l-eww...,2243,1834,0.103216,Other,"{'Other': 0.9955, 'Information/Explanation': 0...",0.995455,


In [21]:
# See initial number of texts
corpus.shape

(23999, 18)

In [4]:
print(corpus["X-GENRE"].value_counts(normalize=True).to_markdown())
print("\n")
print(corpus["X-GENRE"].value_counts().to_markdown())

|                         |    X-GENRE |
|:------------------------|-----------:|
| News                    | 0.335264   |
| Legal                   | 0.26847    |
| Information/Explanation | 0.194883   |
| Instruction             | 0.0864619  |
| Opinion/Argumentation   | 0.0427101  |
| Promotion               | 0.0286262  |
| Prose/Lyrical           | 0.0272095  |
| Other                   | 0.0143756  |
| Forum                   | 0.00200008 |


|                         |   X-GENRE |
|:------------------------|----------:|
| News                    |      8046 |
| Legal                   |      6443 |
| Information/Explanation |      4677 |
| Instruction             |      2075 |
| Opinion/Argumentation   |      1025 |
| Promotion               |       687 |
| Prose/Lyrical           |       653 |
| Other                   |       345 |
| Forum                   |        48 |


In [5]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mt_source,en_domain,mt_domain,average_score,en_doc,mt_doc,en_length,mt_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
1129134,No,en-orig,http://amberalert.com.mt/amber-alert-malta/,B,B,http://www.amberalert.com.mt/mt/amber-alert-ma...,amberalert.com.mt,amberalert.com.mt,0.8833,AMBER Alert Europe connects law enforcement wi...,L-AMBER Alert Europe iqarreb lill-forzi tal-or...,223,207,0.083665,News,"{'Other': 0.001, 'Information/Explanation': 0....",0.953956,News
840121,No,en-orig,http://catholicvoices.mt/a-number-of-reactions...,B,B,http://catholicvoices.mt/numru-ta-reazzjonijie...,catholicvoices.mt,catholicvoices.mt,0.902923,A number of reactions to the Manifesto “Rebuil...,Ir-reazzjonijiet li waslu minn numru ta' kandi...,288,227,0.094801,Opinion/Argumentation,"{'Other': 0.0007, 'Information/Explanation': 0...",0.998019,Opinion/Argumentation
4581,Yes,en-orig,http://catholicvoices.mt/pope-francis-says-pan...,B,B,http://catholicvoices.mt/papa-frangisku-jghid-...,catholicvoices.mt,catholicvoices.mt,0.876624,In an exclusive interview recorded for The Tab...,F'intervista esklussiva għal TheTablet — l-eww...,2243,1834,0.103216,Other,"{'Other': 0.9955, 'Information/Explanation': 0...",0.995455,


In [6]:
corpus["final-X-GENRE"].value_counts()

News                       8046
Legal                      6443
Information/Explanation    4677
Instruction                2075
Opinion/Argumentation      1025
Promotion                   687
Prose/Lyrical               653
Forum                        48
Name: final-X-GENRE, dtype: int64

In [7]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [8]:
corpus["final-X-GENRE"].value_counts()

News                       8046
Legal                      6443
Information/Explanation    4677
Instruction                2075
Opinion/Argumentation      1025
Promotion                   687
Prose/Lyrical               653
Name: final-X-GENRE, dtype: int64

In [9]:
corpus["final-X-GENRE"].describe()

count     23606
unique        7
top        News
freq       8046
Name: final-X-GENRE, dtype: object

In [10]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [11]:
# See the final distribution
corpus["final-X-GENRE"].describe()

count     21376
unique        7
top        News
freq       7481
Name: final-X-GENRE, dtype: object

In [22]:
print(corpus["final-X-GENRE"].value_counts().to_markdown())
print("\n")
print(corpus["final-X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |            7481 |
| Legal                   |            5962 |
| Information/Explanation |            4107 |
| Instruction             |            1829 |
| Opinion/Argumentation   |             820 |
| Prose/Lyrical           |             589 |
| Promotion               |             588 |


|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |       0.349972  |
| Legal                   |       0.278911  |
| Information/Explanation |       0.192131  |
| Instruction             |       0.0855632 |
| Opinion/Argumentation   |       0.0383608 |
| Prose/Lyrical           |       0.0275543 |
| Promotion               |       0.0275075 |


In [23]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['News', 'Opinion/Argumentation', nan, 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']


In [24]:
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mt_source,en_domain,mt_domain,average_score,en_doc,mt_doc,en_length,mt_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,23999,23999,23999,23999,23999,23999,23999,23999,23999.0,23999,23999,23999.0,23999.0,23999.0,23999,23999,23999.0,21376
unique,2,2,23999,4,4,20183,310,310,,23999,20181,,,,9,13498,,7
top,No,en-orig,http://amberalert.com.mt/amber-alert-malta/,B,B,https://eur-lex.europa.eu/legal-content/MT/ALL...,europarl.europa.eu,europarl.europa.eu,,AMBER Alert Europe connects law enforcement wi...,"L-għan ta' dan ir-Regolament, skond il-prinċip...",,,,News,"{'Other': 0.0001, 'Information/Explanation': 0...",,News
freq,19579,14181,1,15212,21149,8,5589,5589,,1,8,,,,8046,1172,,7481
mean,,,,,,,,,0.91717,,,1290.687237,1936.903371,0.077335,,,0.965693,
std,,,,,,,,,0.063733,,,3911.675265,5800.820802,0.022741,,,0.092546,
min,,,,,,,,,0.5,,,79.0,4.0,0.015152,,,0.289614,
25%,,,,,,,,,0.883162,,,153.0,151.0,0.063008,,,0.991647,
50%,,,,,,,,,0.929562,,,300.0,346.0,0.073628,,,0.99704,
75%,,,,,,,,,0.962129,,,853.0,1196.5,0.088496,,,0.998829,


In [15]:
# Save the new file
corpus.to_csv("Macocu-mt-en-predicted.csv")

In [16]:
# Analyze English domains in the corpus
count = pd.DataFrame({"Count": list(corpus.en_domain.value_counts()), "Percentage": list(corpus.en_domain.value_counts(normalize="True")*100)}, index = corpus.en_domain.value_counts().index)

print(count.to_markdown())

|                                    |   Count |   Percentage |
|:-----------------------------------|--------:|-------------:|
| europarl.europa.eu                 |    5589 |  23.2885     |
| newsbook.com.mt                    |    3139 |  13.0797     |
| eur-lex.europa.eu                  |    3101 |  12.9214     |
| wol.jw.org                         |    1632 |   6.80028    |
| dg-justice-portal-demo.eurodyn.com |    1255 |   5.22938    |
| jw.org                             |     749 |   3.12096    |
| europa.eu                          |     617 |   2.57094    |
| ec.europa.eu                       |     528 |   2.20009    |
| tvm.com.mt                         |     445 |   1.85424    |
| cor.europa.eu                      |     422 |   1.75841    |
| weekly.uhm.org.mt                  |     384 |   1.60007    |
| cnimalta.org                       |     267 |   1.11255    |
| ecb.europa.eu                      |     241 |   1.00421    |
| centralbankmalta.org               |  

In [17]:
# See the distribution of genres in the most frequent domains:
frequent_domains = ["europarl.europa.eu","newsbook.com.mt","eur-lex.europa.eu","wol.jw.org","dg-justice-portal-demo.eurodyn.com","jw.org","europa.eu","ec.europa.eu","tvm.com.mt","cor.europa.eu","weekly.uhm.org.mt","cnimalta.org","ecb.europa.eu"]

for i in frequent_domains:
	print(i)
	filtered_corpus = corpus[corpus["en_domain"] == i]
	print(filtered_corpus["final-X-GENRE"].value_counts(normalize="True").to_markdown())

europarl.europa.eu
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Legal                   |      0.41502    |
| News                    |      0.411483   |
| Information/Explanation |      0.163304   |
| Instruction             |      0.00520075 |
| Promotion               |      0.00270439 |
| Opinion/Argumentation   |      0.00228833 |
newsbook.com.mt
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |     0.937521    |
| Information/Explanation |     0.0206049   |
| Legal                   |     0.0192755   |
| Promotion               |     0.00930542  |
| Opinion/Argumentation   |     0.00830841  |
| Instruction             |     0.00432037  |
| Prose/Lyrical           |     0.000664673 |
eur-lex.europa.eu
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Legal                   |     0.83756     |
| Information/Explanation |

In [25]:
# Analyze differences in genres based on domain frequency

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_domain"].value_counts(normalize="True")[:5].to_markdown())

Opinion/Argumentation
|                         |   en_domain |
|:------------------------|------------:|
| cnimalta.org            |   0.184146  |
| wol.jw.org              |   0.170732  |
| churchofjesuschrist.org |   0.112195  |
| jw.org                  |   0.106098  |
| bekids.mt               |   0.0414634 |
News
|                    |   en_domain |
|:-------------------|------------:|
| newsbook.com.mt    |   0.377089  |
| europarl.europa.eu |   0.264403  |
| tvm.com.mt         |   0.0566769 |
| cor.europa.eu      |   0.0457158 |
| weekly.uhm.org.mt  |   0.03422   |
Legal
|                                    |   en_domain |
|:-----------------------------------|------------:|
| eur-lex.europa.eu                  |   0.40993   |
| europarl.europa.eu                 |   0.334619  |
| dg-justice-portal-demo.eurodyn.com |   0.109695  |
| parlament.mt                       |   0.0109024 |
| electoral.gov.mt                   |   0.0102315 |
Information/Explanation
|                  

In [3]:
# Analyze differences in genres based on language varieties - print raw scores
print("\n\nDistribution of English varieties in genres - raw scores:\n\n")

print("Labels in this order: 'News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical'\n\n")

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
	#results_file.write("\n\n")
	#results_file.write(i)
	#results_file.write("\n\n")
	filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
	print(dict(sorted(filtered_corpus["en_var_doc"].value_counts().to_dict().items())))




Distribution of English varieties in genres - raw scores:


Labels in this order: 'News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical'


{'A': 179, 'B': 5053, 'MIX': 170, 'UNK': 2079}
{'A': 281, 'B': 331, 'MIX': 34, 'UNK': 174}
{'A': 40, 'B': 307, 'MIX': 15, 'UNK': 226}
{'A': 250, 'B': 931, 'MIX': 56, 'UNK': 592}
{'A': 658, 'B': 2411, 'MIX': 142, 'UNK': 896}
{'A': 218, 'B': 4481, 'MIX': 253, 'UNK': 1010}
{'A': 295, 'B': 22, 'MIX': 7, 'UNK': 265}


In [4]:

# Analyze differences in genres based on language varieties - print normalized scores

print("\n\nDistribution of English varieties in genres - normalized scores:\n\n")

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
	#results_file.write("\n\n")
	#results_file.write(i)
	#results_file.write("\n\n")
	filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
	print((dict(sorted((filtered_corpus["en_var_doc"].value_counts(normalize = True)*100).round(2).to_dict().items()))))

print("Differences in language varieties distribution per genres analyzed.")




Distribution of English varieties in genres - normalized scores:


{'A': 2.39, 'B': 67.54, 'MIX': 2.27, 'UNK': 27.79}
{'A': 34.27, 'B': 40.37, 'MIX': 4.15, 'UNK': 21.22}
{'A': 6.8, 'B': 52.21, 'MIX': 2.55, 'UNK': 38.44}
{'A': 13.67, 'B': 50.9, 'MIX': 3.06, 'UNK': 32.37}
{'A': 16.02, 'B': 58.7, 'MIX': 3.46, 'UNK': 21.82}
{'A': 3.66, 'B': 75.16, 'MIX': 4.24, 'UNK': 16.94}
{'A': 50.08, 'B': 3.74, 'MIX': 1.19, 'UNK': 44.99}
Differences in language varieties distribution per genres analyzed.


In [26]:
# Analyze differences in genres based on language varieties

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.675444  |
| UNK |    0.277904  |
| A   |    0.0239273 |
| MIX |    0.0227242 |
Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.403659  |
| A   |    0.342683  |
| UNK |    0.212195  |
| MIX |    0.0414634 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.522109  |
| UNK |    0.384354  |
| A   |    0.0680272 |
| MIX |    0.0255102 |
Instruction
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.509021  |
| UNK |    0.323674  |
| A   |    0.136687  |
| MIX |    0.0306178 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.587047  |
| UNK |    0.218164  |
| A   |    0.160214  |
| MIX |    0.0345751 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.751593  |
| UNK |    0.169406  |
| MIX |    0.0424354 |
| A   |    0.0365649 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| A   |    0.500849  |
| UNK |   

In [27]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |    23999    |
| mean  |     1290.69 |
| std   |     3911.68 |
| min   |       79    |
| 25%   |      153    |
| 50%   |      300    |
| 75%   |      853    |
| max   |   123935    |


In [28]:
# Analyze differences in genres based on text length

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

News
|       |   en_length |
|:------|------------:|
| count |    7481     |
| mean  |     592.741 |
| std   |    1451.67  |
| min   |      79     |
| 25%   |     128     |
| 50%   |     213     |
| 75%   |     447     |
| max   |   37591     |
Opinion/Argumentation
|       |   en_length |
|:------|------------:|
| count |     820     |
| mean  |     720.037 |
| std   |     758.044 |
| min   |      81     |
| 25%   |     235     |
| 50%   |     498     |
| 75%   |    1112.5   |
| max   |   10791     |
Promotion
|       |   en_length |
|:------|------------:|
| count |     588     |
| mean  |     257.4   |
| std   |     609.635 |
| min   |      79     |
| 25%   |     113     |
| 50%   |     171.5   |
| 75%   |     271.25  |
| max   |   12086     |
Instruction
|       |   en_length |
|:------|------------:|
| count |    1829     |
| mean  |     525.561 |
| std   |    1440.09  |
| min   |      79     |
| 25%   |     167     |
| 50%   |     284     |
| 75%   |     522     |
| max   |   461