# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = pd.read_csv("Macocu-mk-en-predicted.csv", sep = "\t", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mk_source,en_domain,mk_domain,average_score,en_doc,mk_doc,en_length,mk_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr
165536,No,en-orig,http://6maj.mk/Dealer.aspx?d=78&amp;language=2,B,B,http://6maj.mk/Dealer.aspx?d=10&amp;language=1,6maj.mk,6maj.mk,0.804333,"UPM has 19 modern paper mills in Finland, Germ...",Располага со 19 модерни претставништва во Финс...,119,124,0.154839,Promotion,"{'Other': 0.0008, 'Information/Explanation': 0...",0.629879
438380,Yes,mk-orig,http://6maj.mk/Pages.aspx?page=451&amp;language=2,UNK,B,http://www.6maj.mk/Pages.aspx?page=22&amp;lang...,6maj.mk,6maj.mk,0.8788,"Back to year 1991, when the turbulence of the ...","На 21-ви февруари 1991-та година, во едно врем...",140,154,0.144578,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",0.998885
182636,No,en-orig,http://a-construction.com.mk/index.php?option=...,A,A,http://a-construction.com.mk/mk/index.php?opti...,a-construction.com.mk,a-construction.com.mk,0.8028,Home design These two separate apartment desig...,Дизајн на домот Овие двa одделни дизајни на ст...,272,238,0.089701,Promotion,"{'Other': 0.0002, 'Information/Explanation': 0...",0.997809


In [3]:
# See initial number of texts
corpus.shape

(22055, 17)

In [5]:
# Analyze genre distribution
count = pd.DataFrame({"Count": list(corpus["X-GENRE"].value_counts()), "Percentage": list(corpus["X-GENRE"].value_counts(normalize="True")*100)}, index = corpus["X-GENRE"].value_counts().index)

print(count.to_markdown())

|                         |   Count |   Percentage |
|:------------------------|--------:|-------------:|
| News                    |    9695 |    43.9583   |
| Information/Explanation |    5794 |    26.2707   |
| Promotion               |    3336 |    15.1258   |
| Legal                   |     875 |     3.96735  |
| Opinion/Argumentation   |     861 |     3.90388  |
| Instruction             |     830 |     3.76332  |
| Other                   |     382 |     1.73203  |
| Prose/Lyrical           |     249 |     1.129    |
| Forum                   |      33 |     0.149626 |


In [6]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mk_source,en_domain,mk_domain,average_score,en_doc,mk_doc,en_length,mk_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
165536,No,en-orig,http://6maj.mk/Dealer.aspx?d=78&amp;language=2,B,B,http://6maj.mk/Dealer.aspx?d=10&amp;language=1,6maj.mk,6maj.mk,0.804333,"UPM has 19 modern paper mills in Finland, Germ...",Располага со 19 модерни претставништва во Финс...,119,124,0.154839,Promotion,"{'Other': 0.0008, 'Information/Explanation': 0...",0.629879,Promotion
438380,Yes,mk-orig,http://6maj.mk/Pages.aspx?page=451&amp;language=2,UNK,B,http://www.6maj.mk/Pages.aspx?page=22&amp;lang...,6maj.mk,6maj.mk,0.8788,"Back to year 1991, when the turbulence of the ...","На 21-ви февруари 1991-та година, во едно врем...",140,154,0.144578,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",0.998885,Information/Explanation
182636,No,en-orig,http://a-construction.com.mk/index.php?option=...,A,A,http://a-construction.com.mk/mk/index.php?opti...,a-construction.com.mk,a-construction.com.mk,0.8028,Home design These two separate apartment desig...,Дизајн на домот Овие двa одделни дизајни на ст...,272,238,0.089701,Promotion,"{'Other': 0.0002, 'Information/Explanation': 0...",0.997809,Promotion


In [7]:
corpus["final-X-GENRE"].value_counts()

News                       9695
Information/Explanation    5794
Promotion                  3336
Legal                       875
Opinion/Argumentation       861
Instruction                 830
Prose/Lyrical               249
Forum                        33
Name: final-X-GENRE, dtype: int64

In [9]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [10]:
corpus["final-X-GENRE"].value_counts()

News                       9695
Information/Explanation    5794
Promotion                  3336
Legal                       875
Opinion/Argumentation       861
Instruction                 830
Prose/Lyrical               249
Name: final-X-GENRE, dtype: int64

In [11]:
corpus["final-X-GENRE"].describe()

count     21640
unique        7
top        News
freq       9695
Name: final-X-GENRE, dtype: object

In [12]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [13]:
# See the final distribution
corpus["final-X-GENRE"].describe()

count     20108
unique        7
top        News
freq       9225
Name: final-X-GENRE, dtype: object

In [14]:
# Number of discarded labels
21640 - 20108

1532

In [15]:
1532/21640

0.07079482439926063

In [16]:
# Analyze final genre distribution
count = pd.DataFrame({"Count": list(corpus["final-X-GENRE"].value_counts()), "Percentage": list(corpus["final-X-GENRE"].value_counts(normalize="True")*100)}, index = corpus["final-X-GENRE"].value_counts().index)

print(count.to_markdown())

|                         |   Count |   Percentage |
|:------------------------|--------:|-------------:|
| News                    |    9225 |     45.8773  |
| Information/Explanation |    5298 |     26.3477  |
| Promotion               |    3140 |     15.6157  |
| Legal                   |     775 |      3.85419 |
| Instruction             |     718 |      3.57072 |
| Opinion/Argumentation   |     713 |      3.54585 |
| Prose/Lyrical           |     239 |      1.18858 |


In [17]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

[nan, 'Information/Explanation', 'Promotion', 'Instruction', 'Opinion/Argumentation', 'News', 'Legal', 'Prose/Lyrical']


In [18]:
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,mk_source,en_domain,mk_domain,average_score,en_doc,mk_doc,en_length,mk_length,punct_ratio,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,22055,22055,22055,22055,22055,22055,22055,22055,22055.0,22055,22055,22055.0,22055.0,22055.0,22055,22055,22055.0,20108
unique,2,2,22055,4,4,21801,1152,1152,,22055,21796,,,,9,9492,,7
top,No,mk-orig,http://6maj.mk/Dealer.aspx?d=78&amp;language=2,UNK,A,https://www.samsung.com/mk/info/privacy/,stat.gov.mk,stat.gov.mk,,"UPM has 19 modern paper mills in Finland, Germ...",Samsung Electronics е контролорот на податоцит...,,,,News,"{'Other': 0.0001, 'Information/Explanation': 0...",,News
freq,18577,12953,1,9848,10858,5,1264,1264,,1,5,,,,9695,1870,,9225
mean,,,,,,,,,0.918045,,,323.598322,301.580503,0.090507,,,0.973208,
std,,,,,,,,,0.05468,,,540.894229,526.082911,0.027523,,,0.083511,
min,,,,,,,,,0.5185,,,79.0,2.0,0.016667,,,0.295794,
25%,,,,,,,,,0.892667,,,125.0,109.0,0.072464,,,0.996209,
50%,,,,,,,,,0.93,,,194.0,173.0,0.087719,,,0.998766,
75%,,,,,,,,,0.957333,,,330.0,307.0,0.104023,,,0.998973,


In [19]:
# Save the new file
corpus.to_csv("Macocu-mk-en-predicted.csv")

In [20]:
# Analyze English domains in the corpus
count = pd.DataFrame({"Count": list(corpus.en_domain.value_counts()), "Percentage": list(corpus.en_domain.value_counts(normalize="True")*100)}, index = corpus.en_domain.value_counts().index)

print(count.to_markdown())

|                                  |   Count |   Percentage |
|:---------------------------------|--------:|-------------:|
| stat.gov.mk                      |    1264 |   5.73113    |
| meta.mk                          |    1216 |   5.51349    |
| seeu.edu.mk                      |     981 |   4.44797    |
| finance.gov.mk                   |     668 |   3.02879    |
| ssm.org.mk                       |     598 |   2.7114     |
| sobranie.mk                      |     586 |   2.65699    |
| loging.mk                        |     474 |   2.14917    |
| eprints.ugd.edu.mk               |     410 |   1.85899    |
| ckrm.org.mk                      |     373 |   1.69123    |
| rkmetalurg.mk                    |     337 |   1.528      |
| customs.gov.mk                   |     315 |   1.42825    |
| mcms.mk                          |     270 |   1.22421    |
| alkaloid.com.mk                  |     263 |   1.19247    |
| atamacedonia.org.mk              |     251 |   1.13806    |
| bujink

In [23]:
print(count.index.to_list()[:16])

['stat.gov.mk', 'meta.mk', 'seeu.edu.mk', 'finance.gov.mk', 'ssm.org.mk', 'sobranie.mk', 'loging.mk', 'eprints.ugd.edu.mk', 'ckrm.org.mk', 'rkmetalurg.mk', 'customs.gov.mk', 'mcms.mk', 'alkaloid.com.mk', 'atamacedonia.org.mk', 'bujinkan.koryu.mk', 'clp.mk']


In [24]:
# See the distribution of genres in the most frequent domains:
frequent_domains = ['stat.gov.mk', 'meta.mk', 'seeu.edu.mk', 'finance.gov.mk', 'ssm.org.mk', 'sobranie.mk', 'loging.mk', 'eprints.ugd.edu.mk', 'ckrm.org.mk', 'rkmetalurg.mk', 'customs.gov.mk', 'mcms.mk', 'alkaloid.com.mk', 'atamacedonia.org.mk', 'bujinkan.koryu.mk', 'clp.mk']

for i in frequent_domains:
	print(i)
	filtered_corpus = corpus[corpus["en_domain"] == i]
	print(filtered_corpus["final-X-GENRE"].value_counts(normalize="True").to_markdown())

stat.gov.mk
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |     0.631678    |
| News                    |     0.360825    |
| Legal                   |     0.00374883  |
| Instruction             |     0.00187441  |
| Opinion/Argumentation   |     0.000937207 |
| Promotion               |     0.000937207 |
meta.mk
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |     0.956926    |
| Information/Explanation |     0.0253378   |
| Opinion/Argumentation   |     0.00929054  |
| Promotion               |     0.00422297  |
| Legal                   |     0.00337838  |
| Instruction             |     0.000844595 |
seeu.edu.mk
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |      0.777426   |
| Information/Explanation |      0.187764   |
| Promotion               |      0.0168776  |
| 

In [26]:
# Analyze differences in genres based on domain frequency

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_domain"].value_counts(normalize="True")[:5].to_markdown())

Opinion/Argumentation
|                   |   en_domain |
|:------------------|------------:|
| bujinkan.koryu.mk |   0.117812  |
| sobranie.mk       |   0.0925666 |
| prohor.org        |   0.0617111 |
| mpc.org.mk        |   0.0546985 |
| marxists.org      |   0.0406732 |
News
|                |   en_domain |
|:---------------|------------:|
| meta.mk        |   0.122818  |
| seeu.edu.mk    |   0.0798916 |
| finance.gov.mk |   0.0688347 |
| ssm.org.mk     |   0.0525745 |
| stat.gov.mk    |   0.0417344 |
Legal
|                |   en_domain |
|:---------------|------------:|
| ustavensud.mk  |   0.122581  |
| sobranie.mk    |   0.0593548 |
| kb.com.mk      |   0.0425806 |
| customs.gov.mk |   0.0322581 |
| ssm.org.mk     |   0.0309677 |
Information/Explanation
|                    |   en_domain |
|:-------------------|------------:|
| stat.gov.mk        |   0.127218  |
| eprints.ugd.edu.mk |   0.0734239 |
| mcms.mk            |   0.0386938 |
| seeu.edu.mk        |   0.0335976 |
| iph.m

In [27]:
# Analyze differences in genres based on language varieties

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

News
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.452141  |
| A   |    0.293333  |
| B   |    0.200759  |
| MIX |    0.0537669 |
Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| A   |    0.357644  |
| UNK |    0.297335  |
| B   |    0.253857  |
| MIX |    0.0911641 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.444904  |
| A   |    0.354777  |
| B   |    0.149682  |
| MIX |    0.0506369 |
Instruction
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.480501  |
| A   |    0.32312   |
| B   |    0.160167  |
| MIX |    0.0362117 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.452812  |
| A   |    0.317101  |
| B   |    0.173273  |
| MIX |    0.0568139 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.461935  |
| B   |    0.249032  |
| A   |    0.230968  |
| MIX |    0.0580645 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| A   |    0.393305  |
| UNK |   

In [28]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |   22055     |
| mean  |     323.598 |
| std   |     540.894 |
| min   |      79     |
| 25%   |     125     |
| 50%   |     194     |
| 75%   |     330     |
| max   |   16139     |


In [29]:
# Analyze differences in genres based on text length

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction', 'Information/Explanation', 'Legal', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

News
|       |   en_length |
|:------|------------:|
| count |    9225     |
| mean  |     269.074 |
| std   |     254.154 |
| min   |      79     |
| 25%   |     134     |
| 50%   |     201     |
| 75%   |     313     |
| max   |    6104     |
Opinion/Argumentation
|       |   en_length |
|:------|------------:|
| count |     713     |
| mean  |     708.867 |
| std   |    1044.54  |
| min   |      79     |
| 25%   |     194     |
| 50%   |     399     |
| 75%   |     828     |
| max   |   14660     |
Promotion
|       |   en_length |
|:------|------------:|
| count |    3140     |
| mean  |     223.256 |
| std   |     226.403 |
| min   |      79     |
| 25%   |     110     |
| 50%   |     155     |
| 75%   |     244     |
| max   |    3970     |
Instruction
|       |   en_length |
|:------|------------:|
| count |     718     |
| mean  |     344.571 |
| std   |     372.876 |
| min   |      79     |
| 25%   |     137     |
| 50%   |     222.5   |
| 75%   |     406.75  |
| max   |    49