# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = pd.read_csv("Macocu-is-en-predicted.csv", sep = "\t", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,is_source,en_domain,is_domain,average_score,en_doc,is_doc,en_length,is_length,X-GENRE,label_distribution,chosen_category_distr
321109,No,en-orig,http://2way.is/,A,A,http://2way.is/is/,2way,2way,0.808083,Members can provide constructive feedback for ...,Notendur geta veitt endurgjöf á gististaði og ...,385,383,Instruction,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997964
267802,No,is-orig,http://aflafrettir.com/en/frettir/flokkur/19,UNK,UNK,http://aflafrettir.is/frettir/flokkur/19,aflafrettir,aflafrettir,0.787,"Skreigrunn with 41 tons in 4 trips,. Akom came...",Skreigrunn með 41 tonn í 4 róðrum . Ingvaldson...,230,214,Forum,"{'Other': 0.0005, 'Information/Explanation': 0...",0.997729
207459,No,is-orig,http://aflafrettir.com/en/frettir/grein/16-cre...,UNK,UNK,http://aflafrettir.is/frettir/grein/16-manns-s...,aflafrettir,aflafrettir,0.841167,There have been quite a number of vessels fish...,Það hefur verið þónokkur fjöldi skipa á veiðum...,108,105,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.99815


In [3]:
# See initial number of texts
corpus.shape

(13174, 16)

In [6]:
print(corpus["X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |    X-GENRE |
|:------------------------|-----------:|
| Information/Explanation | 0.305526   |
| News                    | 0.239866   |
| Instruction             | 0.156445   |
| Promotion               | 0.151359   |
| Legal                   | 0.0575376  |
| Opinion/Argumentation   | 0.0538181  |
| Other                   | 0.024518   |
| Forum                   | 0.00698345 |
| Prose/Lyrical           | 0.00394717 |


In [7]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,is_source,en_domain,is_domain,average_score,en_doc,is_doc,en_length,is_length,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
321109,No,en-orig,http://2way.is/,A,A,http://2way.is/is/,2way,2way,0.808083,Members can provide constructive feedback for ...,Notendur geta veitt endurgjöf á gististaði og ...,385,383,Instruction,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997964,Instruction
267802,No,is-orig,http://aflafrettir.com/en/frettir/flokkur/19,UNK,UNK,http://aflafrettir.is/frettir/flokkur/19,aflafrettir,aflafrettir,0.787,"Skreigrunn with 41 tons in 4 trips,. Akom came...",Skreigrunn með 41 tonn í 4 róðrum . Ingvaldson...,230,214,Forum,"{'Other': 0.0005, 'Information/Explanation': 0...",0.997729,Forum
207459,No,is-orig,http://aflafrettir.com/en/frettir/grein/16-cre...,UNK,UNK,http://aflafrettir.is/frettir/grein/16-manns-s...,aflafrettir,aflafrettir,0.841167,There have been quite a number of vessels fish...,Það hefur verið þónokkur fjöldi skipa á veiðum...,108,105,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.99815,News


In [8]:
corpus["final-X-GENRE"].value_counts()

Information/Explanation    4025
News                       3160
Instruction                2061
Promotion                  1994
Legal                       758
Opinion/Argumentation       709
Forum                        92
Prose/Lyrical                52
Name: final-X-GENRE, dtype: int64

In [9]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [10]:
corpus["final-X-GENRE"].value_counts()

Information/Explanation    4025
News                       3160
Instruction                2061
Promotion                  1994
Legal                       758
Opinion/Argumentation       709
Prose/Lyrical                52
Name: final-X-GENRE, dtype: int64

In [11]:
corpus["final-X-GENRE"].describe()

count                       12759
unique                          7
top       Information/Explanation
freq                         4025
Name: final-X-GENRE, dtype: object

In [12]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [13]:
# See the final distribution
corpus["final-X-GENRE"].describe()

count                       11639
unique                          7
top       Information/Explanation
freq                         3753
Name: final-X-GENRE, dtype: object

In [17]:
print(corpus["final-X-GENRE"].value_counts().to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |            3753 |
| News                    |            2916 |
| Instruction             |            1851 |
| Promotion               |            1806 |
| Legal                   |             672 |
| Opinion/Argumentation   |             595 |
| Prose/Lyrical           |              46 |


In [18]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['Instruction', nan, 'News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']


In [19]:
# Analyze differences in genres based on language varieties

for i in ['Instruction','News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

Instruction
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.379795  |
| B   |    0.349001  |
| A   |    0.212858  |
| MIX |    0.0583468 |
News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.498971  |
| UNK |    0.351509  |
| A   |    0.107339  |
| MIX |    0.0421811 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.399779  |
| A   |    0.284053  |
| B   |    0.245293  |
| MIX |    0.0708749 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.404476  |
| B   |    0.398348  |
| A   |    0.145217  |
| MIX |    0.0519584 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.502976  |
| UNK |    0.313988  |
| A   |    0.125     |
| MIX |    0.0580357 |
Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.363025  |
| UNK |    0.305882  |
| A   |    0.248739  |
| MIX |    0.0823529 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.369565  |
| B   |   

In [20]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |   13174     |
| mean  |     346.647 |
| std   |     502.707 |
| min   |      79     |
| 25%   |     124     |
| 50%   |     201     |
| 75%   |     380     |
| max   |   11125     |


In [21]:
# Analyze differences in genres based on text length

for i in ['Instruction','News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

Instruction
|       |   en_length |
|:------|------------:|
| count |    1851     |
| mean  |     451.455 |
| std   |     655.151 |
| min   |      79     |
| 25%   |     146.5   |
| 50%   |     248     |
| 75%   |     487.5   |
| max   |    8663     |
News
|       |   en_length |
|:------|------------:|
| count |    2916     |
| mean  |     345.765 |
| std   |     354.283 |
| min   |      79     |
| 25%   |     141     |
| 50%   |     243     |
| 75%   |     432     |
| max   |    6054     |
Promotion
|       |   en_length |
|:------|------------:|
| count |    1806     |
| mean  |     209.564 |
| std   |     302.288 |
| min   |      79     |
| 25%   |     102     |
| 50%   |     140     |
| 75%   |     222.75  |
| max   |    6234     |
Information/Explanation
|       |   en_length |
|:------|------------:|
| count |    3753     |
| mean  |     274.125 |
| std   |     389.098 |
| min   |      79     |
| 25%   |     112     |
| 50%   |     170     |
| 75%   |     290     |
| max   |   1