# Post-processing of predicted data

Based on the foundings of the manual analysis, we will post-process the data to get reliable results:
* predictions will be removed from texts, labeled as "Other" or "Forum"
* predictions will be removed from texts where the prediction confidence is below 0.9

The analysis of results (in terms of connection of genres and language varieties) is then done on post-processed data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = pd.read_csv("Macocu-is-en-predicted.csv", index_col = 0)
corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,is_source,en_domain,is_domain,average_score,en_doc,is_doc,en_length,is_length,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
321109,No,en-orig,http://2way.is/,A,A,http://2way.is/is/,2way,2way,0.808083,Members can provide constructive feedback for ...,Notendur geta veitt endurgjöf á gististaði og ...,385,383,Instruction,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997964,Instruction
267802,No,is-orig,http://aflafrettir.com/en/frettir/flokkur/19,UNK,UNK,http://aflafrettir.is/frettir/flokkur/19,aflafrettir,aflafrettir,0.787,"Skreigrunn with 41 tons in 4 trips,. Akom came...",Skreigrunn með 41 tonn í 4 róðrum . Ingvaldson...,230,214,Forum,"{'Other': 0.0005, 'Information/Explanation': 0...",0.997729,
207459,No,is-orig,http://aflafrettir.com/en/frettir/grein/16-cre...,UNK,UNK,http://aflafrettir.is/frettir/grein/16-manns-s...,aflafrettir,aflafrettir,0.841167,There have been quite a number of vessels fish...,Það hefur verið þónokkur fjöldi skipa á veiðum...,108,105,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.99815,News


In [4]:
# See initial number of texts
corpus.shape

(13174, 17)

In [5]:
print(corpus["X-GENRE"].value_counts(normalize=True).to_markdown())

|                         |    X-GENRE |
|:------------------------|-----------:|
| Information/Explanation | 0.305526   |
| News                    | 0.239866   |
| Instruction             | 0.156445   |
| Promotion               | 0.151359   |
| Legal                   | 0.0575376  |
| Opinion/Argumentation   | 0.0538181  |
| Other                   | 0.024518   |
| Forum                   | 0.00698345 |
| Prose/Lyrical           | 0.00394717 |


In [6]:
# Post-process the data

# Copy all predicted labels to a new column, except if the label is "Other"
corpus["final-X-GENRE"] = np.where(corpus["X-GENRE"] == "Other", np.nan, corpus["X-GENRE"])

corpus.head(3)

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,is_source,en_domain,is_domain,average_score,en_doc,is_doc,en_length,is_length,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
321109,No,en-orig,http://2way.is/,A,A,http://2way.is/is/,2way,2way,0.808083,Members can provide constructive feedback for ...,Notendur geta veitt endurgjöf á gististaði og ...,385,383,Instruction,"{'Other': 0.0001, 'Information/Explanation': 0...",0.997964,Instruction
267802,No,is-orig,http://aflafrettir.com/en/frettir/flokkur/19,UNK,UNK,http://aflafrettir.is/frettir/flokkur/19,aflafrettir,aflafrettir,0.787,"Skreigrunn with 41 tons in 4 trips,. Akom came...",Skreigrunn með 41 tonn í 4 róðrum . Ingvaldson...,230,214,Forum,"{'Other': 0.0005, 'Information/Explanation': 0...",0.997729,Forum
207459,No,is-orig,http://aflafrettir.com/en/frettir/grein/16-cre...,UNK,UNK,http://aflafrettir.is/frettir/grein/16-manns-s...,aflafrettir,aflafrettir,0.841167,There have been quite a number of vessels fish...,Það hefur verið þónokkur fjöldi skipa á veiðum...,108,105,News,"{'Other': 0.0001, 'Information/Explanation': 0...",0.99815,News


In [7]:
corpus["final-X-GENRE"].value_counts()

Information/Explanation    4025
News                       3160
Instruction                2061
Promotion                  1994
Legal                       758
Opinion/Argumentation       709
Forum                        92
Prose/Lyrical                52
Name: final-X-GENRE, dtype: int64

In [8]:
# Copy all predicted labels to a column "final-X-GENRE", except if the label is "Forum"
corpus["final-X-GENRE"] = np.where(corpus["final-X-GENRE"] == "Forum", np.nan, corpus["final-X-GENRE"])

In [9]:
corpus["final-X-GENRE"].value_counts()

Information/Explanation    4025
News                       3160
Instruction                2061
Promotion                  1994
Legal                       758
Opinion/Argumentation       709
Prose/Lyrical                52
Name: final-X-GENRE, dtype: int64

In [10]:
corpus["final-X-GENRE"].describe()

count                       12759
unique                          7
top       Information/Explanation
freq                         4025
Name: final-X-GENRE, dtype: object

In [11]:
# Copy all predicted labels to a column "final-X-GENRE", except if the prediction confidence is lower than 0.9
corpus["final-X-GENRE"] = np.where(corpus["chosen_category_distr"] < 0.9, np.nan, corpus["final-X-GENRE"])

In [12]:
# See the final distribution
corpus["final-X-GENRE"].describe()

count                       11639
unique                          7
top       Information/Explanation
freq                         3753
Name: final-X-GENRE, dtype: object

In [5]:
print(corpus["final-X-GENRE"].value_counts().to_markdown())

|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |            3753 |
| News                    |            2916 |
| Instruction             |            1851 |
| Promotion               |            1806 |
| Legal                   |             672 |
| Opinion/Argumentation   |             595 |
| Prose/Lyrical           |              46 |


In [6]:
LABELS = list(corpus["final-X-GENRE"].unique())
print(LABELS)

['Instruction', nan, 'News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']


In [15]:
# Save the new file
corpus.to_csv("Macocu-is-en-predicted.csv")

In [7]:
corpus.describe(include="all")

Unnamed: 0,biroamer_entities,translation_direction,en_source,en_var_doc,en_var_dom,is_source,en_domain,is_domain,average_score,en_doc,is_doc,en_length,is_length,X-GENRE,label_distribution,chosen_category_distr,final-X-GENRE
count,13174,13174,13174,13174,13174,13174,13174,13174,13174.0,13174,13174,13174.0,13174.0,13174,13174,13174.0,11639
unique,2,2,13174,4,4,12328,1112,1112,,13174,12328,,,9,7079,,7
top,No,is-orig,http://2way.is/,B,B,https://bookingauto.com/is/ireland,norden,norden,,Members can provide constructive feedback for ...,"Til dæmis, ef einn manneskja eða lítill hópur ...",,,Information/Explanation,"{'Other': 0.0001, 'Information/Explanation': 0...",,Information/Explanation
freq,11158,10152,1,5163,7745,9,913,913,,1,9,,,4025,566,,3753
mean,,,,,,,,,0.865217,,,346.646956,326.917034,,,0.965008,
std,,,,,,,,,0.058979,,,502.70679,528.200731,,,0.096853,
min,,,,,,,,,0.512,,,79.0,2.0,,,0.284306,
25%,,,,,,,,,0.836195,,,124.0,105.0,,,0.993842,
50%,,,,,,,,,0.875971,,,201.0,183.0,,,0.99854,
75%,,,,,,,,,0.905872,,,380.0,360.75,,,0.998932,


In [16]:
# Analyze English domains in the corpus
count = pd.DataFrame({"Count": list(corpus.en_domain.value_counts()), "Percentage": list(corpus.en_domain.value_counts(normalize="True")*100)}, index = corpus.en_domain.value_counts().index)

print(count.to_markdown())

|                                  |   Count |   Percentage |
|:---------------------------------|--------:|-------------:|
| norden                           |     913 |   6.93032    |
| eso                              |     528 |   4.00789    |
| landssjodir                      |     373 |   2.83133    |
| rnh                              |     336 |   2.55048    |
| lhi                              |     320 |   2.42903    |
| booking                          |     310 |   2.35312    |
| neway                            |     274 |   2.07985    |
| efling                           |     264 |   2.00395    |
| garnstudio                       |     251 |   1.90527    |
| laeknabladid                     |     219 |   1.66237    |
| skaftfell                        |     170 |   1.29042    |
| linde-gas                        |     147 |   1.11583    |
| land                             |     140 |   1.0627     |
| landsbokasafn                    |     138 |   1.04752    |
| arionb

In [19]:
# See the distribution of genres in the most frequent domains:
frequent_domains = ["norden","eso","landssjodir" ,"rnh","lhi","booking","neway","efling","garnstudio","laeknabladid",
"skaftfell","linde-gas","land","landsbokasafn","arionbanki","borgarbokasafn"]

for i in frequent_domains:
	print(i)
	filtered_corpus = corpus[corpus["en_domain"] == i]
	print(filtered_corpus["final-X-GENRE"].value_counts(normalize="True").to_markdown())

norden
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |       0.461728  |
| Information/Explanation |       0.246914  |
| Opinion/Argumentation   |       0.117284  |
| Instruction             |       0.0975309 |
| Promotion               |       0.0493827 |
| Legal                   |       0.0271605 |
eso
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| Information/Explanation |      0.6893     |
| News                    |      0.304527   |
| Promotion               |      0.00411523 |
| Prose/Lyrical           |      0.00205761 |
landssjodir
|                         |   final-X-GENRE |
|:------------------------|----------------:|
| News                    |      0.961644   |
| Information/Explanation |      0.0219178  |
| Instruction             |      0.0109589  |
| Opinion/Argumentation   |      0.00273973 |
| Promotion               |      0.00273973 |
rnh
|      

In [20]:
# Analyze differences in genres based on domain frequency

for i in ['Opinion/Argumentation', 'News', 'Legal', 'Information/Explanation', 'Promotion', 'Instruction', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_domain"].value_counts(normalize="True")[:5].to_markdown())

Opinion/Argumentation
|                |   en_domain |
|:---------------|------------:|
| norden         |   0.159664  |
| lhi            |   0.10084   |
| flora-utgafa   |   0.0638655 |
| studentabladid |   0.0571429 |
| gangmyllan     |   0.0336134 |
News
|             |   en_domain |
|:------------|------------:|
| norden      |   0.128258  |
| landssjodir |   0.12037   |
| rnh         |   0.0737311 |
| efling      |   0.0524691 |
| eso         |   0.0507545 |
Legal
|        |   en_domain |
|:-------|------------:|
| randa  |   0.0833333 |
| utl    |   0.0625    |
| land   |   0.0610119 |
| fme    |   0.0431548 |
| norden |   0.0327381 |
Information/Explanation
|                |   en_domain |
|:---------------|------------:|
| eso            |   0.0892619 |
| laeknabladid   |   0.0583533 |
| norden         |   0.0532907 |
| lhi            |   0.0327738 |
| visar.hagstofa |   0.0285105 |
Promotion
|                |   en_domain |
|:---------------|------------:|
| booking        |  

In [24]:
# Analyze differences in genres based on language varieties - print raw scores

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
    #print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(dict(sorted(filtered_corpus["en_var_doc"].value_counts().to_dict().items())))

{'A': 313, 'B': 1455, 'MIX': 123, 'UNK': 1025}
{'A': 148, 'B': 216, 'MIX': 49, 'UNK': 182}
{'A': 513, 'B': 443, 'MIX': 128, 'UNK': 722}
{'A': 394, 'B': 646, 'MIX': 108, 'UNK': 703}
{'A': 545, 'B': 1495, 'MIX': 195, 'UNK': 1518}
{'A': 84, 'B': 338, 'MIX': 39, 'UNK': 211}
{'A': 13, 'B': 14, 'MIX': 2, 'UNK': 17}


In [23]:
# Analyze differences in genres based on language varieties - print normalized scores

for i in ['News', 'Opinion/Argumentation', 'Promotion', 'Instruction','Information/Explanation', 'Legal','Prose/Lyrical']:
    #print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(dict(sorted((filtered_corpus["en_var_doc"].value_counts(normalize = True)*100).round(2).to_dict().items())))

{'A': 10.73, 'B': 49.9, 'MIX': 4.22, 'UNK': 35.15}
{'A': 24.87, 'B': 36.3, 'MIX': 8.24, 'UNK': 30.59}
{'A': 28.41, 'B': 24.53, 'MIX': 7.09, 'UNK': 39.98}
{'A': 21.29, 'B': 34.9, 'MIX': 5.83, 'UNK': 37.98}
{'A': 14.52, 'B': 39.83, 'MIX': 5.2, 'UNK': 40.45}
{'A': 12.5, 'B': 50.3, 'MIX': 5.8, 'UNK': 31.4}
{'A': 28.26, 'B': 30.43, 'MIX': 4.35, 'UNK': 36.96}


In [8]:
# Analyze differences in genres based on language varieties

for i in ['Instruction','News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_var_doc"].value_counts(normalize="True").to_markdown())

Instruction
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.379795  |
| B   |    0.349001  |
| A   |    0.212858  |
| MIX |    0.0583468 |
News
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.498971  |
| UNK |    0.351509  |
| A   |    0.107339  |
| MIX |    0.0421811 |
Promotion
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.399779  |
| A   |    0.284053  |
| B   |    0.245293  |
| MIX |    0.0708749 |
Information/Explanation
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.404476  |
| B   |    0.398348  |
| A   |    0.145217  |
| MIX |    0.0519584 |
Legal
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.502976  |
| UNK |    0.313988  |
| A   |    0.125     |
| MIX |    0.0580357 |
Opinion/Argumentation
|     |   en_var_doc |
|:----|-------------:|
| B   |    0.363025  |
| UNK |    0.305882  |
| A   |    0.248739  |
| MIX |    0.0823529 |
Prose/Lyrical
|     |   en_var_doc |
|:----|-------------:|
| UNK |    0.369565  |
| B   |   

In [9]:
# Length distribution of the entire corpus
print(corpus["en_length"].describe().to_markdown())

|       |   en_length |
|:------|------------:|
| count |   13174     |
| mean  |     346.647 |
| std   |     502.707 |
| min   |      79     |
| 25%   |     124     |
| 50%   |     201     |
| 75%   |     380     |
| max   |   11125     |


In [10]:
# Analyze differences in genres based on text length

for i in ['Instruction','News', 'Promotion', 'Information/Explanation', 'Legal', 'Opinion/Argumentation', 'Prose/Lyrical']:
    print(i)
    filtered_corpus = corpus[corpus["final-X-GENRE"] == i]
    print(filtered_corpus["en_length"].describe().to_markdown())

Instruction
|       |   en_length |
|:------|------------:|
| count |    1851     |
| mean  |     451.455 |
| std   |     655.151 |
| min   |      79     |
| 25%   |     146.5   |
| 50%   |     248     |
| 75%   |     487.5   |
| max   |    8663     |
News
|       |   en_length |
|:------|------------:|
| count |    2916     |
| mean  |     345.765 |
| std   |     354.283 |
| min   |      79     |
| 25%   |     141     |
| 50%   |     243     |
| 75%   |     432     |
| max   |    6054     |
Promotion
|       |   en_length |
|:------|------------:|
| count |    1806     |
| mean  |     209.564 |
| std   |     302.288 |
| min   |      79     |
| 25%   |     102     |
| 50%   |     140     |
| 75%   |     222.75  |
| max   |    6234     |
Information/Explanation
|       |   en_length |
|:------|------------:|
| count |    3753     |
| mean  |     274.125 |
| std   |     389.098 |
| min   |      79     |
| 25%   |     112     |
| 50%   |     170     |
| 75%   |     290     |
| max   |   1

In [3]:
corpus.shape

(13174, 17)

In [4]:
filtered_df = corpus.dropna(subset=["final-X-GENRE"])
print(filtered_df.shape)

print(filtered_df.en_var_doc.value_counts(normalize=True))

# Add combined lang variety labels (if en_var_doc mix or unknown, use en_var_dom)

filtered_df["combined_en_var"] = np.where((filtered_df["en_var_doc"] == "UNK") | (filtered_df["en_var_doc"] == "MIX"), filtered_df["en_var_dom"], filtered_df["en_var_doc"])

print(filtered_df[["en_var_doc", "en_var_dom", "combined_en_var"]].head(10).to_markdown())

print(filtered_df["combined_en_var"].value_counts(normalize=True).to_markdown())

(11639, 17)
B      0.395824
UNK    0.376149
A      0.172695
MIX    0.055331
Name: en_var_doc, dtype: float64
|        | en_var_doc   | en_var_dom   | combined_en_var   |
|-------:|:-------------|:-------------|:------------------|
| 321109 | A            | A            | A                 |
| 207459 | UNK          | UNK          | UNK               |
| 302539 | UNK          | UNK          | UNK               |
| 325966 | A            | A            | A                 |
| 280580 | B            | MIX          | B                 |
|  26997 | B            | B            | B                 |
| 293926 | A            | B            | A                 |
| 322699 | UNK          | UNK          | UNK               |
| 241536 | B            | MIX          | B                 |
| 257101 | A            | MIX          | A                 |
|     |   combined_en_var |
|:----|------------------:|
| B   |         0.632357  |
| A   |         0.228714  |
| MIX |         0.124409  |
| UNK |         0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["combined_en_var"] = np.where((filtered_df["en_var_doc"] == "UNK") | (filtered_df["en_var_doc"] == "MIX"), filtered_df["en_var_dom"], filtered_df["en_var_doc"])
