## Compare FTD labels with GINCO

Based on the FTD predictions that we applied to the GINCO, I'll analyse how the FTD labels are connected with GINCO labels.

In [1]:
import pandas as pd
import numpy as np

In [3]:
# Import the GINCO dataset
ginco = pd.read_csv("final_data/GINCO-MT-GINCO-keeptext-split-file-with-all-information.csv", sep="\t", index_col=0)
ginco.head(3)

Unnamed: 0,id,url,crawled,hard,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,...,tertiary_level_3,split,domain,GINCORE,Slovene_text,MT_text,text_length,FTD_pred_on_SL,FTD_pred_on_MT,split-without-rare-categories
0,3949,http://www.pomurje.si/aktualno/sport/zimska-li...,2014,False,News/Reporting,News/Reporting,News/Reporting,,,,...,,test,www.pomurje.si,News,"Šport <p/> Zimska liga malega nogometa sobota,...",Sport <p/> Winter Little League Football Satur...,93,A8 (news),A8 (news),test
1,3726,http://www.ss-sezana.si/sss/index.php?option=c...,2014,False,Information/Explanation,Information/Explanation,Information/Explanation,,,,...,,train,www.ss-sezana.si,Information/Explanation,JEDILNIK <p/> Iskalnik <p/> Poglavitni cilj pr...,JEDILNIK <p/> Search <p/> The main objective o...,76,A16 (information),A16 (information),test
2,5621,http://www.kamnik-starejsi.si/novice/144-sodel...,2014,False,Promotion of Services,Promotion of Services,Promotion,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,...,Information/Explanation,train,www.kamnik-starejsi.si,Promotion,Projekt INNOVAge in zavod Oreli <p/> Zavod Ore...,Project INNOVAge and the Oreli Institute <p/> ...,232,A12 (promotion),A12 (promotion),train


In [6]:
# See label distribution in GINCO
ginco.primary_level_1.value_counts()

Information/Explanation       130
News/Reporting                115
Promotion of a Product        115
Opinion/Argumentation         114
List of Summaries/Excerpts    106
Opinionated News               89
Forum                          52
Instruction                    38
Other                          34
Invitation                     32
Promotion of Services          32
Promotion                      30
Legal/Regulation               17
Announcement                   17
Review                         17
Correspondence                 16
Call                           11
Research Article                9
Interview                       8
Recipe                          6
Prose                           6
Lyrical                         4
FAQ                             3
Script/Drama                    1
Name: primary_level_1, dtype: int64

In [9]:
# Analyse how GINCO primary labels are connected with FTD labels based on the prediction
pd.crosstab(ginco["primary_level_1"], ginco["FTD_pred_on_MT"], normalize="index")


FTD_pred_on_MT,A1 (argumentative),A11 (personal),A12 (promotion),A14 (academic),A16 (information),A17 (review),A4 (fiction),A7 (instruction),A8 (news),A9 (legal)
primary_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Announcement,0.0,0.0,0.411765,0.0,0.117647,0.0,0.0,0.058824,0.411765,0.0
Call,0.272727,0.0,0.272727,0.0,0.0,0.0,0.0,0.0,0.181818,0.272727
Correspondence,0.3125,0.1875,0.125,0.0,0.0,0.0625,0.0,0.25,0.0625,0.0
FAQ,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.666667,0.0,0.0
Forum,0.384615,0.403846,0.057692,0.0,0.038462,0.019231,0.0,0.096154,0.0,0.0
Information/Explanation,0.046154,0.007692,0.253846,0.0,0.569231,0.015385,0.007692,0.007692,0.030769,0.061538
Instruction,0.0,0.026316,0.131579,0.0,0.0,0.0,0.0,0.842105,0.0,0.0
Interview,0.25,0.375,0.25,0.0,0.0,0.125,0.0,0.0,0.0,0.0
Invitation,0.0,0.0,0.8125,0.0,0.0,0.0,0.0,0.0,0.1875,0.0
Legal/Regulation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
pd.crosstab(ginco["primary_level_1"], ginco["FTD_pred_on_MT"], normalize="index").to_dict("index")

What we can see based on the FTD predictions on the GINCO labels, is (first, the information for prediction on Slovene text is given, followed by the information what is different on predictions on the MT text):

1. Some labels are very well connected (similar):
    - Instruction (38): very well identified as 'A7 (instruction)': 0.79; on MT text even better identified: 'A7 (instruction)': 0.84
    - Invitation (32): very well identified as  'A12 (promotion)': 0.875; on MT slightly worse identified as 'A12 (promotion)': 0.81, 20% identified as News: 'A8 (news)': 0.1875,
    - Legal/Regulation (17): very well identified as 'A9 (legal)': 1.0, on MT the same
    - Promotion of Services (32): very well predicted as  'A12 (promotion)': 0.94; a bit worse predicted on MT: 'A12 (promotion)': 0.91
    - Promotion of a Product (115): very well predicted as 'A12 (promotion)': 0.85; a bit worse predicted on MT: 'A12 (promotion)': 0.83
    - Promotion (30): very well predicted as  'A12 (promotion)': 0.8; the same on MT
    - Prose (6): very well predicted as 'A4 (fiction)': 1.0; on MT a bit worse predicted: 'A4 (fiction)': 0.83, 'A11 (personal)': 0.17
    - Recipe (6): very well predicted as 'A7 (instruction)': 0.83; the same on MT
    - Research Article (9): well predicted as  'A14 (academic)': 0.67, otherwise as 'A16 (information)': 0.22; on MT even better prediction: 'A14 (academic)': 0.77, otherwise 'A16 (information)': 0.11
    - Review (17): predicted as 'A17 (review)': 0.65, otherwise 'A12 (promotion)': 0.29; on MT similar prediction of review, but less promotion: 'A12 (promotion)': 0.17, some 'A11 (personal)': 0.12 instead

2. Some seem to be less connected:
    - Information/Explanation (130 instances): relatively well identified as 'A16 (information)': 0.6, a third identified as  'A12 (promotion)': 0.32; on MT text similarly identified as 'A16 (information)': 0.57, 25% identified as 'A12 (promotion)'.
    - News/Reporting (115): half identified as 'A8 (news)': 0.55, otherwise also 'A1 (argumentative)': 0.10,  'A12 (promotion)': 0.13,   'A16 (information)': 0.10 and other categories; on MT text much better identification of News: 'A8 (news)': 0.79
    - Opinion/Argumentation (114): less than half identified as 'A1 (argumentative): 0.39, otherwise 'A11 (personal)': 0.14,  'A12 (promotion)': 0.12,  'A16 (information)': 0.17,  'A17 (review)': 0.13 and others; on MT text similar: mostly 'A1 (argumentative)': 0.46, 'A11 (personal)': 0.18
    - Opinionated News (89): half identified as 'A8 (news)': 0.528, otherwise mostly 'A1 (argumentative)': 0.13 or 'A12 (promotion)': 0.12 or  'A17 (review)': 0.17; on MT even more identified as News: 'A8 (news)': 0.63, 20% as  'A1 (argumentative)'


3. Some GINCO labels were predicted by a variety of FTD labels - are not well connected:
    - Announcement (17 instances): not closely connected to just one FTD label - instances were predicted as 'A12 (promotion)': 0.47 (of all Announcement instances), 'A16 (information)': 0.18, 'A7 (instruction)': 0.06,   'A8 (news)': 0.29; on MT text, the main label are 'A8 (news)': 0.41 and 'A12 (promotion)': 0.41
    - Call (11 instances): mostly connected to 'A12 (promotion)': 0.73; on MT text, it is divided between 'A1 (argumentative)', 'A12 (promotion)' and 'A9 (legal)' (0.27)
    - Correspondence (16 instances): not well predicted with FTDs - mostly split between 'A1 (argumentative)': 0.375, 'A7 (instruction)': 0.125, and 'A8 (news)': 0.0625 - on MT similar, but there is more 'A7 (instruction)': 0.25 and also 13% 'A12 (promotion)'
    - FAQ (3 instances): mostly predicted as 'A12 (promotion)': 0.67 - to note: there are very little FAQ instances in the corpus. On MT text, it is mostly predicted as 'A7 (instruction)': 0.66 and 33 % as 'A12 (promotion)'.
    - Forum (52 instances): not well connected with FTD categories - identified as 'A1 (argumentative)': 0.38,  'A11 (personal)': 0.19,'A17 (review)': 0.15, also some   'A12 (promotion)', 'A14 (academic)',  'A16 (information)', 'A17 (review)',  'A7 (instruction)'. On MT text it is mostly 'A11 (personal)': 0.40 and 'A1 (argumentative)': 0.38.
    - Interview (8): not well connected to just one category: 'A1 (argumentative)': 0.125, 'A11 (personal)': 0.125,   'A12 (promotion)': 0.375,  'A16 (information)': 0.125,  'A17 (review)': 0.25. On MT text, predictions are scattered between 'A1 (argumentative)': 0.25, 'A11 (personal)': 0.375, 'A12 (promotion)': 0.25, and 'A17 (review)': 0.125.
    - List of Summaries/Excerpts (106): most vague category, not connected to FTD categories well - identified as all of the FTD categories; on MT the same.
    - Lyrical (4): half identified as 'A11 (personal)': 0.5, half as  'A4 (fiction)': 0.5; on MT the same combination of labels, but more identified as 'A11 (personal)': 0.75
    - Other (34): as I assumed, the label was predicted with various FTD labels, mostly 'A12 (promotion)': 0.44, 'A16 (information)': 0.15 or  'A17 (review)': 0.21; on MT even more scattered, less promotion ('A12 (promotion)': 0.29)
    - Script/Drama (1): not well predicted, but there is only one instance. Predicted as  'A16 (information)': 1.0; the same on MT

Comparison between the prediction on Slovene and MT text shows that mostly there is not a big difference between prediction on Slovene or English text. On some labels, the predictions are worse on MT (Promotion labels, Prose), on some it is better (Instruction, Research Article, Information/Explanation). On MT, there is a much better identification of News and Opinionated News. In other cases, there is no difference or just a slight difference.

## Compare FTD labels with CORE

In [14]:
# Import the CORE dataset
core_df = pd.read_csv("data-sheets-with-all-info/CORE-all-information.csv", index_col = 0)

core_df.head(3)

Unnamed: 0,label,text,split,main_labels,sublabels,Len,main_len,sub_len,GINCORE,full_names,main_labels_full_names,FTD_pred
0,NA OP SR OB,The Top TEN 'Whiniest Sets of Fans' in English...,train,NA OP,SR OB,4,2,2,NA OP SR OB,SR OB,NA OP,
1,NA NE,"Ferry consultation needs deeper questions, say...",train,,NE,2,1,1,News,News Report/Blog,Narrative,
2,ID DF,I'v been recording and mixing music for about ...,train,ID,DF,2,1,1,Forum,Discussion Forum,Interactive Discussion,


In [15]:
# Filter out only instances that have FTD prediction
core_df = core_df.dropna(subset = ["FTD_pred"])

core_df.shape

(7970, 12)

In [16]:
core_df.columns

Index(['label', 'text', 'split', 'main_labels', 'sublabels', 'Len', 'main_len',
       'sub_len', 'GINCORE', 'full_names', 'main_labels_full_names',
       'FTD_pred'],
      dtype='object')

In [17]:
# Discard uninteresting columns
core_df = core_df[['GINCORE', 'sub_len', 'full_names', 'main_len', 'main_labels_full_names','FTD_pred']]

core_df

Unnamed: 0,GINCORE,sub_len,full_names,main_len,main_labels_full_names,FTD_pred
6,Forum,1,Discussion Forum,1,Interactive Discussion,A11 (personal)
14,News,1,Sports Report,1,Narrative,A1 (argumentative)
16,News,1,News Report/Blog,1,Narrative,A8 (news)
23,News,1,News Report/Blog,1,Narrative,A1 (argumentative)
25,NA IN HA,1,Historical Article,2,NA IN,A1 (argumentative)
...,...,...,...,...,...,...
48419,News,1,News Report/Blog,1,Narrative,A8 (news)
48420,ID HI QA,1,Question/Answer Forum,2,ID HI,A7 (instruction)
48423,News,1,News Report/Blog,1,Narrative,A8 (news)
48429,Information/Explanation,1,Encyclopedia Article,1,Informational Description/Explanation,A16 (information)


In [18]:
# Analyse how CORE main labels are connected with FTD labels based on the prediction

# Filter out instances that have multiple main labels
main_labels = core_df[core_df["main_len"] == 1]

pd.crosstab(main_labels["main_labels_full_names"], main_labels["FTD_pred"], normalize="index")

FTD_pred,A1 (argumentative),A11 (personal),A12 (promotion),A14 (academic),A16 (information),A17 (review),A4 (fiction),A7 (instruction),A8 (news),A9 (legal)
main_labels_full_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
How-To/Instructional,0.043011,0.11828,0.09319,0.0,0.003584,0.021505,0.0,0.713262,0.0,0.007168
Informational Description/Explanation,0.189058,0.020776,0.228532,0.052632,0.213989,0.089335,0.004848,0.106648,0.059557,0.034626
Informational Persuasion,0.185345,0.021552,0.400862,0.0,0.030172,0.258621,0.012931,0.077586,0.00431,0.008621
Interactive Discussion,0.315702,0.289256,0.047934,0.009917,0.028099,0.041322,0.008264,0.239669,0.014876,0.004959
Lyrical,0.04065,0.577236,0.04878,0.0,0.03252,0.121951,0.089431,0.081301,0.00813,0.0
Narrative,0.228284,0.133191,0.032002,0.002743,0.029259,0.067662,0.017373,0.0064,0.480951,0.002133
Opinion,0.467799,0.101203,0.061571,0.000708,0.012739,0.230715,0.012739,0.08351,0.029016,0.0
Spoken,0.304348,0.278261,0.06087,0.0,0.0,0.252174,0.017391,0.008696,0.078261,0.0
