## Compare GINCO and MT-GINCO labels with FTD

Based on the GINCO and MT-GINCO predictions that we applied to the FTD, I'll analyse how the FTD labels are connected with GINCO labels.

In [1]:
import pandas as pd
import numpy as np

In [11]:
# Import the FTD dataset
ftd = pd.read_csv("data-sheets-with-all-info/FTD-dataset-with-all-information.csv", sep="\t", index_col=0)
ftd.head(3)

Unnamed: 0,ID,labels,Multiple labels,text,length,GINCO_downcast_pred,MT-GINCO_downcast_pred
0,__id__1-syndicate,A1 (argumentative),,BMW's and Chinese Justice * * * * * In most pl...,975,News/Reporting,News/Reporting
1,__id__2-syndicate,A1 (argumentative),,China and a New Balance of Power SHANGHAI – Th...,956,Information/Explanation,Opinion/Argumentation
2,__id__3-syndicate,A1 (argumentative),,China and Russia in the New World Disorder Can...,978,Opinion/Argumentation,Opinion/Argumentation


In [12]:
# Analyse agreement of the labels
ftd["agreement"] = np.where(ftd['GINCO_downcast_pred'] == ftd['MT-GINCO_downcast_pred'], "yes", "no")

ftd["agreement"].value_counts()

yes    1206
no      347
Name: agreement, dtype: int64

In [13]:
347/1206

0.2877280265339967

The predictions of GINCO and MT-GINCO classifier differ in 347 instances (29%).

In [15]:
ftd.columns

Index(['ID', 'labels', 'Multiple labels', 'text', 'length',
       'GINCO_downcast_pred', 'MT-GINCO_downcast_pred', 'agreement'],
      dtype='object')

In [16]:
ftd["Multiple labels"].value_counts()

y    139
Name: Multiple labels, dtype: int64

In [18]:
# Discard texts with multiple labels
ftd = ftd[ftd["Multiple labels"] != "y"]
ftd.describe(include="all")

Unnamed: 0,ID,labels,Multiple labels,text,length,GINCO_downcast_pred,MT-GINCO_downcast_pred,agreement
count,1414,1414,0.0,1414,1414.0,1414,1414,1414
unique,1414,10,0.0,1414,,9,9,2
top,__id__1-syndicate,A1 (argumentative),,BMW's and Chinese Justice * * * * * In most pl...,,Information/Explanation,Information/Explanation,yes
freq,1,296,,1,,407,311,1118
mean,,,,,1445.212164,,,
std,,,,,4989.570842,,,
min,,,,,31.0,,,
25%,,,,,224.0,,,
50%,,,,,495.0,,,
75%,,,,,1144.25,,,


In [19]:
# Analyse how GINCO primary labels are connected with FTD labels based on the prediction
pd.crosstab(ftd['labels'], ftd["GINCO_downcast_pred"], normalize="index")


GINCO_downcast_pred,Forum,Information/Explanation,Instruction,Legal/Regulation,List of Summaries/Excerpts,News/Reporting,Opinion/Argumentation,Other,Promotion
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A1 (argumentative),0.013514,0.263514,0.0,0.010135,0.091216,0.246622,0.22973,0.094595,0.050676
A11 (personal),0.088608,0.037975,0.012658,0.0,0.037975,0.025316,0.696203,0.101266,0.0
A12 (promotion),0.011719,0.15625,0.023438,0.0,0.070312,0.09375,0.015625,0.035156,0.59375
A14 (academic),0.0,0.810127,0.0,0.012658,0.012658,0.037975,0.088608,0.037975,0.0
A16 (information),0.0,0.815476,0.0,0.005952,0.053571,0.065476,0.02381,0.005952,0.029762
A17 (review),0.088235,0.191176,0.0,0.0,0.147059,0.132353,0.132353,0.014706,0.294118
A4 (fiction),0.0,0.202128,0.0,0.0,0.0,0.021277,0.234043,0.542553,0.0
A7 (instruction),0.0,0.139394,0.612121,0.0,0.090909,0.030303,0.030303,0.036364,0.060606
A8 (news),0.0,0.095588,0.0,0.014706,0.110294,0.742647,0.007353,0.029412,0.0
A9 (legal),0.0,0.232877,0.013699,0.643836,0.0,0.013699,0.0,0.09589,0.0


In [20]:
pd.crosstab(ftd['labels'], ftd["GINCO_downcast_pred"], normalize="index").to_dict("index")

{'A1 (argumentative)': {'Forum': 0.013513513513513514,
  'Information/Explanation': 0.2635135135135135,
  'Instruction': 0.0,
  'Legal/Regulation': 0.010135135135135136,
  'List of Summaries/Excerpts': 0.09121621621621621,
  'News/Reporting': 0.24662162162162163,
  'Opinion/Argumentation': 0.22972972972972974,
  'Other': 0.0945945945945946,
  'Promotion': 0.05067567567567568},
 'A11 (personal)': {'Forum': 0.08860759493670886,
  'Information/Explanation': 0.0379746835443038,
  'Instruction': 0.012658227848101266,
  'Legal/Regulation': 0.0,
  'List of Summaries/Excerpts': 0.0379746835443038,
  'News/Reporting': 0.02531645569620253,
  'Opinion/Argumentation': 0.6962025316455697,
  'Other': 0.10126582278481013,
  'Promotion': 0.0},
 'A12 (promotion)': {'Forum': 0.01171875,
  'Information/Explanation': 0.15625,
  'Instruction': 0.0234375,
  'Legal/Regulation': 0.0,
  'List of Summaries/Excerpts': 0.0703125,
  'News/Reporting': 0.09375,
  'Opinion/Argumentation': 0.015625,
  'Other': 0.03515

What we can see based on the GINCO predictions on the FTD labels, is (first, the information for prediction of GINCO-downcast is given, followed by the information what is different on predictions by the MT-GINCO-downcast model):)


## Compare GINCO labels with CORE

In [3]:
# Import the CORE dataset
core_df = pd.read_csv("data-sheets-with-all-info/CORE-all-information.csv", index_col = 0, sep="\t")

core_df.head(3)

Unnamed: 0,label,text,split,main_labels,sublabels,Len,main_len,sub_len,GINCORE,full_names,main_labels_full_names,FTD_pred,GINCO_downcast_pred,MT-GINCO_downcast_pred
0,NA OP SR OB,The Top TEN 'Whiniest Sets of Fans' in English...,train,NA OP,SR OB,4,2,2,NA OP SR OB,SR OB,NA OP,,,
1,NA NE,"Ferry consultation needs deeper questions, say...",train,,NE,2,1,1,News,News Report/Blog,Narrative,,,
2,ID DF,I'v been recording and mixing music for about ...,train,ID,DF,2,1,1,Forum,Discussion Forum,Interactive Discussion,,,


In [4]:
# Filter out only instances that have GINCO prediction
core_df = core_df.dropna(subset = ["GINCO_downcast_pred"])

core_df.shape

(1500, 14)

In [5]:
core_df.columns

Index(['label', 'text', 'split', 'main_labels', 'sublabels', 'Len', 'main_len',
       'sub_len', 'GINCORE', 'full_names', 'main_labels_full_names',
       'FTD_pred', 'GINCO_downcast_pred', 'MT-GINCO_downcast_pred'],
      dtype='object')

In [6]:
# Discard uninteresting columns
core_df = core_df[['GINCORE', 'sub_len', 'full_names', 'main_len', 'main_labels_full_names','FTD_pred', 'GINCO_downcast_pred', 'MT-GINCO_downcast_pred']]

core_df

Unnamed: 0,GINCORE,sub_len,full_names,main_len,main_labels_full_names,FTD_pred,GINCO_downcast_pred,MT-GINCO_downcast_pred
14,News,1,Sports Report,1,Narrative,A1 (argumentative),Opinion/Argumentation,Opinion/Argumentation
16,News,1,News Report/Blog,1,Narrative,A8 (news),News/Reporting,News/Reporting
23,News,1,News Report/Blog,1,Narrative,A1 (argumentative),Opinion/Argumentation,Opinion/Argumentation
54,News,1,News Report/Blog,1,Narrative,A17 (review),News/Reporting,News/Reporting
57,News,1,News Report/Blog,1,Narrative,A17 (review),List of Summaries/Excerpts,List of Summaries/Excerpts
...,...,...,...,...,...,...,...,...
48378,Opinion/Argumentation,1,Opinion Blog,1,Opinion,A1 (argumentative),Information/Explanation,Information/Explanation
48403,News,1,News Report/Blog,1,Narrative,A1 (argumentative),News/Reporting,News/Reporting
48405,Opinion/Argumentation,1,Opinion Blog,1,Opinion,A1 (argumentative),News/Reporting,News/Reporting
48419,News,1,News Report/Blog,1,Narrative,A8 (news),News/Reporting,News/Reporting


In [7]:
# Analyze in how many instance do GINCO and MT-GINCO predictions differ

core_df["agreement"] = np.where(core_df['GINCO_downcast_pred'] == core_df['MT-GINCO_downcast_pred'], "yes", "no")

core_df

Unnamed: 0,GINCORE,sub_len,full_names,main_len,main_labels_full_names,FTD_pred,GINCO_downcast_pred,MT-GINCO_downcast_pred,agreement
14,News,1,Sports Report,1,Narrative,A1 (argumentative),Opinion/Argumentation,Opinion/Argumentation,yes
16,News,1,News Report/Blog,1,Narrative,A8 (news),News/Reporting,News/Reporting,yes
23,News,1,News Report/Blog,1,Narrative,A1 (argumentative),Opinion/Argumentation,Opinion/Argumentation,yes
54,News,1,News Report/Blog,1,Narrative,A17 (review),News/Reporting,News/Reporting,yes
57,News,1,News Report/Blog,1,Narrative,A17 (review),List of Summaries/Excerpts,List of Summaries/Excerpts,yes
...,...,...,...,...,...,...,...,...,...
48378,Opinion/Argumentation,1,Opinion Blog,1,Opinion,A1 (argumentative),Information/Explanation,Information/Explanation,yes
48403,News,1,News Report/Blog,1,Narrative,A1 (argumentative),News/Reporting,News/Reporting,yes
48405,Opinion/Argumentation,1,Opinion Blog,1,Opinion,A1 (argumentative),News/Reporting,News/Reporting,yes
48419,News,1,News Report/Blog,1,Narrative,A8 (news),News/Reporting,News/Reporting,yes


In [8]:
core_df.agreement.value_counts()

yes    1235
no      265
Name: agreement, dtype: int64

In [9]:
265/1500

0.17666666666666667

The GINCO and MT-GINCO predictions differ only in case of 265 instances (18% of instances).

### CORE-main

In [6]:
# Analyse how CORE main labels are connected with GINCO labels based on the prediction

# Filter out instances that have multiple main labels
main_labels = core_df[core_df["main_len"] == 1]

pd.crosstab(main_labels["main_labels_full_names"], main_labels["FTD_pred"], normalize="index")

FTD_pred,A1 (argumentative),A11 (personal),A12 (promotion),A14 (academic),A16 (information),A17 (review),A4 (fiction),A7 (instruction),A8 (news),A9 (legal)
main_labels_full_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
How-To/Instructional,0.043011,0.11828,0.09319,0.0,0.003584,0.021505,0.0,0.713262,0.0,0.007168
Informational Description/Explanation,0.189058,0.020776,0.228532,0.052632,0.213989,0.089335,0.004848,0.106648,0.059557,0.034626
Informational Persuasion,0.185345,0.021552,0.400862,0.0,0.030172,0.258621,0.012931,0.077586,0.00431,0.008621
Interactive Discussion,0.315702,0.289256,0.047934,0.009917,0.028099,0.041322,0.008264,0.239669,0.014876,0.004959
Lyrical,0.04065,0.577236,0.04878,0.0,0.03252,0.121951,0.089431,0.081301,0.00813,0.0
Narrative,0.228284,0.133191,0.032002,0.002743,0.029259,0.067662,0.017373,0.0064,0.480951,0.002133
Opinion,0.467799,0.101203,0.061571,0.000708,0.012739,0.230715,0.012739,0.08351,0.029016,0.0
Spoken,0.304348,0.278261,0.06087,0.0,0.0,0.252174,0.017391,0.008696,0.078261,0.0


In [10]:
main_labels["main_labels_full_names"].value_counts()

Narrative                                3281
Informational Description/Explanation    1444
Opinion                                  1413
Interactive Discussion                    605
How-To/Instructional                      279
Informational Persuasion                  232
Lyrical                                   123
Spoken                                    115
Name: main_labels_full_names, dtype: int64

In [11]:
main_labels["main_labels_full_names"].unique()

array(['Interactive Discussion', 'Narrative', 'Opinion',
       'Informational Description/Explanation', 'How-To/Instructional',
       'Lyrical', 'Informational Persuasion', 'Spoken'], dtype=object)

In [12]:
pd.crosstab(main_labels["main_labels_full_names"], main_labels["FTD_pred"], normalize="index").to_dict("index")

{'How-To/Instructional': {'A1 (argumentative)': 0.043010752688172046,
  'A11 (personal)': 0.11827956989247312,
  'A12 (promotion)': 0.0931899641577061,
  'A14 (academic)': 0.0,
  'A16 (information)': 0.0035842293906810036,
  'A17 (review)': 0.021505376344086023,
  'A4 (fiction)': 0.0,
  'A7 (instruction)': 0.7132616487455197,
  'A8 (news)': 0.0,
  'A9 (legal)': 0.007168458781362007},
 'Informational Description/Explanation': {'A1 (argumentative)': 0.18905817174515235,
  'A11 (personal)': 0.02077562326869806,
  'A12 (promotion)': 0.22853185595567868,
  'A14 (academic)': 0.05263157894736842,
  'A16 (information)': 0.21398891966759004,
  'A17 (review)': 0.08933518005540166,
  'A4 (fiction)': 0.004847645429362881,
  'A7 (instruction)': 0.10664819944598337,
  'A8 (news)': 0.05955678670360111,
  'A9 (legal)': 0.03462603878116344},
 'Informational Persuasion': {'A1 (argumentative)': 0.1853448275862069,
  'A11 (personal)': 0.021551724137931036,
  'A12 (promotion)': 0.40086206896551724,
  'A14 

Comparison of main CORE labels and FTD labels:

1. Well-connected:
* 'How-To/Instructional': 'A7 (instruction)': 0.713

2. Not well connected (no clear majority label/majority label does not seem to be appropriate):
* 'Interactive Discussion': mostly 'A1 (argumentative)': 0.315 (percentage of instances of Interactive Discussion, identified as A1) + 'A11 (personal)': 0.289,  'A7 (instruction)': 0.239
* 'Narrative': 'A8 (news)': 0.48, 'A1 (argumentative)': 0.228
* 'Opinion': 'A1 (argumentative)': 0.467, 'A17 (review)': 0.230
* 'Informational Description/Explanation': A12 (promotion)': 0.228, 'A16 (information)': 0.21, 'A1 (argumentative)': 0.189
* 'Lyrical': 'A11 (personal)': 0.577
* 'Informational Persuasion': 'A12 (promotion)': 0.40, 'A1 (argumentative)': 0.185, 'A17 (review)': 0.258
* 'Spoken': 'A1 (argumentative)': 0.30, 'A11 (personal)': 0.278, 'A17 (review)': 0.252




### CORE-sub

In [13]:
core_df.head(2)

Unnamed: 0,GINCORE,sub_len,full_names,main_len,main_labels_full_names,FTD_pred
6,Forum,1,Discussion Forum,1,Interactive Discussion,A11 (personal)
14,News,1,Sports Report,1,Narrative,A1 (argumentative)


In [14]:
# Analyse how CORE sub labels are connected with FTD labels based on the prediction

# Filter out instances that have multiple main labels
sub_labels = core_df[core_df["sub_len"] == 1]

pd.crosstab(sub_labels["full_names"], sub_labels["FTD_pred"], normalize="index")

FTD_pred,A1 (argumentative),A11 (personal),A12 (promotion),A14 (academic),A16 (information),A17 (review),A4 (fiction),A7 (instruction),A8 (news),A9 (legal)
full_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Advertisement,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
Advice,0.172043,0.139785,0.145161,0.0,0.005376,0.016129,0.0,0.516129,0.005376,0.0
Course Materials,0.173913,0.0,0.130435,0.043478,0.304348,0.043478,0.0,0.304348,0.0,0.0
Description of a Person,0.098039,0.058824,0.143791,0.0,0.339869,0.228758,0.013072,0.0,0.117647,0.0
Description of a Thing,0.1398,0.021398,0.370899,0.007133,0.152639,0.126961,0.002853,0.079886,0.081312,0.017118
Description with Intent to Sell,0.082192,0.018265,0.465753,0.0,0.031963,0.283105,0.013699,0.086758,0.009132,0.009132
Discussion Forum,0.374359,0.320513,0.041026,0.002564,0.007692,0.04359,0.005128,0.179487,0.020513,0.005128
Editorial,0.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0
Encyclopedia Article,0.095238,0.009524,0.0,0.0,0.819048,0.038095,0.009524,0.0,0.019048,0.009524
FAQ about How-to,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.666667,0.0,0.0


In [15]:
sub_labels["full_names"].value_counts()

News Report/Blog                   2101
Opinion Blog                        827
Description of a Thing              701
Sports Report                       564
Personal Blog                       554
Discussion Forum                    390
Reviews                             360
Information Blog                    306
How-to                              263
Description with Intent to Sell     219
Question/Answer Forum               210
Advice                              186
Research Article                    165
Description of a Person             153
Religious Blogs/Sermons             140
Song Lyrics                         108
Encyclopedia Article                105
Interview                            94
Historical Article                   85
Travel Blog                          57
Short Story                          56
FAQ about Information                50
Legal terms                          38
Recipe                               34
Other Information                    27


In [16]:
len(sub_labels.full_names.unique())

44

In [18]:
pd.crosstab(sub_labels["full_names"], sub_labels["FTD_pred"], normalize="index").to_dict("index")

{'Advertisement': {'A1 (argumentative)': 0.0,
  'A11 (personal)': 0.0,
  'A12 (promotion)': 0.6666666666666666,
  'A14 (academic)': 0.0,
  'A16 (information)': 0.0,
  'A17 (review)': 0.0,
  'A4 (fiction)': 0.0,
  'A7 (instruction)': 0.3333333333333333,
  'A8 (news)': 0.0,
  'A9 (legal)': 0.0},
 'Advice': {'A1 (argumentative)': 0.17204301075268819,
  'A11 (personal)': 0.13978494623655913,
  'A12 (promotion)': 0.14516129032258066,
  'A14 (academic)': 0.0,
  'A16 (information)': 0.005376344086021506,
  'A17 (review)': 0.016129032258064516,
  'A4 (fiction)': 0.0,
  'A7 (instruction)': 0.5161290322580645,
  'A8 (news)': 0.005376344086021506,
  'A9 (legal)': 0.0},
 'Course Materials': {'A1 (argumentative)': 0.17391304347826086,
  'A11 (personal)': 0.0,
  'A12 (promotion)': 0.13043478260869565,
  'A14 (academic)': 0.043478260869565216,
  'A16 (information)': 0.30434782608695654,
  'A17 (review)': 0.043478260869565216,
  'A4 (fiction)': 0.0,
  'A7 (instruction)': 0.30434782608695654,
  'A8 (ne

1. Categories that match well:
* 'Advertisement': 'A12 (promotion)': 0.67
* 'Editorial': 'A1 (argumentative)': 0.92
* 'Encyclopedia Article': 'A16 (information)': 0.82
* 'FAQ about How-to':  'A7 (instruction)': 0.66
* 'How-to':  'A7 (instruction)': 0.74
* 'Formal Speech': 'A1 (argumentative)': 0.78
* 'Legal terms': 'A9 (legal)': 0.789
* 'Letter to Editor': 'A1 (argumentative)': 0.875
* 'Opinion Blog': 'A1 (argumentative)': 0.66
* 'Personal Blog: 'A11 (personal)': 0.677
* 'Persuasive Article or Essay': 'A1 (argumentative)': 0.75
* 'Religious Blogs/Sermons': 'A1 (argumentative)': 0.67
* 'Reviews': 'A17 (review)': 0.72
* 'Short Story': 'A4 (fiction)': 0.80
* 'Sports Report': A8 (news)': 0.759
* 'Technical Support': 'A7 (instruction)': 0.875
* Other Informational Persuasion': 'A1 (argumentative)': 1.0
* 'Other Opinion': 'A1 (argumentative)': 0.67
* 'Other Spoken': 'A11 (personal)': 0.67


2. Categories that match, but less well:
* 'Advice': 'A7 (instruction)': 0.51
* 'Description with Intent to Sell': A12 (promotion)': 0.465, 'A17 (review)': 0.283
* 'News Report/Blog': 'A8 (news)': 0.55, 'A1 (argumentative)': 0.28
* 'Recipe': 'A7 (instruction)': 0.47, 'A11 (personal)': 0.35

3. CORE sub categories with no (appropriate) majority FTD label:
* 'Course Materials': 'A7 (instruction)': 0.30 or 'A16 (information)': 0.30
* 'Description of a Person':  'A16 (information)': 0.33, 'A17 (review)': 0.228
* 'Description of a Thing': 'A12 (promotion)': 0.37
* 'Discussion Forum': 'A1 (argumentative)': 0.37, 'A11 (personal)': 0.32
* 'FAQ about Information': 'A7 (instruction)': 0.44, 'A12 (promotion)': 0.28
* 'Historical Article': 'A16 (information)': 0.423, 'A1 (argumentative)': 0.36
* 'Information Blog': 'A1 (argumentative)': 0.29, 'A7 (instruction)': 0.21
* 'Interview': ''A17 (review)': 0.36, A1 (argumentative)': 0.18, 'A11 (personal)': 0.265
* 'Magazine Article': 'A1 (argumentative)': 0.47
* 'Poem': 'A4 (fiction)': 0.33, A11 (personal)': 0.2, 'A17 (review)': 0.27
* 'Prayer': A16 (information)': 0.67
* 'Question/Answer Forum':  'A7 (instruction)': 0.357, 'A11 (personal)': 0.23, A1 (argumentative)': 0.209
* 'Reader/Viewer Responses': 'A17 (review)': 0.3, 'A12 (promotion)': 0.3, 'A7 (instruction)': 0.2
* 'Research Article': 'A1 (argumentative)': 0.39, 'A14 (academic)': 0.376
* 'Song Lyrics': 'A11 (personal)': 0.63 (matches well, but debatable if it's appropriate category)
* 'TV/Movie Script': A11 (personal)': 0.6
* 'Technical Report': 'A1 (argumentative)': 0.375, 'A7 (instruction)': 0.25 
* 'Transcript of Video/Audio': 'A1 (argumentative)': 0.83
* 'Travel Blog': 'A11 (personal)': 0.438, A17 (review)': 0.25, 'A12 (promotion)': 0.19
* 'Other Information':  'A7 (instruction)': 0.30, 'A16 (information)': 0.185