
## Cosine similarity of body text (evaluation) <br>

In what follows we use the translated and preprocessed body text of the evaluation articles to give a first similarity score.<br><br>

The texts are embedded as vectors by using *tf-idf* and the similarity score is given by *cosine similarity*. For both tasks we rely on the eponymous tools from [scikit-learn](https://scikit-learn.org/stable/).

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np 
import math


In [50]:
# read in our translated preprocessed text in pairs in a dataframe
processed_df = pd.read_csv('eval/_EVAL_preprocessed_text.csv')

In [51]:
processed_df

Unnamed: 0.1,Unnamed: 0,pair_id,url1_lang,url2_lang,title1,title2,keywords1,keywords2,text1,text2,...,Entities,Time,Narrative,Overall,Style,Tone,translated_body1,translated_body2,preprocessed_1,preprocessed_2
0,0,1484189203_1484121193,en,en,Police: 2 men stole tools from Lowe’s in Davie,No-swim advisory lifted for Deerfield Beach Pier,"More Trending Stories,Trending","More Trending Stories,Trending","DAVIE, FLA. (WSVN) - Police need help catching...","DEERFIELD BEACH, FLA. (WSVN) - A no-swim advis...",...,4.0,2.0,4.0,3.5,1.0,1.5,"DAVIE, FLA. (WSVN) - Police need help catching...","DEERFIELD BEACH, FLA. (WSVN) - A no-swim advis...",davie fla wsvn police need help catch two croo...,deerfield beach fla wsvn noswim advisory lift ...
1,1,1484011097_1484011106,en,en,"Open database leaked 179GB in customer, US gov...",Best Western’s Massive Data Leak: 179GB Amazon...,"AI,Hardware,Executive Guides,Cloud,Microsoft,S...","Best Western,Information Security,Privacy,aws,...",Govt officials confirm Trump can block US comp...,The latest huge unsecured cloud storage find i...,...,2.0,1.0,1.0,1.0,3.5,2.5,Govt officials confirm Trump can block US comp...,The latest huge unsecured cloud storage find i...,govt official confirm trump block we company o...,late huge unsecured cloud storage find autocle...
2,2,1484039488_1484261803,en,en,Ducks are own worst enemies in sloppy loss in ...,Woody Guthrie's 1943 New Year's Resolutions ar...,anaheim-ducks,"keep the hope machine running,new year's resol...","Ducks defenseman Erik Gudbranson, left, knocks...",Woody Guthrie's 1943 New Year's Resolutions ar...,...,4.0,3.0,4.0,4.0,4.0,36.666.666.666.666.600,"Ducks defenseman Erik Gudbranson, left, knocks...",Woody Guthrie's 1943 New Year's Resolutions ar...,ducks defenseman erik gudbranson leave knock p...,woody guthries 1943 new year resolution powerf...
3,3,1484332324_1484796748,en,en,Another Bengal vs Centre tussle? Govt rejects ...,Congress Rejected 7 Times': BJP's Reminder as ...,"bengal latest news,mamata banrjee,republicday,...","tableaux,WestBengal,Mamata Banerjee,West Benga...",The West Bengal government’s proposal was reje...,Mumbai: The NCP and Shiv Sena on Thursday targ...,...,1.5,1.0,1.5,1.5,1.5,2.0,The West Bengal government’s proposal was reje...,Mumbai: The NCP and Shiv Sena on Thursday targ...,west bengal government ' proposal reject exper...,mumbai ncp shiv sena thursday target centre re...
4,4,1484012256_1484419682,en,en,Bars and clubs you loved and lost this decade ...,Top 20 films of the 2010s,"Broad Street Birmingham,ThingsToDoInBirmingham...","Marvel Cinematic Universe,People's Struggles,a...",The video will start in 8 Cancel\n\nSign up to...,"Jacksonville, FL - I'm not sure how we'll look...",...,4.0,1.0,2.5,4.0,2.5,2.5,The video will start in 8 Cancel Sign up to F...,"Jacksonville, FL - I'm not sure how we'll look...",video start 8 cancel sign free daily email ale...,jacksonville fl I m sure well look back film 2...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4897,4897,1553907621_1553488848,es,it,Denver Nuggets reporta que “un miembro de la o...,"Coronavirus, un caso anche fra i Denver Nuggets","Denver,NBA","DenverNuggets,coronavirus,Denver Nuggets,Coron...",Los Denver Nuggets de la NBA reportaron este j...,Nato ad Alatri (Fr) nel ’93 e qui diplomato al...,...,1.0,1.0,1.0,1.0,2.0,1.0,The Denver Nugets of the NBA reported this Thu...,Born in Alatri (FR) in '93 and here you gradua...,denver nugets nba report thursday member organ...,bear alatri fr 93 graduate classic high school...
4898,4898,1646957948_1643667075,es,it,Vivir en España es más barato que en la media ...,"Coronavirus, in Europa 140mila morti in più in...",EUROSTAT,Estero,El estudio realizado por Eurostat muestra que ...,"Nei mesi di marzo e aprile 2020, dalla decima ...",...,2.0,1.0,4.0,4.0,1.0,3.0,The study carried out by Eurostat shows that l...,"In the months of March and April 2020, from th...",study carry eurostat show live spain cheap ave...,month march april 2020 tithing seventeenth wee...
4899,4899,1504063453_1502866628,es,it,Activan sistema de vigilancia epidemiológica e...,Coronavirus in Cina: consumo di serpenti e zup...,"vigilancia epidemiológica,ticker,puertos",serpenti,Foto: Archivo Foto: Archivo\n\nEl sistema de v...,La diffusione del mortale Coronavirus potrebbe...,...,4.0,1.0,4.0,4.0,2.0,2.0,Photo: File Photo: Archive\n\nThe epidemiologi...,The diffusion of the coronavirus mortal could ...,photo file photo archive epidemiological surve...,diffusion coronavirus mortal could depend cons...
4900,4900,1647862428_1647712939,es,it,Emite Irán orden de arresto contra Tump por as...,Iran emette mandato di arresto per Trump per l...,"ESTADOS UNIDOS,ESTADOSUNIDOS,DONALDTRUMP,IRÁN,...","corpodelleguardie,magistratura,iran,qasemsolei...",CDMX.- Anunció Irán este lunes que ha emitido ...,"Il procuratore di Teheran, Ali Alqasi Mehr, ha...",...,2.0,1.0,1.0,2.0,1.0,1.0,CDMX.- Announced Iran this Monday that has iss...,"The attorney of Tehran, Ali Alqasi Mehr, confi...",cdmx announce iran monday issue arrest warrant...,attorney tehran ali alqasi mehr confirm irania...


In [52]:
# because some texts are missing and were saved as nan, we don't want them to interfere with vectorization
# so we replace them with empty strings
processed_df = processed_df.replace(np.nan, '', regex=True)

processed_df['preprocessed_1'].fillna('')
processed_df['preprocessed_2'].fillna('')

0       deerfield beach fla wsvn noswim advisory lift ...
1       late huge unsecured cloud storage find autocle...
2       woody guthries 1943 new year resolution powerf...
3       mumbai ncp shiv sena thursday target centre re...
4       jacksonville fl I m sure well look back film 2...
                              ...                        
4897    bear alatri fr 93 graduate classic high school...
4898    month march april 2020 tithing seventeenth wee...
4899    diffusion coronavirus mortal could depend cons...
4900    attorney tehran ali alqasi mehr confirm irania...
4901    new york june 17th askanew former vicepresiden...
Name: preprocessed_2, Length: 4902, dtype: object

In [53]:
# then join them together to form a complete corpus
corpus = processed_df['preprocessed_1'].tolist() + processed_df['preprocessed_2'].tolist()
len(corpus)


9804

In [54]:
# randomly pick index to check if they are stored correctly
corpus[4902]

'deerfield beach fla wsvn noswim advisory lift deerfield beach pier florida department health broward county issue advisory friday water pier meet state requirement official lift advisory tuesday water sample test meet state requirement copyright 2019 sunbeam television corp right reserve material may publish broadcast rewrite redistribute'


### Generate tf-idf vectors and calculate cosine similarity <br>

After collecting all the preprocessed article texts, cleaning the files whose text was not available, and storing them to form the text corpus, we encode them into tf-idf.

In [55]:
# then generate tfidf vectors and calculate cosine similarity with the matrix 
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# an overlook of of the cosine similarity matrix
print(cosine_sim.shape)
print(cosine_sim)

(9804, 9804)
[[1.         0.00321175 0.01078851 ... 0.02200145 0.00534774 0.01185675]
 [0.00321175 1.         0.00565664 ... 0.03780077 0.05845428 0.02461147]
 [0.01078851 0.00565664 1.         ... 0.00440796 0.00279291 0.0060509 ]
 ...
 [0.02200145 0.03780077 0.00440796 ... 1.         0.02714095 0.01692379]
 [0.00534774 0.05845428 0.00279291 ... 0.02714095 1.         0.05434344]
 [0.01185675 0.02461147 0.0060509  ... 0.01692379 0.05434344 1.        ]]


In [58]:
# because of how we formed the corpus, 0-4901 is the first text from the pair, and 4902-9803 is second text
# to access the similarity score, we just need to know when they end so we can store them away nicely in our dataframe
switch_col = len(processed_df)

for i in range(0,switch_col):
    cos_sim = cosine_sim[i][switch_col+i]
    print(cos_sim)
    processed_df.at[i,'text_cos_sim'] = cos_sim

0.264686270135833
0.5413204925727716
0.022303404800409012
0.467639285859653
0.07755024541491856
0.026999493511589038
0.07358326946085121
0.2060242644942726
0.5114483072265518
0.025910047946633943
0.01785833572708508
0.02232308973239053
0.2457014395756119
0.011845360668326749
0.14920635353429829
0.029514685690493443
0.0710472359479464
0.15907856060494485
0.17902990131754495
0.1733546875851396
0.37844068005739284
0.6924051270632721
0.028443186145221264
0.23441570145236812
0.0
0.05266893947238651
0.014407780711883972
0.010426928012091597
0.09239064543903623
0.005892083681054997
0.005391632288350502
0.016644979577861575
0.010299518938201097
0.7121644326056302
0.026466512496166647
0.044127848226674836
0.020801408227964256
0.3132603730247697
0.4933322947133727
0.13032759882664494
0.0227103048521087
0.013708066309702158
0.6720186150631833
0.0
0.0478048639569677
0.7436545355990487
0.15039233124985918
0.07131771086650222
0.1729822935266194
0.3340776977225564
0.42304846271700874
0.03807359820020

0.1913101057346061
0.32168137229529326
0.5363169462801844
0.08595618119361408
0.32924460151825463
0.040240907493050135
0.2338865089595929
0.374956994001595
0.24667244756388523
0.24943242990406275
0.5400285003414792
0.08138739021373921
0.23862778651578648
0.41215751877491535
0.4794269066580308
0.16345093324570378
0.5316449650752759
0.2702068471218062
0.3805923065246001
0.42800214907233475
0.17111055669310507
0.6154966549973077
0.4527904701979271
0.06443726633238062
0.15396561053497843
0.23522268063784127
0.0407994030078658
0.32457323723625325
0.20469657150663997
0.3411967019681589
0.2039328112016649
0.19952986669104833
0.4349927298757652
0.5854829376545546
0.06358431933492174
0.44220039472600564
0.3529864283845238
0.004294613354107352
0.11604409736519071
0.10969343207568137
0.33143218916801825
0.4737327246341105
0.2968081487270933
0.09404742446723308
0.15462361879105174
0.5075701789978483
0.4240681022532731
0.303555310272419
0.1845019227211316
0.3532476417606632
0.3670554046200028
0.210

0.7954406718902509
0.5218589386311859
0.5448071074759531
0.5846298960898543
0.08202969176460867
0.7861733813598003
0.4422017669489904
0.7152566541693999
0.4099830578718248
0.08504218623495717
0.009389830933862744
0.2274885537396441
0.2728624077232572
0.23097019397723056
0.8668655154987269
0.0
0.08475066727129521
0.2663293170499852
0.3315271123899449
0.5389642487573807
0.04075264748743325
0.5527638456443298
0.37039131528426805
0.37791168878329556
0.6885463420397564
0.23016017723484475
0.023307365627457704
0.9999999999999999
0.7126677815197233
0.485377804076822
0.4855257450008845
0.6990790279130339
0.5705571268176467
0.09317605823161022
0.3412753913142086
0.5928958850641161
0.14616244558110006
0.5233260053310936
0.16667810700878444
0.5149838452817366
0.5018849753144848
0.7609866472633384
0.16985749099233824
0.1604541280590961
0.0414976705398063
0.5855631514953004
0.7284961148946781
0.5008886418698169
0.43517368668863465
0.39186825429785455
0.5494461525748237
0.11447069240993647
0.4299632

0.3472957700220545
0.30600029024924097
0.5179622834032134
0.6587171817941851
0.8343632194341307
0.8342421309856805
0.7347562029374234
0.46048200795494004
0.23339686529842502
0.30349078236573945
0.6003803344015054
0.4939809087646167
0.8379960490611343
0.6680504521255067
0.07486885272665639
0.43134182546908684
0.5182997207463318
0.907469881044937
0.5514636636411104
0.7319830552139095
0.27245635297853127
0.7787523905944747
0.16376800935272337
0.09828399902183255
0.4868732255076717
0.594474039568174
0.42496546744745894
0.7369984457091647
0.6724756759010739
0.715127898216672
0.9393448874352971
0.5894898646331691
0.36065717156442423
0.609733423326764
0.5716004562159638
0.5104655212646303
0.10598607666821615
0.17083227464333706
0.622696752494006
0.4686538424918388
0.22595080299863954
0.5955876533483888
0.23690535047043343
0.25488260585056705
0.7032307287817198
0.7822499948252649
0.22575212372900252
0.10496940323159022
0.8361007316422617
0.11405837800514189
0.013638109980208006
0.7633093498595

In [68]:
# print(processed_df.iloc[4897]['translated_body1'])
# print(processed_df.iloc[4897]['translated_body2'])
# print(processed_df.iloc[4897]['preprocessed_1'])
# print(processed_df.iloc[4897]['preprocessed_2'])
processed_df


Unnamed: 0.1,Unnamed: 0,pair_id,url1_lang,url2_lang,title1,title2,keywords1,keywords2,text1,text2,...,Time,Narrative,Overall,Style,Tone,translated_body1,translated_body2,preprocessed_1,preprocessed_2,text_cos_sim
0,0,1484189203_1484121193,en,en,Police: 2 men stole tools from Lowe’s in Davie,No-swim advisory lifted for Deerfield Beach Pier,"More Trending Stories,Trending","More Trending Stories,Trending","DAVIE, FLA. (WSVN) - Police need help catching...","DEERFIELD BEACH, FLA. (WSVN) - A no-swim advis...",...,2.0,4.0,3.5,1.0,1.5,"DAVIE, FLA. (WSVN) - Police need help catching...","DEERFIELD BEACH, FLA. (WSVN) - A no-swim advis...",davie fla wsvn police need help catch two croo...,deerfield beach fla wsvn noswim advisory lift ...,0.264686
1,1,1484011097_1484011106,en,en,"Open database leaked 179GB in customer, US gov...",Best Western’s Massive Data Leak: 179GB Amazon...,"AI,Hardware,Executive Guides,Cloud,Microsoft,S...","Best Western,Information Security,Privacy,aws,...",Govt officials confirm Trump can block US comp...,The latest huge unsecured cloud storage find i...,...,1.0,1.0,1.0,3.5,2.5,Govt officials confirm Trump can block US comp...,The latest huge unsecured cloud storage find i...,govt official confirm trump block we company o...,late huge unsecured cloud storage find autocle...,0.541320
2,2,1484039488_1484261803,en,en,Ducks are own worst enemies in sloppy loss in ...,Woody Guthrie's 1943 New Year's Resolutions ar...,anaheim-ducks,"keep the hope machine running,new year's resol...","Ducks defenseman Erik Gudbranson, left, knocks...",Woody Guthrie's 1943 New Year's Resolutions ar...,...,3.0,4.0,4.0,4.0,36.666.666.666.666.600,"Ducks defenseman Erik Gudbranson, left, knocks...",Woody Guthrie's 1943 New Year's Resolutions ar...,ducks defenseman erik gudbranson leave knock p...,woody guthries 1943 new year resolution powerf...,0.022303
3,3,1484332324_1484796748,en,en,Another Bengal vs Centre tussle? Govt rejects ...,Congress Rejected 7 Times': BJP's Reminder as ...,"bengal latest news,mamata banrjee,republicday,...","tableaux,WestBengal,Mamata Banerjee,West Benga...",The West Bengal government’s proposal was reje...,Mumbai: The NCP and Shiv Sena on Thursday targ...,...,1.0,1.5,1.5,1.5,2.0,The West Bengal government’s proposal was reje...,Mumbai: The NCP and Shiv Sena on Thursday targ...,west bengal government ' proposal reject exper...,mumbai ncp shiv sena thursday target centre re...,0.467639
4,4,1484012256_1484419682,en,en,Bars and clubs you loved and lost this decade ...,Top 20 films of the 2010s,"Broad Street Birmingham,ThingsToDoInBirmingham...","Marvel Cinematic Universe,People's Struggles,a...",The video will start in 8 Cancel\n\nSign up to...,"Jacksonville, FL - I'm not sure how we'll look...",...,1.0,2.5,4.0,2.5,2.5,The video will start in 8 Cancel Sign up to F...,"Jacksonville, FL - I'm not sure how we'll look...",video start 8 cancel sign free daily email ale...,jacksonville fl I m sure well look back film 2...,0.077550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4897,4897,1553907621_1553488848,es,it,Denver Nuggets reporta que “un miembro de la o...,"Coronavirus, un caso anche fra i Denver Nuggets","Denver,NBA","DenverNuggets,coronavirus,Denver Nuggets,Coron...",Los Denver Nuggets de la NBA reportaron este j...,Nato ad Alatri (Fr) nel ’93 e qui diplomato al...,...,1.0,1.0,1.0,2.0,1.0,The Denver Nugets of the NBA reported this Thu...,Born in Alatri (FR) in '93 and here you gradua...,denver nugets nba report thursday member organ...,bear alatri fr 93 graduate classic high school...,0.000000
4898,4898,1646957948_1643667075,es,it,Vivir en España es más barato que en la media ...,"Coronavirus, in Europa 140mila morti in più in...",EUROSTAT,Estero,El estudio realizado por Eurostat muestra que ...,"Nei mesi di marzo e aprile 2020, dalla decima ...",...,1.0,4.0,4.0,1.0,3.0,The study carried out by Eurostat shows that l...,"In the months of March and April 2020, from th...",study carry eurostat show live spain cheap ave...,month march april 2020 tithing seventeenth wee...,0.229793
4899,4899,1504063453_1502866628,es,it,Activan sistema de vigilancia epidemiológica e...,Coronavirus in Cina: consumo di serpenti e zup...,"vigilancia epidemiológica,ticker,puertos",serpenti,Foto: Archivo Foto: Archivo\n\nEl sistema de v...,La diffusione del mortale Coronavirus potrebbe...,...,1.0,4.0,4.0,2.0,2.0,Photo: File Photo: Archive\n\nThe epidemiologi...,The diffusion of the coronavirus mortal could ...,photo file photo archive epidemiological surve...,diffusion coronavirus mortal could depend cons...,0.109186
4900,4900,1647862428_1647712939,es,it,Emite Irán orden de arresto contra Tump por as...,Iran emette mandato di arresto per Trump per l...,"ESTADOS UNIDOS,ESTADOSUNIDOS,DONALDTRUMP,IRÁN,...","corpodelleguardie,magistratura,iran,qasemsolei...",CDMX.- Anunció Irán este lunes que ha emitido ...,"Il procuratore di Teheran, Ali Alqasi Mehr, ha...",...,1.0,1.0,2.0,1.0,1.0,CDMX.- Announced Iran this Monday that has iss...,"The attorney of Tehran, Ali Alqasi Mehr, confi...",cdmx announce iran monday issue arrest warrant...,attorney tehran ali alqasi mehr confirm irania...,0.492072


### Compare text cosine similarity score <br>

We compare the annotated overall score of pairs with the similarity score obtained by 
cosine similarity on the translated and preprocessed body texts extracted from the articles in the evaluation dataset. <br><br>

This comparison is analogous to the one in the case of the training dataset, however, in principle, this step is not relevant for the final evaluation of our solution. This comparison can be thought as a double check to make sure that the choice of features makes sense and is motivated by curiosity rather then necessity.

In [79]:
# get a dataframe showing the distribution of label (annotated overalll similarity) and our text named entity similarity score

eval_df = pd.read_csv ('eval/_EVAL_details_in_df.csv')

def normal_round(n):
    if n - math.floor(n) < 0.5:
        return int(math.floor(n))
    return int(math.ceil(n))

compare_df = pd.DataFrame(columns = ["overall","text_cos_sim"])


for i, row in processed_df.iterrows(): 
    # just to check if dataframes align
#     if row["pair_id"] != eval_df.iloc[i]['pair_id']:
#         print('---------------------',i,'-----------------------')
#         print(row["pair_id"])
    print('---------------------',i,'-----------------------')
    label = normal_round(float(eval_df.iloc[i]['Overall']))
    print('label:',label)
    score = row['text_cos_sim']
    print('score:',score)
    
    # fix processed_df overall mishaps
    processed_df.at[i,'Overall'] = label
    entry = {"overall":label,"text_cos_sim":score}
    compare_df = compare_df.append(entry, ignore_index = True)
    

--------------------- 0 -----------------------
label: 4
score: 0.264686270135833
--------------------- 1 -----------------------
label: 1
score: 0.5413204925727716
--------------------- 2 -----------------------
label: 4
score: 0.022303404800409012
--------------------- 3 -----------------------
label: 2
score: 0.467639285859653
--------------------- 4 -----------------------
label: 4
score: 0.07755024541491856
--------------------- 5 -----------------------
label: 4
score: 0.026999493511589038
--------------------- 6 -----------------------
label: 3
score: 0.07358326946085121
--------------------- 7 -----------------------
label: 3
score: 0.2060242644942726
--------------------- 8 -----------------------
label: 2
score: 0.5114483072265518
--------------------- 9 -----------------------
label: 4
score: 0.025910047946633943
--------------------- 10 -----------------------
label: 4
score: 0.01785833572708508
--------------------- 11 -----------------------
label: 3
score: 0.022323089732

--------------------- 124 -----------------------
label: 1
score: 0.4788371251872538
--------------------- 125 -----------------------
label: 4
score: 0.03565292460691639
--------------------- 126 -----------------------
label: 4
score: 0.0720069716207729
--------------------- 127 -----------------------
label: 1
score: 0.7439692104476076
--------------------- 128 -----------------------
label: 4
score: 0.010501222878760978
--------------------- 129 -----------------------
label: 3
score: 0.23775600700306465
--------------------- 130 -----------------------
label: 4
score: 0.01067169294317758
--------------------- 131 -----------------------
label: 4
score: 0.01979884810142941
--------------------- 132 -----------------------
label: 4
score: 0.08472007433815869
--------------------- 133 -----------------------
label: 1
score: 0.5315622225639366
--------------------- 134 -----------------------
label: 2
score: 0.5162980214005071
--------------------- 135 -----------------------
label: 4

--------------------- 297 -----------------------
label: 2
score: 0.7980225994749954
--------------------- 298 -----------------------
label: 1
score: 0.7226084287466942
--------------------- 299 -----------------------
label: 1
score: 0.676968792667175
--------------------- 300 -----------------------
label: 1
score: 0.5983329416733062
--------------------- 301 -----------------------
label: 2
score: 0.3534836605095189
--------------------- 302 -----------------------
label: 4
score: 0.02101583610092506
--------------------- 303 -----------------------
label: 4
score: 0.11169652428166742
--------------------- 304 -----------------------
label: 4
score: 0.2034764484695346
--------------------- 305 -----------------------
label: 3
score: 0.0058498417868414546
--------------------- 306 -----------------------
label: 4
score: 0.016505617996131317
--------------------- 307 -----------------------
label: 2
score: 0.3064536440083837
--------------------- 308 -----------------------
label: 3


--------------------- 461 -----------------------
label: 1
score: 0.5550388658322543
--------------------- 462 -----------------------
label: 4
score: 0.35434931588008867
--------------------- 463 -----------------------
label: 1
score: 0.7048685001585946
--------------------- 464 -----------------------
label: 4
score: 0.10373893782800746
--------------------- 465 -----------------------
label: 4
score: 0.134670295893908
--------------------- 466 -----------------------
label: 2
score: 0.42947439901565926
--------------------- 467 -----------------------
label: 4
score: 0.05656730316481134
--------------------- 468 -----------------------
label: 4
score: 0.3506999451058847
--------------------- 469 -----------------------
label: 1
score: 0.35744143854455684
--------------------- 470 -----------------------
label: 4
score: 0.0036217821150826344
--------------------- 471 -----------------------
label: 4
score: 0.016924655611897727
--------------------- 472 -----------------------
label:

label: 1
score: 0.475362098718657
--------------------- 589 -----------------------
label: 4
score: 0.2675205450630561
--------------------- 590 -----------------------
label: 3
score: 0.5361398069018145
--------------------- 591 -----------------------
label: 4
score: 0.14779336638540763
--------------------- 592 -----------------------
label: 2
score: 0.2942283125003212
--------------------- 593 -----------------------
label: 2
score: 0.3987760569850979
--------------------- 594 -----------------------
label: 3
score: 0.19421715391041602
--------------------- 595 -----------------------
label: 4
score: 0.3612799603383998
--------------------- 596 -----------------------
label: 3
score: 0.5453977196044263
--------------------- 597 -----------------------
label: 4
score: 0.08280496105419383
--------------------- 598 -----------------------
label: 4
score: 0.028851998263878274
--------------------- 599 -----------------------
label: 2
score: 0.19815554501906985
--------------------- 600

--------------------- 732 -----------------------
label: 4
score: 0.556896665974925
--------------------- 733 -----------------------
label: 4
score: 0.3117814427566925
--------------------- 734 -----------------------
label: 4
score: 0.01155322383114165
--------------------- 735 -----------------------
label: 3
score: 0.5410498733834787
--------------------- 736 -----------------------
label: 4
score: 0.4274942872977929
--------------------- 737 -----------------------
label: 2
score: 0.6291340661271497
--------------------- 738 -----------------------
label: 4
score: 0.03523015176607679
--------------------- 739 -----------------------
label: 3
score: 0.35154108841065207
--------------------- 740 -----------------------
label: 4
score: 0.17277106640844728
--------------------- 741 -----------------------
label: 3
score: 0.45387017051243406
--------------------- 742 -----------------------
label: 4
score: 0.21597135522511784
--------------------- 743 -----------------------
label: 4
s

label: 1
score: 0.6282904983331826
--------------------- 854 -----------------------
label: 3
score: 0.0
--------------------- 855 -----------------------
label: 3
score: 0.819821342116911
--------------------- 856 -----------------------
label: 3
score: 0.016107694542286284
--------------------- 857 -----------------------
label: 4
score: 0.0
--------------------- 858 -----------------------
label: 4
score: 0.12681024267647031
--------------------- 859 -----------------------
label: 1
score: 0.6388566030273916
--------------------- 860 -----------------------
label: 1
score: 0.6075003827982225
--------------------- 861 -----------------------
label: 3
score: 0.4076124767499688
--------------------- 862 -----------------------
label: 4
score: 0.1068112050660838
--------------------- 863 -----------------------
label: 4
score: 0.229177744990058
--------------------- 864 -----------------------
label: 3
score: 0.4013943982787881
--------------------- 865 -----------------------
label: 1


--------------------- 1005 -----------------------
label: 1
score: 0.44133901746504095
--------------------- 1006 -----------------------
label: 4
score: 0.02443145291268168
--------------------- 1007 -----------------------
label: 4
score: 0.0022650313390238444
--------------------- 1008 -----------------------
label: 2
score: 0.3939567658875706
--------------------- 1009 -----------------------
label: 4
score: 0.11775894298289889
--------------------- 1010 -----------------------
label: 3
score: 0.28294640571770835
--------------------- 1011 -----------------------
label: 3
score: 0.33263074169911794
--------------------- 1012 -----------------------
label: 3
score: 0.04910676435231236
--------------------- 1013 -----------------------
label: 4
score: 0.07244431279502256
--------------------- 1014 -----------------------
label: 4
score: 0.05233399974943972
--------------------- 1015 -----------------------
label: 4
score: 0.03365973131586005
--------------------- 1016 ---------------

--------------------- 1122 -----------------------
label: 1
score: 0.4904861533161625
--------------------- 1123 -----------------------
label: 1
score: 0.7709385376005508
--------------------- 1124 -----------------------
label: 1
score: 0.3325437358146159
--------------------- 1125 -----------------------
label: 2
score: 0.5177187683337936
--------------------- 1126 -----------------------
label: 1
score: 0.36554030489764294
--------------------- 1127 -----------------------
label: 1
score: 0.8427272822982796
--------------------- 1128 -----------------------
label: 4
score: 0.11125974439234813
--------------------- 1129 -----------------------
label: 4
score: 0.021637541669990247
--------------------- 1130 -----------------------
label: 2
score: 0.2829499882622337
--------------------- 1131 -----------------------
label: 4
score: 0.10025751723553379
--------------------- 1132 -----------------------
label: 1
score: 0.9594409708364579
--------------------- 1133 ----------------------

--------------------- 1269 -----------------------
label: 1
score: 0.47165797356678296
--------------------- 1270 -----------------------
label: 4
score: 0.15370339646542433
--------------------- 1271 -----------------------
label: 1
score: 0.010098250706928704
--------------------- 1272 -----------------------
label: 2
score: 0.46681040361525555
--------------------- 1273 -----------------------
label: 4
score: 0.011917492994702579
--------------------- 1274 -----------------------
label: 4
score: 0.012864809667821577
--------------------- 1275 -----------------------
label: 2
score: 0.5222140116801218
--------------------- 1276 -----------------------
label: 4
score: 0.06497867810737457
--------------------- 1277 -----------------------
label: 3
score: 0.5639581858204353
--------------------- 1278 -----------------------
label: 1
score: 0.9339107350629694
--------------------- 1279 -----------------------
label: 4
score: 0.2296241509385023
--------------------- 1280 -----------------

--------------------- 1415 -----------------------
label: 1
score: 0.4321680033120405
--------------------- 1416 -----------------------
label: 2
score: 0.337340074248455
--------------------- 1417 -----------------------
label: 1
score: 0.22969883537998417
--------------------- 1418 -----------------------
label: 2
score: 0.5960860948153895
--------------------- 1419 -----------------------
label: 4
score: 0.2337747640471722
--------------------- 1420 -----------------------
label: 1
score: 0.4675043866896377
--------------------- 1421 -----------------------
label: 1
score: 0.7072637928887212
--------------------- 1422 -----------------------
label: 1
score: 0.4041395801454347
--------------------- 1423 -----------------------
label: 3
score: 0.43889078235023177
--------------------- 1424 -----------------------
label: 4
score: 0.20529656124270504
--------------------- 1425 -----------------------
label: 2
score: 0.30079191240909287
--------------------- 1426 -----------------------


label: 4
score: 0.03346409203285816
--------------------- 1552 -----------------------
label: 4
score: 0.029766426983380258
--------------------- 1553 -----------------------
label: 4
score: 0.0026464650600318525
--------------------- 1554 -----------------------
label: 2
score: 0.27398422323844684
--------------------- 1555 -----------------------
label: 4
score: 0.18776044855477092
--------------------- 1556 -----------------------
label: 3
score: 0.20258209857135717
--------------------- 1557 -----------------------
label: 4
score: 0.055751291736186405
--------------------- 1558 -----------------------
label: 4
score: 0.07361194474802502
--------------------- 1559 -----------------------
label: 1
score: 0.43410480333068274
--------------------- 1560 -----------------------
label: 2
score: 0.3411861919904485
--------------------- 1561 -----------------------
label: 1
score: 0.35266244640496863
--------------------- 1562 -----------------------
label: 4
score: 0.005580280926473707
---

--------------------- 1708 -----------------------
label: 2
score: 0.4425104107771548
--------------------- 1709 -----------------------
label: 4
score: 0.14291864880798769
--------------------- 1710 -----------------------
label: 3
score: 0.45981588655684713
--------------------- 1711 -----------------------
label: 1
score: 0.5338911733025383
--------------------- 1712 -----------------------
label: 3
score: 0.19245810623963308
--------------------- 1713 -----------------------
label: 1
score: 0.3010683362333262
--------------------- 1714 -----------------------
label: 4
score: 0.1396779975124647
--------------------- 1715 -----------------------
label: 4
score: 0.09098117228152065
--------------------- 1716 -----------------------
label: 4
score: 0.1606449573687562
--------------------- 1717 -----------------------
label: 3
score: 0.1922504156336684
--------------------- 1718 -----------------------
label: 2
score: 0.1519773885843727
--------------------- 1719 -----------------------

--------------------- 1846 -----------------------
label: 1
score: 0.3884990978469128
--------------------- 1847 -----------------------
label: 3
score: 0.15625804359184214
--------------------- 1848 -----------------------
label: 3
score: 0.4774694156743573
--------------------- 1849 -----------------------
label: 4
score: 0.32616406657343766
--------------------- 1850 -----------------------
label: 2
score: 0.25147243992653834
--------------------- 1851 -----------------------
label: 3
score: 0.14232754932850006
--------------------- 1852 -----------------------
label: 2
score: 0.2788578531888932
--------------------- 1853 -----------------------
label: 3
score: 0.2721423427916197
--------------------- 1854 -----------------------
label: 3
score: 0.18407235248387135
--------------------- 1855 -----------------------
label: 4
score: 0.0894869157681994
--------------------- 1856 -----------------------
label: 1
score: 0.719112929494285
--------------------- 1857 -----------------------

--------------------- 1957 -----------------------
label: 4
score: 0.013818918337053971
--------------------- 1958 -----------------------
label: 2
score: 0.68605693730998
--------------------- 1959 -----------------------
label: 1
score: 0.6308819705404376
--------------------- 1960 -----------------------
label: 4
score: 0.23416764883256816
--------------------- 1961 -----------------------
label: 1
score: 0.7094358502293892
--------------------- 1962 -----------------------
label: 4
score: 0.09332620601922226
--------------------- 1963 -----------------------
label: 3
score: 0.18140865335380316
--------------------- 1964 -----------------------
label: 3
score: 0.09252447900454952
--------------------- 1965 -----------------------
label: 1
score: 0.8396479945221001
--------------------- 1966 -----------------------
label: 2
score: 0.4146615398695219
--------------------- 1967 -----------------------
label: 4
score: 0.05050768286517518
--------------------- 1968 ----------------------

--------------------- 2081 -----------------------
label: 3
score: 0.14893439208621365
--------------------- 2082 -----------------------
label: 2
score: 0.3896569330626719
--------------------- 2083 -----------------------
label: 2
score: 0.014469880582581298
--------------------- 2084 -----------------------
label: 3
score: 0.3532087395010482
--------------------- 2085 -----------------------
label: 3
score: 0.0007802122116991899
--------------------- 2086 -----------------------
label: 2
score: 0.310555149586302
--------------------- 2087 -----------------------
label: 3
score: 0.15611110578794213
--------------------- 2088 -----------------------
label: 4
score: 0.01720711809607579
--------------------- 2089 -----------------------
label: 1
score: 0.5064420417185753
--------------------- 2090 -----------------------
label: 1
score: 0.0
--------------------- 2091 -----------------------
label: 3
score: 0.13724793934306348
--------------------- 2092 -----------------------
label: 4
s

--------------------- 2204 -----------------------
label: 2
score: 0.6322864869327626
--------------------- 2205 -----------------------
label: 3
score: 0.3442020030218359
--------------------- 2206 -----------------------
label: 4
score: 0.3082100467827714
--------------------- 2207 -----------------------
label: 4
score: 0.2321059901365842
--------------------- 2208 -----------------------
label: 1
score: 1.0000000000000002
--------------------- 2209 -----------------------
label: 2
score: 0.31872919594091437
--------------------- 2210 -----------------------
label: 2
score: 0.717751755618968
--------------------- 2211 -----------------------
label: 4
score: 0.1787390164553994
--------------------- 2212 -----------------------
label: 3
score: 0.3108170568539532
--------------------- 2213 -----------------------
label: 2
score: 0.1982168233552819
--------------------- 2214 -----------------------
label: 2
score: 0.6440408398094213
--------------------- 2215 -----------------------
lab

--------------------- 2336 -----------------------
label: 2
score: 0.20403181277270085
--------------------- 2337 -----------------------
label: 3
score: 0.3247641068134574
--------------------- 2338 -----------------------
label: 2
score: 0.25044642095679454
--------------------- 2339 -----------------------
label: 4
score: 0.39148958688944235
--------------------- 2340 -----------------------
label: 3
score: 0.6059568295783098
--------------------- 2341 -----------------------
label: 2
score: 0.027140383094333115
--------------------- 2342 -----------------------
label: 1
score: 0.5607001496968685
--------------------- 2343 -----------------------
label: 4
score: 0.09586121625457486
--------------------- 2344 -----------------------
label: 1
score: 0.690029466197082
--------------------- 2345 -----------------------
label: 1
score: 0.0
--------------------- 2346 -----------------------
label: 2
score: 0.5459632255560086
--------------------- 2347 -----------------------
label: 4
scor

--------------------- 2479 -----------------------
label: 3
score: 0.33947138608181876
--------------------- 2480 -----------------------
label: 3
score: 0.6272833358133559
--------------------- 2481 -----------------------
label: 4
score: 0.26905079201766574
--------------------- 2482 -----------------------
label: 3
score: 0.5760185971625549
--------------------- 2483 -----------------------
label: 1
score: 0.6945674195761739
--------------------- 2484 -----------------------
label: 4
score: 0.2182248074788781
--------------------- 2485 -----------------------
label: 3
score: 0.7445361277918737
--------------------- 2486 -----------------------
label: 1
score: 0.4221191933541428
--------------------- 2487 -----------------------
label: 1
score: 0.6007136778659625
--------------------- 2488 -----------------------
label: 1
score: 0.6611187446495536
--------------------- 2489 -----------------------
label: 1
score: 0.42357850383444423
--------------------- 2490 -----------------------


--------------------- 2622 -----------------------
label: 2
score: 0.543449879814203
--------------------- 2623 -----------------------
label: 2
score: 0.2602747217986991
--------------------- 2624 -----------------------
label: 3
score: 0.29466194482361996
--------------------- 2625 -----------------------
label: 2
score: 0.6288508340185943
--------------------- 2626 -----------------------
label: 3
score: 0.048107493934865346
--------------------- 2627 -----------------------
label: 1
score: 0.6118548848615621
--------------------- 2628 -----------------------
label: 3
score: 0.49718772002608774
--------------------- 2629 -----------------------
label: 1
score: 0.6274525015356389
--------------------- 2630 -----------------------
label: 3
score: 0.11700660627620728
--------------------- 2631 -----------------------
label: 1
score: 0.1642002898404958
--------------------- 2632 -----------------------
label: 1
score: 0.794223116139479
--------------------- 2633 -----------------------


--------------------- 2770 -----------------------
label: 3
score: 0.41484833042603825
--------------------- 2771 -----------------------
label: 3
score: 0.2267760130990139
--------------------- 2772 -----------------------
label: 2
score: 0.5268934394375351
--------------------- 2773 -----------------------
label: 4
score: 0.0698980883218309
--------------------- 2774 -----------------------
label: 4
score: 0.041239880711470056
--------------------- 2775 -----------------------
label: 1
score: 0.48516308068701586
--------------------- 2776 -----------------------
label: 1
score: 0.6059464761448706
--------------------- 2777 -----------------------
label: 1
score: 0.7596862833221468
--------------------- 2778 -----------------------
label: 4
score: 0.09719176893204956
--------------------- 2779 -----------------------
label: 1
score: 0.3189403156941826
--------------------- 2780 -----------------------
label: 4
score: 0.13033998151024662
--------------------- 2781 ---------------------

--------------------- 2903 -----------------------
label: 3
score: 0.30418074149292934
--------------------- 2904 -----------------------
label: 4
score: 0.3090036851561051
--------------------- 2905 -----------------------
label: 1
score: 0.3219039979272131
--------------------- 2906 -----------------------
label: 2
score: 0.31328641241939376
--------------------- 2907 -----------------------
label: 2
score: 0.03899543008134668
--------------------- 2908 -----------------------
label: 1
score: 0.10615095624417571
--------------------- 2909 -----------------------
label: 1
score: 0.7242879204528518
--------------------- 2910 -----------------------
label: 3
score: 0.3331486398716526
--------------------- 2911 -----------------------
label: 2
score: 0.30653548772121064
--------------------- 2912 -----------------------
label: 3
score: 0.2983762137563055
--------------------- 2913 -----------------------
label: 1
score: 0.7182992848390276
--------------------- 2914 ----------------------

--------------------- 3003 -----------------------
label: 1
score: 0.5783993643294851
--------------------- 3004 -----------------------
label: 2
score: 0.5687356189868616
--------------------- 3005 -----------------------
label: 3
score: 0.24496975568678017
--------------------- 3006 -----------------------
label: 2
score: 0.4175055185740699
--------------------- 3007 -----------------------
label: 3
score: 0.08312859411239965
--------------------- 3008 -----------------------
label: 3
score: 0.335498184559472
--------------------- 3009 -----------------------
label: 1
score: 0.6386558222295234
--------------------- 3010 -----------------------
label: 3
score: 0.3799546830784489
--------------------- 3011 -----------------------
label: 1
score: 0.5075929041996405
--------------------- 3012 -----------------------
label: 4
score: 0.19115918514774116
--------------------- 3013 -----------------------
label: 1
score: 0.4442354916122025
--------------------- 3014 -----------------------
l

--------------------- 3128 -----------------------
label: 1
score: 0.4278744557880225
--------------------- 3129 -----------------------
label: 2
score: 0.3881286502564985
--------------------- 3130 -----------------------
label: 1
score: 0.6819157067260214
--------------------- 3131 -----------------------
label: 3
score: 0.1822457247164777
--------------------- 3132 -----------------------
label: 3
score: 0.12806993852405477
--------------------- 3133 -----------------------
label: 2
score: 0.08966387878243233
--------------------- 3134 -----------------------
label: 1
score: 0.6220639924191057
--------------------- 3135 -----------------------
label: 4
score: 0.2516472464566111
--------------------- 3136 -----------------------
label: 2
score: 0.018847328339057234
--------------------- 3137 -----------------------
label: 1
score: 0.47291510058701913
--------------------- 3138 -----------------------
label: 1
score: 0.5222687761846612
--------------------- 3139 ----------------------

--------------------- 3245 -----------------------
label: 2
score: 0.7861733813598003
--------------------- 3246 -----------------------
label: 2
score: 0.4422017669489904
--------------------- 3247 -----------------------
label: 3
score: 0.7152566541693999
--------------------- 3248 -----------------------
label: 2
score: 0.4099830578718248
--------------------- 3249 -----------------------
label: 4
score: 0.08504218623495717
--------------------- 3250 -----------------------
label: 4
score: 0.009389830933862744
--------------------- 3251 -----------------------
label: 3
score: 0.2274885537396441
--------------------- 3252 -----------------------
label: 3
score: 0.2728624077232572
--------------------- 3253 -----------------------
label: 3
score: 0.23097019397723056
--------------------- 3254 -----------------------
label: 1
score: 0.8668655154987269
--------------------- 3255 -----------------------
label: 2
score: 0.0
--------------------- 3256 -----------------------
label: 3
score

--------------------- 3403 -----------------------
label: 3
score: 0.070970137535725
--------------------- 3404 -----------------------
label: 1
score: 0.8016633822613304
--------------------- 3405 -----------------------
label: 1
score: 0.8437092238867568
--------------------- 3406 -----------------------
label: 2
score: 0.4715332795838705
--------------------- 3407 -----------------------
label: 3
score: 0.18070940711953162
--------------------- 3408 -----------------------
label: 2
score: 0.7262705614481659
--------------------- 3409 -----------------------
label: 4
score: 0.02402612532535288
--------------------- 3410 -----------------------
label: 3
score: 0.1207107510323324
--------------------- 3411 -----------------------
label: 2
score: 0.16333725188767795
--------------------- 3412 -----------------------
label: 3
score: 0.053366811841524975
--------------------- 3413 -----------------------
label: 4
score: 0.02644100674931003
--------------------- 3414 ----------------------

--------------------- 3539 -----------------------
label: 4
score: 0.0564611271265295
--------------------- 3540 -----------------------
label: 4
score: 0.042289515838309706
--------------------- 3541 -----------------------
label: 1
score: 0.2891212425229759
--------------------- 3542 -----------------------
label: 1
score: 0.5318103627435732
--------------------- 3543 -----------------------
label: 1
score: 0.21309490867931138
--------------------- 3544 -----------------------
label: 1
score: 0.42931769895150534
--------------------- 3545 -----------------------
label: 4
score: 0.006584323914362295
--------------------- 3546 -----------------------
label: 4
score: 0.04281656020733564
--------------------- 3547 -----------------------
label: 4
score: 0.44834674915967704
--------------------- 3548 -----------------------
label: 4
score: 0.2156211732996877
--------------------- 3549 -----------------------
label: 1
score: 0.4914299320407472
--------------------- 3550 -------------------

--------------------- 3692 -----------------------
label: 2
score: 0.13466516635957226
--------------------- 3693 -----------------------
label: 1
score: 0.011188093315951459
--------------------- 3694 -----------------------
label: 1
score: 0.5633021703967865
--------------------- 3695 -----------------------
label: 2
score: 0.20994141082261142
--------------------- 3696 -----------------------
label: 4
score: 0.054628668295388455
--------------------- 3697 -----------------------
label: 1
score: 0.4589576208403177
--------------------- 3698 -----------------------
label: 2
score: 0.2532981243127131
--------------------- 3699 -----------------------
label: 1
score: 0.6382993985926568
--------------------- 3700 -----------------------
label: 1
score: 0.14215400332207592
--------------------- 3701 -----------------------
label: 1
score: 0.3069655957291814
--------------------- 3702 -----------------------
label: 1
score: 0.2246404792962116
--------------------- 3703 --------------------

--------------------- 3852 -----------------------
label: 3
score: 0.29115100466521304
--------------------- 3853 -----------------------
label: 2
score: 0.2720517580788282
--------------------- 3854 -----------------------
label: 2
score: 0.29026039159132583
--------------------- 3855 -----------------------
label: 2
score: 0.48623512414634845
--------------------- 3856 -----------------------
label: 2
score: 0.08437734540308343
--------------------- 3857 -----------------------
label: 1
score: 0.5168791680329257
--------------------- 3858 -----------------------
label: 2
score: 0.5488666429516467
--------------------- 3859 -----------------------
label: 2
score: 0.4805406131130041
--------------------- 3860 -----------------------
label: 1
score: 0.598640190604811
--------------------- 3861 -----------------------
label: 1
score: 0.2869060008397088
--------------------- 3862 -----------------------
label: 1
score: 0.6205479649807347
--------------------- 3863 -----------------------


--------------------- 4008 -----------------------
label: 1
score: 0.35046502431699383
--------------------- 4009 -----------------------
label: 1
score: 0.7406558268099993
--------------------- 4010 -----------------------
label: 2
score: 0.07357418224793358
--------------------- 4011 -----------------------
label: 1
score: 0.5306822880704574
--------------------- 4012 -----------------------
label: 2
score: 0.2777948693127557
--------------------- 4013 -----------------------
label: 1
score: 0.42336739891359726
--------------------- 4014 -----------------------
label: 1
score: 0.596017713059778
--------------------- 4015 -----------------------
label: 1
score: 0.3295658874596383
--------------------- 4016 -----------------------
label: 1
score: 0.20181543991792494
--------------------- 4017 -----------------------
label: 1
score: 0.4654801004986752
--------------------- 4018 -----------------------
label: 2
score: 0.23360544674549524
--------------------- 4019 -----------------------

--------------------- 4154 -----------------------
label: 1
score: 0.055002234463177244
--------------------- 4155 -----------------------
label: 3
score: 0.3179790561993124
--------------------- 4156 -----------------------
label: 1
score: 0.7081060235073107
--------------------- 4157 -----------------------
label: 1
score: 0.8946472896766374
--------------------- 4158 -----------------------
label: 2
score: 0.43030352820820095
--------------------- 4159 -----------------------
label: 1
score: 0.6954909393674183
--------------------- 4160 -----------------------
label: 2
score: 0.07418234267922423
--------------------- 4161 -----------------------
label: 4
score: 0.1252285532016261
--------------------- 4162 -----------------------
label: 2
score: 0.6861294392445985
--------------------- 4163 -----------------------
label: 1
score: 0.9058855440760106
--------------------- 4164 -----------------------
label: 3
score: 0.14874602518732694
--------------------- 4165 ----------------------

--------------------- 4271 -----------------------
label: 1
score: 0.6799728336394648
--------------------- 4272 -----------------------
label: 1
score: 0.6389617971924796
--------------------- 4273 -----------------------
label: 1
score: 0.6723719919121713
--------------------- 4274 -----------------------
label: 2
score: 0.5531953604342105
--------------------- 4275 -----------------------
label: 1
score: 0.5832951198094251
--------------------- 4276 -----------------------
label: 1
score: 0.0
--------------------- 4277 -----------------------
label: 4
score: 0.03350597216624855
--------------------- 4278 -----------------------
label: 4
score: 0.21795068400116668
--------------------- 4279 -----------------------
label: 1
score: 0.5301760271895277
--------------------- 4280 -----------------------
label: 1
score: 0.7818397895482619
--------------------- 4281 -----------------------
label: 1
score: 0.547324905358817
--------------------- 4282 -----------------------
label: 3
score: 0

--------------------- 4382 -----------------------
label: 1
score: 0.6895515057589966
--------------------- 4383 -----------------------
label: 4
score: 0.08391820278775758
--------------------- 4384 -----------------------
label: 1
score: 0.7801425340361432
--------------------- 4385 -----------------------
label: 1
score: 0.3139665657399053
--------------------- 4386 -----------------------
label: 1
score: 0.6112961993167401
--------------------- 4387 -----------------------
label: 4
score: 0.2075622680461694
--------------------- 4388 -----------------------
label: 1
score: 0.6146226230215055
--------------------- 4389 -----------------------
label: 3
score: 0.0
--------------------- 4390 -----------------------
label: 3
score: 0.4011287620549097
--------------------- 4391 -----------------------
label: 3
score: 0.0
--------------------- 4392 -----------------------
label: 4
score: 0.38062899915665427
--------------------- 4393 -----------------------
label: 1
score: 0.6099868578153

--------------------- 4571 -----------------------
label: 2
score: 0.3250906614298888
--------------------- 4572 -----------------------
label: 1
score: 0.4085119093380787
--------------------- 4573 -----------------------
label: 1
score: 0.6096001832733587
--------------------- 4574 -----------------------
label: 3
score: 0.2811116966975069
--------------------- 4575 -----------------------
label: 1
score: 0.7628216013101289
--------------------- 4576 -----------------------
label: 1
score: 0.6628170628748326
--------------------- 4577 -----------------------
label: 1
score: 0.3905115779382814
--------------------- 4578 -----------------------
label: 3
score: 0.015134122031688284
--------------------- 4579 -----------------------
label: 4
score: 0.45635898792640417
--------------------- 4580 -----------------------
label: 1
score: 0.6488240029971526
--------------------- 4581 -----------------------
label: 2
score: 0.07575562605715681
--------------------- 4582 -----------------------

--------------------- 4718 -----------------------
label: 2
score: 0.0
--------------------- 4719 -----------------------
label: 1
score: 0.5283280818983138
--------------------- 4720 -----------------------
label: 1
score: 0.6533902577107213
--------------------- 4721 -----------------------
label: 2
score: 0.534918576778485
--------------------- 4722 -----------------------
label: 1
score: 0.8753961950856063
--------------------- 4723 -----------------------
label: 1
score: 0.7426615607638841
--------------------- 4724 -----------------------
label: 2
score: 0.7353874994252075
--------------------- 4725 -----------------------
label: 1
score: 0.5507137271264483
--------------------- 4726 -----------------------
label: 2
score: 0.5291970396894311
--------------------- 4727 -----------------------
label: 2
score: 0.5062287585363249
--------------------- 4728 -----------------------
label: 1
score: 0.6931874840827906
--------------------- 4729 -----------------------
label: 3
score: 0.2

--------------------- 4868 -----------------------
label: 3
score: 0.5653111089911727
--------------------- 4869 -----------------------
label: 4
score: 0.11897595984985614
--------------------- 4870 -----------------------
label: 2
score: 0.6583321842312377
--------------------- 4871 -----------------------
label: 1
score: 0.6182823496779605
--------------------- 4872 -----------------------
label: 1
score: 0.5196119021488053
--------------------- 4873 -----------------------
label: 2
score: 0.4348825533865445
--------------------- 4874 -----------------------
label: 2
score: 0.44090703825512145
--------------------- 4875 -----------------------
label: 4
score: 0.060991380091938115
--------------------- 4876 -----------------------
label: 3
score: 0.3279114988967924
--------------------- 4877 -----------------------
label: 2
score: 0.39791621928031595
--------------------- 4878 -----------------------
label: 1
score: 0.5666830942818455
--------------------- 4879 ----------------------

In [80]:
compare_df

Unnamed: 0,overall,text_cos_sim
0,4.0,0.264686
1,1.0,0.541320
2,4.0,0.022303
3,2.0,0.467639
4,4.0,0.077550
...,...,...
4897,1.0,0.000000
4898,4.0,0.229793
4899,4.0,0.109186
4900,2.0,0.492072


In [84]:
compare_4 = compare_df[compare_df['overall']==4.0]
compare_3 = compare_df[compare_df['overall']==3.0]
compare_2 = compare_df[compare_df['overall']==2.0]
compare_1 = compare_df[compare_df['overall']==1.0]
len(compare_df)
len(compare_4) + len(compare_3) + len(compare_2) +len(compare_1)

i_list = [1.0,2.0,3.0,4.0]
compare = [compare_1,compare_2,compare_3,compare_4]
col = 'text_cos_sim'
for i in i_list:
    index = int(i)-1
    c = compare[index]
    print('Number of pairs that are rated as ',i,':')
    print(len(c))
    print('Pairs that are rated as ',i,', title named entity score is 0: (value, percentage)')
    print(len(c[c[col]==0]), ', ', len(c[c[col]==0])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.3:')
    print(len(c[c[col]>0.3]), ', ', len(c[c[col]>0.3])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.5:')
    print(len(c[c[col]>0.5]), ', ', len(c[c[col]>0.5])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is 1.0:')
    print(len(c[c[col]==1.0]), ', ', len(c[c[col]==1.0])/len(c)*100,'%')

Number of pairs that are rated as  1.0 :
1443
Pairs that are rated as  1.0 , title named entity score is 0: (value, percentage)
29 ,  2.0097020097020097 %
Pairs that are rated as  1.0 , title named entity score is > 0.3:
1278 ,  88.56548856548856 %
Pairs that are rated as  1.0 , title named entity score is > 0.5:
930 ,  64.44906444906445 %
Pairs that are rated as  1.0 , title named entity score is 1.0:
1 ,  0.0693000693000693 %
Number of pairs that are rated as  2.0 :
1124
Pairs that are rated as  2.0 , title named entity score is 0: (value, percentage)
20 ,  1.7793594306049825 %
Pairs that are rated as  2.0 , title named entity score is > 0.3:
794 ,  70.64056939501779 %
Pairs that are rated as  2.0 , title named entity score is > 0.5:
369 ,  32.829181494661924 %
Pairs that are rated as  2.0 , title named entity score is 1.0:
1 ,  0.0889679715302491 %
Number of pairs that are rated as  3.0 :
959
Pairs that are rated as  3.0 , title named entity score is 0: (value, percentage)
25 ,  2.6

In [92]:
path = 'eval/_EVAL_text_cosSim_score.csv'
processed_df.to_csv(path,index=False)


### Remarks: <br>

As can already be observed, the cosine similarity score is consistent with the annotated *'Overall* score. <br>

For instance, let's consider the pairs whose similarity was annotated to be 4.0 (very dissimilar). Then it is interesting to notice that the around 82% of the pairs scored 4.0 has similarity score between 0 and 0.3.
On the other hand, 64% of pairs whose score was annotated to be 1.0 received a similarity score >0.5. <br><br>

This result can be interpreted by claiming that for **dissimilar** pairs (for this dataset), cosine similarity is extremely consistent with annotations. However, identifying similar articles seems to be a more subtle task for human annotators and this is reflected in the cosine similarity score.