
## Cosine similarity of body text (training) <br>

In what follows we use the translated and preprocessed body text of the training articles to give a first similarity score.<br><br>

The texts are embedded as vectors by using *tf-idf* and the similarity score is given by *cosine similarity*. For both tasks we rely on the eponymous tools from [scikit-learn](https://scikit-learn.org/stable/).

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np 
import math

In [2]:
# read in our translated preprocessed text in pairs in a dataframe
processed_df = pd.read_csv('train/_TRAIN_preprocessed_text.csv')

In [3]:
processed_df

Unnamed: 0.1,Unnamed: 0,pair_id,url1_lang,url2_lang,title1,title2,keywords1,keywords2,text1,text2,...,Entities,Time,Narrative,Overall,Style,Tone,translated_body1,translated_body2,preprocessed_1,preprocessed_2
0,0,1484084337_1484110209,en,en,Virginia man arrested in fatal DUI crash in We...,Haiti’s leader marks independence day amid sec...,"law and order,reckless endangerment,transporta...","port au prince,latinamericaandcaribbean,jean,c...","MARTINSBURG, W.Va. — A suspected drunken drive...","PORT-AU-PRINCE, Haiti — Haitian President Jove...",...,4.0,1.0,4.0,4.0,16666666666666600,2.0,"MARTINSBURG, W.Va. — A suspected drunken drive...","PORT-AU-PRINCE, Haiti — Haitian President Jove...",martinsburg wva — suspect drunken driver arres...,portauprince haiti — haitian president jovenel...
1,1,1484396422_1483924666,en,en,Guyana: Three injured after car crashes into u...,Fire kills more than 30 animals at zoo in west...,,"smg2_world,smg_europe,smg2_news",Share This On:\n\nPin 11 Shares\n\n(NEWS ROOM ...,BERLIN - A fire at a zoo in western Germany in...,...,4.0,1.0,4.0,36666666666666600,16666666666666600,13333333333333300,Share This On: Pin 11 Shares (NEWS ROOM GUYA...,BERLIN - A fire at a zoo in western Germany in...,share pin 11 share news room guyana — three pe...,berlin fire zoo western germany first minute 2...
2,2,1484698254_1483758694,en,en,Trump Brings In 2020 At Mar-a-Lago: ‘We’re Goi...,"Trump says he does not expect war with Iran, ‘...",,"full coverage 2020 us presidential elections,f...",(Breitbart) – President Donald Trump welcomed ...,"PALM BEACH, United States — US President Donal...",...,2.0,1.0,2333333333333330,2333333333333330,1.0,13333333333333300,(Breitbart) – President Donald Trump welcomed ...,"PALM BEACH, United States — US President Donal...",breitbart – president donald trump welcome gue...,palm beach united states — we president donald...
3,3,1576314516_1576455088,en,en,Zomato Buys Uber’s Food Delivery Business in I...,Indian Online Food Delivery Market to Hit $8 B...,zomatoubereatsbusinessacquisitionindiaallstock...,"swiggy,ber,indian online food delivery market ...",Uber has sold its online food-ordering busines...,Rapid digitisation and growth in both online b...,...,2333333333333330,26666666666666600,16666666666666600,2.0,16666666666666600,16666666666666600,Uber has sold its online food-ordering busines...,Rapid digitisation and growth in both online b...,uber sell online foodordere business india loc...,rapid digitisation growth online buyer base sp...
4,4,1484036253_1483894099,en,en,"India approves third moon mission, months afte...",India targets new moon mission in 2020,"india,lunarorbiter,isro,landonthemoon","india,space",BENGALURU (Reuters) - India has approved its t...,BANGALORE: India plans to make a fresh attempt...,...,1.25,1.0,1.25,1.25,1.0,1.0,BENGALURU (Reuters) - India has approved its t...,BANGALORE: India plans to make a fresh attempt...,bengaluru reuters india approve third lunar mi...,bangalore india plan make fresh attempt land u...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4959,4959,1586195445_1598778991,tr,tr,"BM, Aden'de 2 bini aşkın iç göçmenin selden za...",BM'den Yemen'de kadınların doğumda ölüm riski ...,"twitter,yemen,güncel,birleşmişmilletler,birleş...","haber,yemen,güncel,birleşmişmilletler,birleşmi...","BM, Aden'de 2 bini aşkın iç göçmenin selden za...",BM'den Yemen'de kadınların doğumda ölüm riski ...,...,2.0,2.0,4.0,3.0,1.0,1.0,"The UN announced that more than 2,000 domestic...","In Yemen, Women's Death Risk Warning Explanati...",un announce 2000 domestic migrant aden explain...,yemen women death risk warning explanation bir...
4960,4960,1590915424_1590940388,tr,tr,Kovid-19'dan dolayı La Liga kulüplerinde hayat...,Fabio Capello: Koronavirüs sonrası La Liga'da ...,"laliga,la liga,i̇spanya,spor,futbol,realmadrid...","kovid 19,i̇spanya1futbolligi,la liga,laliga,ko...",Kovid-19'dan dolayı La Liga kulüplerinde hayat...,Yeni tip koronavirüs (Kovid-19) salgınının eko...,...,1.0,1.0,1.0,1.0,1.0,1.0,"Because of the Kovid-19, the Survival Struggle...",The new type of coronavirus (Kovid-19) is cons...,kovid19 survival struggle la liga club discuss...,new type coronavirus kovid19 consider severely...
4961,4961,1526157103_1492737005,tr,tr,Saray da çare olmadı: 'Borca boğulan dev kulüp...,TFF’den jet yanıt! ''Bizi hedef gösteriyorlar'',"satiş,'borca,olmadi:,kulüpler,çare,da,masasınd...","tff,ahmet nur çebi,türkiye futbol federasyonu,...",\n\n\n\n\n\n\n\nİflas noktasındaki kulüplerin ...,"TFF, resmi internet sitesinden Beşiktaş'ın fai...",...,2.0,3.0,4.0,3.0,1.0,2.0,It is stated that the sales of the clubs at th...,TFF has published an explanation on the offici...,state sale club point bankrupt agenda besiktas...,tff publish explanation official website besik...
4962,4962,1603274500_1618292937,tr,tr,Ergene Belediyesi yol çalışmalarına aksatmadan...,Ergene'de Ahimehmet ve Yeşiltepe mahallelerind...,"tekirdağ,ergene,rasimyüksel,güncel,koronavirüs...","yeşiltepe,yaşam,koronavirüs,haber",Ergene Belediyesi yol çalışmalarına aksatmadan...,Ergene'de Ahimehmet ve Yeşiltepe mahallelerind...,...,2.0,3.0,3.0,3.0,1.0,1.0,Ergene Municipality continues without disrupti...,"In Ergene, the mask was distributed in Ahimehm...",ergene municipality continue without disrupt r...,ergene mask distribute ahimehmet yeşiltepe nei...


In [4]:
# because some texts are missing and were saved as nan, we don't want them to interfere with vectorization
# so we replace them with empty strings
processed_df = processed_df.replace(np.nan, '', regex=True)

processed_df['preprocessed_1'].fillna('')
processed_df['preprocessed_2'].fillna('')

0       portauprince haiti — haitian president jovenel...
1       berlin fire zoo western germany first minute 2...
2       palm beach united states — we president donald...
3       rapid digitisation growth online buyer base sp...
4       bangalore india plan make fresh attempt land u...
                              ...                        
4960    new type coronavirus kovid19 consider severely...
4961    tff publish explanation official website besik...
4962    ergene mask distribute ahimehmet yeşiltepe nei...
4963    criminal court group commentation member terro...
Name: preprocessed_2, Length: 4964, dtype: object

In [5]:
# then join them together to form a complete corpus
corpus = processed_df['preprocessed_1'].tolist() + processed_df['preprocessed_2'].tolist()
len(corpus)


9928

In [10]:
# randomly pick index to check if they are stored correctly
corpus[0+4964]

'portauprince haiti — haitian president jovenel moïse break tradition wednesday celebrate country \' independence day capital security reason follow month political turmoil moïse whose government accuse corruption denounce graft speech national palace portauprince urge haiti \' elite work government help create employment " \' still extremely poor " say " continue get rich find normal pay taxis find normal competition find normal set price consumer especially consumer state " moïse also apologize country \' ongoing power outage renew 2016 campaign pledge provide electricity 24 hour day say hard accomplish imagine speech mark 216th anniversary world \' first black republic originally slate take place northern coastal town gonaive jeanjacque dessaline declare haiti \' independence town like many other hit violent protest begin september amid anger corruption fuel shortage dwindle food supply opposition leader supporter demand resignation moïse 40 people kill dozen injured largescale prot


### Generate tf-idf vectors and calculate cosine similarity <br>

After collecting all the preprocessed article texts, cleaning the files whose text was not available, and storing them to form the text corpus, we encode them into tf-idf.

In [11]:
# then generate tfidf vectors and calculate cosine similarity with the matrix 
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# an overlook of of the cosine similarity matrix
print(cosine_sim.shape)
print(cosine_sim)

(9928, 9928)
[[1.         0.0639595  0.04137205 ... 0.02714676 0.0120541  0.00926632]
 [0.0639595  1.         0.01252504 ... 0.02336032 0.00337145 0.01006014]
 [0.04137205 0.01252504 1.         ... 0.03889902 0.00681368 0.01049759]
 ...
 [0.02714676 0.02336032 0.03889902 ... 1.         0.0055339  0.01052353]
 [0.0120541  0.00337145 0.00681368 ... 0.0055339  1.         0.00371053]
 [0.00926632 0.01006014 0.01049759 ... 0.01052353 0.00371053 1.        ]]


In [12]:
# because of how we formed the corpus, 0-4901 is the first text from the pair, and 4902-9803 is second text
# to access the similarity score, we just need to know when they end so we can store them away nicely in our dataframe
switch_col = len(processed_df)

for i in range(0,switch_col):
    cos_sim = cosine_sim[i][switch_col+i]
    print(cos_sim)
    processed_df.at[i,'text_cos_sim'] = cos_sim

0.06417263315038343
0.02177422125698962
0.2922464437390565
0.413752104340453
0.5098179976865087
0.3577997516807665
0.06163473735520945
0.03417475399974342
0.16815497114919364
0.08639938084820331
0.0
0.06822159505363207
0.006032058963242328
0.08382534524689099
0.05820996358562497
0.07218782151645319
0.0
0.0971263887933748
0.011382695990251473
0.4307271619260126
0.03205004464885682
0.0
0.0
0.5057071375252934
0.03099007651887247
0.6867107927942305
0.039821805994658954
0.01713709832487599
0.45911679794930793
0.02235813761180428
0.2945523102692112
0.008158706174462954
0.31271908493442674
0.01443785547576841
0.044574897499833906
0.026290553239222948
0.0
0.49688427365002696
0.018290218655549664
0.4032627646459601
0.4911534821041509
0.028536564637417692
0.33938492908815837
0.38479299444266063
0.03628489659710795
0.5369604285141993
0.545682251953029
0.0
0.4369381071629575
0.1221932829093317
0.0811516265391405
0.6369111215637959
0.044317907125761245
0.2915010415888149
0.3668433688437983
0.023452

0.034347184635002526
0.01378592103754947
0.04147806321949519
0.0
0.3997041124766862
0.1963173279544361
0.01886816900108305
0.06185856936155003
0.0
0.2264662423071479
0.014439691440438199
0.0632596312301373
0.05081697638781543
0.2448318492338783
0.20647396797513884
0.5218599604359726
0.014667914841505526
0.07697984865542551
0.29788704979654335
0.08451688999149616
0.2281116035727558
0.5537777505324373
0.008945909467022546
0.012071389319547121
0.012609685506286836
0.04434540846371651
0.01240488933141826
0.2728369321380744
0.6540643670535762
0.10618889221699065
0.008934929826196043
0.0
0.507665969114313
0.3452380278811689
0.0
0.08597022769642912
0.4263017332355238
0.027935205133696343
0.07400389704298362
0.260380675653017
0.09976546952018286
0.004694181211970762
0.21768210649816783
0.04516885676252888
0.6295464167433203
0.008351822959055658
0.2873159690429847
0.04928923896365929
0.07868626788571054
0.021885104103671393
0.011480129045376512
0.6159745431710533
0.47980331164613826
0.093107557

0.6068310297445723
0.07130557173476092
0.590472476167582
0.6331209790729296
0.057733083517699305
0.17271513424674828
0.5124720391306844
0.5023581256549853
0.2877105675210284
0.16597192142080874
0.189595176608513
0.41042077899512724
0.13183346240536198
0.6890262943861165
0.08048634030830426
0.07418171849275719
0.3237611226135224
0.30192629676593785
0.036218573675022674
0.20610132934220626
0.17523091240656147
0.35974470270775616
0.4638797847569418
0.25072071368484655
0.20694858822764287
0.23052480058224814
0.15803146489137054
0.05162967210857855
0.6262137445005783
0.38409391892464395
0.41997964892436496
0.17276239529268103
0.3757216125677176
0.15216769835062632
0.3007602293354821
0.047614658330741876
0.19594432782830234
0.16323730092663194
0.38803566358936376
0.727052193714639
0.27943046017916556
0.41433647080687697
0.7593447463237661
0.007784703386072544
0.14732825815317527
0.42149894716050934
0.2429665504089509
0.6101952528957664
0.1915891502512754
0.049336711764215
0.6363147025440246


0.15949598210194008
0.03896710236611689
0.22112659364802129
0.0
0.0
0.040515502769323315
0.04954748074855361
0.07294545207514928
0.02047575182689361
0.07240253611374897
0.47740631248703674
0.07563037597867983
0.026479260388655404
0.0
0.05384495195690906
0.03241611421585404
0.066362218675074
0.0
0.2582791169079321
0.47883411762008155
0.04629679993392737
0.5918029410707105
0.11974617679414792
0.15140303318336598
0.25003891040198134
0.0
0.014000201324900895
0.11726352652324067
0.040957075330633606
0.007634613130843686
0.18369142934198399
0.0
0.008124009003350965
0.25774965874105993
0.0
0.019428410909214554
0.43451923540996573
0.23287865980924033
0.0
0.5050477190924777
0.2924348388582926
0.0031780190148576943
0.04751433813120745
0.0
0.20885756917997478
0.20885756917997478
0.05059550403167795
0.012432569699815285
0.32868926337577103
0.12511332797946237
0.1660401113437067
0.0
0.060271704747031425
0.27301692959659696
0.021229893935655393
0.20950670470259347
0.006563127768181968
0.0
0.02191504

### Compare text cosine similarity score <br>

We compare the annotated overall score of pairs with the similarity score obtained by 
cosine similarity on the translated and preprocessed body texts extracted from the articles in the training dataset . <br><br>

In [27]:
# get a dataframe showing the distribution of label (annotated overall similarity) and our text named entity similarity score

train_df = pd.read_csv ('train/_TRAIN_details_in_df.csv')

def normal_round(n):
    if n - math.floor(n) < 0.5:
        return int(math.floor(n))
    return int(math.ceil(n))

compare_df = pd.DataFrame(columns = ["overall","text_cos_sim"])


for i, row in processed_df.iterrows(): 
    # just to check if dataframes align
#     if row["pair_id"] != train_df.iloc[i]['pair_id']:
#         print('---------------------',i,'-----------------------')
#         print(row["pair_id"])
    print('---------------------',i,'-----------------------')
    label = normal_round(float(train_df.iloc[i]['Overall']))
    print('label:',label)
    score = row['text_cos_sim']
    print('score:',score)
    
    # fix processed_df overall mishaps
    processed_df.at[i,'Overall'] = label
    entry = {"overall":label,"text_cos_sim":score}
    compare_df = compare_df.append(entry, ignore_index = True)
    

--------------------- 0 -----------------------
label: 4
score: 0.06417263315038343
--------------------- 1 -----------------------
label: 4
score: 0.02177422125698962
--------------------- 2 -----------------------
label: 2
score: 0.2922464437390565
--------------------- 3 -----------------------
label: 2
score: 0.413752104340453
--------------------- 4 -----------------------
label: 1
score: 0.5098179976865087
--------------------- 5 -----------------------
label: 2
score: 0.3577997516807665
--------------------- 6 -----------------------
label: 4
score: 0.06163473735520945
--------------------- 7 -----------------------
label: 3
score: 0.03417475399974342
--------------------- 8 -----------------------
label: 4
score: 0.16815497114919364
--------------------- 9 -----------------------
label: 3
score: 0.08639938084820331
--------------------- 10 -----------------------
label: 3
score: 0.0
--------------------- 11 -----------------------
label: 4
score: 0.06822159505363207
-----------

--------------------- 146 -----------------------
label: 4
score: 0.025325468852070016
--------------------- 147 -----------------------
label: 4
score: 0.03734760634556585
--------------------- 148 -----------------------
label: 4
score: 0.028998327134824822
--------------------- 149 -----------------------
label: 4
score: 0.0005958187647002499
--------------------- 150 -----------------------
label: 3
score: 0.4331736439666263
--------------------- 151 -----------------------
label: 2
score: 0.2132266615091241
--------------------- 152 -----------------------
label: 2
score: 0.30314157856111007
--------------------- 153 -----------------------
label: 2
score: 0.4457895737404415
--------------------- 154 -----------------------
label: 2
score: 0.2653327910845074
--------------------- 155 -----------------------
label: 4
score: 0.34581132330612996
--------------------- 156 -----------------------
label: 3
score: 0.24908804036250953
--------------------- 157 -----------------------
labe

score: 0.19775764863033685
--------------------- 289 -----------------------
label: 1
score: 0.5372345134633647
--------------------- 290 -----------------------
label: 2
score: 0.2671080534760943
--------------------- 291 -----------------------
label: 4
score: 0.0014467651971222537
--------------------- 292 -----------------------
label: 4
score: 0.024045721094944763
--------------------- 293 -----------------------
label: 4
score: 0.01069978953912983
--------------------- 294 -----------------------
label: 1
score: 0.3360962205914697
--------------------- 295 -----------------------
label: 4
score: 0.0
--------------------- 296 -----------------------
label: 2
score: 0.21856946016371784
--------------------- 297 -----------------------
label: 4
score: 0.04034756123245038
--------------------- 298 -----------------------
label: 4
score: 0.03812337809970153
--------------------- 299 -----------------------
label: 4
score: 0.03136754058878063
--------------------- 300 -----------------

--------------------- 436 -----------------------
label: 3
score: 0.08537599757262247
--------------------- 437 -----------------------
label: 2
score: 0.4300547140775196
--------------------- 438 -----------------------
label: 4
score: 0.07173838002442935
--------------------- 439 -----------------------
label: 4
score: 0.06371804920052393
--------------------- 440 -----------------------
label: 1
score: 0.5204409140776274
--------------------- 441 -----------------------
label: 2
score: 0.1720331753172668
--------------------- 442 -----------------------
label: 1
score: 0.3793070351089032
--------------------- 443 -----------------------
label: 3
score: 0.38821474642991627
--------------------- 444 -----------------------
label: 1
score: 0.4128510568854492
--------------------- 445 -----------------------
label: 4
score: 0.0
--------------------- 446 -----------------------
label: 4
score: 0.05873367352176144
--------------------- 447 -----------------------
label: 4
score: 0.0996897

--------------------- 581 -----------------------
label: 3
score: 0.04583852386336232
--------------------- 582 -----------------------
label: 4
score: 0.014315900836466201
--------------------- 583 -----------------------
label: 4
score: 0.025478606365859097
--------------------- 584 -----------------------
label: 4
score: 0.0
--------------------- 585 -----------------------
label: 3
score: 0.0
--------------------- 586 -----------------------
label: 4
score: 0.010945780662834803
--------------------- 587 -----------------------
label: 1
score: 0.41617975805923896
--------------------- 588 -----------------------
label: 4
score: 0.09861565516394855
--------------------- 589 -----------------------
label: 4
score: 0.020827404631523577
--------------------- 590 -----------------------
label: 3
score: 0.43920990706882057
--------------------- 591 -----------------------
label: 4
score: 0.03266179611988449
--------------------- 592 -----------------------
label: 4
score: 0.02340701040634

--------------------- 744 -----------------------
label: 3
score: 0.002352126672087339
--------------------- 745 -----------------------
label: 3
score: 0.21060596385838282
--------------------- 746 -----------------------
label: 4
score: 0.02136056646782033
--------------------- 747 -----------------------
label: 4
score: 0.025930765942280255
--------------------- 748 -----------------------
label: 4
score: 0.03353201183551161
--------------------- 749 -----------------------
label: 4
score: 0.030640947859667984
--------------------- 750 -----------------------
label: 3
score: 0.20223583976770204
--------------------- 751 -----------------------
label: 3
score: 0.08512444821059796
--------------------- 752 -----------------------
label: 2
score: 0.333372982376062
--------------------- 753 -----------------------
label: 4
score: 0.033423241886997156
--------------------- 754 -----------------------
label: 1
score: 0.5621640013674362
--------------------- 755 -----------------------
lab

--------------------- 884 -----------------------
label: 2
score: 0.5135915192850851
--------------------- 885 -----------------------
label: 4
score: 0.009070305401922434
--------------------- 886 -----------------------
label: 4
score: 0.04718131977098263
--------------------- 887 -----------------------
label: 1
score: 0.4327582449644088
--------------------- 888 -----------------------
label: 4
score: 0.013047912524979442
--------------------- 889 -----------------------
label: 1
score: 0.5372216588535397
--------------------- 890 -----------------------
label: 4
score: 0.0552546063126825
--------------------- 891 -----------------------
label: 4
score: 0.0
--------------------- 892 -----------------------
label: 1
score: 0.5907983552206103
--------------------- 893 -----------------------
label: 4
score: 0.22737723706217133
--------------------- 894 -----------------------
label: 4
score: 0.10485013317309781
--------------------- 895 -----------------------
label: 1
score: 0.72559

score: 0.05482244214131694
--------------------- 1015 -----------------------
label: 4
score: 0.020067062206270973
--------------------- 1016 -----------------------
label: 4
score: 0.045694996502110966
--------------------- 1017 -----------------------
label: 4
score: 0.017492241831306574
--------------------- 1018 -----------------------
label: 4
score: 0.052571680560338355
--------------------- 1019 -----------------------
label: 3
score: 0.06677538692481183
--------------------- 1020 -----------------------
label: 2
score: 0.458202581141334
--------------------- 1021 -----------------------
label: 4
score: 0.04940064965416041
--------------------- 1022 -----------------------
label: 4
score: 0.15525231036950227
--------------------- 1023 -----------------------
label: 4
score: 0.05617466400193858
--------------------- 1024 -----------------------
label: 2
score: 0.2957206487398417
--------------------- 1025 -----------------------
label: 1
score: 0.6363722076387015
----------------

--------------------- 1190 -----------------------
label: 4
score: 0.06026048585979179
--------------------- 1191 -----------------------
label: 3
score: 0.03091259527250084
--------------------- 1192 -----------------------
label: 3
score: 0.14581785013476534
--------------------- 1193 -----------------------
label: 2
score: 0.7483007696680117
--------------------- 1194 -----------------------
label: 1
score: 0.4308777353272028
--------------------- 1195 -----------------------
label: 2
score: 0.5972751784923818
--------------------- 1196 -----------------------
label: 4
score: 0.012854251174276944
--------------------- 1197 -----------------------
label: 1
score: 0.46403132565012845
--------------------- 1198 -----------------------
label: 4
score: 0.01690274401166645
--------------------- 1199 -----------------------
label: 2
score: 0.2100968357959886
--------------------- 1200 -----------------------
label: 4
score: 0.022813889284232322
--------------------- 1201 ------------------

--------------------- 1335 -----------------------
label: 3
score: 0.14848499324310555
--------------------- 1336 -----------------------
label: 4
score: 0.00712300258790019
--------------------- 1337 -----------------------
label: 1
score: 0.41707829351881753
--------------------- 1338 -----------------------
label: 2
score: 0.6801952592502636
--------------------- 1339 -----------------------
label: 4
score: 0.009727580447040632
--------------------- 1340 -----------------------
label: 1
score: 0.6670286065884183
--------------------- 1341 -----------------------
label: 2
score: 0.5177030958886111
--------------------- 1342 -----------------------
label: 3
score: 0.29183540514896433
--------------------- 1343 -----------------------
label: 4
score: 0.02139840841381831
--------------------- 1344 -----------------------
label: 2
score: 0.22528996289174358
--------------------- 1345 -----------------------
label: 4
score: 0.047952181582445755
--------------------- 1346 -----------------

--------------------- 1483 -----------------------
label: 2
score: 0.44084500289014955
--------------------- 1484 -----------------------
label: 2
score: 0.4286154856081916
--------------------- 1485 -----------------------
label: 4
score: 0.05513692125820429
--------------------- 1486 -----------------------
label: 3
score: 0.10759004676035785
--------------------- 1487 -----------------------
label: 4
score: 0.0
--------------------- 1488 -----------------------
label: 1
score: 0.7740722065973764
--------------------- 1489 -----------------------
label: 4
score: 0.02356476417955073
--------------------- 1490 -----------------------
label: 4
score: 0.06587864692031753
--------------------- 1491 -----------------------
label: 3
score: 0.13080475712937897
--------------------- 1492 -----------------------
label: 2
score: 0.4533774017041893
--------------------- 1493 -----------------------
label: 2
score: 0.36072863500345914
--------------------- 1494 -----------------------
label: 1
sc

--------------------- 1635 -----------------------
label: 4
score: 0.06652345430309203
--------------------- 1636 -----------------------
label: 3
score: 0.15659962155130563
--------------------- 1637 -----------------------
label: 3
score: 0.20856144458397335
--------------------- 1638 -----------------------
label: 3
score: 0.16154006598830753
--------------------- 1639 -----------------------
label: 4
score: 0.061264952611374324
--------------------- 1640 -----------------------
label: 4
score: 0.08180441423898366
--------------------- 1641 -----------------------
label: 4
score: 0.05571516668054906
--------------------- 1642 -----------------------
label: 2
score: 0.5703343383978918
--------------------- 1643 -----------------------
label: 4
score: 0.09026165337207444
--------------------- 1644 -----------------------
label: 1
score: 0.5581137735822366
--------------------- 1645 -----------------------
label: 2
score: 0.1086919072903826
--------------------- 1646 ------------------

--------------------- 1752 -----------------------
label: 1
score: 0.5815536490075802
--------------------- 1753 -----------------------
label: 1
score: 0.5198391605881066
--------------------- 1754 -----------------------
label: 1
score: 0.6150607760474243
--------------------- 1755 -----------------------
label: 3
score: 0.2116504424162527
--------------------- 1756 -----------------------
label: 1
score: 0.38908551473678266
--------------------- 1757 -----------------------
label: 4
score: 0.06691162683916457
--------------------- 1758 -----------------------
label: 4
score: 0.03093390101049873
--------------------- 1759 -----------------------
label: 4
score: 0.012224186012167126
--------------------- 1760 -----------------------
label: 2
score: 0.20565418157735801
--------------------- 1761 -----------------------
label: 4
score: 0.001980460715317239
--------------------- 1762 -----------------------
label: 4
score: 0.004916238478791937
--------------------- 1763 -----------------

--------------------- 1902 -----------------------
label: 2
score: 0.36231764492514135
--------------------- 1903 -----------------------
label: 2
score: 0.37857578235163536
--------------------- 1904 -----------------------
label: 2
score: 0.5763592296904079
--------------------- 1905 -----------------------
label: 4
score: 0.0999306993790995
--------------------- 1906 -----------------------
label: 1
score: 0.37030180125454404
--------------------- 1907 -----------------------
label: 1
score: 0.37200942632388345
--------------------- 1908 -----------------------
label: 4
score: 0.08851656661102408
--------------------- 1909 -----------------------
label: 3
score: 0.06026607947086206
--------------------- 1910 -----------------------
label: 3
score: 0.042231235317344025
--------------------- 1911 -----------------------
label: 4
score: 0.055270930875849875
--------------------- 1912 -----------------------
label: 4
score: 0.08960850589348104
--------------------- 1913 ----------------

--------------------- 2023 -----------------------
label: 4
score: 0.04105186519559109
--------------------- 2024 -----------------------
label: 3
score: 0.12780841505338256
--------------------- 2025 -----------------------
label: 1
score: 0.6217116901982386
--------------------- 2026 -----------------------
label: 1
score: 0.23035481480797
--------------------- 2027 -----------------------
label: 4
score: 0.05952577405867773
--------------------- 2028 -----------------------
label: 1
score: 0.37864682304721975
--------------------- 2029 -----------------------
label: 1
score: 0.372451440134463
--------------------- 2030 -----------------------
label: 4
score: 0.015388404177750028
--------------------- 2031 -----------------------
label: 2
score: 0.3928302326264958
--------------------- 2032 -----------------------
label: 2
score: 0.4246233474913436
--------------------- 2033 -----------------------
label: 4
score: 0.0037843015435357393
--------------------- 2034 ---------------------

--------------------- 2199 -----------------------
label: 4
score: 0.0
--------------------- 2200 -----------------------
label: 4
score: 0.0
--------------------- 2201 -----------------------
label: 4
score: 0.21955854204221556
--------------------- 2202 -----------------------
label: 1
score: 0.3888495687225069
--------------------- 2203 -----------------------
label: 4
score: 0.023745919009749587
--------------------- 2204 -----------------------
label: 4
score: 0.0
--------------------- 2205 -----------------------
label: 3
score: 0.0
--------------------- 2206 -----------------------
label: 3
score: 0.20404989835668955
--------------------- 2207 -----------------------
label: 3
score: 0.0014349176348208087
--------------------- 2208 -----------------------
label: 4
score: 0.29575447350577366
--------------------- 2209 -----------------------
label: 3
score: 0.02030277597682231
--------------------- 2210 -----------------------
label: 1
score: 0.02194357155075415
------------------

--------------------- 2355 -----------------------
label: 3
score: 0.21989864100796602
--------------------- 2356 -----------------------
label: 3
score: 0.1289282105564124
--------------------- 2357 -----------------------
label: 2
score: 0.5235852689194403
--------------------- 2358 -----------------------
label: 3
score: 0.0
--------------------- 2359 -----------------------
label: 4
score: 0.23426218119082257
--------------------- 2360 -----------------------
label: 3
score: 0.0707582839364876
--------------------- 2361 -----------------------
label: 3
score: 0.37895982377036125
--------------------- 2362 -----------------------
label: 4
score: 0.031524390245340365
--------------------- 2363 -----------------------
label: 4
score: 0.0
--------------------- 2364 -----------------------
label: 1
score: 0.3597994943478969
--------------------- 2365 -----------------------
label: 2
score: 0.18829804269904957
--------------------- 2366 -----------------------
label: 3
score: 0.148133633

--------------------- 2496 -----------------------
label: 3
score: 0.054500021514218636
--------------------- 2497 -----------------------
label: 4
score: 0.025481189726065605
--------------------- 2498 -----------------------
label: 3
score: 0.03227173016546927
--------------------- 2499 -----------------------
label: 1
score: 0.013279285582940819
--------------------- 2500 -----------------------
label: 2
score: 0.2628439269485254
--------------------- 2501 -----------------------
label: 1
score: 0.1652895563556848
--------------------- 2502 -----------------------
label: 4
score: 0.20051772075953767
--------------------- 2503 -----------------------
label: 2
score: 0.29211345514383863
--------------------- 2504 -----------------------
label: 3
score: 0.011511519269746332
--------------------- 2505 -----------------------
label: 2
score: 0.46788770414460756
--------------------- 2506 -----------------------
label: 1
score: 0.4135717551470385
--------------------- 2507 ---------------

--------------------- 2599 -----------------------
label: 1
score: 0.1983650280113043
--------------------- 2600 -----------------------
label: 3
score: 0.516156408222473
--------------------- 2601 -----------------------
label: 2
score: 0.21463580460561474
--------------------- 2602 -----------------------
label: 2
score: 0.36386525583017437
--------------------- 2603 -----------------------
label: 1
score: 0.564268748739076
--------------------- 2604 -----------------------
label: 2
score: 0.2810630204109626
--------------------- 2605 -----------------------
label: 1
score: 0.4645488871366637
--------------------- 2606 -----------------------
label: 4
score: 0.07682901341612709
--------------------- 2607 -----------------------
label: 4
score: 0.1711572651458615
--------------------- 2608 -----------------------
label: 1
score: 0.4393617626884689
--------------------- 2609 -----------------------
label: 3
score: 0.06725039007740179
--------------------- 2610 -----------------------
l

--------------------- 2704 -----------------------
label: 1
score: 0.47882002272033203
--------------------- 2705 -----------------------
label: 2
score: 0.5107552048528363
--------------------- 2706 -----------------------
label: 3
score: 0.4020241480154234
--------------------- 2707 -----------------------
label: 3
score: 0.06255893815926264
--------------------- 2708 -----------------------
label: 1
score: 0.5814576927925404
--------------------- 2709 -----------------------
label: 1
score: 0.5243883380594949
--------------------- 2710 -----------------------
label: 1
score: 0.7830902085890277
--------------------- 2711 -----------------------
label: 1
score: 0.6233044310619539
--------------------- 2712 -----------------------
label: 2
score: 0.8013999008876795
--------------------- 2713 -----------------------
label: 3
score: 0.5219849576865256
--------------------- 2714 -----------------------
label: 2
score: 0.4731155464711581
--------------------- 2715 -----------------------
l

--------------------- 2818 -----------------------
label: 4
score: 0.39447733886764313
--------------------- 2819 -----------------------
label: 3
score: 0.18923106566586267
--------------------- 2820 -----------------------
label: 1
score: 0.5316830791574938
--------------------- 2821 -----------------------
label: 4
score: 0.026091732175433126
--------------------- 2822 -----------------------
label: 2
score: 0.3572208570669475
--------------------- 2823 -----------------------
label: 1
score: 0.5138914168414259
--------------------- 2824 -----------------------
label: 2
score: 0.4096409651673466
--------------------- 2825 -----------------------
label: 2
score: 0.27668176957671503
--------------------- 2826 -----------------------
label: 3
score: 0.23055652271118787
--------------------- 2827 -----------------------
label: 2
score: 0.1856371458764413
--------------------- 2828 -----------------------
label: 3
score: 0.17667846560111616
--------------------- 2829 --------------------

--------------------- 2919 -----------------------
label: 4
score: 0.30192629676593785
--------------------- 2920 -----------------------
label: 4
score: 0.036218573675022674
--------------------- 2921 -----------------------
label: 2
score: 0.20610132934220626
--------------------- 2922 -----------------------
label: 3
score: 0.17523091240656147
--------------------- 2923 -----------------------
label: 2
score: 0.35974470270775616
--------------------- 2924 -----------------------
label: 2
score: 0.4638797847569418
--------------------- 2925 -----------------------
label: 4
score: 0.25072071368484655
--------------------- 2926 -----------------------
label: 2
score: 0.20694858822764287
--------------------- 2927 -----------------------
label: 3
score: 0.23052480058224814
--------------------- 2928 -----------------------
label: 3
score: 0.15803146489137054
--------------------- 2929 -----------------------
label: 4
score: 0.05162967210857855
--------------------- 2930 ----------------

label: 1
score: 0.702752893743698
--------------------- 3061 -----------------------
label: 2
score: 0.3633005203591333
--------------------- 3062 -----------------------
label: 3
score: 0.37760171789859787
--------------------- 3063 -----------------------
label: 1
score: 0.697350013877963
--------------------- 3064 -----------------------
label: 2
score: 0.5920089592107808
--------------------- 3065 -----------------------
label: 1
score: 0.7182906427017949
--------------------- 3066 -----------------------
label: 4
score: 0.07903902919809197
--------------------- 3067 -----------------------
label: 3
score: 0.5155463853247657
--------------------- 3068 -----------------------
label: 1
score: 0.04522725287988014
--------------------- 3069 -----------------------
label: 3
score: 0.1689279934787432
--------------------- 3070 -----------------------
label: 4
score: 0.08000207507498923
--------------------- 3071 -----------------------
label: 3
score: 0.18627408485893449
----------------

label: 4
score: 0.1860087458061626
--------------------- 3164 -----------------------
label: 4
score: 0.018794081523276968
--------------------- 3165 -----------------------
label: 2
score: 0.0
--------------------- 3166 -----------------------
label: 3
score: 0.3178724836007249
--------------------- 3167 -----------------------
label: 4
score: 0.2845106120287708
--------------------- 3168 -----------------------
label: 3
score: 0.5328480794606586
--------------------- 3169 -----------------------
label: 2
score: 0.6120087473970672
--------------------- 3170 -----------------------
label: 2
score: 0.5508320113222176
--------------------- 3171 -----------------------
label: 1
score: 0.5614924689186621
--------------------- 3172 -----------------------
label: 3
score: 0.41213397363864174
--------------------- 3173 -----------------------
label: 2
score: 0.3842084198536414
--------------------- 3174 -----------------------
label: 4
score: 0.4764266059054336
--------------------- 3175 ----

--------------------- 3266 -----------------------
label: 4
score: 0.09571397580531064
--------------------- 3267 -----------------------
label: 3
score: 0.5128770550038828
--------------------- 3268 -----------------------
label: 4
score: 0.0
--------------------- 3269 -----------------------
label: 1
score: 0.6615479680097188
--------------------- 3270 -----------------------
label: 1
score: 0.554044790956489
--------------------- 3271 -----------------------
label: 4
score: 0.03559482498133338
--------------------- 3272 -----------------------
label: 4
score: 0.06962221730825183
--------------------- 3273 -----------------------
label: 2
score: 0.27743257279989103
--------------------- 3274 -----------------------
label: 4
score: 0.18941326302588724
--------------------- 3275 -----------------------
label: 1
score: 0.5718358324843287
--------------------- 3276 -----------------------
label: 4
score: 0.4251008879940652
--------------------- 3277 -----------------------
label: 4
score

--------------------- 3377 -----------------------
label: 4
score: 0.2828258711286305
--------------------- 3378 -----------------------
label: 4
score: 0.16208890927794806
--------------------- 3379 -----------------------
label: 1
score: 0.7682152523263263
--------------------- 3380 -----------------------
label: 1
score: 0.530297224768104
--------------------- 3381 -----------------------
label: 4
score: 0.21596378584747342
--------------------- 3382 -----------------------
label: 1
score: 0.4652345631627517
--------------------- 3383 -----------------------
label: 2
score: 0.21266640066465733
--------------------- 3384 -----------------------
label: 4
score: 0.22290363971931865
--------------------- 3385 -----------------------
label: 4
score: 0.4286777229602216
--------------------- 3386 -----------------------
label: 4
score: 0.08948993159465964
--------------------- 3387 -----------------------
label: 1
score: 0.6334019207069974
--------------------- 3388 -----------------------

--------------------- 3570 -----------------------
label: 2
score: 0.5041956890129843
--------------------- 3571 -----------------------
label: 2
score: 0.6316017884709876
--------------------- 3572 -----------------------
label: 3
score: 0.264119229545265
--------------------- 3573 -----------------------
label: 4
score: 0.009688748447548825
--------------------- 3574 -----------------------
label: 3
score: 0.28984505258199966
--------------------- 3575 -----------------------
label: 2
score: 0.3110480894149746
--------------------- 3576 -----------------------
label: 4
score: 0.10090802541627608
--------------------- 3577 -----------------------
label: 2
score: 0.48192745731397035
--------------------- 3578 -----------------------
label: 3
score: 0.325409659859462
--------------------- 3579 -----------------------
label: 1
score: 0.7056053713617945
--------------------- 3580 -----------------------
label: 3
score: 0.02533075879106028
--------------------- 3581 -----------------------

--------------------- 3750 -----------------------
label: 2
score: 0.4337613304086829
--------------------- 3751 -----------------------
label: 3
score: 0.5329154905858633
--------------------- 3752 -----------------------
label: 2
score: 0.3473881885978552
--------------------- 3753 -----------------------
label: 1
score: 0.47749860515222636
--------------------- 3754 -----------------------
label: 3
score: 0.25573724577055856
--------------------- 3755 -----------------------
label: 3
score: 0.10123669270562752
--------------------- 3756 -----------------------
label: 4
score: 0.014710293577740249
--------------------- 3757 -----------------------
label: 3
score: 0.5059529602154639
--------------------- 3758 -----------------------
label: 3
score: 0.1629112081090205
--------------------- 3759 -----------------------
label: 4
score: 0.19353062744694846
--------------------- 3760 -----------------------
label: 1
score: 0.42661156515242155
--------------------- 3761 --------------------

score: 0.19586604075691544
--------------------- 3928 -----------------------
label: 4
score: 0.5428587733565382
--------------------- 3929 -----------------------
label: 2
score: 0.5365594753466608
--------------------- 3930 -----------------------
label: 3
score: 0.13783482806482183
--------------------- 3931 -----------------------
label: 1
score: 0.5324784593020695
--------------------- 3932 -----------------------
label: 3
score: 0.2711251001892321
--------------------- 3933 -----------------------
label: 3
score: 0.4477842970755697
--------------------- 3934 -----------------------
label: 3
score: 0.5691864001077757
--------------------- 3935 -----------------------
label: 3
score: 0.3532078750268207
--------------------- 3936 -----------------------
label: 2
score: 0.7795385868032998
--------------------- 3937 -----------------------
label: 3
score: 0.1582100977873303
--------------------- 3938 -----------------------
label: 3
score: 0.3388637302196452
--------------------- 3939

--------------------- 4110 -----------------------
label: 2
score: 0.614790285087132
--------------------- 4111 -----------------------
label: 4
score: 0.2365714465629172
--------------------- 4112 -----------------------
label: 1
score: 0.6394355407540406
--------------------- 4113 -----------------------
label: 4
score: 0.34856121296179415
--------------------- 4114 -----------------------
label: 2
score: 0.5724897783150653
--------------------- 4115 -----------------------
label: 4
score: 0.07719186231737059
--------------------- 4116 -----------------------
label: 4
score: 0.04828277085179257
--------------------- 4117 -----------------------
label: 4
score: 0.4810201404119915
--------------------- 4118 -----------------------
label: 2
score: 0.650317656204312
--------------------- 4119 -----------------------
label: 3
score: 0.23448411886344617
--------------------- 4120 -----------------------
label: 1
score: 0.6777689396791505
--------------------- 4121 -----------------------
l

--------------------- 4240 -----------------------
label: 2
score: 0.1332327758633312
--------------------- 4241 -----------------------
label: 4
score: 0.006655932071591418
--------------------- 4242 -----------------------
label: 2
score: 0.6221721626128314
--------------------- 4243 -----------------------
label: 4
score: 0.5948848046228689
--------------------- 4244 -----------------------
label: 1
score: 0.2505925638581496
--------------------- 4245 -----------------------
label: 1
score: 0.4049708938524057
--------------------- 4246 -----------------------
label: 1
score: 0.003218967952471546
--------------------- 4247 -----------------------
label: 3
score: 0.2527264202373031
--------------------- 4248 -----------------------
label: 4
score: 0.04625691429050335
--------------------- 4249 -----------------------
label: 2
score: 0.7198644614557348
--------------------- 4250 -----------------------
label: 2
score: 0.2625404463320882
--------------------- 4251 ----------------------

--------------------- 4406 -----------------------
label: 4
score: 0.040515502769323315
--------------------- 4407 -----------------------
label: 4
score: 0.04954748074855361
--------------------- 4408 -----------------------
label: 4
score: 0.07294545207514928
--------------------- 4409 -----------------------
label: 4
score: 0.02047575182689361
--------------------- 4410 -----------------------
label: 3
score: 0.07240253611374897
--------------------- 4411 -----------------------
label: 2
score: 0.47740631248703674
--------------------- 4412 -----------------------
label: 4
score: 0.07563037597867983
--------------------- 4413 -----------------------
label: 3
score: 0.026479260388655404
--------------------- 4414 -----------------------
label: 4
score: 0.0
--------------------- 4415 -----------------------
label: 4
score: 0.05384495195690906
--------------------- 4416 -----------------------
label: 4
score: 0.03241611421585404
--------------------- 4417 -----------------------
label:

--------------------- 4517 -----------------------
label: 4
score: 0.0
--------------------- 4518 -----------------------
label: 4
score: 0.0017778026206578086
--------------------- 4519 -----------------------
label: 3
score: 0.2765590977659088
--------------------- 4520 -----------------------
label: 4
score: 0.023310896239417688
--------------------- 4521 -----------------------
label: 4
score: 0.016262649105867694
--------------------- 4522 -----------------------
label: 4
score: 0.10686673665277213
--------------------- 4523 -----------------------
label: 3
score: 0.186418616563947
--------------------- 4524 -----------------------
label: 4
score: 0.049740643838897314
--------------------- 4525 -----------------------
label: 2
score: 0.1665972488676925
--------------------- 4526 -----------------------
label: 4
score: 0.008470080265288992
--------------------- 4527 -----------------------
label: 4
score: 0.0064818861022767765
--------------------- 4528 -----------------------
labe

--------------------- 4684 -----------------------
label: 4
score: 0.06334628626195524
--------------------- 4685 -----------------------
label: 3
score: 0.4158824734589932
--------------------- 4686 -----------------------
label: 1
score: 0.17280277093936178
--------------------- 4687 -----------------------
label: 2
score: 0.33885970922105324
--------------------- 4688 -----------------------
label: 4
score: 0.20579993986558273
--------------------- 4689 -----------------------
label: 2
score: 0.0
--------------------- 4690 -----------------------
label: 3
score: 0.18511268165507114
--------------------- 4691 -----------------------
label: 4
score: 0.3133562723160654
--------------------- 4692 -----------------------
label: 2
score: 0.2681869589117347
--------------------- 4693 -----------------------
label: 3
score: 0.10408731365503013
--------------------- 4694 -----------------------
label: 4
score: 0.0
--------------------- 4695 -----------------------
label: 1
score: 0.523697371

--------------------- 4820 -----------------------
label: 2
score: 0.31032705056463294
--------------------- 4821 -----------------------
label: 4
score: 0.13715701047313342
--------------------- 4822 -----------------------
label: 1
score: 0.4725374747614135
--------------------- 4823 -----------------------
label: 4
score: 0.1480128993358518
--------------------- 4824 -----------------------
label: 3
score: 0.31997349728047586
--------------------- 4825 -----------------------
label: 2
score: 0.4273341571608537
--------------------- 4826 -----------------------
label: 3
score: 0.23457943979816498
--------------------- 4827 -----------------------
label: 3
score: 0.36772859640345856
--------------------- 4828 -----------------------
label: 4
score: 0.12226862789686437
--------------------- 4829 -----------------------
label: 4
score: 0.021532190421893183
--------------------- 4830 -----------------------
label: 3
score: 0.2776188323691778
--------------------- 4831 -------------------

--------------------- 4943 -----------------------
label: 1
score: 0.570697703058467
--------------------- 4944 -----------------------
label: 2
score: 0.7975408859763827
--------------------- 4945 -----------------------
label: 2
score: 0.32897186329973316
--------------------- 4946 -----------------------
label: 1
score: 0.6982598492473218
--------------------- 4947 -----------------------
label: 1
score: 0.45569211780764884
--------------------- 4948 -----------------------
label: 1
score: 0.1452119111819586
--------------------- 4949 -----------------------
label: 1
score: 0.5798048004573527
--------------------- 4950 -----------------------
label: 2
score: 0.3437130691826319
--------------------- 4951 -----------------------
label: 3
score: 0.12063447302294074
--------------------- 4952 -----------------------
label: 2
score: 0.8136296720959686
--------------------- 4953 -----------------------
label: 4
score: 0.21435945722154895
--------------------- 4954 -----------------------


In [33]:
path = 'train/_TRAIN_text_cosSim_score.csv'
processed_df.to_csv(path,index=False)

processed_df

Unnamed: 0.1,Unnamed: 0,pair_id,url1_lang,url2_lang,title1,title2,keywords1,keywords2,text1,text2,...,Time,Narrative,Overall,Style,Tone,translated_body1,translated_body2,preprocessed_1,preprocessed_2,text_cos_sim
0,0,1484084337_1484110209,en,en,Virginia man arrested in fatal DUI crash in We...,Haiti’s leader marks independence day amid sec...,"law and order,reckless endangerment,transporta...","port au prince,latinamericaandcaribbean,jean,c...","MARTINSBURG, W.Va. — A suspected drunken drive...","PORT-AU-PRINCE, Haiti — Haitian President Jove...",...,1.0,4.0,4,16666666666666600,2.0,"MARTINSBURG, W.Va. — A suspected drunken drive...","PORT-AU-PRINCE, Haiti — Haitian President Jove...",martinsburg wva — suspect drunken driver arres...,portauprince haiti — haitian president jovenel...,0.064173
1,1,1484396422_1483924666,en,en,Guyana: Three injured after car crashes into u...,Fire kills more than 30 animals at zoo in west...,,"smg2_world,smg_europe,smg2_news",Share This On:\n\nPin 11 Shares\n\n(NEWS ROOM ...,BERLIN - A fire at a zoo in western Germany in...,...,1.0,4.0,4,16666666666666600,13333333333333300,Share This On: Pin 11 Shares (NEWS ROOM GUYA...,BERLIN - A fire at a zoo in western Germany in...,share pin 11 share news room guyana — three pe...,berlin fire zoo western germany first minute 2...,0.021774
2,2,1484698254_1483758694,en,en,Trump Brings In 2020 At Mar-a-Lago: ‘We’re Goi...,"Trump says he does not expect war with Iran, ‘...",,"full coverage 2020 us presidential elections,f...",(Breitbart) – President Donald Trump welcomed ...,"PALM BEACH, United States — US President Donal...",...,1.0,2333333333333330,2,1.0,13333333333333300,(Breitbart) – President Donald Trump welcomed ...,"PALM BEACH, United States — US President Donal...",breitbart – president donald trump welcome gue...,palm beach united states — we president donald...,0.292246
3,3,1576314516_1576455088,en,en,Zomato Buys Uber’s Food Delivery Business in I...,Indian Online Food Delivery Market to Hit $8 B...,zomatoubereatsbusinessacquisitionindiaallstock...,"swiggy,ber,indian online food delivery market ...",Uber has sold its online food-ordering busines...,Rapid digitisation and growth in both online b...,...,26666666666666600,16666666666666600,2,16666666666666600,16666666666666600,Uber has sold its online food-ordering busines...,Rapid digitisation and growth in both online b...,uber sell online foodordere business india loc...,rapid digitisation growth online buyer base sp...,0.413752
4,4,1484036253_1483894099,en,en,"India approves third moon mission, months afte...",India targets new moon mission in 2020,"india,lunarorbiter,isro,landonthemoon","india,space",BENGALURU (Reuters) - India has approved its t...,BANGALORE: India plans to make a fresh attempt...,...,1.0,1.25,1,1.0,1.0,BENGALURU (Reuters) - India has approved its t...,BANGALORE: India plans to make a fresh attempt...,bengaluru reuters india approve third lunar mi...,bangalore india plan make fresh attempt land u...,0.509818
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4959,4959,1586195445_1598778991,tr,tr,"BM, Aden'de 2 bini aşkın iç göçmenin selden za...",BM'den Yemen'de kadınların doğumda ölüm riski ...,"twitter,yemen,güncel,birleşmişmilletler,birleş...","haber,yemen,güncel,birleşmişmilletler,birleşmi...","BM, Aden'de 2 bini aşkın iç göçmenin selden za...",BM'den Yemen'de kadınların doğumda ölüm riski ...,...,2.0,4.0,3,1.0,1.0,"The UN announced that more than 2,000 domestic...","In Yemen, Women's Death Risk Warning Explanati...",un announce 2000 domestic migrant aden explain...,yemen women death risk warning explanation bir...,0.259547
4960,4960,1590915424_1590940388,tr,tr,Kovid-19'dan dolayı La Liga kulüplerinde hayat...,Fabio Capello: Koronavirüs sonrası La Liga'da ...,"laliga,la liga,i̇spanya,spor,futbol,realmadrid...","kovid 19,i̇spanya1futbolligi,la liga,laliga,ko...",Kovid-19'dan dolayı La Liga kulüplerinde hayat...,Yeni tip koronavirüs (Kovid-19) salgınının eko...,...,1.0,1.0,1,1.0,1.0,"Because of the Kovid-19, the Survival Struggle...",The new type of coronavirus (Kovid-19) is cons...,kovid19 survival struggle la liga club discuss...,new type coronavirus kovid19 consider severely...,0.988634
4961,4961,1526157103_1492737005,tr,tr,Saray da çare olmadı: 'Borca boğulan dev kulüp...,TFF’den jet yanıt! ''Bizi hedef gösteriyorlar'',"satiş,'borca,olmadi:,kulüpler,çare,da,masasınd...","tff,ahmet nur çebi,türkiye futbol federasyonu,...",\n\n\n\n\n\n\n\nİflas noktasındaki kulüplerin ...,"TFF, resmi internet sitesinden Beşiktaş'ın fai...",...,3.0,4.0,3,1.0,2.0,It is stated that the sales of the clubs at th...,TFF has published an explanation on the offici...,state sale club point bankrupt agenda besiktas...,tff publish explanation official website besik...,0.248100
4962,4962,1603274500_1618292937,tr,tr,Ergene Belediyesi yol çalışmalarına aksatmadan...,Ergene'de Ahimehmet ve Yeşiltepe mahallelerind...,"tekirdağ,ergene,rasimyüksel,güncel,koronavirüs...","yeşiltepe,yaşam,koronavirüs,haber",Ergene Belediyesi yol çalışmalarına aksatmadan...,Ergene'de Ahimehmet ve Yeşiltepe mahallelerind...,...,3.0,3.0,3,1.0,1.0,Ergene Municipality continues without disrupti...,"In Ergene, the mask was distributed in Ahimehm...",ergene municipality continue without disrupt r...,ergene mask distribute ahimehmet yeşiltepe nei...,0.742618


In [31]:
compare_4 = compare_df[compare_df['overall']==4.0]
compare_3 = compare_df[compare_df['overall']==3.0]
compare_2 = compare_df[compare_df['overall']==2.0]
compare_1 = compare_df[compare_df['overall']==1.0]
len(compare_df)
len(compare_4) + len(compare_3) + len(compare_2) +len(compare_1)

i_list = [1.0,2.0,3.0,4.0]
compare = [compare_1,compare_2,compare_3,compare_4]
col = 'text_cos_sim'
for i in i_list:
    index = int(i)-1
    c = compare[index]
    print('Number of pairs that are rated as ',i,':')
    print(len(c))
    print('Pairs that are rated as ',i,', title named entity score is 0: (value, percentage)')
    print(len(c[c[col]==0]), ', ', len(c[c[col]==0])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.3:')
    print(len(c[c[col]>0.3]), ', ', len(c[c[col]>0.3])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.5:')
    print(len(c[c[col]>0.5]), ', ', len(c[c[col]>0.5])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is 1.0:')
    print(len(c[c[col]==1.0]), ', ', len(c[c[col]==1.0])/len(c)*100,'%')

Number of pairs that are rated as  1.0 :
963
Pairs that are rated as  1.0 , title named entity score is 0: (value, percentage)
35 ,  3.6344755970924196 %
Pairs that are rated as  1.0 , title named entity score is > 0.3:
823 ,  85.46209761163031 %
Pairs that are rated as  1.0 , title named entity score is > 0.5:
507 ,  52.64797507788161 %
Pairs that are rated as  1.0 , title named entity score is 1.0:
1 ,  0.10384215991692627 %
Number of pairs that are rated as  2.0 :
957
Pairs that are rated as  2.0 , title named entity score is 0: (value, percentage)
32 ,  3.343782654127482 %
Pairs that are rated as  2.0 , title named entity score is > 0.3:
591 ,  61.75548589341693 %
Pairs that are rated as  2.0 , title named entity score is > 0.5:
217 ,  22.675026123301986 %
Pairs that are rated as  2.0 , title named entity score is 1.0:
0 ,  0.0 %
Number of pairs that are rated as  3.0 :
1067
Pairs that are rated as  3.0 , title named entity score is 0: (value, percentage)
54 ,  5.060918462980318 %


### Remarks: <br>

As can already be observed, the cosine similarity score is consistent with the annotated *'Overall* score. <br>

For instance, let's consider the pairs whose similarity was annotated to be 4.0 (very dissimilar). Then it is interesting to notice that the around 93% of the pairs scored 4.0 has similarity score between 0 and 0.3.
On the other hand, 54% of pairs whose score was annotated to be 1.0 received a similarity score >0.5. <br><br>

This result can be interpreted by claiming that for **dissimilar** pairs (for this dataset), cosine similarity is extremely consistent with annotations. However, identifying similar articles seems to be a more subtle task for human annotators and this is reflected in the cosine similarity score.