
## Extracting Named Entities (evaluation) <br>

We use named entities (NEs) extracted from both translated title and translated body text of articles in the evaluation dataset.

The extraction uses [spaCy API](https://spacy.io/).


### Extracting NEs from translated titles <br>

As one of our features, we extract named entities from the translated title using [spacy.io](https://spacy.io/) and count the common ones. However, there are a few problems with title NE recognition:

>**Issue 1**. 
spaCy relies heavily on capitalization, so the first word will not be recognised:
eg. the first word in _"Taoiseach makes first general election campaign pitch in New Year video"_ will not be extracted

>**Solution**:
Adding "In fact," at the beginning of every title. However, if the first word of the title is not a NE, the capitalization will lead to spaCy recognising it as NE, but we chose to have an excess of NE than not have important NE.
---------------------------
>**Issue 2.**
Some sentences have all words capitalized (except prepositions and conjunctions), eg. "Zomato Buys Uber's Food Delivery Business in India in an All-Stock Deal", which leads to spaCy extracting _"Zomato Buys Uber's"_, _'Food Delivery Business'_, _'India'_ as spaCy relies heavily on capitalization

>**Solution:**
Truecasing with **standfordNLP's stanza** tool to truecase the sentence. However, from _"In fact, Zomato buys Uber's Food Delivery Business in India in an all - stock deal"_, spaCy still can only extract "India", not to mention running truecasing for every sentence (even the one already truecased) can lead to all nouns being capitalized, which affects spaCy greatly, and since when there's no NE (eg. "Warm For The Season For Now, Some Wet Snow This Weekend") no unwanted NE is extracted (we chose NOT to extract certain entities like 'CARDINAL','DATE', 'MONEY', 'ORDINAL', 'PERCENT', 'QUANTITY', 'TIME') , so we ended up decide not to truecase as we could gain very little while creating a new issues.
-------------------------
>**Issue 3.**
The extracted NE needs to be standardized in order to be compared

>**Solution:**
We lowercase them and removed "'s", "."("US" and "U.S.") and "-".
    

In [7]:
import pandas as pd
import spacy #pip install spacy
import stanza
# pip install stanza
stanza.download('en')
nlp = spacy.load('en_core_web_sm') 


HBox(children=(FloatProgress(value=0.0, description='Downloading https://raw.githubusercontent.com/stanfordnlp…

2022-03-17 15:48:01 INFO: Downloading default packages for language: en (English)...





2022-03-17 15:48:04 INFO: File exists: /Users/chung-fantsai/stanza_resources/en/default.zip.
2022-03-17 15:48:10 INFO: Finished downloading models and saved to /Users/chung-fantsai/stanza_resources.


In [6]:
title_df = pd.read_csv ('eval/_EVAL_title_translated.csv')
title_df

Unnamed: 0,pair_id,url1_lang,google_lang1,url2_lang,google_lang2,title1,title2,translated_title1,translated_title2
0,1484189203_1484121193,en,en,en,en,Police: 2 men stole tools from Lowe’s in Davie,No-swim advisory lifted for Deerfield Beach Pier,Police: 2 men stole tools from Lowe’s in Davie,No-swim advisory lifted for Deerfield Beach Pier
1,1484011097_1484011106,en,en,en,en,"Open database leaked 179GB in customer, US gov...",Best Western’s Massive Data Leak: 179GB Amazon...,"Open database leaked 179GB in customer, US gov...",Best Western’s Massive Data Leak: 179GB Amazon...
2,1484039488_1484261803,en,en,en,en,Ducks are own worst enemies in sloppy loss in ...,Woody Guthrie's 1943 New Year's Resolutions ar...,Ducks are own worst enemies in sloppy loss in ...,Woody Guthrie's 1943 New Year's Resolutions ar...
3,1484332324_1484796748,en,en,en,en,Another Bengal vs Centre tussle? Govt rejects ...,'Congress Rejected 7 Times': BJP's Reminder as...,Another Bengal vs Centre tussle? Govt rejects ...,'Congress Rejected 7 Times': BJP's Reminder as...
4,1484012256_1484419682,en,en,en,en,Bars and clubs you loved and lost this decade ...,Top 20 films of the 2010s,Bars and clubs you loved and lost this decade ...,Top 20 films of the 2010s
...,...,...,...,...,...,...,...,...,...
4897,1553907621_1553488848,es,es,it,it,Denver Nuggets reporta que “un miembro de la o...,"Coronavirus, un caso anche fra i Denver Nuggets","Denver Nuggets reports that ""a member of the o...","Coronavirus, a case also among the Denver Nuggets"
4898,1646957948_1643667075,es,es,it,it,Vivir en España es más barato que en la media ...,"Coronavirus, in Europa 140mila morti in più in...",Living in Spain is cheaper than in the Europea...,"Coronavirus, in Europe 140 thousand dead in Ma..."
4899,1504063453_1502866628,es,es,it,it,Activan sistema de vigilancia epidemiológica e...,Coronavirus in Cina: consumo di serpenti e zup...,Activate epidemiological surveillance system i...,Coronavirus in China: consumption of snakes an...
4900,1647862428_1647712939,es,es,it,it,Emite Irán orden de arresto contra Tump por as...,Iran emette mandato di arresto per Trump per l...,Emite Iran Arrest Order against Tump by Genera...,Iran emits stop mandate for trump for the murd...


 The chosen classes of named entities exclude the ones like _"CARDINAL"_, _"DATE"_, _"MONEY"_, _"ORDINAL"_, _"PERCENT"_, _"QUANTITY"_, and _"TIME"_. <br><br>

The reason for this is that the normalization of NEs in these categories might prove problematic. For instance, understanding that _"20/3/22"_ and _"the previous Sunday"_ correspond to the same date is not obvious. Moreover, we agreed that named entities, such as the ones referring to proper nouns of people or places, might be enough to capture the basic data of the story.

In [8]:

ignore_labels = ['CARDINAL','DATE', 'MONEY', 'ORDINAL', 'PERCENT', 'QUANTITY', 'TIME']

def overlapScore(key1, key2):
    key1 = set(key1)
    key2 = set(key2)
    interset = key1.intersection(key2)
    union = key1.union(key2)
    
    denominator = len(key1)
    if len(key2) < denominator:
        denominator = len(key2)
    
    if len(key1) <1 or len(key2) <1 or len(union) < 1 or len(interset)<1:
        return 0
    return (len(interset)/(denominator))

title_key_df = pd.DataFrame(columns = ["pair_id","url1_lang", "google_lang1","url2_lang","google_lang2", "translated_title1","translated_title2","key1","key2", "key_score"])

for index, row in title_df.iterrows(): 
    issue1_solve = 'In fact, '
    title1 = issue1_solve + row['translated_title1']
    title2 = issue1_solve + row['translated_title2']
    doc1 = nlp(title1) 
    doc2 = nlp(title2) 
    ner1 = [(((ent.text.lower()).replace("'s", "")).replace("-"," ")).replace(".","") for ent in doc1.ents if ent.label_ not in ignore_labels]
    ner2 = [(((ent.text.lower()).replace("'s", "")).replace("-"," ")).replace(".","") for ent in doc2.ents if ent.label_ not in ignore_labels]
    score = overlapScore(ner1, ner2)
    pair = row['pair_id']
    if score >0:
        print('---------------------',index,'-----------------------')
        print(pair)
        print(title1)
        print(ner1)
        print(title2)
        print(ner2)
        print(score)
    
    entry = {"pair_id": pair,"url1_lang":row['url1_lang'],"google_lang1":row['google_lang1'],"url2_lang":row['url2_lang'],"google_lang2":row['google_lang2'],"translated_title1":row["translated_title1"],"translated_title2":row["translated_title2"],"key1":ner1,"key2":ner2, "key_score":score}
    title_key_df = title_key_df.append(entry, ignore_index = True)
    

--------------------- 21 -----------------------
1484190460_1483806119
In fact, Telegram release a major update for the new year (changelog)
['telegram']
In fact, Telegram adds bunch of features including new Theme Editor, Send When Online, Podcast support and much more- Technology News, Firstpost
['telegram', 'theme editor', 'online', 'technology news']
1.0
--------------------- 49 -----------------------
1484037143_1484201620
In fact, Jahanbakhsh nets overhead kick as Brighton holds Chelsea 1-1
['jahanbakhsh', 'brighton', 'chelsea 1 1']
In fact, USMNT's Christian Pulisic back in Chelsea XI for 1-1 draw away to Brighton
['usmnt', 'christian pulisic', 'chelsea', 'brighton']
0.3333333333333333
--------------------- 53 -----------------------
1483806302_1483770632
In fact, Rep. John Lewis Of Atlanta Battling Stage 4 Pancreatic Cancer
['john lewis', 'atlanta battling stage 4 pancreatic cancer']
In fact, John Lewis represents an entire generation America can't afford to lose
['john lewis',

--------------------- 217 -----------------------
1562483638_1611624784
In fact, Zelensky voiced three scenarios with coronavirus in Ukraine.Video
['zelensky', 'ukraine']
In fact, Mortgage under 10%: Zelensky announced cheap loans for Ukrainians
['mortgage', 'zelensky', 'ukrainians']
0.5
--------------------- 222 -----------------------
1642220033_1618977537
In fact, In Krasnodar, found the missing Sherzod of Igamediev
['krasnodar', 'sherzod of igamediev']
In fact, In an opponent's opposite: In Krasnodar, the police are looking for a malicious violator of traffic rules
['krasnodar']
1.0
--------------------- 223 -----------------------
1609501071_1609409437
In fact, In the Ministry of Defense of the Russian Federation, they told how preparing for the Army 2020
['the ministry of defense', 'the russian federation', 'army']
In fact, The Ministry of Defense announced the most ambitious exercises in 2020
['the ministry of defense']
1.0
--------------------- 235 -----------------------
16253

--------------------- 321 -----------------------
1605048410_1594000116
In fact, Council's ceasefire agreement compromise China attitude into key
['council', 'china']
In fact, China-US contradictions do not solve the United Nations draft aunsence
['china', 'us', 'united nations']
0.5
--------------------- 334 -----------------------
1643509381_1619126052
In fact, Lin Zheng Yue: Hong Kong is still the ideal place for multinational corporate development after the implementation of national security law
['lin zheng yue', 'hong kong']
In fact, Lin Zheng Yue: Welcomes the decision of the National Congress through the establishment of a legal system and implementation mechanism for maintaining national security in Hong Kong SAR
['lin zheng yue', 'the national congress', 'hong kong']
1.0
--------------------- 335 -----------------------
1632977953_1630872412
In fact, Taiwan is surprisingly surprising to the US plane Beijing protests illegal
['taiwan', 'us', 'beijing']
In fact, Taiwan National

--------------------- 503 -----------------------
1514562858_1538268930
In fact, Consumer prices in Brazil go up in January, but less than expected
['brazil']
In fact, GDP of Brazil loses strength in fourth quarter but close 2019 with progress of 1.1%
['brazil']
1.0
--------------------- 504 -----------------------
1611623641_1614996201
In fact, Pillad two men disguised as Boya swimming in the sea in full confinement
['boya']
In fact, Pillad two men disguised as Boya to swim at sea during confinement
['boya']
1.0
--------------------- 509 -----------------------
1504781389_1504973190
In fact, Ministry of Health reported on the situation with Coronavirus near the border of Russia
['ministry of health', 'coronavirus', 'russia']
In fact, Moscow Department: Coronavirus has no found from China from China
['moscow department', 'coronavirus', 'china', 'china']
0.3333333333333333
--------------------- 512 -----------------------
1636095223_1637064680
In fact, What kind of conflict between poli

--------------------- 583 -----------------------
1646745400_1646298579
In fact, Medvedev voted by amendments to the Constitution without mask and gloves
['medvedev', 'constitution']
In fact, Medvedev urged the youth of the Russian Federation to participate in the voting on the Constitution
['medvedev', 'the russian federation', 'the constitution']
0.5
--------------------- 585 -----------------------
1527110798_1623237051
In fact, In Chechnya will open a monument to the dead health workers
['chechnya']
In fact, In Chechnya, the mosques will open and will introduce funerals for rites
['chechnya', 'mosques']
1.0
--------------------- 588 -----------------------
1497287604_1497506528
In fact, Under Kiev, a man attacks people: the inhabitant of Buchie with an iron hook broke his head
['kiev', 'buchie']
In fact, Under Kiev, an inadequate man attacks people (photo)
['kiev']
1.0
--------------------- 591 -----------------------
1629620910_1606103662
In fact, Anabolics, transported under the 

--------------------- 705 -----------------------
1566332247_1489125178
In fact, USA deals with conflict in Iraq
['usa', 'iraq']
In fact, Iranian Revolutionary Guards attack US military in Iraq
['iranian revolutionary guards', 'us', 'iraq']
0.5
--------------------- 722 -----------------------
1531750952_1525081230
In fact, Parties home associations of the CDU top competitors are positioning themselves
['cdu']
In fact, Röttgen optimistic about membership survey on CDU chair
['röttgen', 'cdu']
1.0
--------------------- 725 -----------------------
1581654542_1619154724
In fact, Success in Hong Kong - Million metropolis handles pandemic without complete lockdown
['success', 'hong kong']
In fact, China threatens USA with countermeasures due to interference in Hong Kong
['china', 'usa', 'hong kong']
0.5
--------------------- 726 -----------------------
1549534843_1503362459
In fact, How South Korea defeated the virus
['south korea']
In fact, Second case with novel coronavirus in South Korea

--------------------- 833 -----------------------
1606294927_1606041212
In fact, Philippsburg: So the blasting of the AKW cooling towers (plus video / update)
['philippsburg', 'akw']
In fact, Blowing in Philippsburg: The cooling towers of the nuclear power plant are no longer
['blowing', 'philippsburg']
0.5
--------------------- 842 -----------------------
1639022132_1639205246
In fact, Rainer Langhans pleads for "spiritual sex"
['rainer langhans']
In fact, Rainer Langhans: "At least people stop now"
['rainer langhans']
1.0
--------------------- 845 -----------------------
1618017338_1617888514
In fact, Bayern wants to target anti-Semitic offenses
['bayern', 'anti semitic']
In fact, New guideline for detection of anti-Semitic crimes
['anti semitic']
1.0
--------------------- 848 -----------------------
1551309853_1550872830
In fact, The EU decides instant entry stop
['eu']
In fact, France takes a forward EU decision on the entry restriction
['france', 'eu']
1.0
--------------------- 85

--------------------- 1090 -----------------------
1572079895_1572741006
In fact, Cuba passes to the limited autochthonous transmission stage of the new coronavirus
['cuba']
In fact, Cuba has Reached Limited Local Transmission Stage of COVID-19
['cuba', 'reached limited local transmission stage']
1.0
--------------------- 1094 -----------------------
1622566508_1623715440
In fact, The Liverpool campus dedicated an emotional homage to George Floyd
['liverpool', 'george floyd']
In fact, Liverpool take a knee to protest George Floyd's death
['george floyd']
1.0
--------------------- 1101 -----------------------
1573959882_1642941640
In fact, Fire in the attic of a villa in Monfalcone, the firefighters intervene
['monfalcone']
In fact, Monfalcone, elderly invested by a car: it is serious
['monfalcone']
1.0
--------------------- 1106 -----------------------
1544911607_1544697909
In fact, Coronavirus, rixi: "No in half measures: total closure only road to contain viruses"
['coronavirus']
In 

--------------------- 1209 -----------------------
1569652448_1569668614
In fact, Reopen churches at Easter?All against Salvini (including Catholics)
['easter?all', 'salvini', 'catholics']
In fact, Salvini proposes the churches open at Easter, the reaction of the mayor of Milan
['salvini', 'easter', 'milan']
0.3333333333333333
--------------------- 1211 -----------------------
1588247062_1584872661
In fact, Kim Jong-a, mystery succession: who leaders North Korea after him?
['kim jong a', 'north korea']
In fact, Kim Jong-a "in serious condition".Cautious the reactions of Beijing and Seoul
['kim jong a', 'condition"', 'beijing', 'seoul']
0.5
--------------------- 1219 -----------------------
1579626390_1579715693
In fact, Coronavirus, Salvini pushes Lombardy to reopen: "The government listens to requests".Zingaretti: "National rules and times, enough with crafts"
['coronavirus', 'salvini', 'lombardy', 'requests"', 'zingaretti']
In fact, Step 2, the wrath of zingaretti: "No to cunning, th

--------------------- 1327 -----------------------
1568908955_1558994751
In fact, We don't do it like Italy.Russian chronicle from the front of Bergamo
['italy', 'russian', 'bergamo']
In fact, Russian Ambassador insists, Italy does nothing for aid against Coronavirus
['russian', 'italy']
1.0
--------------------- 1331 -----------------------
1608458444_1573637125
In fact, Varchi and parking in the center, VCS returns to charge
['varchi', 'vcs']
In fact, VCS: "The tables of the premises?For us they must be removed not to give the possibility of assembling "
['vcs']
1.0
--------------------- 1332 -----------------------
1585470371_1609591753
In fact, Borgo San Dalmazzo prepares to celebrate April 25th
['borgo san dalmazzo']
In fact, Borgo San Dalmazzo: New for the non-food market coming
['borgo san dalmazzo']
1.0
--------------------- 1333 -----------------------
1505010765_1491456788
In fact, Carlo Tansi / who is the independent candidate in the Calabria regional elections
['carlo tansi

--------------------- 1431 -----------------------
1543125435_1557534283
In fact, Covid-19 in Neptune, the mayor to citizens: "Enough of irresponsible behavior"
['neptune']
In fact, Third death to Coronavirus in Neptune.The mayor: "It is the blackest day, we resist"
['coronavirus', 'neptune']
1.0
--------------------- 1435 -----------------------
1633060840_1639577708
In fact, Covid, the hundred days of Naples in the book of "Repubblica" free on newsstands
['covid', 'naples', 'repubblica']
In fact, "Covid, the hundred days of Naples": Saturday the book as a gift with "Repubblica"
['covid', 'naples', 'repubblica']
1.0
--------------------- 1436 -----------------------
1499605250_1516119414
In fact, Favara, fire bursts in a house: fear and damage
['favara']
In fact, Favara, overturned by car: 33-year-old wounded in the hospital
['favara']
1.0
--------------------- 1438 -----------------------
1602441998_1608003130
In fact, "Phase2, government fixed general lines"
['phase2']
In fact, Phas

--------------------- 1521 -----------------------
1569318796_1566410351
In fact, American scholar: Coronary virus exposes the misjudgment of the US government on national security issues
['american', 'us']
In fact, Op-Ed: Coronavirus makes clear how the U.S. government has focused on the wrong threats
['op ed: coronavirus', 'us']
0.5
--------------------- 1522 -----------------------
1499576580_1499757755
In fact, Vietnam + (VietnamPlus)
['vietnam', 'vietnamplus']
In fact, Programme highlights outcomes of Vietnam’s participation in UN peacekeeping
['programme', 'vietnam', 'un']
0.5
--------------------- 1544 -----------------------
1487397562_1487009926
In fact, The United States was attacked by three Americans in Kenya.
['the united states', 'americans', 'kenya']
In fact, Terrorists attack military base in Kenya that houses U.S. soldiers
['kenya', 'us']
0.5
--------------------- 1555 -----------------------
1599293910_1532123241
In fact, Beijing has no new diagnosis case in 20 consec

--------------------- 1651 -----------------------
1504633843_1487664388
In fact, "As long as I am alive and want" - the fight of Evo Morales around his political heritage
['evo morales']
In fact, Zaffaroni and Ferreyra to act as legal advisors to Evo Morales
['zaffaroni', 'ferreyra', 'evo morales']
1.0
--------------------- 1663 -----------------------
1525066448_1525118734
In fact, #twitterbeef - Bill Gates upset with new electric car the billionaire colleagues Elon Musk
['bill gates', 'elon musk']
In fact, Bill Gates says not fond of Tesla cars and won't buy one, Elon Musk says he finds Gates underwhelming
['bill gates', 'elon musk', 'gates']
1.0
--------------------- 1666 -----------------------
1589915984_1588790103
In fact, Share in focus: planned rescue package helps Lufthansa to recovery attempt
['lufthansa']
In fact, Merkel, Politicians to Discuss Rescue Package for Lufthansa ⋆
['merkel', 'politicians', 'discuss rescue package', 'lufthansa']
1.0
--------------------- 1685 ----

--------------------- 1808 -----------------------
1602229722_1601664650
In fact, Rams always wait public for opening their stadium
['rams']
In fact, Rams still hoping for guests at SoFi Stadium opening party
['rams', 'sofi stadium']
1.0
--------------------- 1817 -----------------------
1584972828_1592245127
In fact, Galicia reaches 472 deceased with Covid-19
['galicia']
In fact, Doctors in Galicia surprised to find very few cases of Covid-19 after rapid testing in North-West Spain
['galicia', 'north west spain']
1.0
--------------------- 1819 -----------------------
1584006797_1548280049
In fact, TRump will nominate the exrespable of the BM as an Ambassador of the USA. In Panama
['trump', 'bm', 'usa', 'panama']
In fact, Trump tested for coronavirus
['trump']
1.0
--------------------- 1820 -----------------------
1618301181_1618253552
In fact, Larry Kramer, famous writer and activist against AIDS died at 84 years
['larry kramer']
In fact, Larry Kramer, longtime AIDS activist, dies at 

--------------------- 1926 -----------------------
1606884717_1603601202
In fact, They stop 39 military deserters in Venezuela
['venezuela']
In fact, Guaidó advisers quit following bungled Venezuela raid
['guaidó', 'venezuela']
1.0
--------------------- 1929 -----------------------
1611146782_1555055620
In fact, Venezuela and Russia review progress from your cooperation projects
['venezuela', 'russia']
In fact, Maduro Says Russia Will Send Humanitarian Aid to Venezuela Next Week Amid COVID-19 Outbreak
['maduro says', 'russia']
0.5
--------------------- 1939 -----------------------
1541865610_1542610534
In fact, ATP and WTA announce prevention measures against Coronavirus
['atp', 'coronavirus']
In fact, ATP, WTA tennis tournament cancelled over coronavirus concerns
['atp']
1.0
--------------------- 1940 -----------------------
1554210406_1554349331
In fact, Closing Bangladesh's largest brothel by the coronavirus
['bangladesh']
In fact, Bangladesh shuts largest brothel over coronavirus f

--------------------- 2046 -----------------------
1508030664_1532100968
In fact, They detect in Germany the first case in Europe of Wuhan Coronavirus in a person who did not travel to China
['germany', 'europe', 'wuhan coronavirus', 'china']
In fact, Germany Confirms First Coronavirus Case, May Be the First Human-To-Human Transmission in Europe
['germany confirms', 'europe']
0.5
--------------------- 2053 -----------------------
1533893476_1515814278
In fact, Greece: antimigrant riots on the islands
['greece']
In fact, UNHCR urges Greece to improve refugee conditions
['unhcr', 'greece']
1.0
--------------------- 2057 -----------------------
1506275423_1495356119
In fact, The officer pretended for 18 years and worked in NATO headquarters.Scandal in Sweden
['nato', 'sweden']
In fact, 'Not Good': Sweden Admits Sending Fake Officer to NATO
['sweden', 'nato']
1.0
--------------------- 2058 -----------------------
1628537385_1621281855
In fact, Visit little-known tourist destinations in Jap

--------------------- 2176 -----------------------
1592655619_1498998087
In fact, The first Jordan in the region is in the uniform of transparency in the open budget index
['jordan']
In fact, Discussion of policies paper for leadership and emerging companies in Jordan
['jordan']
1.0
--------------------- 2178 -----------------------
1590093007_1590312339
In fact, As part of the "Protection" initiative for the Corona respondents in the lake .. Distribution of 1473 kitted food
['protection', 'corona']
In fact, «Kratin, Antihysemics and Financial Aid» for 1100 families from Corona in the lake
['«kratin, antihysemics and financial aid»', 'corona']
0.5
--------------------- 2182 -----------------------
1558315635_1560709888
In fact, "Higher Education" launches an initiative "Dish your idea" .. and 30 million pounds to support innovations to confront Corona
['higher education', 'dish']
In fact, Scientific research: The door of progress for the initiative of "Dish your idea" for April 2
['dis

--------------------- 2291 -----------------------
1494597539_1492788958
In fact, Politisprominence travels to the WEF - probably no Iranian representatives in Davos
['wef', 'iranian', 'davos']
In fact, Merkel: Germany will endeavor to maintain the Iranian atom agreement
['merkel', 'germany', 'iranian']
0.3333333333333333
--------------------- 2298 -----------------------
1583440515_1580886735
In fact, Brazil: Government pendants demand military intervention
['brazil']
In fact, Dispute over Corona Course Brazil: Bolsonaro fires his Minister of Health
['corona', 'brazil', 'bolsonaro']
1.0
--------------------- 2300 -----------------------
1624266854_1592059212
In fact, DGAP-News: German Living SE: German Wohnen Published Sustainability Report 2019 (German)
['dgap news', 'german', 'german wohnen published sustainability report', 'german']
In fact, DGAP-News: German living SE: Annual General Meeting is carried out on June 5, 2020 virtual (German)
['dgap news', 'german', 'german']
1.0
----

--------------------- 2437 -----------------------
1521660733_1542050751
In fact, Erdogan wants to get rid of Syrian refugees, but Putin knows how to prevent that
['erdogan', 'syrian', 'putin']
In fact, Meeting between Putin and Erdogan to the Syria Conflict - what was agreed?
['putin', 'erdogan', 'the syria conflict  ']
0.6666666666666666
--------------------- 2442 -----------------------
1518939923_1519833328
In fact, Crypto Leaks: Historian Thomas Buomberger criticizes authorities
['crypto leaks', 'historian', 'thomas buomberger']
In fact, Crypto Leaks: Ex-NBD boss still defended Crypto AG 2016
['crypto leaks', 'ex nbd', 'crypto ag 2016']
0.3333333333333333
--------------------- 2458 -----------------------
1484633289_1643303349
In fact, POL-MS: Unknown offenders steal scaffolding components worth more than 50,000 euros
['pol ms']
In fact, POL-MS: Car bends - child brakes and crashes - police are looking for witnesses
['pol ms']
1.0
--------------------- 2459 -----------------------

--------------------- 2548 -----------------------
1618774655_1618772368
In fact, Jean-Jacques Bourdin: The journalist arrested by the police!
['jean jacques bourdin']
In fact, Journalist Jean-Jacques Bourdin verbalized for speeding (186 km / h) and exceeding the 100 km rule
['jean jacques bourdin']
1.0
--------------------- 2549 -----------------------
1615280442_1615135236
In fact, Popularity: Macron down, Philippe up
['macron', 'philippe']
In fact, The popularity of Macron down, that of Philippe up
['macron', 'philippe']
1.0
--------------------- 2552 -----------------------
1527074223_1529766070
In fact, Municipal 2020. Regional Councilor Claude Taleb presents a list in Hauville
['councilor claude taleb', 'hauville']
In fact, Municipal 2020: Regional Advisor Claude Taleb presents his colicketeers in Hauville
['claude taleb', 'hauville']
0.5
--------------------- 2555 -----------------------
1614641165_1517692565
In fact, Reservations, Hygiene, Home ... Why do tourist sites sail in 

--------------------- 2624 -----------------------
1586653930_1554127284
In fact, S. Małgorzata Borkowska OSB: About spiritual development
['s małgorzata borkowska']
In fact, S. Małgorzata Borkowska: about silence
['s małgorzata borkowska']
1.0
--------------------- 2626 -----------------------
1529169576_1529642564
In fact, Austrians did not let the train from Italy, with fear of coronavirus
['austrians', 'italy']
In fact, A gigantic increase in prices of Maseczek and resources for disinfection in Italy
['maseczek', 'italy']
0.5
--------------------- 2627 -----------------------
1570154819_1570128432
In fact, PiS loses in the Sejm regarding correspondence voting
['pis', 'sejm']
In fact, 3 deputies from the PiS club against the extension of the agenda
['pis']
1.0
--------------------- 2628 -----------------------
1587069958_1543692294
In fact, Street lighting in Słupsk works too long [PHOTOS]
['słupsk']
In fact, Petrified housing estate in Słupsk.Old energy poles will be opposed to new

--------------------- 2751 -----------------------
1548312639_1568655284
In fact, Tatras.Rescue action on Świcy.Tourists despite the ban go to the mountains
['tatras']
In fact, No entry into the Tatras beneficial for nature
['tatras']
1.0
--------------------- 2752 -----------------------
1517783286_1518290061
In fact, Military do not want Huawei to build a 5G network
['huawei']
In fact, Canada: Military want the government to block Huawei the possibility of building a network 5g
['canada', 'huawei']
1.0
--------------------- 2765 -----------------------
1496643461_1496613308
In fact, The European Parliament adopted a resolution on the rule of law in Poland
['the european parliament', 'poland']
In fact, The European Parliament adopted a resolution on the rule of law in Poland
['the european parliament', 'poland']
1.0
--------------------- 2772 -----------------------
1531674925_1620636118
In fact, Asteroid approaches today.Nasa admits: this is not the end of the world, but we observe a

--------------------- 2832 -----------------------
1548604467_1548571990
In fact, The main sanitation of Uzbekistan made a statement in connection with the detection of coronavirus in the country
['uzbekistan']
In fact, The first case of coronavirus is registered in Uzbekistan
['uzbekistan']
1.0
--------------------- 2834 -----------------------
1621124980_1621555894
In fact, Crew Dragon ship with astronauts went into orbit
['crew dragon']
In fact, Crew Dragon spacecraft close to the ISS before the upcoming docking
['crew dragon']
1.0
--------------------- 2835 -----------------------
1568051061_1512431682
In fact, Putin: Saudi Arabia is trying to get rid of the mining oil of competitors
['putin', 'saudi arabia']
In fact, Putin and the King of Saudi Arabia discussed further actions in OPEC + format
['putin', 'the king of', 'saudi arabia', 'opec']
1.0
--------------------- 2836 -----------------------
1514489958_1532711600
In fact, In green and white colors.At the airport Kurgan, they t

--------------------- 2914 -----------------------
1603759756_1603235759
In fact, The harsh reaction to the German judgment from the EU
['german', 'eu']
['legal process', 'eu', 'germany']
0.5
--------------------- 2919 -----------------------
1485418993_1595165552
In fact, The Iran, who promised Süleymani's revenge, shipped the battle aircraft to the limit
['iran', 'süleymani']
In fact, Iran Foreign Minister Elegant: The biggest weapon seller is concerned from the US, Iran
['iran', 'elegant', 'us', 'iran']
0.5
--------------------- 2923 -----------------------
1522381580_1527105014
In fact, The number of decades from Kovid-19 in China rose to 1666
['china']
In fact, The number of kovid-19 outbreaks in China came out to 2 thousand 238
['china']
1.0
--------------------- 2927 -----------------------
1642450531_1642504732
In fact, The extreme right-wing organization was prohibited in Germany
['germany']
In fact, Forbidden to the Organization of Excessive Realist "Nordads" in Germany
['for

--------------------- 3042 -----------------------
1491412905_1491425949
In fact, Retirement Private Director to Canca last task
['canca']
In fact, Older Turkey Judo Federation President Canca's funeral was given to the land
['turkey judo federation', 'canca']
1.0
--------------------- 3046 -----------------------
1585685493_1609095688
In fact, Coronavirus: Losing the revenue in the UK and the Taraway to Turkey Ankara negotiated what are they live?
['uk', 'taraway', 'turkey']
In fact, Turkey's Foreign Ministry of Turkey Alert
['turkey', 'foreign ministry of turkey alert']
0.5
--------------------- 3053 -----------------------
1566734358_1583622671
In fact, Paid permit time in Russia extended until April 30
['russia']
In fact, Russia President Putin: The epidemic did not reach the peak point
['russia', 'putin']
1.0
--------------------- 3060 -----------------------
1487434280_1595528307
In fact, Fenerbahce has established first official contact for Marcos Rojo Transfer
['fenerbahce', 'm

--------------------- 3188 -----------------------
1640803995_1640937871
In fact, Huawei will expect to spend 400 million pounds in the UK new chip base
['huawei', 'uk']
In fact, Huawei will pre-funded 400 million pounds in the UK
['huawei', 'uk']
1.0
--------------------- 3189 -----------------------
1644291819_1644711003
In fact, The Dragon Boat Festival, perceived the deep family of General Secretary Xi Jinping
['the dragon boat festival', 'xi jinping']
In fact, Dragon Boat Festival, feeling of Xi Jinping's traditional cultural complex
['dragon boat festival', 'xi jinping']
0.5
--------------------- 3202 -----------------------
1634280264_1632690669
In fact, Wonderful!"Culture and Natural Heritage Day" Henan home activities held in Luoyang
['henan', 'luoyang']
In fact, The theme activity of "Culture and Natural Heritage Day" in Henan will be held in Luoyang
['culture and natural heritage day"', 'henan', 'luoyang']
1.0
--------------------- 3211 -----------------------
1584173636_158

--------------------- 3285 -----------------------
1489121308_1489121310
In fact, US Department of Defense: Iranian is an assessment situation and response measures to the US base in Rockets - International - News
['us department of defense:', 'iranian', 'us', 'rockets   international   news']
In fact, White House spokesperson: Trump has known the US Air Force Base by Rockets - International - People's Network
['white house', 'us', 'rockets   international   people network']
0.3333333333333333
--------------------- 3288 -----------------------
1603187388_1580352127
In fact, Morocco new crown confirmed cases more than 6000 cases - International - People's Network
['morocco', 'international   people network']
In fact, Morocco added 259 cases of diagnosis cases - Xinhuanet
['morocco']
1.0
--------------------- 3289 -----------------------
1591920962_1592061832
In fact, 9 "Taiwanese" dialysis US, the US politicians attack China's three sets of roads - Time Administrative News - People's Ne

--------------------- 3359 -----------------------
1544755802_1558409011
In fact, Voice Diary from Wuhan: I hope that a heavy rain will have all unfortunately - Xinhuanet
['wuhan']
In fact, Voice Diary from Wuhan: Thank you, big white!Xinhuanet
['wuhan', 'white!xinhuanet']
1.0
--------------------- 3362 -----------------------
1594125435_1628203991
In fact, Malaysia announced the relaxation limit measures to gradually restart economic activities - Xinhuanet
['malaysia']
In fact, Malaysia announced further relaxing new crown epidemic management measures - Xinhuanet
['malaysia']
1.0
--------------------- 3368 -----------------------
1501997981_1514744335
In fact, London stock market falls on the 22nd - Xinhuanet
['london']
In fact, London stock market fell on the 7th - Xinhuanet
['london']
1.0
--------------------- 3369 -----------------------
1617127210_1563902505
In fact, Vietnamese government prime minister, Chunfu, President, President of the Philippine
['vietnamese', 'chunfu', 'phil

--------------------- 3533 -----------------------
1573316128_1600984105
In fact, Guangdong announced the 2020 spring semester students return school time - Xinhuanet
['guangdong']
In fact, Shanghai, Guangdong to begin reopening school from April 27
['shanghai', 'guangdong']
1.0
--------------------- 3536 -----------------------
1561900090_1532123241
In fact, In Beijing, the new report of the new report, there is no new report, local case
['beijing']
In fact, Feeling Feverish? The List of 101 Designated Fever Clinics in Beijing
['designated fever clinics', 'beijing']
1.0
--------------------- 3538 -----------------------
1512813502_1499817917
In fact, The diagnosis of Wuhan, the diagnosis of Wuhan, China has exceeded more than 400 deaths.
['wuhan', 'wuhan', 'china']
In fact, Coronavirus: Toll rises, infection spreads; China claims ‘active’ response
['china']
1.0
--------------------- 3542 -----------------------
1501253200_1501200897
In fact, I understand how to block and conceal the t

--------------------- 3640 -----------------------
1643848349_1640130837
In fact, Pandemic is still in phase "worrying and intense" in Latin America: WHO
['latin america']
In fact, Worry about the situation in Latin America, where the numbers are very high
['worry', 'latin america']
1.0
--------------------- 3643 -----------------------
1589689553_1589534405
In fact, Borussia Dortmund: Bundesliga will sink if the season is not restarted
['bundesliga']
In fact, Shooting Bundesliga, Watzke Freme: "Or we start or collapse".Heeness: "No hurry"
['bundesliga', 'watzke freme', 'collapse"']
1.0
--------------------- 3644 -----------------------
1509675722_1509007704
In fact, United Kingdom will isolate 14 days to those who arrive from Wuhan due to the coronavirus
['united kingdom', 'wuhan']
In fact, Great Britain announces a quarantine for those who come from Wuhan
['great britain', 'wuhan']
0.5
--------------------- 3647 -----------------------
1578527730_1576222329
In fact, Last time Coronav

--------------------- 3732 -----------------------
1527224754_1527082318
In fact, Film Critic: Donald Trump Wetter against Oscar winner "Parasite"
['film critic', 'donald trump wetter', 'oscar']
In fact, Trump criticizes the attribution of the Oscar of the best parasite film
['trump', 'oscar']
0.5
--------------------- 3733 -----------------------
1626509167_1626408793
In fact, Racism: March and millions / donation: Kanye West sets sign against racism
['kanye west']
In fact, Kanye West launches a background to pay for George Floyd's daughter's studies
['kanye west', 'george floyd']
1.0
--------------------- 3735 -----------------------
1531866001_1528045007
In fact, Intelligence supervision takes over Crypto examination of the Federal Council
['crypto', 'the federal council']
In fact, Crypto case: Switzerland challenged by countries
['crypto', 'switzerland']
0.5
--------------------- 3739 -----------------------
1613743735_1616875587
In fact, Venezuela: 66 arrests after alleged circums

--------------------- 3830 -----------------------
1597101694_1597320784
In fact, The United Kingdom registers the lowest number of deaths since the end of March
['the united kingdom']
In fact, In the United Kingdom the epidemic has done more victims than in Italy
['the united kingdom', 'italy']
1.0
--------------------- 3833 -----------------------
1543439170_1500398715
In fact, China announces that a fourth person died by new pneumonia virus
['china']
In fact, World Health Organization convenes meeting for viral pneumonia in China
['world health organization', 'china']
1.0
--------------------- 3834 -----------------------
1557754472_1553304051
In fact, Italy reloads an increase in the number of dead, with another 743 deceased
['italy']
In fact, Coronavirus, to date there are 3405 deaths in Italy: more victims of China
['coronavirus', 'italy', 'china']
1.0
--------------------- 3835 -----------------------
1496986187_1497808289
In fact, Rhodium is the most expensive metal in the worl

--------------------- 3910 -----------------------
1569383985_1570970455
In fact, Coronavirus: Spain slows the curve with 674 dead in a day, the greatest descent of the last week
['coronavirus', 'spain']
In fact, Coronavirus Spain: 13,798 dead / trend growing deaths, +743: but there is a reason
['spain']
1.0
--------------------- 3912 -----------------------
1512248299_1515464051
In fact, "Everything starts at Iowa!"
['iowa']
In fact, DEM primaries, in Iowa Buttigieg wins with a detachment of 0.1%: "No to extreme recipes".But Sanders: "redistribute wealth"
['dem', 'iowa', 'buttigieg']
1.0
--------------------- 3914 -----------------------
1569709010_1563813512
In fact, The last thing about Coronavirus: Louisiana reports 68 deaths
['coronavirus', 'louisiana']
In fact, Coronavirus, over 809 thousand cases in the world: in the US double of China
['coronavirus', 'us', 'china']
0.5
--------------------- 3919 -----------------------
1509274703_1507096400
In fact, Formally accuse Netanyahu by

--------------------- 3993 -----------------------
1597905792_1597590962
In fact, United States Fans arrive in Mexico
['united states fans', 'mexico']
In fact, Mexico: Coronavirus, landloded with 211 respirators from the USA
['mexico', 'coronavirus', 'usa']
0.5
--------------------- 3997 -----------------------
1596079028_1594186139
In fact, Japan prepares for an extension of the state of emergency sanitary
['japan']
In fact, Japan: premier Abe, government ready to extend the state of emergency one month
['japan', 'abe']
1.0
--------------------- 4001 -----------------------
1588015300_1588184387
In fact, Corona Pandemic in Europe: Like Italians and Spaniards for their bathing season
['corona pandemic in europe', 'italians', 'spaniards']
In fact, Italy will leave the Union?Almost half of Italians support italexit
['italy', 'italians']
0.5
--------------------- 4002 -----------------------
1515565065_1514797340
In fact, Thuringia choice: FDP loses clearly in Germany survey
['thuringia',

--------------------- 4098 -----------------------
1602077848_1601925305
In fact, The advice of PiS politicians and agreement ended
['pis']
In fact, Governmental council at the PiS headquarters
['governmental council', 'pis']
1.0
--------------------- 4100 -----------------------
1570482173_1570128432
In fact, Krzysztof Bosak remaxes dirty PiS logs: "This trick, having to circumvent the results of a loser vote"
['pis']
In fact, 3 deputies from the PiS club against the extension of the agenda
['pis']
1.0
--------------------- 4102 -----------------------
1557149437_1557274362
In fact, Ministry of Foreign Affairs: Up again urged the US to stop politicization to the epidemic, stop stainization of China
['ministry of foreign affairs', 'us', 'china']
In fact, The US said China concealed the epidemic department: the US rumors show the thief shouting to catch thieves
['us', 'china', 'us']
1.0
--------------------- 4103 -----------------------
1648316931_1648280037
In fact, [忌 comment] The US-

--------------------- 4150 -----------------------
1543846939_1543374251
In fact, All Countries, Wuhan District Hospital - China News
['wuhan', 'district hospital   china news']
In fact, Wuhan accumulated admission and discharge patient "apartment" hospital "apartment"
['wuhan']
1.0
--------------------- 4153 -----------------------
1576601445_1576611416
In fact, Beijing: Volunteer Park to discourage unclean behavior - Xinhuanet
['beijing', 'volunteer park']
In fact, Beijing: Do not wear a mask is included in the "black reputation" in the garden - Xinhuanet
['beijing']
1.0
--------------------- 4154 -----------------------
1634984464_1635576664
In fact, Vietnam has no new local new crown pneumonia in 60 days.
['vietnam']
In fact, Vietnam + (VietnamPlus)
['vietnam', 'vietnamplus']
1.0
--------------------- 4155 -----------------------
1489077936_1488505446
In fact, Iran launches missile to the US military station - Xinhuanet
['iran', 'us']
In fact, 35 people died in Iran's stamping inci

--------------------- 4225 -----------------------
1629144713_1629497314
In fact, China refused to restore Ladakh primary India refused border road negotiations in a deadlock
['china', 'ladakh', 'india']
In fact, The CCP refused to restore the Ladakh primary China-Indian border negotiations in the deadlock
['ccp', 'ladakh', 'china', 'indian']
0.6666666666666666
--------------------- 4227 -----------------------
1633655297_1633671134
In fact, Beijing announced the list of 98 new crown virus nucleic acid detection services
['beijing']
In fact, Beijing announced the list of medical and health institutions for new crown virus nucleic acid detection services - Xinhuanet
['beijing']
1.0
--------------------- 4229 -----------------------
1627281330_1626626962
In fact, 170 million mu!The national wheat harvest has passed over half Huang Huaihai main product area has over 98%.
['huang huaihai']
In fact, National wheat harvest half Huang Huaihai main product area machine rate over 98% - Xinhuane

--------------------- 4328 -----------------------
1609241667_1609165974
In fact, Lebanon: Proposal to cancel the secondary school exams and complete the study by Corona
['lebanon', 'corona']
In fact, Cancellation of secondary school exams in Lebanon and completion of the academic year
['lebanon']
1.0
--------------------- 4332 -----------------------
1641108223_1640820693
In fact, The Kingdom supports Egypt's right to defend itself .. and thank you: We will refund any threat
['the kingdom supports', 'egypt']
In fact, Shukri: We will refund any threat to the security of Egypt
['shukri', 'egypt']
0.5
--------------------- 4333 -----------------------
1551201460_1552548593
In fact, Turkish deputy warns Erdogan from killing thousands of detainees in Ankara prisons because of Corona
['turkish', 'erdogan', 'ankara', 'corona']
In fact, Erdogan: The West deals with a negative with Corona!
['erdogan', 'west', 'corona']
0.6666666666666666
--------------------- 4337 -----------------------
16280

--------------------- 4431 -----------------------
1490252360_1489858288
In fact, Conflicts: Maas: Location in Iraq has "considerably relaxed"
['maas', 'iraq']
In fact, Maas evaluates Trump's statements for Iran conflict as a "good signal"
['maas', 'trump', 'iran']
0.5
--------------------- 4440 -----------------------
1618167813_1617908602
In fact, DGAP-Adhoc: LPKF Laser & Electronics Aktiengesellschaft: Bantleon Group announces its intention to place its LPKF shares at the capital market (German)
['lpkf laser & electronics aktiengesellschaft', 'bantleon group', 'german']
In fact, DGAP-Adhoc: Easy Software AG: Former Management Board Dieter White Hair complains about severance pay (German)
['easy software ag', 'dieter white hair', 'german']
0.3333333333333333
--------------------- 4445 -----------------------
1584787116_1581158412
In fact, POL-MA: Mannheim-Käfertal: arson to church tower, witnesses wanted!
['pol ma', 'mannheim käfertal']
In fact, POL-MA: Mannheim-Käfertal: Milling bin

--------------------- 4506 -----------------------
1531972652_1529880359
In fact, 22 studies agree: 'Medicare for All' saves money
['medicare']
In fact, Here's What 22 Separate Studies Found: Medicare for All Would Cost Less Than the For-Profit Status Quo
['medicare']
1.0
--------------------- 4508 -----------------------
1592091128_1591556033
In fact, Cheryl's show The Greatest Dancer axed by BBC
['cheryl', 'the greatest dancer', 'bbc']
In fact, BBC cancels The Greatest Dancer after two series
['bbc', 'the greatest dancer']
1.0
--------------------- 4509 -----------------------
1571274943_1576594517
In fact, Who is Dominic Raab, the man running the country in Boris Johnson’s absence?
['dominic raab', 'boris johnson’s']
In fact, Who is Dominic Raab?
['dominic raab']
1.0
--------------------- 4511 -----------------------
1557010702_1579667345
In fact, Tom Hanks’ wife Rita Wilson raps Naughty By Nature while in quarantine
['tom hanks', 'rita wilson', 'nature']
In fact, Tom Hanks Gives Ma

--------------------- 4609 -----------------------
1530933267_1530959859
In fact, Prosecutor's Office clarifies that there is no recall record against Mon Laverte by Carabineros
['prosecutor office', 'mon laverte']
In fact, Public Ministry clarifies that there is no complaint against Mon Laverte and that Carabineros has not requested its statement
['public ministry', 'mon laverte', 'carabineros']
0.5
--------------------- 4610 -----------------------
1625621666_1625520166
In fact, Summit between Espinoza and Valenzuela: "This virus does not understand avenues, or streets that divide us"
['espinoza']
In fact, Espinoza and Valenzuela coordinated strategies to optimize tasks "without partisisms"
['espinoza', 'valenzuela']
1.0
--------------------- 4612 -----------------------
1640484317_1637791958
In fact, Pandemic has knees to Panama: Registered 25 222 infected and 493 dead
['panama']
In fact, Virus does not have a compassion of Panama: they register 470 dead and 22 597 infected
['virus'

--------------------- 4715 -----------------------
1487725219_1485057505
In fact, Several damage is reported in Puerto Rico after Sismo
['puerto rico']
In fact, New earthquake, now of 4.5, remits to Puerto Rico
['puerto rico']
1.0
--------------------- 4716 -----------------------
1551240100_1547419464
In fact, Aislinn Derbez talks about the treatment he received from Victoria Ruffo when he was with Eugenio
['aislinn derbez', 'victoria ruffo']
In fact, Aislinn Derbez Manda Message to Victoria Ruffo
['victoria ruffo']
1.0
--------------------- 4719 -----------------------
1586981410_1584233290
In fact, Video: They captured a jellyfish swimming in the canals of Venice
['venice']
In fact, Jellyfish slid through the clean waters of Venice canals
['jellyfish', 'venice']
1.0
--------------------- 4722 -----------------------
1602161279_1601966801
In fact, A robot maintains the social distance in Singapore
['singapore']
In fact, 'Spot', the dog-robot that guarantees the distancing in Singapor

--------------------- 4793 -----------------------
1505428056_1504214905
In fact, 18 cases of new new pneumonia in Chongqing have a total of 75 cases of diagnosis - News - Global IC Trade Starts Here.
['chongqing']
In fact, Pneumonia epidemic situation in Chongqing new coronavirus infection on January 25, 2020 - China News Net
['chongqing', 'china news net']
1.0
--------------------- 4794 -----------------------
1606379213_1605424353
In fact, Cumulative cure cases of Kazakhstan new crown pneumonia reached 2476 cases
['kazakhstan']
In fact, 363 cases of cure cases in Kazakhstan new crown pneumonia
['kazakhstan']
1.0
--------------------- 4795 -----------------------
1535894441_1535843341
In fact, Hijacking incident in Manila, Philippines, is trapped
['hijacking', 'manila', 'philippines']
In fact, The Philippines took a gunman's hosted personality incident about 30 people trapped shopping malls
['philippines']
1.0
--------------------- 4801 -----------------------
1489641413_1489827732
I

--------------------- 4892 -----------------------
1519429984_1518245422
In fact, Climb at 1,100 the number of dead by Coronavirus in China - so it's margarita
['coronavirus', 'china']
In fact, The Coronavirus is called Covid-19
['coronavirus']
1.0
--------------------- 4896 -----------------------
1529909197_1529885148
In fact, Weinstein, guilty of two sexual crimes in historical trial for methoo
['weinstein', 'methoo']
In fact, Harassment: Weinstein guilty of sexual violence and rape
['weinstein']
1.0


In [17]:
title_key_df

Unnamed: 0,pair_id,url1_lang,google_lang1,url2_lang,google_lang2,translated_title1,translated_title2,key1,key2,key_score
0,1484189203_1484121193,en,en,en,en,Police: 2 men stole tools from Lowe’s in Davie,No-swim advisory lifted for Deerfield Beach Pier,[lowe],[deerfield beach],0
1,1484011097_1484011106,en,en,en,en,"Open database leaked 179GB in customer, US gov...",Best Western’s Massive Data Leak: 179GB Amazon...,[us],[best western’s],0
2,1484039488_1484261803,en,en,en,en,Ducks are own worst enemies in sloppy loss in ...,Woody Guthrie's 1943 New Year's Resolutions ar...,[las vegas],"[woody guthrie, new year]",0
3,1484332324_1484796748,en,en,en,en,Another Bengal vs Centre tussle? Govt rejects ...,'Congress Rejected 7 Times': BJP's Reminder as...,[govt],"[congress rejected 7 times', bjp, maharashtra ...",0
4,1484012256_1484419682,en,en,en,en,Bars and clubs you loved and lost this decade ...,Top 20 films of the 2010s,[],[],0
...,...,...,...,...,...,...,...,...,...,...
4897,1553907621_1553488848,es,es,it,it,"Denver Nuggets reports that ""a member of the o...","Coronavirus, a case also among the Denver Nuggets",[denver nuggets],"[coronavirus, denver]",0
4898,1646957948_1643667075,es,es,it,it,Living in Spain is cheaper than in the Europea...,"Coronavirus, in Europe 140 thousand dead in Ma...","[spain, european]","[coronavirus, europe]",0
4899,1504063453_1502866628,es,es,it,it,Activate epidemiological surveillance system i...,Coronavirus in China: consumption of snakes an...,"[activate, venezuelan]","[coronavirus, china]",0
4900,1647862428_1647712939,es,es,it,it,Emite Iran Arrest Order against Tump by Genera...,Iran emits stop mandate for trump for the murd...,"[emite iran arrest order against, tump, soleim...",[iran],0


In [18]:
path = 'eval/_EVAL_title_named-entity_score.csv'
title_key_df.to_csv(path,index=False)

### Compare named entity similarity score <br>

We compare the annotated overall score of pairs with the similarity score obtained by 
overlap coefficient on the named entities extracted from the articles. <br><br>

This comparison is analogous to the one in the case of the training dataset, however, in principle, this step is not relevant for the final evaluation of our solution. This comparison can be thought as a double check to make sure that the choice of features makes sense and is motivated by curiosity rather then necessity.

In [13]:
import math

# see how indicating title named entities are to the similairty of two articles
eval_df = pd.read_csv ('eval/_EVAL_details_in_df.csv')

def normal_round(n):
    if n - math.floor(n) < 0.5:
        return int(math.floor(n))
    return int(math.ceil(n))

compare_df = pd.DataFrame(columns = ["overall","title_ne_score"])


for i, row in eval_df.iterrows(): 
    ## just to check our index order is still in place
#     if row["pair_id"] != title_key_df.iloc[i]['pair_id']:
#         print('---------------------',i,'-----------------------')
#         print(row["pair_id"])
    print('---------------------',i,'-----------------------')
    label = normal_round(row['Overall'])
    print('label:',label)
    score = title_key_df.iloc[i]['key_score']
    print('score:',score)
    
    entry = {"overall":label,"title_ne_score":score}
    compare_df = compare_df.append(entry, ignore_index = True)
    

--------------------- 0 -----------------------
label: 4
score: 0
--------------------- 1 -----------------------
label: 1
score: 0
--------------------- 2 -----------------------
label: 4
score: 0
--------------------- 3 -----------------------
label: 2
score: 0
--------------------- 4 -----------------------
label: 4
score: 0
--------------------- 5 -----------------------
label: 4
score: 0
--------------------- 6 -----------------------
label: 3
score: 0
--------------------- 7 -----------------------
label: 3
score: 0
--------------------- 8 -----------------------
label: 2
score: 0
--------------------- 9 -----------------------
label: 4
score: 0
--------------------- 10 -----------------------
label: 4
score: 0
--------------------- 11 -----------------------
label: 3
score: 0
--------------------- 12 -----------------------
label: 4
score: 0
--------------------- 13 -----------------------
label: 4
score: 0
--------------------- 14 -----------------------
label: 3
score: 0
-----

--------------------- 197 -----------------------
label: 3
score: 0
--------------------- 198 -----------------------
label: 1
score: 0
--------------------- 199 -----------------------
label: 3
score: 0
--------------------- 200 -----------------------
label: 3
score: 1.0
--------------------- 201 -----------------------
label: 4
score: 0
--------------------- 202 -----------------------
label: 3
score: 0
--------------------- 203 -----------------------
label: 1
score: 0
--------------------- 204 -----------------------
label: 3
score: 0
--------------------- 205 -----------------------
label: 2
score: 1.0
--------------------- 206 -----------------------
label: 4
score: 1.0
--------------------- 207 -----------------------
label: 3
score: 0
--------------------- 208 -----------------------
label: 2
score: 1.0
--------------------- 209 -----------------------
label: 2
score: 0
--------------------- 210 -----------------------
label: 4
score: 0
--------------------- 211 --------------

--------------------- 348 -----------------------
label: 1
score: 0
--------------------- 349 -----------------------
label: 4
score: 0
--------------------- 350 -----------------------
label: 3
score: 0
--------------------- 351 -----------------------
label: 4
score: 0
--------------------- 352 -----------------------
label: 1
score: 1.0
--------------------- 353 -----------------------
label: 4
score: 0
--------------------- 354 -----------------------
label: 4
score: 0
--------------------- 355 -----------------------
label: 4
score: 0
--------------------- 356 -----------------------
label: 1
score: 0
--------------------- 357 -----------------------
label: 3
score: 0
--------------------- 358 -----------------------
label: 3
score: 0
--------------------- 359 -----------------------
label: 4
score: 0
--------------------- 360 -----------------------
label: 2
score: 0
--------------------- 361 -----------------------
label: 4
score: 0
--------------------- 362 --------------------

--------------------- 509 -----------------------
label: 3
score: 0.3333333333333333
--------------------- 510 -----------------------
label: 3
score: 0
--------------------- 511 -----------------------
label: 2
score: 0
--------------------- 512 -----------------------
label: 1
score: 1.0
--------------------- 513 -----------------------
label: 4
score: 0
--------------------- 514 -----------------------
label: 3
score: 0.5
--------------------- 515 -----------------------
label: 4
score: 1.0
--------------------- 516 -----------------------
label: 4
score: 0
--------------------- 517 -----------------------
label: 4
score: 0
--------------------- 518 -----------------------
label: 1
score: 1.0
--------------------- 519 -----------------------
label: 4
score: 0
--------------------- 520 -----------------------
label: 3
score: 0
--------------------- 521 -----------------------
label: 4
score: 0
--------------------- 522 -----------------------
label: 3
score: 1.0
---------------------

label: 2
score: 1.0
--------------------- 677 -----------------------
label: 4
score: 0
--------------------- 678 -----------------------
label: 4
score: 0
--------------------- 679 -----------------------
label: 3
score: 0
--------------------- 680 -----------------------
label: 4
score: 0
--------------------- 681 -----------------------
label: 3
score: 0
--------------------- 682 -----------------------
label: 4
score: 0
--------------------- 683 -----------------------
label: 2
score: 0
--------------------- 684 -----------------------
label: 4
score: 0
--------------------- 685 -----------------------
label: 3
score: 1.0
--------------------- 686 -----------------------
label: 2
score: 0
--------------------- 687 -----------------------
label: 3
score: 0
--------------------- 688 -----------------------
label: 1
score: 0
--------------------- 689 -----------------------
label: 4
score: 1.0
--------------------- 690 -----------------------
label: 4
score: 1.0
--------------------- 

--------------------- 836 -----------------------
label: 3
score: 0
--------------------- 837 -----------------------
label: 4
score: 0
--------------------- 838 -----------------------
label: 2
score: 0
--------------------- 839 -----------------------
label: 1
score: 0
--------------------- 840 -----------------------
label: 3
score: 0
--------------------- 841 -----------------------
label: 4
score: 0
--------------------- 842 -----------------------
label: 1
score: 1.0
--------------------- 843 -----------------------
label: 2
score: 0
--------------------- 844 -----------------------
label: 1
score: 0
--------------------- 845 -----------------------
label: 1
score: 1.0
--------------------- 846 -----------------------
label: 1
score: 0
--------------------- 847 -----------------------
label: 1
score: 0
--------------------- 848 -----------------------
label: 2
score: 1.0
--------------------- 849 -----------------------
label: 1
score: 0
--------------------- 850 ----------------

--------------------- 994 -----------------------
label: 4
score: 0
--------------------- 995 -----------------------
label: 4
score: 0
--------------------- 996 -----------------------
label: 4
score: 0
--------------------- 997 -----------------------
label: 4
score: 1.0
--------------------- 998 -----------------------
label: 4
score: 0
--------------------- 999 -----------------------
label: 4
score: 0
--------------------- 1000 -----------------------
label: 4
score: 0
--------------------- 1001 -----------------------
label: 4
score: 0
--------------------- 1002 -----------------------
label: 4
score: 0
--------------------- 1003 -----------------------
label: 4
score: 0
--------------------- 1004 -----------------------
label: 3
score: 0
--------------------- 1005 -----------------------
label: 1
score: 0
--------------------- 1006 -----------------------
label: 4
score: 0
--------------------- 1007 -----------------------
label: 4
score: 0
--------------------- 1008 -----------

--------------------- 1168 -----------------------
label: 4
score: 0
--------------------- 1169 -----------------------
label: 1
score: 0
--------------------- 1170 -----------------------
label: 2
score: 0
--------------------- 1171 -----------------------
label: 4
score: 0
--------------------- 1172 -----------------------
label: 4
score: 0.5
--------------------- 1173 -----------------------
label: 1
score: 0
--------------------- 1174 -----------------------
label: 2
score: 0
--------------------- 1175 -----------------------
label: 1
score: 0
--------------------- 1176 -----------------------
label: 2
score: 0
--------------------- 1177 -----------------------
label: 3
score: 0
--------------------- 1178 -----------------------
label: 4
score: 0
--------------------- 1179 -----------------------
label: 2
score: 0
--------------------- 1180 -----------------------
label: 3
score: 0
--------------------- 1181 -----------------------
label: 3
score: 0
--------------------- 1182 -----

--------------------- 1332 -----------------------
label: 4
score: 1.0
--------------------- 1333 -----------------------
label: 3
score: 0.5
--------------------- 1334 -----------------------
label: 4
score: 0
--------------------- 1335 -----------------------
label: 2
score: 0
--------------------- 1336 -----------------------
label: 2
score: 0
--------------------- 1337 -----------------------
label: 4
score: 0
--------------------- 1338 -----------------------
label: 1
score: 1.0
--------------------- 1339 -----------------------
label: 2
score: 0
--------------------- 1340 -----------------------
label: 2
score: 0
--------------------- 1341 -----------------------
label: 2
score: 0
--------------------- 1342 -----------------------
label: 3
score: 0
--------------------- 1343 -----------------------
label: 2
score: 0
--------------------- 1344 -----------------------
label: 1
score: 0
--------------------- 1345 -----------------------
label: 3
score: 0
--------------------- 1346 -

--------------------- 1512 -----------------------
label: 3
score: 0
--------------------- 1513 -----------------------
label: 2
score: 1.0
--------------------- 1514 -----------------------
label: 4
score: 0
--------------------- 1515 -----------------------
label: 2
score: 0
--------------------- 1516 -----------------------
label: 4
score: 0
--------------------- 1517 -----------------------
label: 2
score: 0
--------------------- 1518 -----------------------
label: 3
score: 0
--------------------- 1519 -----------------------
label: 4
score: 0
--------------------- 1520 -----------------------
label: 3
score: 0
--------------------- 1521 -----------------------
label: 1
score: 0.5
--------------------- 1522 -----------------------
label: 1
score: 0.5
--------------------- 1523 -----------------------
label: 4
score: 0
--------------------- 1524 -----------------------
label: 4
score: 0
--------------------- 1525 -----------------------
label: 3
score: 0
--------------------- 1526 -

--------------------- 1650 -----------------------
label: 3
score: 0
--------------------- 1651 -----------------------
label: 3
score: 1.0
--------------------- 1652 -----------------------
label: 4
score: 0
--------------------- 1653 -----------------------
label: 3
score: 0
--------------------- 1654 -----------------------
label: 1
score: 0
--------------------- 1655 -----------------------
label: 4
score: 0
--------------------- 1656 -----------------------
label: 2
score: 0
--------------------- 1657 -----------------------
label: 3
score: 0
--------------------- 1658 -----------------------
label: 3
score: 0
--------------------- 1659 -----------------------
label: 3
score: 0
--------------------- 1660 -----------------------
label: 3
score: 0
--------------------- 1661 -----------------------
label: 1
score: 0
--------------------- 1662 -----------------------
label: 3
score: 0
--------------------- 1663 -----------------------
label: 1
score: 1.0
--------------------- 1664 ---

--------------------- 1797 -----------------------
label: 3
score: 0
--------------------- 1798 -----------------------
label: 3
score: 0
--------------------- 1799 -----------------------
label: 1
score: 0
--------------------- 1800 -----------------------
label: 1
score: 0.5
--------------------- 1801 -----------------------
label: 2
score: 0
--------------------- 1802 -----------------------
label: 1
score: 0
--------------------- 1803 -----------------------
label: 4
score: 0
--------------------- 1804 -----------------------
label: 4
score: 1.0
--------------------- 1805 -----------------------
label: 4
score: 0
--------------------- 1806 -----------------------
label: 2
score: 0
--------------------- 1807 -----------------------
label: 4
score: 0
--------------------- 1808 -----------------------
label: 1
score: 1.0
--------------------- 1809 -----------------------
label: 3
score: 0
--------------------- 1810 -----------------------
label: 4
score: 0
--------------------- 1811 -

--------------------- 1952 -----------------------
label: 2
score: 0
--------------------- 1953 -----------------------
label: 1
score: 0
--------------------- 1954 -----------------------
label: 4
score: 0
--------------------- 1955 -----------------------
label: 2
score: 0
--------------------- 1956 -----------------------
label: 1
score: 0
--------------------- 1957 -----------------------
label: 4
score: 0
--------------------- 1958 -----------------------
label: 2
score: 0
--------------------- 1959 -----------------------
label: 1
score: 0
--------------------- 1960 -----------------------
label: 4
score: 0.5
--------------------- 1961 -----------------------
label: 1
score: 0
--------------------- 1962 -----------------------
label: 4
score: 0
--------------------- 1963 -----------------------
label: 3
score: 0.5
--------------------- 1964 -----------------------
label: 3
score: 1.0
--------------------- 1965 -----------------------
label: 1
score: 0
--------------------- 1966 -

--------------------- 2094 -----------------------
label: 2
score: 0
--------------------- 2095 -----------------------
label: 1
score: 1.0
--------------------- 2096 -----------------------
label: 3
score: 0
--------------------- 2097 -----------------------
label: 4
score: 0
--------------------- 2098 -----------------------
label: 2
score: 0
--------------------- 2099 -----------------------
label: 3
score: 0
--------------------- 2100 -----------------------
label: 2
score: 0.5
--------------------- 2101 -----------------------
label: 1
score: 0
--------------------- 2102 -----------------------
label: 1
score: 0
--------------------- 2103 -----------------------
label: 2
score: 0
--------------------- 2104 -----------------------
label: 2
score: 0
--------------------- 2105 -----------------------
label: 1
score: 0.5
--------------------- 2106 -----------------------
label: 4
score: 0
--------------------- 2107 -----------------------
label: 4
score: 0
--------------------- 2108 -

--------------------- 2221 -----------------------
label: 3
score: 0
--------------------- 2222 -----------------------
label: 1
score: 0
--------------------- 2223 -----------------------
label: 4
score: 0
--------------------- 2224 -----------------------
label: 1
score: 0
--------------------- 2225 -----------------------
label: 4
score: 0
--------------------- 2226 -----------------------
label: 1
score: 0
--------------------- 2227 -----------------------
label: 4
score: 1.0
--------------------- 2228 -----------------------
label: 3
score: 0
--------------------- 2229 -----------------------
label: 2
score: 0
--------------------- 2230 -----------------------
label: 1
score: 0
--------------------- 2231 -----------------------
label: 1
score: 0
--------------------- 2232 -----------------------
label: 1
score: 0
--------------------- 2233 -----------------------
label: 1
score: 1.0
--------------------- 2234 -----------------------
label: 1
score: 0
--------------------- 2235 ---

--------------------- 2344 -----------------------
label: 1
score: 0
--------------------- 2345 -----------------------
label: 1
score: 1.0
--------------------- 2346 -----------------------
label: 2
score: 1.0
--------------------- 2347 -----------------------
label: 4
score: 0
--------------------- 2348 -----------------------
label: 4
score: 0
--------------------- 2349 -----------------------
label: 3
score: 1.0
--------------------- 2350 -----------------------
label: 1
score: 1.0
--------------------- 2351 -----------------------
label: 1
score: 1.0
--------------------- 2352 -----------------------
label: 1
score: 0
--------------------- 2353 -----------------------
label: 3
score: 0
--------------------- 2354 -----------------------
label: 4
score: 0
--------------------- 2355 -----------------------
label: 1
score: 0
--------------------- 2356 -----------------------
label: 1
score: 0
--------------------- 2357 -----------------------
label: 2
score: 0
--------------------- 23

--------------------- 2485 -----------------------
label: 3
score: 0
--------------------- 2486 -----------------------
label: 1
score: 1.0
--------------------- 2487 -----------------------
label: 1
score: 0
--------------------- 2488 -----------------------
label: 1
score: 0
--------------------- 2489 -----------------------
label: 1
score: 0
--------------------- 2490 -----------------------
label: 3
score: 0
--------------------- 2491 -----------------------
label: 1
score: 0.5
--------------------- 2492 -----------------------
label: 2
score: 0
--------------------- 2493 -----------------------
label: 1
score: 0
--------------------- 2494 -----------------------
label: 3
score: 0
--------------------- 2495 -----------------------
label: 2
score: 0
--------------------- 2496 -----------------------
label: 1
score: 1.0
--------------------- 2497 -----------------------
label: 1
score: 1.0
--------------------- 2498 -----------------------
label: 1
score: 1.0
--------------------- 24

--------------------- 2629 -----------------------
label: 1
score: 1.0
--------------------- 2630 -----------------------
label: 3
score: 0
--------------------- 2631 -----------------------
label: 1
score: 0
--------------------- 2632 -----------------------
label: 1
score: 0
--------------------- 2633 -----------------------
label: 1
score: 0
--------------------- 2634 -----------------------
label: 2
score: 0.5
--------------------- 2635 -----------------------
label: 1
score: 0
--------------------- 2636 -----------------------
label: 1
score: 0
--------------------- 2637 -----------------------
label: 3
score: 0.5
--------------------- 2638 -----------------------
label: 3
score: 0
--------------------- 2639 -----------------------
label: 3
score: 0
--------------------- 2640 -----------------------
label: 4
score: 0
--------------------- 2641 -----------------------
label: 2
score: 0
--------------------- 2642 -----------------------
label: 1
score: 1.0
--------------------- 2643

--------------------- 2757 -----------------------
label: 4
score: 0
--------------------- 2758 -----------------------
label: 2
score: 0
--------------------- 2759 -----------------------
label: 1
score: 0
--------------------- 2760 -----------------------
label: 1
score: 0
--------------------- 2761 -----------------------
label: 1
score: 0
--------------------- 2762 -----------------------
label: 3
score: 0
--------------------- 2763 -----------------------
label: 2
score: 0
--------------------- 2764 -----------------------
label: 2
score: 0
--------------------- 2765 -----------------------
label: 1
score: 1.0
--------------------- 2766 -----------------------
label: 3
score: 0
--------------------- 2767 -----------------------
label: 3
score: 0
--------------------- 2768 -----------------------
label: 2
score: 0
--------------------- 2769 -----------------------
label: 3
score: 0
--------------------- 2770 -----------------------
label: 3
score: 0
--------------------- 2771 -----

--------------------- 2922 -----------------------
label: 1
score: 0
--------------------- 2923 -----------------------
label: 1
score: 1.0
--------------------- 2924 -----------------------
label: 2
score: 0
--------------------- 2925 -----------------------
label: 3
score: 0
--------------------- 2926 -----------------------
label: 2
score: 0
--------------------- 2927 -----------------------
label: 1
score: 1.0
--------------------- 2928 -----------------------
label: 2
score: 1.0
--------------------- 2929 -----------------------
label: 4
score: 0
--------------------- 2930 -----------------------
label: 1
score: 1.0
--------------------- 2931 -----------------------
label: 2
score: 0
--------------------- 2932 -----------------------
label: 3
score: 0.3333333333333333
--------------------- 2933 -----------------------
label: 2
score: 0
--------------------- 2934 -----------------------
label: 3
score: 0
--------------------- 2935 -----------------------
label: 4
score: 0
---------

--------------------- 3069 -----------------------
label: 3
score: 0
--------------------- 3070 -----------------------
label: 4
score: 0
--------------------- 3071 -----------------------
label: 4
score: 0
--------------------- 3072 -----------------------
label: 2
score: 0
--------------------- 3073 -----------------------
label: 4
score: 0
--------------------- 3074 -----------------------
label: 1
score: 1.0
--------------------- 3075 -----------------------
label: 1
score: 1.0
--------------------- 3076 -----------------------
label: 3
score: 0
--------------------- 3077 -----------------------
label: 2
score: 0
--------------------- 3078 -----------------------
label: 3
score: 0
--------------------- 3079 -----------------------
label: 4
score: 1.0
--------------------- 3080 -----------------------
label: 2
score: 0
--------------------- 3081 -----------------------
label: 2
score: 0
--------------------- 3082 -----------------------
label: 4
score: 0
--------------------- 3083 -

--------------------- 3224 -----------------------
label: 1
score: 0
--------------------- 3225 -----------------------
label: 1
score: 0
--------------------- 3226 -----------------------
label: 1
score: 1.0
--------------------- 3227 -----------------------
label: 2
score: 0
--------------------- 3228 -----------------------
label: 2
score: 0.5
--------------------- 3229 -----------------------
label: 1
score: 1.0
--------------------- 3230 -----------------------
label: 2
score: 1.0
--------------------- 3231 -----------------------
label: 1
score: 0
--------------------- 3232 -----------------------
label: 2
score: 1.0
--------------------- 3233 -----------------------
label: 3
score: 0.5
--------------------- 3234 -----------------------
label: 4
score: 0
--------------------- 3235 -----------------------
label: 4
score: 0
--------------------- 3236 -----------------------
label: 4
score: 0
--------------------- 3237 -----------------------
label: 3
score: 0
--------------------- 

--------------------- 3389 -----------------------
label: 1
score: 0
--------------------- 3390 -----------------------
label: 4
score: 0
--------------------- 3391 -----------------------
label: 2
score: 0
--------------------- 3392 -----------------------
label: 3
score: 1.0
--------------------- 3393 -----------------------
label: 2
score: 0
--------------------- 3394 -----------------------
label: 2
score: 0
--------------------- 3395 -----------------------
label: 1
score: 0
--------------------- 3396 -----------------------
label: 4
score: 0
--------------------- 3397 -----------------------
label: 4
score: 0
--------------------- 3398 -----------------------
label: 3
score: 0.5
--------------------- 3399 -----------------------
label: 1
score: 0
--------------------- 3400 -----------------------
label: 2
score: 1.0
--------------------- 3401 -----------------------
label: 2
score: 1.0
--------------------- 3402 -----------------------
label: 1
score: 1.0
--------------------- 34

--------------------- 3564 -----------------------
label: 4
score: 0
--------------------- 3565 -----------------------
label: 3
score: 0
--------------------- 3566 -----------------------
label: 4
score: 0
--------------------- 3567 -----------------------
label: 4
score: 0
--------------------- 3568 -----------------------
label: 2
score: 1.0
--------------------- 3569 -----------------------
label: 1
score: 0
--------------------- 3570 -----------------------
label: 4
score: 0
--------------------- 3571 -----------------------
label: 4
score: 0
--------------------- 3572 -----------------------
label: 3
score: 0
--------------------- 3573 -----------------------
label: 4
score: 0
--------------------- 3574 -----------------------
label: 2
score: 0
--------------------- 3575 -----------------------
label: 2
score: 0
--------------------- 3576 -----------------------
label: 4
score: 0
--------------------- 3577 -----------------------
label: 2
score: 1.0
--------------------- 3578 ---

--------------------- 3726 -----------------------
label: 1
score: 0
--------------------- 3727 -----------------------
label: 1
score: 0
--------------------- 3728 -----------------------
label: 1
score: 0.5
--------------------- 3729 -----------------------
label: 3
score: 0
--------------------- 3730 -----------------------
label: 1
score: 0
--------------------- 3731 -----------------------
label: 3
score: 0.3333333333333333
--------------------- 3732 -----------------------
label: 1
score: 0.5
--------------------- 3733 -----------------------
label: 1
score: 1.0
--------------------- 3734 -----------------------
label: 2
score: 0
--------------------- 3735 -----------------------
label: 1
score: 0.5
--------------------- 3736 -----------------------
label: 2
score: 0
--------------------- 3737 -----------------------
label: 1
score: 0
--------------------- 3738 -----------------------
label: 4
score: 0
--------------------- 3739 -----------------------
label: 2
score: 1.0
-------

--------------------- 3884 -----------------------
label: 3
score: 0.5
--------------------- 3885 -----------------------
label: 2
score: 1.0
--------------------- 3886 -----------------------
label: 1
score: 0
--------------------- 3887 -----------------------
label: 2
score: 0
--------------------- 3888 -----------------------
label: 2
score: 0
--------------------- 3889 -----------------------
label: 2
score: 1.0
--------------------- 3890 -----------------------
label: 3
score: 1.0
--------------------- 3891 -----------------------
label: 2
score: 0
--------------------- 3892 -----------------------
label: 2
score: 1.0
--------------------- 3893 -----------------------
label: 2
score: 0
--------------------- 3894 -----------------------
label: 3
score: 0
--------------------- 3895 -----------------------
label: 2
score: 0
--------------------- 3896 -----------------------
label: 3
score: 0
--------------------- 3897 -----------------------
label: 1
score: 0
--------------------- 38

label: 2
score: 0
--------------------- 4024 -----------------------
label: 1
score: 0.5
--------------------- 4025 -----------------------
label: 2
score: 0
--------------------- 4026 -----------------------
label: 2
score: 0
--------------------- 4027 -----------------------
label: 1
score: 0
--------------------- 4028 -----------------------
label: 2
score: 1.0
--------------------- 4029 -----------------------
label: 1
score: 1.0
--------------------- 4030 -----------------------
label: 3
score: 0
--------------------- 4031 -----------------------
label: 3
score: 0
--------------------- 4032 -----------------------
label: 3
score: 0
--------------------- 4033 -----------------------
label: 4
score: 0
--------------------- 4034 -----------------------
label: 1
score: 0
--------------------- 4035 -----------------------
label: 1
score: 0.5
--------------------- 4036 -----------------------
label: 1
score: 0
--------------------- 4037 -----------------------
label: 1
score: 0
--------

--------------------- 4149 -----------------------
label: 2
score: 0
--------------------- 4150 -----------------------
label: 1
score: 1.0
--------------------- 4151 -----------------------
label: 1
score: 0
--------------------- 4152 -----------------------
label: 3
score: 0
--------------------- 4153 -----------------------
label: 1
score: 1.0
--------------------- 4154 -----------------------
label: 1
score: 1.0
--------------------- 4155 -----------------------
label: 3
score: 1.0
--------------------- 4156 -----------------------
label: 1
score: 0
--------------------- 4157 -----------------------
label: 1
score: 1.0
--------------------- 4158 -----------------------
label: 2
score: 0
--------------------- 4159 -----------------------
label: 1
score: 1.0
--------------------- 4160 -----------------------
label: 2
score: 0
--------------------- 4161 -----------------------
label: 4
score: 1.0
--------------------- 4162 -----------------------
label: 2
score: 0
--------------------

--------------------- 4334 -----------------------
label: 1
score: 0
--------------------- 4335 -----------------------
label: 3
score: 0
--------------------- 4336 -----------------------
label: 1
score: 0
--------------------- 4337 -----------------------
label: 1
score: 1.0
--------------------- 4338 -----------------------
label: 2
score: 1.0
--------------------- 4339 -----------------------
label: 1
score: 0
--------------------- 4340 -----------------------
label: 1
score: 0
--------------------- 4341 -----------------------
label: 4
score: 0
--------------------- 4342 -----------------------
label: 2
score: 0
--------------------- 4343 -----------------------
label: 1
score: 0.6666666666666666
--------------------- 4344 -----------------------
label: 1
score: 0
--------------------- 4345 -----------------------
label: 1
score: 1.0
--------------------- 4346 -----------------------
label: 1
score: 0
--------------------- 4347 -----------------------
label: 1
score: 1.0
---------

--------------------- 4496 -----------------------
label: 2
score: 1.0
--------------------- 4497 -----------------------
label: 1
score: 1.0
--------------------- 4498 -----------------------
label: 1
score: 0
--------------------- 4499 -----------------------
label: 2
score: 0
--------------------- 4500 -----------------------
label: 4
score: 0
--------------------- 4501 -----------------------
label: 1
score: 1.0
--------------------- 4502 -----------------------
label: 2
score: 0
--------------------- 4503 -----------------------
label: 3
score: 0
--------------------- 4504 -----------------------
label: 1
score: 0.6666666666666666
--------------------- 4505 -----------------------
label: 3
score: 1.0
--------------------- 4506 -----------------------
label: 1
score: 1.0
--------------------- 4507 -----------------------
label: 1
score: 0
--------------------- 4508 -----------------------
label: 1
score: 1.0
--------------------- 4509 -----------------------
label: 1
score: 1.0
---

--------------------- 4636 -----------------------
label: 1
score: 1.0
--------------------- 4637 -----------------------
label: 2
score: 1.0
--------------------- 4638 -----------------------
label: 2
score: 1.0
--------------------- 4639 -----------------------
label: 2
score: 0
--------------------- 4640 -----------------------
label: 2
score: 0
--------------------- 4641 -----------------------
label: 1
score: 0
--------------------- 4642 -----------------------
label: 1
score: 0
--------------------- 4643 -----------------------
label: 1
score: 0
--------------------- 4644 -----------------------
label: 1
score: 0
--------------------- 4645 -----------------------
label: 2
score: 0
--------------------- 4646 -----------------------
label: 2
score: 1.0
--------------------- 4647 -----------------------
label: 1
score: 0
--------------------- 4648 -----------------------
label: 2
score: 0
--------------------- 4649 -----------------------
label: 1
score: 0
--------------------- 4650

--------------------- 4795 -----------------------
label: 1
score: 1.0
--------------------- 4796 -----------------------
label: 1
score: 0
--------------------- 4797 -----------------------
label: 4
score: 0
--------------------- 4798 -----------------------
label: 2
score: 0
--------------------- 4799 -----------------------
label: 1
score: 0
--------------------- 4800 -----------------------
label: 2
score: 0
--------------------- 4801 -----------------------
label: 1
score: 0.6666666666666666
--------------------- 4802 -----------------------
label: 4
score: 0.5
--------------------- 4803 -----------------------
label: 3
score: 0
--------------------- 4804 -----------------------
label: 2
score: 0
--------------------- 4805 -----------------------
label: 2
score: 0
--------------------- 4806 -----------------------
label: 1
score: 0
--------------------- 4807 -----------------------
label: 4
score: 0
--------------------- 4808 -----------------------
label: 1
score: 0
-------------

In [14]:
compare_4 = compare_df[compare_df['overall']==4.0]
compare_3 = compare_df[compare_df['overall']==3.0]
compare_2 = compare_df[compare_df['overall']==2.0]
compare_1 = compare_df[compare_df['overall']==1.0]
len(compare_df)
len(compare_4) + len(compare_3) + len(compare_2) +len(compare_1)

i_list = [1.0,2.0,3.0,4.0]
compare = [compare_1,compare_2,compare_3,compare_4]
for i in i_list:
    index = int(i)-1
    c = compare[index]
    print('Number of pairs that are rated as ',i,':')
    print(len(c))
    print('Pairs that are rated as ',i,', title named entity score is 0: (value, percentage)')
    print(len(c[c['title_ne_score']==0]), ', ', len(c[c['title_ne_score']==0])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.3:')
    print(len(c[c['title_ne_score']>0.3]), ', ', len(c[c['title_ne_score']>0.3])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is > 0.5:')
    print(len(c[c['title_ne_score']>0.5]), ', ', len(c[c['title_ne_score']>0.5])/len(c)*100,'%')
    print('Pairs that are rated as ',i,', title named entity score is 1.0:')
    print(len(c[c['title_ne_score']==1.0]), ', ', len(c[c['title_ne_score']==1.0])/len(c)*100,'%')

Number of pairs that are rated as  1.0 :
1443
Pairs that are rated as  1.0 , title named entity score is 0: (value, percentage)
930 ,  64.44906444906445 %
Pairs that are rated as  1.0 , title named entity score is > 0.3:
512 ,  35.48163548163548 %
Pairs that are rated as  1.0 , title named entity score is > 0.5:
380 ,  26.334026334026333 %
Pairs that are rated as  1.0 , title named entity score is 1.0:
362 ,  25.086625086625087 %
Number of pairs that are rated as  2.0 :
1124
Pairs that are rated as  2.0 , title named entity score is 0: (value, percentage)
792 ,  70.46263345195729 %
Pairs that are rated as  2.0 , title named entity score is > 0.3:
332 ,  29.537366548042705 %
Pairs that are rated as  2.0 , title named entity score is > 0.5:
227 ,  20.195729537366546 %
Pairs that are rated as  2.0 , title named entity score is 1.0:
220 ,  19.572953736654807 %
Number of pairs that are rated as  3.0 :
959
Pairs that are rated as  3.0 , title named entity score is 0: (value, percentage)
711 


### Remarks: <br>

As we expected, it's hard to tell if extracting NE from titles is a good feature to decide whether two articles are telling the same event. The reason being that some titles are NOT very descriptive, instead some tend to be sensationalistic or provocative, completely disregarding facts that we can extract NEs from. Even when two titles are associated to the same event, different news platforms may adopt different titling strategies; either they choose to keep it very short, or very long, or use an idiom, a phrase, a joke, or puns, making titling very much a stylistic choice. This is reasonable if one thinks that titles should be as striking and as different from one another as possible in order to be succesful. <br><br>

One argument against the use of title NEs as a feature is that  named entities appearing in the title are also likely be mentioned in the text. Since we are also extracting NEs from the text and comparing them in pairs, it can be argued that having title named entity similarity score is adding a similar feature that might not be necessary.

Nevertheless, from the previous short analysis, we can see a tendency when a pair is similar (rated as 1.0 or not much smaller), the high named entity similarity score occurs more common (25% of the title named entity have a score of 1.0 for the most similar pairs, and 8% of them are scored 1.0 for the least similar pairs (rated 4.0))<br><br>

To get a better judgement whether or not this feature is necessary, we will also examine the similarity score of the named entity extracted from the body text of the article (**insert DIRECTORY 17**).