In [1]:
import os

import pandas as pd

from preprocessing.InputDataset import BaseArticleDataset, FramingArticleDataset

DATA_DIR = os.path.join('..', '..', '..', 'data')

  from .autonotebook import tqdm as notebook_tqdm


# Are there duplicates?

Load datasets in all languages availables

In [2]:
languages = ('en', 'ru', 'it', 'fr', 'po', 'ge')

In [3]:
datasets = {}

for language in languages:
    datasets[language] = BaseArticleDataset(
        data_dir=DATA_DIR,
        language=language, subtask=2, split='train')
    datasets[language].separate_title_content()

433it [00:00, 19587.92it/s]
143it [00:00, 15318.23it/s]
227it [00:00, 22644.95it/s]
158it [00:00, 14515.71it/s]
145it [00:00, 14954.61it/s]
132it [00:00, 20547.34it/s]


## Summary of findings

 * Exact duplicates exist only on the english and german train dataset. It occurs only for two couples of documents:
    * English:
        * 698092698 and 999000878
        * 832917532 (actual article) and 833032367 (comment on the article, probably best to remove this one)
    * German
        * 224 has the wrong content. Its title is different but it has the same contet as document 225.
 * No other exact duplicates were detected at the other 4 units of analysis (title_and_5_sentences, title_and_10_sentences, title_and_first_paragraph, title_and_first_sentence_each_paragraph)

## Check for duplicates in the datasets

### Check for exact content or title duplicates
#### English

In [4]:
datasets['en'].df[datasets['en'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
999000878,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,probably a reason why reporters Laura Loomer a...
698092698,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,probably a reason why reporters Laura Loomer a...


In [5]:
datasets['en'].df[datasets['en'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
765982381,Julian Assange\n\nDuring World War II Cardinal...,Julian Assange,During World War II Cardinal Jozsef Mindszenty...
723793978,America's Immigration Voice.\n\nSwedish PM doe...,America's Immigration Voice.,Swedish PM does not rule out use of army to en...
700461600,America's Immigration Voice.\n\nThe Kurds have...,America's Immigration Voice.,"The Kurds have no friends but the mountains, i..."
999000878,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,probably a reason why reporters Laura Loomer a...
706661242,America's Immigration Voice.\n\nEver thought t...,America's Immigration Voice.,Ever thought that the academic discipline of h...
765385479,Julian Assange\n\nThe persecution of Julian As...,Julian Assange,The persecution of Julian Assange must end. Or...
730269378,America's Immigration Voice.\n\nThanks for pub...,America's Immigration Voice.,Thanks for publicizing the race of Quentin Lam...
706088110,America's Immigration Voice.\n\nYears ago nume...,America's Immigration Voice.,Years ago numerous Indonesian Christians came ...
698092698,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,probably a reason why reporters Laura Loomer a...
832917532,Robert Mueller Not Recommending Any More Indic...,Robert Mueller Not Recommending Any More Indic...,Special counsel Robert Mueller will not recomm...


Some articles have the same titles because they proceed from the same blog or source, but their content is not the same.

The only real duplicate seems to be:
    - 698092698 and 999000878
    - 832917532 (actual article) and 833032367 (comment on the article, probably best to remove this one)


In [6]:
print(datasets['en'].df.content.loc[832917532])

Special counsel Robert Mueller will not recommend any more indictments as part of his investigation, the Justice Department announced Friday evening.
A senior Justice Department official announced the development shortly after the special counsel submitted its final report to U.S. Attorney General William Barr.
Barr will now review the report and write his own report on Mueller’s findings and present them to Congress as soon as this weekend.
“I am reviewing the report and anticipate that I may be in a position to advise you of the special counsel’s principal conclusions as soon as this weekend,” the attorney general wrote in a letter to Republican and Democrat leaders on the House and Senate Judiciary Committees.
He also said at no time did the Justice Department prevent Mueller from any actions he sought to make during the course of his investigation.
President Trump’s initial reaction to news of the report’s delivery is that he is “glad it’s over,” reported ABC News.
In a separate st

In [7]:
print(datasets['en'].df.content.loc[833032367])

But of course, this makes no difference to the party of treason.
The coup will continue.
The New York Times instructs its goosestepping goons that:
the delivery of a report late Friday afternoon from Robert S. Mueller III, the special counsel, to Attorney General William P. Barr might seem like the conclusion of a long-running drama , but it is only the end of the beginning.Two and half years and billions of dollars because these thumbsucking traitors lost the election.
ROBERT MUELLER NOT RECOMMENDING ANY MORE INDICTMENTS IN RUSSIA PROBE take our poll - story continues below
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
* Yes, they're supposedly the party against hate, so they have to.
No, intersectional politics rule the Democrat party, so Omar wins.
I don't really care wha

#### Italian

In [8]:
datasets['it'].df[datasets['it'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [9]:
datasets['it'].df[datasets['it'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


#### Russian

In [10]:
datasets['ru'].df[datasets['ru'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [11]:
datasets['ru'].df[datasets['ru'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Seems other languages do not contain exact duplicates

#### French

In [12]:
datasets['fr'].df[datasets['fr'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [13]:
datasets['fr'].df[datasets['fr'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


#### Polish

In [14]:
datasets['po'].df[datasets['po'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [15]:
datasets['po'].df[datasets['po'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
25151,Srebro na niedzielę (bo milczenie jest złotem)...,Srebro na niedzielę (bo milczenie jest złotem),Francja zapłaci\n\nLiderka Zjednoczenia Narodo...
25152,Srebro na niedzielę (bo milczenie jest złotem)...,Srebro na niedzielę (bo milczenie jest złotem),Zaradny Richard Henry\n\nEuroposeł Ryszard Cza...


#### German

In [16]:
datasets['ge'].df[datasets['ge'].df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
225,Europol ist jetzt vollautorisiert zur EU-weite...,Europol ist jetzt vollautorisiert zur EU-weite...,Am Mittwoch verabschiedete das EU-Parlament mi...
224,AfD: Allein in Hamburg: Eine halbe Milliarde E...,AfD: Allein in Hamburg: Eine halbe Milliarde E...,Am Mittwoch verabschiedete das EU-Parlament mi...


In [17]:
datasets['ge'].df[datasets['ge'].df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [18]:
print(datasets['ge'].df.loc[225].title)
print(datasets['ge'].df.loc[225].content)


Europol ist jetzt vollautorisiert zur EU-weiten Massenüberwachung
Am Mittwoch verabschiedete das EU-Parlament mit 480 zu 143 Stimmen bei 20 Enthaltungen einen Entwurf zur Reform des europäischen Polizeiamts Europol. Dieses hat fortan die Befugnis, umfangreiche und komplexe Datensätze zu verarbeiten, um die Mitgliedstaaten beim Kampf „gegen Schwerkriminalität und Terrorismus” zu unterstützen – bzw. gegen alles, was nach „Landessitte“ und nationaler Auslegung darunter fällt. In Deutschland kann man eingedenk der Prioritäten einer linksradikalen Bundesinnenministerin also gewiss sein, dass fortan auch Meinungsverbrecher und als „Rechte“ kriminalisierte Dissidenten in diese Kategorien fallen – mit den entsprechenden, nunmehr legalisierten Entrechtungsfolgen.

Damit wird die Macht der Behörde mit Sitz in Den Haag noch einmal erheblich gesteigert: Denn Europol wird vor allem von den nationalen Strafverfolgungsbehörden mit riesigen Datenmengen beliefert. Dem EU-Datenschutzbeauftragten Wojciec

In [19]:
print(datasets['ge'].df.loc[224].title)

AfD: Allein in Hamburg: Eine halbe Milliarde Euro für die Gesundheitsversorgung von Asylbewerbern!


In [20]:
print(datasets['ge'].df.loc[224].content)

Am Mittwoch verabschiedete das EU-Parlament mit 480 zu 143 Stimmen bei 20 Enthaltungen einen Entwurf zur Reform des europäischen Polizeiamts Europol. Dieses hat fortan die Befugnis, umfangreiche und komplexe Datensätze zu verarbeiten, um die Mitgliedstaaten beim Kampf „gegen Schwerkriminalität und Terrorismus” zu unterstützen – bzw. gegen alles, was nach „Landessitte“ und nationaler Auslegung darunter fällt. In Deutschland kann man eingedenk der Prioritäten einer linksradikalen Bundesinnenministerin also gewiss sein, dass fortan auch Meinungsverbrecher und als „Rechte“ kriminalisierte Dissidenten in diese Kategorien fallen – mit den entsprechenden, nunmehr legalisierten Entrechtungsfolgen.

Damit wird die Macht der Behörde mit Sitz in Den Haag noch einmal erheblich gesteigert: Denn Europol wird vor allem von den nationalen Strafverfolgungsbehörden mit riesigen Datenmengen beliefert. Dem EU-Datenschutzbeauftragten Wojciech Wiewiórowski sind die ungeheuren Informationsströme, die bei Eur

It seems like the content of article 224 is wrong. It somehow has the text of article with ID 225. Let's check if they have the same labels

In [21]:
ge_dataset = FramingArticleDataset(
    data_dir=DATA_DIR,
    language='ge', subtask=2, split='train',
    remove_duplicates=False
)

132it [00:00, 17101.63it/s]


In [22]:
print(ge_dataset.df.loc[224].frames)
print(ge_dataset.df.loc[225].frames)
ge_dataset.df.loc[224].frames == ge_dataset.df.loc[225].frames

Legality_Constitutionality_and_jurisprudence,Policy_prescription_and_evaluation,External_regulation_and_reputation,Capacity_and_resources,Health_and_safety,Political,Security_and_defense
Legality_Constitutionality_and_jurisprudence,Policy_prescription_and_evaluation,External_regulation_and_reputation,Security_and_defense,Political,Crime_and_punishment


False

Their labels are indeed diffent. It seems like they just parsed wrongly document 224.

## Let's look for duplicates with fuzzy matching

In [23]:
from fuzzywuzzy import fuzz

def naive_pairwise_comparisson(df: pd.DataFrame, col_name: str, comp_fn: callable) -> pd.DataFrame:
    duplicates = []
    for row_i in df.iterrows():
        for row_j in df.drop(row_i[0]).iterrows():
            if comp_fn(row_i[1][col_name], row_j[1][col_name]):
                duplicates.append(row_i[0], row_i[1], row_j[0], row_j[1])
                print(duplicates)

    return pd.DataFrame(duplicates, columns=['id_1', 'raw_text_1', 'id_2', 'raw_text_2'])



In [24]:
naive_pairwise_comparisson(df=semeval_data.df, col_name="raw_text",
                           comp_fn=lambda str1, str2: fuzz.partial_ratio(str1, str2) > 90)

NameError: name 'semeval_data' is not defined

Let's verify these with some fuzzy matching

In [None]:
from fuzzywuzzy import fuzz

def naive_pairwise_comparisson(df: pd.DataFrame, col_name: str, comp_fn: callable) -> pd.DataFrame:
    duplicates = []
    for row_i in df.iterrows():
        for row_j in df.drop(row_i[0]).iterrows():
            if comp_fn(row_i[1][col_name], row_j[1][col_name]):
                duplicates.append(row_i[0], row_i[1], row_j[0], row_j[1])
                print(duplicates)

    return pd.DataFrame(duplicates, columns=['id_1', 'raw_text_1', 'id_2', 'raw_text_2'])



In [None]:
naive_pairwise_comparisson(df=semeval_data.df, col_name="raw_text",
                           comp_fn=lambda str1, str2: fuzz.partial_ratio(str1, str2) > 90)