# Are there duplicates?

In [16]:
import os

from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from InputDataset import BaseArticleDataset, FramingArticleDataset


semeval_data_en = BaseArticleDataset(
    data_dir=os.path.join('..', '..', 'data'),
    language='en', subtask=2, split='train')
semeval_data_en.separate_title_content()

semeval_data_ru = BaseArticleDataset(
    data_dir=os.path.join('..', '..', 'data'),
    language='ru', subtask=2, split='train')
semeval_data_ru.separate_title_content()

semeval_data_it = BaseArticleDataset(
    data_dir=os.path.join('..', '..', 'data'),
    language='it', subtask=2, split='train')
semeval_data_it.separate_title_content()

433it [00:00, 29436.81it/s]
143it [00:00, 16141.06it/s]
227it [00:00, 21674.26it/s]


## Summary of findings

 * Exact duplicates exist only on the english train dataset and it occurs only for two couples of documents
    * 698092698 and 999000878
    * 832917532 (actual article) and 833032367 (comment on the article, probably best to remove this one)

## Check for duplicates in the datasets

### Check for exact content or title duplicates
#### English

In [10]:
semeval_data_en.df[semeval_data_en.df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [11]:
semeval_data_en.df[semeval_data_en.df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
765982381,Julian Assange\n\nDuring World War II Cardinal...,Julian Assange,During World War II Cardinal Jozsef Mindszenty...
723793978,America's Immigration Voice.\n\nSwedish PM doe...,America's Immigration Voice.,Swedish PM does not rule out use of army to en...
700461600,America's Immigration Voice.\n\nThe Kurds have...,America's Immigration Voice.,"The Kurds have no friends but the mountains, i..."
999000878,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,
706661242,America's Immigration Voice.\n\nEver thought t...,America's Immigration Voice.,Ever thought that the academic discipline of h...
765385479,Julian Assange\n\nThe persecution of Julian As...,Julian Assange,The persecution of Julian Assange must end. Or...
730269378,America's Immigration Voice.\n\nThanks for pub...,America's Immigration Voice.,Thanks for publicizing the race of Quentin Lam...
706088110,America's Immigration Voice.\n\nYears ago nume...,America's Immigration Voice.,Years ago numerous Indonesian Christians came ...
698092698,Las Vegas Shooting: A THIRD Timeline Emerges E...,Las Vegas Shooting: A THIRD Timeline Emerges E...,\n
832917532,Robert Mueller Not Recommending Any More Indic...,Robert Mueller Not Recommending Any More Indic...,Special counsel Robert Mueller will not recomm...


Some articles have the same titles because they proceed from the same blog or source, but their content is not the same.

The only real duplicate seems to be:
    - 698092698 and 999000878
    - 832917532 (actual article) and 833032367 (comment on the article, probably best to remove this one)


In [14]:
print(semeval_data_en.df.content.loc[832917532])

Special counsel Robert Mueller will not recommend any more indictments as part of his investigation, the Justice Department announced Friday evening.
A senior Justice Department official announced the development shortly after the special counsel submitted its final report to U.S. Attorney General William Barr.
Barr will now review the report and write his own report on Mueller’s findings and present them to Congress as soon as this weekend.
“I am reviewing the report and anticipate that I may be in a position to advise you of the special counsel’s principal conclusions as soon as this weekend,” the attorney general wrote in a letter to Republican and Democrat leaders on the House and Senate Judiciary Committees.
He also said at no time did the Justice Department prevent Mueller from any actions he sought to make during the course of his investigation.
President Trump’s initial reaction to news of the report’s delivery is that he is “glad it’s over,” reported ABC News.
In a separate st

In [13]:
print(semeval_data_en.df.content.loc[833032367])

But of course, this makes no difference to the party of treason.
The coup will continue.
The New York Times instructs its goosestepping goons that:
the delivery of a report late Friday afternoon from Robert S. Mueller III, the special counsel, to Attorney General William P. Barr might seem like the conclusion of a long-running drama , but it is only the end of the beginning.Two and half years and billions of dollars because these thumbsucking traitors lost the election.
ROBERT MUELLER NOT RECOMMENDING ANY MORE INDICTMENTS IN RUSSIA PROBE take our poll - story continues below
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
Do you think Democrats will push out Representative Ilhan Omar over her anti-Semitism?
* Yes, they're supposedly the party against hate, so they have to.
No, intersectional politics rule the Democrat party, so Omar wins.
I don't really care wha

#### Italian

In [17]:
semeval_data_it.df[semeval_data_it.df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [19]:
semeval_data_it.df[semeval_data_it.df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


#### Russian

In [21]:
semeval_data_ru.df[semeval_data_ru.df.content.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [22]:
semeval_data_ru.df[semeval_data_ru.df.title.duplicated(keep=False)]

Unnamed: 0_level_0,raw_text,title,content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Seems other languages do not contain exact duplicates

## Let's look for duplicates with fuzzy matching

In [37]:
from fuzzywuzzy import fuzz

def naive_pairwise_comparisson(df: pd.DataFrame, col_name: str, comp_fn: callable) -> pd.DataFrame:
    duplicates = []
    for row_i in df.iterrows():
        for row_j in df.drop(row_i[0]).iterrows():
            if comp_fn(row_i[1][col_name], row_j[1][col_name]):
                duplicates.append(row_i[0], row_i[1], row_j[0], row_j[1])
                print(duplicates)

    return pd.DataFrame(duplicates, columns=['id_1', 'raw_text_1', 'id_2', 'raw_text_2'])



In [38]:
naive_pairwise_comparisson(df=semeval_data.df, col_name="raw_text",
                           comp_fn=lambda str1, str2: fuzz.partial_ratio(str1, str2) > 90)

KeyboardInterrupt: 

Let's verify these with some fuzzy matching

In [37]:
from fuzzywuzzy import fuzz

def naive_pairwise_comparisson(df: pd.DataFrame, col_name: str, comp_fn: callable) -> pd.DataFrame:
    duplicates = []
    for row_i in df.iterrows():
        for row_j in df.drop(row_i[0]).iterrows():
            if comp_fn(row_i[1][col_name], row_j[1][col_name]):
                duplicates.append(row_i[0], row_i[1], row_j[0], row_j[1])
                print(duplicates)

    return pd.DataFrame(duplicates, columns=['id_1', 'raw_text_1', 'id_2', 'raw_text_2'])



In [38]:
naive_pairwise_comparisson(df=semeval_data.df, col_name="raw_text",
                           comp_fn=lambda str1, str2: fuzz.partial_ratio(str1, str2) > 90)

KeyboardInterrupt: 