# Demo
## Quickstart

In [3]:
from pyonion.remover import ListCorpusProvider
from pyonion.remover import DuplicateRemover, CleaningMode

documents = [
    'The cat sat on the large mat',
    'The cat sat on the large rug'
         ]
corpus = ListCorpusProvider(documents)

remover = DuplicateRemover(n_gram=5)
duplicated_ngrams = remover.find_duplicated_ngrams(corpus)
duplicated_ngrams

{'The_cat_sat_on_the', 'cat_sat_on_the_large'}

In [4]:
iter_clean_corpus = remover.iter_clean_text(corpus, duplicated_ngrams, threshold=.2, mode=CleaningMode.FIRST)
clean_corpus = [clean_doc for clean_doc in iter_clean_corpus]
clean_corpus

[('The cat sat on the large mat', 0.0), ('', 0.6666666666666666)]

## Using the document provider
I've collated some recent Guardian articles into a short list to demonstrate the process in a more realistic scenario.

Note that documents 3 and 5 are near duplicates, though a couple of words have been changed. Documents 2 and 4 are also exact duplicates.

In [5]:
from pyonion.remover import FileCorpusProvider, CleaningMode

In [7]:
corpus = FileCorpusProvider('pyonion/data/demo_data.txt')
for i, doc in enumerate(corpus.iter_docs()):
    print(f"Document {i}\n*******************")
    print(doc, end='\n\n')

Document 0
*******************
Japan’s biggest airline has started offering luxury dining aboard a parked airplane titled the “winged restaurant,” for £390 a meal.

Diners rushed to relive the cabin dining experience on Wednesday, despite being unable to travel due to the pandemic.

All Nippon Airways (ANA) is offering “passengers” a choice between a first-class seat with a meal for 59,800 yen (£391) and a business-class option for about half the price, at 29,800 yen, on board a stationary Boeing-777 at Haneda airport in Tokyo. They are asked to choose in advance from three menus: Japanese-style, western-style beef or western-style fish, served with wine.

The chef speaks with a customer on a parked All Nippon Airways plane at Haneda airport in Tokyo.
The chef speaks with a customer on a parked All Nippon Airways plane at Haneda airport in Tokyo. Photograph: All Nippon AIrways (ANA)/AFP/Getty Images
Yosuke Kimoto, 42, who had a business-class meal with his 14-year-old son, told Kyodo N

Lets find the duplicated 10-grams, and have a look at some of them

In [8]:
remover = DuplicateRemover(n_gram=10)
duplicated = remover.find_duplicated_ngrams(corpus)
list(duplicated)[:10]

['to_international_criticism_As_China_engages_in_international_disputes_ranging',
 'diplomats_to_international_criticism_As_China_engages_in_international_disputes',
 'injustice_in_which_the_lamb_is_falsely_accused_and_killed',
 'Aesop_s_Fable_from_the_Twitter_account_of_China_s',
 'over_two_weeks_When_police_arrived_at_the_two_story',
 'fervently_conducted_online_Thursday_s_tweet_pushed_back_on_such',
 'at_the_two_story_structure_around_5_30pm_shots_were',
 'belligerent_and_aggressive_style_of_communication_that_is_most_fervently',
 'and_the_gunman_critically_wounded_police_said_The_violence_in',
 '30pm_shots_were_being_fired_Orange_police_lieutenant_Jennifer_Amat']

Lets now use our duplicate remover to strip the set down to only high quality documents.

In [9]:
clean_docs = [doc for doc in remover.iter_clean_text(corpus,
                                           duplicated_ngrams=duplicated,
                                           threshold=.9, mode=CleaningMode.FIRST)]

If we look at document 5 we see it has been removed as it had a very high resemblance.

In [10]:
clean_docs[5]

('', 0.9942528735632183)

Document 3 has not been touched, and has a resemblance of 0. This is because the cleaning mode was set to 'FIRST', meaning the first occurance isn't treated as a duplicate.

In [11]:
clean_docs[3]

('A butchered Aesop s Fable from the Twitter account of China s embassy in Ireland has drawn mirth from observers and highlighted the growing sensitivity of Chinese diplomats to international criticism As China engages in international disputes ranging from fist fights with Taiwanese officials to trade sanctions to threats of conflict the behaviour of its foreign officials has earned the nickname wolf warrior diplomacy a belligerent and aggressive style of communication that is most fervently conducted online Thursday s tweet pushed back on such accusations but appeared to lose something in translation as the author navigated English allegories and the need to maintain an image of Chinese strength Advertisement Riffing on the fable of the Wolf and the Lamb a story of tyrannical injustice in which the lamb is falsely accused and killed Thursday s post queried Who is the wolf It continued Some people accused China for so called wolf warrior diplomacy In his well known fable Aesop describ