# Document Similarity (English)

In this notebook, you will use the DocumentSimilarity tool to identify similar documents in the English language and decide whether to keep or remove them from the corpus.  

**Note:** this tool uses [MinHash](https://ekzhu.com/datasketch/minhash.html) to estimate the Jaccard similarity between sets of documents. MinHash is introduced by Andrei Z. Broder in this [paper](https://cs.brown.edu/courses/cs253/papers/nearduplicate.pdf).

<div class="alert alert-block alert-warning">
<b>Jupyter Notebook User Guide</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

### Document Similarity User Guide

For instructions on how to use the Document Similarity tool, please refer to the [Document Similarity User Guide](documents/docsim-help-pages.pdf).

## 1. Setup
Before you begin, you need to import the DocumentSimilarity package and the necessary libraries and initiate them to run in this notebook.

In [1]:
# import the DocumentSimilarity tool
print('Loading DocumentSimilarity...')
from atap_corpus_loader import CorpusLoader
from document_similarity import DocumentSimilarity

# initialize the DocumentSimilarity
ds = DocumentSimilarity()
print('Finished loading.')

Loading DocumentSimilarity...


Finished loading.


## 2. Load the data
This notebook will allow you to upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet ([see an example here](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/sample_texts.xlsx)).  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>

<div class="alert alert-block alert-warning">
    <td> <img src='./img/file_pane.jpg' style='width: 300px'/> </td>
    <td> <img src='./img/corpus_loader.png' style='width: 600px'/> </td>
    
<b>Uploading your text files</b> 
    
You can now simply **drag and drop** your file to the top left pane (file explorer pane) of the JupyterLab (this notebook interface), in order to upload to the cloud instance, then the supported file types will be displayed in the **ATAP corpus loader UI** after executing the following cell. Select the file(s) you want to build as a corpus, click on the button "Load as corpus", then select the right data label for your text contents before building the selected files as a corpus. Once a corpus is successfully built, you can continue with the rest of the notebook to run the Document Similarity Tool with your corpus.
</div>

In [2]:
corpus_loader = CorpusLoader(".")
corpus_loader.set_build_callback(ds.set_text_df, corpus_loader)
corpus_loader.servable()

<div class="alert alert-block alert-warning">
    
**Automatic deduplication of identical documents within the corpus**
    
The Document Similarity Tool is designed to find the documents in your corpus that are similar, but not 100% identical. As a first step, the tool will therefore aim to identify all identical documents in the corpus and undertake an automatic deduplication. For these identical documents, only the first document (according to alphabetical order by “text_name” or filename) will be retained in the corpus. The Jaccard-based similarity analysis below is then only run on the deduplicated version of the corpus to avoid including identical documents in the pairwise display. You can see the names of all identical documents in your corpus by executing the following cell, which allows you to export the relevant table as a CSV file. This table provides the filename of the retained file in the ‘kept’ column and the file names of the relevant identical (excluded) files in subsequent numbered columns. For example, the column ‘1’ contains the file name of the first duplicate of the file in the ‘kept’ column, and so on – this depends on the number of duplicates identified.

</div>

In [3]:
ds.identical_docs()

1779 duplicated files in 791 groups are found. The first file of each group 791 are kept in the corpus and all other 988 files are removed and the results can be checked in the following spreadsheet.


In [4]:
# display uploaded text
n=5

ds.text_df.head()

Unnamed: 0,text,text_name,text_id
0,Bet on collette IF anyone can get Australian f...,AD080100672,a755d6d87d299fe03829cd8b2d0e071f
1,No headline in original Second-hand DVD trader...,AD080100673,556bbe5a9932ef549fbff071937dc218
2,Obese aussies don't believe they're fat MORE t...,AD080100674,6eee7e429fa9ceb4b12d5939e7e86bf9
3,Fitness fighting the flab Baby boomers across ...,AD080100675,c9177ea1413136ca2d8fa149799a8dce
4,"Saving money, lives FURTHER to ""Fast-food addi...",AD080100676,4a669596ca788fc99b40b6682b74345c


## 3. Calculate Document Similarity
Once your texts have been uploaded, you can begin to calculate the similarity between documents in the corpus. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- MinHash: fast implementation of estimating Jaccard similarity between documents in the corpus.  
- Gensim: to tokenize the text.  
    
<b>Note:</b> in general, Gensim splits the text whenever whitespace or punctuation is encountered and digits are excluded, e.g., the text "Here's to victory no 2" will be tokenized into five tokens: "Here", "s", "to", "victory" and "no". For more information, please visit [this page](https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize).
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

<div class="alert alert-block alert-warning">
<b>Parameters for calculating similarity</b> 
    
The DocumentSimilarity tool uses Jaccard similarity to measure the similarity between documents. In the code below, we have specified and explained the default parameters for calculating the Jaccard similarity. However, you can also change these parameters should you wish. 
</div>

In [5]:
# USER SPECIFIES THESE VARIABLES
# set the n-gram size (the number of words used to detect similarity), 
# e.g., n-gram=1 means compare every word ('apple' and 'orange'), 
# n-gram=2 means compare every pair of words ('one apple' and 'two oranges'), etc.
ngram_value = 1

# select whether to calculate actual or estimated Jaccard similarity 
# to measure the similarity between documents 
# we recommend using estimated Jaccard similarity for large corpus of documents (faster)
actual_jaccard = False # True or False

# whether to exclude punctuations when calculating Jaccard similarity
ds.exclude_punc = True # True or False

# set the number of permutation functions (num_perm) parameter for estimating Jaccard similarity
# higher permutation functions improves the accuracy, but also increases query cost
num_perm = 256

# anything with >= the cutoff will be identified as similar documents
similarity_cutoff = 0.6 # value should be between 0-1

In [6]:
# begin the process of calculating similarity and identify similar documents
ds.calculate_similarity(ngram_value, num_perm, similarity_cutoff, actual_jaccard)

4199 pair of similar documents found in the corpus.


## Test discrepency

#### Check if minHash is identical on different runs


In [7]:
# Old
#import pickle
print(ds.deduplicated_text_df.shape)

(22099, 8)


In [7]:
# New

print(ds.deduplicated_text_df.shape)

(22099, 8)


In [8]:
import pickle
with open('run_6990.pkl', 'wb') as f:
    pickle.dump(ds, f)


In [8]:
import pickle
with open('run_6990.pkl', 'rb') as f:
    ds_1 = pickle.load(f)

In [9]:
set(ds_1.deduplicated_text_df.text_name).difference(ds.deduplicated_text_df.text_name)

set()

In [31]:
# Check if each file has same paired text id
import pandas as pd
df_1 = ds_1.text_df
df_2 = ds.text_df


In [39]:
#df = df_1.join(df_2, on='text_id')
df = pd.merge(df_1, df_2, on='text_id')
df = df.assign(id_hash = df.apply(lambda row: all(row.hash_x.hashvalues == row.hash_y.hashvalues), axis=1))
df = df.assign(id_match = df.apply(lambda row: row.matched_list_x == row.matched_list_y, axis=1))
df.head()

Unnamed: 0,text_x,text_name_x,text_id,text_with_punc_x,word_count_x,hash_x,matched_list_x,jaccards_x,text_y,text_name_y,text_with_punc_y,word_count_y,hash_y,matched_list_y,jaccards_y,id_hash,id_match
0,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0001,ecc6f182b30d0393a2fd14bb79260f27,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,138,<datasketch.minhash.MinHash object at 0x7fffa5...,"[a45b8988a4466132635b6bdb6fc4dedd, 5d6edf035bd...","[0.6133, 0.6133, 0.8945, 0.625]",THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0001,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,138,<datasketch.minhash.MinHash object at 0x7fffa2...,"[8f3544389222dfb902e2a93e4eae57fe, f544019720f...","[0.8945, 0.625, 0.6133, 0.6133]",True,False
1,AUSTRALIA faces being saddled with the bulk of...,2015_august0002,247a30c096487cadac0f5865231afad6,AUSTRALIA faces being saddled with the bulk of...,484,<datasketch.minhash.MinHash object at 0x7fffa5...,[d6dd0c1192406fa163b9ae8313ef983e],[0.9219],AUSTRALIA faces being saddled with the bulk of...,2015_august0002,AUSTRALIA faces being saddled with the bulk of...,484,<datasketch.minhash.MinHash object at 0x7fffa2...,[d6dd0c1192406fa163b9ae8313ef983e],[0.9219],True,True
2,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0003,8f3544389222dfb902e2a93e4eae57fe,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,166,<datasketch.minhash.MinHash object at 0x7fffa5...,"[dbec3069d7ecb5576d40098d14fba0a1, a45b8988a44...","[0.6172, 0.6328, 0.6328, 0.6797, 0.8945]",THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0003,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,166,<datasketch.minhash.MinHash object at 0x7fffa2...,"[f544019720f8608c40fcbf921da63412, ecc6f182b30...","[0.6797, 0.8945, 0.6328, 0.6328, 0.6172]",True,False
3,AUSTRALIA faces being saddled with the bulk of...,2015_august0004,d6dd0c1192406fa163b9ae8313ef983e,AUSTRALIA faces being saddled with the bulk of...,461,<datasketch.minhash.MinHash object at 0x7fffa5...,[247a30c096487cadac0f5865231afad6],[0.9219],AUSTRALIA faces being saddled with the bulk of...,2015_august0004,AUSTRALIA faces being saddled with the bulk of...,461,<datasketch.minhash.MinHash object at 0x7fffa2...,[247a30c096487cadac0f5865231afad6],[0.9219],True,True
4,Firebrand Liberal Senator Cory Bernardi has se...,2015_august0005,eaa60b92b9f888f264c3437653b89dc5,Firebrand Liberal Senator Cory Bernardi has se...,371,<datasketch.minhash.MinHash object at 0x7fffa5...,[],[],Firebrand Liberal Senator Cory Bernardi has se...,2015_august0005,Firebrand Liberal Senator Cory Bernardi has se...,371,<datasketch.minhash.MinHash object at 0x7fffa2...,[],[],True,True


In [45]:
diff_minHash = df[~df['id_hash']]
diff_minHash.shape

(0, 17)

In [41]:
diff_df = df[~df['id_match']]
df.apply(lambda row: set(row.matched_list_x).difference(row.matched_list_y), axis=1)

0       {}
1       {}
2       {}
3       {}
4       {}
        ..
8496    {}
8497    {}
8498    {}
8499    {}
8500    {}
Length: 8501, dtype: object

In [43]:
df.matched_list_x.head().tolist()

[['a45b8988a4466132635b6bdb6fc4dedd',
  '5d6edf035bd7db540a0f25ba8b60bcdc',
  '8f3544389222dfb902e2a93e4eae57fe',
  'f544019720f8608c40fcbf921da63412'],
 ['d6dd0c1192406fa163b9ae8313ef983e'],
 ['dbec3069d7ecb5576d40098d14fba0a1',
  'a45b8988a4466132635b6bdb6fc4dedd',
  '5d6edf035bd7db540a0f25ba8b60bcdc',
  'f544019720f8608c40fcbf921da63412',
  'ecc6f182b30d0393a2fd14bb79260f27'],
 ['247a30c096487cadac0f5865231afad6'],
 []]

In [44]:
df.matched_list_y.head().tolist()

[['8f3544389222dfb902e2a93e4eae57fe',
  'f544019720f8608c40fcbf921da63412',
  '5d6edf035bd7db540a0f25ba8b60bcdc',
  'a45b8988a4466132635b6bdb6fc4dedd'],
 ['d6dd0c1192406fa163b9ae8313ef983e'],
 ['f544019720f8608c40fcbf921da63412',
  'ecc6f182b30d0393a2fd14bb79260f27',
  '5d6edf035bd7db540a0f25ba8b60bcdc',
  'a45b8988a4466132635b6bdb6fc4dedd',
  'dbec3069d7ecb5576d40098d14fba0a1'],
 ['247a30c096487cadac0f5865231afad6'],
 []]

In [24]:
m1 = ds_1.text_df[ds_1.text_df.text_name == '2015_december0045']['hash'].tolist()[0]
m2 = ds.text_df[ds.text_df.text_name == '2015_december0045']['hash'].tolist()[0]
m1.jaccard(m2)

1.0

In [36]:
m1.hashvalues == m2.hashvalues

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [35]:
m2.hashvalues

array([19894258, 38167608,  9024081,  6864922, 14019373,  7829423,
        4801307, 11841380, 11338106,  1606348, 18427543,  6951828,
        4691642, 13920403, 21373960, 79953024, 22573117,  8106186,
        1701365, 22579370, 26304246, 32850825, 32248502,  9683852,
       73598365,  6968936,  1324353, 10591022, 16651970,    46833,
        3661179,  2877764,  7113228,  6952233, 22427269,  5086582,
        9121476, 33093107,  9985536, 36700884, 65058177, 16699632,
        4528959,  8553173,  1994469, 10151165, 21595824,  2257890,
        4656625, 20409292, 21004811, 36602426,  1827125, 11660676,
       36212708, 10654555, 42397331,  6044690,  6183873,  4635727,
       44159153,  7656216, 34900825,  2330906, 15229800,  4770883,
        2055674, 20876282, 17798266, 11128502, 18298946,  6038789,
        6828299, 23316344, 20588532,  4803979, 51997502, 14091892,
       68760743,  1081391, 45404420,    99339, 24766906, 14250187,
         604504, 14097842,  6180455, 37241603,  1725669,  7763

In [19]:
df_2 = ds.text_df.sort_values('text_name').reset_index()

In [20]:
df_1.head()

Unnamed: 0,index,text,text_name,text_id,text_with_punc,word_count,hash,matched_list,jaccards
0,4266,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0001,7828417993793090164,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,138,<datasketch.minhash.MinHash object at 0x7fff6c...,"[-4216157355293154410, 6754295732006624618, -7...","[0.8945, 0.625, 0.6133, 0.6133]"
1,4062,AUSTRALIA faces being saddled with the bulk of...,2015_august0002,7613468115692745676,AUSTRALIA faces being saddled with the bulk of...,484,<datasketch.minhash.MinHash object at 0x7fff6c...,[5127103057632798102],[0.9219]
2,960,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0003,-4216157355293154410,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,166,<datasketch.minhash.MinHash object at 0x7fff6c...,"[3839124784264248608, 6754295732006624618, -70...","[0.6172, 0.6797, 0.6328, 0.8945, 0.6328]"
3,12741,AUSTRALIA faces being saddled with the bulk of...,2015_august0004,5127103057632798102,AUSTRALIA faces being saddled with the bulk of...,461,<datasketch.minhash.MinHash object at 0x7fff6c...,[7613468115692745676],[0.9219]
4,2147,Firebrand Liberal Senator Cory Bernardi has se...,2015_august0005,5915666267027126909,Firebrand Liberal Senator Cory Bernardi has se...,371,<datasketch.minhash.MinHash object at 0x7fff6c...,[],[]


In [21]:
df_2.head()

Unnamed: 0,index,text,text_name,text_id,text_with_punc,word_count,hash,matched_list,jaccards
0,2997,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0001,-7256051956056480776,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,138,<datasketch.minhash.MinHash object at 0x7fff7c...,"[8020305585274922749, -371310788376854805, 505...","[0.6133, 0.625, 0.8945, 0.6133]"
1,9992,AUSTRALIA faces being saddled with the bulk of...,2015_august0002,-6528799633353284092,AUSTRALIA faces being saddled with the bulk of...,484,<datasketch.minhash.MinHash object at 0x7fff6f...,[-1866417032301314761],[0.9219]
2,432,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,2015_august0003,5054373459852396794,THIRD TEST AT EDGBASTON AUSTRALIA First Inning...,166,<datasketch.minhash.MinHash object at 0x7fff6f...,"[8020305585274922749, -371310788376854805, -33...","[0.6328, 0.6797, 0.6172, 0.8945, 0.6328]"
3,11389,AUSTRALIA faces being saddled with the bulk of...,2015_august0004,-1866417032301314761,AUSTRALIA faces being saddled with the bulk of...,461,<datasketch.minhash.MinHash object at 0x7fff7d...,[-6528799633353284092],[0.9219]
4,12833,Firebrand Liberal Senator Cory Bernardi has se...,2015_august0005,5259326324612749517,Firebrand Liberal Senator Cory Bernardi has se...,371,<datasketch.minhash.MinHash object at 0x7fff7c...,[],[]


In [22]:
diff_df = list()
for idx in df_1.index:
    sim = df_1.loc[idx,'hash'].jaccard(df_2.loc[idx,'hash'])
    if sim < 1:
        temp = {'index': idx, 'score': sim, 'name_old':df_1.loc[idx,'text_name'], 'name_new': df_2.loc[idx,'text_name']}
        diff_df.append(temp)
len(diff_df)

0

In [24]:
matched = list()
for idx in df_1.index:
    diff_match = set(df_1.loc[idx,'matched_list']).difference(df_2.loc[idx,'matched_list'])
    if matched:
        temp = {'index': idx, 'diff':diff_match, 'match_old':df_1.loc[idx,'matched_list'], 'match_new': df_1.loc[idx,'matched_list']}
        matched.append(temp)
len(matched)

0

## 3. Analyse similar documents
Once the tool has finished calculating the document similarity, you can begin to analyse the outcome.  

The graph below is a histogram of the count of similar documents in the corpus as measured by their Jaccard similarity. In this histogram, you can identify how many documents are found at different level of similarity measures.

<div class="alert alert-block alert-warning">
<b>Histogram of similar documents</b> 
    
The x-axis on the histogram shows the Jaccard similarity scores for every document in the corpus, and the y-axis (the height of the bar) tells us how many similar documents are found at those Jaccard similarity score ranges. 
</div>

In [None]:
# plot the similarity count accross the entire corpus
ds.plot_hash_similarity_by_source(ds.deduplication_df)

<div class="alert alert-block alert-warning">
<b>Heatmap of similar documents</b> 
    
The below heatmap shows the Jaccard similarity scores between pair of similar documents, with the x- and y-axes showing the text_id of the similar document pairs (you can hover over the similar nodes to display the text name pairs). Please note that the heatmap only displays pair of similar documents with scores above the similarity cut-off, as defined earlier.  
</div>  

<div class="alert alert-block alert-danger">
<b>Large number of similar documents</b> 
    
You can resize the heatmap, adjust the font size or the font color to better visualize your data by specifying the below parameters. You can also zoom in/out of the heatmap, move it around, save and download it to your local computer using the interactive tool on the right hand-side of the heatmap.  

<b>Note:</b> visualizing a large number of similar document pairs (**>500**) may slow down the notebook.   
</div>
<div class="alert alert-block alert-info">
<b>Input before plotting</b> 
    
To avoid plotting oversized figure, the user is asked to **specify the range** of matched documents to be included in the heatmap.
Entering **'n'** will cancel the figure generation.
Entering **'y'** will proceed with **all pairs** of similar documents.
Entering an **integer number**, such 30, will generate the figure with the top-30 pairs of the similar documents.
Entering a number range like **15-45** will generate the figure with the selected range (15 to 45) of the document pairs.
    
**Press Enter Key after inputting.**
</div>

In [None]:
# define the plot width, height, font size and color
plot_width = 900 # increase plot width if necessary
plot_height = 800 # increase plot height if necessary
font_size = '14px'
text_color = 'white' # 'black' or 'white' would usually work for most scenarios

print('\033[1mVisualizing a large number of similar document pairs (>500) may slow down the notebook.\033[0m')
print('There are \033[1m{}\033[0m document pairs in the current process'.format(ds.deduplication_df.shape[0]))
plot_range = input("""Enter the range of documents pairs to be plotted, e.g. y, n, 10-25, or 30.""")

# plot heatmap of Jaccard similarity
ds.plot_heatmap_similarity(similarity_cutoff,
                                plot_width,
                                plot_height,
                                font_size,
                                text_color,
                                plot_range)

<div class="alert alert-block alert-warning">
<b>Analyse similar documents</b> 

Below you can generate a list of similar documents (in pairs) found by the tool, based on the similarity cutoff specified earlier. By default, the tool makes recommendations on whether to 'keep' or 'remove' each similar document (the tool will recommend to remove the document with the lower word count, if the Jaccard similarity is above the specified threshold). However, using the below tool, you can generate each pair of similar documents (by specifying the row index you wish to analyse), analyse them, and update the action/recommendation as you see fit.
</div>

<div class="alert alert-block alert-danger">
<b>Similar documents table</b> 

The table below displays only those texts identified as similar based on the Jaccard similarity cut-off selected earlier and the number of texts included in the table display therefore also informs you how many texts in your corpus are identified as within the cut-off threshold.
</div>

In [None]:
ds.display_deduplication_text()

<div class="alert alert-block alert-warning">
<b>What information is included in the above table?</b> 

**text_id1/2:** the text id of the pair of similar documents.
    
**text_name1/2:** the text name of the pair of similar documents.
   
**word_count1/2:** the word count of the pair of similar documents.

**status1/2:** whether to 'keep' or 'remove' each similar document.

**similarity:** the Jaccard similarity between the pair of similar documents.
</div>

<div class="alert alert-block alert-danger">
    
**Caveat: Discrepancies in the highlighted side-by-side comparison**

In the display of document pairs where differences between texts are highlighted for checking by users, only document pairs based on the Jaccard similarity parameters are included. However, this visualisation uses the python function difflib which is independent from the Jaccard calculation and may thus highlight differences in punctuation (regardless of previous settings) and this function may also at times contain incorrectly highlighted text blocks. Despite this caveat, the visualisation should still be helpful in allowing you to decide which of the two texts you want to ‘keep’ or ‘remove’.
 
</div>

## 5. Save duplicated/non-duplicated texts
Once you are happy with the list of texts that you want to keep, you can run the below code to save the non-duplicated texts (those with 'keep' status) or the duplicated ones (those with 'remove' status) into a zip of text (.txt) files and download them to your local computer.

In [41]:
rows_to_display=5

ds.finalise_and_save(rows_to_display)

VBox(children=(Output(), Button(description='Save non-duplicated texts', layout=Layout(margin='20px 0px 10px 0…

In [4]:
ds.text_df.head()

In [25]:
ds.deduplicated_text_df.shape

(6972, 7)

In [26]:
dup_index1 = ds.duplicated_text_df.index
dedup_index1 = ds.deduplicated_text_df.index

In [5]:
a = dict()
a['b'] = 2
a['c'] = 1
a['a'] = 5
a.get('b')

2

In [8]:
list(a.values())[-1]

5

In [5]:
ds.deduplicated_text_df.shape

(6974, 8)

In [30]:
dup_index2 = ds.duplicated_text_df.index
dedup_index2 = ds.deduplicated_text_df.index

In [31]:
set(dup_index1).difference(dup_index2)

set()

In [18]:
ds.deduplicated_text_df.to_excel('dedup_2.xlsx')

In [7]:
import pandas as pd
df_1 = pd.read_excel('dedup_1.xlsx')

set(df_1.text_name).difference(ds.deduplicated_text_df.text_name)


{'2015_december0045',
 '2015_december0118',
 '2015_november0005',
 '2015_november0031',
 '2015_november0480',
 '2015_november0528',
 '2015_november1135',
 '2015_november1501',
 '2015_november1924',
 '2015_november2015',
 '2015_september0060',
 '2015_september0683',
 '2016_february0680',
 '2016_january0055',
 '2016_january0879',
 '2016_july0117',
 '2016_july0668',
 '2016_july1010',
 '2016_june0371',
 '2016_march0054',
 '2016_march0132',
 '2016_may0308'}

In [6]:
# Actual 
import pandas as pd
df_2 = pd.read_excel('dedup_2.xlsx')

set(df_2.text_name).difference(ds.deduplicated_text_df.text_name)

{'2015_august0230',
 '2015_august0485',
 '2015_december0363',
 '2015_november0242',
 '2015_november0518',
 '2015_november0848',
 '2015_november1597',
 '2015_october0400',
 '2015_october1081',
 '2015_september0505',
 '2016_february0179',
 '2016_february0327',
 '2016_february0537',
 '2016_february0600',
 '2016_february0704',
 '2016_february0831',
 '2016_january0443',
 '2016_january0531',
 '2016_january0660',
 '2016_january1220',
 '2016_january1283',
 '2016_january1335',
 '2016_july0351',
 '2016_june0207',
 '2016_march0926',
 '2016_may0577'}

In [14]:
set(df_1.text_name).difference(ds.deduplicated_text_df.text_name)

{'2015_december0045',
 '2015_december0118',
 '2015_november0005',
 '2015_november0031',
 '2015_november0480',
 '2015_november0528',
 '2015_november1135',
 '2015_november1501',
 '2015_november1924',
 '2015_november2015',
 '2015_september0060',
 '2015_september0683',
 '2016_february0680',
 '2016_january0055',
 '2016_january0879',
 '2016_july0117',
 '2016_july0668',
 '2016_july1010',
 '2016_june0371',
 '2016_march0054',
 '2016_march0132',
 '2016_may0308'}

In [42]:
print(ds.duplicated_text_df.shape, ds.deduplicated_text_df.shape)
dup_index2 = ds.duplicated_text_df.index
dedup_index2 = ds.deduplicated_text_df.index
print(set(dup_index1).difference(dup_index2))

(1529, 7) (6972, 7)
set()


In [48]:
idx = ds.duplicated_text_df.text_id.tolist()
with open('output/dup_id.csv', 'w') as f:
    for i in idx:
        f.write('{}\n'.format(i))
        

In [None]:
ds.duplicated_text_df.text_id