# Seed Sieve
***2018.05 IsthmusCrypto: isthmuscrypto@protonmail.com***

This seedSieve function is a tool for stripping cryptocurrency seed mnemonic phrases from text fields or databases to protect users from theft. 

Allowing users to accidentally or intentionally/foolishly send or store seed mnemonic phrases in your data sets or streams opens them up to high risk of theft from malicious parties. In almost all cases and services, this increases liabilities and offers no benefits. 

Here is an easy method for removing such data, implemented with the BIP39 library, pending inclusion of others. It scans the text to identify strings with high density of seed words, and redacts them. 

Any service that transmits user data and is NOT suitable for sending sensitive financial information (e.g. user logs, message boards, chat services) should implement the seed sieve on the device, before transmitting the field. 

Any entity passing along user data to {marketing, analysis, etc, or any third party} should apply the seed sieve first to reduce risk & liability. If this data set is to be used for learning, this should actually increase accuracy by removing the red-herrings (e.g. "apple" being about fruits) and properly labeling the text as cryptocurrency-related.

2018-05 known bugs:

-  "if wordOnly in wordlist" matches substrings

Wishlist:

- support for other dictionary files besides BIP39 English (other languages & cryptocurrencies)

    -  Could be done by looping over multiple dictionaries (cleaner)
    
    -  Could be done by making de-deuped union of multiple dectionaries (lighter)
    
- Sliding seedSieve wrapper: Scan a window of fixed word-size (e.g. N_w = 36 words) across large inputs, and apply seed-sieve redaction to each snip. Imagine a user that pastes a 24-word seed into the middle of the 272 words of the Gettysburg Address. Current implementation would not trigger seed sieve; however, the sliding seed sieve would trigger when the window hits the phrase with 24+ of (N_w = ) 36 words returning dictionary matches.

- Implementation in other languages. Java*? C*?

- Parameters here (default_minimumSeedHits = 7; default_seedRatio = 0.3;) were selected as educated guesses after ~ 20 minutes of glancing through various types of English writing. Open to suggestions for improvements, especially if backed by testing on some seed-spiked corpus, quantified by error metrics (accuracy, false +/- rates, etc)

In [1]:
def seedSieve(rawStr, minimumSeedHits, seedRatio, wordlist, replaceWith):
    
    import re; # regular expressions

    # initialization
    qTriggered = 0; # init 0, whether to redact
    counts = 0; # init 0, num seed words counter
    foundThese = []; # init empty
    rawStrWords = rawStr.split();
    
    # count (loop) over words in the input string 
    for word in rawStrWords: 
        wordOnly = re.sub('[^A-Za-z]+', '',word.lower()); # lowercase & remove punctuation
        if wordOnly in wordlist:
            counts = counts + 1;
            foundThese.append(wordOnly)
            
    # Calculate ratio of seedwords to all words
    numWordsTotal = len(rawStr.split());
    actualRatio = counts/numWordsTotal;
        
    # set triggered if both criteria met
    if (actualRatio > seedRatio) and (counts > minimumSeedHits):
        qTriggered = 1;
    
    if(qTriggered):
        cleaningString = rawStr;
        for word in foundThese:
            cleaningString = re.sub(word, replaceWith, cleaningString, flags=re.I); 
        outputString = cleaningString
    else:
        outputString = rawStr;
    
    return [outputString, qTriggered]

## Default parameters:

In [2]:
# Import a wordlist
import os # for proper file separator to wordlist subdirectory

default_minimumSeedHits = 7; # How many seed words must be observed to trigger redaction
default_seedRatio = 0.3; # Minimum ratio (0,1) of seed:all words to trigger redaction
wordlistFile = open(os.path.join('wordlists','bip39.txt'), 'r'); # open wordlist, e.g. BIP39
default_wordlist= wordlistFile.read();
default_replaceWith = "[#BIP39]"; # string to mark redacted words

## Input string(s):

In [3]:
# multiStr contains several example strings
multiStr =  [
"retreat brisk ball dirt cushion skill catalog afford explain pigeon mail few elegant avoid gallery",
"A man, a plan, a canal, panama",
"Apple farm near upstate New York.",
"Lorem ipsum dolor sit amet, consectetur adipiscing",
"Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
"R4ndom padding words won't stop hackers, income spoil awake soccer action twist sadness able client topple stairs nice industry labor spice, But the seed sieve will",
"Please help improve this project at github.com/Mitchellpkt/SeedSieve"];


## Demonstrate seedSieve:

In [4]:
# Here is actual function call, with several examples
for testString in multiStr:
    [outputString,qTriggered] = seedSieve(testString, default_minimumSeedHits, default_seedRatio, default_wordlist, default_replaceWith);
    
    # display the results
    print('\n\n'+20*'*'+'\n'+20*'*')
    print('Input string:')
    print('\"'+testString+'\"')
    print(15*'-'+'\nOutput string:')
    print('\"'+outputString+'\"')

    
    



********************
********************
Input string:
"retreat brisk ball dirt cushion skill catalog afford explain pigeon mail few elegant avoid gallery"
---------------
Output string:
"[#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39] [#BIP39]"


********************
********************
Input string:
"A man, a plan, a canal, panama"
---------------
Output string:
"A man, a plan, a canal, panama"


********************
********************
Input string:
"Apple farm near upstate New York."
---------------
Output string:
"Apple farm near upstate New York."


********************
********************
Input string:
"Lorem ipsum dolor sit amet, consectetur adipiscing"
---------------
Output string:
"Lorem ipsum dolor sit amet, consectetur adipiscing"


********************
********************
Input string:
"Lorem Ipsum is simply dummy text of the printing and typesetting industry."
---------------
Output stri