Using spacy English large model:

python -m spacy download en_core_web_lg


In [13]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [67]:
w1 = "red"
w2 = "blue"

w1 = nlp.vocab[w1]
w2 = nlp.vocab[w2]
w1.similarity(w2)

0.8438411951065063

In [34]:
w1 = "labor"
w2 = "lorem"

w1 = nlp.vocab[w1]
w2 = nlp.vocab[w2]
w1.similarity(w2)

-0.07253860682249069

In [51]:
w1 = "the"
w2 = "a"

w1 = nlp.vocab[w1]
w2 = nlp.vocab[w2]
w1.similarity(w2)

0.5925824642181396

In [68]:
w1 = "."
w2 = "?"

w1 = nlp.vocab[w1]
w2 = nlp.vocab[w2]
w1.similarity(w2)

0.5152676105499268

In [69]:
s1 = "This is a sentence"
s2 = "This is another sentence"

s1 = nlp(s1)
s2 = nlp(s2)
s1.similarity(s2)

0.9810225367546082

In [70]:
s1 = "Sigmund Freud"
s2 = "Psychology"

s1 = nlp(s1)
s2 = nlp(s2)
s1.similarity(s2)

0.3366757333278656

In [50]:
s1 = "Lorem ipsum"
s2 = "Placeholder"

s1 = nlp(s1)
s2 = nlp(s2)
s1.similarity(s2)

0.3752337098121643

Some words are too different. If all things are negative in evaluation, we will use text similarity instead with rapidfuzz library:

pip install rapidfuzz



In [35]:
from rapidfuzz import fuzz

In [49]:
w1 = "labor"
w2 = "lorem"

print(fuzz.ratio(w1, w2) / 100)

0.6


In [59]:
w1 = "Sigmund Freud"
w2 = "Psychology"

print(fuzz.ratio(w1, w2) / 100)

0.08695652173913049


## Designing a scrambling and unscrambling procedure

In [54]:
maintext = '''
Lorem ipsum (/ˌlɔː.rəm ˈɪp.səm/ LOR-əm IP-səm) is a dummy or placeholder text commonly used in graphic design, publishing, and web development. Its purpose is to permit a page layout to be designed, independently of the copy that will subsequently populate it, or to demonstrate various fonts of a typeface without meaningful text that could be distracting. \n

    Lorem ipsum is typically a corrupted version of De finibus bonorum et malorum, a 1st-century BC text by the Roman statesman and philosopher Cicero, with words altered, added, and removed to make it nonsensical and improper Latin. The first two words themselves are a truncation of dolorem ipsum ("pain itself"). \n

    Versions of the Lorem ipsum text have been used in typesetting at least since the 1960s, when it was popularized by advertisements for Letraset transfer sheets.[1] Lorem ipsum was introduced to the digital world in the mid-1980s, when Aldus employed it in graphic and word-processing templates for its desktop publishing program PageMaker. Other popular word processors, including Pages and Microsoft Word, have since adopted Lorem ipsum,[2] as have many LaTeX packages,[3][4][5] web content managers such as Joomla! and WordPress, and CSS libraries such as Semantic UI.
'''

header = '''
A common form of Lorem ipsum reads:
'''

headertext = '''}
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
'''

captions = ["Using Lorem ipsum to focus attention on graphic elements in a webpage design proposal"]

In [None]:
# TOKENIZATION: breaking down text into individual words
# Process the text using spaCy
doc = nlp(maintext)

# Extract words (tokens)
words = [token.text for token in doc]

# Extract lemmas
lemmas = [token.lemma_ for token in doc]

# Extract part-of-speech tags: https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean
tags = [token.pos_ for token in doc]

# Print the results
print("Words:", words)
print("Lemmas:", lemmas)
print("Tags:", tags)

Words: ['\n', 'Lorem', 'ipsum', '(', '/ˌlɔː.rəm', 'ˈɪp.səm/', 'LOR', '-', 'əm', 'IP', '-', 'səm', ')', 'is', 'a', 'dummy', 'or', 'placeholder', 'text', 'commonly', 'used', 'in', 'graphic', 'design', ',', 'publishing', ',', 'and', 'web', 'development', '.', 'Its', 'purpose', 'is', 'to', 'permit', 'a', 'page', 'layout', 'to', 'be', 'designed', ',', 'independently', 'of', 'the', 'copy', 'that', 'will', 'subsequently', 'populate', 'it', ',', 'or', 'to', 'demonstrate', 'various', 'fonts', 'of', 'a', 'typeface', 'without', 'meaningful', 'text', 'that', 'could', 'be', 'distracting', '.', '\n\n\n    ', 'Lorem', 'ipsum', 'is', 'typically', 'a', 'corrupted', 'version', 'of', 'De', 'finibus', 'bonorum', 'et', 'malorum', ',', 'a', '1st', '-', 'century', 'BC', 'text', 'by', 'the', 'Roman', 'statesman', 'and', 'philosopher', 'Cicero', ',', 'with', 'words', 'altered', ',', 'added', ',', 'and', 'removed', 'to', 'make', 'it', 'nonsensical', 'and', 'improper', 'Latin', '.', 'The', 'first', 'two', 'words

In [57]:
for token in doc:
    print(token.text + " | " + token.lemma_ + " | " + token.pos_)


 | 
 | SPACE
Lorem | Lorem | PROPN
ipsum | ipsum | NOUN
( | ( | PUNCT
/ˌlɔː.rəm | /ˌlɔː.rəm | PUNCT
ˈɪp.səm/ | ˈɪp.səm/ | DET
LOR | LOR | PROPN
- | - | PUNCT
əm | əm | PROPN
IP | IP | PROPN
- | - | PUNCT
səm | səm | NOUN
) | ) | PUNCT
is | be | AUX
a | a | DET
dummy | dummy | ADJ
or | or | CCONJ
placeholder | placeholder | NOUN
text | text | NOUN
commonly | commonly | ADV
used | use | VERB
in | in | ADP
graphic | graphic | ADJ
design | design | NOUN
, | , | PUNCT
publishing | publishing | NOUN
, | , | PUNCT
and | and | CCONJ
web | web | NOUN
development | development | NOUN
. | . | PUNCT
Its | its | PRON
purpose | purpose | NOUN
is | be | AUX
to | to | PART
permit | permit | VERB
a | a | DET
page | page | NOUN
layout | layout | NOUN
to | to | PART
be | be | AUX
designed | design | VERB
, | , | PUNCT
independently | independently | ADV
of | of | ADP
the | the | DET
copy | copy | NOUN
that | that | PRON
will | will | AUX
subsequently | subsequently | ADV
populate | populate | VERB
it | 

In [64]:
# Convert tokens back into text
temp = ""

for token in doc:
    if token.pos_ == "SPACE" or token.pos_ == "PUNCT":
        temp += token.text

    else:
        temp += " " + token.text

print(temp)


 Lorem ipsum(/ˌlɔː.rəm ˈɪp.səm/ LOR- əm IP- səm) is a dummy or placeholder text commonly used in graphic design, publishing, and web development. Its purpose is to permit a page layout to be designed, independently of the copy that will subsequently populate it, or to demonstrate various fonts of a typeface without meaningful text that could be distracting.


     Lorem ipsum is typically a corrupted version of De finibus bonorum et malorum, a 1st- century BC text by the Roman statesman and philosopher Cicero, with words altered, added, and removed to make it nonsensical and improper Latin. The first two words themselves are a truncation of dolorem ipsum(" pain itself").


     Versions of the Lorem ipsum text have been used in typesetting at least since the 1960s, when it was popularized by advertisements for Letraset transfer sheets.[1 ] Lorem ipsum was introduced to the digital world in the mid-1980s, when Aldus employed it in graphic and word- processing templates for its desktop 

Current Plan:
- Take the tokenized version and convert into a dictionary mapping each word to its scrambled version
- For each guess the user makes, unscramble the closest words semantically to it (and update the dictionary)
- Reconstruct the scrambled text by iterating over the tokens and referencing the dictionary
- This dictionary can then represent the user's "state" of unscrambling the text
- We can also precompute the dictionary of all scrambled text so that all users get the same starting scrambled text

Small Details:
- We could use the set of all letters that appear in the article as the list of possible random letters for scrambling the text