##### Algorithms and Data Structures (Winter - Spring 2022)

* [Colab view](https://colab.research.google.com/github/4dsolutions/elite_school/blob/master/ADS_research_1.ipynb)
* [nbviewer view](https://nbviewer.org/github/4dsolutions/elite_school/blob/master/ADS_research_1.ipynb)
* [ADS Page 1](ADS_intro_1.ipynb)
* [ADS Page 2](ADS_intro_2.ipynb)
* [ACSL](Exercises.ipynb)
* [Repo](https://github.com/4dsolutions/elite_school/)

## Five Letter Land:  A Semantic Space

Around the time of our *Algorithms and Data Structures* first meetups, [the word game Wordle](ADS_project_1.ipynb) became popular.  

The focus of the game is accepted English dictionary words of five letters. Of course the rules could be extended or refocused. Our research went with that definition and particular data file structured as text:

https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt
(don't click the link unless you wish to eyeball the file)

Beginning with exactly this list of 5-letter words, we embarked upon some research adventures bringing in our sense of Graph Theory.  The words are nodes and we consider them connected by an edge if they are one and only one letter different.

For example, neighbors of "caked" would be "naked" and "baked".  We will need an algorithm that gives us a complete list of neighbors.

Example research questions:

* given "only one letter different" defines an edge between words, does our net of connected words encompass all the words?

* if the answer to the above question is "no" then does the set consist of "islands" of words with definite boundaries?

But first, we need the set of words:

In [1]:
import requests

response = requests.get("https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt")
print(response.status_code)
# words
words = set(response.text.split("\n")) # <-- linebreak delimited

200


The classroom toolkit includes Replits in the Cloud.  We welcome a "many languages" approach.  C++, Julia... Replits come in many flavors.

Through Replits ([example](https://replit.com/@kurner/fiveland#main.py)), we might use Python only because requests is so convenient and the Replit makes it easy to install as a package.  Once we harvest our data, we might switch to some other language or tool.  It's up to us to design the pipelines.

Having done the work to retrieve the text file from [The Stanford Graphbase](https://www-cs-faculty.stanford.edu/~knuth/sgb.html), why not keep it locally.  It's a tiny file, but requesting it over the internet everytime requires unnecessary dependency and overhead.

In [2]:
with open("wordkeep.txt","w") as the_file:
    the_file.write(response.text)

In [3]:
import os.path

if not os.path.exists("./wordkeep.txt"):
    print("No wordkeep.txt on file")

else:

    with open("wordkeep.txt","r") as the_file:
        for _ in range(10):
            print(the_file.readline(), end="")

    if not "words" in globals():
        with open("wordkeep.txt","r") as the_file:
            words = set(the_file.read().split("\n"))

which
there
their
about
would
these
other
words
could
write


Lets continue our work in Python because of its set type, which will give us fast lookup and prevent inadvertent duplication as a built-in feature.  Sets do not allow duplicates.  As we find all words reachable by single hops (one letter changes) from a given word, we don't want to include the same word more than once.  Sets will take care of that.

One detail though. As a result of splitting and converting to a set, an empty string will have snuck into our set, and should be removed if found.

In [4]:
'' in words

True

In [5]:
try:
    words.remove('')
except KeyError:
    print("No empty string present")

On a checklist, double check your lookup table of all 5757 words is indeed of that length.

In [6]:
len(words)

5757

*The Stanford GraphBase: A Platform for Combinatorial Computing* is close to 600 pages and consists of numerous examples of "literate programming* i.e. the kind of programming we do in Jupyter Notebooks.

However, for this particular research project, that volume was not consulted.  Many of the results may nevertheless overlap.  Finding those connections is an exercise left to the reader.

The research project outlined above:


* given "only one letter different" defines an edge between words, does our net of connected words encompass all the words?

* if the answer to the above question is "no" then does the set consist of "islands" of words with definite boundaries?

becomes interesting because it's easy to prove the answer is "no" to our first question.  We are able to find words that have no neighbors in the sense defined.  A good example would be "spasm".

In [8]:
from five_land import initialize, roll_alpha

In [9]:
len(five_land.words)

0

In [10]:
initialize()

5757


In [11]:
neighbors = roll_alpha('spasm')
neighbors  # no neighbors, empty set

set()

In [12]:
neighbors = roll_alpha('opera')
neighbors  # likewise

set()

OK, so we know we have solitary "word islands" that consist of a single word.  Lets take another example, where the "word island" is larger.

In [13]:
neighbors = roll_alpha('logic', True)
neighbors

logic
yogic
logic
logic
logic
logic
login


{'login', 'yogic'}

Turning on printing (2nd argument True) reveals how often the algorithm rediscovers the word itself, as roll_apha subsitutes a-z for each letter in turn, meaning it always rolls through itself at some point.  However, upon eliminating itself in the end (the last thing roll_apha does), the algorithm confines itself to providing neighbors one letter away, an no more.

What we need next is an algorithm to keep growing the pool of words, starting from any pool, to find all words ultimately reachable in one letter jumps.  A single iteration of this algorithm will add neighbors for those currently in the pool.

In [14]:
from five_land import fish_pond

In [15]:
new_neighbors = fish_pond(neighbors)
new_neighbors

{'logic', 'login', 'yogic', 'yogis'}

The crystallizing word "logic" is now back in the pool, as a neighbor of its neighbors, along with "yogis" which is two hops from "logic".

We may continue cycling the growing list through fish_pond...

In [16]:
neighbors = new_neighbors
new_neighbors = fish_pond(neighbors)
new_neighbors

{'logic', 'login', 'yogas', 'yogic', 'yogis'}

The set did not grow.  Here's an example of an island, whereon the words are reachable, one to another, by one-letter changes, until "shores are reached" and no more may be added.

Now lets start with a more typical example:

In [17]:
from five_land import grow_pool
print(grow_pool.__doc__)


    find all the words reachable from p by means 
    of *any number* of one letter legal word hops,
    and stop when p stops growing.
    


In [18]:
final_set = grow_pool({"caked"}, True)

15 105
105 366
366 816
816 1346
1346 1922
1922 2487
2487 3006
3006 3451
3451 3877
3877 4200
4200 4349
4349 4426
4426 4462
4462 4480
4480 4486
4486 4489
4489 4490
4490 4492
4492 4493
4493 4493


The numbers show by how many "fish" (legal five letter words) the pool is growing with each application of fish_pond( ).  The algorithm keeps cycling the growing pond through fish_pond until it stops growing. At that point, we know that no more "fish" will be found.

What is the length of the final set?

In [19]:
len(final_set)

4493

This turns out to be a central fracture in our semantic graph.  Any word in a set of 4493 words will find the others.  That's the giant island amidst an archipelago of smaller ones, down to the solitary islands such as "spasm".

In [20]:
len(words) - len(final_set)

1264

In [21]:
five_land.never_reached = words - final_set
len(five_land.never_reached)

1264

We learn that the archipelago consists of 1264 words (not necessarily all connected to each other), while the main island consists of 4493 words.  The total is 5757, the expected total

A last reseach question for this Notebook is:  what is the make-up of this archipelago? How many islands and how big are they?

Here's some background blogging about Donald Knuth's overall plan:
[Knuth (Donald E. Knuth) two decades plan](https://blog.krybot.com/a?ID=01800-af2c69d8-40ba-4b02-9cdc-aca2c205be68)

In [22]:
import five_land
from imp import reload
reload(five_land)

<module 'five_land' from '/Users/mac/Documents/elite_school/five_land.py'>

In [23]:
from five_land import survey
initialize()

5757


In [24]:
survey()

({1: 671,
  7: 42,
  2: 206,
  3: 126,
  15: 45,
  6: 24,
  4: 52,
  5: 30,
  24: 24,
  19: 19,
  17: 17,
  8: 8},
 24,
 'duffs',
 set())

In [25]:
survey(24)

({1: 671,
  7: 42,
  2: 206,
  3: 126,
  15: 45,
  6: 24,
  4: 52,
  5: 30,
  24: 24,
  19: 19,
  17: 17,
  8: 8},
 24,
 'duffs',
 {'biffs',
  'biffy',
  'boffo',
  'boffs',
  'buffa',
  'buffo',
  'buffs',
  'cuffs',
  'daffy',
  'doffs',
  'duffs',
  'huffs',
  'huffy',
  'jiffs',
  'jiffy',
  'miffs',
  'muffs',
  'puffs',
  'puffy',
  'ruffs',
  'taffy',
  'tiffs',
  'toffs',
  'toffy'})

What this survey tells us is a lot of 3- and 2-word islands, 42 and 103 respectively, involving 126 and 206 total words. One island consists of 24 words, and is the biggest aside from the Big Island of 4493. 

Islands do not have overlapping membership.  The item "15: 45", for example, suggests 45 words participate in three disjoint subnets of 15 words each.

Two words on the same island will have a path between them, of only one-letter change hops.  Two words on different islands, never will.

To figure out whether any two five-letter words have a path of edge hops between them, it should be sufficent to discover whether they on on the same island.  If both are on an island of size greater than 24, then we know both are on the main island.  Otherwise, within very few iterations, it should be possible to decide if they're on the same smaller island.

In [26]:
from five_land import path_exists
five_land.initialize()

5757


In [27]:
path_exists("caked", "fluid")

False

In [28]:
path_exists("hello", "norms")

Both on big island


True

In [29]:
path_exists("logic", "yogis")

Same small island


True

### Wordle

Check the [Project Page](ADS_project_1.ipynb) for links between Five Letter Land and Wordle.  The 5757 list we're using is reportedly much larger than the one hard-coded into Wordle.

## Pi Day (3 - 14)

More projects suggest themselves around Pi Day.

Here's a link to a friendly repo:

[Pi Day at Python5](https://nbviewer.org/github/4dsolutions/Python5/blob/master/Pi%20Day%20Fun.ipynb)

And a friendly repl:

[pi_day on Repl.it](https://replit.com/@kurner/piday#main.py)

The Ramanujan expression for $(1/\pi)$ is also taken up in [Exercises](Exercises.ipynb) here at our EliteSchool.

From Python Docs:  [itertools library](https://docs.python.org/3/library/itertools.html)