# Chapter xx

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/chapters/chap01.ipynb)

[The New York Times Spelling Bee](https://www.nytimes.com/puzzles/spelling-bee) is a daily puzzle where the goal is to spell as many words as possible using only the given set of seven letters. 
For example, in a recent Spelling Bee, the available letters were `dekiklo`,
so you could spell "like" and "hold".

You can use each of the letters more than once, so "hook" and "deed" would be allowed, too.

To make it a little more interesting, one of the letters is special and must be included in every word.
In this example, the special letter is `o`, so "hood" would be allowed, but not "like".

Each word you find scores points depending on it's length, which must be at least four letters.
A word that uses all of the letters is called a "pangram" and scores extra points.

We'll use this puzzle to explore the use of Python sets.

## Sets

Suppose we're given a word and we would like to know whether it can be spelled using only  a given set of letters.
The following function solves this problem using list operations.

In [1]:
def uses_only(word, letters):
    for letter in word:
        if letter not in letters:
            return False
    return True

If we find any letters in `word` that are not in the list of letters, we can return `False` immediately.
If we get through the word without finding any unavailable letters, we can return `True`.

Let's try it out with some examples. In a recent Spelling Bee, the available letters were `dekiklo`.
Let's see what we can spell with them.

In [2]:
letters = "dehiklo"
uses_only('lode', letters)

True

In [3]:
uses_only('implode', letters)

False

**Exercise:** It is possible to implement `uses_only` more concisely using set operations rather than list operations. Read the documentation of the `set` class and rewrite `uses_only` using sets.

In [4]:
# Solution

def uses_only(word, letters):
    return set(word) <= set(letters)

In [5]:
letters = "dehiklo"
uses_only('lode', letters)

True

In [6]:
uses_only('implode', letters)

False

## Word list

The following function downloads a list of about 100,000 words in American English, which is the word list installed by default on a recent Unix operating system.

In [7]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/DSIRP/raw/main/american-english')

The file contains one word per line, so we can read the file and split it into a list of words like this:

In [8]:
word_list = open('american-english').read().split()
len(word_list)

102401

**Exercise:** Write a loop that iterates through this word list and prints only words 

* With at least four letters,

* That can be spelled using only the letters `dehiklo`, and

* That include the letter `o`.

In [9]:
for word in word_list:
    if len(word) < 4:
        continue
        
    if 'o' not in word:
        continue
            
    if uses_only(word, letters):
        print(word)

diode
dodo
dole
doled
doll
dolled
doodle
doodled
hellhole
hello
hoed
hold
hole
holed
hood
hooded
hoodie
hoodoo
hoodooed
hook
hooked
idol
kiddo
kilo
kook
kookie
likelihood
lode
loll
lolled
look
looked
oiled
oldie
oleo


**Exercise:** Now let's check for pangrams. 
Write a function called `uses_all` that takes two strings and returns `True` if the first string uses all of the letters in the second string.
Think about how to express this computation using set operations.

Test your function with at least one case that returns `True` and one that returns `False`.

In [10]:
# Solution

def uses_all(word, letters):
    return set(letters) <= set(word)

In [11]:
# Solution

letters = "dehiklo"
uses_all('likelihood', letters)

True

In [12]:
# Solution

uses_all('hellhole', letters)

False

**Exercise:** Modify the previous loop to search the word list for pangrams using `uses_only` and `uses_all`.

Or, as a bonus, write a function called `uses_all_and_only` that checks both conditions using a single `set` operation.

In [13]:
# Solution

for word in word_list:
    if len(word) < 4:
        continue
        
    if 'o' not in word:
        continue
            
    if uses_only(word, letters) and uses_all(word, letters):
        print(word)

likelihood


In [14]:
# Solution

def uses_all_and_only(word, letters):
    return set(letters) == set(word)

In [15]:
# Solution

for word in word_list:
    if len(word) < 4:
        continue
        
    if 'o' not in word:
        continue
            
    if uses_all_and_only(word, letters):
        print(word)

likelihood


## Leftovers

So far we've been writing Boolean functions that test specific conditions, but if they return `False`, they don't explain why.
As an alternative to `uses_only`, we could write a function called takes a word and a set of letters, and returns a new string that contain the letters in words that are not available.  Let's call it `bad_letters`. 

In [16]:
def bad_letters(word, letters):
    return set(word) - set(letters)

Now if we run this function with an illegal word, it will tell us which letters in the word are not available.

In [17]:
bad_letters('oilfield', letters)

{'f'}

**Exercise:** Write a function called `unused_letters` that takes a word and a set of letters and returns the subset of the letters that are not used in `word`.

In [18]:
# Solution

def unused_letters(word, letters):
    return set(letters) - set(word)

In [19]:
# Solution

unused_letters('looked', letters)

{'h', 'i'}

**Exercise:** Write a function called `no_duplicates` that takes a string and returns `True` if each letter appears only once.

In [20]:
# Solution

def no_duplicates(word):
    return len(set(word)) == len(word)

In [21]:
# Solution

no_duplicates('oiled')

True

In [22]:
# Solution

no_duplicates('doodled')

False