# Algorithms

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/notebooks/algorithms.ipynb)

## Searching for anagrams

In this notebook we'll implement algorithms for two tasks:

* Testing a pair of words to see if they are anagrams of each other, that is, if you can rearrange the letters in one word to spell the other.

* Searching a list of words for all pairs that are anagrams of each other.

There is a point to these examples, which I will explain at the end.

**Exercise 1:** Write a function that takes two words and returns `True` if they are anagrams. Test your function with the examples below.

In [1]:
def is_anagram(word1, word2):
    
    # sort the letters in each word
    word1 = sorted(word1)
    word2 = sorted(word2)
    
    # compare the sorted words
    return word1 == word2

In [2]:
is_anagram('tachymetric', 'mccarthyite') # True

True

In [3]:
is_anagram('post', 'top') # False, letter not present

False

In [4]:
is_anagram('pott', 'top') # False, letter present but not enough copies

False

In [5]:
is_anagram('top', 'post') # False, letters left over at the end

False

In [6]:
is_anagram('topss', 'postt') # False

False

**Exercise 2:** Use `timeit` to see how fast your function is for these examples:

In [7]:
%timeit is_anagram('tops', 'spot')

755 ns ± 107 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [8]:
%timeit is_anagram('tachymetric', 'mccarthyite')

1.22 µs ± 98.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


How can we compare algorithms running on different computers?

## Searching for anagram pairs

**Exercise 3:** Write a function that takes a word list and returns a list of all anagram pairs.

In [9]:
short_word_list = ['proudest', 'stop', 'pots', 'tops', 'sprouted']

In [10]:
#v0 basic function - does not check for identity or reverse pairs

def all_anagram_pairs_v0(word_list):
    
    # initialize an empty list to hold the anagram pairs
    anagram_pairs = []
    
    # loop over each word in the list
    for word1 in word_list:
        
        # loop over each word in the list again
        for word2 in word_list:
            
            # check if the two words are anagrams
            if is_anagram(word1, word2):
                
                # add the pair to the list
                anagram_pairs.append((word1, word2))
                
    return anagram_pairs




In [11]:
#v1 updated function - checks for duplicates and reverse pairs

def all_anagram_pairs_v1(word_list):

    # initialize an empty list to hold the anagram pairs
    anagram_pairs = []
    
    # loop over each word in the list
    for word1 in word_list:
        
        # loop over each word in the list again
        for word2 in word_list:
            
            # check if the two words are anagrams
            if is_anagram(word1, word2):
                
                # check if the pair is already in the list
                if (word2, word1) not in anagram_pairs:
                    
                    # add the pair to the list
                    anagram_pairs.append((word1, word2))
                
    return anagram_pairs


    



In [12]:
#v updated function - using enumerate. otherwise unnecessary comparisons are made 
def all_anagram_pairs_v2(word_list):
    # initialize an empty list to hold the anagram pairs
    anagram_pairs = []
    
    for index, word1 in enumerate(word_list):
        for word2 in word_list[index+1:]:
            if is_anagram(word1, word2):
                anagram_pairs.append((word1, word2))

    return anagram_pairs

In [13]:
#v updated function - using itertools. otherwise unnecessary comparisons are made 

import itertools

def all_anagram_pairs_v3(word_list):
    # initialize an empty list to hold the anagram pairs
    anagram_pairs = []
    
    for word1,word2 in itertools.combinations(word_list, 2):
            if is_anagram(word1, word2):
                anagram_pairs.append((word1, word2))

    return anagram_pairs

In [14]:
# choose the version of the function to use

all_anagram_pairs = all_anagram_pairs_v3
all_anagram_pairs(short_word_list)


[('proudest', 'sprouted'),
 ('stop', 'pots'),
 ('stop', 'tops'),
 ('pots', 'tops')]

The following cell downloads a file containing a list of English words.

In [15]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/DSIRP/raw/main/american-english')

The following function reads a file and returns a set of words (I used a set because after we convert words to lower case, there are some repeats.)

In [16]:
def read_words(filename):
    """Read lines from a file and split them into words."""
    res = set()
    for line in open(filename):
        for word in line.split():
            res.add(word.strip().lower())
    return res

In [17]:
word_list = read_words('american-english')
len(word_list)

100781

**Exercise 4:** Loop through the word list and print all words that are anagrams of `stop`.

In [18]:
for word in word_list:
    if is_anagram('stop', word):
        print(word)

tops
spot
stop
pots
post
opts


Now run `all_anagram_pairs` with the full `word_list`:

In [19]:
#pairs = all_anagram_pairs(word_list)
#print(pairs)

#runs for more than 45 minutes

#reduced the word list - only words with 5 letters
word_list_5 = [word for word in word_list if len(word) == 5]
len(word_list_5)

pairs = all_anagram_pairs(word_list_5)
print(pairs)

#runtime is 18.7s

[('valid', 'advil'), ('valid', 'vidal'), ('gnats', 'angst'), ('gnats', 'tangs'), ('cures', 'curse'), ('cures', 'sucre'), ('untie', 'unite'), ('purls', 'slurp'), ('barer', 'berra'), ('tills', 'still'), ('tills', 'lilts'), ('trust', 'strut'), ('ethan', 'neath'), ('algol', 'gallo'), ('smear', 'mares'), ('smear', 'reams'), ('dimes', 'deism'), ('tyros', 'troys'), ('tyros', 'story'), ('scull', 'culls'), ('anita', 'tania'), ('quiet', 'quite'), ('dates', 'sated'), ('dates', 'stead'), ('ghost', 'goths'), ('mario', 'maori'), ('mario', 'moira'), ('askew', 'wesak'), ('askew', 'wakes'), ('mable', 'mabel'), ('mable', 'amble'), ('mable', 'melba'), ('mable', 'blame'), ('valor', 'orval'), ('rares', 'rears'), ('rares', 'serra'), ('scuff', 'cuffs'), ('libra', 'blair'), ('aisle', 'elias'), ('aisle', 'elisa'), ("nat's", "ant's"), ("nat's", "tan's"), ('slake', 'lakes'), ('slake', 'leaks'), ('trope', 'perot'), ('taxes', 'texas'), ("aol's", "lao's"), ("aol's", "ola's"), ('acton', 'canto'), ('deers', 'reeds'),

**Exercise 5:** While that's running, let's estimate how long it's going to take.

## A better algorithm

**Exercise 6:** Write a better algorithm! Hint: make a dictionary. How much faster is your algorithm?

In [20]:
def all_anagram_lists(word_list):
    """Finds all anagrams in a list of words.

    word_list: sequence of strings
    """
    return {}

In [21]:
#rewrite the function to return a dictionary with the keys as the sorted words and the values as the list of words that
# are anagrams of the key

# "".join to concatenate list ['a', 'b', 'c', 'd', 'e', 'f'] into a string 'abcdef'
def all_anagram_lists(word_list):
    word_dict = {}
    for word in word_list:
        key = ''.join(sorted(word))
        word_dict.setdefault(key, []).append(word)
    return word_dict


In [22]:
explaining the code:  word_dict.setdefault(key, []).append(word)
#word_dict.setdefault(key, []) creates a new key with an empty list as the value
#word_dict.setdefault(key, []).append(word) appends the word to the list

pairs = all_anagram_lists(word_list_5)
print(pairs)

SyntaxError: invalid syntax (4253251940.py, line 1)

In [None]:
word ="fedcba"
key = ''.join(sorted(word))
print(key)

key2 = sorted(word)
print(key2)

type(key2)

abcdef
['a', 'b', 'c', 'd', 'e', 'f']


list

In [None]:
#explain all_anagram_lists

all_anagram_lists1(word_list_5)

TypeError: unhashable type: 'list'

In [None]:
%time anagram_map = all_anagram_lists(word_list)

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 11.9 µs


In [None]:
len(anagram_map)

0

## Summary

What is the point of the examples in this notebook?

* The different versions of `is_anagram` show that, when inputs are small, it is hard to say which algorithm will be the fastest. It often depends on details of the implementation. Anyway, the differences tend to be small, so it might not matter much in practice.

* The different algorithms we used to search for anagram pairs show that, when inputs are large, we can often tell which algorithm will be fastest. And the difference between a fast algorithm and a slow one can be huge!

## Exercises

Before you work on these exercises, you might want to read the Python [Sorting How-To](https://docs.python.org/3/howto/sorting.html). It uses `lambda` to define an anonymous function, which [you can read about here](https://www.w3schools.com/python/python_lambda.asp).

**Exercise 7:**
Make a dictionary like `anagram_map` that contains only keys that map to a list with more than one element. You can use a `for` loop to make a new dictionary, or a [dictionary comprehension](https://www.freecodecamp.org/news/dictionary-comprehension-in-python-explained-with-examples/).

**Exercise 8:**
Find the longest word with at least one anagram. Suggestion: use the `key` argument of `sort` or `sorted` ([see here](https://stackoverflow.com/questions/8966538/syntax-behind-sortedkey-lambda)).

**Exercise 9:**
Find the largest list of words that are anagrams of each other.

**Exercise 10:**
Write a function that takes an integer `word_length` and finds the longest list of words with the given length that are anagrams of each other.

**Exercise 11:**
At this point we have a data structure that contains lists of words that are anagrams, but we have not actually enumerated all pairs.
Write a function that takes `anagram_map` and returns a list of all anagram pairs.
How many are there?

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)