## Words

In the previous notebook we used `set` to make sets of letters; we used `len` to count the number of letters in a set; and we used `<=` to check whether one set us a subset of another.

In this notebook, we'll download a list of words and we'll use it to search for words that have various characteristics, like containing all of the vowels.

The following cell downloads the a list of the 10,000 most common English words, curated by [Josh Kaufman](https://github.com/first20hours/google-10000-english) on GitHub.  

It uses an `if` statement to download the file only if it does not already exist.

In [1]:
import os

filename = 'google-10000-english-usa-no-swears.txt'

if not os.path.exists(filename):
    !wget https://github.com/first20hours/google-10000-english/raw/master/google-10000-english-usa-no-swears.txt

Now we can open the file, read the contents, split it into words, and save the result in a list of words.

In [2]:
word_list = open(filename).read().splitlines()

We can use `len` to see how many words there are.

In [4]:
len(word_list)

9884

It turns out that we have not quite 10,000 words because the swears have been removed.

The following cell uses a `for` loop to print the words, which appear in order from most common to least common.

In [5]:
for word in word_list:
    print(word)

the
of
and
to
a
in
for
is
on
that
by
this
with
i
you
it
not
or
be
are
from
at
as
your
all
have
new
more
an
was
we
will
home
can
us
about
if
page
my
has
search
free
but
our
one
other
do
no
information
time
they
site
he
up
may
what
which
their
news
out
use
any
there
see
only
so
his
when
contact
here
business
who
web
also
now
help
get
pm
view
online
c
e
first
am
been
would
how
were
me
s
services
some
these
click
its
like
service
x
than
find
price
date
back
top
people
had
list
name
just
over
state
year
day
into
email
two
health
n
world
re
next
used
go
b
work
last
most
products
music
buy
data
make
them
should
product
system
post
her
city
t
add
policy
number
such
please
available
copyright
support
message
after
best
software
then
jan
good
video
well
d
where
info
rights
public
books
high
school
through
m
each
links
she
review
years
order
very
privacy
book
items
company
r
read
group
need
many
user
said
de
does
set
under
general
research
university
january
mail
full
map
reviews
program
life
kno

partnership
editorial
nt
expression
es
equity
provisions
speech
wire
principles
suggestions
rural
shared
sounds
replacement
tape
strategic
judge
spam
economics
acid
bytes
cent
forced
compatible
fight
apartment
height
null
zero
speaker
filed
gb
netherlands
obtain
bc
consulting
recreation
offices
designer
remain
managed
pr
failed
marriage
roll
korea
banks
fr
participants
secret
bath
aa
kelly
leads
negative
austin
favorites
toronto
theater
springs
missouri
andrew
var
perform
healthy
translation
estimates
font
assets
injury
mt
joseph
ministry
drivers
lawyer
figures
married
protected
proposal
sharing
philadelphia
portal
waiting
birthday
beta
fail
gratis
banking
officials
brian
toward
won
slightly
assist
conduct
contained
lingerie
legislation
calling
parameters
jazz
serving
bags
profiles
miami
comics
matters
houses
doc
postal
relationships
tennessee
wear
controls
breaking
combined
ultimate
wales
representative
frequency
introduced
minor
finish
departments
residents
noted
displayed
mom
reduce

candy
pills
tiger
donald
folks
sensor
exposed
telecom
hunt
angels
deputy
indicators
sealed
thai
emissions
physicians
loaded
fred
complaint
scenes
experiments
afghanistan
dd
boost
spanking
scholarship
governance
mill
founded
supplements
chronic
icons
moral
den
catering
aud
finger
keeps
pound
locate
camcorder
pl
trained
burn
implementing
roses
labs
ourselves
bread
tobacco
wooden
motors
tough
roberts
incident
gonna
dynamics
lie
crm
rf
conversation
decrease
chest
pension
billy
revenues
emerging
worship
capability
ak
fe
craig
herself
producing
churches
precision
damages
reserves
contributed
solve
shorts
reproduction
minority
td
diverse
amp
ingredients
sb
ah
johnny
sole
franchise
recorder
complaints
facing
sm
nancy
promotions
tones
passion
rehabilitation
maintaining
sight
laid
clay
defence
patches
weak
refund
usc
towns
environments
trembl
divided
blvd
reception
amd
wise
emails
cyprus
wv
odds
correctly
insider
seminars
consequences
makers
hearts
geography
appearing
integrity
worry
ns
discrimi

sage
knives
eco
vulnerable
arrange
artistic
bat
honors
booth
indie
reflected
unified
bones
breed
detector
ignored
polar
fallen
precise
sussex
respiratory
notifications
msgid
transexual
mainstream
invoice
evaluating
lip
subcommittee
sap
gather
suse
maternity
backed
alfred
colonial
mf
carey
motels
forming
embassy
cave
journalists
danny
rebecca
slight
proceeds
indirect
amongst
wool
foundations
msgstr
arrest
volleyball
mw
adipex
horizon
nu
deeply
toolbox
ict
marina
liabilities
prizes
bosnia
browsers
decreased
patio
dp
tolerance
surfing
creativity
lloyd
describing
optics
pursue
lightning
overcome
eyed
ou
quotations
grab
inspector
attract
brighton
beans
bookmarks
ellis
disable
snake
succeed
leonard
lending
oops
reminder
xi
searched
behavioral
riverside
bathrooms
plains
sku
ht
raymond
insights
abilities
initiated
sullivan
za
midwest
karaoke
trap
lonely
fool
ve
nonprofit
lancaster
suspended
hereby
observe
julia
containers
attitudes
karl
berry
collar
simultaneously
racial
integrate
bermuda
aman

The `for` considers the words one at a time and uses an `if` statement to print only the words that have 15 or more letters.

In [6]:
for word in word_list:
    if len(word) >= 15:
        print(word)

recommendations
characteristics
representatives
telecommunications
responsibilities
sublimedirectory
pharmaceuticals
congratulations
representations
troubleshooting
internationally
characterization
confidentiality
instrumentation


The following loop prints only the words that have the letter `z` in them.

In [7]:
for word in word_list:
    if 'z' in word:
        print(word)

size
z
magazine
organization
zip
amazon
organizations
az
zone
magazines
zealand
arizona
zoom
zero
jazz
oz
brazil
amazing
zum
elizabeth
citizens
verzeichnis
sizes
authorized
crazy
switzerland
czech
recognized
organized
biz
citizen
prize
optimization
nz
pizza
recognize
plaza
realize
puzzle
customize
sized
mhz
personalized
wizard
customized
zus
quiz
zoo
ghz
organizational
realized
hazardous
specialized
venezuela
zdnet
horizontal
mozilla
zimbabwe
authorization
analyzes
bronze
puzzles
dozen
zones
lopez
verizon
brazilian
frozen
bizrate
tanzania
analyze
cruz
shopzilla
thumbzilla
hazard
bizarre
suzuki
buzz
organize
zope
zambia
horizon
prizes
za
mazda
hz
zoning
minimize
belize
zen
mozambique
benz
hazards
organizing
kazakhstan
quizzes
uzbekistan
organizer
enzyme
zshops
ez
citizenship
gazette
utilization
muze
azerbaijan
civilization
analyzed
utilize
dozens
liz
zu
characterized
unauthorized
ozone
lazy
specializing
zinc
freeze
maximize
characterization
optimize
prozac
fuzzy
zoloft
cz
gzip
swaziland

**Exercise** Modify the previous example to print all words that contain the letters `'bing'`, in that order.

The following loop searches for words that contain both `z` and `x`; it uses `and` to check whether both letters appear.

In [8]:
for word in word_list:
    if 'z' in word and 'x' in word:
        print(word)

maximize


**Exercise** Use `and` to find all words that contain `a`, `b`, and `c`. 

Now let's search for words that contain all of the vowels.  We could use `and` again to check them one at a time.  But there's a better way!

## Subsets

To start, I'll make a set that contains the vowels.

In [9]:
vowels = set('aeiou')

We can use this set to check whether a word contains all of the vowels.  If it does, then `vowels` is a subset of the letters in the word.

In [10]:
vowels <= set('education')

True

If the word does not contain all of the vowels, then `vowels` is not a subset:

In [11]:
vowels <= set('educated')

False

**Exercise** Write a loop that prints the words that contain all of the vowels.

In [12]:
for word in word_list:
    if vowels <= set(word):
        print(word)

education
educational
evaluation
regulations
documentation
automotive
miscellaneous
regulation
authorities
authorized
telecommunications
authentication
communicate
equation
dialogue
reputation
automobile
boundaries
equations
simultaneously
obituaries
encouraging
mozambique
revolutionary
questionnaire
unauthorized
evaluations
automobiles
instrumentation


**Exercise** Now suppose you have the letters `hpacnoy`.  How can you check whether a word contains only these letters?

In the next cell, write a test that checks whether the word `canyon` can be spelled with this word.

**Execise** Now write a loop that prints all words in the list that contain only these letters.

Modify it so it prints only words with 4 or more letters.

Finally, modify it so it also requires the word to contain the letter `y`.