# Deconstructing BERT's Vocabulary

BERT and BERT-like models almost always have a vocabulary of around 30k words. We'll get to what this really means later in the course. For now, let's just assume it means that the model has a form of meaning associated with each of the 30k entries in the lexicon. Intuitively, this aligns well with our notions of how many words fluent English speakers know.

Here we have a list of the words that are in the "BERT-base" lower-case model in the file BERT-vocab.txt, with one "word" per line.  Let's see what that looks like.

In [2]:
wc -l BERT-vocab.txt

   30522 BERT-vocab.txt


So we're in the right ballpark with 30522 lines (words). Let's see what's in there.

In [3]:
head BERT-vocab.txt

[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]


Ok, those don't look like the words we want.  BERT uses a number of unique symbols in its workings, including symbols like [PAD], [CLS], [SEP] and a couple of others.  These aren't really words. And it looks like it reserves some entries ([unused\*]) for future work (typically for adaptation to specialized domains). These aren't the words we're looking for.  Let's see how many of these there are. 

In [4]:
grep '^\[' < BERT-vocab.txt

[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]
[unused9]
[unused10]
[unused11]
[unused12]
[unused13]
[unused14]
[unused15]
[unused16]
[unused17]
[unused18]
[unused19]
[unused20]
[unused21]
[unused22]
[unused23]
[unused24]
[unused25]
[unused26]
[unused27]
[unused28]
[unused29]
[unused30]
[unused31]
[unused32]
[unused33]
[unused34]
[unused35]
[unused36]
[unused37]
[unused38]
[unused39]
[unused40]
[unused41]
[unused42]
[unused43]
[unused44]
[unused45]
[unused46]
[unused47]
[unused48]
[unused49]
[unused50]
[unused51]
[unused52]
[unused53]
[unused54]
[unused55]
[unused56]
[unused57]
[unused58]
[unused59]
[unused60]
[unused61]
[unused62]
[unused63]
[unused64]
[unused65]
[unused66]
[unused67]
[unused68]
[unused69]
[unused70]
[unused71]
[unused72]
[unused73]
[unused74]
[unused75]
[unused76]
[unused77]
[unused78]
[unused79]
[unused80]
[unused81]
[unused82]
[unused83]
[unused84]
[unused85]
[unused86]
[unused87]
[unused88]
[unused89]
[unused90]
[un

[unused690]
[unused691]
[unused692]
[unused693]
[unused694]
[unused695]
[unused696]
[unused697]
[unused698]
[unused699]
[unused700]
[unused701]
[unused702]
[unused703]
[unused704]
[unused705]
[unused706]
[unused707]
[unused708]
[unused709]
[unused710]
[unused711]
[unused712]
[unused713]
[unused714]
[unused715]
[unused716]
[unused717]
[unused718]
[unused719]
[unused720]
[unused721]
[unused722]
[unused723]
[unused724]
[unused725]
[unused726]
[unused727]
[unused728]
[unused729]
[unused730]
[unused731]
[unused732]
[unused733]
[unused734]
[unused735]
[unused736]
[unused737]
[unused738]
[unused739]
[unused740]
[unused741]
[unused742]
[unused743]
[unused744]
[unused745]
[unused746]
[unused747]
[unused748]
[unused749]
[unused750]
[unused751]
[unused752]
[unused753]
[unused754]
[unused755]
[unused756]
[unused757]
[unused758]
[unused759]
[unused760]
[unused761]
[unused762]
[unused763]
[unused764]
[unused765]
[unused766]
[unused767]
[unused768]
[unused769]
[unused770]
[unused771]
[unused772]
[unu

In [5]:
grep '^\[' < BERT-vocab.txt | wc -l

    1000


In [6]:
grep -v '^\[' < BERT-vocab.txt | wc -l

   29522


Looks like we're down to 29k. Let's see what else is in there if we skip over the [] items.

In [7]:
grep -v '^\[' < BERT-vocab.txt | head

!
"
#
$
%
&
'
(
)
*


Remember we said that subword algorithms start with an initial vocabulary of characters. In class we took that to be characters, numbers and punctuation.  That's really not quite right, if you're using arbitrary web docs and things like Wikipedia then you're going to run into a lot of odd characters.  Better to just use all the unicode characters that occur in the training text. Let's see what we get we look at all the single character entries in the list.

In [26]:
grep '^.$' < BERT-vocab.txt

!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
¡
¢
£
¤
¥
¦
§
¨
©
ª
«
¬
®
°
±
²
³
´
µ
¶
·
¹
º
»
¼
½
¾
¿
×
ß
æ
ð
÷
ø
þ
đ
ħ
ı
ł
ŋ
œ
ƒ
ɐ
ɑ
ɒ
ɔ
ɕ
ə
ɛ
ɡ
ɣ
ɨ
ɪ
ɫ
ɬ
ɯ
ɲ
ɴ
ɹ
ɾ
ʀ
ʁ
ʂ
ʃ
ʉ
ʊ
ʋ
ʌ
ʎ
ʐ
ʑ
ʒ
ʔ
ʰ
ʲ
ʳ
ʷ
ʸ
ʻ
ʼ
ʾ
ʿ
ˈ
ː
ˡ
ˢ
ˣ
ˤ
α
β
γ
δ
ε
ζ
η
θ
ι
κ
λ
μ
ν
ξ
ο
π
ρ
ς
σ
τ
υ
φ
χ
ψ
ω
а
б
в
г
д
е
ж
з
и
к
л
м
н
о
п
р
с
т
у
ф
х
ц
ч
ш
щ
ъ
ы
ь
э
ю
я
ђ
є
і
ј
љ
њ
ћ
ӏ
ա
բ
գ
դ
ե
թ
ի
լ
կ
հ
մ
յ
ն
ո
պ
ս
վ
տ
ր
ւ
ք
־
א
ב
ג
ד
ה
ו
ז
ח
ט
י
ך
כ
ל
ם
מ
ן
נ
ס
ע
ף
פ
ץ
צ
ק
ר
ש
ת
،
ء
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ـ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ٹ
پ
چ
ک
گ
ں
ھ
ہ
ی
ے
अ
आ
उ
ए
क
ख
ग
च
ज
ट
ड
ण
त
थ
द
ध
न
प
ब
भ
म
य
र
ल
व
श
ष
स
ह
ा
ि
ी
ो
।
॥
ং
অ
আ
ই
উ
এ
ও
ক
খ
গ
চ
ছ
জ
ট
ড
ণ
ত
থ
দ
ধ
ন
প
ব
ভ
ম
য
র
ল
শ
ষ
স
হ
া
ি
ী
ে
க
ச
ட
த
ந
ன
ப
ம
ய
ர
ல
ள
வ
ா
ி
ு
ே
ை
ನ
ರ
ಾ
ක
ය
ර
ල
ව
ා
ก
ง
ต
ท
น
พ
ม
ย
ร
ล
ว
ส
อ
า
เ
་
།
ག
ང
ད
ན
པ
བ
མ
འ
ར
ལ
ས
မ
ა
ბ
გ
დ
ე
ვ
თ
ი
კ
ლ
მ
ნ
ო
რ
ს
ტ
უ
ᄀ
ᄂ
ᄃ
ᄅ
ᄆ
ᄇ
ᄉ
ᄊ
ᄋ
ᄌ
ᄎ
ᄏ
ᄐ
ᄑ
ᄒ
ᅡ
ᅢ
ᅥ
ᅦ
ᅧ
ᅩ
ᅪ
ᅭ
ᅮ
ᅯ
ᅲ
ᅳ
ᅴ
ᅵ
ᆨ
ᆫ
ᆯ
ᆷ
ᆸ
ᆼ
ᴬ
ᴮ
ᴰ
ᴵ
ᴺ
ᵀ
ᵃ
ᵇ
ᵈ


In [27]:
grep '^.$' < BERT-vocab.txt | wc -l 

     997


Ok. We' just knocked another 1000 entries from BERT's word list. Down to roughly 28,500. 

In [18]:
grep '^.$' < BERT-vocab.txt | wc -l

     997


Ok. That drops us down another 1000 or so to 28k.  

Now the wordpiece algorithm used in BERT employs ## to mark the start of the subword units that the algorithm discovers. Let's see what they look like.

In [28]:
grep '^##'< BERT-vocab.txt 

##s
##a
##e
##i
##ing
##n
##o
##d
##ed
##r
##y
##t
##er
##ly
##l
##m
##u
##es
##h
##on
##k
##us
##c
##g
##an
##p
##en
##in
##man
##al
##ia
##2
##z
##is
##1
##b
##3
##ra
##na
##ers
##f
##4
##le
##6
##7
##ic
##x
##v
##te
##8
##5
##ne
##ie
##ton
##9
##0
##ta
##th
##la
##ness
##ch
##um
##da
##ry
##w
##ma
##rs
##el
##re
##os
##ar
##ka
##ist
##ian
##or
##ism
##ling
##ity
##as
##ley
##ted
##ng
##ville
##able
##ri
##ies
##land
##ur
##ya
##ine
##de
##ki
##ts
##ro
##less
##ey
##ion
##ha
##am
##ter
##ge
##ll
##se
##st
##ation
##nt
##son
##et
##ce
##to
##ting
##ble
##ke
##ni
##j
##tion
##ham
##ive
##do
##ca
##men
##ized
##ous
##va
##id
##co
##ck
##ns
##no
##ga
##li
##ment
##ba
##ner
##ko
##ate
##io
##wood
##led
##ty
##ve
##sa
##by
##ier
##ti
##field
##ford
##ja
##ler
##ally
##ina
##ization
##ful
##go
##il
##at
##hi
##berg
##der
##sh
##rd
##lin
##lo
##ot
##za
##q
##me
##ius
##line
##den
##it
##wa
##ad
##ite
##que
##ard
##les
##ff
##tor
##age
##di
##ir
##mi
##est
##ria
##ze
##well
##ated
##ee
##ah
#

##lina
##zzi
##late
##nga
##ake
##ido
##haus
##anda
##lal
##uan
##gg
##type
##pt
##trom
##hman
##ght
##used
##elia
##eg
##alis
##ages
##uded
##ppa
##lton
##cock
##worthy
##fall
##yon
##hine
##vers
##igo
##ways
##some
##atory
##tered
##uda
##rrell
##ame
##bby
##fest
##ast
##ented
##ided
##fying
##star
##ost
##rod
##uru
##yard
##owing
##dd
##30
##ifies
##ying
##combe
##о
##fly
##flower
##ه
##tail
##nese
##nz
##form
##uc
##hian
##fies
##raj
##xton
##hm
##uki
##dley
##shu
##haw
##icus
##wise
##isa
##kis
##zie
##eld
##lp
##urn
##pu
##lov
##uth
##cle
##kins
##aid
##jon
##him
##rre
##nagar
##pling
##lier
##vier
##mouth
##pf
##top
##how
##graph
##ssen
##bone
##dling
##ime
##lah
##park
##bil
##sby
##bat
##rial
##cian
##hoe
##ي
##usion
##mir
##uation
##lby
##oll
##rman
##ott
##11
##holder
##lake
##rp
##sl
##rer
##ema
##ively
##vor
##culture
##tead
##oth
##ttes
##hof
##oro
##tics
##α
##rid
##iard
##tera
##sies
##tly
##aan
##jin
##iss
##ear
##dock
##haven
##tical
##ook
##rata
##uit
##rama
##biliti

##mur
##gil
##anne
##xes
##llus
##pathy
##hue
##eit
##bate
##lore
##itch
##hea
##phobic
##vati
##sport
##dation
##eyer
##otic
##udy
##kari
##sier
##sei
##gor
##isman
##kling
##ego
##utz
##chule
##nesian
##iol
##market
##xin
##egan
##chet
##user
##ddle
##illon
##xx
##finger
##ulator
##wire
##mour
##atin
##chrome
##ester
##rates
##yria
##llation
##tom
##ulu
##uze
##raz
##ako
##lev
##gau
##bourg
##lles
##rya
##nius
##fight
##hak
##cl
##nham
##iac
##lab
##rber
##sner
##isto
##aran
##mt
##tana
##acies
##atz
##gement
##thest
##ej
##fusion
##orum
##stra
##rred
##vine
##hini
##bies
##eering
##hui
##kee
##nl
##aus
##dition
##notes
##iology
##mology
##isk
##zione
##illy
##naire
##sler
##arts
##imated
##uate
##wley
##ject
##dio
##ods
##ricted
##eti
##ntly
##lane
##ggio
##torm
##oting
##liner
##ush
##ooped
##lage
##rdo
##yen
##zak
##pose
##tur
##enity
##gat
##bara
##zza
##kaya
##raphic
##zam
##ogical
##phine
##fide
##thing
##cars
##ptic
##ibe
##chu
##sio
##aly
##rano
##tious
##oman
##pire
##dable


##ptive
##dur
##antes
##rral
##ggles
##omba
##ament
##uen
##rrick
##lase
##jic
##tonic
##promising
##cala
##sle
##lang
##dication
##fed
##rh
##oza
##woods
##linson
##mming
##ouin
##bala
##dda
##eased
##oides
##rdial
##rke
##thesis
##nob
##tically
##mined
##iti
##tler
##iente
##ulum
##tip
##lley
##iam
##dson
##ower
##anger
##laise
##bour
##icle
##urity
##lux
##yad
##bang
##claim
##erving
##uing
##amps
##sund
##xious
##tops
##icative
##iot
##dberg
##nified
##adia
##vite
##yme
##lino
##hosis
##lick
##ophone
##arable
##jure
##esian
##phus
##brates
##ritan
##erative
##zai
##hae
##imov
##mini
##rso
##taken
##nh
##crest
##ntino
##chester
##optera
##dara
##esthesia
##ior
##basket
##umatic
##cek
##mps
##orous
##omp
##ports
##tream
##deh
##ocks
##yson
##nad
##cius
##gli
##rook
##anov
##acker
##lika
##alla
##som
##national
##umb
##agne
##nessy
##iani
##osphere
##champ
##itan
##athi
##hab
##kong
##oia
##nail
##vc
##dity
##riated
##mission
##tort
##caster
##gman
##khov
##tively
##vio
##eak
##kt
##d

Some of these are recognizable as English suffixes (-ed, -ing, -ly, etc).  Along with these we have a lot of single character "subwords".  Let's stipulate that none of these are what we had in mind for 'words'.  Not to say they aren't useful or have meanings.

In [29]:
grep '^##' < BERT-vocab.txt | wc -l


    5828


Ok, we just lost nearly another 6k entries. Starting to sound like maybe BERT's vocab isn't all its cracked up to be.  More like 22k.

Let's take a look at what's left.


In [23]:
grep -v '\[' < BERT-vocab.txt | grep -v '^.$' | grep -v '^##' 


the
of
and
in
to
was
he
is
as
for
on
with
that
it
his
by
at
from
her
she
you
had
an
were
but
be
this
are
not
my
they
one
which
or
have
him
me
first
all
also
their
has
up
who
out
been
when
after
there
into
new
two
its
time
would
no
what
about
said
we
over
then
other
so
more
can
if
like
back
them
only
some
could
where
just
during
before
do
made
school
through
than
now
years
most
world
may
between
down
well
three
year
while
will
later
city
under
around
did
such
being
used
state
people
part
know
against
your
many
second
university
both
national
these
don
known
off
way
until
re
how
even
get
head
...
didn
team
american
because
de
born
united
film
since
still
long
work
south
us
became
any
high
again
day
family
see
right
man
eyes
house
season
war
states
including
took
life
north
same
each
called
name
much
place
however
go
four
group
another
found
won
area
here
going
10
away
series
left
home
music
best
make
hand
number
company
several
never
last
john
000
very
album
take
end
good
too
following
r

arrived
minute
believed
sorry
complex
beautiful
victory
associated
temple
1968
1973
chance
perhaps
metal
1945
bishop
lee
launched
particularly
tree
le
retired
subject
prize
contains
yeah
theory
empire
suddenly
waiting
trust
recording
happy
terms
camp
champion
1971
religious
pass
zealand
names
2nd
port
ancient
tom
corner
represented
watch
legal
anti
justice
cause
watched
brothers
45
material
changes
simply
response
louis
fast
answer
60
historical
1969
stories
straight
create
feature
increased
rate
administration
virginia
el
activities
cultural
overall
winner
programs
basketball
legs
guard
beyond
cast
doctor
mm
flight
results
remains
cost
effect
winter
larger
islands
problems
chairman
grew
commander
isn
1967
pay
failed
selected
hurt
fort
box
regiment
majority
journal
35
edward
plans
shown
pretty
irish
characters
directly
scene
likely
operated
allow
spring
junior
matches
looks
mike
houses
fellow
beach
marriage
rules
oil
65
florida
expected
nearby
congress
sam
peace
recent
iii
wait
subsequ

send
bowl
plus
enter
catch
economy
duty
1929
speech
authorities
princess
performances
versions
shall
graduate
pictures
effective
remembered
poetry
desk
crossed
starring
starts
passenger
sharp
acres
ass
weather
falling
rank
fund
supporting
check
adult
publishing
heads
cm
southeast
lane
application
bc
les
condition
transfer
prevent
display
ex
regions
earl
federation
cool
relatively
answered
besides
1928
obtained
portion
mix
reaction
liked
dean
express
peak
1932
counter
religion
chain
rare
miller
convention
aid
lie
vehicles
mobile
perform
squad
wonder
lying
crazy
sword
attempted
centuries
weren
philosophy
category
anna
interested
47
sweden
wolf
frequently
abandoned
kg
literary
alliance
task
entitled
threw
promotion
factory
tiny
soccer
visited
matt
fm
achieved
52
defence
internal
persian
43
methods
arrested
otherwise
cambridge
programming
villages
elementary
districts
rooms
criminal
conflict
worry
trained
1931
attempts
waited
signal
bird
truck
subsequent
programme
ad
49
communist
details
f

offensive
shell
shouldn
waist
plain
ross
organ
resolution
manufacturing
adding
relative
kennedy
98
whilst
moth
marketing
gardens
crash
72
heading
partners
credited
carlos
moves
cable
marshall
depending
bottle
represents
rejected
responded
existed
04
jobs
denmark
lock
treated
graham
routes
talent
commissioner
drugs
secure
tests
reign
restored
photography
contributions
oklahoma
designer
disc
grin
seattle
robin
paused
atlanta
unusual
praised
las
laughing
satellite
hungary
visiting
interesting
factors
deck
poems
norman
stuck
speaker
rifle
domain
premiered
dc
comics
actors
01
reputation
eliminated
8th
ceiling
prisoners
script
leather
austin
mississippi
rapidly
admiral
parallel
charlotte
guilty
tools
gender
divisions
fruit
laboratory
nelson
fantasy
marry
rapid
aunt
tribe
requirements
aspects
suicide
amongst
adams
bone
ukraine
abc
kick
sees
edinburgh
clothing
column
rough
gods
hunting
broadway
gathered
concerns
spending
ty
12th
snapped
requires
solar
bones
cavalry
iowa
drinking
waste
index
fr

1881
lion
traded
photographs
writes
craig
trials
generated
beth
noble
debt
percentage
yorkshire
erected
ss
viewed
grades
confidence
ceased
islam
telephone
retail
chile
m²
roberts
sixteen
commented
hampshire
innocent
dual
pounds
checked
regulations
afghanistan
sung
rico
liberty
assets
bigger
options
angels
relegated
tribute
wells
attending
leaf
butler
romanian
forum
monthly
lisa
patterns
gmina
madison
hurricane
rev
bristol
elite
valuable
disaster
democracy
awareness
germans
freyja
loop
absolutely
paying
populations
maine
sole
prayer
spencer
releases
doorway
bull
lover
midnight
conclusion
thirteen
lily
mediterranean
nhl
proud
sample
drummer
guinea
murphy
climb
instant
attributed
horn
ain
railways
steven
autumn
ferry
opponent
root
traveling
secured
corridor
stretched
tales
sheet
trinity
cattle
helps
indicates
manhattan
murdered
fitted
1882
gentle
grandmother
mines
shocked
vegas
produces
caribbean
belong
continuous
desperate
drunk
historically
trio
waved
raf
dealing
nathan
bat
murmured
int

exhibited
armor
twins
divorce
abraham
reviewed
jo
temporarily
matrix
physically
pulse
curled
difficulties
bengal
usage
annie
riders
certificate
holes
warsaw
distinctive
jessica
mutual
1857
customs
circular
eugene
removal
loaded
mere
vulnerable
depicted
generations
dame
heir
enormous
lightly
climbing
pitched
lessons
pilots
nepal
ram
google
preparing
brad
louise
renowned
liam
plaza
shaw
sophie
brilliant
bills
fucking
mainland
server
pleasant
seized
veterans
jerked
fail
beta
brush
radiation
stored
warmth
southeastern
nate
sin
raced
berkeley
joke
athlete
designation
trunk
roland
qualification
archives
heels
artwork
receives
judicial
reserves
woke
installation
abu
floating
fake
lesser
excitement
interface
concentrated
addressed
characteristic
amanda
saxophone
monk
auto
releasing
egg
dies
interaction
defender
ce
outbreak
glory
loving
sequel
consciousness
http
awake
ski
enrolled
handling
rookie
brow
somebody
biography
warfare
amounts
contracts
presentation
fabric
dissolved
challenged
meter
ps

kidnapped
accommodation
emigrated
knockout
correspondent
violation
profits
peaks
lang
specimen
agenda
ancestry
pottery
spelling
equations
obtaining
ki
linking
1825
debris
asylum
buddhism
teddy
gazette
dental
eligibility
utc
fathers
averaged
zimbabwe
francesco
coloured
hissed
translator
lynch
mandate
humanities
mackenzie
uniforms
lin
asset
mhz
fitting
samantha
genera
wei
rim
beloved
shark
riot
entities
expressions
indo
carmen
slipping
owing
abbot
neighbor
sidney
rats
recommendations
encouraging
squadrons
anticipated
commanders
conquered
donations
diagnosed
divide
guessed
decoration
vernon
auditorium
revelation
conversations
herzegovina
dash
alike
protested
lateral
herman
accredited
mg
freeman
mel
fiji
crow
crimson
livestock
humanitarian
bored
oz
whip
legitimate
alter
grinning
spelled
anxious
oriental
wesley
carnival
controller
detect
bowed
educator
kosovo
macedonia
occupy
mastering
stephanie
janeiro
para
unaware
nurses
noon
135
cam
hopefully
ranger
combine
sociology
polar
rica
neill
hol

tracing
brig
afb
pathways
utilizing
mod
mb
disturbance
kneeling
100th
pune
decreasing
168
manipulation
miriam
academia
ecosystem
occupational
rbi
rift
rotary
stacked
incorporation
awakening
generators
guerrero
racist
cyber
derivatives
culminated
allie
annals
panzer
sainte
wikipedia
pops
zu
austro
algerian
politely
nicholson
mornings
educate
tastes
thrill
dartmouth
db
regan
differing
concentrating
choreography
divinity
pledged
alexandre
routing
gregor
madeline
apocalypse
gunfire
culminating
elves
fined
liang
lam
programmed
tar
guessing
transparency
gabrielle
cancellation
flexibility
accession
shea
stronghold
nets
specializes
abused
hasan
sgt
ling
exceeding
admiration
supermarket
photographers
specialised
tilt
resonance
hmm
perfume
380
sami
threatens
garland
botany
guarding
boiled
greet
puppy
russo
supplier
wilmington
vibrant
vijay
paralympic
grumbled
paige
faa
licking
margins
hurricanes
fest
grenade
ripping
counseling
weigh
needles
wiltshire
edison
costly
fulton
tramway
redesigned
staff

tatum
vittorio
cholera
bracing
indifference
projectile
superliga
realises
upgrading
299
porte
retribution
nk
stil
ama
bureaucracy
blackberry
bosch
testosterone
collapses
greer
ioc
fifties
malls
bao
baskets
adolescents
siegfried
mantra
detecting
existent
fledgling
dissatisfied
gan
telecommunication
mingled
sobbed
6000
controversies
outdated
taxis
fright
slams
detectors
fetal
tanned
fray
goth
olympian
skipping
mandates
scratches
sheng
unspoken
hyundai
tracey
hotspur
restrictive
americana
mundo
burroughs
diva
vulcan
distinctions
thumping
mikey
sheds
fide
rescues
springsteen
vested
valuation
pinnacle
rake
sylvie
almond
quivering
alteration
faltered
51st
hydra
ticked
recommends
antigua
arjun
stagecoach
wilfred
trickle
pronouns
aryan
nighttime
gall
pea
stitch
leung
milos
eritrea
nexus
starved
snowfall
kant
parasitic
cot
discus
hana
strikers
appleton
kitchens
disclose
metis
1701
tesla
fitch
1735
blooded
decimal
cyclones
eun
bottled
peas
pensacola
basha
bolivian
crabs
boil
lanterns
partridge
r

handheld
intersecting
stimulating
crate
fellowships
hemingway
casinos
climatic
fordham
copeland
drip
beatty
leaflets
robber
brothel
madeira
sphinx
ultrasound
valor
forbade
leonid
villas
duane
marquez
disadvantaged
forearms
kawasaki
reacts
consular
lax
uncles
uphold
concepcion
dorsey
lass
arching
passageway
1708
researches
tia
internationals
distinguishes
javanese
divert
plotted
affirmative
signifies
validation
kari
felicity
georgina
zulu
overcoming
argyll
1734
chiba
ratification
windy
earls
parapet
hunan
pristine
astrid
punta
brodie
malaga
minerva
rouse
bellowed
pagoda
portals
reclamation
parentheses
quoting
allergic
palette
showcases
benefactor
heartland
nonlinear
bladed
cheerfully
scans
1666
girlfriends
pedersen
hiram
sous
1683
bobo
primaries
smiley
unearthed
uniformly
fis
metadata
1635
ind
recoil
406
hilbert
jamestown
mcmillan
tulane
seychelles
antics
coli
fated
stucco
1654
bulky
accolades
arrays
caledonian
carnage
optimism
puebla
enforcing
rotherham
seo
dunlop
aeronautics
chimed
in

Although its not stated, this is obviously a frequency ordered list. "the" is always at the top.

Let's just sort it alphanumerically to see what's in there.

In [24]:
grep -v '\[' < BERT-vocab.txt | grep -v '^.$' | grep -v '^##'  | sort


£1
£10
£100
£2
£3
£5
...
00
000
001
00pm
01
02
03
04
05
050
06
07
08
09
10
100
1000
100th
101
1016
102
103
104
105
106
107
108
1086
109
10th
11
110
1100
111
112
113
114
115
116
117
118
119
11th
12
120
1200
121
122
123
124
125
126
127
128
129
12th
13
130
1300
131
132
133
134
135
136
137
138
139
13th
14
140
1400
141
142
143
144
145
146
147
148
149
14th
15
150
1500
151
152
153
154
1540
155
1550
156
1560
157
1570
158
1580
159
15th
16
160
1600
1603
1604
1605
1609
161
1610
1611
1612
1618
162
1620
1621
1622
1623
1624
1625
1626
1628
1629
163
1630
1632
1634
1635
1638
164
1640
1641
1642
1643
1644
1645
1646
1648
1649
165
1650
1651
1652
1653
1654
1655
1656
1658
1659
166
1660
1661
1662
1663
1664
1665
1666
1667
167
1670
1672
1675
1679
168
1680
1682
1683
1685
1688
1689
169
1690
1692
1695
1697
1699
16th
17
170
1700
1701
1702
1703
1704
1705
1707
1708
1709
171
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
172
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
173
1730
1731
1732
1733
1734
1735
1736
17

airplane
airplanes
airplay
airport
airports
airs
airship
airspace
airways
aisle
aisles
aix
aj
ajax
ak
aka
akbar
akin
akira
akron
al
ala
alabama
alain
alam
alan
alarm
alarmed
alarms
alaska
alba
albania
albanian
albanians
albans
albany
albeit
albert
alberta
alberto
albion
albrecht
album
albums
albuquerque
alcohol
alcoholic
alcoholism
alderman
aldo
ale
alec
alejandro
aleksandr
aleppo
alert
alerted
alessandro
alex
alexa
alexander
alexandra
alexandre
alexandria
alexei
alexia
alexis
alf
alfa
alfonso
alfred
alfredo
algae
algebra
algebraic
algeria
algerian
algiers
algorithm
algorithms
ali
alias
alice
alicia
alien
aliens
align
aligned
alignment
alike
alison
alistair
alive
all
alla
allah
allan
allegations
alleged
allegedly
allegheny
allegiance
alleging
allegro
allen
allergic
alleviate
alley
alliance
alliances
allie
allied
allies
alligator
allison
allmusic
allocated
allocation
allotted
allow
allowance
allowed
allowing
allows
alloy
alloys
ally
alma
almeida
almighty
almond
almost
alone
along
alongs

award
awarded
awarding
awards
aware
awareness
away
awe
awesome
awful
awhile
awkward
awkwardly
awoke
ax
axe
axel
axes
axial
axis
axle
aye
az
azerbaijan
azerbaijani
aziz
azores
aztec
azure
b1
ba
baba
babe
babies
babu
baby
babylon
babylonian
baccalaureate
bach
bachelor
back
backbone
backdrop
backed
background
backgrounds
backing
backlash
backpack
backs
backseat
backside
backstage
backstroke
backup
backward
backwards
backyard
bacon
bacteria
bacterial
bacterium
bad
baden
badge
badger
badges
badly
badminton
bae
baffled
bafta
bag
baggage
baghdad
bags
baha
bahadur
bahamas
bahia
bahn
bahrain
bai
bail
bailey
bain
baird
bait
baja
baked
baker
bakery
baking
baku
bal
bala
balance
balanced
balancing
balcony
bald
baldwin
balfour
bali
balkan
balkans
ball
ballad
ballads
ballard
ballast
ballet
ballistic
balloon
balloons
ballot
ballots
ballroom
balls
balthazar
baltic
baltimore
bam
bamboo
ban
banana
bananas
banco
bancroft
band
banda
bandage
bandages
banded
bandit
bandits
bands
bandwidth
bane
bang
bangalore

brothers
brought
broughton
brow
brown
browne
browning
brownish
browns
brows
browser
bruce
bruins
bruise
bruised
bruises
brunei
brunette
bruno
brunswick
brush
brushed
brushes
brushing
brussels
brutal
brutality
brutally
brute
bryan
bryant
bryce
bryn
bryson
bs
bsc
bt
bu
bubba
bubble
bubbles
bubbling
buccaneers
buchanan
bucharest
buck
bucket
buckingham
buckinghamshire
buckle
buckled
buckley
bucks
bud
budapest
buddha
buddhism
buddhist
buddies
buddy
budge
budget
budgets
buds
buena
buenos
buff
buffalo
buffer
buffet
buffy
bug
bugs
buick
build
builder
builders
building
buildings
builds
built
bukit
bulb
bulbs
bulgaria
bulgarian
bulge
bulging
bulk
bulky
bull
bulldog
bulldogs
bullet
bulletin
bullets
bullock
bulls
bullshit
bully
bullying
bum
bump
bumped
bumper
bumps
bun
bunch
bundesliga
bundle
bundled
bundles
bungalow
bunk
bunker
bunny
burden
bureau
bureaucracy
buren
burger
burgess
burgundy
burial
burials
buried
burke
burkina
burlington
burma
burmese
burn
burned
burnett
burnham
burning
burnley
burn

cleared
clearer
clearing
clearly
clears
cleavage
clemens
clement
clements
clemson
clench
clenched
clenching
cleopatra
clergy
clergyman
cleric
clerical
clerk
clerks
clermont
cleveland
clever
click
clicked
clicking
clicks
client
clients
cliff
clifford
cliffs
clifton
climate
climates
climatic
climax
climb
climbed
climbing
climbs
clinch
clinched
clinging
clinic
clinical
clinics
clint
clinton
clip
clipped
clips
clit
clive
cloak
clock
clocks
clockwise
clone
clones
close
closed
closely
closeness
closer
closes
closest
closet
closing
closure
cloth
clothed
clothes
clothing
cloud
clouded
clouds
cloudy
clover
clown
club
clube
clubhouse
clubs
clue
clues
clumsy
clung
cluster
clustered
clusters
clutch
clutched
clutches
clutching
clyde
cm
cmll
cn
cnn
co
coa
coach
coached
coaches
coaching
coal
coalition
coarse
coast
coastal
coaster
coastline
coasts
coat
coated
coating
coats
cobalt
cobb
cobra
coca
cocaine
cochin
cochran
cochrane
cock
cocked
cockpit
cocktail
cocky
coco
cocoa
coconut
cod
code
coded
codes


cord
cordoba
cords
core
cores
corey
corinne
cork
corn
cornelius
cornell
corner
cornerback
cornered
corners
cornerstone
cornice
cornish
cornwall
corona
coronation
coroner
corp
corporal
corporate
corporation
corporations
corps
corpse
corpses
corpus
correct
corrected
correction
correctional
corrections
correctly
correlated
correlation
correspond
corresponded
correspondence
correspondent
corresponding
corresponds
corridor
corridors
corrosion
corrugated
corrupt
corrupted
corruption
corsica
cortes
cortex
corvette
cory
cosmetic
cosmetics
cosmic
cosmopolitan
cosmos
cost
costa
costello
costing
costly
costs
costume
costumes
cot
cote
cottage
cottages
cotton
couch
cougars
cough
coughed
coughing
could
couldn
coulter
council
councillor
councillors
councils
counsel
counseling
counselor
count
countdown
counted
counter
counterattack
countered
counterpart
counterparts
counters
countess
counties
counting
countless
countries
country
countryside
counts
county
coup
coupe
couple
coupled
couples
coupling
cour

dickens
dickinson
dickson
dictated
dictator
dictatorship
dictionary
did
didn
die
died
diego
dies
diesel
diet
dietary
dieter
dietrich
differ
differed
difference
differences
different
differential
differentiate
differentiated
differentiation
differently
differing
differs
difficult
difficulties
difficulty
diffuse
diffusion
dig
digest
digger
digging
digit
digital
digitally
digits
dignitaries
dignity
dil
dilapidated
dilemma
dillon
dim
dime
dimension
dimensional
dimensions
diminished
dimitri
dimly
din
dina
dinah
dinamo
diner
ding
dining
dinner
dinners
dino
dinosaur
dinosaurs
diocesan
diocese
figuring
fiji
file
filed
files
filing
filipino
filippo
fill
filled
filling
fills
filly
film
filmed
filmfare
filming
filmmaker
filmmakers
filmmaking
films
filter
filtered
filtering
filters
filthy
fin
final
finale
finalist
finalists
finalized
finally
finals
finance
financed
finances
financial
financially
financing
finch
find
finding
findings
finds
fine
fined
finely
finer
fines
finest
finger
fingered
finger

golden
martina
martinez
martini
martinique
martins
marty
martyr
martyrs
maru
marvel
marvelous
marvin
marx
marxism
marxist
mary
maryland
marylebone
mas
mascara
mascot
masculine
masjid
mask
masked
masks
mason
masonic
masonry
masovian
mass
massachusetts
massacre
massage
masses
massey
massif
massimo
massive
mast
master
mastered
mastering
masterpiece
masters
mastery
mat
mata
match
matched
matches
matching
mate
mated
mateo
mater
material
materialized
materials
maternal
maternity
mates
math
mathematical
mathematician
mathematicians
mathematics
mathew
mathews
mathias
matilda
mating
matrices
matrix
mats
matt
matteo
matter
mattered
matters
matthew
matthews
matthias
mattress
mature
maturity
maud
maui
maureen
maurice
mauritius
mausoleum
maverick
mavericks
max
maxi
maxim
maximal
maximilian
maximize
maximum
maximus
maxwell
may
maya
maybe
mayer
mayfair
mayfield
mayhem
maynard
mayo
mayor
mayoral
mayors
mazda
maze
mb
mba
mbc
mbe
mc
mca
mcbride
mcc
mccain
mccall
mccann
mccarthy
mccartney
mcconnell
mccor

supervised
supervising
supervision
supervisor
supervisors
supervisory
supper
supplement
supplemental
supplementary
supplemented
supplements
supplied
supplier
suppliers
supplies
supply
supplying
support
supported
supporter
supporters
supporting
supportive
supports
suppose
supposed
supposedly
suppress
suppressed
suppression
supremacy
supreme
sur
sure
surely
surf
surface
surfaced
surfaces
surfer
surfing
surge
surged
surgeon
surgeons
surgery
surgical
suriname
surname
surpassed
surpassing
surplus
surprise
surprised
surprises
surprising
surprisingly
surreal
surrender
surrendered
surrey
surround
surrounded
surrounding
surroundings
surrounds
surveillance
survey
surveyed
surveying
surveyor
surveys
survival
survive
survived
survives
surviving
survivor
survivors
susan
susanna
susannah
susceptible
susie
suspect
suspected
suspects
suspend
suspended
suspense
suspension
suspicion
suspicions
suspicious
suspiciously
susquehanna
sussex
sustain
sustainability
sustainable
sustained
sustaining
sutherland
s

transmitter
transmitters
transmitting
transparency
transparent
transplant
transport
transportation
transported
transporting
transports
transvaal
transverse
transylvania
trap
trapped
trapping
traps
trash
trauma
traumatic
travel
traveled
traveler
travelers
traveling
travelled
traveller
travellers
travelling
travels
travers
traverse
traversed
travis
tray
tre
treacherous
tread
treason
treasure
treasurer
treasures
treasury
treat
treated
treaties
treating
treatise
treatment
treatments
treats
treaty
tree
trees
trek
tremble
trembled
trembling
tremendous
tremor
trench
trenches
trend
trends
trent
trenton
tres
trevor
trey
tri
triad
trial
trials
triangle
triangles
triangular
triassic
tribal
tribe
tribes
tribunal
tribune
tributaries
tributary
tribute
trick
tricked
trickle
tricks
tricky
trident
tried
trier
tries
trieste
trigger
triggered
triggering
triggers
trillion
trilogy
trim
trimmed
trinidad
trinity
trio
trip
triple
tripoli
tripped
trips
tristan
triumph
triumphant
trivial
trois
trojan
troll
trol

Hmm.  Numbers, lots of numbers, time expressions, etc.  Also not words. 

In [8]:
egrep -o '[[:digit:]]+' < BERT-vocab.txt

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [12]:
egrep -o '[0-9]+' < BERT-vocab.txt | wc

    2072    2072    8239


That's 2000 entries that are just plain numbers

What about a bit of morphology?  BERT should get credit for knowing "look". But it shouldn't get credit for knowing 4 words just because it knows all the inflected forms for this regular verb.

In [15]:
grep '^look' < BERT-vocab.txt

looked
look
looking
looks
lookout
