# Deconstructing BERT's Vocabulary

BERT and BERT-like models almost always have a vocabulary of around 30k words. We'll get to what this really means later in the course. For now, let's just assume it means that the model has a form of meaning associated with each of the 30k entries in the lexicon. Intuitively, this aligns well with our notions of how many words fluent English speakers know.

Here we have a list of the words that are in the "BERT-base" lower-case model in the file BERT-vocab.txt, with one "word" per line.  Let's see what that looks like.

In [2]:
wc -l BERT-vocab.txt

   30522 BERT-vocab.txt


So we're in the right ballpark with 30522 lines (words). Let's see what's in there.

In [3]:
head BERT-vocab.txt

[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]


Ok, those don't look like the words we want.  BERT uses a number of unique symbols in its workings, including symbols like [PAD], [CLS], [SEP] and a couple of others.  These aren't really words. And it looks like it reserves some entries ([unused\*]) for future work (typically for adaptation to specialized domains). These aren't the words we're looking for.  Let's see how many of these there are. 

In [16]:
grep '^\[' < BERT-vocab.txt | head

[PAD]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]


In [5]:
grep '^\[' < BERT-vocab.txt | wc -l

    1000


In [6]:
grep -v '^\[' < BERT-vocab.txt | wc -l

   29522


Looks like we're down to 29k. Let's see what else is in there if we skip over the [] items.

In [7]:
grep -v '^\[' < BERT-vocab.txt | head

!
"
#
$
%
&
'
(
)
*


Remember we said that subword algorithms start with an initial vocabulary of characters. In class we took that to be characters, numbers and punctuation.  That's really not quite right, if you're using arbitrary web docs and things like Wikipedia then you're going to run into a lot of odd characters.  Better to just use all the unicode characters that occur in the training text. Let's see what we get we look at all the single character entries in the list.

In [17]:
grep '^.$' < BERT-vocab.txt | head 

!
"
#
$
%
&
'
(
)
*


In [27]:
grep '^.$' < BERT-vocab.txt | wc -l 

     997


Ok. We' just knocked another 1000 entries from BERT's word list. Down to roughly 28,500. 

In [18]:
grep '^.$' < BERT-vocab.txt | wc -l

     997


Ok. That drops us down another 1000 or so to 28k.  

Now the wordpiece algorithm used in BERT employs ## to mark the start of the subword units that the algorithm discovers. Let's see what they look like.

In [20]:
grep '^##'< BERT-vocab.txt | head -100

##s
##a
##e
##i
##ing
##n
##o
##d
##ed
##r
##y
##t
##er
##ly
##l
##m
##u
##es
##h
##on
##k
##us
##c
##g
##an
##p
##en
##in
##man
##al
##ia
##2
##z
##is
##1
##b
##3
##ra
##na
##ers
##f
##4
##le
##6
##7
##ic
##x
##v
##te
##8
##5
##ne
##ie
##ton
##9
##0
##ta
##th
##la
##ness
##ch
##um
##da
##ry
##w
##ma
##rs
##el
##re
##os
##ar
##ka
##ist
##ian
##or
##ism
##ling
##ity
##as
##ley
##ted
##ng
##ville
##able
##ri
##ies
##land
##ur
##ya
##ine
##de
##ki
##ts
##ro
##less
##ey
##ion
##ha
##am
##ter


Some of these are recognizable as English suffixes (-ed, -ing, -ly, etc).  Along with these we have a lot of single character "subwords".  Let's stipulate that none of these are what we had in mind for 'words'.  Not to say they aren't useful or have meanings.

In [29]:
grep '^##' < BERT-vocab.txt | wc -l


    5828


Ok, we just lost nearly another 6k entries. Starting to sound like maybe BERT's vocab isn't all its cracked up to be.  More like 22k.

Let's take a look at what's left.


In [21]:
grep -v '\[' < BERT-vocab.txt | grep -v '^.$' | grep -v '^##' | head -200


the
of
and
in
to
was
he
is
as
for
on
with
that
it
his
by
at
from
her
she
you
had
an
were
but
be
this
are
not
my
they
one
which
or
have
him
me
first
all
also
their
has
up
who
out
been
when
after
there
into
new
two
its
time
would
no
what
about
said
we
over
then
other
so
more
can
if
like
back
them
only
some
could
where
just
during
before
do
made
school
through
than
now
years
most
world
may
between
down
well
three
year
while
will
later
city
under
around
did
such
being
used
state
people
part
know
against
your
many
second
university
both
national
these
don
known
off
way
until
re
how
even
get
head
...
didn
team
american
because
de
born
united
film
since
still
long
work
south
us
became
any
high
again
day
family
see
right
man
eyes
house
season
war
states
including
took
life
north
same
each
called
name
much
place
however
go
four
group
another
found
won
area
here
going
10
away
series
left
home
music
best
make
hand
number
company
several
never
last
john
000
very
album
take
end
good
too
following
r

Although its not stated, this is obviously a frequency ordered list. "the" is always at the top.

Let's just sort it alphanumerically to see what's in there.

In [22]:
grep -v '\[' < BERT-vocab.txt | grep -v '^.$' | grep -v '^##'  | sort | head -200


£1
£10
£100
£2
£3
£5
...
00
000
001
00pm
01
02
03
04
05
050
06
07
08
09
10
100
1000
100th
101
1016
102
103
104
105
106
107
108
1086
109
10th
11
110
1100
111
112
113
114
115
116
117
118
119
11th
12
120
1200
121
122
123
124
125
126
127
128
129
12th
13
130
1300
131
132
133
134
135
136
137
138
139
13th
14
140
1400
141
142
143
144
145
146
147
148
149
14th
15
150
1500
151
152
153
154
1540
155
1550
156
1560
157
1570
158
1580
159
15th
16
160
1600
1603
1604
1605
1609
161
1610
1611
1612
1618
162
1620
1621
1622
1623
1624
1625
1626
1628
1629
163
1630
1632
1634
1635
1638
164
1640
1641
1642
1643
1644
1645
1646
1648
1649
165
1650
1651
1652
1653
1654
1655
1656
1658
1659
166
1660
1661
1662
1663
1664
1665
1666
1667
167
1670
1672
1675
1679
168
1680
1682
1683
1685
1688
1689
169
1690
1692
1695
1697
1699
16th
17
170
1700
1701
1702
1703
1704
1705
1707
1708
1709
171
1710
1711
1712
1713
1714


Hmm.  Numbers, lots of numbers, time expressions, etc.  Also not words.

Let's just get the numbers that constitute the whole line.

In [23]:
egrep -o '[[:digit:]]+' < BERT-vocab.txt | head -200

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199


In [12]:
egrep -o '[0-9]+' < BERT-vocab.txt | wc

    2072    2072    8239


That's 2000 entries that are just plain numbers

What about a bit of morphology?  BERT should get credit for knowing "look". But it shouldn't get credit for knowing 4 words just because it knows all the inflected forms for this regular verb.

In [15]:
grep '^look' < BERT-vocab.txt

looked
look
looking
looks
lookout
