# SCOWL wordlist

Create a definitive list of 'real' words that we can use across all our projects for deciding what to consider a word/non-word.

Wordlist is based on the SCOWL set of word lists at http://wordlist.aspell.net/.

In [12]:
# Import necessary modules
import pandas as pd
import pickle as pkl
import glob

# Set preferred notebook format
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', 999)

## Import necessary files
All the dictionary files are in the `SCOWL/final` folder from the SCOWL download (link above). Here we check them to see which ones to include, then use glob to create a final .txt file, i.e. a single long list of words.

In [4]:
# Checking the files
path = "../../../Documents/SCOWL/final/" # Set path to wherever you have downloaded the SCOWL files
all_file_names = glob.glob(path+"*")
print(len(all_file_names))
all_file_names[:20]

347


['../../../Documents/SCOWL/final/english-words.20',
 '../../../Documents/SCOWL/final/british-words.80',
 '../../../Documents/SCOWL/final/canadian-words.10',
 '../../../Documents/SCOWL/final/english-abbreviations.20',
 '../../../Documents/SCOWL/final/canadian_variant_2-words.10',
 '../../../Documents/SCOWL/final/british_variant_1-words.20',
 '../../../Documents/SCOWL/final/australian-words.40',
 '../../../Documents/SCOWL/final/variant_1-words.80',
 '../../../Documents/SCOWL/final/australian_variant_1-contractions.50',
 '../../../Documents/SCOWL/final/variant_2-abbreviations.70',
 '../../../Documents/SCOWL/final/variant_1-proper-names.80',
 '../../../Documents/SCOWL/final/canadian-abbreviations.95',
 '../../../Documents/SCOWL/final/canadian_variant_1-proper-names.95',
 '../../../Documents/SCOWL/final/variant_1-words.20',
 '../../../Documents/SCOWL/final/australian-upper.50',
 '../../../Documents/SCOWL/final/british_variant_1-words.80',
 '../../../Documents/SCOWL/final/english-abbreviatio

In [5]:
file_types = set([x[31:-3] for x in all_file_names])
print(len(file_types))
file_types

66


{'american-abbreviations',
 'american-proper-names',
 'american-upper',
 'american-words',
 'australian-abbreviations',
 'australian-contractions',
 'australian-proper-names',
 'australian-upper',
 'australian-words',
 'australian_variant_1-abbreviations',
 'australian_variant_1-contractions',
 'australian_variant_1-proper-names',
 'australian_variant_1-upper',
 'australian_variant_1-words',
 'australian_variant_2-abbreviations',
 'australian_variant_2-contractions',
 'australian_variant_2-proper-names',
 'australian_variant_2-upper',
 'australian_variant_2-words',
 'british-abbreviations',
 'british-proper-names',
 'british-upper',
 'british-words',
 'british_variant_1-abbreviations',
 'british_variant_1-contractions',
 'british_variant_1-upper',
 'british_variant_1-words',
 'british_variant_2-abbreviations',
 'british_variant_2-contractions',
 'british_variant_2-upper',
 'british_variant_2-words',
 'british_z-abbreviations',
 'british_z-proper-names',
 'british_z-upper',
 'british_z-

In [6]:
file_categories = [x.split('-') for x in file_types]
file_categories = set([x[-1] for x in file_categories])
print(file_categories)

{'names', 'numerals', 'upper', 'hacker', 'contractions', 'words', 'abbreviations'}


There are 7 different categories so we will manually look into each, checking different regional varieties, to decide whether to include in the final SCOWL list.

#### Abbreviations

In [13]:
pd.read_csv(path+'american-abbreviations.70',names=['word'])
pd.read_csv(path+'australian-abbreviations.70',names=['word'])
pd.read_csv(path+'british-abbreviations.95',names=['word'])
pd.read_csv(path+'canadian-abbreviations.80',names=['word'])
pd.read_csv(path+'english-abbreviations.95',names=['word']) # The 'english' lists are the big ones

Unnamed: 0,word
0,bor
1,eq
2,ger
3,therap


Unnamed: 0,word
0,Ire
1,aet
2,archaeol
3,jour
4,palaeontol
5,ret


Unnamed: 0,word
0,Ire's
1,gynaecol


Unnamed: 0,word
0,sae


Unnamed: 0,word
0,AAAA
1,AAAAAA
2,AAAL
3,AAAS
4,AAE
...,...
5108,wrnt
5109,xd
5110,xdiv
5111,yrbk


My vote is to remove all of the abbreviations from our list - these are more likely to be misspellings than real words when used by the students.

#### contractions

In [14]:
# Weirdly, no list for US English - maybe just part of the big 'English' list
pd.read_csv(path+'australian-contractions.35',names=['word']) #hahaha love it
pd.read_csv(path+'british_variant_2-contractions.70',names=['word'])
pd.read_csv(path+'canadian_variant_1-contractions.35',names=['word']) #Interesting
pd.read_csv(path+'english-contractions.80',names=['word']).head(10) # The 'english' lists are the big ones

Unnamed: 0,word
0,g'day


Unnamed: 0,word
0,howe'er


Unnamed: 0,word
0,Qur'an


Unnamed: 0,word
0,a'body
1,a'thing
2,cowslip'd
3,d'Indy
4,d'accord
5,dog'sbane
6,entr'actes
7,fa'ard
8,fatwa'd
9,freez'd


Keep contractions. Probably not many occurrences, but shouldn't be corrected and are unlikely to be errors.

#### hacker (special hacker words!)

In [15]:
# Just one file
pd.read_csv(path+'special-hacker.50',names=['word']).head(20)
pd.read_csv(path+'special-hacker.50',names=['word']).tail(20)

Unnamed: 0,word
0,AFAIK
1,AI
2,AIDS
3,AIs
4,ANSI
5,ANSIs
6,ASCII
7,ASCIIs
8,Ada
9,Adas


Unnamed: 0,word
1629,zap
1630,zapped
1631,zapping
1632,zaps
1633,zen
1634,zenned
1635,zens
1636,zero
1637,zeroed
1638,zeroes


I'm in favour of keeping these but am ambivalent - there are a mix of abbreviations and real words. Even though we didn't include other abbreivations, these are all pretty common, and there may be words that will not appear in other lists related to technology.

#### names

In [16]:
pd.read_csv(path+'american-proper-names.80',names=['word']).sample(5)
len(pd.read_csv(path+'american-proper-names.80',names=['word']))

pd.read_csv(path+'australian-proper-names.50',names=['word']).sample(5)
len(pd.read_csv(path+'australian-proper-names.50',names=['word']))

pd.read_csv(path+'british-proper-names.95',names=['word']).sample(5)
len(pd.read_csv(path+'british-proper-names.95',names=['word']))

pd.read_csv(path+'canadian-proper-names.80',names=['word']).sample(5)
len(pd.read_csv(path+'canadian-proper-names.80',names=['word']))

pd.read_csv(path+'english-proper-names.40',names=['word'])
len(pd.read_csv(path+'english-proper-names.40',names=['word']))

Unnamed: 0,word
251,Timonizes
148,Philippized
192,Romanizers
218,Septembrizers
101,Madera


268

Unnamed: 0,word
54,Kingaroy's
82,Nowra's
63,Mildura
113,Yeppoon
74,Morwell's


115

Unnamed: 0,word
438,Saeed
462,Sinae
56,Babelise
192,Goetae's
22,Anchinoae's


520

Unnamed: 0,word
39,Goetz
98,Londonizes
112,Mohammedanizing
120,Mongolizing's
45,Guebre


268

Unnamed: 0,word
0,Irisher
1,Terr
2,Terr's


3

Definitely keep - proper names is one important element lacking in some other word lists, and our POS tagging definitely hasn't caugh all the proper nouns in our texts. We will need to remove the possessives (reducing the lists significantly). There are some words here which are not names, e.g. verbs, but that's fine.

#### numerals

In [17]:
#Just one file

pd.read_csv(path+'special-roman-numerals.35',names=['word']).sample(5)
len(pd.read_csv(path+'special-roman-numerals.35',names=['word']))

Unnamed: 0,word
55,xxxii
25,vii
16,lvi
20,lxiv
24,vi


63

Keep roman numerals - we may end up with a few false negatives where it's actually a spelling mistake we miss, but better that than accidentally correcting an intended roman numeral.

#### upper

In [18]:
pd.read_csv(path+'american-upper.95',names=['word']).sample(5)
len(pd.read_csv(path+'american-upper.95',names=['word']))

pd.read_csv(path+'australian-upper.80',names=['word']).sample(5)
len(pd.read_csv(path+'australian-upper.80',names=['word']))

pd.read_csv(path+'british_z-upper.50',names=['word']).sample(5)
len(pd.read_csv(path+'british_z-upper.50',names=['word']))

pd.read_csv(path+'canadian_variant_2-upper.40',names=['word']).sample(5)
len(pd.read_csv(path+'canadian_variant_2-upper.40',names=['word']))

pd.read_csv(path+'english-upper.10',names=['word']).sample(5)
len(pd.read_csv(path+'english-upper.10',names=['word']))

Unnamed: 0,word
14,Italianization's
26,Teutonization's
15,Japanization's
13,Islamization's
16,Judaization's


27

Unnamed: 0,word
164,Prussianising
140,Nabataeans
71,Grecised
136,Manichaeanism
155,Phocaean


189

Unnamed: 0,word
3,Americanize
15,Manilla's
0,Americanization
18,Roumania's
10,Baeyer's


21

Unnamed: 0,word
5,Presidents
0,Cabinet
3,President
6,Senator
4,President's


7

Unnamed: 0,word
3,Congress
8,French
0,American
10,I
12,Mister


13

Not sure how these are different than the 'name' category, but we can definitely keep.

#### words

In [19]:
pd.read_csv(path+'american-words.95',names=['word']).sample(5) # All the 'z' spellings - good
len(pd.read_csv(path+'american-words.95',names=['word']))

pd.read_csv(path+'australian_variant_2-words.40',names=['word']).sample(5)
len(pd.read_csv(path+'australian_variant_2-words.40',names=['word']))

pd.read_csv(path+'british-words.55',names=['word']).sample(5)
len(pd.read_csv(path+'british-words.55',names=['word']))

pd.read_csv(path+'canadian-words.70',names=['word']).sample(5)
len(pd.read_csv(path+'canadian-words.70',names=['word']))

pd.read_csv(path+'english-words.80',names=['word'], encoding = 'latin-1').sample(5) # Big files
len(pd.read_csv(path+'english-words.80',names=['word'], encoding = 'latin-1'))

pd.read_csv(path+'variant_3-words.60',names=['word']).sample(5) # Generic 'variant' whatever that is!
len(pd.read_csv(path+'variant_3-words.60',names=['word']))

Unnamed: 0,word
2579,unconventionalize
823,feudalization's
2453,theaterless
1318,mentalize
319,cephalization's


2946

Unnamed: 0,word
0,anaesthetize
115,welched
85,pigmy
103,skillfully
26,consortiums


127

Unnamed: 0,word
70,denationalise
196,sodomises
42,colourise
151,paedophilia
13,behaviourists


219

Unnamed: 0,word
917,parasitize
1209,territorialize
276,crenelled
1233,tricoloured
818,mutualize


1316

Unnamed: 0,word
74087,naturalisms
37704,epinicion
138472,zax
5157,anureses
126993,tzaddiqs


139223

Unnamed: 0,word
1,coenobite
30,reflexion's
18,hydrolyse
23,merchandizers
7,cyders


33

These are the main lists that need to be included.

#### Summary of checking:
Keep all except the abbreviation files.

In [20]:
exclude = [x for x in all_file_names if 'abbreviations' in x]
include = [x for x in all_file_names if 'abbreviations' not in x]

In [22]:
len(all_file_names)
len(exclude)
len(include)

347

51

296

## Compiling files
#### Reading in SCOWL files"

In [23]:
%%capture

# Read in and combine all of our text files

with open("SCOWL_wordlist.txt", "wb") as outfile:
    for f in include:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

In [24]:
# Read in the new SCOWL_wordlist to manipulate
text_file = open("SCOWL_wordlist.txt", "r", encoding = 'latin-1')
SCOWL = text_file.read().split('\n')
SCOWL[:10]
len(SCOWL)
# Output = a single tokenized list

['aardvark',
 'abandon',
 'abandoned',
 'abandoning',
 'abandons',
 'abbreviate',
 'abbreviated',
 'abbreviates',
 'abbreviating',
 'abbreviation']

726845

In [27]:
# Turn into set to save space and remove duplicates
SCOWL = set(SCOWL)
len(SCOWL)

# Removed 60k items

668130

In [17]:
'poutine' in SCOWL

# Good - the Canadian dict worked!

True

#### Remove any blanks

In [29]:
SCOWL = [x for x in SCOWL if len(x) > 0]
len(SCOWL)

#1 blank removed

668129

#### 's
From earlier, we know that there are a lot of possessives of nouns. With our tokenization splitting the 's, we don't need these in this list as well.

In [19]:
"Ben's" in SCOWL
"Ben" in SCOWL

True

True

In [31]:
[x for x in SCOWL if x[-2:] == "'s"][:10]
len([x for x in SCOWL if x[-2:] == "'s"])

# Interesting proper names!

["tantarara's",
 "Nafud's",
 "Pallaton's",
 "Achakzai's",
 "digitiser's",
 "cumfrey's",
 "bugloss's",
 "Wake's",
 "Cree's",
 "cupping's"]

147872

In [33]:
# Remove these items
SCOWL = set([x for x in SCOWL if x[-2:] != "'s"])
len(SCOWL)

520257

#### Capitalization
It is probably simpler to have everything lowercase, removing duplicates as we often lowercase everything anyways.

In [35]:
SCOWL = set([x.lower() for x in SCOWL])
len(SCOWL)

497551

Final list is almost exactly 500k words.

## Write out SCOWL.txt

In [37]:
#Change SCOWL back to a sorted list
SCOWL = sorted(list(SCOWL))

In [39]:
# Pickle it for convenience
with open('SCOWL_condensed.pkl', 'wb') as f:
    pkl.dump(SCOWL, f)

In [40]:
%%capture

# Create a text file - takes a while
with open('SCOWL_condensed.txt', 'w') as f:
    for item in SCOWL:
        f.write("%s\n" % item)

[Back to top](#SCOWL-wordlist)