# SCOWL wordlist

Create a definitive list of 'real' words that we can use across all our projects for deciding what to consider a word/non-word.

Wordlist is based on the SCOWL set of word lists at http://wordlist.aspell.net/.

In [1]:
# Importing necessary modules

import pandas as pd
import pickle as pkl
import glob

## Import necessary files
All the dictionary files are in the `SCOWL/final` folder from the SCOWL download (link above). Here we check them to see which ones to include, then use glob to create a final .txt file, i.e. a single long list of words.

In [2]:
# Checking the files
path = "../../../Documents/SCOWL/final/" # Set path to wherever you have downloaded the SCOWL files
all_file_names = glob.glob(path+"*")
print(len(all_file_names))
all_file_names[:20]

347


['../../../Documents/SCOWL/final/english-words.20',
 '../../../Documents/SCOWL/final/british-words.80',
 '../../../Documents/SCOWL/final/canadian-words.10',
 '../../../Documents/SCOWL/final/english-abbreviations.20',
 '../../../Documents/SCOWL/final/canadian_variant_2-words.10',
 '../../../Documents/SCOWL/final/british_variant_1-words.20',
 '../../../Documents/SCOWL/final/australian-words.40',
 '../../../Documents/SCOWL/final/variant_1-words.80',
 '../../../Documents/SCOWL/final/australian_variant_1-contractions.50',
 '../../../Documents/SCOWL/final/variant_2-abbreviations.70',
 '../../../Documents/SCOWL/final/variant_1-proper-names.80',
 '../../../Documents/SCOWL/final/canadian-abbreviations.95',
 '../../../Documents/SCOWL/final/canadian_variant_1-proper-names.95',
 '../../../Documents/SCOWL/final/variant_1-words.20',
 '../../../Documents/SCOWL/final/australian-upper.50',
 '../../../Documents/SCOWL/final/british_variant_1-words.80',
 '../../../Documents/SCOWL/final/english-abbreviatio

In [3]:
file_types = set([x[31:-3] for x in all_file_names])
print(len(file_types))
file_types

66


{'american-abbreviations',
 'american-proper-names',
 'american-upper',
 'american-words',
 'australian-abbreviations',
 'australian-contractions',
 'australian-proper-names',
 'australian-upper',
 'australian-words',
 'australian_variant_1-abbreviations',
 'australian_variant_1-contractions',
 'australian_variant_1-proper-names',
 'australian_variant_1-upper',
 'australian_variant_1-words',
 'australian_variant_2-abbreviations',
 'australian_variant_2-contractions',
 'australian_variant_2-proper-names',
 'australian_variant_2-upper',
 'australian_variant_2-words',
 'british-abbreviations',
 'british-proper-names',
 'british-upper',
 'british-words',
 'british_variant_1-abbreviations',
 'british_variant_1-contractions',
 'british_variant_1-upper',
 'british_variant_1-words',
 'british_variant_2-abbreviations',
 'british_variant_2-contractions',
 'british_variant_2-upper',
 'british_variant_2-words',
 'british_z-abbreviations',
 'british_z-proper-names',
 'british_z-upper',
 'british_z-

In [4]:
file_categories = [x.split('-') for x in file_types]
file_categories = set([x[-1] for x in file_categories])
print(file_categories)

{'words', 'hacker', 'abbreviations', 'contractions', 'names', 'numerals', 'upper'}


There are 7 different categories so we will manually look into each, checking different regional varieties, to decide whether to include in the final SCOWL list.

#### Abbreviations

In [5]:
pd.read_csv(path+'american-abbreviations.70',names=['word'])
pd.read_csv(path+'australian-abbreviations.70',names=['word'])
pd.read_csv(path+'british-abbreviations.95',names=['word'])
pd.read_csv(path+'canadian-abbreviations.80',names=['word'])
pd.read_csv(path+'english-abbreviations.95',names=['word']) # The 'english' lists are the big ones

Unnamed: 0,word
0,AAAA
1,AAAAAA
2,AAAL
3,AAAS
4,AAE
...,...
5108,wrnt
5109,xd
5110,xdiv
5111,yrbk


My vote is to remove all of the abbreviations from our list - these are more likely to be misspellings than real words when used by the students.

#### contractions

In [6]:
# Weirdly, no list for US English - maybe just part of the big 'English' list
pd.read_csv(path+'australian-contractions.35',names=['word']) #hahaha love it
pd.read_csv(path+'british_variant_2-contractions.70',names=['word'])
pd.read_csv(path+'canadian_variant_1-contractions.35',names=['word']) #Interesting
pd.read_csv(path+'english-contractions.80',names=['word']).head(10) # The 'english' lists are the big ones

Unnamed: 0,word
0,a'body
1,a'thing
2,cowslip'd
3,d'Indy
4,d'accord
5,dog'sbane
6,entr'actes
7,fa'ard
8,fatwa'd
9,freez'd


Keep contractions. Probably not many occurrences, but shouldn't be corrected and are unlikely to be errors.

#### hacker (special hacker words!)

In [7]:
# Just one file
pd.read_csv(path+'special-hacker.50',names=['word']).head(20)
pd.read_csv(path+'special-hacker.50',names=['word']).tail(20)

Unnamed: 0,word
1629,zap
1630,zapped
1631,zapping
1632,zaps
1633,zen
1634,zenned
1635,zens
1636,zero
1637,zeroed
1638,zeroes


I'm in favour of keeping these but am ambivalent - there are a mix of abbreviations and real words. Even though we didn't include other abbreivations, these are all pretty common, and there may be words that will not appear in other lists related to technology.

#### names

In [8]:
pd.read_csv(path+'american-proper-names.80',names=['word']).sample(5)
len(pd.read_csv(path+'american-proper-names.80',names=['word']))

pd.read_csv(path+'australian-proper-names.50',names=['word']).sample(5)
len(pd.read_csv(path+'australian-proper-names.50',names=['word']))

pd.read_csv(path+'british-proper-names.95',names=['word']).sample(5)
len(pd.read_csv(path+'british-proper-names.95',names=['word']))

pd.read_csv(path+'canadian-proper-names.80',names=['word']).sample(5)
len(pd.read_csv(path+'canadian-proper-names.80',names=['word']))

pd.read_csv(path+'english-proper-names.40',names=['word'])
len(pd.read_csv(path+'english-proper-names.40',names=['word']))

3

Definitely keep - proper names is one important element lacking in some other word lists, and our POS tagging definitely hasn't caugh all the proper nouns in our texts. We will need to remove the possessives (reducing the lists significantly). There are some words here which are not names, e.g. verbs, but that's fine.

#### numerals

In [9]:
#Just one file

pd.read_csv(path+'special-roman-numerals.35',names=['word']).sample(5)
len(pd.read_csv(path+'special-roman-numerals.35',names=['word']))

63

Keep roman numerals - we may end up with a few false negatives where it's actually a spelling mistake we miss, but better that than accidentally correcting an intended roman numeral.

#### upper

In [10]:
pd.read_csv(path+'american-upper.95',names=['word']).sample(5)
len(pd.read_csv(path+'american-upper.95',names=['word']))

pd.read_csv(path+'australian-upper.80',names=['word']).sample(5)
len(pd.read_csv(path+'australian-upper.80',names=['word']))

pd.read_csv(path+'british_z-upper.50',names=['word']).sample(5)
len(pd.read_csv(path+'british_z-upper.50',names=['word']))

pd.read_csv(path+'canadian_variant_2-upper.40',names=['word']).sample(5)
len(pd.read_csv(path+'canadian_variant_2-upper.40',names=['word']))

pd.read_csv(path+'english-upper.10',names=['word']).sample(5)
len(pd.read_csv(path+'english-upper.10',names=['word']))

13

Not sure how these are different than the 'name' category, but we can definitely keep.

#### words

In [11]:
pd.read_csv(path+'american-words.95',names=['word']).sample(5) # All the 'z' spellings - good
len(pd.read_csv(path+'american-words.95',names=['word']))

pd.read_csv(path+'australian_variant_2-words.40',names=['word']).sample(5)
len(pd.read_csv(path+'australian_variant_2-words.40',names=['word']))

pd.read_csv(path+'british-words.55',names=['word']).sample(5)
len(pd.read_csv(path+'british-words.55',names=['word']))

pd.read_csv(path+'canadian-words.70',names=['word']).sample(5)
len(pd.read_csv(path+'canadian-words.70',names=['word']))

pd.read_csv(path+'english-words.80',names=['word'], encoding = 'latin-1').sample(5) # Big files
len(pd.read_csv(path+'english-words.80',names=['word'], encoding = 'latin-1'))

pd.read_csv(path+'variant_3-words.60',names=['word']).sample(5) # Generic 'variant' whatever that is!
len(pd.read_csv(path+'variant_3-words.60',names=['word']))

33

These are the main lists that need to be included.

#### Summary of checking:
Keep all except the abbreviation files.

In [12]:
exclude = [x for x in all_file_names if 'abbreviations' in x]
include = [x for x in all_file_names if 'abbreviations' not in x]

In [13]:
len(all_file_names)
len(exclude)
len(include)

296

## Compiling files
#### Reading in SCOWL files"

In [14]:
%%capture

# Read in and combine all of our text files

with open("SCOWL_wordlist.txt", "wb") as outfile:
    for f in include:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

In [15]:
# Read in the new SCOWL_wordlist to manipulate
text_file = open("SCOWL_wordlist.txt", "r", encoding = 'latin-1')
SCOWL = text_file.read().split('\n')
SCOWL[:10]
len(SCOWL)
# Output = a single tokenized list

726845

In [16]:
# Turn into set to save space and remove duplicates
SCOWL = set(SCOWL)
len(SCOWL)

# Removed 60k items

668130

In [17]:
'poutine' in SCOWL

# Good - the Canadian dict worked!

True

#### Remove any blanks

In [18]:
SCOWL = [x for x in SCOWL if len(x) > 0]
len(SCOWL)

#1 blank removed

668129

#### 's
From earlier, we know that there are a lot of possessives of nouns. With our tokenization splitting the 's, we don't need these in this list as well.

In [19]:
"Ben's" in SCOWL
"Ben" in SCOWL

True

In [20]:
[x for x in SCOWL if x[-2:] == "'s"][:10]
len([x for x in SCOWL if x[-2:] == "'s"])

# Interesting proper names!

147872

In [21]:
# Remove these items
SCOWL = set([x for x in SCOWL if x[-2:] != "'s"])
len(SCOWL)

520257

Final list is almost exactly approximately 520k words.

## Write out SCOWL.txt

In [22]:
#Change SCOWL back to a sorted list
SCOWL = sorted(list(SCOWL))

In [23]:
# Pickle it for convenience
with open('SCOWL_condensed.pkl', 'wb') as f:
    pkl.dump(SCOWL, f)

In [24]:
%%capture

# Create a text file - takes a while
with open('SCOWL_condensed.txt', 'w') as f:
    for item in SCOWL:
        f.write("%s\n" % item)

## Names experiment
Creating a list of names to add to our 'safe' list when spell checking

Names from 1990 US census data, collected by https://pypi.org/project/names/

In [25]:
path = "../../../Documents/names_dataset/names/" # Set path to wherever you have downloaded the names files

In [26]:
male_first = pd.read_csv(path+'dist.male.first',names=['male_first'])
print(len(male_first))
male_first.head()

# 1200 Male names

1219


Unnamed: 0,male_first
0,JAMES 3.318 3.318 1
1,JOHN 3.271 6.589 2
2,ROBERT 3.143 9.732 3
3,MICHAEL 2.629 12.361 4
4,WILLIAM 2.451 14.812 5


In [27]:
female_first = pd.read_csv(path+'dist.female.first',names=['female_first'])
print(len(female_first))
female_first.head()

# 4275 female first names

4275


Unnamed: 0,female_first
0,MARY 2.629 2.629 1
1,PATRICIA 1.073 3.702 2
2,LINDA 1.035 4.736 3
3,BARBARA 0.980 5.716 4
4,ELIZABETH 0.937 6.653 5


In [28]:
last_name = pd.read_csv(path+'dist.all.last',names=['last_name'])
print(len(last_name))
last_name.head()

# 88799 last names

88799


Unnamed: 0,last_name
0,SMITH 1.006 1.006 1
1,JOHNSON 0.810 1.816 2
2,WILLIAMS 0.699 2.515 3
3,JONES 0.621 3.136 4
4,BROWN 0.621 3.757 5


In [29]:
# Combining all these names into one list
male_first_list = [x[0] for x in male_first.male_first.apply(lambda x: x.split(' '))]
female_first_list = [x[0] for x in female_first.female_first.apply(lambda x: x.split(' '))]
last_name_list = [x[0] for x in last_name.last_name.apply(lambda x: x.split(' '))]

In [30]:
all_names = set(male_first_list + female_first_list + last_name_list)

In [31]:
len(all_names)

91910

In [32]:
%%capture

# Create a text file
with open('all_names.txt', 'w') as f:
    for item in all_names:
        f.write("%s\n" % item)

[Back to top](#SCOWL-wordlist)