# Stopwords and Filtering

## Using stopwords

In natural language processing, we frequently work with a list of stopwords, those words that occur most often in any given text in a language. We might want to exclude words from this list from our larger body of texts before analysis, add to this list, or use just those words from our stopword list as a part of our project. 

NLTK has a handy way for pulling all those stopwords into your program.

In [1]:
import nltk
print(nltk.corpus.stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

The words here are probably unsurprising: prepositions, pronouns, and similar words that show up frequently across English-language corpora. In most cases you will want to supplement such a list with your own words to create a stopwords list that makes sense for your corpus. If you work with early modern texts, for example, you probably have a whole range of different words and phrases particular to your period that you would want to keep in mind. NLTK comes with a range of different lists for different languages, but there are a range of other options available [online ](https://github.com/Alir3z4/stop-words). [CLTK](http://cltk.org/), a variation of NLTK for working with ancient languages, comes with its own lists as well. 

It's common in NLP to use lists like these to create filters for your text. Let's take a piece of the first chunk of Jacob's room, using the get_chunks() method we developing in the unit on [dividing your text](dividing.ipynb):

In [2]:
import math
import nltk

def get_chunks(text, num_chunks):
    text_length = len(text)
    text_chunks = []
    number_of_chunks = num_chunks
    for i in range(number_of_chunks):
        chunk_size = text_length/number_of_chunks
        chunk_start = math.floor(chunk_size * i)
        chunk_end = math.floor(chunk_size * (i +1))
        text_chunks.append(text[chunk_start:chunk_end])
    return text_chunks

filename = 'corpus/woolf/1922_jacobs_room.txt'
with open(filename, 'r') as fin:
    raw_text = fin.read()

chunked_text = get_chunks(raw_text, 100)
tokenized_text = [nltk.word_tokenize(chunk) for chunk in chunked_text]
tokenized_text[0][:50]

['CHAPTER',
 'ONE',
 "''",
 'So',
 'of',
 'course',
 ',',
 "''",
 'wrote',
 'Betty',
 'Flanders',
 ',',
 'pressing',
 'her',
 'heels',
 'rather',
 'deeper',
 'in',
 'the',
 'sand',
 ',',
 '``',
 'there',
 'was',
 'nothing',
 'for',
 'it',
 'but',
 'to',
 'leave',
 '.',
 "''",
 'Slowly',
 'welling',
 'from',
 'the',
 'point',
 'of',
 'her',
 'gold',
 'nib',
 ',',
 'pale',
 'blue',
 'ink',
 'dissolved',
 'the',
 'full',
 'stop',
 ';']

We can use our stopwords list from above to filter out common words. This code removes stopwords from the first chunk and prints the first 50 words.

In [3]:
filtered_chunk = [token for token in tokenized_text[0] if token not in nltk.corpus.stopwords.words('english')]
print(filtered_chunk[:50])

['CHAPTER', 'ONE', "''", 'So', 'course', ',', "''", 'wrote', 'Betty', 'Flanders', ',', 'pressing', 'heels', 'rather', 'deeper', 'sand', ',', '``', 'nothing', 'leave', '.', "''", 'Slowly', 'welling', 'point', 'gold', 'nib', ',', 'pale', 'blue', 'ink', 'dissolved', 'full', 'stop', ';', 'pen', 'stuck', ';', 'eyes', 'fixed', ',', 'tears', 'slowly', 'filled', '.', 'The', 'entire', 'bay', 'quivered', ';']


Notice the words that are gone now: 'of', 'her', 'in', 'the'. Filtering stopwords often gives us a sense of those words that are more likely to be meaningful for the particular text. Here, we get character names, as well as a sense of vocabulary that might be thought of as more particular to Woolf. Note how these words are particular to the content of her text - we get a sense of what she is writing about and how she is describing it. You could imagine comparing this list of vocabulary to that of another author and finding it to be quite different!

## Adding to your stopwords list

In the previous example, even though we filtered out common words, we still have quite a lot of unwanted characters - punctuation, for example. These grammatical markings are often filtered out in much the same way, using a different part of nltk to expand our stopwords list. There are a couple different ways to do this. The simplest draws upon Python's built-in string class, which has a list of punctuation in it:

In [4]:
import string
string.punctuation

# create an initial stopword list by loading in the one from nltk

stopword_list = nltk.corpus.stopwords.words('english')

# add punctuation to that list and print out our new list

stopword_list.extend(string.punctuation)
print(stopword_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

We can use this expanded list to filter again and get a more refined list.

In [5]:
filtered_chunk = [token for token in tokenized_text[0] if token not in stopword_list]
print(filtered_chunk[:50])

['CHAPTER', 'ONE', "''", 'So', 'course', "''", 'wrote', 'Betty', 'Flanders', 'pressing', 'heels', 'rather', 'deeper', 'sand', '``', 'nothing', 'leave', "''", 'Slowly', 'welling', 'point', 'gold', 'nib', 'pale', 'blue', 'ink', 'dissolved', 'full', 'stop', 'pen', 'stuck', 'eyes', 'fixed', 'tears', 'slowly', 'filled', 'The', 'entire', 'bay', 'quivered', 'lighthouse', 'wobbled', 'illusion', 'mast', 'Mr.', 'Connor', "'s", 'little', 'yacht', 'bending']


That gets us closer - we pulled out a few punctuation marks. But notice that some still made it through - this text has individual punctuation marks not contained within that generalized list we used before. So we'll want to further extend our stopwords list based on the things that got left out. You can use the same approach to add corpus-specific stopwords if you need to do so. Your research questions will ultimately drive your decisions about what words to include or not. 

In [6]:
# make a custom stopwords list and add it to our general stopwords list.
custom_stop_list = ["''", "``"]
stopword_list.extend(custom_stop_list)
filtered_chunk = [token for token in tokenized_text[0] if token not in stopword_list]
print(filtered_chunk[:50])

['CHAPTER', 'ONE', 'So', 'course', 'wrote', 'Betty', 'Flanders', 'pressing', 'heels', 'rather', 'deeper', 'sand', 'nothing', 'leave', 'Slowly', 'welling', 'point', 'gold', 'nib', 'pale', 'blue', 'ink', 'dissolved', 'full', 'stop', 'pen', 'stuck', 'eyes', 'fixed', 'tears', 'slowly', 'filled', 'The', 'entire', 'bay', 'quivered', 'lighthouse', 'wobbled', 'illusion', 'mast', 'Mr.', 'Connor', "'s", 'little', 'yacht', 'bending', 'like', 'wax', 'candle', 'sun']


That looks more like it. We've successfully massaged out all the punctuation in this initial chunk, though further situations like this might come up later on. The lesson here is that textual data is messy, and even the most sophisticated natural language processing setups require a good deal of massaging and careful modification in light of particular situations.  Each text presents its own set of problems, and only through familiarity with your objects of study will you know what exactly needs to be accounted for. In short, there is no substitute for knowing your corpus. But methods like this can help you control what is in your texts by the time you get to analyzing them.

## Why you might want to leave stopwords in

There are good reasons why you might actually want to leave those stopwords in for your analysis. If these words are statistically the most common in English-language texts, then they must serve as meaningful points of comparison among many different texts. Put another way, any given two authors might take different objects of study in their texts. Filtering out common words can give a better sense of their particular interests. But the common words that they share might give a good sense of the particulars of their own literary style, of the ways in which they write. And stopwords might be especially necessary for more advanced kinds of [machine learning](https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52). Let's use this principle to compare Jacob's Room with The Voyage Out. Even though both these texts are by Virginia Woolf, they have very different styles. Let's see if these are represented by word counts, with or without stopwords. In what follows, we'll compare the texts in two ways, first with stopwords taken out and second using _only_ the stopwords.

In [7]:
# store the filenames of both texts

fn_one = 'corpus/woolf/1915_the_voyage_out.txt'
fn_two = 'corpus/woolf/1922_jacobs_room.txt'

# read in the text of the_voyage_out and assign it a variable text_voyage

with open(fn_one, 'r') as fin:
    text_voyage = fin.read()

# read in the text of jacobs_room and assign it a variable text_jacob
    
with open(fn_two, 'r') as fin:
    text_jacob = fin.read()

# read in the English-language stopwords list provided by nltk and store it in 
# a variable stopword_list

stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.extend(string.punctuation)
custom_stop_list = ["''", "``", '``','--']
stopword_list.extend(custom_stop_list)
# use nltk's built-in tokenizer to get a list of the tokens in each text. store
# these in variables.

tokens_voyage = [token.lower() for token in nltk.word_tokenize(text_voyage)]
tokens_jacob = [token.lower() for token in nltk.word_tokenize(text_jacob)]


# look at the tokens we've got and filter out stopwords by comparing tokens against
# the stored list of stopwords.

stop_tokens_voyage = [token for token in tokens_voyage if token in stopword_list]
stop_tokens_jacob = [token for token in tokens_jacob if token in stopword_list]

# create frequency distributions of the top tokens in each text. then print out
# the ten most common tokens. since we previously filtered, these lists will contain
# stopwords only.

stop_freq_voyage = nltk.FreqDist(stop_tokens_voyage)
stop_freq_jacob = nltk.FreqDist(stop_tokens_jacob)


tokens_voyage = [token for token in tokens_voyage if token not in stopword_list]
tokens_jacob = [token for token in tokens_jacob if token not in stopword_list]

freq_voyage = nltk.FreqDist(tokens_voyage)
freq_jacob = nltk.FreqDist(tokens_jacob)


print('=====')
print('Comparison of Texts with Stopwords Excluded')
print('The Voyage Out')
print(freq_voyage.most_common(10))
print("Jacob's Room")
print(freq_jacob.most_common(10))

print('=====')
print('Comparison of Stopwords in the texts')
print('The Voyage Out')
print(stop_freq_voyage.most_common(15))
print("Jacob's Room")
print(stop_freq_jacob.most_common(15))

=====
Comparison of Texts with Stopwords Excluded
The Voyage Out
[("'s", 1007), ('said', 874), ('one', 801), ('rachel', 579), ("n't", 513), ('mrs.', 437), ('like', 392), ('helen', 392), ('could', 380), ('people', 379)]
Jacob's Room
[('said', 425), ("'s", 411), ('jacob', 390), ('one', 291), ('mrs.', 225), ('like', 165), ('would', 150), ('little', 137), ('mr.', 134), ('flanders', 128)]
=====
Comparison of Stopwords in the texts
The Voyage Out
[(',', 10332), ('the', 7209), ('.', 7193), ('and', 4478), ('of', 3723), ('to', 3640), ('a', 2980), ("''", 2866), ('``', 2590), ('she', 2548), ('in', 2249), ('was', 2113), ('her', 1979), ('he', 1722), ('that', 1667)]
Jacob's Room
[(',', 4604), ('the', 3920), ('.', 2964), ('and', 1816), ('of', 1330), ('a', 1144), ('to', 1016), ('in', 997), ("''", 939), ('``', 805), (';', 790), ('was', 686), ('her', 605), ('he', 582), ('it', 574)]


On first blush, for one, it's clear that The Voyage Out is much longer than Jacob's Room. You get a clue to that by looking at the word counts in general or even just the number of different punctuation marks. In the results that exclude stopwords, note that Jacob's Room has a male name in the top ten most common words, while The Voyage Out does not. On the one hand, this makes sense - Jacob's Room is at least nominally about a character named Jacob. But this character is presented in an elliptical way, primarily through the descriptions and recollections of other characters. Looking at the results that are solely based on stopwords, we can see this reflected to some degree. The stopword counts for Jacobs Room are a bit more equivalent in gendered pronouns, which give a different picture of gender prominence in the novel than by just noticing the frequent use of Jacob's name. You could use this information in conjunction with other metrics about the text to make larger arguments about the representation of character and gender. 

We could combine this sort of analysis with one that takes into account punctuation use to get a better sense of an author's particular style. But, on first read, there doesn't actually look to be that much difference in the texts when looking solely at stopwords. This might be interesting, as upon _reading_ them Jacob's Room feels like a significant departure in style. We might ask why that style difference is not represented at the level of the word. These counts can be difficult to compare, though, because it can be hard to compare two lists of counts that overlap as much as they do when considering stopwords. Some knowledge of statistics could help us to pursue this question further.