 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*
 
# Stop Words

In this practical we will explore removing stop words using the NLTK and SpaCy libraries.

In [1]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
text = "Paul likes to write code in Python, however he is not too fond of C++."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

['Paul', 'likes', 'write', 'code', 'Python', ',', 'fond', 'C++', '.']


In this example you can see that 'to', 'in', 'he', 'is', 'not', 'too', 'of' have been removed from the text. These tokens can also be joined to output a full sentence:

In [3]:
filtered_sentence = (" ").join(tokens_without_sw)
print(filtered_sentence)

Paul likes write code Python , fond C++ .


It is possible to add words to the 'stop words' list, you can also print the current list of stop words.

In [4]:
print(len(stopwords.words('english')))
print(stopwords.words('english'))

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In the example provided, if we wanted to add the word 'write' as a stop word, we can do this using the following code:

In [5]:
all_stopwords = stopwords.words('english')
all_stopwords.append('write')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'likes', 'code', 'Python', ',', 'however', 'fond', 'C++', '.']


We can also add a list of stop words using the following code:

In [6]:
sw_list = ['likes', 'write']
all_stopwords.extend(sw_list)

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'code', 'Python', ',', 'however', 'fond', 'C++', '.']


In this case, removing the word 'not' changes the meaning of the sentence. If we want to remove a stop word, this can be completed using the following code:

In [7]:
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'likes', 'write', 'code', 'Python', ',', 'however', 'not', 'fond', 'C++', '.']


Now we will try completing the same tasks using SpaCy.

In [8]:
# Import spaCy and load the English language library
import spacy
# This will take a while to load initially
sp = spacy.load("en_core_web_sm")
all_stopwords = sp.Defaults.stop_words



In [9]:
text = "Paul likes to write code in Python, however he is not too fond of C++."
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'likes', 'write', 'code', 'Python', ',', 'fond', 'C++', '.']


In SpaCy there are more stop words defined, 326 stop words compared to the 179 in NLTK.

In [10]:
print(len(all_stopwords))
print(all_stopwords)

326
{'using', 'per', 're', 'sometimes', 'about', 'well', 'something', 'very', 'he', 'quite', 'everything', 'then', 'out', 'below', 'eleven', 'noone', 'go', 'so', 'they', 'whence', '’m', 'next', 'whither', 'more', 'an', 'n‘t', 'most', 'along', 'why', 'if', '‘m', 'many', 'whole', 'third', 'another', 'done', 'sixty', 'had', 'via', 'except', 'name', 'beside', 'other', 'take', '‘d', 'because', 'moreover', 'somehow', 'own', 'sometime', 'therein', 'never', 'former', 'before', 'someone', 'yourself', 'five', 'thus', 'serious', 'thence', 'alone', 'ten', 'amongst', 'themselves', 'over', 'can', 'forty', 'have', 'herself', 'one', 'see', 'toward', 'and', 'itself', 'with', 'now', 'thereupon', 'such', 'none', 'was', 'for', 'either', 'wherein', 'empty', "'re", 'may', 'nobody', '’ve', 'anywhere', 'every', 'others', 'due', 'i', 'when', 'regarding', 'hers', 'though', 'show', 'must', 'n’t', 'without', '‘ve', 'seems', 'six', 'ca', 'throughout', 'yours', 'my', 'whoever', 'everywhere', 'those', 'indeed', 'off

It is also possible to add stop words either individually or through using an array:

In [11]:
all_stopwords = sp.Defaults.stop_words
all_stopwords.add("C++")

text = "Paul likes to write code in Python, however he is not too fond of C++."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'likes', 'write', 'code', 'Python', ',', 'fond', '.']


We can also remove stop words from the list:

In [12]:
all_stopwords = sp.Defaults.stop_words
all_stopwords.remove('not')

text = "Paul likes to write code in Python, however he is not too fond of C++."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Paul', 'likes', 'write', 'code', 'Python', ',', 'not', 'fond', '.']


## Exercise

We'll be looking at sentiment analysis on the IMDB movie review dataset later. To prepare for this: 
 - Try doing some frequency analysis on the dataset, to see how frequently stop words occur in reviews
 - Remove the stop words from the reviews and check a few reviews to verify this has been done.

In [13]:
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=30000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Note that the data is already tokenised and indexed. We can decode a single sample (one review) as follows:

In [14]:
word_index = imdb.get_word_index()
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()])
decoded_review = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in train_data[8]]) 
decoded_review

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


"? just got out and cannot believe what a brilliant documentary this is rarely do you walk out of a movie theater in such awe and amazement lately movies have become so over hyped that the thrill of discovering something truly special and unique rarely happens amores perros did this to me when it first came out and this movie is doing to me now i didn't know a thing about this before going into it and what a surprise if you hear the concept you might get the feeling that this is one of those touchy movies about an amazing triumph covered with over the top music and trying to have us fully convinced of what a great story it is telling but then not letting us in ? this is not that movie the people tell the story this does such a good job of capturing every moment of their involvement while we enter their world and feel every second with them there is so much beyond the climb that makes everything they go through so much more tense touching the void was also a great doc about mountain cli

In [28]:
i=7
print(train_labels[i])
word_index = imdb.get_word_index()
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()])
decoded_review = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in train_data[i]]) 
decoded_review

0


"? the hamiltons tells the story of the four hamilton siblings teenager francis cory ? twins ? joseph ? darlene mackenzie ? the eldest david samuel who is now the surrogate parent in charge the hamilton's move house a lot ? is unsure why is unhappy with the way things are the fact that his brother's sister kidnap ? murder people in the basement doesn't help relax or calm ? nerves either francis ? something just isn't right when he eventually finds out the truth things will never be the same again br br co written co produced directed by mitchell ? phil ? as the butcher brothers who's only other film director's credit so far is the april fool's day 2008 remake enough said this was one of the ? to die ? at the 2006 after dark horrorfest or whatever it's called in keeping with pretty much all the other's i've seen i thought the hamiltons was complete total utter crap i found the character's really poor very unlikable the slow moving story failed to capture my imagination or sustain my int