# Removing common and unwanted words

So far we have tidied up our text and split it into smaller chunks. But to start to process it there are a number of further tasks we might want to do. For statistical analysis we might want to remove many of the joining and most used words as these may skew results.

These words are often called **Stopwords**. You can create your own list of stopwords based on the the text you have or use standard lists that you can add/alter to suit your needs.

In [2]:
text_o="""
First time, I did it for the hell of it 
Stuck it on the back of my tongue 
And swallowed it 
Second time, things are
Getting easier 
Blow me down this wind
Is getting breezier 
Third time lucky, maybe feelin' fuzzy 
Oh my God! we're getting hippy-dippy!
5-6-7 man I'm in heaven 
And I'm growing my beard 
Until it gets sheared
You're on my mind 
Every day and every night 
And you'll never go away 
Cause I know you're here to stay 
For the rest of my mind
I just keep repeating myself
I just keep repeating myself
I just keep repeating myself
"""

## Custom Stopword List

Simply create a list with each word in it. Then clean up and word tokenise the text, then remove any tokens that are in the list using a list comprehension. 

Firstly define the stopword list.

In [3]:
stopwords = ['the', 'did', 'it', 'of', 'are', 'and', 'on','' ]

Then preprocess the text and check against the stopword list:

In [14]:
#process the text 
import re

text_processed = re.sub(r'[.,?:\']', "", text_o)
text_processed = re.sub(r'\n', " ", text_processed)
text_processed = re.sub( r'  +', " ", text_processed)
tokens = text_processed.split(" ")

# remove stopwords

tokens_sw = [t for t in tokens if t not in stopwords]
print(tokens_sw)

['First', 'time', 'I', 'for', 'hell', 'Stuck', 'back', 'my', 'tongue', 'And', 'swallowed', 'Second', 'time', 'things', 'Getting', 'easier', 'Blow', 'me', 'down', 'this', 'wind', 'Is', 'getting', 'breezier', 'Third', 'time', 'lucky', 'maybe', 'feelin', 'fuzzy', 'Oh', 'my', 'God!', 'were', 'getting', 'hippy-dippy!', '5-6-7', 'man', 'Im', 'in', 'heaven', 'And', 'Im', 'growing', 'my', 'beard', 'Until', 'gets', 'sheared', 'Youre', 'my', 'mind', 'Every', 'day', 'every', 'night', 'And', 'youll', 'never', 'go', 'away', 'Cause', 'I', 'know', 'youre', 'here', 'to', 'stay', 'For', 'rest', 'my', 'mind', 'I', 'just', 'keep', 'repeating', 'myself', 'I', 'just', 'keep', 'repeating', 'myself', 'I', 'just', 'keep', 'repeating', 'myself']


### Can you see any issues with the processed text? 
#### What other pre-processing would you do? 
#### Are there any symbols or digits you'd think about removing? 
#### How would you do this?


In [None]:
#Try your ideas here