![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Retrieval</a></span></li><li><span><a href="#Processing" data-toc-modified-id="Processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Processing</a></span><ul class="toc-item"><li><span><a href="#Tokenisation" data-toc-modified-id="Tokenisation-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tokenisation</a></span></li><li><span><a href="#Remove-uppercase-letters" data-toc-modified-id="Remove-uppercase-letters-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Remove uppercase letters</a></span></li><li><span><a href="#Spelling-correction" data-toc-modified-id="Spelling-correction-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Spelling correction</a></span></li><li><span><a href="#Stopwords" data-toc-modified-id="Stopwords-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Stopwords</a></span></li></ul></li><li><span><a href="#Basic-Natural-Language-Processing" data-toc-modified-id="Basic-Natural-Language-Processing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic Natural Language Processing</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Bibliography</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction



## Retrieval


The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora' because it is a latin word and therefore has a funny plural. 

For text-mining, a corpus can be:
- a set of tweets, 
- the full text of an 18th centrury novel,
- the contents of a page in the dictionary, 
- random gibberish letters and numbers, or
- just about anything else in text format. 


Retrieval is a very important step, but it is not the focus of this training series. If you are particularly interested in creating a corpus from internet adat, then we recommend you check out our previous training sessions on Web-scraping (recording or jupyter notebook) and API's (recording or jupyter notebook) Both of these demonstrate and discuss ways to get data from the internet that you could use to build a corpus. 

Instead, for the purposes of this session, we will assume that you already have a corpus to analyse. This is easy for us to assume, because we have provided a sample text file that we can use as a corpus for these exercises. 

First, let's check that it is there. To do that, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 


In [19]:
# It is good practice to always start by importing the modules and packages you will need. 
# os is a module for navigating your machine (e.g., file directories).
# nltk stands for natural language tool kit and is useful for text-mining. 
# The print statement is just a bit of encouragement!

import os
import nltk
import nltk.corpus

print("1. Succesfully imported necessary modules")    
print("")

# List all of the files in the "data" folder that is provided to you
for file in os.listdir("./data"):
   print("2. One of the files in ./data is...", file)
print("")


1. Succesfully imported necessary modules

2. One of the files in ./data is... sample_text.txt



_______________________________________________________________________________________________________________________________
Great! We have imported a useful module and used it to check that we have access to the sample_text file. 

Now we need to load that sample_text file into a variable that we can work with in python. Time to Run/Shift+Enter again!

In [4]:
# Open the "sample_text" file and read (import) its contents to a variable called "corpus"
with open("./data/sample_text.txt", "r") as f:
    corpus = f.read()
    
    print(corpus)

This is a sample corpus. It haz some spelling errors and has numbers written two ways. For example, it has both 1972 and ninety-six. 

This sample corpus also uses abbreviations sometimes, but not always. California is spelled out once but also written CA. 

To really complicate things, another country name is written as the U.K., the UK, the United Kingdom, the United Kingdom of Great Britain and The United Kingdom of Great Britain and Northern Ireland becuase sometimes full names are important. 

Further, here is a bunch of unrelated toxt just to fill up the space. 

This privacy policy (“Privacy Policy”) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your Personal Information through our site and any other sites that links to this Privacy Policy (the “Site”). We define “Personal Information” as information that allows someone to identify you personally or contact you, including for example your name, address, telephone numbe

_______________________________________________________________________________________________________________________________
Hmm. Not excellent literature, but it will do for our purposes. 

A quick look tells us that there are capital letters, contractions, punctuation, numbers as digits, numbers written out, abbreviations, and other things that, as humans, we know are equivalent but that computers do not know about. 

Before we go further, it helps to know what kind of variable corpus is. Run/Shift+Enter the next code block to find out!

In [5]:
type(corpus)

str

_______________________________________________________________________________________________________________________________
This tells us that 'corpus' is one very long string of text characters.  

Congratulations! We are done with the retreival portion of this process. They won't all be this easy, but today, we can put our feet up. 

Next up... Processing, which is about cleaning, correcting, standardizing and formatting. 

## Processing



_______________________________________________________________________________________________________________________________
The string we have as our corpus is a good starting point, but it is not perfect. It has a bunch of errors and punctuation which need to be corrected. But even worse, it is 'one long thing' when statistical analysis works on 'lots of short things'. 

So, clearly, we have a few steps to go through with our raw text. 
- Tokenisation, (or splitting text into various kinds of 'short things' that can be statistically analysed).
- Convert to lowercase (so that 'United' and 'united' are counted as the same word).
- Spell check (should be obvious).
- Remove punctuation from each token (so that 'u.k.' and 'uk' are counted as the same word).
- Filter out stop words ( like 'the' or 'to' that are consistent across all English texts, so are unhelpful for text-mining). 



### Tokenisation

Our first step is to cut our 'one big thing' into tokens, or 'lots of little things'. In reality, this might actually be a multi-stage step because some of the data sets you will want to analyse will need to be cut into individual newspaper articles, into tweets, etc. These then, may need to be broken down further into chapters, sections, paragraphs, sentences, words, or something else. 

Fortunately for us, we can skip right to breaking a text into sentences and words. These are both useful tokens in their own way, so we will see how to produce both kinds. 

As a side note, it is good practice to create new variables whenever you manipulate an existing variable rather than write over the original. This means that you keep the original and can go back to it anytime you need to if you want to try a different manipulation or correct an error.
 
We start by dividing our corpus into words, splitting the string into substrings every time it finds a white space (including tabs and new lines). 

Let's try that. But this time, let's just have a look at the first 100 things it finds instead of the entire text.
Run/Shift+Enter.

In [9]:
# importing word_tokenize from nltk
from nltk import word_tokenize

# Passing the string corpus into word tokenize to be broken into words
corpus_words = word_tokenize(corpus)
print(corpus_words[:10])                                                  # the [:100] within the print statement says 
                                                                           # to print only the first 100 items in the list  
print("...")                                                               # the print("...") just improves output readability
type(corpus_words)                                                         # Always good to know your variable type!


['This', 'is', 'a', 'sample', 'corpus', '.', 'It', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'For', 'example', ',', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '.', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'California', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'CA', '.', 'To', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'U.K.', ',', 'the', 'UK', ',', 'the', 'United', 'Kingdom', ',', 'the', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'The', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'Northern', 'Ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important', '.', 'Further', ',', 'here', 'is', 'a', 'bunch', 'of', 'unrelated', 'toxt', 'just', 'to', 'fill', 'up', 'the', 'space', '.', 'This', 'privacy', 'policy', '(', '“', 'Privacy', 'Policy', '”', ')', 'is', 'intended',

list

Right. Let's have a look. 

We can see that corpus_words is a list of strings. We know it is a list because it starts and ends with square brackets and we know the things in that list are surrounded by single quotes. 

However, we can also see that we still have some problems with spelling errors, capital letters and puctuation. For example, each full stop at the end of a sentence appears as its own token. Interestingly, 'U.K.' is all one token, despite having full stops in. Clever stuff, this tokenisation function!

This 'bag of words' reduces a lot of the contextual information within the original corpus because it ignores how the words were used or in what order they originally appeared. There is a suprising amount of insight to be gained from just counting word occurrences, but it does mean that 'building' as a verb in "He is building a diorama for a school project." will be counted as a the same word as 'building' as a noun in "The building is a clear example of brutalist architecture." 

There are other kinds of analyses that you could do, if you don't want verb-building and noun-building to be counted as the same. Let's see what one might look like by running the same basic analysis again, but this time with sentence-token things instead of word-token things. 

Do that funky Run/Shift+Enter thing! 

In [8]:
# importing sent_tokenize from nltk
from nltk import sent_tokenize

# Same again, but this time broken into sentences
corpus_sentences = sent_tokenize(corpus)
print(corpus_sentences[:10])                                                  # Since these are sentences instead of words, 
                                                                              # we only want the first 10 items instead of 100.
print("...")                                                                  
type(corpus_sentences)

['This is a sample corpus.', 'It haz some spelling errors and has numbers written two ways.', 'For example, it has both 1972 and ninety-six.', 'This sample corpus also uses abbreviations sometimes, but not always.', 'California is spelled out once but also written CA.', 'To really complicate things, another country name is written as the U.K., the UK, the United Kingdom, the United Kingdom of Great Britain and The United Kingdom of Great Britain and Northern Ireland becuase sometimes full names are important.', 'Further, here is a bunch of unrelated toxt just to fill up the space.', 'This privacy policy (“Privacy Policy”) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your Personal Information through our site and any other sites that links to this Privacy Policy (the “Site”).', 'We define “Personal Information” as information that allows someone to identify you personally or contact you, including for example your name, addres

list

_______________________________________________________________________________________________________________________________

Corpus_sentences is also a list of strings (starts and ends with square brackets, each item is surrounded by single quotes). 

The same spelling errors, capital letters and puctuation problems are here too, but the full stop at the end of each sentence is now part of the sentence. Now, we could carry on doing two separate analyses, but let's concentrate on the words for now. 

### Remove uppercase letters

Since we are focussing on the 'bag of words' approach, which means we don't really care about uppercase or lowercase distinctions. We want to count 'Privacy' as the same as 'privacy', rather than as two different words. 

We can remove all uppercase letters with a built in python command on corpus_words. Do this in the next code cell, again returning just the first 100 items instead of the whole thing. 

Do the Run/Shift+Enter thing. 

In [11]:
# You can see that I created a new variable called corpus_lower rather than edit corpus_words directly.
# This means I can easily compare two different processes or correct something without going back and re-running earlier steps. 

corpus_lower = [word.lower() for word in corpus_words]
print(corpus_lower[:100])

['this', 'is', 'a', 'sample', 'corpus', '.', 'it', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'for', 'example', ',', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '.', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '.', 'to', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'u.k.', ',', 'the', 'uk', ',', 'the', 'united', 'kingdom', ',', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important', '.', 'further', ',', 'here', 'is', 'a', 'bunch']


_______________________________________________________________________________________________________________________________
Great! This is another step in the right direction. 

If you want a bit more practice, you can copy/paste/edit the command above to create a second version that applies to corpus_sentences instead of corpus_words. It doesn't make a whole lot of sense to do this, because uppercase letters are potentially useful in an analysis that looks at sentences. But we are just playing around here, so go ahead. Knock yourself out! 


Forging ahead, let's filter out punctuation. We can define a string that includes all the standard English language punctuation, and then use that to iterate over corpus_words, removing anything that matches.

But wait... Do we really want to remove the:
- dash in 'ninety-six'? 
- full stops in 'u.k.'? 
- the apostrophe in contractions or possessives?

There are no right or wrong answers here. Every project will have to decide, based on the research questions, what is the right choice for the specific context. In this case, we want to remove the full stops, even from 'u.k.' becomes identical to 'uk'. 

But, at the same time, we don't want to remove dashes or apostrophes because I decided that 'ninetysix' and 'whats' are not words.

Run/Shift+Enter, as is tradition. 

In [57]:
# First, we want to define a variable with all the punctuation to remove.
# Print that defined variable, just to check it is correct.

English_punctuation = "!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”[”]"  
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~”“'''
print(punctuations)
print(English_punctuation)                                    
print("...")
table_punctuation = str.maketrans('','', punctuations) # The python function 'maketrans' creates a table that maps
print(table_punctuation)                                       # the punctation marks to 'None'. Print the table to check. 
print("...")                                                   # Just to be clear, '!' is 33 in Unicode, and '\' is 34, etc.
                                                               # 'None' is python for nothing, not a string of the word "none".
    
corpus_no_punct = [w.translate(table) for w in corpus_lower]   # Iterate over corpus_lower, turning punctuation to nothing.
print(corpus_no_punct[:100])                                   # Print the 1st 100 items in corpus_no_punct to check.

!()-[]{};:'"\,<>./?@#$%^&*_~
!"#$%&()*+,./:;<=>?@[\]^_`{|}~“”[”]
...
{33: None, 40: None, 41: None, 45: None, 91: None, 93: None, 123: None, 125: None, 59: None, 58: None, 39: None, 34: None, 92: None, 44: None, 60: None, 62: None, 46: None, 47: None, 63: None, 64: None, 35: None, 36: None, 37: None, 94: None, 38: None, 42: None, 95: None, 126: None}
...
['this', 'is', 'a', 'sample', 'corpus', '', 'it', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '', 'for', 'example', '', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', '', 'but', 'not', 'always', '', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '', 'to', 'really', 'complicate', 'things', '', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'uk', '', 'the', 'uk', '', 'the', 'united', 'kingdom', '', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'unite

_______________________________________________________________________________________________________________________________
Super! 

Do you want to try something else? How about you create a version that *does* filter out dashes and apostrophes. 

C'mon. You know you can do it. 

Take each of the steps above and copy/paste/edit them as needed. 
- Create a copy of the line that defines the English_punctuation variable and edit it to define an All_English_Punctuation variable that includes more punctuation.
- Then create a copy of the line that defines the table_punctuation variable and have it create a table_all_punctuation variable.
- Then create a copy of the line that creates the corpus_no_punct variable and have it create an absolutely_no_punct variable.
- Then ask for the first 100 items of absolutely_no_punct. 

Feel free to change the variable names as you like. I am going for clarity, but you might prefer brevity. 


A point you might notice here... Removing the punctuation has left list items that are empty strings. Between 'corpus' and 'it', for example, is an item shown as ''. This is an empty string item. Can you think of how we might remove this? Don't worry too much about it now, because I will show you a method later. 

### Spelling correction

_______________________________________________________________________________________________________________________________
Everybody loves spelling... RIGHT?!?

Fortunately, there are several decent spellchecking packages written for python. They are not automatically installed and ready to import in the same way that the 'os' or 'nltk' packages were, but we just need to install the packages and import the functions we need through an installer called 'pip'. You will see 'pip' in the next code block, but since this is in jupyter notebook rather than directly in a python shell, we need to put a '!' in front of the 'pip' function. Don't worry too much about that now, I just include it here in case you find it interesting to know. 

The next code cell:
- installs the 'autocorrect' package,
- imports the Speller function, and
- creates a one-word command that specifies that the Speller function should use English language. 

Run/Shift+Enter, as per usual. 

In [39]:
!pip install autocorrect
from autocorrect import Speller
check = Speller(lang='en')



_______________________________________________________________________________________________________________________________
Super. Creating that one-word command saves us some time, which is maybe less important here but is a good skill to be aware of if you are working on text-mining every day for weeks on end. Always be on the look out for good ways to save time. 

Moving on, we need to iterate over our corpus, checking and correcting each token in our punctuation free bag of words. This is easy to do if you start with a new, empty list (I called mine 'corpus_correct_spell'). As I work through corpus_no_punct, one token at a time, we append (which is just fancy for 'add to the end') the corrected word to our new blank list. 

Then, as usual, we have a quick look at the first 100 entries in the new 'corpus_correct_spell'. 

Run/Shift+Enter. You know how to do it. Don't worry if it takes a while... Checking the spelling on each word is not a cakewalk. 

In [58]:
corpus_correct_spell = []

for word in corpus_no_punct:
    corpus_correct_spell.append(check(word))    

print(corpus_correct_spell[:100])

['this', 'is', 'a', 'sample', 'corpus', '', 'it', 'had', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '', 'for', 'example', '', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', '', 'but', 'not', 'always', '', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '', 'to', 'really', 'complicate', 'things', '', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'uk', '', 'the', 'uk', '', 'the', 'united', 'kingdom', '', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'because', 'sometimes', 'full', 'names', 'are', 'important', '', 'further', '', 'here', 'is', 'a', 'bunch']


_______________________________________________________________________________________________________________________________
How did it do? Well, this spell-checker replaced 'haz' with 'had' rather than 'has'. That is ok, I guess. No automatic spelling correction programme will get it 100% right 100% of the time. Maybe your project has specific research questions that won't work with this decision. 

In that case, you would have to check out some other spell-checkers like textblob or pyspellchecker. You might even want to custom build or adapt your own spell-checker, especially if you were working with very non-standard text, like comment boards that use a bunch of slang, common typos, or specific terms. 

Next up, stopwords. 

### Stopwords

Stopwords are typically conjunctions ('and', 'or'), prepositions ('to', 'around'), determiners ('the', 'an'), possessives ('s) and the like. The are **REALLY** common in all language, and tend to occur at about the same ratio in all kinds of writing, regardless of who did the writing or what it is about. These words are definitely important for structure as they make all the difference between "Freeze *or* I'll shoot!" and "Freeze *and* I'll shoot!". 

Buuuut... In the bag of words approach, these words don't have a whole lot of meaning in and of themselves. Thus, we want to remove them. 

Let's start by downloading the basic stopwords function built into nltk and storing the English language ones in a list called, appropriately enough, 'stop_words'. 

Then let's have a look at what is in that list with a print command by doing the whole Run/Shift+Enter thing in the next two (two?!?) code cells. 

In [41]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [47]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(sorted(stop_words))


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

_____________________________________________________________________________________________________________________________
Great. Now let's remove those stop_words by creating another list called corpus_no_stop_words. Then, we iterate over corpus_correct_spell, looking at them one by one and appending them to corpus_no_stop_words if and only if they do not match any of the items in the stop_words list. 

As you might expect, you should do the whole Run/Shift+Enter thing. Again. (I know, I know...)

In [60]:
corpus_no_stop_words = []

for word in corpus_correct_spell:
    if word not in stop_words:
        corpus_no_stop_words.append(word)
        
        
print(corpus_no_stop_words[:100])

['sample', 'corpus', '', 'spelling', 'errors', 'numbers', 'written', 'two', 'ways', '', 'example', '', '1972', 'ninety-six', '', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', '', 'always', '', 'california', 'spelled', 'also', 'written', 'ca', '', 'really', 'complicate', 'things', '', 'another', 'country', 'name', 'written', 'uk', '', 'uk', '', 'united', 'kingdom', '', 'united', 'kingdom', 'great', 'britain', 'united', 'kingdom', 'great', 'britain', 'northern', 'ireland', 'sometimes', 'full', 'names', 'important', '', '', 'bunch', 'unrelated', 'text', 'fill', 'space', '', 'privacy', 'policy', '', '', 'privacy', 'policy', '”', '', 'intended', 'inform', 'policies', 'practices', 'regarding', 'collection', '', 'use', '', 'disclosure', 'personal', 'information', 'site', 'sites', 'links', 'privacy', 'policy', '', '', 'site', '”', '', '', 'define', '']


_______________________________________________________________________________________________________________________________
Hey now! That looks pretty good. 

But do you remember those empty strings? They are not filtered out by the stop_words because... Well. They don't match any of the items on the stop_words list. But python is aware that empty strings exist and can filter them out. 

Let's give it a try. Run/Shift+Enter. Do it!

In [52]:
corpus_no_space = list(filter(None, corpus_no_stop_words))  # This adds the empty string to the stop words list.

print(corpus_no_space[:100])                           # Will we see the same result? One way to find out!

['sample', 'corpus', 'spelling', 'errors', 'numbers', 'written', 'two', 'ways', 'example', '1972', 'ninety-six', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', 'always', 'california', 'spelled', 'also', 'written', 'ca', 'really', 'complicate', 'things', 'another', 'country', 'name', 'written', 'uk', 'uk', 'united', 'kingdom', 'united', 'kingdom', 'great', 'britain', 'united', 'kingdom', 'great', 'britain', 'northern', 'ireland', 'sometimes', 'full', 'names', 'important', 'bunch', 'unrelated', 'text', 'fill', 'space', 'privacy', 'policy', 'privacy', 'policy', '”', 'intended', 'inform', 'policies', 'practices', 'regarding', 'collection', 'use', 'disclosure', 'personal', 'information', 'site', 'sites', 'links', 'privacy', 'policy', 'site', '”', 'define', 'personal', 'information', '”', 'information', 'allows', 'someone', 'identify', 'personally', 'contact', 'including', 'example', 'name', 'address', 'telephone', 'number', 'email', 'address', 'registering', 'us', 'using'

At this point, we also need to decide whether to dive into the text-mining or whether to carry on correcting and cleaning the text using downloaded packages that could:
- replace contractions so 'haven't' would become 'have' and 'not',
- replace abbreviations, so 'ca' becomes 'california' or 'uk' becomes 'united' and 'kingdom', or
- anything else that might be specific to the text.

Buuuuuuuuuuuuuut... those steps take some time and at this point, it is not clear that that time would be well spent. Let's dive in for now and we can always come back and add more correcting and cleaning steps if we decide they are needed. 

## Basic Natural Language Processing



Commentary

## Conclusion

Hopefully this chapter has demystified aspects of CSS and whetted your appetite for some applied work. The subsequent chapters provide plenty of opportunity to practice CSS with various forms of data. For now I wanted to reflect on some outstanding issues.

<!-- #### Python vs R vs Julia vs ....

[Perhaps a table with some properties of each?] The general point is it's your choice.
 -->

## Bibliography


<!-- ## Further Reading and Resources

[Copy AQMEN reading lists] -->