# UK Data Service -  Text Mining Basic Processes Tutorial

I am working through the code files at my own pace to grasp the processing steps and reflect on how I can apply to my own data.
https://github.com/UKDataServiceOpen/text-mining/blob/master/code/tm-processing-2020-06-16.ipynb

Key learning points:
   * tokenisation, standardisation, removing irregularlies and consolidation (i.e lemmatisation or stemming) are key stages of the processing. However the order will vary depending on the research question and the text.
   * Really important to keep track of variables, particularly if you are experimenting with different token sizes.
   * Need to get better at RegEx expressions!

In [3]:
# Import the modules and packages 

import os   # os is a module for navigating your machine (e.g., file directories).
import nltk # nltk stands for natural language tool kit and is useful for text-mining. 
import re  #  re is for regular expressions, which we use later 

print("1. Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

print("")



1. Succesfully imported necessary modules



# Retrieval

In [None]:
# Open the "sample_text" file and read (import) its contents to a variable called "corpus"
with open("C:/Users/sonja/Desktop/TfL/Furlough Learning/sample_text.txt", "r", encoding = "ISO-8859-1") as f:
    corpus = f.read()
    
print(corpus)

In [None]:
#Hmm this all looks a bit dull. Nevermind; it's an example.
#Let's see what datatype this is

In [5]:
type(corpus)# it's one long string....

str

# Processing

We have a few steps of processing to do:

* Tokenisation, (or splitting text into various kinds of 'short things' that can be statistically analysed).
* Standardising the next (including converting uppercase to lower, correcting spelling, find-and-replace operations to remove abbreviations, etc.).
* Removing irrelevancies (anything from punctuation to stopwords like 'the' or 'to' that are unhelpful for many kinds of analysis).
* Consolidating (including stemming and lemmatisation that strip words back to their 'root').
* Basic NLP (that put some of the small things back together into logically useful medium things, like multi-word noun or verb phrases and proper names).

In practice, most text-mining work will require that any given corpus undergo multiple steps, but the exact steps and the exact order of steps depends on the desired analysis to be done. 

Also - it is important to create new variables at the different stages of the process so that it is possible to return to previous stages easily.


## Tokenization

Our first step is to cut our 'one big thing' into tokens, or 'lots of little things'. For most text-mining work will also want to divide the corpus into what are known as 'tokens'. These 'tokens' are the unit of analysis, which might be chapters, sections, paragraphs, sentences, words, or something else.

As this corpus is already in the form of a string, we can tokenize both into words and sentences.

### Word tokenizing

In [11]:
nltk.download('punkt')

from nltk import word_tokenize  # importing the word_tokenize function from nltk
corpus_words = word_tokenize(corpus) # Pass the corpus through word tokenize 
print(corpus_words[:100])    # the [:10] within the print statement says 
type(corpus_words)      # Always good to know your variable type!

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['This', 'is', 'a', 'sample', 'corpus', '.', 'It', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'For', 'example', ',', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '.', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'California', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'CA', '.', 'To', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'U.K.', ',', 'the', 'UK', ',', 'the', 'United', 'Kingdom', ',', 'the', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'The', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'Northern', 'Ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important', '.', 'Further', ',', 'here', 'is', 'a', 'bunch']


list

 - Punctuation is one token
 - Interestingly both UK and U.K. are one token

Word_tokenize is a useful function if you want to take a 'bag of words' approach to text-mining. This ignores how the words were used or in what order they originally appeared, making it easy to count how often each word occurrs. You can glean a lot of  info from this. Important to note that this does not distinguish between verbs and nouns always - eg "building" the verb will be treated in the same way as the "building" (noun). 

### Sentence tokenizing

In [12]:
# importing sent_tokenize from nltk
from nltk import sent_tokenize

# Same again, but this time broken into sentences
corpus_sentences = sent_tokenize(corpus)
print(corpus_sentences[:10])        # Since these are sentences instead of words 
#we'll, take the first 10

#check type
type(corpus_sentences)


['This is a sample corpus.', 'It haz some spelling errors and has numbers written two ways.', 'For example, it has both 1972 and ninety-six.', 'This sample corpus also uses abbreviations sometimes, but not always.', 'California is spelled out once but also written CA.', 'To really complicate things, another country name is written as the U.K., the UK, the United Kingdom, the United Kingdom of Great Britain and The United Kingdom of Great Britain and Northern Ireland becuase sometimes full names are important.', 'Further, here is a bunch of unrelated toxt just to fill up the space.', 'This privacy policy (â\x80\x9cPrivacy Policyâ\x80\x9d) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your Personal Information through our site and any other sites that links to this Privacy Policy (the â\x80\x9cSiteâ\x80\x9d).', 'We define â\x80\x9cPersonal Informationâ\x80\x9d as information that allows someone to identify you personally or cont

list

Corpus_sentences is also a list of strings - they stand and end with square brackets and each item has single quotes. Here full stops at the end of each sentence are included within the sentence token.

### Standardisation

For a "bag of words" approach we don't need to distinguish between lower case and capitals. Creating a new variable, corpus_lower.

In [13]:
corpus_lower = [word.lower() for word in corpus_words]
print(corpus_lower[:100])

['this', 'is', 'a', 'sample', 'corpus', '.', 'it', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'for', 'example', ',', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '.', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '.', 'to', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'u.k.', ',', 'the', 'uk', ',', 'the', 'united', 'kingdom', ',', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important', '.', 'further', ',', 'here', 'is', 'a', 'bunch']


In [18]:
#let's try and do the same thing for corpus sentences and see if this makes sense
#or no

corpus_sentences_lower = [word.lower() for word in corpus_sentences]
print(corpus_sentences_lower[:5])

['this is a sample corpus.', 'it haz some spelling errors and has numbers written two ways.', 'for example, it has both 1972 and ninety-six.', 'this sample corpus also uses abbreviations sometimes, but not always.', 'california is spelled out once but also written ca.']


In [None]:
#doable, not sure how useful this is.

### Spelling correction

Next we need to correct some spelling. There are several spellchecking packages written for python so we're using one of those. And we're experimenting with a pip install in a JN which seems to work!

In [19]:
!pip install autocorrect
from autocorrect import Speller
check = Speller(lang='en')

Collecting autocorrect
  Downloading https://files.pythonhosted.org/packages/a1/83/9cecf8ea84b964b80205a081b808cc262160d6b1d6dc5c3dacd8c8e10b20/autocorrect-1.3.0.tar.gz (1.8MB)
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py): started
  Building wheel for autocorrect (setup.py): finished with status 'done'
  Stored in directory: C:\Users\sonja\AppData\Local\pip\Cache\wheels\c6\de\f7\d24f92af3335a698d38a54a43b8b40dcb3e8168a18a7f6f8c1
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-1.3.0


You are using pip version 19.0.1, however version 20.2b1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Next we iterate over our corpus - we'll take lower, checking and correct each token where needed. Start with a new, empty list (I called mine 'corpus_correct_spell'). Working through the list the corrected word is appended to the empty list.
Then, let's look at the first 100 words as before.

#### .....for word tokens

In [22]:
corpus_correct_spell = []

for word in corpus_lower:
    corpus_correct_spell.append(check(word))    

print(corpus_correct_spell[:100])

['this', 'is', 'a', 'sample', 'corpus', '.', 'it', 'had', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'for', 'example', ',', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '.', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '.', 'to', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'u.k.', ',', 'the', 'uk', ',', 'the', 'united', 'kingdom', ',', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'because', 'sometimes', 'full', 'names', 'are', 'important', '.', 'further', ',', 'here', 'is', 'a', 'bunch']


In [None]:
#not bad - haz turned to had, not has. Nothing is 100%

#### .... for sentence tokens

In [24]:
corpus_correct_sentence_spell = []

for word in corpus_sentences_lower:
    corpus_correct_sentence_spell.append(check(word))    

print(corpus_correct_sentence_spell[:10])

['this is a sample corpus.', 'it had some spelling errors and has numbers written two ways.', 'for example, it has both 1972 and ninety-six.', 'this sample corpus also uses abbreviations sometimes, but not always.', 'california is spelled out once but also written ca.', 'to really complicate things, another country name is written as the u.k., the uk, the united kingdom, the united kingdom of great britain and the united kingdom of great britain and northern ireland because sometimes full names are important.', 'further, here is a bunch of unrelated text just to fill up the space.', 'this privacy policy (â\x80\x9cprivacy policyâ\x80\x9d) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your personal information through our site and any other sites that links to this privacy policy (the â\x80\x9csiteâ\x80\x9d).', 'we define â\x80\x9cpersonal informationâ\x80\x9d as information that allows someone to identify you personally or cont

In [None]:
#Output comparable, just "sliced up" differently.

### Regular Expressions (RegEx) to replace specific terms

Regular Expressions is how seach & replace works in text docs. But RegEx is actually stronger than that because you can use it to identify combinations of letters, numbers, symbols, spaces and more, some of which can be repeated more than once or can be optional. Lots more to learn on RegEx.

In [25]:
corpus_numbers = [re.sub(r"ninety-six", "96", word) for word in corpus_correct_spell]  # Defines a new variable create by substituting
                                                    # '96' for 'ninety-six' in corpus_words

print(corpus_numbers[:100])                                            # Prints the first 100 items in the newly created corpus

['this', 'is', 'a', 'sample', 'corpus', '.', 'it', 'had', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '.', 'for', 'example', ',', 'it', 'has', 'both', '1972', 'and', '96', '.', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'but', 'not', 'always', '.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '.', 'to', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'u.k.', ',', 'the', 'uk', ',', 'the', 'united', 'kingdom', ',', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'because', 'sometimes', 'full', 'names', 'are', 'important', '.', 'further', ',', 'here', 'is', 'a', 'bunch']


In [None]:
#Super but only works on 96. So we need to find a multiple replace option.
#Note: this function works on strings, so I applied it to 'corpus' our original 
#raw text. We can either put a step like this as the first step in a pipeline, 
#or we can adapt the code to iterate over a list of strings.

In [27]:
def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  dict = {
    "CA" : "California",
    "United Kingdom" : "U.K.",
    "United Kingdom of Great Britain and Northern Ireland" : "U.K.",
    "United Kingdom of Great Britain" : "U.K.",
    "UK" : "U.K.",
    "Privacy Policy" : "noodle soup",
  } 


In [30]:
#Now apply the function
#We have options about which variable to apply it to. I would like to  apply it
#to the spell corrected, lower case, word tokenised corpus. However that gives
#me an error. I think due to the fact that this is a list, rather than a string.
#So I am going with the flow and applying it to a string - the raw corpus.

corpus_replace = multiple_replace(dict, corpus)#_correct_spell)
print(corpus_replace)

This is a sample corpus. It haz some spelling errors and has numbers written two ways. For example, it has both 1972 and ninety-six. 

This sample corpus also uses abbreviations sometimes, but not always. California is spelled out once but also written California. 

To really complicate things, another country name is written as the U.K., the U.K., the U.K., the U.K. of Great Britain and The U.K. of Great Britain and Northern Ireland becuase sometimes full names are important. 

Further, here is a bunch of unrelated toxt just to fill up the space. 

This privacy policy (ânoodle soupâ) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your Personal Information through our site and any other sites that links to this noodle soup (the âSiteâ). We define âPersonal Informationâ as information that allows someone to identify you personally or contact you, including for example your name, address, telephone number, and email a

#### filter out irrelevancies...

Let's filter out punctuation. We can define a string that includes all the standard English language punctuation, and then use that to iterate over corpus_words, removing anything that matches.

But wait... Do we really want to remove the hyphen in 'ninety-six' or words like 'lactose-free'? Or the apostrophe in contractions or possessives?

There are no right or wrong answers here. Every project will have to decide, based on the research questions, what is the right choice for the specific context. 

Here we want to remove the full stops, even from 'u.k.' so that it becomes identical to 'uk'. Also we don't want to remove dashes or apostrophes. Those are punctuation marks that occur in the middle of words and do add meaning to the word. 

In [31]:
English_punctuation = "!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”"      # Define a variable with all the punctuation to remove.
print(English_punctuation)                                     # Print that defined variable, just to check it is correct.
print("...")                                                   # Print an ellipsis, just to make the output more readable.

table_punctuation = str.maketrans('','', English_punctuation)  # The python function 'maketrans' creates a table that maps
print(table_punctuation)                                       # the punctation marks to 'None'. Print the table to check. 
print("...")                                                   # Just to be clear, '!' is 33 in Unicode, and '\' is 34, etc.
                                                               # 'None' is python for nothing, not a string of the word "none".
    
corpus_no_punct = [w.translate(table_punctuation) for w in corpus_words]  
                                                               # Iterate over corpus_words, turning punctuation to nothing.
print(corpus_no_punct[:100])                                   # Print the 1st 100 items in corpus_no_punct to check.

!"#$%&()*+,./:;<=>?@[\]^_`{|}~“”
...
{33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 40: None, 41: None, 42: None, 43: None, 44: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 63: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None, 8220: None, 8221: None}
...
['This', 'is', 'a', 'sample', 'corpus', '', 'It', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', '', 'For', 'example', '', 'it', 'has', 'both', '1972', 'and', 'ninety-six', '', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', '', 'but', 'not', 'always', '', 'California', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'CA', '', 'To', 'really', 'complicate', 'things', '', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'UK', '', 'the', 'UK', '', 'the', 'United', 'Kingdom', '', 'the', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 

In [32]:
corpus_no_space = list(filter(None, corpus_no_punct))     # This filters out the empty string from the no_punct list.

print(corpus_no_space[:100])

['This', 'is', 'a', 'sample', 'corpus', 'It', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways', 'For', 'example', 'it', 'has', 'both', '1972', 'and', 'ninety-six', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', 'but', 'not', 'always', 'California', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'CA', 'To', 'really', 'complicate', 'things', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'UK', 'the', 'UK', 'the', 'United', 'Kingdom', 'the', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'The', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'Northern', 'Ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important', 'Further', 'here', 'is', 'a', 'bunch', 'of', 'unrelated', 'toxt', 'just', 'to', 'fill', 'up', 'the', 'space', 'This', 'privacy', 'policy', 'â\x80\x9cPrivacy']


In [None]:
#hmm more to learn here on regex and applying.

### Stopwords

From the tutorial:
     - Stopwords are typically conjunctions ('and', 'or'), prepositions ('to', 'around'), determiners ('the', 'an'), possessives ('s) and the like. The are REALLY common in all languages, and tend to occur at about the same ratio in all kinds of writing, regardless of who did the writing or what it is about. These words are definitely important for structure as they make all the difference between "Freeze or I'll shoot!" and "Freeze and I'll shoot!".

Buuuut... Many for many text-mining analyses, especially those that take the bag of words approach, these words don't have a whole lot of meaning in and of themselves. Thus, we want to remove them.

In [33]:
#Download basic stop words function
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [34]:
#Let's see what they look like
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(sorted(stop_words))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [36]:
#Now let's remove those stop_words
#create another list called corpus_no_stop_words. 
#iterate over corpus_correct_spell, looking at them one by one and appending them
#to corpus_no_stop_words if and only if they do not match any of the items in the 
#stop_words list.

corpus_no_stop_words = []

for word in corpus_correct_spell:
    if word not in stop_words:
        corpus_no_stop_words.append(word)
        
        
print(corpus_no_stop_words[:100])

['sample', 'corpus', '.', 'spelling', 'errors', 'numbers', 'written', 'two', 'ways', '.', 'example', ',', '1972', 'ninety-six', '.', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'always', '.', 'california', 'spelled', 'also', 'written', 'ca', '.', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'written', 'u.k.', ',', 'uk', ',', 'united', 'kingdom', ',', 'united', 'kingdom', 'great', 'britain', 'united', 'kingdom', 'great', 'britain', 'northern', 'ireland', 'sometimes', 'full', 'names', 'important', '.', ',', 'bunch', 'unrelated', 'text', 'fill', 'space', '.', 'privacy', 'policy', '(', 'â\x80\x9cprivacy', 'policyâ\x80\x9d', ')', 'intended', 'inform', 'policies', 'practices', 'regarding', 'collection', ',', 'use', ',', 'disclosure', 'personal', 'information', 'site', 'sites', 'links', 'privacy', 'policy', '(', 'â\x80\x9csiteâ\x80\x9d', ')', '.', 'define', 'â\x80\x9cpersonal', 'informationâ\x80\x9d', 'information', 'allows', 'someone']


In [37]:
#Let's try this on corpus words
corpus_no_stop_words1 = []

for word in corpus_words:
    if word not in stop_words:
        corpus_no_stop_words1.append(word)
        
        
print(corpus_no_stop_words1[:100])

['This', 'sample', 'corpus', '.', 'It', 'haz', 'spelling', 'errors', 'numbers', 'written', 'two', 'ways', '.', 'For', 'example', ',', '1972', 'ninety-six', '.', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes', ',', 'always', '.', 'California', 'spelled', 'also', 'written', 'CA', '.', 'To', 'really', 'complicate', 'things', ',', 'another', 'country', 'name', 'written', 'U.K.', ',', 'UK', ',', 'United', 'Kingdom', ',', 'United', 'Kingdom', 'Great', 'Britain', 'The', 'United', 'Kingdom', 'Great', 'Britain', 'Northern', 'Ireland', 'becuase', 'sometimes', 'full', 'names', 'important', '.', 'Further', ',', 'bunch', 'unrelated', 'toxt', 'fill', 'space', '.', 'This', 'privacy', 'policy', '(', 'â\x80\x9cPrivacy', 'Policyâ\x80\x9d', ')', 'intended', 'inform', 'policies', 'practices', 'regarding', 'collection', ',', 'use', ',', 'disclosure', 'Personal', 'Information', 'site', 'sites', 'links', 'Privacy', 'Policy']


In [None]:
#Doesn't work so well - as stopwords may be lower or upper case and this is case
#sensitive

### Consolidation - stemming



We see that 'sample' has become 'sampl', which collapses 'sampled' together with 'samples' and 'sampling' and 'sample'. This puts plurals and verb tenses all in the same form so they can be counted as instances of the "same" word.

We could run with this or decided to do more cleaning. If we want nouns and verbs seperate we need lemmatisation.


In [39]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
corpus_stemmed = [porter.stem(word) for word in corpus_no_stop_words]
print(corpus_stemmed[:100])

['sampl', 'corpu', '.', 'spell', 'error', 'number', 'written', 'two', 'way', '.', 'exampl', ',', '1972', 'ninety-six', '.', 'sampl', 'corpu', 'also', 'use', 'abbrevi', 'sometim', ',', 'alway', '.', 'california', 'spell', 'also', 'written', 'ca', '.', 'realli', 'complic', 'thing', ',', 'anoth', 'countri', 'name', 'written', 'u.k.', ',', 'uk', ',', 'unit', 'kingdom', ',', 'unit', 'kingdom', 'great', 'britain', 'unit', 'kingdom', 'great', 'britain', 'northern', 'ireland', 'sometim', 'full', 'name', 'import', '.', ',', 'bunch', 'unrel', 'text', 'fill', 'space', '.', 'privaci', 'polici', '(', 'â\x80\x9cprivaci', 'policyâ\x80\x9d', ')', 'intend', 'inform', 'polici', 'practic', 'regard', 'collect', ',', 'use', ',', 'disclosur', 'person', 'inform', 'site', 'site', 'link', 'privaci', 'polici', '(', 'â\x80\x9csiteâ\x80\x9d', ')', '.', 'defin', 'â\x80\x9cperson', 'informationâ\x80\x9d', 'inform', 'allow', 'someon']


### Lemmatisation
The results show that our examples produce good output - 'rocks', 'corpora' and 'cares' are all de-pluralised correctly. The examples with part of speech tags also show that 'caring' and 'cared' are both correctly converted to 'care' as the base verb.

In [40]:
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print('rocks :', lemmatizer.lemmatize('rocks'))              #a few examples of lemmatising as a de-pluraliser
print('corpora :', lemmatizer.lemmatize('corpora'))
print('cares :', lemmatizer.lemmatize('cares'))              #no part of speech tag supplied, so 'cares' is treated as noun
print('caring :', lemmatizer.lemmatize('caring', pos = "v")) #when part of speech tag added, 'caring' is treated as verb             
print('cared :', lemmatizer.lemmatize('cared', pos = "v"))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sonja\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
rocks : rock
corpora : corpus
cares : care
caring : care
cared : care


In [41]:
corpus_lemmed = [lemmatizer.lemmatize(word) for word in corpus_correct_spell]

print(corpus_lemmed[:100])

['this', 'is', 'a', 'sample', 'corpus', '.', 'it', 'had', 'some', 'spelling', 'error', 'and', 'ha', 'number', 'written', 'two', 'way', '.', 'for', 'example', ',', 'it', 'ha', 'both', '1972', 'and', 'ninety-six', '.', 'this', 'sample', 'corpus', 'also', 'us', 'abbreviation', 'sometimes', ',', 'but', 'not', 'always', '.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca', '.', 'to', 'really', 'complicate', 'thing', ',', 'another', 'country', 'name', 'is', 'written', 'a', 'the', 'u.k.', ',', 'the', 'uk', ',', 'the', 'united', 'kingdom', ',', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'because', 'sometimes', 'full', 'name', 'are', 'important', '.', 'further', ',', 'here', 'is', 'a', 'bunch']


Result are mixed.
No part of speech tags in our corpus, so everything was treated as nouns. 
The corpus has been effectively de-pluralised, but all of the different verb tenses remain. 
Part of Speech tags - in next JN!