# I.  Text Cleaning and Preparation:  Introduction

This workbook presents some basic code for text cleaning and preparation.  While some of the code looks for text patterns specific to the Project Gutenberg EBook used as an exanple, the process presented is generic and applicable to all text cleaning and preparation workflows.

# II.  Setup the Environment

We'll focus on usingthe NLTK package to clean and process our text for final analysis.  Comments in the code identify each of the packages being loaded.  In each case, you can refer to the package documentation for more specific information about the package being used.  You must run the code cells below to properly prepare your environment to perfrom the text mining and analysis tasks presented in this module.

In [None]:
# update collab environment to latest version of NLTK
# documentation: https://www.nltk.org/
!pip install nltk -U

In [None]:
# update to the latest version fo Spacy
# https://spacy.io/
!pip install spacy -U

In [None]:
# import the base nltk package
# and required modules
# https://www.nltk.org/
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# download nltk language models
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

# import the python regular expression package
# https://docs.python.org/3/library/re.html
import re

# inport the string package
# https://docs.python.org/3/library/string.html
import string

# import Spacy NLP package
# https://spacy.io/
import spacy

In [None]:
# download and install the spacy language model
!python -m spacy download en_core_web_sm
sp=spacy.load('en_core_web_sm')

# III.  Load the Text File

In [None]:
from google.colab import drive
drive.mount('/gdrive/')

In [6]:
working_file_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/s2_data_class/melville.txt"

Now that you've defined a file to load, we can open the file and read its contents into a string variable.

In [7]:
# open a text file for processing
working_file = open(working_file_path, "r")

# read the file contents into a string variable
working_text = working_file.read()

You can check that your file loaded by checking the length and examining the opening characters of the working_text variable.

In [None]:
# print the character length of our working text
# and the first several characters
print('Characters in string:', len(working_text))
print(working_text[:600:1])

Now let's print a "representation" version of the string, which shows all hiddent characters:

In [None]:
print(repr(working_text[:600:1]))

IV.  Convert to Lowercase

# IV.  Remove Newline Characters and Strip Spaces

In [10]:
# define a pattern for finding newlines
pattern = re.compile(r"\n", re.DOTALL | re.MULTILINE | re.IGNORECASE)
# run the replacement.  
working_text = re.sub(pattern, " ", working_text)

In [11]:
# define a patern for finding multiple spaces
pattern = re.compile(r"\s+")
# run the replacement
working_text = re.sub(pattern, " ", working_text)

In [12]:
# strip leading and trailing spaces
working_text = working_text.strip()

Now let's look at the state of the text.

In [None]:
print(working_text)

# IV. Remove Paratext

In [14]:
# define a pattern and remove the opening text
pattern = re.compile(r"^.*chapter 1\. Loomings\.?", re.IGNORECASE)
working_text = re.sub(pattern, "", working_text)

In [15]:
# define a pattern and remove the closing text
pattern = re.compile(r"End of Project Gutenberg's.*", re.IGNORECASE)
working_text = re.sub(pattern, "", working_text)

In [None]:
# look at the first 50 characters of the string
print(working_text[0:50:1])

In [None]:
# look at the last 50 characters of the string
print(working_text[1189996:1190046:1])

# V. Save Clean Blob Version

It's a good idea to put aside a version of the minimally claeaned text as single blob for use later. 

In [21]:
blob_text = working_text

# For the rest of this stage of cleaning we'll tokenize and then clean because NLTK Likes it that way

In [22]:
# tokenize on words
tokens = word_tokenize(working_text)


In [None]:
# look at the results
print(tokens[:10])

In [24]:
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
filtered_tokens = [w.translate(table) for w in tokens]

In [None]:
# look at the results
print(filtered_tokens[:10])

In [26]:
# remove remaining tokens that are not alphanumeric
filtered_tokens = [word for word in filtered_tokens if word.isalpha()]

In [None]:
# look at the results
print(filtered_tokens[:50])

In [29]:
# load the nltp stopword list
stop_words = set(stopwords.words('english'))


In [None]:
# review the stopwords
print(stop_words)

Note that based on your research qustion you might want to modify the stopword list.  You can reuse code from above (for removing spaces, etc.) to create a list of words you want to remoove from the stopword list.  Alternatively, you can append other words to this list or build your own from scratch.

In [31]:
# remove the stopwords
filtered_tokens = [w for w in filtered_tokens if not w in stop_words]

In [None]:
# look at the results
print(filtered_tokens[:50])

# Stemming



Stemming uses a rules-based algorithm to remove plural endings, "ing" endings and the like from words as a means of reducing variation.  [See the Wikipedia article here for a good overview](https://en.wikipedia.org/wiki/Stemming).

In [34]:
# instantiate a porter stemmer class object
p_stemmer = PorterStemmer()

#  run the stemmer on our list of filtered words
stemmed_tokens = [p_stemmer.stem(word) for word in filtered_tokens]

In [None]:
# view the results
print(stemmed_tokens[:100])

# Lemmatizing

Lemmatization is an NLP based reduction method that uses language models to reduce all variants of word (is, are, am, etc.) to a common linguistic root (be).  It generally improves the quality of semantic models because it reduces lexical variation in favor of semantic sameness.  However, in many cases we care a lot about particular words, and stemming often erases these differences, so think carefully before stemming.  Also note that the stemming algorithm operates on a fulltext blob, not on a tokenized list of words.  (Remember, we saved this above for future use in a variable names blob_text.)

In [None]:
sp_text = sp(blob_text[:100:1])
for word in sp_text:
  print(word.text, word.lemma_)

Before we can actually lemmatize an entire text, we have to exapand the max size of the memory allocation devoted to the Spacy language model to handle a text of this length.

In [None]:
# get the character length of the text blob
len(blob_text)

In [44]:
# set the max character length of the spacy object
sp.max_length = 1190050

Here we run the code to preform the lemmatization.  Note that this will take several minutes to run.

In [47]:
lemma_tokens = [word.lemma_ for word in sp(blob_text)]

Now look at the output.

In [None]:
print(lemma_tokens[:50])

Now, we need to remove all spaces, non alphanumerics, etc. from our lemmatized list to get to clean text.  Note that this is the exact same process we ran our original list of word tokens above.

In [49]:
# remove punctuation from the list
table2 = str.maketrans('', '', string.punctuation)
filtered_lem_tokens = [w.translate(table2) for w in lemma_tokens]

# remove remaining tokens that are not alphanumeric
filtered_lem_tokens = [word for word in filtered_lem_tokens if word.isalpha()]

# remove the stopwords
filtered_lem_tokens = [w for w in filtered_lem_tokens if not w in stop_words]

Now let's look at our lemmatized token list.



In [None]:
print(filtered_lem_tokens[0:50:1])