# I.  Text Cleaning and Preparation:  Introduction

Some basic code for text cleaning and prep

# II.  Setup the Environment

We'll focus on usingthe NLTK package to clean and process our text for final analysis.  Comments in the code identify each of the packages being loaded.  In each case, you can refer to the package documentation for more specific information about the package being used.  You must run the code cells below to properly prepare your environment to perfrom the text mining and analysis tasks presented in this module.

In [None]:
# update collab environment to latest version of NLTK
# documentation: https://www.nltk.org/
!pip install nltk -U

In [None]:
# update to the latest version fo Spacy
# https://spacy.io/
!pip install spacy -U

In [None]:
# import the base nltk package
import nltk

# load the nltk tokenize module
from nltk.tokenize import word_tokenize

# download the punkt model
nltk.download('punkt')

# import nltk stopword module
from nltk.corpus import stopwords

# donload the stopword list
nltk.download('stopwords')

# import the nltk porter stemmer
from nltk.stem.porter import PorterStemmer

# import the nltk lemmatizer
from nltk.stem import WordNetLemmatizer

# download nltk wordnet model
nltk.download('wordnet')

# import the regular expressions package
import re

# inport the string package
import string

# import Spacy
import spacy

In [None]:
# download and install the spacy language model
!python -m spacy download en_core_web_sm
sp=spacy.load('en_core_web_sm')

# III.  Load the Text File

In [None]:
from google.colab import drive
drive.mount('/gdrive/')

Mounted at /gdrive/


In [None]:
working_file_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/melville.txt"

Now that you've defined a file to load, we can open the file and read its contents into a string variable.

In [None]:
# open a text file for processing
working_file = open(working_file_path, "r")

# read the file contents into a string variable
working_text = working_file.read()

You can check that your file loaded by checking the length and examining the opening characters of the working_text variable.

In [None]:
# print the character length of our working text
# and the first several characters
print('Characters in string:', len(working_text))
print(working_text[:600:1])

IV.  Convert to Lowercase

# IV.  Remove Newline Characters and Strip Spaces

In [None]:
# define a pattern for finding newlines
pattern = re.compile(r"\n", re.DOTALL | re.MULTILINE | re.IGNORECASE)
# run the replacement.  
working_text = re.sub(pattern, " ", working_text)

In [None]:
# define a patern for finding multiple spaces
pattern = re.compile(r"\s+")
# run the replacement
working_text = re.sub(pattern, " ", working_text)

In [None]:
# strip leading and trailing spaces
working_text = working_text.strip()

# IV. Remove Editorial Text

In [None]:
# look at the first 600 characters of the string
print(len(working_text))
print(working_text[:600:1])

In [None]:
# define a pattern and remove the opening text
pattern = re.compile(r"^.*chapter 1\. Loomings\.?", re.IGNORECASE)
working_text = re.sub(pattern, "", working_text)

In [None]:
# define a pattern and remove the closing text
pattern = re.compile(r"End of Project Gutenberg's.*", re.IGNORECASE)
working_text = re.sub(pattern, "", working_text)

In [None]:
# look at the first 600 characters of the string
print(len(working_text))
print(working_text[1190000:1190046:1])

# V. Save Clean Blob Version

It's a good idea to put aside a version of the minimally claeaned text as single blob for use later. 

In [None]:
blob_text = working_text

# For the rest of the cleaning we'll tokenize and then clean because NLTK Likes it that way

In [None]:
# tokenize on words
tokens = word_tokenize(working_text)


In [None]:
# look at the results
print(tokens[:10])

In [None]:
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
filtered_tokens = [w.translate(table) for w in tokens]

In [None]:
# look at the results
print(filtered_tokens[:10])

In [None]:
# remove remaining tokens that are not alphabetic
filtered_tokens = [word for word in filtered_tokens if word.isalpha()]

In [None]:
# look at the results
print(filtered_tokens[:10])

In [None]:
# load the nltp stopword list
stop_words = set(stopwords.words('english'))


In [None]:
# review the stopwords
print(stop_words)

Note that based on your research qustion you might want to modify the stopword list.  You can reuse code from above (for removing spaces, etc.) to create a list of words you want to remoove from the stopword list.  Alternatively, you can append other words to this list or build your own from scratch.

In [None]:
# remove the stopwords
filtered_tokens = [w for w in filtered_tokens if not w in stop_words]

In [None]:
# look at the results
print(filtered_tokens[:10])

# Stemming



Stemmking uses a rules-based algorithm to remove plural endings, "ing" endings and the like from words as a means of reducing variation.  [See the Wikipedia article here for a good overview](https://en.wikipedia.org/wiki/Stemming).

In [None]:
# instantiate a porter stemmer class object
p_stemmer = PorterStemmer()

#  run the stemmer on our list of filtered words
stemmed_tokens = [p_stemmer.stem(word) for word in filtered_tokens]

In [None]:
# view the results
print(stemmed_tokens[:100])

# Lemmatizing

Lemmatization is an NLP based reduction method that uses language models to reduce all variants of word (is, are, am, etc.) to a common linguistic root (be).  It generally improves the quality of semantic models because it reduces lexical variation in favor of semantic sameness.  However, in many cases we care a lot about particular words, and stemming often erases these differences, so think carefully before stemming.

In [None]:
sp_text = sp(blob_text[:100:1])
for word in sp_text:
  print(word.text, word.lemma, word.lemma_)
