# Preprocessing text

We are going to clean a text as a preprocessing step before using machine learning algorithms. 

The selected text is "Pride and Prejudice" by Jane Austen, extracted from Project Gutenberg in UTF-8 (http://www.gutenberg.org/ebooks/1342)

The first step is to load the text as data.

In [1]:
# load text
filename = 'PrideandPrejudice_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

### Manual cleaning

Now we are going to clean this text manually (later we will use the package NLTK to automatically clean this text.)

First we split the text by whitespaces.

In [2]:
# split into words by white space
words = text.split()
print(words[:100])

['PRIDE', 'AND', 'PREJUDICE', 'By', 'Jane', 'Austen', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in', 'want', 'of', 'a', 'wife.', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood,', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families,', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters.', '“My', 'dear', 'Mr.', 'Bennet,”', 'said', 'his', 'lady', 'to', 'him', 'one', 'day,', '“have', 'you', 'heard', 'that', 'Netherfield', 'Park', 'is', 'let', 'at', 'last?”', 'Mr.']


Using split we get a list of words, including the punctuation (i.e., 'wife.') Instead of this approach we could use regex, or we could remove the punctuation marks. We will use the fact that python gives us a list of such marks.

In [3]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [4]:
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])

['PRIDE', 'AND', 'PREJUDICE', 'By', 'Jane', 'Austen', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', '“My', 'dear', 'Mr', 'Bennet”', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', '“have', 'you', 'heard', 'that', 'Netherfield', 'Park', 'is', 'let', 'at', 'last”', 'Mr']


There are some punctuation marks that haven't been erased, such as “.

In [5]:
table = str.maketrans('', '', '“”')
stripped_full = [w.translate(table) for w in stripped]
print(stripped_full[:100])

['PRIDE', 'AND', 'PREJUDICE', 'By', 'Jane', 'Austen', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', 'My', 'dear', 'Mr', 'Bennet', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', 'have', 'you', 'heard', 'that', 'Netherfield', 'Park', 'is', 'let', 'at', 'last', 'Mr']


Next we will use normalization. There are many normalization approaches, but the most common is the normalization to lowercases.

In [6]:
# convert to lower case
words_normalized = [word.lower() for word in stripped_full]
print(words_normalized[:100])

['pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'chapter', '1', 'it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', 'my', 'dear', 'mr', 'bennet', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', 'have', 'you', 'heard', 'that', 'netherfield', 'park', 'is', 'let', 'at', 'last', 'mr']


### Cleaning using NLTK

NLTK allows us to split easily by sentences using sent_tokenize.

In [7]:
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[:100])

['PRIDE AND PREJUDICE\n\nBy Jane Austen\n\n\n\nChapter 1\n\n\nIt is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife.', 'However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered the rightful property\nof some one or other of their daughters.', '“My dear Mr. Bennet,” said his lady to him one day, “have you heard that\nNetherfield Park is let at last?”\n\nMr. Bennet replied that he had not.', '“But it is,” returned she; “for Mrs. Long has just been here, and she\ntold me all about it.”\n\nMr. Bennet made no answer.', '“Do you not want to know who has taken it?” cried his wife impatiently.', '“_You_ want to tell me, and I have no objection to hearing it.”\n\nThis was invitation enough.', '“Why, my dear, you must know, Mrs. Long says that Netherfield is taken\nby a young man of large fortun

To split by words we use word_tokenize

In [8]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

['PRIDE', 'AND', 'PREJUDICE', 'By', 'Jane', 'Austen', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', ',', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', ',', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', '.', '“', 'My', 'dear', 'Mr.', 'Bennet', ',', '”', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', ',', '“']


Here we can see that word_tokenize splits words and punctuation marks.

To remove the punctuation marks we can use the python function isalpha for keeping only alphabetical elements, or isalnum for alphanumerical (for instance, the number of chapter).

In [9]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalnum()]
print(words[:100])

['PRIDE', 'AND', 'PREJUDICE', 'By', 'Jane', 'Austen', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', 'My', 'dear', 'Bennet', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', 'have', 'you', 'heard', 'that', 'Netherfield', 'Park', 'is', 'let', 'at', 'last', 'Bennet', 'replied']


As we can see, this way we have removed every punctuation mark, not only the ones in string.punctuation.

Next we normalize to lowercase, as before.

In [10]:
words = [w.lower() for w in words]
print(words[:100])

['pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'chapter', '1', 'it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', 'that', 'he', 'is', 'considered', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', 'my', 'dear', 'bennet', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', 'have', 'you', 'heard', 'that', 'netherfield', 'park', 'is', 'let', 'at', 'last', 'bennet', 'replied']


One of the most useful functions in the NLTK package is stopwords. The stopwords are the words in a language that do not contribute to the meaning of a sentence, such as "the", "a", "and" in English. We can see the list of stopwords in English as follows:

In [11]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words[:100])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once']


And the list of languages:

In [12]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

For example:

In [13]:
palabras = stopwords.words('spanish')
print(palabras[:100])

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo']


In [14]:
len(stop_words)

179

In [15]:
len(palabras)

313

Using these lists of stopwords we can easily remove words that do not give extra information.

In [16]:
words = [w for w in words if not w in stop_words]
print(words[:100])

['pride', 'prejudice', 'jane', 'austen', 'chapter', '1', 'truth', 'universally', 'acknowledged', 'single', 'man', 'possession', 'good', 'fortune', 'must', 'want', 'wife', 'however', 'little', 'known', 'feelings', 'views', 'man', 'may', 'first', 'entering', 'neighbourhood', 'truth', 'well', 'fixed', 'minds', 'surrounding', 'families', 'considered', 'rightful', 'property', 'one', 'daughters', 'dear', 'bennet', 'said', 'lady', 'one', 'day', 'heard', 'netherfield', 'park', 'let', 'last', 'bennet', 'replied', 'returned', 'long', 'told', 'bennet', 'made', 'answer', 'want', 'know', 'taken', 'cried', 'wife', 'impatiently', 'want', 'tell', 'objection', 'hearing', 'invitation', 'enough', 'dear', 'must', 'know', 'long', 'says', 'netherfield', 'taken', 'young', 'man', 'large', 'fortune', 'north', 'england', 'came', 'monday', 'chaise', 'four', 'see', 'place', 'much', 'delighted', 'agreed', 'morris', 'immediately', 'take', 'possession', 'michaelmas', 'servants', 'house', 'end', 'next']


Finally, we can use normalization techniques to reduce every word to its root. The first one is called stemming.

In [17]:
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed[:100])

['pride', 'prejudic', 'jane', 'austen', 'chapter', '1', 'truth', 'univers', 'acknowledg', 'singl', 'man', 'possess', 'good', 'fortun', 'must', 'want', 'wife', 'howev', 'littl', 'known', 'feel', 'view', 'man', 'may', 'first', 'enter', 'neighbourhood', 'truth', 'well', 'fix', 'mind', 'surround', 'famili', 'consid', 'right', 'properti', 'one', 'daughter', 'dear', 'bennet', 'said', 'ladi', 'one', 'day', 'heard', 'netherfield', 'park', 'let', 'last', 'bennet', 'repli', 'return', 'long', 'told', 'bennet', 'made', 'answer', 'want', 'know', 'taken', 'cri', 'wife', 'impati', 'want', 'tell', 'object', 'hear', 'invit', 'enough', 'dear', 'must', 'know', 'long', 'say', 'netherfield', 'taken', 'young', 'man', 'larg', 'fortun', 'north', 'england', 'came', 'monday', 'chais', 'four', 'see', 'place', 'much', 'delight', 'agre', 'morri', 'immedi', 'take', 'possess', 'michaelma', 'servant', 'hous', 'end', 'next']


Another technique is called lemmatization.

In [18]:
# lemmatization of words
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized[:100])

['pride', 'prejudice', 'jane', 'austen', 'chapter', '1', 'truth', 'universally', 'acknowledged', 'single', 'man', 'possession', 'good', 'fortune', 'must', 'want', 'wife', 'however', 'little', 'known', 'feeling', 'view', 'man', 'may', 'first', 'entering', 'neighbourhood', 'truth', 'well', 'fixed', 'mind', 'surrounding', 'family', 'considered', 'rightful', 'property', 'one', 'daughter', 'dear', 'bennet', 'said', 'lady', 'one', 'day', 'heard', 'netherfield', 'park', 'let', 'last', 'bennet', 'replied', 'returned', 'long', 'told', 'bennet', 'made', 'answer', 'want', 'know', 'taken', 'cried', 'wife', 'impatiently', 'want', 'tell', 'objection', 'hearing', 'invitation', 'enough', 'dear', 'must', 'know', 'long', 'say', 'netherfield', 'taken', 'young', 'man', 'large', 'fortune', 'north', 'england', 'came', 'monday', 'chaise', 'four', 'see', 'place', 'much', 'delighted', 'agreed', 'morris', 'immediately', 'take', 'possession', 'michaelmas', 'servant', 'house', 'end', 'next']
