```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.2 Simple topic identification
```

# Simple text preprocessing

## Why preprocess?

* Help make for better input data:

	* *when performing machine learning or other statistical methods*

* Examples:

	* *tokenization to create a bag of words*

	* *lowercasing words*

* Lemmatization/Stemming:

	* *shorten words to their root stems*

* Remove stop words, punctuation, or unwanted tokens.

* Good to experiment with different approaches.

## Code of text preprocessing with Python:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
from nltk.corpus import stopwords

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

In [3]:
from nltk.stem import WordNetLemmatizer

text = """Cats, dogs and birds are common pets. So are fish."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
lemmatized = WordNetLemmatizer()
lemmatized_output = [lemmatized.lemmatize(ns) for ns in no_stops]
print(lemmatized_output)

['cat', 'dog', 'bird', 'common', 'pet', 'fish']


## Practice question for text preprocessing steps:

* Which of the following are useful text preprocessing steps?
    
    $\Box$ Stems, spelling corrections, lowercase.
    
    $\boxtimes$ Lemmatization, lowercasing, removing unwanted tokens.

    $\Box$ Removing stop words, leaving in capital words.
    
    $\Box$ Strip stop words, word endings and digits.

$\blacktriangleright$ **Question-solving method:**

In [4]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize("The cat is in the box. The cat box."))

Counter({'The': 2, 'cat': 2, 'is': 1, 'in': 1, 'the': 1, 'box': 2, '.': 2})

## Practice exercises for word counts with bag-of-words:

$\blacktriangleright$ **Package pre-loading:**

In [5]:
from nltk import word_tokenize

$\blacktriangleright$ **Data pre-loading:**

In [6]:
article = open('ref1. Wikipedia article - Debugging.txt').read()
article_title = 'Debugging'

$\blacktriangleright$ **Bag-of-words `Counter` building practice:**

In [7]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]
