```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.2 Simple topic identification
```

# Simple text preprocessing

## What is bag-of-words?

* It is a basic method for finding topics in a text.

* Need first to create tokens using tokenization.

* And then count up all the tokens.

* The more frequent a word, the more important it might be.

* It can be a great way to determine the significant words in a text.

## Code of bag-of-words in Python:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(
    word_tokenize("""The cat is in the box. The cat likes the box. \
The box is over the cat."""))

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

In [2]:
counter = Counter(
    word_tokenize("""The cat is in the box. The cat likes the box. \
    The box is over the cat."""))
counter.most_common(2)

[('The', 3), ('cat', 3)]

## Practice question for bag-of-words picker:

* It's time for a quick check on the understanding of bag-of-words. Which of the below options, with basic NLTK tokenization, map the bag-of-words for the following text?

    > "The cat is in the box. The cat box."
    
    $\Box$ `('the', 3), ('box.', 2), ('cat', 2), ('is', 1)`.

    $\Box$ `('The', 3), ('box', 2), ('cat', 2), ('is', 1), ('in', 1), ('.', 1)`.
    
    $\Box$ `('the', 3), ('cat box', 1), ('cat', 1), ('box', 1), ('is', 1), ('in', 1)`.
        
    $\boxtimes$ `('The', 2), ('box', 2), ('.', 2), ('cat', 2), ('is', 1), ('in', 1), ('the', 1)`.

$\blacktriangleright$ **Question-solving method:**

In [3]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize("The cat is in the box. The cat box."))

Counter({'The': 2, 'cat': 2, 'is': 1, 'in': 1, 'the': 1, 'box': 2, '.': 2})

## Practice exercises for word counts with bag-of-words:

$\blacktriangleright$ **Package pre-loading:**

In [4]:
from nltk import word_tokenize

$\blacktriangleright$ **Data pre-loading:**

In [5]:
article = open('ref1. Wikipedia article - Debugging.txt').read()
article_title = 'Debugging'

$\blacktriangleright$ **Bag-of-words `Counter` building practice:**

In [6]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]
