# Text Normalization

## Install NLTK

```
!pip install -U nltk
```
if you're not in a virtual env
```
sudo pip install -U nltk
```

## Simple text normalization with Python

### Load data
Upload `nlp_history_wikipedia.txt`; it's an excerpt from the Wikipedia page about NLP.

**Tutorial question 1**:
What Python two-liner can you use to load the file into a string called `text`?

In [82]:
# Read
with open("nlp_history_wikipedia.txt") as f:
    text = f.read()

**Tutorial question 2**: How many characters are there in the string `text`?

In [83]:
# for c in text[:10]:
#     print(c)
len(text)

print("There's 3474 characters")

There's 3474 characters


### Case normalization
**Tutorial question 3**: Make all characters lower case using only a string method.

In [84]:
# Lower case entire string
new_text = text.lower()

new_text

'history\nfurther information: history of natural language processing\n\nnatural language processing has its roots in the 1950s.[1] already in 1950, alan turing published an article titled "computing machinery and intelligence" which proposed what is now called the turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. the proposed test includes a task that involves the automated interpretation and generation of natural language.\n\nsymbolic nlp (1950s – early 1990s)\n\nthe premise of symbolic nlp is well-summarized by john searle\'s chinese room experiment: given a collection of rules (e.g., a chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other nlp tasks) by applying those rules to the data it confronts.\n\n * 1950s: the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into engli

In [85]:
# Replacement
new_text = new_text.replace("\n", " ")
new_text

'history further information: history of natural language processing  natural language processing has its roots in the 1950s.[1] already in 1950, alan turing published an article titled "computing machinery and intelligence" which proposed what is now called the turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. the proposed test includes a task that involves the automated interpretation and generation of natural language.  symbolic nlp (1950s – early 1990s)  the premise of symbolic nlp is well-summarized by john searle\'s chinese room experiment: given a collection of rules (e.g., a chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other nlp tasks) by applying those rules to the data it confronts.   * 1950s: the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the a

In [86]:
# Remove all non-alphanumeric characters
# non_alpha_list = [".", "[", "]", "?", ",", "$", "%", "^", "@", "&", ":", ";"]

# Remove them

In [87]:
import re

In [88]:
a = ' '
a.isalnum()

False

In [89]:
# # Remove special non-alpha numeric characters
# text_2 = re.sub(r'[^a-zA-Z0-9 ]', '', new_text)

In [90]:
text_2 = re.sub(r'\W+', " ", new_text)

In [91]:
text_2

'history further information history of natural language processing natural language processing has its roots in the 1950s 1 already in 1950 alan turing published an article titled computing machinery and intelligence which proposed what is now called the turing test as a criterion of intelligence though at the time that was not articulated as a problem separate from artificial intelligence the proposed test includes a task that involves the automated interpretation and generation of natural language symbolic nlp 1950s early 1990s the premise of symbolic nlp is well summarized by john searle s chinese room experiment given a collection of rules e g a chinese phrasebook with questions and matching answers the computer emulates natural language understanding or other nlp tasks by applying those rules to the data it confronts 1950s the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english the authors claimed that within three 

In [92]:
# text_2

# # 3320

In [93]:
# len(text_2)

In [94]:
# text_2

In [95]:
# len(text_2[:30])

In [96]:
# len(text_2)

In [97]:
# text_2

1. Replace all new line characters with space.
2. Use regular expressions to remove all non-alphanumeric characters (`.`, `[`, `]`, `,`, `?`, etc).

**Tutorial question 4**: How long is `text` after the two steps above?

In [98]:
len(text_2)

# It's 3474 characters

3320

### Tokenization

Use only Python string method(s) to split `text` by whitespace and call the output `tokens`.

**Tutorial question 5**: How many tokens are in `tokens`?

In [99]:
len(text_2)

3320

In [100]:
len(text_2.split(" "))

# 521

521

## Text normalization with NLTK

[NLTK](https://www.nltk.org/) - the natural language toolkit - has built-in loaders for some datasets.
### Interactive with a GUI

In [101]:
import nltk
 # interactive download

In [102]:
# nltk.download() 

### Direct download

In [103]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/christine/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

#### The Brown corpus

In [104]:
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/christine/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

### Word tokenization – basic

Tokenize the sentence `"Your computer is getting hacked!"` using NLTK's `word_tokenize`.

**Tutorial question 6**: How many tokens were returned?

In [114]:
# len(nltk.word_tokenize(text_2))

a = nltk.word_tokenize("Your computer is getting hacked!")
len(a) #6

6

In [115]:
a

['Your', 'computer', 'is', 'getting', 'hacked', '!']

## Stemming using the Porter Stemmer

In [106]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
doc = "It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(doc)
print('Stemming results:\n-----------------')
for w in words:
    print(w, ps.stem(w))

Stemming results:
-----------------
It it
is is
important import
to to
be be
very veri
pythonly pythonli
while while
you you
are are
pythoning python
with with
python python
. .
All all
pythoners python
have have
pythoned python
poorly poorli
at at
least least
once onc
. .


## Lemmatization using the WordNetLemmatizer

Download `wordnet` and `averaged_perceptron_tagger`.

In [116]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/christine/nltk_data...


True

In [117]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/christine/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [118]:
from nltk import WordNetLemmatizer
from collections import defaultdict
from nltk import pos_tag
from nltk.corpus import wordnet as wn

tag_map = defaultdict(lambda: wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
tag_map['N'] = wn.NOUN

class LemmaTokenizer(object):

    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, text):
        lemmatized = []
        for token, tag in pos_tag(word_tokenize(text)):
            lemmatized.append(self.wnl.lemmatize(token.lower(), tag_map[tag[0]]))

        return lemmatized

text = "It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
lt = LemmaTokenizer()
lemmatized = lt(text)
words = word_tokenize(text)

print("\nLemmatizing results:\n--------------------")
for word, lemma in zip(words, lemmatized):
    print(word, lemma)


Lemmatizing results:
--------------------
It it
is be
important important
to to
be be
very very
pythonly pythonly
while while
you you
are be
pythoning pythoning
with with
python python
. .
All all
pythoners pythoners
have have
pythoned pythoned
poorly poorly
at at
least least
once once
. .


## Remove stop words

Download `stopwords`.

In [119]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christine/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True


**Tutorial question 7**: How many English stopwords are there in `stopwords` from `nltk.corpus`?

In [121]:
from nltk.corpus import stopwords

In [122]:
stoplist = stopwords.words('english')

In [124]:
len(stoplist)

179

Remove the stopwords from the text below, and print the new text.

**Tutorial question 8**: How many tokens remain?

In [125]:
text = "It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

In [None]:
text.tok

In [126]:
new_list = []

for word in text:
    if word not in stoplist:
        new_list.append(word)

In [127]:
new_list

['I',
 ' ',
 ' ',
 'p',
 'r',
 'n',
 ' ',
 ' ',
 'b',
 'e',
 ' ',
 'v',
 'e',
 'r',
 ' ',
 'p',
 'h',
 'n',
 'l',
 ' ',
 'w',
 'h',
 'l',
 'e',
 ' ',
 'u',
 ' ',
 'r',
 'e',
 ' ',
 'p',
 'h',
 'n',
 'n',
 'g',
 ' ',
 'w',
 'h',
 ' ',
 'p',
 'h',
 'n',
 '.',
 ' ',
 'A',
 'l',
 'l',
 ' ',
 'p',
 'h',
 'n',
 'e',
 'r',
 ' ',
 'h',
 'v',
 'e',
 ' ',
 'p',
 'h',
 'n',
 'e',
 ' ',
 'p',
 'r',
 'l',
 ' ',
 ' ',
 'l',
 'e',
 ' ',
 'n',
 'c',
 'e',
 '.']

: 

## Extra: visualize a corpus using WordCloud

### install
```
pip install -U wordcloud
```

In [120]:
from wordcloud import WordCloud
import nltk
import matplotlib.pyplot as plt
%matplotlib inline
plt.ion()

wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=['nn', 'pp', 'cc'])
wordcloud.generate(nltk.corpus.brown.raw())
fig = plt.figure(figsize=[10, 16])
ax = fig.add_subplot(111)
ax.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

ModuleNotFoundError: No module named 'wordcloud'