<a href="https://colab.research.google.com/github/20WH1A6637/AI_NLP_Lab/blob/main/text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing-Text" data-toc-modified-id="Preprocessing-Text-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing Text</a></span><ul class="toc-item"><li><span><a href="#Where-did-the-text-originate?" data-toc-modified-id="Where-did-the-text-originate?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Where did the <em>text</em> originate?</a></span></li><li><span><a href="#Removing-irrelevant-information" data-toc-modified-id="Removing-irrelevant-information-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Removing irrelevant information</a></span></li><li><span><a href="#Useful-tools" data-toc-modified-id="Useful-tools-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Useful tools</a></span><ul class="toc-item"><li><span><a href="#Introducing-Natural-Language-Toolkit-(NLTK)" data-toc-modified-id="Introducing-Natural-Language-Toolkit-(NLTK)-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Introducing Natural Language Toolkit (NLTK)</a></span></li><li><span><a href="#Regular-Expression-(Regex)" data-toc-modified-id="Regular-Expression-(Regex)-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Regular Expression (Regex)</a></span></li></ul></li></ul></li><li><span><a href="#Steps-to-Processing-Text" data-toc-modified-id="Steps-to-Processing-Text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Steps to Processing Text</a></span><ul class="toc-item"><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Cleaning</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Normalization</a></span><ul class="toc-item"><li><span><a href="#Capitalization" data-toc-modified-id="Capitalization-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Capitalization</a></span></li><li><span><a href="#Punctuation" data-toc-modified-id="Punctuation-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Punctuation</a></span></li></ul></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stopword-removal" data-toc-modified-id="Stopword-removal-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Stopword removal</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Stemming</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Lemmatization</a></span></li><li><span><a href="#Note-on-Part-of-Speech-(POS)-Tagging" data-toc-modified-id="Note-on-Part-of-Speech-(POS)-Tagging-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Note on Part-of-Speech (POS) Tagging</a></span></li></ul></li></ul></div>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Preprocessing Text

## Where did the _text_ originate?

Depending on where the text came from will change how we will preprocess it

Examples:
- Speech --> Convert into text/words
- Web pages --> HTML tags
- Word Doc, other text formats --> More "junk" to consider

## Removing irrelevant information

> The dogs in Alaska are cold, hungry, and lonely.

- Punctuation likely can be removed without drastically changing the meaning
- Capitalization rarely changes meaning
- Some common words really don't add to meaning: "a", "the", "are", "of"


## Useful tools 

### Introducing Natural Language Toolkit (NLTK)

NLTK is a great library that can help with preprocessing text as well as feature extraction

> Documentation: https://www.nltk.org/
>
> Book: https://www.nltk.org/book/

### Regular Expression (Regex)

Useful way to structurally to move through language (won't go through it here; lots of resources)

<img src='https://imgs.xkcd.com/comics/regular_expressions.png' width=60%/>

Personally like this webapp to help test out your pattern matching: [Regexr](https://regexr.com/)

Regex Crosswords! https://regexcrossword.com/

# Steps to Processing Text

In [2]:
import nltk

In [3]:
fname = "/content/drive/MyDrive/a_christmas_carol.txt"

# Get first n lines
n = 500
with open(fname) as myfile:
    lines = [next(myfile) for x in range(20,n)]
print(lines)



In [4]:
# Create on large text to simulate a full text
text_christmas_carol = ''.join(lines)
print(text_christmas_carol)

﻿The Project Gutenberg EBook of A Christmas Carol, by Charles Dickens

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: A Christmas Carol
       A Ghost Story of Christmas

Author: Charles Dickens

Release Date: August 11, 2004 [EBook #46]
Last Updated: March 4, 2018

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL ***




Produced by Jose Menendez




A CHRISTMAS CAROL

IN PROSE
BEING
A Ghost Story of Christmas

by Charles Dickens



PREFACE

I HAVE endeavoured in this Ghostly little book,
to raise the Ghost of an Idea, which shall not put my
readers out of humour with themselves, with each other,
with the season, or with me.  May it haunt their houses
pleasantly, and no one wish to lay it.

Their faithful Friend and S

In [5]:
words_christmas_carol = text_christmas_carol.split()
words_christmas_carol

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'A',
 'Christmas',
 'Carol,',
 'by',
 'Charles',
 'Dickens',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.net',
 'Title:',
 'A',
 'Christmas',
 'Carol',
 'A',
 'Ghost',
 'Story',
 'of',
 'Christmas',
 'Author:',
 'Charles',
 'Dickens',
 'Release',
 'Date:',
 'August',
 '11,',
 '2004',
 '[EBook',
 '#46]',
 'Last',
 'Updated:',
 'March',
 '4,',
 '2018',
 'Language:',
 'English',
 'Character',
 'set',
 'encoding:',
 'UTF-8',
 '***',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'A',
 'CHRISTMAS',
 'CAROL',
 '***',
 'Produced',
 'by',
 'Jose',
 'Mene

## Cleaning

Can use regex ([Python doc](https://docs.python.org/3/library/re.html)) & packages like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to get rid of the extra junk so you just have the natural language material 

## Normalization

### Capitalization

In [6]:
text_christmas_carol = text_christmas_carol.lower()
text_christmas_carol



### Punctuation 

Dependent on what's the task

> Useful for text document as a whole

In [7]:
import re
# Getting rid of puncuation
text_christmas_carol_clean = re.sub(r'[^a-zA-Z0-0]', " ", text_christmas_carol)
text_christmas_carol_clean



## Tokenization

Token (a symbol) holds meaning and can't meaningfully be split up (in English, these are usually words)

`nltk.tokenize` has a variety of tokenizers (http://www.nltk.org/api/nltk.tokenize.html): 

- `sent_tokenize` finds sentences (often done for translation)
- `word_tokenize` is like `split` but is a little smarter in how it tokenizes the text
- `RegexTokenizer` can do more advance control like tokenize the words and remove punctuation (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#module-nltk.tokenize.regexp)
- `TweetTokenizer` specifically for tweets from Twitter (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#nltk.tokenize.casual.TweetTokenizer)

In [8]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
from nltk.tokenize import word_tokenize

#
words_christmas_carol_tokens = word_tokenize(text_christmas_carol_clean)
words_christmas_carol_tokens

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'a',
 'christmas',
 'carol',
 'by',
 'charles',
 'dickens',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 'you',
 'may',
 'copy',
 'it',
 'give',
 'it',
 'away',
 'or',
 're',
 'use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'project',
 'gutenberg',
 'license',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www',
 'gutenberg',
 'net',
 'title',
 'a',
 'christmas',
 'carol',
 'a',
 'ghost',
 'story',
 'of',
 'christmas',
 'author',
 'charles',
 'dickens',
 'release',
 'date',
 'august',
 '00',
 'ebook',
 'last',
 'updated',
 'march',
 '0',
 'language',
 'english',
 'character',
 'set',
 'encoding',
 'utf',
 'start',
 'of',
 'this',
 'project',
 'gutenberg',
 'ebook',
 'a',
 'christmas',
 'carol',
 'produced',
 'by',
 'jose',
 'menendez',
 'a',
 'christmas',
 'carol',
 'in',
 'prose

## Stopword removal

- Makes set smaller and still mostly readable
- Usually these common stop words dominate the list of words
- Dependent on context of task

In [10]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english') + ['christmas']
print(eng_stopwords)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [12]:
words_christmas_carol_tokens = [
    w for w in words_christmas_carol_tokens 
        if w not in eng_stopwords
]

words_christmas_carol_tokens

['project',
 'gutenberg',
 'ebook',
 'carol',
 'charles',
 'dickens',
 'ebook',
 'use',
 'anyone',
 'anywhere',
 'cost',
 'almost',
 'restrictions',
 'whatsoever',
 'may',
 'copy',
 'give',
 'away',
 'use',
 'terms',
 'project',
 'gutenberg',
 'license',
 'included',
 'ebook',
 'online',
 'www',
 'gutenberg',
 'net',
 'title',
 'carol',
 'ghost',
 'story',
 'author',
 'charles',
 'dickens',
 'release',
 'date',
 'august',
 '00',
 'ebook',
 'last',
 'updated',
 'march',
 '0',
 'language',
 'english',
 'character',
 'set',
 'encoding',
 'utf',
 'start',
 'project',
 'gutenberg',
 'ebook',
 'carol',
 'produced',
 'jose',
 'menendez',
 'carol',
 'prose',
 'ghost',
 'story',
 'charles',
 'dickens',
 'preface',
 'endeavoured',
 'ghostly',
 'little',
 'book',
 'raise',
 'ghost',
 'idea',
 'shall',
 'put',
 'readers',
 'humour',
 'season',
 'may',
 'haunt',
 'houses',
 'pleasantly',
 'one',
 'wish',
 'lay',
 'faithful',
 'friend',
 'servant',
 'c',
 'december',
 'contents',
 'stave',
 'marley'

## Stemming 

- Reducing to root form --> reduce complexity but still have meaning
- fast and crude
- not all stemmed words are _words_

In [13]:
words_christmas_carol_tokens_stemmed = [
    nltk.stem.porter.PorterStemmer().stem(w) 
        for w in words_christmas_carol_tokens
]

print(words_christmas_carol_tokens_stemmed)

['project', 'gutenberg', 'ebook', 'carol', 'charl', 'dicken', 'ebook', 'use', 'anyon', 'anywher', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi', 'give', 'away', 'use', 'term', 'project', 'gutenberg', 'licens', 'includ', 'ebook', 'onlin', 'www', 'gutenberg', 'net', 'titl', 'carol', 'ghost', 'stori', 'author', 'charl', 'dicken', 'releas', 'date', 'august', '00', 'ebook', 'last', 'updat', 'march', '0', 'languag', 'english', 'charact', 'set', 'encod', 'utf', 'start', 'project', 'gutenberg', 'ebook', 'carol', 'produc', 'jose', 'menendez', 'carol', 'prose', 'ghost', 'stori', 'charl', 'dicken', 'prefac', 'endeavour', 'ghostli', 'littl', 'book', 'rais', 'ghost', 'idea', 'shall', 'put', 'reader', 'humour', 'season', 'may', 'haunt', 'hous', 'pleasantli', 'one', 'wish', 'lay', 'faith', 'friend', 'servant', 'c', 'decemb', 'content', 'stave', 'marley', 'ghost', 'stave', 'ii', 'first', 'three', 'spirit', 'stave', 'iii', 'second', 'three', 'spirit', 'stave', 'iv', 'last', 'spirit', 'stave',

## Lemmatization

- Uses dictionary to map variants to root word
- Converts to an acutal _word_

In [14]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [16]:
words_christmas_carol_tokens_lemmed = [
    nltk.stem.wordnet.WordNetLemmatizer().lemmatize(w) 
        for w in words_christmas_carol_tokens
]

print(words_christmas_carol_tokens_lemmed)



## Note on Part-of-Speech (POS) Tagging

> A simple but limited solution since someone has to laboriously label the entire corpus. This is process is extremely error-prone.
>
> There are other strategies of learn sentence structure and tags (HMMs & RNNs)

In [17]:
import os
from IPython.display import Image, display
from nltk.draw import TreeWidget
from nltk.draw.util import CanvasFrame

# To save as a PS file vs tree.draw()
def jupyter_draw_nltk_tree(tree,fn):
    cf = CanvasFrame()
    tc = TreeWidget(cf.canvas(), tree)
    tc['node_font'] = 'arial 13 bold'
    tc['leaf_font'] = 'arial 14'
    tc['node_color'] = '#005990'
    tc['leaf_color'] = '#3F8F57'
    tc['line_color'] = '#175252'
    cf.add_widget(tc, 10, 10)
    cf.print_to_file(f'{fn}.ps')
    cf.destroy()

![](images/I_shot_an_elephant_in_my_pajamas_0.png)

![](images/I_shot_an_elephant_in_my_pajamas_1.png)