# Text Processing

Here, we'll learn how to read text data from different sources and prepare it for feature extraction

* [Udacity NLP Nanodegree Repository](https://github.com/udacity/cd0377-Introduction-to-Natural-Language-Processing)

ASIDE: Working with Udacity GPU Enabled Notebooks

Context Manager Example:
```python
from workspace_utils import active_session

with active_session():
    # do long-running work here
```

Iterator Wrapper Example:
```python
from workspace_utils import keep_awake

for i in keep_awake(range(5)):
    # do iteration with lots of work here
```

Reference Module - `workspace_utils.py`:
```python
import signal

from contextlib import contextmanager

import requests


DELAY = INTERVAL = 4 * 60  # interval time in seconds
MIN_DELAY = MIN_INTERVAL = 2 * 60
KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
TOKEN_HEADERS = {"Metadata-Flavor":"Google"}


def _request_handler(headers):
    def _handler(signum, frame):
        requests.request("POST", KEEPALIVE_URL, headers=headers)
    return _handler


@contextmanager
def active_session(delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import active session

    with active_session():
        # do long-running work here
    """
    token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
    headers = {'Authorization': "STAR " + token}
    delay = max(delay, MIN_DELAY)
    interval = max(interval, MIN_INTERVAL)
    original_handler = signal.getsignal(signal.SIGALRM)
    try:
        signal.signal(signal.SIGALRM, _request_handler(headers))
        signal.setitimer(signal.ITIMER_REAL, delay, interval)
        yield
    finally:
        signal.signal(signal.SIGALRM, original_handler)
        signal.setitimer(signal.ITIMER_REAL, 0)


def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import keep_awake

    for i in keep_awake(range(5)):
        # do iteration with lots of work here
    """
    with active_session(delay, interval): yield from iterable
```

# Capturing Text Data

The processing stage begins with reading text data.

Common sources:
1. Plain text file on your local machine that can be read with the Python's `with open()` builtin
2. CSV that can be read in using Pandas
3. Online data accessed via an API (application programming interface)

## Exercise on Text Processing:

1. Plain Text

In [1]:
ud_folder = 'cd0377-Introduction-to-Natural-Language-Processing'

In [2]:
import sys
from pathlib import Path
import os
import nltk
import pandas as pd
import requests
import json
import re
from bs4 import BeautifulSoup

In [3]:
dir_items = ['cd0377-Introduction-to-Natural-Language-Processing']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
sys.path.append(data_path)

In [4]:
dir_items = [ud_folder, 'data', 'hieroglyph.txt']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
with open(data_path, 'r') as f:
    text = f.read()
print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



2. Tabular Data

In [5]:
dir_items = [ud_folder, 'data', 'news.csv']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
df = pd.read_csv(data_path)
display(df[['publisher', 'title']].head())
print()
print("Convert text column to lowercase")
df['title'] = df['title'].str.lower()
display(df[['publisher', 'title']].head())

Unnamed: 0,publisher,title
0,Livemint,Fed's Charles Plosser sees high bar for change...
1,IFA Magazine,US open: Stocks fall after Fed official hints ...
2,IFA Magazine,"Fed risks falling 'behind the curve', Charles ..."
3,Moneynews,Fed's Plosser: Nasty Weather Has Curbed Job Gr...
4,NASDAQ,Plosser: Fed May Have to Accelerate Tapering Pace



Convert text column to lowercase


Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


3. Online Source

In [7]:
r = requests.get('https://quotes.rest/qod.json', verify=False)
res = r.json()
print(json.dumps(res, indent=4))
print()
print("Get the quote of the day and the author")
qod_obj = res["contents"]["quotes"][0]
qod = qod_obj['quote']
auth = qod_obj['author']
print(qod, "\n--", auth)



{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "The determination to win is the better part of winning.",
                "length": "55",
                "author": "Daisaku Ikeda",
                "tags": {
                    "0": "buddhism",
                    "1": "determination",
                    "2": "doingyourbest",
                    "3": "humanism",
                    "4": "inspire",
                    "6": "winning"
                },
                "category": "inspire",
                "language": "en",
                "date": "2022-08-30",
                "permalink": "https://theysaidso.com/quote/daisaku-ikeda-the-determination-to-win-is-the-better-part-of-winning",
                "id": "N8DuJUQlTaAm_hHViB2sTAeF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl":

4. Cleaning

In [8]:
# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?CC3O3g8R3a6BgIE1hihI">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
                            <a href="newest">new</a> | <a href="front">past<

In [9]:
# Remove HTML tags using RegEx
pattern = re.compile(r'<.*?>')  # tags look like <...>
print(pattern.sub('', r.text))  # replace them with blank


        
          
        Hacker News
        
                  Hacker News
                            new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      The Big [Censored] Theory (pudding.cool)
        811 points by feross 7 hours ago  | hide | 464&nbsp;comments              
      
                
      2.      Stable Diffusion Textual Inversion (github.com/hlky)
        161 points by antman 3 hours ago  | hide | 73&nbsp;comments              
      
                
      3.      90s Cursor Effects (tholman.com)
        353 points by lysergia 7 hours ago  | hide | 119&nbsp;comments              
      
                
      4.      Don’t think to write, write to think (herbertlui.net)
        104 points by herbertl 4 hours ago  | hide | 34&nbsp;comments              
      
                
      5.      Why are you so busy? (tomlingham.com)
        56 point

In [10]:
# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
                            new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      The Big [Censored] Theory (pudding.cool)
        811 points by feross 7 hours ago  | hide | 464 comments              
      
                
      2.      Stable Diffusion Textual Inversion (github.com/hlky)
        161 points by antman 3 hours ago  | hide | 73 comments              
      
                
      3.      90s Cursor Effects (tholman.com)
        353 points by lysergia 7 hours ago  | hide | 119 comments              
      
                
      4.      Don’t think to write, write to think (herbertlui.net)
        104 points by herbertl 4 hours ago  | hide | 34 comments              
      
                
      5.      Why are you so busy? (tomlingham.com)
        56 points by tjlingham 2 hou

In [11]:
summaries = soup.find_all("tr", class_="athing")
summaries[0]

<tr class="athing" id="32641028">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=32641028&amp;how=up&amp;goto=news" id="up_32641028"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="titlelink" href="https://pudding.cool/2022/08/censorship/">The Big [Censored] Theory</a><span class="sitebit comhead"> (<a href="from?site=pudding.cool"><span class="sitestr">pudding.cool</span></a>)</span></td></tr>

In [12]:
# Extract title
summaries[0].find("a", class_="titlelink").get_text().strip()

'The Big [Censored] Theory'

In [13]:
# Find all articles, extract titles
articles = []
summaries = soup.find_all("tr", class_="athing")
for summary in summaries:
    title = summary.find("a", class_="titlelink").get_text().strip()
    articles.append((title))

print(len(articles), "Article summaries found. Sample:")
print(articles[0])

30 Article summaries found. Sample:
The Big [Censored] Theory


## Normalization

5. Case Normalization

In [14]:
# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [15]:
text = text.lower()
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


6. Punctuation Removal

In [19]:
# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


7. Tokenization

In [20]:
# Split text into tokens (words)
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


## NLTK: Natural Language ToolKit

In [24]:
import nltk
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/daiglechris/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/daiglechris/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [36]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/daiglechris/nltk_data...


True

In [38]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/daiglechris/nltk_data...


True

In [25]:
# Another sample text
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [26]:
from nltk.tokenize import word_tokenize

# Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [27]:
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


In [30]:
# List stop words
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [31]:
# Reset text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [32]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


## Part of Speech Tagging / Sentence Parsing

In [33]:
# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


## Stemming & Lemmatization

1. Stemming

In [34]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


2. Lemmatization

In [39]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [40]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']


### Data Sources:

* `hieroglyph.txt`: [Wikipedia: Ancient Egyptian Writing](https://en.wikipedia.org/wiki/Ancient_Egypt#Writing)<br>
* `news.csv`: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)<br>

### Additional Resources:
* [Pandas: Working with Text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), technical documentation regarding working with text data<br>
* [Quote of the Day API](http://quotes.rest/), a platform used to access famous quotes on a daily basis.<br>
