# Text Processing

Here, we'll learn how to read text data from different sources and prepare it for feature extraction

* [Udacity NLP Nanodegree Repository](https://github.com/udacity/cd0377-Introduction-to-Natural-Language-Processing)

ASIDE: Working with Udacity GPU Enabled Notebooks

Context Manager Example:
```python
from workspace_utils import active_session

with active_session():
    # do long-running work here
```

Iterator Wrapper Example:
```python
from workspace_utils import keep_awake

for i in keep_awake(range(5)):
    # do iteration with lots of work here
```

Reference Module - `workspace_utils.py`:
```python
import signal

from contextlib import contextmanager

import requests


DELAY = INTERVAL = 4 * 60  # interval time in seconds
MIN_DELAY = MIN_INTERVAL = 2 * 60
KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
TOKEN_HEADERS = {"Metadata-Flavor":"Google"}


def _request_handler(headers):
    def _handler(signum, frame):
        requests.request("POST", KEEPALIVE_URL, headers=headers)
    return _handler


@contextmanager
def active_session(delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import active session

    with active_session():
        # do long-running work here
    """
    token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
    headers = {'Authorization': "STAR " + token}
    delay = max(delay, MIN_DELAY)
    interval = max(interval, MIN_INTERVAL)
    original_handler = signal.getsignal(signal.SIGALRM)
    try:
        signal.signal(signal.SIGALRM, _request_handler(headers))
        signal.setitimer(signal.ITIMER_REAL, delay, interval)
        yield
    finally:
        signal.signal(signal.SIGALRM, original_handler)
        signal.setitimer(signal.ITIMER_REAL, 0)


def keep_awake(iterable, delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import keep_awake

    for i in keep_awake(range(5)):
        # do iteration with lots of work here
    """
    with active_session(delay, interval): yield from iterable
```

# Capturing Text Data

The processing stage begins with reading text data.

Common sources:
1. Plain text file on your local machine that can be read with the Python's `with open()` builtin
2. CSV that can be read in using Pandas
3. Online data accessed via an API (application programming interface)

## Exercise on Text Processing:

1. Plain Text

In [13]:
ud_folder = 'cd0377-Introduction-to-Natural-Language-Processing'

In [1]:
import sys
from pathlib import Path
import os
import nltk
import pandas as pd
import requests
import json
import re
from bs4 import BeautifulSoup

In [12]:
dir_items = ['cd0377-Introduction-to-Natural-Language-Processing']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
sys.path.append(data_path)

In [14]:
dir_items = [ud_folder, 'data', 'hieroglyph.txt']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
with open(data_path, 'r') as f:
    text = f.read()
print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



2. Tabular Data

In [17]:
dir_items = [ud_folder, 'data', 'news.csv']
data_path = Path.cwd().parents[0].joinpath(*dir_items)
df = pd.read_csv(data_path)
display(df[['publisher', 'title']].head())
print()
print("Convert text column to lowercase")
df['title'] = df['title'].str.lower()
display(df[['publisher', 'title']].head())

Unnamed: 0,publisher,title
0,Livemint,Fed's Charles Plosser sees high bar for change...
1,IFA Magazine,US open: Stocks fall after Fed official hints ...
2,IFA Magazine,"Fed risks falling 'behind the curve', Charles ..."
3,Moneynews,Fed's Plosser: Nasty Weather Has Curbed Job Gr...
4,NASDAQ,Plosser: Fed May Have to Accelerate Tapering Pace



Convert text column to lowercase


Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


3. Online Source

In [19]:
r = requests.get('https://quotes.rest/qod.json', verify=False)
res = r.json()
print(json.dumps(res, indent=4))
print()
print("Get the quote of the day and the author")
qod_obj = res["contents"]["quotes"][0]
qod = qod_obj['quote']
auth = qod_obj['author']
print(qod, "\n--", auth)



{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "The man who removes a mountain begins by carrying away small stones..",
                "length": "71",
                "author": "Chinese Proverb",
                "tags": [
                    "inspire",
                    "moving-mountains"
                ],
                "category": "inspire",
                "language": "en",
                "date": "2022-08-26",
                "permalink": "https://theysaidso.com/quote/chinese-proverb-the-man-who-removes-a-mountain-begins-by-carrying-away-small-sto",
                "id": "h7Gyu282q_vzBWvn2zdmtweF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl": "https://theysaidso.com",
    "copyright": {
        "year": 2024,
        "url": "https://theysaidso.com"
    }
}

Get the quote 