## Text Processing

### Capturing Text Data

#### Plain Text

In [1]:
# Read in plain text file
with open("hieroglyph.txt", 'r') as f:
    text = f.read()
    
print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



#### Tabular Data

In [2]:
import pandas as pd

# Extract text column from a dataframe
df = pd.read_csv('news.csv')
df.head()

Unnamed: 0,id,title,url,publisher,category,story,hostname,timestamp
0,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
1,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
2,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
3,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
4,6,Plosser: Fed May Have to Accelerate Tapering Pace,http://www.nasdaq.com/article/plosser-fed-may-...,NASDAQ,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.nasdaq.com,1394470372212


In [3]:
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


#### Online Resources

In [4]:
import requests
import json

r = requests.get("https://quotes.rest/qod.json")
res = r.json()

In [5]:
q = res['contents']['quotes'][0]
print(q['quote'], '\n--', q['author'])

The man who removes a mountain begins by carrying away small stones.. 
-- Chinese Proverb


#### Cleaning

In [6]:
import requests

r = requests.get("https://news.ycombinator.com")
print(r.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?ByzR2VCazcHqBPYc6t4C">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

In [7]:
import re

pattern = re.sub('<.*?>', '', r.text)
pattern

"\n        \n          \n        Hacker News\n        \n                  Hacker News\n              new | past | comments | ask | show | jobs | submit            \n                              login\n                          \n              \n\n              \n      1.      My simple GitHub project went Viral (gourav.io)\n        67 points by deadcoder0904 1 hour ago  | hide | 16&nbsp;comments              \n      \n                \n      2.      Yamauchi No.10 Family Office (y-n10.com)\n        326 points by cmod 5 hours ago  | hide | 95&nbsp;comments              \n      \n                \n      3.      Is WebAssembly magic performance pixie dust? (surma.dev)\n        146 points by pimterry 5 hours ago  | hide | 78&nbsp;comments              \n      \n                \n      4.      Facebook let fake engagement distort global politics: a whistleblower's account (theguardian.com)\n        78 points by annadane 1 hour ago  | hide | 11&nbsp;comments              \n      \n         

In [8]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      My simple GitHub project went Viral (gourav.io)
        67 points by deadcoder0904 1 hour ago  | hide | 16 comments              
      
                
      2.      Yamauchi No.10 Family Office (y-n10.com)
        326 points by cmod 5 hours ago  | hide | 95 comments              
      
                
      3.      Is WebAssembly magic performance pixie dust? (surma.dev)
        146 points by pimterry 5 hours ago  | hide | 78 comments              
      
                
      4.      Facebook let fake engagement distort global politics: a whistleblower's account (theguardian.com)
        78 points by annadane 1 hour ago  | hide | 11 comments              
      
                
      5.      Effort to disrupt exploita

In [9]:
summaries = soup.find_all('tr', class_='athing')
summaries[0]

<tr class="athing" id="26804555">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=26804555&amp;how=up&amp;goto=news" id="up_26804555"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://gourav.io/blog/my-simple-github-project-went-viral">My simple GitHub project went Viral</a><span class="sitebit comhead"> (<a href="from?site=gourav.io"><span class="sitestr">gourav.io</span></a>)</span></td></tr>

In [10]:
summaries[0].find('a', class_='storylink').get_text().strip()

'My simple GitHub project went Viral'

In [11]:
# Find all articles, extract titles
articles = []
summaries = soup.find_all("tr", class_="athing")
for summary in summaries:
    title = summary.find("a", class_='storylink').get_text().strip()
    articles.append(title)

print(len(articles), "Article summaries found, \nSample: ")
print(articles[0])

30 Article summaries found, 
Sample: 
My simple GitHub project went Viral


### Normalization

#### Case Normalization

In [12]:
# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [13]:
text = text.lower()

In [14]:
text

'the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?'

### Punctuation Removal

In [15]:
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


### Tokenization

In [16]:
words = text.split()
words

['the',
 'first',
 'time',
 'you',
 'see',
 'the',
 'second',
 'renaissance',
 'it',
 'may',
 'look',
 'boring',
 'look',
 'at',
 'it',
 'at',
 'least',
 'twice',
 'and',
 'definitely',
 'watch',
 'part',
 '2',
 'it',
 'will',
 'change',
 'your',
 'view',
 'of',
 'the',
 'matrix',
 'are',
 'the',
 'human',
 'people',
 'the',
 'ones',
 'who',
 'started',
 'the',
 'war',
 'is',
 'ai',
 'a',
 'bad',
 'thing']

### NLTK: Natural Language Toolkit

In [17]:

# Another sample text
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [18]:
from nltk.tokenize import word_tokenize

word_tokenize(text)

['Dr.',
 'Smith',
 'graduated',
 'from',
 'the',
 'University',
 'of',
 'Washington',
 '.',
 'He',
 'later',
 'started',
 'an',
 'analytics',
 'firm',
 'called',
 'Lux',
 ',',
 'which',
 'catered',
 'to',
 'enterprise',
 'customers',
 '.']

In [19]:
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


In [20]:
# Reset text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [21]:
from nltk.corpus import stopwords

# Remove stopwords
words = [word for word in words if word not in stopwords.words('english')]
words

['first',
 'time',
 'see',
 'second',
 'renaissance',
 'may',
 'look',
 'boring',
 'look',
 'least',
 'twice',
 'definitely',
 'watch',
 'part',
 '2',
 'change',
 'view',
 'matrix',
 'human',
 'people',
 'ones',
 'started',
 'war',
 'ai',
 'bad',
 'thing']

### Sentencec Parsing

In [23]:
import nltk

# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)
# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


### Stemming and Lemmatization

#### Stemming

In [24]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in words]
stemmed

['first',
 'time',
 'see',
 'second',
 'renaiss',
 'may',
 'look',
 'bore',
 'look',
 'least',
 'twice',
 'definit',
 'watch',
 'part',
 '2',
 'chang',
 'view',
 'matrix',
 'human',
 'peopl',
 'one',
 'start',
 'war',
 'ai',
 'bad',
 'thing']

#### Lemmatization

In [25]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [27]:
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'start', 'war', 'ai', 'bad', 'thing']
