### [Speech: "Friends, Romans, countrymen, lend me your ears" BY WILLIAM SHAKESPEARE](https://www.poetryfoundation.org/poems/56968/speech-friends-romans-countrymen-lend-me-your-ears)


In [None]:
speech = 'Friends, Romans, countrymen, lend me your ears; \
I come to bury Caesar, not to praise him. \
The evil that men do lives after them; \
The good is oft interred with their bones; \
So let it be with Caesar. The noble Brutus \
Hath told you Caesar was ambitious: \
If it were so, it was a grievous fault, \
And grievously hath Caesar answer’d it. \
Here, under leave of Brutus and the rest– \
For Brutus is an honourable man; \
So are they all, all honourable men– \
Come I to speak in Caesar’s funeral. \
He was my friend, faithful and just to me: \
But Brutus says he was ambitious; \
And Brutus is an honourable man. \
He hath brought many captives home to Rome \
Whose ransoms did the general coffers fill: \
Did this in Caesar seem ambitious? \
When that the poor have cried, Caesar hath wept: \
Ambition should be made of sterner stuff: \
Yet Brutus says he was ambitious; \
And Brutus is an honourable man. \
You all did see that on the Lupercal \
I thrice presented him a kingly crown, \
Which he did thrice refuse: was this ambition? \
Yet Brutus says he was ambitious; \
And, sure, he is an honourable man. \
I speak not to disprove what Brutus spoke, \
But here I am to speak what I do know. \
You all did love him once, not without cause: \
What cause withholds you then, to mourn for him? \
O judgment! thou art fled to brutish beasts, \
And men have lost their reason. Bear with me; \
My heart is in the coffin there with Caesar, \
And I must pause till it come back to me.'

### Word Frequency

In [None]:
from collections import Counter

word_freq = Counter(speech.split())

word_freq

In [None]:
# sort from high to low frequency words

sorted(word_freq.items(), key=lambda x: x[1], reverse=True)

## Text Processing

The most common way to deal with text documents is to first convert them into a numeric vector form (sparse matrix), and then perform additional analysis -- like clsutering, classification, and visualization -- using those vectors. This is usually referred to as 'Bag-of-Words' or 'Vector Space Model'.

But before we convert the text into a numeric vector form, we should clean it up.

### 1. Remove Punctuations

In [None]:
import string

string.punctuation

In [None]:
all_punctuations = set(string.punctuation)

all_punctuations

In [None]:
speech_clean = ''.join(l for l in speech if l not in all_punctuations)

speech_clean

### 2. Cover to Upper/Lower-case

In [None]:
speech_clean = speech_clean.lower()

speech_clean

### 3. Remove Stop Words

There's no standard definition of "stop words", but in general, it usually refers to most common words, like 'a', 'the', 'at'.

The `scikit-learn` package provides a list of stop words. 

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

Let's discard all stop words from the text.

In [None]:
speech_words = [word for word in speech_clean.split() if word not in ENGLISH_STOP_WORDS]

set(speech_words)

### 4. Stemming

Stemming is the process of reducing inflected or derived words to their word stem, base or root form. There are several stemming algorithms available; we will use *Porter* and *Lancaster* stemmers in this exercise.

Let's first take a look at an example.

In [None]:
# import
from nltk.stem import PorterStemmer

# initilize the stemmer
stemmer= PorterStemmer()

# example 1
for word in ['Play', 'Playing', 'Played']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, '\t --> Stem:', stem)

In [None]:
# example 2

for word in ['grievous', 'grievously']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, ' \t --> Stem:', stem)

Let's try another stemmer.

In [None]:
# import
from nltk.stem import LancasterStemmer

# initialize
stemmer = LancasterStemmer()

for word in ['grievous', 'grievously']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, ' \t --> Stem:', stem)

Apply stemmer on the speech text.

In [None]:
# create an empty array to store the results (i.e., stems)
stems = []

for word in speech_words:
    
    # check if it's a stop word
    if word not in ENGLISH_STOP_WORDS:
        
        # append the stem for each word to the output array
        stems.append(stemmer.stem(word))
   
set(stems)

*Note: [Julie Beth Lovins](https://en.wikipedia.org/wiki/Julie_Beth_Lovins), a computational linguist, published the first-ever stemming algorithm in 1968.*

### 5. Lemmatization

_Stemming_ usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. _Lemmatization_ usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. [[source]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

In [None]:
# import
from nltk.stem import WordNetLemmatizer

# initialize
lemmatizer = WordNetLemmatizer()

for word in ['know', 'knowing', 'knew', 'knowledge']:
    
    lemma = lemmatizer.lemmatize(word)
    stem = stemmer.stem(word)
    
    print ('Word:', word, '--> Stem:', stem, '--> Lemma:', lemma)

We must provide the context in which we're trying to lemmatize the words. This is refered to as the Parts-Of-Speech (POS).

In [None]:
for word in ['know', 'knowing', 'knew', 'knowledge']:
    
    # adding pos argument to lemmatize()
    lemma = lemmatizer.lemmatize(word, pos='v')
    stem = stemmer.stem(word)
    
    print ('Word:', word, '--> Stem:', stem, '--> Lemma:', lemma)

In [None]:
# create an empty array to store the results (i.e., lemmas)
lemmas = []

for word in speech_words:
    
    # check if it's a stop word
    if word not in ENGLISH_STOP_WORDS:
        
        # append the stem for each word to the output array
        lemmas.append(lemmatizer.lemmatize(word, 'v'))
    
set(lemmas)

In [None]:
len(set(lemmas))

In [None]:
len(set(stems))

In [None]:
len(speech_clean.split())

Stemming and Lemmatization are closely related. Unlike Lemmatization, Stemming doesn't incorporate the conext (part of speech) but they typically run faster. In Information Retrieval applications, Stemming improves the True Positive Rate (recall), but reduces the True Negative Rate (specificity).

### Bringing it all together

For the next part of this exercise, let's analyze transcripts from a couple of US presidential inagural addresses.

In [None]:
# change the filepath to your local machine where the speech text files are located

trump_speech_transcript = r"C:\Users\visha\derive Dropbox\clients\vcu\python\misc\inaugural_speech_trump.txt"
obama_speech_transcript = r"C:\Users\visha\derive Dropbox\clients\vcu\python\misc\inaugural_speech_obama.txt"

We will use the NLTK tokenizer to split a sentence into words. More details details about NLTK can be found [here](https://www.nltk.org/_modules/nltk/tokenize/punkt.html).

In [None]:
import string
import nltk

nltk.download('punkt')

In [None]:
def create_tokens(infile):
    
    with open(infile) as f:
        
        # read each line from the file and convert it into lowercase
        line = f.read().lower()

        # remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # remove all stop words (this will create a list of words)
        line_words = [word for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]

        # join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens = create_tokens(trump_speech_transcript)

tokens[:5]

In [None]:
count = Counter(tokens)

print (count.most_common(10))

There are some non-alphabetical characters that were not captured by the `all_punctuations` filter.

In [None]:
import re

def create_tokens(infile):
    
    with open(infile) as f:
        
        # read each line from the file and convert it into lowercase
        line = f.read().lower()

        # remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # remove all stop words (this will create a list of words)
        # in addition, use regex to replace non-alphabetic characters into null
        line_words = [re.sub("[^a-zA-Z]", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]
        
        # join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens_trump = create_tokens(trump_speech_transcript)

token_count_trump = Counter(tokens_trump)

print (token_count_trump.most_common(10))

*Note: For an explanation of how that `regex` query replaces all non-letter chatacters with '' (nothing), please follow this [link](https://stackoverflow.com/questions/47561298/python-regex-remove-escape-characters-and-punctuation-except-for-apostrophe?rq=1).*

In [None]:
# let's create tokens from Obama's speeach

tokens_obama = create_tokens(obama_speech_transcript)

token_count_obama = Counter(tokens_obama)

print (token_count_obama.most_common(10))

We need to remove 'applause' from this list as it's not part of the speech.

In [None]:
def create_tokens(infile):
    
    with open(infile) as f:
        
        # read each line from the file and convert it into lowercase
        line = f.read().lower()

        # remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # remove all stop words (this will create a list of words)
        # in addition, use regex to replace 
        line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                     and word != 'applause']

        # join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens_obama = create_tokens(obama_speech_transcript)

token_count_obama = Counter(tokens_obama)

print (token_count_obama.most_common(10))

### TF-IDF Vectorization

TF-IDF stands for Term Frequency – Inverse Document Frequency. The idea behind this metric is to rescale the frequency of each word by how often they appear across all documents. Words that are common across all documents are penalized, and as a result, the words that are most distinct (and ferquent) within a document are emphasized more. Read more about TF-IDF [here](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting).

First, let's store all tokens in a dictionary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tokens_all = {}

with open(trump_speech_transcript) as f:

    # read each line from the file and convert it into lowercase
    line = f.read().lower()

    # remove all punctuations
    line_clean = ''.join(l for l in line if l not in all_punctuations)

    # remove all stop words (this will create a list of words)
    # in addition, use regex to remove non-alphabetic characters
    line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                 and word != 'applause']

    # join all those words to create a line (of text) again
    line_clean = ' '.join(line_words)

    tokens_all['trump'] = line_clean
    
with open(obama_speech_transcript) as f:

    # read each line from the file and convert it into lowercase
    line = f.read().lower()

    # remove all punctuations
    line_clean = ''.join(l for l in line if l not in all_punctuations)

    # remove all stop words (this will create a list of words)
    # in addition, use regex to replace 
    line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                 and word != 'applause']

    # join all those words to create a line (of text) again
    line_clean = ' '.join(line_words)

    tokens_all['obama'] = line_clean

In [None]:
tokens_all.keys()

In [None]:
tokens_all

In [None]:
tfidf = TfidfVectorizer()

tfs_matrix = tfidf.fit_transform(tokens_all.values())

print(tfs_matrix)

This is how a sparse matrix is represented in Python.

In [None]:
# feature ("column") names

print(tfidf.get_feature_names()[:10])

In [None]:
# covert from sparse matrix to dense matrix

tfs_matrix.todense()

Let's create a `pandas` dataframe.

In [None]:
import pandas as pd

feature_names = tfidf.get_feature_names()

scores = tfs_matrix.todense().tolist()

df = pd.DataFrame(scores, columns=feature_names, index=['trump', 'obama'])

df.head()

Let's select the top 10 most _common_ words from Trump's speech and then we will select those columns from the above data frame.

In [None]:
token_count_trump.most_common(10)

In [None]:
# we can iterate through this list of words
for w, c in token_count_trump.most_common(10):
    print (w, c)

In [None]:
# let's store these words in an array
words_trump = []

for w, c in token_count_trump.most_common(10):
    words_trump.append(w)
    
words_trump

In [None]:
df[words_trump].T

The words 'america' and 'american' the most commonly used words by Trump in his inaugural speech, and these were also _distinct_ words used by Trump as compared to Obama. (In other words, Obama didn't used these two words as frequently as Trump did.)

Similarly, let's select the top 10 most common words from Obama's speech and then select those columns.

In [None]:
words_obama = []

for w, c in token_count_obama.most_common(10):
    words_obama.append(w)
    
words_obama

df[words_obama].T

Words such as 'freedom', 'equal', and 'journey' were some of the most commonly used words by Obama in his inaugural speech, while Trump did not use these words even once.

### Reading Text from Web-pages

The following web-site contains US Presidential inauguration speeches: http://avalon.law.yale.edu/subject_menus/inaug.asp

We will use `requests` and `BeautifulSoup` packages to read data directly from this web-site.

In [None]:
from bs4 import BeautifulSoup
import requests

url = 'http://avalon.law.yale.edu/21st_century/obama.asp'

Step 1: Ping the web-page for information. This called making a request.

In [None]:
source_code = requests.get(url)

Step 2: Use Beautiful Soup to parse the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.

In [None]:
soup = BeautifulSoup(source_code.content)

In [None]:
# view the content

soup

Step 3: Extract the text part of the page.

In [None]:
speech_obama = soup.get_text()

speech_obama

Step 4: Extract a specific portion of the text chunk which contains the actual speech.

Note that the speech starts with 'My fellow citizens', which is immediately preceeded by the following: `\r\n\n\n\n`. Let's split the text chunk into two parts using `\r\n\n\n\n` as the separator, and then take the second half of the results.

In [None]:
speech_obama = speech_obama.split('\r\n\n\n\n')[1]

speech_obama

Now the speech actually ends with 'And God bless the United States of America.', which is immediately followed by `\n\n\n\n\n`. Let's split the text chunk into two parts using `\n\n\n\n\n` as the separator, and then take the *first* half of the results.

In [None]:
speech_obama = speech_obama.split('\n\n\n\n')[0]

speech_obama

Step 5: Clean and Tokenize!

In [None]:
# read each line from the file and convert it into lowercase
line = speech_obama.lower()

# remove all punctuations
line_clean = ''.join(l for l in line if l not in all_punctuations)

# remove all stop words (this will create a list of words)
line_words = [word for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]

# join all those words to create a line (of text) again
line_clean = ' '.join(line_words)

# tokenize
tokens = nltk.word_tokenize(line_clean)

tokens[:5]

___

**Applications of Text Mining:**
    
    1. Text (or Document) Categorization
    2. Text Clustering
    3. Sentiment Analysis
    4. Document Summarization
    5. Topic Extraction
    6. Document Associations 
    7. Etc.

**Resources:**
    
1. [NLTK 3.4 documentation](http://www.nltk.org/index.html)
2. [spaCy API](https://spacy.io/api)
3. [scikit-learn TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
4. [NLTK's WordNet Interface](http://www.nltk.org/howto/wordnet.html)
5. [Modern NLP in Python by Patrick Harrison | PyData DC 2016](https://www.youtube.com/watch?v=6zm9NC9uRkk) (YouTube)
6. [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/) (Book)
7. [Text Feature Extraction using scikit-learn](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)