# Day 4

## Natural Language Processing with NLTK and gensim

# Preparation for Today

Download the "stopwords", "names" and "gutenberg" corpora through nltk(all-corpora)

~~~python
import nltk
nltk.download()
~~~
<img src="images/nltk-downloader.png" style="width: 500px;" align="middle"/>

In [None]:
import nltk
nltk.download()

# Agenda

* What is Natural Language Processing?
* Structuring Data
* Exploring Data Using NLTK
* Classification Using NLTK
* Topic Modeling in SK Learn

<h1> What is Natural Language Processing? </h1>

<img src="images/watson.jpg" style="width: 500px;" align="middle"/>

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof. (Straight from Wikipedia...)

## What we are going to cover today:
### Document Level Exploration
### Categorizing Words
### Supervise Categorization
### Corpus Level Exploration

Today we will be working primarially with NLTK (http://www.nltk.org/), which is a tool designed intially for students of computational linguistics to lower the barrier to entry for performing natural langugae processing.  It is a great tool for teaching the inital concepts - but we will be using sklearn, as well in one of our later exercises.

# Structuring Data for Processing

In [None]:
import nltk
%matplotlib inline

The first step to getting your natural language into a structure format is to "tokenize" it.  The default for tokens is to split on white space and separate all punction at the end of words out as their own token (if a word is a contraction (with and apostrophe) that is generally treated as a single token).

In [None]:
sentence = """The morning had dawned clear and cold, 
              with a crispness that hinted at the end
              of summer."""

In [None]:
tokens = nltk.word_tokenize(sentence)
tokens[:5]

# NLTK for Basic Exploration

In [None]:
#load up some data and tokenize it
with open('data/Bran.txt') as b:
    data = b.read()
bran_token = nltk.word_tokenize(data) #tokenizes string data
bran_text = nltk.Text(bran_token) #passing a tokenized set of data to nltk.Text creates a Text object in NLTK 
bran_text #this object has lots of different functions we can call on it

## Quick Document Characterization

In [None]:
#how many tokens (words) are in this document?
len(bran_text)

In [None]:
#how many unique words?
len(set(bran_text))

In [None]:
#what are some of those words?
sorted(set(bran_text))[:10]

In [None]:
#calculating "lexical diversity"
len(set(bran_text))/len(bran_text)

In [None]:
#counting specigic words
bran_text.count("snow")

In [None]:
#percentage of total words
100 * bran_text.count('the') / len(bran_text)

# Exercise

<p>Read in the "Storm of Swords" file (nl_sos_all.txt), tokenize it and create a Text object. Answer the following questions:</p>
1. How long is "Storm of Swords"?
2. How many unique words are in "Storm of Swords"?
3. What is the lexical diversity of "Storm of Swords?"
4. How many times does the word "king" appear in "Storm of Swords?"
5. What percentage of the total words is the word "a"?

In [None]:
with open('data/nl_ffc_all.txt') as p:
    data = p.read()
FFC_token = nltk.word_tokenize(data)
FFC_text = nltk.Text(FFC_token)
with open('data/nl_sos_all.txt') as p:
    data = p.read()
SoS_token = nltk.word_tokenize(data)
SoS_text = nltk.Text(SoS_token)

# Frequency Distributions

<p>How can we automatically identify the words of a text that are most informative about the topic and genre of the text?</p>

In [None]:
#the nltk.FreqDist() generates a frequency distribution for any Text object
FFC_dist = nltk.FreqDist(FFC_text)
FFC_dist.most_common(5)

In [None]:
len(FFC_text)

In [None]:
#plot the top 50 most common words
FFC_dist.plot(50, cumulative=True)

In [None]:
#these top 50 words are not very informative of the text
#If the frequent words don't help us, 
#how about the words that occur once only?
FFC_dist.hapaxes()[:10]

# Fine-grained Selection of Words

<p>Next, let's look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than 15 characters long. </p>


In [None]:
V = set(FFC_text)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)[:7]

In [None]:
#these very long words are often hapaxes
#it would be better to find frequently occurring 
#long words
sorted(w for w in set(FFC_text) 
       if len(w) > 8 and FFC_dist[w] > 35)

# Collocations and Bigrams
<p>A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not.</p>

In [None]:
list(nltk.bigrams(["easier", "said", "than", "done"]))

In [None]:
FFC_text.collocations()

# Exercise

1. Create a Frequency Distribution for Storm of Swords.
2. Identify 10 hapaxes for the text.
3. Identify some distinctive words in this text (tweaking the word length and frequncy)
4. Generate collocations for Storm of Swords. How different are they from "Feast for Crows?"

# Searching Text

<p>There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context.</p>

In [None]:
FFC_text.concordance("king")

<p>A concordance permits us to see words in context. For example, we saw that king occurred in contexts such as "King Balon" and "the king is dead". What other words appear in a similar range of contexts?</p>

In [None]:
FFC_text.similar("bloody")

In [None]:
SoS_text.similar("bloody")

<p>The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as snow and mud.</p>

In [None]:
FFC_text.common_contexts(["bloody", "red"])

## Identifying th location of data within text

~~~python
FFC_text.dispersion_plot(["List","of","words","to","map"])
~~~

In [None]:
FFC_text.dispersion_plot(["Arya","Sam","Brienne", "death"])

# Exercise
<p>Using the "Storm of Swords" Text:</p>
1. Search for some of the distinctive words you've discovered using the concordance tool.
2. Experiment with the similar function.
3. Find some interesting pairs of words using the "common context" function.
4. Create a dispersion plot for the characters "Brienne", "Dany", "Catelyn", and "Jorah". Using this plot, can you identify which characters travel together?

# Part of Speech Tagging


<h4>Clean dishes are in the cabinet.</h4>
<h4>Clean dishes before going to work!</h4>


<img src="images/pos_tagger.png" style="width: 500px;" align="middle"/>

# NLTK's POS Tagger

~~~python
text = nltk.word_tokenize("And now for something completely different")
~~~

In [None]:
text = nltk.word_tokenize("And now for something completely different")

In [None]:
nltk.pos_tag(text)

In [None]:
nltk.help.upenn_tagset('RB')

## POS can work on context

In [None]:
confusing_text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

In [None]:
nltk.pos_tag(confusing_text)

# Characterizing Tagged Data

In [None]:
tagged_FFC = nltk.pos_tag(FFC_text)

In [None]:
proper_names = [k for k,v 
                in tagged_FFC 
                if v == 'NNP']

In [None]:
proper_names[5:15]

# Exercise
1. Run the Storm of Swords tokens through a tagger.(This could take a while)
2. Extract all of the singlular proper nouns(NNP) and plural proper nouns.
3. What is the distribution of the different tags across this data set?

# Word Level Classification

### Names Dataset

In [None]:
#we will be using the "names" dataset included in NLTK
names = nltk.corpus.names
names.fileids()

In [None]:
male_names = names.words('male.txt') #text file containing a list of male names
female_names = names.words('female.txt') #text file containing a list of female names
[w for w in male_names if w in female_names][:5]

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot()

Observe above some of the clear peaks on certain letters being the first letter in a female name vs. a male name.

## Step 1: Create a Classifier


In [None]:
def gender_features(word):
    return({'last_letter': word[-1]})


In [None]:
#this returned dictionary is our feature set
gender_features("Laura")

## Step 2: Create Labeled Test and Train Sets

In [None]:
import random
labeled_names = ([(name, 'male') for name 
                  in names.words('male.txt')] +
                 [(name, 'female') for name 
                  in names.words('female.txt')])
random.shuffle(labeled_names)
featuresets = [(gender_features(n), gender) for 
               (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

## Step 3: Train and test the classifier

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
classifier.classify(gender_features('Bran'))

In [None]:
classifier.classify(gender_features('Arya'))

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(5)

# Exercise
<p>Modify the gender_features() function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.</p>

# Document Level Classification
<p>Now that we understand how to classify items on a word level, lets move up to an entire document</p>
<p>First, we construct a list of documents, labeled with the appropriate categories. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.</p>

In [None]:
from nltk.corpus import movie_reviews #another data set provided by NLTK, tags movie reviews as positive or negative
import random
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [None]:
print(documents[0][0][:100])
print(documents[0][1])

<p>Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word.</p>

In [None]:
all_words = nltk.FreqDist(w.lower() for w 
                          in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return(features)

In [None]:
featuresets = [(document_features(d), c) 
               for (d,c) in documents]
train_set, test_set = featuresets[100:],featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier,
                             test_set))
classifier.show_most_informative_features(5)

# Corpus Level Classification

We can also do this sort of thing at the corpus level.  (Corpus == a collection of related documents.)

In [None]:
doc_names = nltk.corpus.gutenberg.fileids()
doc_names

In [None]:
print(nltk.corpus.gutenberg.raw("whitman-leaves.txt")[:145])

# Topic Modeling Steps

1. Tokenize each document.
2. Remove stop words.
3. Remove infrequent terms
4. Construct the document-term matrix
5. Apply TF-IDF term weighting 
6. Apply NMF Model to the matrix

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import numpy as np

<h4> What is a Document Term Matrix? </h4>

<img src="images/doc-term-matrix.png" style="width: 500px;" align="middle"/>

In [None]:
#Read in each document.
documents = [nltk.corpus.gutenberg.raw(fn) for 
             fn in doc_names]

In [None]:
#Use scikit-learn to apply tokenization and vectorization 
#to build a document-term matrix A for the corpus of documents:
tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, 
                        lowercase=True, 
                        strip_accents="unicode", 
                        use_idf=True, norm="l2", 
                        min_df = 5) 
DTM = tfidf.fit_transform(documents)
DTM

#### Store the list of terms for later use, whose indices correspond to the columns of the document-term matrix.

In [None]:
#Store the list of terms for later use, whose indices correspond 
#to the columns of the document-term matrix.
num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
    terms[ tfidf.vocabulary_[term] ] = term

<h4> What is Decomposition </h4>

<img src="images/decomp_matrix.png" style="width: 800px;" align="middle"/>

In [None]:
model = decomposition.NMF(init="nndsvd", n_components=10, max_iter=200)
W = model.fit_transform(DTM)
H = model.components_

In [None]:
for topic_index in range( H.shape[0] ):
    top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print( "Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))

In [None]:
for fn in doc_names:
    print(fn, W[doc_names.index(fn)].argmax())

# Exercise

<p>Using the same methodology, perform NMF topic modeling on the bbc_news dataset (bbc_small.zip).</p>

<p>Hint: Try using fewer topics</p>