# Text Processing with Python

Packages Discussued:

- [readability-lxml](https://github.com/buriy/python-readability) and [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)
- [Pattern](http://www.clips.ua.ac.be/pattern)
- [NLTK](http://www.nltk.org/)
- [TextBlob](http://textblob.readthedocs.org/en/dev/)
- [spaCy](https://honnibal.github.io/spaCy/updates.html)
- [gensim](https://radimrehurek.com/gensim/)

Other packages:

- [MITIE](https://github.com/mit-nlp/MITIE)

## NLP in Context

> The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.
>
> &mdash; Ferdinand de Saussure

### The State of the Art

- Academic design for use alongside intelligent agents (AI discipline)
- Relies on formal models or representations of knowledge & language
- Models are adapted and augment through probabilistic methods and machine learning.
- A small number of algorithms comprise the standard framework.

Required:

- Domain Knowledge
- A Corpus in the Domain
- Methods

### The Data Science Pipeline

![The Data Science Pipeline](figures/data_science_pipeline.png)

### The NLP Pipeline

![The NLP Pipeline](figures/nlp_pipeline.png)

#### Morphology

The study of the forms of things, words in particular.

Consider pluralization for English:

- Orthographic Rules: puppy → puppies
- Morphological Rules: goose → geese or fish

Major parsing tasks: 

- stemming
- lemmatization
- tokenization.

#### Syntax

The study of the rules for the formation of sentences.

Major tasks:

- chunking
- parsing
- feature parsing
- grammars
- NGram Models (perplexity)
- Language generation

#### Semantics

The study of meaning.

    I see what I eat.
    I eat what I see.
    He poached salmon.

Major Tasks

- Frame extraction
- creation of TMRs
- Question and answer systems

#### Machine Learning

Solve **Clustering Problems**:

- Topic Modeling
- Language Similarity
- Document Association (authorship)

Solve **Classification Problems**:

- Language Detection
- Sentiment Analysis
- Part of Speech Tagging
- Statistical Parsing
- Much more

Use of word _vectors_ to implement distance based metrics. 

## Setup and Dataset

To install the required packages (hopefully to a virtual environment) you can download the `requirements.txt` and run:

    $ pip install -r requirements.txt

Or you can pip install each dependency as you need them.

### Corpus Organization

## Preprocessing HTML and XML Documents to Text

Much of the text that we're interested in is available on the web and formatted either as HTML or XML. It's not just web pages, however. Most eReader formats like ePub and Mobi are actually zip files containing XHTML. These semi-structured documents contain a lot of information, usually structural in nature. However, we want to get to the main body of the content of what we're looking for, disregarding other content that might be included such as headers for navigation, sidebars, ads and other extraneous content.

On the web, there are several services that provide web pages in a "readable" fashion like [Instapaper](https://www.instapaper.com/) and [Clearly](https://evernote.com/clearly/). Some browsers might even come with a clutter and distraction free "reading mode" that seems to give us exactly the content that we're looking for. An option that I've used in the past is to either programmatically access these renderers, Instapaper even provides an [API](https://www.instapaper.com/api). However, for large corpora, we need to quickly and repeatably perform extraction, while maintaining the original documents. 

> Corpus management requires that the original documents be stored alongside preprocessed documents - do not make changes to the originals in place! See discussions of data lakes and data pipelines for more on ingesting to WORM storages.

In Python, the fastest way to process HTML and XML text is with the [`lxml`](http://lxml.de/) library - a superfast XML parser that binds the C libraries `libxml2` and `libxslt`. However, the API for using `lxml` is a bit tricky, so instead use friendlier wrappers, `readability-lxml` and `BeautifulSoup`. 

For example, consider the following code to fetch an HTML web article from The Washington Post:

In [1]:
import codecs
import requests

from urlparse import urljoin
from contextlib import closing

chunk_size = 10**6  # Download 1 MB at a time.
wpurl = "http://wpo.st/"  # Washington Post provides short links

def fetch_webpage(url, path):
    # Open up a stream request (to download large documents)
    # Ensure that we will close when complete using contextlib
    with closing(requests.get(url, stream=True)) as response:

        # Check that the response was successful
        if response.status_code == 200:
            
            # Write each chunk to disk with the correct encoding
            with codecs.open(path, 'w', response.encoding) as f:
                for chunk in response.iter_content(chunk_size,  decode_unicode=True):
                    f.write(chunk)

def fetch_wp_article(article_id):
    path = "%s.html" % article_id
    url  = urljoin(wpurl, article_id)
    return fetch_webpage(url, path)

In [2]:
fetch_webpage("http://www.koreadaily.com/news/read.asp?art_id=3283896", "korean.html")

In [3]:
fetch_wp_article("nrRB0")

In [4]:
fetch_wp_article("uyRB0")

`BeautifulSoup` allows us to search the DOM to extract particular elements, for example to load our document and find all the `<p>` tags, we would do the following:

In [2]:
import bs4

def get_soup(path):
    with open(path, 'r') as f:
        return bs4.BeautifulSoup(f, "lxml") # Note the use of the lxml parser

for p in get_soup("nrRB0.html").find_all('p'):
    print p

<p class="category-desc"> The inside track on Washington politics. </p>
<p class="invalid-email">*Invalid email address</p>
<p class="category-desc"> The inside track on Washington politics. </p>
<p class="invalid-email">*Invalid email address</p>
<p>Sign in or create an account so we can save this story to your Reading List. You'll be able to access the story from your Reading List on any computer, tablet or smartphone.</p>
<p class="top-header-message">Sign in to your account to save this article.</p>
<p id="U9001274173114EdC"></p>
<p id="U1000696839467p6H"> <i>It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible.</i> </p>
<p id="U1000696839467ntG"></p>


In order to print out only the text with no nodes, do the following:

In [3]:
for p in get_soup("nrRB0.html").find_all('p'):
    print p.text
    print

 The inside track on Washington politics. 

*Invalid email address

 The inside track on Washington politics. 

*Invalid email address

Sign in or create an account so we can save this story to your Reading List. You'll be able to access the story from your Reading List on any computer, tablet or smartphone.

Sign in to your account to save this article.



 It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. 



Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two an

While this allows us to easily traverse the DOM and find specific elements by their id, class, or element type - we still have a lot of cruft in the document. This is where `readability-lxml` comes in. This library is a Python port of the [readability project](http://lab.arc90.com/experiments/readability/), written in Ruby and inspired by Instapaper. This code uses readability.js and some other helper functions to extract the main body and even title of the document you're working with. 

In [4]:
from readability.readability import Document

def get_paper(path):
    with codecs.open(path, 'r', encoding='utf-8') as f:
        return Document(f.read())

paper = get_paper("nrRB0.html")
print paper.title()

A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post


In [5]:
with codecs.open("nrRB0-clean.html", "w", encoding='utf-8') as f:
    f.write(paper.summary())

Combine readability and BeautifulSoup as follows:

In [6]:
def get_text(path):
    with open(path, 'r') as f:
        paper = Document(f.read())
        soup = bs4.BeautifulSoup(paper.summary())
        output = [paper.title()]
        for p in soup.find_all('p'):
            output.append(p.text)
        return "\n\n".join(output)

In [7]:
print get_text("nrRB0.html")

A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post



 It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. 



‘Rotissi-fried’ chicken at the Partisan

Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfe

### A note on binary formats

In order to transform PDF documents to XML, the best solution is currently [PDFMiner](http://www.unixuser.org/~euske/python/pdfminer/index.html), specificially their [pdf2text](https://github.com/euske/pdfminer/blob/master/tools/pdf2txt.py) tool. Note that this tool can output into multiple formats like XML or HTML, which is often better than the direct text export. Because of this it's often useful to convert PDF to XHTML and then use Readabiilty or BeautifulSoup to extract the text out of the document. 

Unfortunately, the conversion from PDF to text is often not great, though statistical methodologies can help ease some of the errors in transformation. If PDFMiner is not sufficient, you can use tools like [PyPDF2](https://github.com/mstamy2/PyPDF2) to work directly on the PDF file, or write Python code to wrap other tools in Java and C like [PDFBox](https://pdfbox.apache.org/). 

Older binary formats like Pre-2007 Microsoft Word Documents (.doc) require special tools. Again, the best bet is to use Python to call another command line tool like [antiword](http://www.winfield.demon.nl/). Newer Microsoft formats are acutally zipped XML files (.docx) and can be either unzipped and handled using the XML tools mentioned above, or using Python packages like [python-docx](https://github.com/mikemaccana/python-docx) and [python-excel](http://www.python-excel.org/). 

## Pattern

The `pattern` library by the CLiPS lab at the University of Antwerp is designed specifically for language processing of web data and contains a toolkit for fetching data via web APIS: Google, Gmail, Bing, Twitter, Facebook, Wikipedia, and more. It supports HTML DOM parsing and even includes a web crawler! 

For example to ingest Twitter data:

In [11]:
from pattern.web import Twitter, plaintext

In [12]:
twitter = Twitter(language='en')
for tweet in twitter.search("#DataDC", cached=False):
    print tweet.text

RT @MicrosoftR: MT @DataEducationDC: Register for our #rstats special event w/ @RevoJoe 4/12 @WeWork in Dupont: https://t.co/2L75cA781v #da…
RT @wahalulu: More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
RT @wahalulu: More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
RT @wahalulu: Getting the #datadc gang together at #rstatsnyc  @robertvesco https://t.co/qebNLGWFFd
Getting the #datadc gang together at #rstatsnyc  @robertvesco https://t.co/qebNLGWFFd
RT @robertvesco: Screw teddy bears and dolls -&gt; Awesome plush statistical distributions via @NausicaaDist https://t.co/zHJiLQMfba #rstatsny…
RT @tonyojeda3: Natural Language Processing with Python Workshop on 4/9 https://t.co/3gzxbXW2Jw #DataScience #BigData #NLProc #DataDC #DCTe…
Screw teddy bears and dolls -&gt; Awesome plush statistical distributions via @NausicaaDist https://t.co/zHJiLQMfba #rstatsnyc #datadc
Nat

Pattern also contains an NLP toolkit for English in the `pattern.en` module that utilizes statistical approcahes and regular expressions. Other languages include Spanish, French, Italian, German, and Dutch. 

The patern parser will identify word classes (e.g. Part of Speech tagging), perform morphological inflection analysis, and includes a WordNet API for lemmatization.

In [13]:
from pattern.en import parse, parsetree

s = "The man hit the building with a baseball bat."
print parse(s, relations=True, lemmata=True)
print
for clause in parsetree(s):
    for chunk in clause.chunks:
        for word in chunk.words:
            print word,
        print

The/DT/B-NP/O/NP-SBJ-1/the man/NN/I-NP/O/NP-SBJ-1/man hit/VBD/B-VP/O/VP-1/hit the/DT/O/O/O/the building/VBG/B-VP/O/O/build with/IN/B-PP/B-PNP/O/with a/DT/B-NP/I-PNP/O/a baseball/NN/I-NP/I-PNP/O/baseball bat/NN/I-NP/I-PNP/O/bat ././O/O/O/.

Word(u'The/DT') Word(u'man/NN')
Word(u'hit/VBD')
Word(u'building/VBG')
Word(u'with/IN')
Word(u'a/DT') Word(u'baseball/NN') Word(u'bat/NN')


The `pattern.search` module allows you to retreive N-Grams from text based on phrasal patterns, and can be used to mine dependencies from text, e.g.

In [14]:
from pattern.search import search

s = "The man hit the building with a baseball bat."
pt = parsetree(s, relations=True, lemmata=True)
for match in search('NP VP', pt):
    print match

Match(words=[Word(u'The/DT'), Word(u'man/NN'), Word(u'hit/VBD')])


Lastly the `pattern.vector` module has a toolkit for distance-based bag-of-words model machine learning including clustering (K-Means, Hierarhcical Clustering) and classification. 

## NLTK

Suite of libraries for a variety of academic text processing tasks:

    tokenization, stemming, tagging,
    chunking, parsing, classification,
    language modeling, logical semantics

Pedagogical resources for teaching NLP theory in Python ...

- Python interface to over 50 corpora and lexical resources
- Focus on Machine Learning with specific domain knowledge
- Free and Open Source
- Numpy and Scipy under the hood
- Fast and Formal

What is NLTK not?

- Production ready out of the box*
- Lightweight
- Generally applicable
- Magic

*There are actually a few things that are production ready right out of the box*.

**The Good Parts**:

- Preprocessing
    - segmentation
    - tokenization
    - PoS tagging
- Word level processing
    - WordNet
    - Lemmatization
    - Stemming
    - NGram
- Utilities
    - Tree
    - FreqDist
    - ConditionalFreqDist
- Streaming CorpusReader objects
- Classification
    - Maximum Entropy (Megam Algorithm)
    - Naive Bayes
    - Decision Tree
- Chunking, Named Entity Recognition
- Parsers Galore!

**The Bad Parts**:

- Syntactic Parsing
    - No included grammar (not a black box)
- Feature/Dependency Parsing
    - No included feature grammar
- The sem package
- Toy only (lambda-calculus & first order logic)
- Lots of extra stuff
    - papers, chat programs, alignments, etc.



In [87]:
import nltk

text = get_text("nrRB0.html")
for idx, s in enumerate(nltk.sent_tokenize(text)): # Segmentation
    words = nltk.wordpunct_tokenize(s)  # Tokenization
    tags  = nltk.pos_tag(words)    # Part of Speech tagging
    print tags
    print
    if idx > 5:
        break

[(u'A', 'DT'), (u'crisp', 'NN'), (u'and', 'CC'), (u'juicy', 'NN'), (u'bucket', 'NN'), (u'list', 'NN'), (u'of', 'IN'), (u'D', 'NNP'), (u'.', '.'), (u'C', 'NNP'), (u'.\u2019', 'NNP'), (u's', 'VBZ'), (u'best', 'JJS'), (u'fried', 'VBN'), (u'chicken', 'NN'), (u'-', ':'), (u'The', 'DT'), (u'Washington', 'NNP'), (u'Post', 'NNP'), (u'It', 'NNP'), (u'\u2019', 'NNP'), (u's', 'VBZ'), (u'lowbrow', 'NN'), (u'.', '.')]

[(u'It', 'PRP'), (u'\u2019', 'VBP'), (u's', 'NNS'), (u'messy', 'JJ'), (u'.', '.')]

[(u'It', 'PRP'), (u'could', 'MD'), (u'never', 'RB'), (u'be', 'VB'), (u'accused', 'VBN'), (u'of', 'IN'), (u'being', 'VBG'), (u'healthful', 'JJ'), (u'.', '.')]

[(u'But', 'CC'), (u'we', 'PRP'), (u'\u2019', 'VBP'), (u'd', 'VBN'), (u'never', 'RB'), (u'let', 'VB'), (u'those', 'DT'), (u'formalities', 'NNS'), (u'get', 'VBP'), (u'between', 'IN'), (u'us', 'PRP'), (u'and', 'CC'), (u'an', 'DT'), (u'order', 'NN'), (u'of', 'IN'), (u'crispy', 'NN'), (u',', ','), (u'crackly', 'RB'), (u',', ','), (u'delicious', 'JJ')

In [90]:
from nltk import FreqDist
from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

text  = get_text("nrRB0.html")
vocab = FreqDist()
words = FreqDist()
for s in nltk.sent_tokenize(text): 
    for word in nltk.wordpunct_tokenize(s):
        words[word] += 1
        lemma = lemmatizer.lemmatize(word)
        vocab[lemma] += 1

print words
print vocab

<FreqDist with 1072 samples and 3084 outcomes>
<FreqDist with 1032 samples and 3084 outcomes>


The first thing you needed to do was create a corpus reader that could read the RSS feeds and their topics, implementing one of the built-in corpus readers:

In [16]:
import os
import nltk
import time
import random
import pickle
import string

from bs4 import BeautifulSoup
from nltk.corpus import CategorizedPlaintextCorpusReader

# The first group captures the category folder, docs are any HTML file.
CORPUS_ROOT = './corpus'
DOC_PATTERN = r'(?!\.).*\.html'
CAT_PATTERN = r'([a-z_]+)/.*'

# Specialized Corpus Reader for HTML documents
class CategorizedHTMLCorpusreader(CategorizedPlaintextCorpusReader):
    """
    Reads only the HTML body for the words and strips any tags.
    """

    def _read_word_block(self, stream):
        soup = BeautifulSoup(stream, 'lxml')
        return self._word_tokenizer.tokenize(soup.get_text())

    def _read_para_block(self, stream):
        soup  = BeautifulSoup(stream, 'lxml')
        paras = []
        piter = soup.find_all('p') if soup.find('p') else self._para_block_reader(stream)

        for para in piter:
            paras.append([self._word_tokenizer.tokenize(sent)
                          for sent in self._sent_tokenizer.tokenize(para)])

        return paras

# Create our corpus reader
rss_corpus = CategorizedHTMLCorpusreader(CORPUS_ROOT, DOC_PATTERN,
                    cat_pattern=CAT_PATTERN, encoding='utf-8')

Just to make things easy, I've also included all of the imports at the top of this snippet in case you're just copying and pasting. This should give you a corpus that is easily readable with the following properties:

> RSS Corpus contains 5506 files in 11 categories
> Vocab: 69642 in 1920455 words for a lexical diversity of 27.576

This snippet demonstrates a choice I made - to override the `_read_word_block` and the `_read_para_block` functions in the `CategorizedPlaintextCorpusReader`, but of course you could have created your own `HTMLCorpusReader` class that implemented the categorization features.

The next thing to do is to figure out how you will generate your featuresets, I hope that you used unigrams, bigrams, TF-IDF and others. The simplest thing to do is simply a bag of words approach, however I have ensured that this bag of words does not contain punctuation or stopwords, has been normalized to all lowercase and has been lemmatized to reduce the number of word forms:

In [17]:
# Create feature extractor methodology
def normalize_words(document):
    """
    Expects as input a list of words that make up a document. This will
    yield only lowercase significant words (excluding stopwords and
    punctuation) and will lemmatize all words to ensure that we have word
    forms that are standardized.
    """
    stopwords  = set(nltk.corpus.stopwords.words('english'))
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    for token in document:
        token = token.lower()
        if token in string.punctuation: continue
        if token in stopwords: continue
        yield lemmatizer.lemmatize(token)

def document_features(document):
    words = nltk.FreqDist(normalize_words(document))
    feats = {}
    for word in words.keys():
        feats['contains(%s)' % word] = True
    return feats

You should save a training, devtest and test as pickles to disk so that you can easily work on your classifier without having to worry about the overhead of randomization. I went ahead and saved the features to disk; but if you're developing features then you'll only save the word lists to disk. Here are the functions both for generation and for loading the data sets:

In [19]:
def timeit(func):
    def wrapper(*args, **kwargs):
        start  = time.time()
        result = func(*args, **kwargs)
        delta  = time.time() - start
        return result, delta
    return wrapper

@timeit
def generate_datasets(test_size=550, pickle_dir="."):
    """
    Creates three data sets; a test set and dev test set of 550 documents
    then a training set with the rest of the documents in the corpus. It
    will then write the data sets to disk at the pickle_dir.
    """
    documents = [(document_features(rss_corpus.words(fileid)), category)
                    for category in rss_corpus.categories()
                    for fileid in rss_corpus.fileids(category)]

    random.shuffle(documents)

    datasets = {
        'test':     documents[0:test_size],
        'devtest':  documents[test_size:test_size*2],
        'training': documents[test_size*2:],
    }

    for name, data in datasets.items():
        with open(os.path.join(pickle_dir, name+".pickle"), 'wb') as out:
            pickle.dump(data, out)

def load_datasets(pickle_dir="."):
    """
    Loads the randomly shuffled data sets from their pickles on disk.
    """

    def loader(name):
        path = os.path.join(pickle_dir, name+".pickle")
        with open(path, 'rb') as f:
            data = pickle.load(f)

        return name, data

    return dict(loader(name) for name in ('test', 'devtest', 'training'))

# Using a time it decorator you can see that this saves you quite a few seconds:

_, delta = generate_datasets(pickle_dir='datasets')
print "Took %0.3f seconds to generate datasets" % delta

Took 26.951 seconds to generate datasets


Last up is the building of the classifier. I used a maximum entropy classifier with the lemmatized word level features. Also note that I used the MEGAM algorithm to significantly speed up my training time:

In [20]:
@timeit
def train_classifier(training, path='classifier.pickle'):
    """
    Trains the classifier and saves it to disk.
    """
    classifier = nltk.MaxentClassifier.train(training,
                algorithm='megam', trace=2, gaussian_prior_sigma=1)

    with open(path, 'wb') as out:
        pickle.dump(classifier, out)

    return classifier

datasets = load_datasets(pickle_dir='datasets')
classifier, delta = train_classifier(datasets['training'])
print "trained in %0.3f seconds" % delta

testacc    = nltk.classify.accuracy(classifier, datasets['test']) * 100
print "test accuracy %0.2f%%" % testacc

classifier.show_most_informative_features(30)

[Found megam: /Users/benjamin/bin/megam]
[Found megam: /Users/benjamin/bin/megam]
trained in 133.205 seconds
test accuracy 82.36%
   3.917 contains(comment)==True and label is 'data_science'
   3.599 contains(...)==True and label is 'gaming'
   3.573 contains(data)==True and label is 'data_science'
   3.248 contains(book)==True and label is 'books'
   2.984 contains(wired)==True and label is 'tech'
   2.970 label is 'business'
   2.836 contains(»)==True and label is 'business'
   2.667 contains(game)==True and label is 'gaming'
   2.481 contains(entrepreneur)==True and label is 'business'
  -2.418 label is 'essays'
   2.342 contains(facebook)==True and label is 'tech'
   2.259 contains(read)==True and label is 'tech'
   2.255 contains(...)==True and label is 'cinema'
   2.229 contains(adafruit)==True and label is 'do_it_yourself'
   2.186 contains(recipe)==True and label is 'cooking'
   2.166 contains(film)==True and label is 'cinema'
  -2.101 contains(read)==True and label is 'busines

Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object find_file_iter at 0x1246070a0> ignored


In [21]:
from operator import itemgetter

def classify(text, explain=False):
    
    classifier = None
    with open('classifier.pickle', 'rb') as f:
        classifier = pickle.load(f)
    
    document = nltk.wordpunct_tokenize(text)
    features = document_features(document)
    
    pd = classifier.prob_classify(features)
    for result in sorted([(s,pd.prob(s)) for s in pd.samples()], key=itemgetter(1), reverse=True):
        print "%s: %0.4f" % result

    print
    if explain:
        classifier.explain(features)

classify(get_text("nrRB0.html"), True)

cooking: 1.0000
essays: 0.0000
books: 0.0000
do_it_yourself: 0.0000
gaming: 0.0000
design: 0.0000
tech: 0.0000
cinema: 0.0000
data_science: 0.0000
sports: 0.0000
business: 0.0000

  Feature                                          cooking  essays   books do_it_y
  --------------------------------------------------------------------------------
  contains(recipe)==True (1)                         2.186
  contains(dish)==True (1)                           1.839
  contains(food)==True (1)                           1.073
  contains(chef)==True (1)                           1.026
  contains(classic)==True (1)                        0.980
  contains(spicy)==True (1)                          0.881
  contains(flavor)==True (1)                         0.853
  contains(stuffed)==True (1)                        0.845
  contains(served)==True (1)                         0.841
  contains(bar)==True (1)                            0.823
  contains(fresh)==True (1)                          0.787
  con

In [22]:
classifier.explain(document_features(get_text("nrRB0.html")))

  Feature                                           design   books data_sc    tech
  --------------------------------------------------------------------------------
  contains(’)==True (1)                              1.444
  contains(2)==True (1)                              0.709
  contains(x)==True (1)                              0.607
  contains(4)==True (1)                              0.513
  contains(h)==True (1)                              0.475
  contains(“)==True (1)                              0.368
  label is 'design' (1)                              0.362
  contains(r)==True (1)                              0.331
  contains(0)==True (1)                             -0.296
  contains(7)==True (1)                             -0.290
  contains(8)==True (1)                             -0.266
  contains(v)==True (1)                             -0.261
  contains(q)==True (1)                              0.255
  contains(9)==True (1)                             -0.237
  contai

The classifier did well - it trained in 2 minutes or so an dit got an initial accuracy of about 83% - a pretty good start!

### Parsing with Stanford Parser and NLTK

NLTK parsing is notoriously bad - because it's pedagogical. However, you can use Stanford. 

In [23]:
import os

from nltk.tag.stanford import NERTagger
from nltk.parse.stanford import StanfordParser

## NER JAR and Models
STANFORD_NER_MODEL = os.path.expanduser("~/Development/stanford-ner-2014-01-04/classifiers/english.all.3class.distsim.crf.ser.gz")
STANFORD_NER_JAR   = os.path.expanduser("~/Development/stanford-ner-2014-01-04/stanford-ner-2014-01-04.jar")

## Parser JAR and Models
STANFORD_PARSER_MODELS = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser-3.5.0-models.jar")
STANFORD_PARSER_JAR    = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser.jar")

def create_tagger(model=None, jar=None, encoding='ASCII'):
    model = model or STANFORD_NER_MODEL
    jar   = jar or STANFORD_NER_JAR

    return NERTagger(model, jar, encoding)

def create_parser(models=None, jar=None, **kwargs):
    models = models or STANFORD_PARSER_MODELS
    jar   = jar or STANFORD_PARSER_JAR

    return StanfordParser(jar, models, **kwargs)

class NER(object):

    tagger = None

    @classmethod
    def initialize_tagger(klass, model=None, jar=None, encoding='ASCII'):
        klass.tagger = create_tagger(model, jar, encoding)

    @classmethod
    def tag(klass, sent):
        if klass.tagger is None:
            klass.initialize_tagger()

        sent = nltk.word_tokenize(sent)
        return klass.tagger.tag(sent)

class Parser(object):

    parser = None

    @classmethod
    def initialize_parser(klass, models=None, jar=None, **kwargs):
        klass.parser = create_parser(models, jar, **kwargs)

    @classmethod
    def parse(klass, sent):
        if klass.parser is  None:
            klass.initialize_parser()

        return klass.parser.raw_parse(sent)

def tag(sent):
    return NER.tag(sent)

def parse(sent):
    return Parser.parse(sent)

In [24]:
tag("The man hit the building with the bat.")

[(u'The', u'O'),
 (u'man', u'O'),
 (u'hit', u'O'),
 (u'the', u'O'),
 (u'building', u'O'),
 (u'with', u'O'),
 (u'the', u'O'),
 (u'bat', u'O'),
 (u'.', u'O')]

In [25]:
for p in parse("The man hit the building with the bat."):
    print p


(ROOT
  (S
    (NP (DT The) (NN man))
    (VP
      (VBD hit)
      (NP (DT the) (NN building))
      (PP (IN with) (NP (DT the) (NN bat))))
    (. .)))


## TextBlob

A lightweight wrapper around nltk that provides a simple "Blob" interface for working with text. 

In [23]:
from textblob import TextBlob
from bs4 import BeautifulSoup

text = TextBlob(get_text("nrRB0.html"))

print text.sentences

[Sentence("A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post

 It’s lowbrow."), Sentence("It’s messy."), Sentence("It could never be accused of being healthful."), Sentence("But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken."), Sentence("Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings."), Sentence("Here are some of the most irresistible."), Sentence("‘Rotissi-fried’ chicken at the Partisan

Forget the cronut."), Sentence("Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan."), Sentence("Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes."), Sentence("Why both?"), Sentence("“Everything is better once it’s fried in beef fat,” Anda said."), Senten

In [25]:
import nltk

In [26]:
np = nltk.FreqDist(text.noun_phrases)
print np.most_common(10)

[(u'hot sauce', 6), (u'popeyes', 5), (u'washington', 5), (u'it\u2019s', 4), (u'gbd', 4), (u'd.c.\u2019s', 4), (u'st. nw', 4), (u'bonchon', 3), (u'maryland', 3), (u'hank\u2019s oyster', 3)]


In [27]:
print text.sentiment

Sentiment(polarity=-0.0025676717918097208, subjectivity=0.5856343297507093)


In [28]:
review = TextBlob("Harrison Ford would be the most amazing, most wonderful, most handsome actor - the greatest that ever lived, if only he didn't have that silly earing.")
print review.sentiment

Sentiment(polarity=0.4555555555555555, subjectivity=0.8083333333333333)


Language Detection using TextBlob

In [29]:
b = TextBlob(u"بسيط هو أفضل من مجمع")
b.detect_language()

u'ar'

In [32]:
chinese_blob = TextBlob(u"美丽优于丑陋")
chinese_blob.translate(from_lang="zh-CN", to='en')

TextBlob("")

In [33]:
en_blob = TextBlob(u"Simple is better than complex.")
en_blob.translate(to="es")

TextBlob("")

## spaCy

Industrial strength NLP, in Python but with a strong Cython backend. Super fast. Licensing issue though. 

In [34]:
from __future__ import unicode_literals 
from spacy.en import English

nlp = English()

tokens = nlp(u'The man hit the building with the baseball bat.')

baseball = tokens[7]
print (baseball.orth, baseball.orth_, baseball.head.lemma, baseball.head.lemma_)

(2303, u'baseball', 4193, u'bat')


In [139]:
tokens = nlp(u'The man hit the building with the baseball bat.', parse=True)
for token in tokens:
    print token.prob

-5.02773189545
-8.16621112823
-8.3605670929
-3.07847452164
-8.67186450958
-5.23164892197
-3.07847452164
-9.61269950867
-10.9683980942
-3.17597317696


## gensim

Library for bag of words clustering - LSA and LDA. 

Also implements word2vec - Google's word vectorizer: something that was explored in a previous post.