In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Model building. "NLP" case study: Apple/apple.

## Starting point
It turns out that there are several common types of features and approaches that form the "starting point" for NLP and text-based data problems.  We'll talk about a few of the common ones:

##Goals / Topics
 - Bag of words
 - TF/IDF
 - n-grams
 - Stemming / part of speech tagging / etc.
 - Feature hashing
 - Topics not covered
 
Some useful tools:
 - http://scikit-learn.org/stable/modules/feature_extraction.html
 - http://www.nltk.org/
 - http://www.nltk.org/howto/wordnet.html
 
If you're in a hurry just head there (with the caveat that nltk is huge).

If you have more time, check out this comprehensive resource for using nltk with Python: http://www.nltk.org/book/

It is a good idea to start with a well-defined problem in mind.  So, let us pick:

##The problem: Word sense disambiguation.  

In a given source block of (prose) text, you want to be able to tell apart Apple (the company) vs. apple (the fruit).  Ideally you would also be able to tell apart Ford vs ford, Windows vs windows, etc. via similar examples.


**The type of learner**: We're going to choose to look at this as a _supervised classification_ problem.  There are also unsupervised approaches, but you have to make choices sometimes.  This means we need some "marked up" data:

**The training dataset**: Having limited resources at our disposal, we're going to try to use Wikipedia's articles on the given topics as our chosen "corpus" of text.

**The test dataset**: A good idea would be to mark up a small corpus by hand, or to use sentences culled from Wikipedia with the disambiguation coming from looking at the target of outgoing links.  We're not going to be that scientific for lack fo time.

Okay, so you might be guessing that for "Apple"/"apple" simple heuristics (e.g., capitalization, presence of a possessive) would do quite well.  That is certainly true in this example!  To do better, you'd have to be more clever: An idea would be to look at nearby words (e.g., "software" and "computer" certainly hint one way, while "flavor" hints the other).  Figuring out how to turn this idea into an actual implementation takes us to our first two topics: **bag of words** and **TF/IDF**.

## Brainstorming:

How might we do this?
  1. We need to clean and tokenize the text data: __Tokenization__ refers to splitting the text into pieces, in this case into sentences and into words.  Cleaning can also include things like __stemming__ or __lemmatizing__ (identifying similar words like "computer" and "computers" to their stem).
  2. We need to develop __features__.  We're going to think of our input as a sentence, and try to develop features of that sentence.  In our example application, we might try to use:
   - Capitalized of the word apple? (_a_pple vs _A_pple)    
   - Pluralization of the word apple? (apples)
   - Possessive form of the word apple? (Apple's)
   - Presence (or frequency) of certain well-chosen words : Does (e.g.,) the word "computer" or "fruit" occur in the sentence?  (This feature regards the sentence as a simple __bag of words__ without regard to trying to parse its structure.)
   - In addition to single words, we can also look for __n-grams__: Strings of n consecutive words.
   - There are common techniques for determining which words / n-grams to look for.  One of them is called __tf-idf__.
  3. Finally, we'll run some sort of classifier on the features.
  
We'll mostly focus on general NLP techniques in 1 and 2, rather than diving deeply into techniques for word disambiguation.

## Step 0: Cleaning and tokenizing the data

Our first step will be to pull in the training (and test) data.  We will want to clean both data on the way in: our goal is to have each text as a list of strings, one string for each sentence.

We'll be using nltk already (to split things into sentences). We've already downloaded the nltk data to your box.

###Splitting into words/sentences:
NLTK has convenient presets for sentence and word tokenization (i.e., splitting a document into sentences, resp. splitting a sentence into words).
>        
       my_list_of_sentences = nltk.tokenize.sent_tokenize(my_long_string) 
       words = [ nltk.tokenize.word_tokenize(sent) for sent in my_list_of_sentences]

In [None]:
import urllib2
from bs4 import BeautifulSoup
import re
import nltk.tokenize

# Spit out (slightly cleaned up) sentences from a Wikipedia article.
def wikipedia_to_sents(url):
    soup = BeautifulSoup(urllib2.urlopen(url)).find(attrs={'id':'mw-content-text'})
    
    # The text is litered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    
    raw_sents = reduce(lambda x, y: x + y, [nltk.tokenize.sent_tokenize(p.strip()) for p in paragraphs if p.strip()!=''])
    return filter(lambda s: len(s.split(" "))>2, raw_sents)

fruit_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple")
company_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple_Inc.")

In [None]:
company_sents[-105:-100]

In [None]:
fruit_sents[-105:-100]

#Step 1.1.  Features: Bag of words (and variants)

Learning algorithms like vectors of numbers, not text.  The simplest way to turn a text into a vector of number is to treat the text as a "bag of words."  That is you

  - Split the text into words
  - Count how many times each word (/each word in some fixed vocabulary) occurs
  - _(Optionally)_ normalize the counts against some baseline
  - _(Variant)_ Just do a binary "yes / no" for whether each word (/.. in some vocabulary) is contained in the material
  
The output is a very large, but usually sparse, vector: The number of coordinates is the number of words in our dictionary, and the $i$-th coordinate entry is the number of occurances of the $i$-th word.

There's a reasonable implementation of this in the CountVectorizer class in sklearn.feature_extraction.text.  See http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref for more detail on the options.

For instance, here we'll apply this to each sentence (as a separate bag).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words_vectorizer = CountVectorizer()

counts = bag_of_words_vectorizer.fit_transform( fruit_sents + company_sents  )
print counts.shape

In [None]:
# Note that counts is a **sparse** matrix.
print counts.toarray() #This is what it actually looks like.. there are non-zero entries, really!
print counts           # .. this is just describing the non-zero entries

In [None]:
bag_of_words_vectorizer.get_feature_names()[1000:1005]

### Stop words
It's common to want to __omit__ certain common words when doing these counts -- "a", "an", and "the" are common enough so that their counts do not tend to give us any hints as to the meaning of documents.  Such words that we want to omit are called __stop words__ (they don't stop anything, though).

NLTK contains a standard list of such stop words for English in `nltk.corpus.stopwords.words('english')`.  In our application, we'd also want to include "apple" -- it is certainly not going to help us distinguish our two meanings!

In [None]:
nltk.corpus.stopwords.words('english')

In [None]:
# The vocabulary *can* be built for you.  
#
# For instance, here we'll compute and then use the top 100 words by frequency -- *ignoring*
# the so-calle "stopwords": these are words like "a", "and", "the" that are very common
# "apple" is not useful for distinguishing the two, but is common, so add it as a stopword.
#
# Nevertheless, this method is probably NOT GOOD.  See tf-idf below instead.
counter=CountVectorizer(max_features=300,
                        stop_words=nltk.corpus.stopwords.words('english') + ['apple'])
counter=counter.fit( fruit_sents + company_sents )
print counter.get_feature_names()

# Now we can use it with that vectorizer, like so...
counter.transform(company_sents)
counter.transform(fruit_sents)

##n-grams

Instead of looking at just single words, it is also useful to look at **n-grams**: These are n-word long sequences of words (i.e., each of "farmer's market", "market share", and "farm share" is a 2-gram).

The exact same sort of counting techniques apply.  The `CountVectorizer` function has built in support for this, too:

If you pass it the `ngram_range=(m, M)` then it will count $n$-grams with  $m \leq n \leq M$.

In [None]:
ng_counter=CountVectorizer(max_features=300, 
                           ngram_range=(2,2), 
                           stop_words=nltk.corpus.stopwords.words('english') + ['apple', 'Apple'])
ng_counter=ng_counter.fit( fruit_sents + company_sents  )
print ng_counter.get_feature_names()

# Now we can use it with that vectorizer, like so...
ng_counter.transform(company_sents)
ng_counter.transform(fruit_sents)

##TF-IDF: term frequency–inverse document frequency

With single word vocabularies, we can probably do an okay job of coming up with a reasonable (if short) list of words that distinguish between the two documents.  With n-grams, even for $n=2$, it is better to let a computer help us.  

Just using frequencies, as above, is clearly not great.  Both apples the fruit and Apple the company are enjoyed around the world (one of the 2-grams that came up above!).  We would like to find words that are common in one document, not not common in all of them.  This is the goal of the __td-idf weighting__.  A precise definition is:


  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of ny single word's occurance count ($L^1$) or by the Euclidean length of the vector of word occurance counts ($L^2$).  Scikit-learn by defaults does this second one:
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  with a common variant being
  $$ \mathrm{idf}(t, D) = \log \frac{\# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{\text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
  (This second one is the default in scikit-learn. Without this tweak we would omit the $1+$ in the denominator and have to worry about dividing by zero if $t$ is not found in any documents.)
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$
  
  
  
###Exercises:
  1. Imagine that $D$ consists of just two documents $D = \{ d_1, d_2 \}$ and that the word "cultivar" occurs in $d_1$ but not in $d_2$.  What is 
  $$ \mathrm{tfidf}(\mathrm{"cultivar"}, d_i, D)$$
for each $i=1,2$?  For simplicity, use $\mathrm{tf}^{raw}$ and the version of `idf` without the $1+$.  

  2. Same question as 1, but now use $\mathrm{tf}^{L^2}$ and the version of of `idf` with the $1+$.  

  3. What happens to the tf-idf weighting of a word if it occurs in all (or all but one) documents?  Consider both forms of `idf`.
  
  4. In the example below, we consider each sentence as a separate document for the purpose of tf-idf.  What happens if you instead treat the input as just two documents, one for each starting article.
  
### Hints/Answers:
  1. For $i=2$ it is zero.  For $i=1$, it is the number of occurances of "apple" in $d_1$ multiplied by $\log 2$.
  2. For $i=2$, it is zero.  For $i=1$, it is .. also zero.
  3. Answer: If it occurs in all documents, and there are $N$ of the, then the $1+$ form weights `idf` by $\log N/(N+1) < 0$ while the other form weights `idf` by $\log N/N = 0$.  If it occurs in all-but-one document, then th $1+$ form weights `idf` by $0$ while the other form weights it by $\log N/(N-1) \approx 1+1/N$.
  4. It works less well, because of what's discussed in 3.  tf-idf doesn't work so well with few documents and where relevant words  occur (even if with wildly different frequencies!) in both.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

ng_tfidf=TfidfVectorizer(max_features=300, 
                         ngram_range=(1,2), 
                         stop_words=nltk.corpus.stopwords.words('english') + ["apple", "apples"])
ng_tfidf=ng_tfidf.fit( fruit_sents + company_sents )
print ng_tfidf.get_feature_names()[100:105]
print ng_tfidf.transform( fruit_sents + company_sents )

## Aside: Document similarity

A common problem is looking up a document similar to a given snippet, or relatedly comparing two documents for similarity.  The above provides a simple method for this called __cosine similarity__:
  - To each of the two douments $d_1, d_2$ in a corpus of documents $D$, assign its tf or tf-idf vector $$ (v_i)_{j} = \mathrm{tfidf}( t_{j}, d_i, D ) $$
  where $i$ ranges over indices for documents, and $j$ ranges over indices for terms in the vocabulary.
  - To compare two documents, simply find the cosine of the angle between the vectors:
  $$ \frac{v_i \cdot v_{i'}}{|v_i| |v_{i'}|} $$
  
(There's also a variant using binary vectors and Jaccard distance.)

## Stemming

In our original hand-built vocabulary, we had to include both "computer" and "computers".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  We can choose to first stem our words before counting them:


In [None]:
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
print "stemming carry", [stemmer.stem(s) for s in ["carry", "carries", "carrying", "carried"]]
print "stemming eat", [stemmer.stem(s) for s in ["eat", "eating", "eaten", "ate"]]
                                
# More examples 
print stemmer.stem("The quick brown fox jumped over the lazy dog.  I can't believe it's not butter.  I tried to ford the river and my unfortunate oxen died.")
print " ".join(map(stemmer.stem, "The quick brown fox jumped over the lazy dog.  I can't believe it's not butter.  I tried to ford the river and my unfortunate oxen died.".split(" ")))

In [None]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
print "lemma carry", [lemmatizer.lemmatize(s) for s in ["carry", "carries", "carrying", "carried"]]
print "lemma eat", [lemmatizer.lemmatize(s) for s in ["eat", "eating", "eaten", "ate"]]

We can tell our bag-of-words counters (/tf-idf) to first run its input through the stemmer.  This way it won't have to include both e.g., 'computer' and 'computers':

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

default_tokenizer = TfidfVectorizer().build_tokenizer()
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
    
def tokenize_stem(text):
    """
    We will use the default tokenizer from TfidfVectorizer, combined with the nltk SnowballStemmer.
    """
    tokens = default_tokenizer(text)
    stemmed = map(stemmer.stem, tokens)
    return stemmed

ng_stem_tfidf = TfidfVectorizer(max_features=300, 
                         ngram_range=(1,2), 
                         stop_words=map(stemmer.stem, nltk.corpus.stopwords.words('english') + ["apple"]),
                         tokenizer=tokenize_stem)
ng_stem_tfidf = ng_stem_tfidf.fit( fruit_sents + company_sents )

ng_stem_vocab = ng_stem_tfidf.get_feature_names()
print ng_stem_vocab

##Variation: Feature hashing

When doing "bag of words" type techniques on a *large* corpus and without an existing vocabulary, there is a simple trick that is often useful.  The issue (and solution) is as follows: 

 - The output is a feature vector, so that whenever we encounter a word we must look up which coordinate slot it is in.  A naive way would be to keep a list of all the words encoutered so far, and look up each word when it is encountered.  Whenever we encounter a new word, we see if we've already seen it before and if not -- assign it a new number.  This requires storing all the words that we have seen in memory, cannot be done in parallel (because we'd have to share the hash-table of seen words), etc.
 - A **hash function** takes as input something complicated (like a string) and spits out a number, with the desired property being that different inputs *usually* produce different outputs.  (This is how hash tables are implemented, as the name suggests.)
 - So -- rather than exactly looking up the coordinate of a given word, we can just use its hash value (modulo a big size that we choose).  This is fast and parallelizes easily.  (There are some downsides: You cannot tell, after the fact, what word each of your feature actually corresponds to!)
 
Scikit-learn includes `sklearn.feature_extraction.text.HashingVectorizer` to do this.  It behaves as almost a drop-in replacement for `CountVectorizer`.  It can be used with tf-idf by combining it with the `TfidfTransformer` (the `TfidfVectorizer` is the `CountVectorizer` together with the `TfidfTransformer`). For our application (where the training and test data is small), we may as well just use `TfidfVectorizer` -- but it is good to know that `HashingVectorizer` is there.

#Step 1.2: Some other features.

##Part of speech tagging.

Consider the "Ford" vs "ford" example.  As a human being, the easiest way to tell these apart is that Ford is a __noun__ while ford is a __verb__.

Fortunately, NLTK also has a part-of-speech tagger: You give it a sentence, and it tries to tag the parts of speech (e.g., noun, verb, adjective, etc.).  The command is `nltk.pos_tag` and for documentation on the tags either search around online, or use `nltk.help.upenn_tagset`:

(N.B. Nothing's perfect -- the tagger makes mistakes in each of the examples below!)

In [None]:
s1 = "I tried to ford the river, and my unfortunate oxen died"
s2 = "Henry Ford built factories to facilitate the construction of the Ford automobile."

In [None]:
nltk.pos_tag(nltk.tokenize.word_tokenize(s1))

In [None]:
nltk.pos_tag(nltk.tokenize.word_tokenize(s2))

In [None]:
nltk.help.upenn_tagset('NN.*')

##Capitalization, punctuation, etc.
There are the obvious features that we had in mind... lorum ipsum est.

In [None]:
import numpy as np

def feature_verbs(words, positions):
    pos_tag = nltk.pos_tag(words)
    return len( [ i for i in positions if pos_tag[i][1] == 'VB'] )

def is_cap(word):
    return word[0] == word[0].capitalize()

def feature_caps(words, positions):
    return len ( [ i for i in positions if is_cap(words[i]) ] )

def feature_plural(words, positions):
    def is_plural(word):
        return re.match( ".*s$", word )
    return len ( [ i for i in positions if is_plural(words[i]) ] )

## N.B. The nltk word tokenizer will tokenize "Apple's" as ["Apple", "'s"]
def feature_posessive(words, positions):
    l = len(words)
    return len ( [ i for i in positions if i+1 < l and words[i+1]=="'s" ] )

def ad_hoc_features(keyword, strs):
    """
    Given a keyword (e.g., "apple") and a list of strings;
    Returns a numpy ndarray encoding several ad hoc features of the string that are local
    near occurances of the keyword:
        - If the keyword is capitalized
        - If it is plural (in the stupid sense of ending in s.. good enough for 'apple')
        - If it is possessive in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """
    stemmed_word = stemmer.stem(keyword)
    def feature_one(s):
        words = nltk.tokenize.word_tokenize(s)
        hits = [ i for i in range(len(words)) if stemmer.stem(words[i]) == stemmed_word ]
        return np.array([ feature_caps(words, hits), 
                          feature_plural(words, hits), 
                          feature_posessive(words, hits), 
                          feature_verbs(words, hits) ])
    return np.asarray(map(feature_one, strs))

In [None]:
ad_hoc_features("ford", ["I drive a Ford.", "I tried to ford the river.", "That's not Ford's."])

In [None]:
ad_hoc_features("apple", ["Have you eaten your apple?", "How is Apple's stock doing?", "Apples are tasty."])

## The actual application
Diclaimer: This version is actually pretty bad -- it uses many of the right ideas, but them them together pretty poorly (and with fairly little available data).

In [None]:
import urllib2
import re
import nltk.tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from sklearn.naive_bayes import MultinomialNB

def wikipedia_to_sents(url):
    """
    Retrieves a URL from wikipedia, and returns a list of sentences (of at least 3 words) in the body text.
    """
    soup = BeautifulSoup(urllib2.urlopen(url)).find(attrs={'id':'mw-content-text'})
    
    # The text is litered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    
    raw_sents = reduce(lambda x, y: x + y, [nltk.tokenize.sent_tokenize(p.strip()) for p in paragraphs if p.strip()!=''])
    return filter(lambda s: len(s.split(" "))>2, raw_sents)


#### Bag-of-words features, using tf-idf
def make_ng_stem_vectorizer(texts, extra_stop_words, max_features):
    """
    Given 
        - a list of texts ("documents");
        - a list of extra stop words (in addition to the standard NLTK English ones); and,
        - a number of features to remember
    Returns the tf-idf feature extractor with this number of features, based on these documents.
    """
    default_tokenizer = TfidfVectorizer().build_tokenizer()
    stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
    def tokenize_stem(text):
        return map(stemmer.stem, default_tokenizer(text))
    ng_stem_tfidf=TfidfVectorizer(#max_features=max_features, 
                             ngram_range=(1,2),
                             stop_words=map(stemmer.stem, nltk.corpus.stopwords.words('english') + extra_stop_words),
                             tokenizer = tokenize_stem)
    return ng_stem_tfidf.fit( texts )


#### Ad hoc features
def feature_verbs(words, positions):
    pos_tag = nltk.pos_tag(words)
    return len( [ i for i in positions if pos_tag[i][1] == 'VB'] )

def is_cap(word):
    return word[0] == word[0].capitalize()
def feature_caps(words, positions):
    return len ( [ i for i in positions if is_cap(words[i]) ] )

def feature_plural(words, positions):
    def is_plural(word):
        return re.match( ".*s$", word )
    return len ( [ i for i in positions if is_plural(words[i]) ] )

## N.B. The nltk word tokenizer will tokenize "Apple's" as ["Apple", "'s"]
def feature_posessive(words, positions):
    l = len(words)
    return len ( [ i for i in positions if i+1 < l and words[i+1]=="'s" ] )

def ad_hoc_features(keyword, strs, use_verbs=False):
    """
    Given a keyword (e.g., "apple") and a list of strings;
    Returns a numpy ndarray encoding several ad hoc features of the string that are local
    near occurances of the keyword:
        - If the keyword is capitalized
        - If it is plural (in the stupid sense of ending in s.. good enough for 'apple')
        - If it is possessive in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """
    stemmed_word = stemmer.stem(keyword)
    def feature_one(s):
        words = nltk.tokenize.word_tokenize(s)
        hits = [ i for i in range(len(words)) if stemmer.stem(words[i]) == stemmed_word ]
        ret_list = [ feature_caps(words, hits), 
                     feature_plural(words, hits), 
                     feature_posessive(words, hits) ]
        if use_verbs:  # This is slow, so only use it sometimes
            ret_list.append( feature_verbs(words, hits)  )
        return np.array(ret_list)
    return np.asarray(map(feature_one, strs))

####
def make_classifier(base_word, meaning1, meaning2, use_verbs=False):
    """
    Given
        - a base word (e.g., "apple", "ford") that can have ambiguous meaning
        - a pair meaning1 = (name1, url1) of a label for the first meaning, and a Wikipedia URL for it
        - a pair meaning2 = ... for the other meaning
    Returns a tuple (make_features, classifier) where
        - make_features is a function taking in a string text, and returns a feature vector
        - classifier takes in a feature vector (output by make_features) and predicts the meaning
    """
    name1, url1 = meaning1
    name2, url2 = meaning2
    sents1 = wikipedia_to_sents(url1)
    sents2 = wikipedia_to_sents(url2)
    tfidf_vect = make_ng_stem_vectorizer(sents1 + sents2,
                                        [base_word],
                                        100000)
    def make_features(sents, use_verbs=False):
        a = ad_hoc_features(base_word, sents, use_verbs)
        t = tfidf_vect.transform(sents).toarray()
        return np.hstack((a,t))

    # Build the training data
    train_feat = make_features(sents1 + sents2, use_verbs)
    train_res  = np.array( [0] * len(sents1) + [1] * len(sents2) )
    
    classifier = MultinomialNB()
    classifier = classifier.fit(train_feat, train_res)
    return (lambda x: make_features(x, use_verbs), classifier)

In [None]:
#### Now we actually run our code for Apple
base_word="apple"
options = [ ("fruit", "http://en.wikipedia.org/wiki/Apple"),
            ("company", "http://en.wikipedia.org/wiki/Apple_Inc.") ]
(make_features, classifier) = make_classifier("apple", *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "I'm baking a pie with my granny smith apples.",
    "I looked up the recipe on my Apple iPhone.",
    "The apple pie recipe is on my desk.",
    "How is Apple's stock doing?",
    "I'm drinking apple juice.",
    "I have three apples.",
    "Steve Jobs is the CEO of apple.",
    "Steve Jobs likes to eat apples."
])))

In [None]:
#### Now we actually run our code for Apple
base_word="windows"
options = [ ("building", "http://en.wikipedia.org/wiki/Window"),
            ("software", "http://en.wikipedia.org/wiki/Microsoft_Windows") ]
(make_features, classifier) = make_classifier("apple", use_verbs=True, *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "Bill Gates was involved with Windows.",
    "Could you open the window?",
    "The 'broken window' theory related broken windows to increases in crime rate.",
    "The windows are all made of shatter-proof glass.",
    "Could you install windows on your computer?",
    "Could you install windows on your house?"
])))

In [None]:
#### Now we actually run our code for Ford
base_word="ford"
options = [ ("crossing", "http://en.wikipedia.org/wiki/Ford_(crossing)"),
            ("company", "http://en.wikipedia.org/wiki/Ford") ]
(make_features, classifier) = make_classifier("apple", *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "I tried to ford the river and my unfortunate oxen died.",
    "Ford makes cars, though their quality is sometimes in dispute.",
    "The Ford Mustang is an iconic automobile.",
    "The river crossing was shallow, but we could not ford it."
])))

##Exercises / Brainstorming for Improvement:

1. Change the code to use just the ad hoc features.  How does this change the results?  Why do you think this is?

2. Same question as 1, but for the tf-idf features.

3. Change the formation of tf-idf features as follows -- when doing the tf-idf weighting (in the call to fit appearing in make_ng_stem_vectorizer) we pass in the sentences as separate documents.  How do the results change if we pass them in as just two documents?

4. What ideas do you think could improve the performance of this model?

#### Exit Tickets
1. What are some other options for modeling with text data besides bag of words?
1. How would you account for the fact that word meanings change over time?
1. How do stopwords, stemming, and limiting the # features affect variance-bias?

##Spoiler

##...

##...

##...

Some ideas / hints for the exercises:
  - A key problem with the model is the small amount of training data.  At the least, we could follow links from the given Wikipedia articles.  Better would be to find other sources that directly use the words Apple/apple.
  - In this specific case (apple/Apple) we would do better by using a few human created absolute rules _first_: e.g., typos aside -- apple's will always refer to the company and apples to the fruit, so we do not need to run a more complicated learner. 

## Topics not covered

Natural language processing is a big field.  We only (really) talked about a few tools and techniques.  Here are some other terms that are relevant:

 - Context free grammars (and probabilistic context free grammars): This is a simple and basic technique for parsing.  
 - [Word2vec](https://code.google.com/p/word2vec/) is a popular tool for creating a vectorized representation of a text corpus. The learned vectors can then be used to identify/predict words similar to a target, or even (weakly) to reason by analogy. For example, vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen').
 
 To use Word2vec in Python (and get the computation speed improvements) look at [gensim](https://radimrehurek.com/gensim/models/word2vec.html) and [cython](http://docs.cython.org/src/quickstart/install.html). This [Git repo](https://github.com/danielfrg/word2vec) is an alternative way to access the algorithm. You might also use this [Kaggle competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors) as a reference.


## Provisio (old)
Before running some of what's below, you will want to _disable_ `pylab` (which is enabled by default in the `ipython notebook` configuration I set up for our DO instances).   The easiest way to do this is to run the following code (or just change the indicated line in the configuration file) , and then to restart ipython notebook:

```bash
patch ~/.ipython/profile_nbserver/ipython_notebook_config.py <<EOF
643c643
< c.IPKernelApp.pylab = 'inline'
---
> c.IPKernelApp.matplotlib = 'inline'
EOF
```

(we talked about this in lecture)

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*