# NLP example case study: Apple/apple.

The notebook documents an NLP case study. 

Useful tools (all - Python libraries):
 - http://scikit-learn.org/stable/modules/feature_extraction.html
 - http://www.nltk.org/
 - http://www.nltk.org/howto/wordnet.html
 
The main Python package used for test mining is **nltk** 

##Overview
 - Bag of words
 - TF/IDF
 - n-grams
 - Stemming / part of speech tagging / etc.
 - Feature hashing
 - Other topics

## The problem: Word sense disambiguation.  

In a given block of text, we need to be able to distinguish between the meaning of Apple (the company) vs. apple (the fruit).  Ideally one would also be able to tell apart Windows vs windows, etc. via similar examples.


**The type of learner**: going to choose to look at this as a _supervised classification_ problem. This means we need some labeled data.

**The training dataset**: going to try to use Wikipedia's articles on the given topics as a chosen "corpus" of text.

**The test dataset**: Possible option would be to mark up a small corpus by hand, or to use sentences taken from Wikipedia with the disambiguation coming from looking at the target of outgoing links.

## Brainstorming:

For "Apple"/"apple" case simple capitalization, presence of a possessive, could be the main discriminants. To do better one could look at nearby words (e.g., "software" and "computer" hint one way, while "flavor" hints the other).  Figuring out how to turn this idea into an actual implementation takes us to our first two topics: **bag of words** and **TF/IDF**.

How might we do this?
  1. We need to clean and tokenize the text data: __Tokenization__ refers to splitting the text into pieces, in this case into sentences and into words.  Cleaning can also include things like __stemming__ or __lemmatizing__ (identifying similar words like "computer" and "computers" to their stem).
  2. We need to extract __features__.  We're going to think of our input as a sentence, and try to develop features of that sentence.  In this example application, I try to use:
   - Capitalized of the word apple? (_a_pple vs _A_pple)    
   - Pluralization of the word apple? (apples)
   - Possessive form of the word apple? (Apple's)
   - Presence (or frequency) of certain well-chosen words : Does (e.g.,) the word "computer" or "fruit" occur in the sentence?  (This feature regards the sentence as a simple __bag of words__ without regard to trying to parse its structure.)
   - In addition to single words, I also try looking for __n-grams__: Strings of n consecutive words.
   - There are common techniques for determining which words / n-grams to look for.  One of them is called __tf-idf__.
  3. Finally, run some sort of classifier on the features.
  
I mostly focused on general NLP techniques in 1 and 2, rather than diving deeply into techniques for word disambiguation.

## Step 0: Cleaning and tokenizing the data

The goal of the tokenization step is to clean the text and load it one string per sentence. I am going to use the nltk for that purpose.

###Splitting into words/sentences:
NLTK has convenient presets for sentence and word tokenization (i.e., splitting a document into sentences, resp. splitting a sentence into words).

In [5]:
import nltk

my_long_string = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
my_list_of_sentences = nltk.tokenize.sent_tokenize(my_long_string) 
words = [ nltk.tokenize.word_tokenize(sent) for sent in my_list_of_sentences]
print words

[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]


In [6]:
import urllib2
from bs4 import BeautifulSoup
import re
import nltk.tokenize

# Spit out (slightly cleaned up) sentences from a Wikipedia article.
def wikipedia_to_sents(url):
    soup = BeautifulSoup(urllib2.urlopen(url)).find(attrs={'id':'mw-content-text'})

    # The text is litered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    
    raw_sents = reduce(lambda x, y: x + y, [nltk.tokenize.sent_tokenize(p.strip()) for p in paragraphs if p.strip()!=''])    
    return filter(lambda s: len(s.split(" "))>2, raw_sents)

fruit_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple")
company_sents = wikipedia_to_sents("http://en.wikipedia.org/wiki/Apple_Inc.")

In [7]:
#company_sents[-110:-100]
company_sents[1:3]

[u'Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, online services, and personal computers.',
 u'Its best-known hardware products are the Mac line of computers, the iPod media player, the iPhone smartphone, the iPad tablet computer, and the Apple Watch smartwatch.']

In [8]:
fruit_sents[1:3]

[u'Malus pumila auct.', u'Pyrus malus L.']

#Step 1.1.  Features: Bag-of-words

Using text for learning algorithms is not good, need to use vectors of numbers instead. 
The simplest way to turn a text into a vector of number is to treat the text as a "bag of words."  Namely

  - Split the text into words
  - Count how many times each word (/each word in some fixed vocabulary) occurs
  - _(Variant)_ Just do a binary "yes / no" for whether each word (/.. in some vocabulary) is contained in the material
  
The output is a very large, but usually sparse, vector: The number of coordinates is the number of words in our dictionary, and the $i$-th coordinate entry is the number of occurances of the $i$-th word.

There's a reasonable implementation of this in the CountVectorizer class in sklearn.feature_extraction.text.  See http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref for more detail on the options.

For instance, here we'll apply this to each sentence (as a separate bag).

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# *temporarily* use a small hand built vocabulary of words.
vocabulary = "fruit eat tasty pie leaf cook tree computer computers laptop tech technology ceo jobs ipad iphone announce announced mac company companies employee employees user software released".split(" ")

bag_of_words_vectorizer = CountVectorizer(vocabulary=vocabulary)

counts = bag_of_words_vectorizer.fit_transform( fruit_sents[1:3] + company_sents[1:3])
print counts.shape

(4, 26)


In [10]:
# Note that this is a **sparse** matrix.
print counts.toarray() #This is what it actually looks like..
print counts           # .. this is just describing the non-zero entries

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0]]
  (2, 7)	1
  (2, 8)	1
  (2, 11)	1
  (2, 19)	1
  (2, 24)	1
  (3, 7)	1
  (3, 8)	1
  (3, 14)	1
  (3, 15)	1
  (3, 18)	1


In [11]:
bag_of_words_vectorizer.get_feature_names()

['fruit',
 'eat',
 'tasty',
 'pie',
 'leaf',
 'cook',
 'tree',
 'computer',
 'computers',
 'laptop',
 'tech',
 'technology',
 'ceo',
 'jobs',
 'ipad',
 'iphone',
 'announce',
 'announced',
 'mac',
 'company',
 'companies',
 'employee',
 'employees',
 'user',
 'software',
 'released']

### Stop words
It's common to want to __omit__ certain common words when doing these counts -- "a", "an", and "the" are common enough so that their counts do not tend to give us any hints as to the meaning of documents.  Such words that we want to omit are called __stop words__ (they don't stop anything, though).

NLTK contains a standard list of such stop words for English in `nltk.corpus.stopwords.words('english')`.  In our application, we'd also want to include "apple" -- it is certainly not going to help us distinguish our two meanings!

In [12]:
nltk.corpus.stopwords.words('english')[1:10]

[u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

In [13]:
counter=CountVectorizer(max_features=300,
                        stop_words=nltk.corpus.stopwords.words('english') + ['apple'])
counter=counter.fit( fruit_sents + company_sents )
print counter.get_feature_names()[1:10]

# Now we can use it with that vectorizer, like so...
counter.transform(company_sents)
counter.transform(fruit_sents)

[u'10', u'100', u'16', u'19', u'1997', u'1999', u'20', u'2001', u'2006']


<210x300 sparse matrix of type '<type 'numpy.int64'>'
	with 751 stored elements in Compressed Sparse Row format>

In [14]:
#print counter.transform(company_sents)[1:2]
print counter.transform(fruit_sents)

  (0, 162)	1
  (1, 162)	1
  (2, 162)	1
  (3, 44)	1
  (3, 104)	1
  (3, 139)	1
  (3, 162)	1
  (3, 269)	2
  (4, 104)	1
  (4, 111)	1
  (4, 162)	1
  (4, 269)	1
  (4, 295)	1
  (5, 37)	1
  (5, 52)	1
  (5, 100)	1
  (5, 162)	1
  (5, 250)	1
  (5, 269)	1
  (5, 288)	1
  (6, 32)	1
  (6, 37)	1
  (6, 48)	1
  (6, 92)	1
  (6, 111)	1
  :	:
  (203, 234)	1
  (203, 244)	1
  (204, 141)	1
  (204, 234)	1
  (204, 244)	1
  (205, 139)	1
  (205, 185)	2
  (205, 234)	2
  (206, 167)	1
  (206, 222)	1
  (206, 239)	1
  (206, 257)	1
  (207, 25)	1
  (207, 53)	2
  (207, 75)	3
  (207, 76)	1
  (207, 88)	1
  (207, 97)	1
  (207, 104)	1
  (207, 113)	1
  (207, 190)	1
  (207, 193)	1
  (207, 250)	1
  (207, 278)	1
  (207, 291)	1


##n-grams

Instead of looking at just single words, it is also useful to look at **n-grams**: These are n-word long sequences of words (i.e., each of "farmer's market", "market share", and "farm share" is a 2-gram).

The exact same sort of counting techniques apply.  The `CountVectorizer` function has built in support for this, too:

If you pass it the `ngram_range=(m, M)` then it will count $n$-grams with  $m \leq n \leq M$.

In [15]:
ng_counter=CountVectorizer(max_features=300, 
                           ngram_range=(2,2), 
                           stop_words=nltk.corpus.stopwords.words('english') + ['apple', 'Apple'])
ng_counter=ng_counter.fit( fruit_sents + company_sents  )
print ng_counter.get_feature_names()[1:10]

# Now we can use it with that vectorizer, like so...
ng_counter.transform(company_sents)
ng_counter.transform(fruit_sents)

[u'2011 jobs', u'2012 update', u'2014 update', u'24 2012', u'27 2010', u'30 000', u'30 2012', u'33182 122', u'37 33182']


<210x300 sparse matrix of type '<type 'numpy.int64'>'
	with 191 stored elements in Compressed Sparse Row format>

In [16]:
#print counter.transform(company_sents)[1:2]
print ng_counter.transform(fruit_sents)[0:4]

  (3, 165)	1


##TF-IDF: term frequency–inverse document frequency

With single word vocabularies, we can probably do an okay job of coming up with a reasonable list of words that distinguish between the two documents.  With n-grams, even for $n=2$, it is better to let a computer help us.  

Just using frequencies, as above, is clearly not great.  Both apples the fruit and Apple the company are enjoyed around the world (one of the 2-grams that came up above!).  We would like to find words that are common in one document, but not common in all of them.  This is the goal of the __td-idf weighting__.  A precise definition is:


  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of ny single word's occurance count ($L^1$) or by the Euclidean length of the vector of word occurance counts ($L^2$).  Scikit-learn by defaults does this second one:
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  with a common variant being
  $$ \mathrm{idf}(t, D) = \log \frac{\# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{\text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
  (This second one is the default in scikit-learn. Without this tweak we would omit the $1+$ in the denominator and have to worry about dividing by zero if $t$ is not found in any documents.)
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

ng_tfidf=TfidfVectorizer(max_features=300, 
                         ngram_range=(1,2), 
                         stop_words=nltk.corpus.stopwords.words('english') + ["apple", "apples"],
                         )#token_pattern=u'\w') #u'[^0-9]+')
ng_tfidf=ng_tfidf.fit( fruit_sents + company_sents )
print ng_tfidf.get_feature_names()[1:30]

[u'10', u'100', u'16', u'19', u'20', u'2001', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015', u'24', u'30', u'500', u'800', u'access', u'according', u'added', u'almost', u'also', u'america', u'american', u'announced', u'app']


## In addition: Document similarity

A common problem is looking up a document similar to a given snippet, or relatedly comparing two documents for similarity.  The above provides a simple method for this called __cosine similarity__:
  - To each of the two douments $d_1, d_2$ in a corpus of documents $D$, assign its tf or tf-idf vector $$ (v_i)_{j} = \mathrm{tfidf}( t_{j}, d_i, D ) $$
  where $i$ ranges over indices for documents, and $j$ ranges over indices for terms in the vocabulary.
  - To compare two documents, simply find the cosine of the angle between the vectors:
  $$ \frac{v_i \cdot v_{i'}}{|v_i| |v_{i'}|} $$

## Stemming

In our original hand-built vocabulary, we had to include both "computer" and "computers".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  We can choose to first stem our words before counting them:


In [18]:
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
print "stemming carry", [stemmer.stem(s) for s in ["carry", "carries", "carrying", "carried"]]
print "stemming eat", [stemmer.stem(s) for s in ["eat", "eating", "eaten", "ate"]]
                                
# More examples 
print stemmer.stem("The quick brown fox jumped over the lazy dog.  I can't believe it's not butter.  I tried to ford the river and my unfortunate oxen died.")
print " ".join(map(stemmer.stem, "The quick brown fox jumped over the lazy dog.  I can't believe it's not butter.  I tried to ford the river and my unfortunate oxen died.".split(" ")))

stemming carry [u'carri', u'carri', u'carri', u'carri']
stemming eat [u'eat', u'eat', u'eaten', u'ate']
the quick brown fox jumped over the lazy dog.  i can't believe it's not butter.  i tried to ford the river and my unfortunate oxen died.
the quick brown fox jump over the lazi dog.  i can't believ it not butter.  i tri to ford the river and my unfortun oxen died.


In [19]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
print "lemma carry", [lemmatizer.lemmatize(s) for s in ["carry", "carries", "carrying", "carried"]]
print "lemma eat", [lemmatizer.lemmatize(s) for s in ["eat", "eating", "eaten", "ate"]]

lemma carry ['carry', u'carry', 'carrying', 'carried']
lemma eat ['eat', 'eating', 'eaten', 'ate']


We can tell our bag-of-words counters (/tf-idf) to first run its input through the stemmer.  This way it won't have to include both e.g., 'computer' and 'computers':

In [20]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

default_tokenizer = TfidfVectorizer().build_tokenizer()
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
    
def tokenize_stem(text):
    """
    We will use the default tokenizer from TfidfVectorizer, combined with the nltk SnowballStemmer.
    """
    tokens = default_tokenizer(text)
    stemmed = map(stemmer.stem, tokens)
    return stemmed

ng_stem_tfidf=TfidfVectorizer(max_features=300, 
                         ngram_range=(1,2), 
                         stop_words=map(stemmer.stem, nltk.corpus.stopwords.words('english') + ["apple"]),
                         tokenizer = tokenize_stem)
ng_stem_tfidf=ng_stem_tfidf.fit( fruit_sents + company_sents )

ng_stem_vocab = ng_stem_tfidf.get_feature_names()
print ng_stem_vocab

[u'000', u'10', u'100', u'16', u'19', u'20', u'2001', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015', u'30', u'800', u'access', u'accord', u'ad', u'addit', u'aim', u'allow', u'also', u'american', u'announc', u'app', u'app store', u'applic', u'april', u'around', u'august', u'avail', u'averag', u'back', u'base', u'becam', u'began', u'best', u'billion', u'brand', u'brought', u'build', u'call', u'camera', u'campus', u'centuri', u'ceo', u'chang', u'china', u'climat', u'color', u'commerci', u'common', u'compani', u'comput', u'condit', u'consum', u'continu', u'control', u'cook', u'corpor', u'countri', u'creat', u'cultiv', u'cultivar', u'davidson', u'day', u'decemb', u'design', u'develop', u'devic', u'differ', u'digit', u'direct', u'diseas', u'display', u'download', u'due', u'dwarf', u'earli', u'effect', u'electron', u'employe', u'end', u'europ', u'even', u'event', u'facil', u'factori', u'featur', u'first', u'focus', u'follow', u'form', u'found', u'fox

##Variation: Feature hashing

When doing "bag of words" type techniques on a *large* corpus and without an existing vocabulary, there is a simple trick that can often be useful.  The issue (and solution) is as follows: 

 - The output is a feature vector, so that whenever we encounter a word we must look up which coordinate slot it is in.  A naive way would be to keep a list of all the words encoutered so far, and look up each word when it is encountered.  Whenever we encounter a new word, we see if we've already seen it before and if not -- assign it a new number.  This requires storing all the words that we have seen in memory, cannot be done in parallel (because we'd have to share the hash-table of seen words), etc.
 - A **hash function** takes as input something complicated (like a string) and spits out a number, with the desired property being that different inputs *usually* produce different outputs.  (This is how hash tables are implemented, as the name suggests.)
 - So -- rather than exactly looking up the coordinate of a given word, we can just use its hash value (modulo a big size that we choose).  This is fast and parallelizes easily.  (There are some downsides: You cannot tell, after the fact, what word each of your feature actually corresponds to!)
 
Scikit-learn includes `sklearn.feature_extraction.text.HashingVectorizer` to do this.  It behaves as almost a drop-in replacement for `CountVectorizer`.  It can be used with tf-idf by combining it with the `TfidfTransformer` (the `TfidfVectorizer` is the `CountVectorizer` together with the `TfidfTransformer`). For our application (where the training and test data is small), we may as well just use `TfidfVectorizer` -- but it is good to know that `HashingVectorizer` is there.

#Step 1.2: Some other features.

##Part of speech tagging.

Consider the "Ford" vs "ford" example.  As a human being, the easiest way to tell these apart is that Ford is a __noun__ while ford is a __verb__.

Fortunately, NLTK also has a part-of-speech tagger: You give it a sentence, and it tries to tag the parts of speech (e.g., noun, verb, adjective, etc.).  The command is `nltk.pos_tag` and for documentation on the tags either search around online, or use `nltk.help.upenn_tagset`:

In [21]:
s1 = "I tried to ford the river, and my unfortunate oxen died"
s2 = "Henry Ford built factories to facilitate the construction of the Ford automobile."

In [22]:
nltk.pos_tag(nltk.tokenize.word_tokenize(s1))

[('I', 'PRP'),
 ('tried', 'VBD'),
 ('to', 'TO'),
 ('ford', 'VB'),
 ('the', 'DT'),
 ('river', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('my', 'PRP$'),
 ('unfortunate', 'NN'),
 ('oxen', 'NN'),
 ('died', 'VBD')]

In [23]:
nltk.pos_tag(nltk.tokenize.word_tokenize(s2))

[('Henry', 'NNP'),
 ('Ford', 'NNP'),
 ('built', 'VBD'),
 ('factories', 'NNS'),
 ('to', 'TO'),
 ('facilitate', 'VB'),
 ('the', 'DT'),
 ('construction', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Ford', 'NNP'),
 ('automobile', 'NN'),
 ('.', '.')]

In [24]:
nltk.help.upenn_tagset('NN.*')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


##Capitalization, punctuation, etc.
There are the obvious features that we had in mind... lorum ipsum est.

In [25]:
import numpy as np

def feature_verbs(words, positions):
    pos_tag = nltk.pos_tag(words)
    return len( [ i for i in positions if pos_tag[i][1] == 'VB'] )

def is_cap(word):
    return word[0] == word[0].capitalize()
def feature_caps(words, positions):
    return len ( [ i for i in positions if is_cap(words[i]) ] )

def feature_plural(words, positions):
    def is_plural(word):
        return re.match( ".*s$", word )
    return len ( [ i for i in positions if is_plural(words[i]) ] )

## N.B. The nltk word tokenizer will tokenize "Apple's" as ["Apple", "'s"]
def feature_posessive(words, positions):
    l = len(words)
    return len ( [ i for i in positions if i+1 < l and words[i+1]=="'s" ] )

def ad_hoc_features(keyword, strs):
    """
    Given a keyword (e.g., "apple") and a list of strings;
    Returns a numpy ndarray encoding several ad hoc features of the string that are local
    near occurances of the keyword:
        - If the keyword is capitalized
        - If it is plural (in the stupid sense of ending in s.. good enough for 'apple')
        - If it is possessive in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """
    stemmed_word = stemmer.stem(keyword)
    def feature_one(s):
        words = nltk.tokenize.word_tokenize(s)
        hits = [ i for i in range(len(words)) if stemmer.stem(words[i]) == stemmed_word ]
        return np.array([ feature_caps(words, hits), 
                          feature_plural(words, hits), 
                          feature_posessive(words, hits), 
                          feature_verbs(words, hits) ])
    return np.asarray(map(feature_one, strs))

In [26]:
ad_hoc_features("ford", ["I drive a Ford.", "I tried to ford the river.", "That's not Ford's."])

array([[1, 0, 0, 0],
       [0, 0, 0, 1],
       [1, 0, 1, 0]])

In [27]:
ad_hoc_features("apple", ["Have you eaten your apple?", "How is Apple's stock doing?", "Apples are tasty."])

array([[0, 0, 0, 0],
       [1, 0, 1, 0],
       [1, 1, 0, 0]])

## The actual application
Diclaimer: This version is actually pretty bad -- it uses many of the right ideas, but puts them together pretty poorly (and with fairly little available data).

In [32]:
import urllib2
import re
import nltk.tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from sklearn.naive_bayes import MultinomialNB


def wikipedia_to_sents(url):
    """
    Retrieves a URL from wikipedia, and returns a list of sentences (of at least 3 words) in the body text.
    """
    soup = BeautifulSoup(urllib2.urlopen(url)).find(attrs={'id':'mw-content-text'})
    
    # The text is litered by references like [n].  Drop them.
    def drop_refs(s):
        return ''.join( re.split('\[\d+\]', s) )
    
    paragraphs = [drop_refs(p.text) for p in soup.find_all('p')]
    
    raw_sents = reduce(lambda x, y: x + y, [nltk.tokenize.sent_tokenize(p.strip()) for p in paragraphs if p.strip()!=''])
    return filter(lambda s: len(s.split(" "))>2, raw_sents)


#### Bag-of-words features, using tf-idf
def make_ng_stem_vectorizer(texts, extra_stop_words, max_features):
    """
    Given 
        - a list of texts ("documents");
        - a list of extra stop words (in addition to the standard NLTK English ones); and,
        - a number of features to remember
    Returns the tf-idf feature extractor with this number of features, based on these documents.
    """
    default_tokenizer = TfidfVectorizer().build_tokenizer()
    stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
    def tokenize_stem(text):
        return map(stemmer.stem, default_tokenizer(text))
    ng_stem_tfidf=TfidfVectorizer(#max_features=max_features, 
                             ngram_range=(1,2),
                             stop_words=map(stemmer.stem, nltk.corpus.stopwords.words('english') + extra_stop_words),
                             tokenizer = tokenize_stem)
    return ng_stem_tfidf.fit( texts )


#### Ad hoc features
def feature_verbs(words, positions):
    pos_tag = nltk.pos_tag(words)
    return len( [ i for i in positions if pos_tag[i][1] == 'VB'] )

def is_cap(word):
    return word[0] == word[0].capitalize()
def feature_caps(words, positions):
    return len ( [ i for i in positions if is_cap(words[i]) ] )

def feature_plural(words, positions):
    def is_plural(word):
        return re.match( ".*s$", word )
    return len ( [ i for i in positions if is_plural(words[i]) ] )

## N.B. The nltk word tokenizer will tokenize "Apple's" as ["Apple", "'s"]
def feature_posessive(words, positions):
    l = len(words)
    return len ( [ i for i in positions if i+1 < l and words[i+1]=="'s" ] )

def ad_hoc_features(keyword, strs, use_verbs=False):
    """
    Given a keyword (e.g., "apple") and a list of strings;
    Returns a numpy ndarray encoding several ad hoc features of the string that are local
    near occurances of the keyword:
        - If the keyword is capitalized
        - If it is plural (in the stupid sense of ending in s.. good enough for 'apple')
        - If it is possessive in the stupid sense of being followed by 's)
        - If the keyword is a verb (e.g., for Ford vs ford)
    """
    stemmed_word = stemmer.stem(keyword)
    def feature_one(s):
        words = nltk.tokenize.word_tokenize(s)
        hits = [ i for i in range(len(words)) if stemmer.stem(words[i]) == stemmed_word ]
        ret_list = [ feature_caps(words, hits), 
                     feature_plural(words, hits), 
                     feature_posessive(words, hits) ]
        if use_verbs:  # This is slow, so only use it sometimes
            ret_list.append( feature_verbs(words, hits)  )
        return np.array(ret_list)
    return np.asarray(map(feature_one, strs))

####
def make_classifier(base_word, meaning1, meaning2, use_verbs=False):
    """
    Given
        - a base word (e.g., "apple", "ford") that can have ambiguous meaning
        - a pair meaning1 = (name1, url1) of a label for the first meaning, and a Wikipedia URL for it
        - a pair meaning2 = ... for the other meaning
    Returns a tuple (make_features, classifier) where
        - make_features is a function taking in a string text, and returns a feature vector
        - classifier takes in a feature vector (output by make_features) and predicts the meaning
    """
    name1, url1 = meaning1
    name2, url2 = meaning2
    sents1 = wikipedia_to_sents(url1)
    sents2 = wikipedia_to_sents(url2)
    tfidf_vect = make_ng_stem_vectorizer(sents1 + sents2,
                                        [base_word],
                                        100000)
    def make_features(sents, use_verbs=False):
        a = ad_hoc_features(base_word, sents, use_verbs)
        t = tfidf_vect.transform(sents).toarray()
        return np.hstack((a,t))

    # Build the training data
    train_feat = make_features(sents1 + sents2, use_verbs)
    train_res  = np.array( [0] * len(sents1) + [1] * len(sents2) )
    
    classifier = MultinomialNB()
    classifier = classifier.fit(train_feat, train_res)
    return (lambda x: make_features(x, use_verbs), classifier)

In [33]:
#### Now we actually run our code for Apple
base_word="apple"
options = [ ("fruit", "http://en.wikipedia.org/wiki/Apple"),
            ("company", "http://en.wikipedia.org/wiki/Apple_Inc.") ]
(make_features, classifier) = make_classifier("apple", *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "I'm baking a pie with my granny smith apples.",
    "I looked up the recipe on my Apple iPhone.",
    "The apple pie recipe is on my desk.",
    "How is Apple's stock doing?",
    "I'm drinking apple juice.",
    "I have three apples.",
    "Steve Jobs is the CEO of apple.",
    "Steve Jobs likes to eat apples."
])))

['fruit', 'company', 'company', 'company', 'fruit', 'fruit', 'company', 'fruit']


In [30]:
#### Now we actually run our code for Apple
base_word="windows"
options = [ ("building", "http://en.wikipedia.org/wiki/Window"),
            ("software", "http://en.wikipedia.org/wiki/Microsoft_Windows") ]
(make_features, classifier) = make_classifier("apple", use_verbs=True, *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "Bill Gates was involved with Windows.",
    "Could you open the window?",
    "The 'broken window' theory related broken windows to increases in crime rate.",
    "The windows are all made of shatter-proof glass.",
    "Could you install windows on your computer?",
    "Could you install windows on your house?"
])))

['software', 'building', 'software', 'building', 'software', 'software']


In [31]:
#### Now we actually run our code for Ford
base_word="ford"
options = [ ("crossing", "http://en.wikipedia.org/wiki/Ford_(crossing)"),
            ("company", "http://en.wikipedia.org/wiki/Ford") ]
(make_features, classifier) = make_classifier("apple", *options)
print map(lambda x: options[x][0], classifier.predict(make_features([
    "I tried to ford the river and my unfortunate oxen died.",
    "Ford makes cars, though their quality is sometimes in dispute.",
    "The Ford Mustang is an iconic automobile.",
    "The river crossing was shallow, but we could not ford it."
])))

['company', 'company', 'company', 'company']
