
You have text data and want to tag each word or character with its part of
speech.

Use NLTK’s pre-trained parts-of-speech tagger:


In [20]:
# Load libraries
from nltk import pos_tag
from nltk import word_tokenize
from sklearn.preprocessing import MultiLabelBinarizer


In [21]:
# Create text
text_data = "Chris loved outdoor running"
# Use pre-trained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))
# Show parts of speech
text_tagged


[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

The output is a list of tuples with the word and the tag of the part of speech.
NLTK uses the Penn Treebank parts for speech tags. Some examples of the Penn
Treebank tags are

![](./NLTKtag.jpg)

In [22]:
[word for word, tag in text_tagged if tag in ['NN', 'NNS', 'NNP','NNPS'] ]

['Chris']

In [23]:
[(word, tag) for word, tag in text_tagged if tag in ['NN', 'NNS', 'NNP','NNPS'] ]

[('Chris', 'NNP')]

A more realistic situation would be that we have data where every observation
contains a tweet and we want to convert those sentences into features for
individual parts of speech h (e.g., a feature with 1 if a proper noun is present, and 0
otherwise)

In [24]:
# Create text
tweets = ["I am eating a burrito for breakfast",
"Political science is an amazing field",
"San Francisco is an awesome city"]

In [25]:
# Create list
tagged_tweets = []
# Tag each word and each tweet
for tweet in tweets:
    tweet_tag = pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])
# Use one-hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [26]:
# Show feature names
one_hot_multi.classes_


array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

If our text is English and not on a specialized topic (e.g., medicine) the simplest
solution is to use NLTK’s pre-trained parts-of-speech tagger. However, if
pos_tag is not very accurate, NLTK also gives us the ability to train our own
tagger. The major downside of training a tagger is that we need a large corpus of
text where the tag of each word is known. Constructing this tagged corpus is
obviously labor intensive and is probably going to be a last resort.
All that said, if we had a tagged corpus and wanted to train a tagger, the
following is an example of how we could do it. The corpus we are using is the
Brown Corpus, one of the most popular sources of tagged text. Here we use a
backoff n-gram tagger, where n is the number of previous words we take into
account when predicting a word’s part-of-speech tag. First we take into account
the previous two words using TrigramTagger; if two words are not present, we
“back off” and take into account the tag of the previous one word using
BigramTagger, and finally if that fails we only look at the word itself using
UnigramTagger. To examine the accuracy of our tagger, we split our text data
into two parts, train our tagger on one part, and test how well it predicts the tags
of the second part:

In [27]:
# Load library
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
# Get some text from the Brown Corpus, broken into sentences
sentences = brown.tagged_sents(categories='news')
# Split into 4000 sentences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]
# Create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)
# Show accuracy
trigram.evaluate(test)


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  trigram.evaluate(test)


0.8174734002697437