# NLTK: Natural Language Made Easy

Dealing with text is hard! Thankfully, it's hard for everyone, so tools exist to make it easier.

NLTK, the Natural Language Toolkit, is a python package "for building Python programs to work with human language data". It has many tools for basic language processing (e.g. tokenization, $n$-grams, etc.) as well as tools for more complicated language processing (e.g. part of speech tagging, parse trees, etc.).

NLTK has an [associated book about NLP](http://www.nltk.org/book/) that provides some context for the corpora and models.

## Installing NLTK, or "why do I need to download so much data?"
We can `conda install nltk` to get the package. Then we need to do something somewhat strange: we have to download data.



In [1]:
import nltk
nltk.download()

This pops up a GUI where we can choose what data to download.

---
![nltk download](fig/nltk_download.png)

---

What is this stuff? The data is separated into two categories:

1. Corpora
    - These data are a set of collections of text.
1. Models
    - These are data (e.g. weights, etc.) for trained models.

NLTK provides several collections of data to make installing easier.

- `all`: All corpora and models
- `all-corpora`: All corpora, no models
- `all-nltk`: Everything plus more data from the website
- `book`: Data to run the associated book
- `popular`: The most popular packages
- `third-party`: Extra data from third parties

Downloading the `popular` collection is recommended.

## Analyzing tweets
### First pass
Let's take a look at one corpus in particular: positive and negative tweets.

In [2]:
# read some twitter data
neg_id = nltk.corpus.twitter_samples.fileids()[0]
neg_tweets = nltk.corpus.twitter_samples.strings(neg_id)
pos_id = nltk.corpus.twitter_samples.fileids()[1]
pos_tweets = nltk.corpus.twitter_samples.strings(pos_id)

In [3]:
print(pos_tweets[10])
print()
print(neg_tweets[10])

#FollowFriday @wncer1 @Defense_gouv for being top influencers in my community this week :)

I have a really good m&amp;g idea but I'm never going to meet them :(((


How does the language in positive and negative tweets differ?

We can start by looking at how the words differ. NLTK provides tools for tokenization.

In [4]:
def tokenize_tweets1(tweets):
    """Get all of the tokens in a set of tweets"""
    tokens = [token for tweet in tweets for token in nltk.word_tokenize(tweet)]
    return(tokens)

What does this output?

In [5]:
pos_tokens = tokenize_tweets1(pos_tweets)
neg_tokens = tokenize_tweets1(neg_tweets)
print(pos_tokens[:10])

['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being']


We can look at the most common words (like in the first homework) using Python's Counter class.

In [6]:
from collections import Counter

pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)

In [7]:
pos_count.most_common()

[(':', 6667),
 (')', 5165),
 ('@', 5119),
 ('!', 1920),
 ('you', 1427),
 ('.', 1323),
 ('#', 1292),
 ('I', 1176),
 ('to', 1063),
 ('the', 997),
 (',', 964),
 ('a', 881),
 ('-', 863),
 ('http', 856),
 ('for', 749),
 ('D', 662),
 ('and', 656),
 ('?', 582),
 ('it', 566),
 ('my', 484),
 ('in', 481),
 (';', 449),
 ("'s", 423),
 ('is', 420),
 ('of', 403),
 ('&', 400),
 ('have', 356),
 ('https', 336),
 ('me', 333),
 ('your', 318),
 ('on', 312),
 ('...', 305),
 ('follow', 286),
 ('that', 286),
 ('this', 263),
 ('be', 249),
 ('i', 239),
 ('so', 234),
 ('u', 226),
 ("n't", 224),
 ('with', 221),
 ('like', 209),
 ('Thanks', 209),
 ('day', 199),
 ('all', 197),
 ('do', 192),
 ('are', 188),
 ('love', 185),
 ('we', 182),
 ('thanks', 182),
 ("'m", 182),
 ('amp', 174),
 ('will', 167),
 ('at', 166),
 ('3', 162),
 ('good', 162),
 ('back', 154),
 ('just', 151),
 ('lt', 148),
 ("''", 148),
 ("'ll", 144),
 ('can', 143),
 ('but', 140),
 ('know', 140),
 ('Hi', 140),
 ('p', 139),
 ('get', 139),
 ('great', 138),

In [8]:
neg_count.most_common()

[('(', 7076),
 (':', 5959),
 ('@', 3181),
 ('I', 1986),
 ('.', 1078),
 ('to', 1067),
 ('#', 913),
 ('!', 895),
 ('the', 846),
 (',', 733),
 ('you', 707),
 ('i', 684),
 ('?', 650),
 ('my', 629),
 ('a', 626),
 ("n't", 614),
 ('and', 613),
 ('-', 600),
 ('it', 591),
 ('me', 520),
 ('is', 487),
 ('so', 464),
 ("'s", 449),
 ('in', 420),
 ('for', 391),
 ('http', 381),
 ('but', 378),
 ('...', 361),
 ('of', 352),
 ("'m", 339),
 ('have', 327),
 ('do', 311),
 ('that', 310),
 ('on', 297),
 ('not', 273),
 ('this', 270),
 ('was', 253),
 (';', 250),
 ('be', 241),
 ('no', 229),
 ('&', 216),
 ('miss', 212),
 ('just', 207),
 ('want', 201),
 ('like', 192),
 ('all', 183),
 ('https', 180),
 ('at', 179),
 ('with', 172),
 ('get', 171),
 ('ca', 167),
 ('na', 165),
 ('ME', 165),
 ('u', 164),
 ('are', 164),
 ('too', 161),
 ('up', 160),
 ('now', 146),
 ('one', 137),
 ('we', 136),
 ('time', 136),
 ('go', 131),
 ('PLEASE', 131),
 ('your', 128),
 ('know', 124),
 ('did', 123),
 ('can', 122),
 ('they', 121),
 ('why'

The two most common tokens for postiive tweets are ":" and ")" and the tweo most common tokens for negative tweets are "(" and ":". These are smiley and frowny faces! The basic word tokenizer is treating these as separate tokens, which makes sense in most cases but not for text from social media.

### A better tokenizer
We're not the first people to see this problem, and NLTK actually has a wide set of tokenizers in the [`nltk.tokenizer` module](http://www.nltk.org/api/nltk.tokenize.html). In particular, there's a [tokenizer that's optimized for tweets](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual).

In [9]:
def tokenize_tweets2(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    tokens = [token for tweet in tweets for token in twt.tokenize(tweet)]
    return(tokens)

In [10]:
pos_tokens = tokenize_tweets2(pos_tweets)
neg_tokens = tokenize_tweets2(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)

In [11]:
pos_count.most_common()

[(':)', 3691),
 ('!', 1844),
 ('you', 1341),
 ('.', 1341),
 ('to', 1065),
 ('the', 999),
 (',', 964),
 ('I', 890),
 ('a', 888),
 ('for', 749),
 (':-)', 701),
 ('and', 660),
 (':D', 658),
 ('?', 581),
 (')', 525),
 ('my', 484),
 ('in', 481),
 ('it', 460),
 ('is', 418),
 ('of', 403),
 ('have', 342),
 ('me', 330),
 ('your', 320),
 ('on', 313),
 ('...', 290),
 ('follow', 284),
 ('"', 264),
 ('this', 263),
 ('be', 249),
 (':', 249),
 ('that', 246),
 ('so', 234),
 ('u', 228),
 ('with', 221),
 ('-', 213),
 ('like', 209),
 ('Thanks', 209),
 ('i', 203),
 ('day', 202),
 ('all', 197),
 ('are', 184),
 ('love', 184),
 ('thanks', 182),
 ('&', 174),
 ('will', 168),
 ('at', 167),
 ('good', 162),
 ("I'm", 161),
 ('back', 154),
 ('just', 152),
 ('we', 146),
 ('but', 141),
 ('know', 141),
 ('Hi', 141),
 ('can', 139),
 ('get', 139),
 (':p', 138),
 ('great', 138),
 ('up', 138),
 ('<3', 135),
 ('was', 133),
 ('Thank', 131),
 ('do', 131),
 ('our', 130),
 ('if', 130),
 ('..', 129),
 ('too', 127),
 ('new', 126

In [12]:
neg_count.most_common()

[(':(', 4585),
 ('I', 1587),
 ('(', 1180),
 ('.', 1092),
 ('to', 1068),
 ('the', 846),
 ('!', 831),
 (',', 734),
 ('you', 660),
 ('?', 644),
 ('my', 629),
 ('a', 627),
 ('i', 620),
 ('and', 614),
 ('me', 524),
 (':-(', 501),
 ('so', 466),
 ('is', 456),
 ('it', 449),
 ('in', 421),
 ('for', 391),
 ('but', 384),
 ('of', 352),
 ('...', 332),
 ('have', 306),
 ('on', 297),
 ("I'm", 295),
 ('this', 270),
 ('not', 268),
 ('that', 263),
 ('be', 240),
 ('was', 232),
 ('no', 231),
 ('"', 215),
 ('miss', 212),
 (':', 211),
 ('♛', 210),
 ('》', 210),
 ('just', 207),
 ('want', 201),
 ('like', 193),
 ('all', 183),
 ('at', 179),
 ('with', 172),
 ('get', 171),
 ("can't", 167),
 ('ME', 165),
 ('u', 164),
 ('too', 163),
 ('up', 160),
 ('are', 157),
 ('do', 156),
 ("don't", 154),
 ('now', 148),
 ("it's", 140),
 ('time', 136),
 ('one', 136),
 ('go', 131),
 ('PLEASE', 131),
 ('your', 129),
 ('-', 125),
 ('know', 124),
 ('please', 122),
 ('why', 120),
 ('really', 120),
 ('can', 117),
 ('back', 115),
 ('out', 

Much better! This tokenizer got rid of twitter handles for us, so no more "@" tokens, and catches emoticons. However, there are still some questions:

1. Should we count a capitalized word differently from a non-capitalized word? e.g. should "Thanks" be different from "thanks"?
1. Do we want to be counting punctuation?
1. Do we want to count words like "I", "me", etc.?

Using a combination of NLTK and basic Python string tools we can address these concerns.

We can easily take a string and get a lowercase version of it.

In [13]:
"ThIS IS a cRaZy sTRing".lower()

'this is a crazy string'

The `string` module in base Python has a set of punctuation for the latin alphabet.

In [14]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

NLTK has a collection of "stop words" for many languages, including English. This is one of the corpora we downloaded.

In [15]:
from nltk.corpus import stopwords

stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

We can combine all of these into our tokenizer

In [16]:
def tokenize_tweets3(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    # combine stop words and punctuation
    stop = stopwords.words("english") + list(string.punctuation)
    # filter out stop words and punctuation and send to lower case
    tokens = [token.lower() for tweet in tweets 
              for token in twt.tokenize(tweet) 
              if token.lower() not in stop]
    return(tokens)

In [17]:
pos_tokens = tokenize_tweets3(pos_tweets)
neg_tokens = tokenize_tweets3(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)

In [18]:
pos_count.most_common()

[(':)', 3691),
 (':-)', 701),
 (':d', 658),
 ('thanks', 392),
 ('follow', 304),
 ('...', 290),
 ('love', 273),
 ('thank', 247),
 ('u', 245),
 ('good', 234),
 ('like', 218),
 ('day', 209),
 ('happy', 191),
 ("i'm", 183),
 ('hi', 173),
 ('great', 172),
 ('get', 168),
 ('see', 167),
 ('back', 162),
 ("it's", 162),
 ('know', 148),
 ('new', 146),
 (':p', 139),
 ('<3', 135),
 ('..', 129),
 ('one', 127),
 ('hope', 123),
 ('us', 115),
 ('time', 112),
 ('today', 112),
 ('friday', 100),
 ('nice', 99),
 ('morning', 98),
 ('please', 96),
 ("you're", 94),
 ("i'll", 91),
 ('much', 89),
 ('via', 85),
 ('would', 84),
 ('go', 82),
 ('well', 81),
 ("don't", 80),
 ('really', 79),
 ('hey', 77),
 ('lot', 77),
 ('yes', 74),
 ('want', 74),
 ('x', 72),
 ('week', 71),
 ('1', 71),
 ('birthday', 71),
 ('weekend', 70),
 ('going', 69),
 ('welcome', 69),
 ('got', 68),
 ('let', 68),
 ('make', 67),
 ("that's", 67),
 ('always', 67),
 ('work', 67),
 ('arrived', 65),
 ('best', 64),
 ('night', 63),
 ('http://t.co/rcvcyyo

In [19]:
neg_count.most_common()

[(':(', 4585),
 (':-(', 501),
 ("i'm", 343),
 ('...', 332),
 ('please', 274),
 ('miss', 238),
 ('want', 218),
 ('♛', 210),
 ('》', 210),
 ('like', 206),
 ('u', 193),
 ('get', 180),
 ("can't", 180),
 ("it's", 178),
 ("don't", 176),
 ('sorry', 149),
 ('one', 144),
 ('follow', 142),
 ('time', 141),
 ('much', 139),
 ('go', 137),
 ('really', 133),
 ('love', 132),
 ('know', 129),
 ('im', 128),
 ('still', 124),
 ('sad', 121),
 ('back', 121),
 ('followed', 110),
 ('see', 108),
 ('..', 108),
 ('today', 108),
 ('got', 102),
 ('good', 99),
 ('feel', 99),
 ('day', 96),
 ('need', 95),
 ('wanna', 94),
 ('oh', 92),
 ('work', 91),
 ('wish', 88),
 ('going', 87),
 ('sleep', 82),
 ("i've", 77),
 ('thanks', 77),
 ('people', 75),
 ('hope', 72),
 ('would', 70),
 ('last', 70),
 ("didn't", 69),
 ('could', 69),
 ('bad', 68),
 ('even', 67),
 ('think', 66),
 ('omg', 63),
 ("that's", 61),
 ('come', 61),
 ('home', 60),
 ('never', 57),
 ('someone', 57),
 ('though', 57),
 ('make', 56),
 ('well', 56),
 ('always', 56),

### Additional processing
How we pre-process text is very important. NLTK provides more tools for pre-processing.

One popular method of pre-processing is **stemming**. The idea here is to find the "root" of each word. 

In [20]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmer.stem("actually")

'actual'

Does this always work how we want?

In [21]:
print(stemmer.stem("please"), stemmer.stem("pleasing"))

pleas pleas


Let's update the tokenizer

In [22]:
def tokenize_tweets4(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    # combine stop words and punctuation
    stop = stopwords.words("english") + list(string.punctuation)
    # create the stemmer
    stemmer = PorterStemmer()
    # filter out stop words and punctuation and send to lower case
    tokens = [ stemmer.stem(token) for tweet in tweets 
              for token in twt.tokenize(tweet) 
              if token.lower() not in stop]
    return(tokens)
pos_tokens = tokenize_tweets4(pos_tweets)
neg_tokens = tokenize_tweets4(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)

In [23]:
pos_count.most_common()

[(':)', 3691),
 (':-)', 701),
 (':D', 658),
 ('thank', 643),
 ('follow', 443),
 ('love', 398),
 ('...', 290),
 ('day', 245),
 ('good', 238),
 ('like', 232),
 ('u', 228),
 ('get', 209),
 ('happi', 206),
 ('see', 186),
 ("i'm", 183),
 ('great', 172),
 ('back', 163),
 ("it'", 162),
 ('know', 155),
 ('new', 153),
 ('hope', 143),
 ('Hi', 141),
 ('go', 140),
 ('look', 139),
 (':p', 138),
 ('<3', 135),
 ('one', 131),
 ('..', 129),
 ('time', 128),
 ('today', 112),
 ('work', 111),
 ('us', 109),
 ('friday', 104),
 ('make', 100),
 ('pleas', 99),
 ('nice', 99),
 ('want', 98),
 ('morn', 98),
 ("you'r", 94),
 ("i'll", 91),
 ('much', 89),
 ('lot', 87),
 ('via', 85),
 ('would', 84),
 ('week', 83),
 ('let', 83),
 ('well', 81),
 ("don't", 80),
 ('realli', 79),
 ('enjoy', 78),
 ('need', 78),
 ('hey', 77),
 ('ye', 76),
 ('welcom', 73),
 ('birthday', 73),
 ('weekend', 72),
 ('come', 71),
 ('1', 71),
 ('think', 71),
 ('wait', 70),
 ('thing', 69),
 ('night', 68),
 ('got', 68),
 ('keep', 68),
 ('arriv', 67),


In [24]:
neg_count.most_common()

[(':(', 4585),
 (':-(', 501),
 ("i'm", 343),
 ('...', 332),
 ('miss', 301),
 ('pleas', 275),
 ('follow', 263),
 ('want', 246),
 ('get', 233),
 ('like', 223),
 ('go', 218),
 ('♛', 210),
 ('》', 210),
 ("can't", 180),
 ("it'", 178),
 ("don't", 176),
 ('time', 166),
 ('u', 164),
 ('feel', 158),
 ('love', 151),
 ('day', 150),
 ('sorri', 149),
 ('one', 149),
 ('much', 139),
 ('work', 133),
 ('realli', 133),
 ('know', 133),
 ('see', 125),
 ('still', 124),
 ('back', 122),
 ('sad', 121),
 ('..', 108),
 ('today', 108),
 ('thank', 107),
 ('need', 107),
 ('make', 102),
 ('got', 102),
 ('hope', 102),
 ('good', 101),
 ('look', 99),
 ('im', 94),
 ('wanna', 94),
 ('come', 92),
 ('wish', 91),
 ('sleep', 90),
 ("i'v", 77),
 ('watch', 77),
 ('peopl', 75),
 ('think', 75),
 ('last', 73),
 ('would', 70),
 ('even', 70),
 ("didn't", 69),
 ('could', 69),
 ('bad', 68),
 ('tri', 65),
 ('say', 63),
 ('home', 63),
 ('omg', 63),
 ('guy', 62),
 ("that'", 61),
 ('hate', 57),
 ('never', 57),
 ('someon', 57),
 ('though

### Runtime and optimizations
How does the runtime change as we add all of these complications?

In [25]:
small_twt =  pos_tweets[:2000]

In [26]:
%%time
# Base NLTK tokenizer
_ = tokenize_tweets1(small_twt)

CPU times: user 531 ms, sys: 15.6 ms, total: 547 ms
Wall time: 568 ms


In [27]:
%%time
# Twitter optimized tokenizer
_ = tokenize_tweets2(small_twt)

CPU times: user 125 ms, sys: 15.6 ms, total: 141 ms
Wall time: 136 ms


In [28]:
%%time
# Get rid of stop words and lowercase
_ = tokenize_tweets3(small_twt)

CPU times: user 203 ms, sys: 31.2 ms, total: 234 ms
Wall time: 205 ms


In [29]:
%%time
# Also stemming
_ = tokenize_tweets4(small_twt)

CPU times: user 578 ms, sys: 0 ns, total: 578 ms
Wall time: 592 ms


Takeaways:
- The general NLTK word tokenizer works on many problems, but that generality makes it slow
  - Using a tokenizer optimized to your problem will be faster
- Adding more and more complications adds more and more time
  - Sometimes need to work to optimize these also

This optimization really does matter. Here's a "fast" version of tokenization made for a specific project.

In [33]:
import re

def word_tokenize(words):
    """Faster word tokenization than nltk.word_tokenize
    Input:
        words: a string to be tokenized
    Output:
        tokens: tokenized words
    """
    tokens = re.findall(r"[a-z]+-?[a-z]+", words.lower())
    return(tokens)

In [40]:
small_twt = " ".join(pos_tweets[:10000])
twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)

In [42]:
%%time
_ = nltk.word_tokenize(small_twt)

CPU times: user 797 ms, sys: 0 ns, total: 797 ms
Wall time: 806 ms


In [43]:
%%time
_ = twt.tokenize(small_twt)

CPU times: user 359 ms, sys: 15.6 ms, total: 375 ms
Wall time: 357 ms


In [44]:
%%time
_ = word_tokenize(small_twt)

CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms
Wall time: 34.7 ms


We can see that optimizing our tokenization can really help the speed. But this tokenizer isn't optimized for this problem. For instance, it doesn't pick up emoticons.

In [47]:
Counter(word_tokenize(small_twt)).most_common()

[('you', 1591),
 ('co', 1196),
 ('the', 1096),
 ('to', 1094),
 ('http', 856),
 ('for', 772),
 ('and', 706),
 ('it', 681),
 ('my', 560),
 ('in', 505),
 ('have', 436),
 ('is', 434),
 ('of', 413),
 ('thanks', 393),
 ('me', 364),
 ('that', 343),
 ('https', 336),
 ('your', 333),
 ('on', 326),
 ('follow', 308),
 ('this', 303),
 ('we', 289),
 ('so', 288),
 ('love', 277),
 ('be', 264),
 ('thank', 248),
 ('can', 241),
 ('good', 236),
 ('with', 228),
 ('all', 223),
 ('like', 220),
 ('day', 212),
 ('just', 198),
 ('happy', 197),
 ('are', 195),
 ('if', 180),
 ('will', 179),
 ('at', 179),
 ('but', 176),
 ('amp', 174),
 ('hi', 174),
 ('great', 173),
 ('no', 171),
 ('get', 170),
 ('see', 167),
 ('back', 162),
 ('do', 152),
 ('new', 148),
 ('know', 148),
 ('lt', 148),
 ('ll', 147),
 ('re', 145),
 ('up', 144),
 ('not', 142),
 ('was', 142),
 ('our', 141),
 ('what', 137),
 ('here', 132),
 ('too', 131),
 ('now', 129),
 ('one', 128),
 ('hope', 123),
 ('an', 120),
 ('out', 117),
 ('today', 117),
 ('us', 116

So we see that NLTK has some pros and cons:
- Pros
  - Easy to use
  - Fast enough for a one off analysis on small(ish) data
  - Great when (time to code solution) > (time to run NLTK)
- Cons
  - Much slower than optimized solutions 
  - Really feel the crunch on larger corpora or large analyses

### More involved processing
NLTK has many other modules to perform more complicated text processsing.

We can get the parts of speech for each word in a sentence

In [98]:
nltk.pos_tag(tokens)

[('followfriday', 'JJ'),
 ('france', 'NN'),
 ('inte', 'NN'),
 ('pkuchly', 'RB'),
 ('milipol', 'JJ'),
 ('paris', 'NN'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 ('lamb', 'NN'),
 ('ja', 'NN'),
 ('hey', 'NN'),
 ('james', 'VBZ'),
 ('how', 'WRB'),
 ('odd', 'JJ'),
 ('please', 'NN'),
 ('call', 'VB'),
 ('our', 'PRP$'),
 ('contact', 'NN'),
 ('centre', 'NN'),
 ('on', 'IN'),
 ('and', 'CC'),
 ('we', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('assist', 'VB'),
 ('you', 'PRP'),
 ('many', 'JJ'),
 ('thanks', 'NNS'),
 ('despiteofficial', 'JJ'),
 ('we', 'PRP'),
 ('had', 'VBD'),
 ('listen', 'VBN'),
 ('last', 'JJ'),
 ('night', 'NN'),
 ('as', 'IN'),
 ('you', 'PRP'),
 ('bleed', 'VBP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('amazing', 'JJ'),
 ('track', 'NN'),
 ('when', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('in', 'IN'),
 ('scotland', 'NN'),