## Basic Natural Language Processing Tasks

In this notebook we'll run through some of the basic (but very important) tasks carried out during NLP. To produce a successful model, it is critical to put care and thought in to the way you pre-process the text data.

We are going to use a [dataset of jokes](https://github.com/taivop/joke-dataset) scraped from the website [stupidstuff.org](http://stupidstuff.org). The data should be flexible enough for us to try lots of different NLP techniques and achieve some reasonable results. 

We'll be using [NLTK (the Natural Language Toolkit)](https://www.nltk.org/) which is a very easy Python library for working with text data and NLP.

### Import the data and inspect it

Our data is stored in a JSON file. We'll need to open it and convert it to a python dictionary before we can work with it. Let's do that and then inspect the first entry!

In [66]:
import json
with open('stupidstuff.json') as json_file:
    data = json.load(json_file)
data[0]

{'body': 'A blackjack dealer and a player with a thirteen count in his hand\nwere arguing about whether or not it was appropriate to tip the\ndealer.\n\nThe player said, "When I get bad cards, it\'s not the dealer\'s fault.\nAccordingly, when I get good cards, the dealer obviously had nothing\nto do with it so, why should I tip him?"\n\nThe dealer said, "When you eat out do you tip the waiter?"\n\n"Yes."\n\n"Well then, he serves you food, I\'m serving you cards, so you should\ntip me."\n\n"Okay, but, the waiter gives me what I ask for. I\'ll take an eight."',
 'category': 'Children',
 'id': 1,
 'rating': 2.63}

For now, we are only interested in the text stored in the `body` of each.

We define a variable called `jokes` which stores all of the individual jokes in a list. We'll come back to this later when we want to work with the whole dataset at once.

In [67]:
jokes = [data[i]["body"] for i in range(len(data))]
jokes[0]

'A blackjack dealer and a player with a thirteen count in his hand\nwere arguing about whether or not it was appropriate to tip the\ndealer.\n\nThe player said, "When I get bad cards, it\'s not the dealer\'s fault.\nAccordingly, when I get good cards, the dealer obviously had nothing\nto do with it so, why should I tip him?"\n\nThe dealer said, "When you eat out do you tip the waiter?"\n\n"Yes."\n\n"Well then, he serves you food, I\'m serving you cards, so you should\ntip me."\n\n"Okay, but, the waiter gives me what I ask for. I\'ll take an eight."'

### Noise Removal

Let's just inspect the length of our data and check for empty values. In reality, we'd probably want to do a lot more data validation than this.

In [69]:
print(len(jokes))
jokes.count('')

3773


573

**Oh No!** It appears there are 573 jokes that are completely empty. Those entries won't be any use to us so let's remove them from our dataset.

In [70]:
jokes_clean = [joke for joke in jokes if joke != '']
len(jokes_clean)

3200

Ok, that looks better. We've now got 3200 jokes to work with. That should be enough! Let's pick an individual joke to work with for now...

In [71]:
joke = jokes_clean[2353]
joke

'A budding actor: "Dad guess what? I\'ve got my first part in a , I play the part of a man who has been maried for 25 years."\nFather: "That\'s a good start son, just keep at it and one of these days you\'ll get a speaking part."'

Let's strip out the punctuation. We're only interested in the words for now. Let's also make the whole thing lowercase otherwise the words `Hello` and `hello` might be treated as separate words.

In [72]:
import string
clean = joke.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
clean = clean.lower()
clean

'a budding actor   dad guess what  i ve got my first part in a   i play the part of a man who has been maried for 25 years  \nfather   that s a good start son  just keep at it and one of these days you ll get a speaking part  '

### Tokenization

Now let's tokenize the joke to turn it in to a list of words.

In [73]:
import nltk
clean_tokens = nltk.word_tokenize(clean)
clean_tokens[:20]

['a',
 'budding',
 'actor',
 'dad',
 'guess',
 'what',
 'i',
 've',
 'got',
 'my',
 'first',
 'part',
 'in',
 'a',
 'i',
 'play',
 'the',
 'part',
 'of',
 'a']

### Stop-word Removal

Looks like there are plenty of words we don't need in there. Words like `a`, `of`, `the` will all show up frequently in any model we make but aren't very relevant to the content of the joke. Let's do some stop-word removal.

In [74]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

The **NLTK Corpus** contains a list of stopwords you can use out of the box. You can see them printed above. Depending on what type of NLP you are trying to do, it might be better to design your own list of stop-words so that you don't lose any important information when removing them.

Let's go ahead and remove those stopwords from our joke...

In [75]:
joke_no_stops = [word for word in clean_tokens if word not in stop_words]
joke_no_stops[:20]

['budding',
 'actor',
 'dad',
 'guess',
 'got',
 'first',
 'part',
 'play',
 'part',
 'man',
 'maried',
 '25',
 'years',
 'father',
 'good',
 'start',
 'son',
 'keep',
 'one',
 'days']

### Stemming

Now that we have removed the stopwords. Let's break each word down to it's stem using the [Porter Stemmer](https://www.nltk.org/howto/stem.html).

In [76]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in joke_no_stops]
stemmed

['bud',
 'actor',
 'dad',
 'guess',
 'got',
 'first',
 'part',
 'play',
 'part',
 'man',
 'mari',
 '25',
 'year',
 'father',
 'good',
 'start',
 'son',
 'keep',
 'one',
 'day',
 'get',
 'speak',
 'part']

### Part of Speech Tagging

Time to tag each word with it's part of speech. This is required for us to get decent results from Lemmatization of our joke.

In [77]:
pos_tagged = nltk.pos_tag(joke_no_stops)
pos_tagged


[('budding', 'VBG'),
 ('actor', 'NN'),
 ('dad', 'NN'),
 ('guess', 'NN'),
 ('got', 'VBD'),
 ('first', 'JJ'),
 ('part', 'NN'),
 ('play', 'VB'),
 ('part', 'NN'),
 ('man', 'NN'),
 ('maried', 'VBD'),
 ('25', 'CD'),
 ('years', 'NNS'),
 ('father', 'RB'),
 ('good', 'JJ'),
 ('start', 'NN'),
 ('son', 'NN'),
 ('keep', 'VB'),
 ('one', 'CD'),
 ('days', 'NNS'),
 ('get', 'VBP'),
 ('speaking', 'JJ'),
 ('part', 'NN')]

Now that we have each part of speech for the words in our joke. We can feed these through a Lemmatizer and gain the lemma for each word. You'll notice that the tags for parts of speech from **NLTK** are different than those used by [WordNet](https://wordnet.princeton.edu/) so we've written a handy function `get_wordnet_pos` that converts one type to the other.

In [78]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

get_wordnet_pos('NNS')
pos_tagged_wn = [(val[0], get_wordnet_pos(val[1])) for val in pos_tagged]
pos_tagged_wn

[('budding', 'v'),
 ('actor', 'n'),
 ('dad', 'n'),
 ('guess', 'n'),
 ('got', 'v'),
 ('first', 'a'),
 ('part', 'n'),
 ('play', 'v'),
 ('part', 'n'),
 ('man', 'n'),
 ('maried', 'v'),
 ('25', 'n'),
 ('years', 'n'),
 ('father', 'r'),
 ('good', 'a'),
 ('start', 'n'),
 ('son', 'n'),
 ('keep', 'v'),
 ('one', 'n'),
 ('days', 'n'),
 ('get', 'v'),
 ('speaking', 'a'),
 ('part', 'n')]

### Lemmatization

Now we have our joke in the correct format to feed through the `WordNetLemmatizer`. Let's do that...

In [79]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 

lemmatized = [lemmatizer.lemmatize(token[0], pos = token[1]) for token in pos_tagged_wn]
lemmatized

['bud',
 'actor',
 'dad',
 'guess',
 'get',
 'first',
 'part',
 'play',
 'part',
 'man',
 'maried',
 '25',
 'year',
 'father',
 'good',
 'start',
 'son',
 'keep',
 'one',
 'day',
 'get',
 'speaking',
 'part']

**Awesome!** We've just learnt how to clean our text, break it in to tokens, remove stop words, tag it with the parts of speech and then lemmatize (or stem) each token. Our data is now in a really useful format for building models.

### N-Grams

To inspect the n-gram tokens within our jokes, it'll be more useful to use the whole dataset than a single joke. Let's quickly merge all of our jokes in to one long string so that we can tokenize it. We'll also remove the punctuation and make it lowercase like we did before.

In [80]:
all_jokes = ""
for joke in jokes_clean:
    all_jokes += joke + " "
clean_jokes = all_jokes.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
clean_jokes = clean_jokes.lower()
clean_jokes



Time to tokenize

In [82]:
all_joke_tokens = nltk.word_tokenize(clean_jokes)
print(len(all_joke_tokens))
print(len(set(all_joke_tokens)))
all_joke_tokens[:50]

412931
18735


['a',
 'blackjack',
 'dealer',
 'and',
 'a',
 'player',
 'with',
 'a',
 'thirteen',
 'count',
 'in',
 'his',
 'hand',
 'were',
 'arguing',
 'about',
 'whether',
 'or',
 'not',
 'it',
 'was',
 'appropriate',
 'to',
 'tip',
 'the',
 'dealer',
 'the',
 'player',
 'said',
 'when',
 'i',
 'get',
 'bad',
 'cards',
 'it',
 's',
 'not',
 'the',
 'dealer',
 's',
 'fault',
 'accordingly',
 'when',
 'i',
 'get',
 'good',
 'cards',
 'the',
 'dealer',
 'obviously']

### Bi-Grams

Bi-grams are n-grams where n=2. They are essentially word pairs. Let's use NLTK to convert our data into all of the bi-grams.

In [83]:
bigrams = list(nltk.bigrams(all_joke_tokens))
bigrams[:5]

[('a', 'blackjack'),
 ('blackjack', 'dealer'),
 ('dealer', 'and'),
 ('and', 'a'),
 ('a', 'player')]

We can then use this data to inspect the most common bi-grams.

In [84]:
import collections
c = collections.Counter
c(bigrams).most_common(10)

[(('to', 'the'), 1544),
 (('in', 'the'), 1508),
 (('of', 'the'), 1308),
 (('on', 'the'), 1021),
 (('don', 't'), 830),
 (('i', 'm'), 829),
 (('the', 'man'), 796),
 (('and', 'the'), 733),
 (('at', 'the'), 678),
 (('in', 'a'), 638)]

Oh dear! It doesn't look like these bi-grams are much use to us, they are all just combinations of short words with very little meaning. Perhaps we should have removed them as stop-words earlier?

Let's have a look at a longer n-gram and see if we get better results.

In [87]:
n_grams = list(nltk.ngrams(all_joke_tokens, 10))
c(n_grams).most_common(10)

[(('does', 'it', 'take', 'to', 'screw', 'in', 'a', 'light', 'bulb', 'a'), 52),
 (('it', 'take', 'to', 'screw', 'in', 'a', 'light', 'bulb', 'a', 'none'), 14),
 (('a', 'pick', 'it', 'up', 'and', 'shake', 'it', 'q', 'how', 'do'), 11),
 (('pick', 'it', 'up', 'and', 'shake', 'it', 'q', 'how', 'do', 'i'), 11),
 (('a', 'lone', 'nut', 'by', 'the', 'name', 'of', 'lee', 'on', 'november'),
  11),
 (('lone', 'nut', 'by', 'the', 'name', 'of', 'lee', 'on', 'november', '22nd'),
  11),
 (('nut', 'by', 'the', 'name', 'of', 'lee', 'on', 'november', '22nd', 'who'),
  11),
 (('by',
   'the',
   'name',
   'of',
   'lee',
   'on',
   'november',
   '22nd',
   'who',
   'killed'),
  11),
 (('the',
   'name',
   'of',
   'lee',
   'on',
   'november',
   '22nd',
   'who',
   'killed',
   'kennedy'),
  11),
 (('2', 'mafia', 'thugs', 'and', 'a', 'lone', 'nut', 'by', 'the', 'name'), 11)]

Looks like there are a lot of Lightbulb jokes in our dataset! Come back for the next session to find out if that's true when we classify our jokes in to categories.