## Introduction to Natural Language Processing

This notebook is a brief overview of some common techniques in NLP.  

NLP is the field of using machine learning to extract insights or make predictions with text data.  The main challenge of NLP is converting text data into forms that the machine can understand.  The main part of this introduction will cover various techniques for transforming text data for input to machine learning algorithms.  The library we use is the Natural Language Toolkit, or NLTK.

In [1]:
import nltk
import pandas as pd
import re
from sklearn.datasets import fetch_20newsgroups

We will practice on a classic NLP data set, the 20newsgroups data set, which contains text posts from an early internet forum that had 20 topic boards.  The data contains the text of the post and the label of the topic.

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_train.target_names

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

## Tokenization

The first thing we'll do is split our data up into its components.  The most natural separations are into sentences and into individual words.  This process is called tokenization. 

Note that splitting the data into sentences is not as simple as splitting on periods because of the occurence of words like "Mrs."  Luckily the nltk package includes tokenizers that can do this for us. 

In [4]:
sentences = nltk.sent_tokenize(newsgroups_train.data[0])
# note that this gives us an error, I left the error in as an example
sentences

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/Users/carlostavarez/nltk_data'
    - '/Users/carlostavarez/opt/anaconda3/nltk_data'
    - '/Users/carlostavarez/opt/anaconda3/share/nltk_data'
    - '/Users/carlostavarez/opt/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [5]:
nltk.download('punkt')
sentences = nltk.sent_tokenize(newsgroups_train.data[0])
sentences

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/carlostavarez/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day.',
 'It was a 2-door sports car, looked to be from the late 60s/\nearly 70s.',
 'It was called a Bricklin.',
 'The doors were really small.',
 'In addition,\nthe front bumper was separate from the rest of the body.',
 'This is \nall I know.',
 'If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.']

We can also use the package to split on words, again its sometimes not as simple as splitting on spaces so its good to leverage the work already put into the package.  Notice that it also takes out the new lines that were making the text hard to read.

In [6]:
words = []
for sentence in sentences:
    for word in nltk.word_tokenize(sentence):
        words.append(word)
print(words)

['I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.', 'It', 'was', 'a', '2-door', 'sports', 'car', ',', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70s', '.', 'It', 'was', 'called', 'a', 'Bricklin', '.', 'The', 'doors', 'were', 'really', 'small', '.', 'In', 'addition', ',', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'This', 'is', 'all', 'I', 'know', '.', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'specs', ',', 'years', 'of', 'production', ',', 'where', 'this', 'car', 'is', 'made', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', ',', 'please', 'e-mail', '.']


## Stop Words, Stemming, Lemmatization

Now that we have the words in our data, we want to reduce some of the noise in the data.  One example of this noise is stop words, which are the common words like "the", "of", "and" etc that are not adding any relevant information to our data.  Its common practice to simply remove these words. There is no universal list of stopwords, it can depend somewhat on the application. 

In [7]:
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/carlostavarez/nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
stop_words = set(stopwords.words("english"))
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['I', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'I', 'saw', 'day', '.', 'It', '2-door', 'sports', 'car', ',', 'looked', 'late', '60s/', 'early', '70s', '.', 'It', 'called', 'Bricklin', '.', 'The', 'doors', 'really', 'small', '.', 'In', 'addition', ',', 'front', 'bumper', 'separate', 'rest', 'body', '.', 'This', 'I', 'know', '.', 'If', 'anyone', 'tellme', 'model', 'name', ',', 'engine', 'specs', ',', 'years', 'production', ',', 'car', 'made', ',', 'history', ',', 'whatever', 'info', 'funky', 'looking', 'car', ',', 'please', 'e-mail', '.']


Stemming and Lemmatization are two methods to transform a word into its root word.  This is another way to reduce noise in the data.  For example, "automobile" and "automobiles" can be treated as the same thing, or "wonder" and "wondering."

Stemming is a crude way to do this by simply chopping off the end of the word.  Lemmatization is a little more sophisticated and works from a dictionary lookup, but may be a little slower.  In general lemmatization is the recommended way.

In [9]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

print(lemmatizer.lemmatize('better'))
print(stemmer.stem('better'))

# sometimes you need to pass a part of speech as well
print(lemmatizer.lemmatize('better', pos='a'))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/carlostavarez/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


better
better
good


In [10]:
for word in without_stop_words:
    print(word, lemmatizer.lemmatize(word))

I I
wondering wondering
anyone anyone
could could
enlighten enlighten
car car
I I
saw saw
day day
. .
It It
2-door 2-door
sports sport
car car
, ,
looked looked
late late
60s/ 60s/
early early
70s 70
. .
It It
called called
Bricklin Bricklin
. .
The The
doors door
really really
small small
. .
In In
addition addition
, ,
front front
bumper bumper
separate separate
rest rest
body body
. .
This This
I I
know know
. .
If If
anyone anyone
tellme tellme
model model
name name
, ,
engine engine
specs spec
, ,
years year
production production
, ,
car car
made made
, ,
history history
, ,
whatever whatever
info info
funky funky
looking looking
car car
, ,
please please
e-mail e-mail
. .


This didn't really do much.  Let's try passing the part of speech to the lemmatizer as well and see if it does more.  

Part of speech is another subfield in NLP in which we train a model to be able to label words with their part of speech, such as 'noun', 'verb', 'pronoun', etc.  A lot of work goes into making these taggers, we'll use the one that comes with the library.

In [11]:
nltk.download('averaged_perceptron_tagger')
tagged_words = nltk.pos_tag(without_stop_words)
tagged_words

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/carlostavarez/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'),
 ('wondering', 'VBG'),
 ('anyone', 'NN'),
 ('could', 'MD'),
 ('enlighten', 'VB'),
 ('car', 'NN'),
 ('I', 'PRP'),
 ('saw', 'VBD'),
 ('day', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('2-door', 'JJ'),
 ('sports', 'NNS'),
 ('car', 'NN'),
 (',', ','),
 ('looked', 'VBD'),
 ('late', 'JJ'),
 ('60s/', 'CD'),
 ('early', 'JJ'),
 ('70s', 'CD'),
 ('.', '.'),
 ('It', 'PRP'),
 ('called', 'VBD'),
 ('Bricklin', 'NNP'),
 ('.', '.'),
 ('The', 'DT'),
 ('doors', 'NNS'),
 ('really', 'RB'),
 ('small', 'JJ'),
 ('.', '.'),
 ('In', 'IN'),
 ('addition', 'NN'),
 (',', ','),
 ('front', 'JJ'),
 ('bumper', 'NN'),
 ('separate', 'JJ'),
 ('rest', 'NN'),
 ('body', 'NN'),
 ('.', '.'),
 ('This', 'DT'),
 ('I', 'PRP'),
 ('know', 'VBP'),
 ('.', '.'),
 ('If', 'IN'),
 ('anyone', 'NN'),
 ('tellme', 'NN'),
 ('model', 'NN'),
 ('name', 'NN'),
 (',', ','),
 ('engine', 'NN'),
 ('specs', 'NN'),
 (',', ','),
 ('years', 'NNS'),
 ('production', 'NN'),
 (',', ','),
 ('car', 'NN'),
 ('made', 'VBD'),
 (',', ','),
 ('history', 'N

In [12]:
for word in tagged_words:
    print(word[0], ' | ', lemmatizer.lemmatize(word[0], pos=word[1]))

KeyError: 'PRP'

In [13]:
# for some the tagger and lemmatizer use different abbreviations for part of speech, which is really annoying
# here's the relevant piece from the documentation

# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }

def convert_pos(pos):
    if pos.startswith('V'):
        return wordnet.VERB
    elif pos.startswith('J'):
        return wordnet.ADJ
    elif pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

for word in tagged_words:
    print(word[0], ' | ', lemmatizer.lemmatize(word[0], pos=convert_pos(word[1])))

I  |  I
wondering  |  wonder
anyone  |  anyone
could  |  could
enlighten  |  enlighten
car  |  car
I  |  I
saw  |  saw
day  |  day
.  |  .
It  |  It
2-door  |  2-door
sports  |  sport
car  |  car
,  |  ,
looked  |  look
late  |  late
60s/  |  60s/
early  |  early
70s  |  70
.  |  .
It  |  It
called  |  call
Bricklin  |  Bricklin
.  |  .
The  |  The
doors  |  door
really  |  really
small  |  small
.  |  .
In  |  In
addition  |  addition
,  |  ,
front  |  front
bumper  |  bumper
separate  |  separate
rest  |  rest
body  |  body
.  |  .
This  |  This
I  |  I
know  |  know
.  |  .
If  |  If
anyone  |  anyone
tellme  |  tellme
model  |  model
name  |  name
,  |  ,
engine  |  engine
specs  |  spec
,  |  ,
years  |  year
production  |  production
,  |  ,
car  |  car
made  |  make
,  |  ,
history  |  history
,  |  ,
whatever  |  whatever
info  |  info
funky  |  funky
looking  |  look
car  |  car
,  |  ,
please  |  please
e-mail  |  e-mail
.  |  .


That still didn't do much on our sentences, but we picked up a couple of the verbs and learned a little about parts of speech.

## Bag of Words and N-Grams

Two other related concepts you'll encounter in NLP are bag of words and N-Grams.  Bag of words is essentially what we've been doing here, you take text data and turn it into a collection of words that make up the data.  The words are unordered, so you lose a lot of information present in the text but in return have a representation that's easier to work with. 

N-Grams is a representation of the text that keeps some of the ordering.  A bigram is every two words, a trigram is every three words.  So in the example sentence "We are learning NLP." The bigrams would be "we are", "are learning" and "learning NLP." And the trigrams would be "we are learning" and "are learning NLP."

## Classification with TF-IDF

Now we'll try to classify these posts into the topic boards that they came from, this is called document classification or topic labeling.

The best practice technique for this kind of problem is TF-IDF, which stands for term frequency - inverse document frequency.  TFIDF is a technique for assigning a weight to each word in a document.  We can then use these weights as inputs to other algorithms.  TFIDF assigns the weight by calculating two numbers:
* 1.  The term frequency of the word, which is simply the number of times that word appears in the document divided by the total number of words in the document.  
* 2.  The second number is the inverse document frequency, which scales up the importance of rare words by taking the log of the total number of documents divided by the number of documents that contain the word.  

We multiply these two numbers together to get the TFIDF weight for each word.

In [14]:
# scikit learn has a tfidf vectorizer so we will use that
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
print(train_vectors.shape)
print(len(newsgroups_train.data))
train_vectors

(11314, 101631)
11314


<11314x101631 sparse matrix of type '<class 'numpy.float64'>'
	with 1103627 stored elements in Compressed Sparse Row format>

We now have a sparse matrix with a row for each post in our training data.  The matrix has 101,631 columns, one for each word that appears in any of our training documents.  The matrix is sparse because most of the entries are zero, most documents only contain a small subset of these 101,631 words.

In [15]:
pd.DataFrame(train_vectors[0:5].todense())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,101621,101622,101623,101624,101625,101626,101627,101628,101629,101630
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


You can see there's a lot of weird stuff in here, we won't worry about that now.

In [16]:
vectorizer.get_feature_names()[0:10]

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '00000000',
 '0000000004',
 '00000000b',
 '00000001',
 '00000001b']

In [17]:
vectorizer.get_feature_names()[50000:50010]

['ingests',
 'ingezet',
 'ingilizce',
 'ingles',
 'ingleside',
 'inglis',
 'ingolf',
 'ingoring',
 'ingr',
 'ingrained']

We can now feed these train vectors into any classifier we want.  A common one for NLP is naive bayes.

In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(newsgroups_train.data)

clf = MultinomialNB(alpha=.01)
clf.fit(train_vectors, newsgroups_train.target)

newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
test_vectors = vectorizer.transform(newsgroups_test.data)

pred = clf.predict(test_vectors)
metrics.accuracy_score(newsgroups_test.target, pred)

0.7002124269782263

We can use some of the things we talked about in the first part of the lesson to see if we can improve on our accuracy.  For example, what happens if we use bigrams alongside single words.

In [19]:
vectorizer = TfidfVectorizer(ngram_range=(1,2)) # ngram_range lets us include bigrams as well
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
print(train_vectors.shape) # note that the number of columns is much greater

clf = MultinomialNB(alpha=.01)
clf.fit(train_vectors, newsgroups_train.target)

newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
test_vectors = vectorizer.transform(newsgroups_test.data)

pred = clf.predict(test_vectors)
metrics.accuracy_score(newsgroups_test.target, pred)

(11314, 948675)


0.6829527349973447

Our accuracy went down, it doesn't always work out how we want.

We can also try some of our techniques from earlier in the lesson

In [20]:
vectorizer = TfidfVectorizer(stop_words = 'english') # removing stop words
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
print(train_vectors.shape)

clf = MultinomialNB(alpha=.01)
clf.fit(train_vectors, newsgroups_train.target)

newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
test_vectors = vectorizer.transform(newsgroups_test.data)

pred = clf.predict(test_vectors)
metrics.accuracy_score(newsgroups_test.target, pred)

(11314, 101322)


0.7010090281465746

In [21]:
from random import sample 
lemmatizer = WordNetLemmatizer()

def convert_pos(pos):
    if pos.startswith('V'):
        return wordnet.VERB
    elif pos.startswith('J'):
        return wordnet.ADJ
    elif pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

stop_words = set(stopwords.words("english"))

def lemma_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    pos_words = nltk.pos_tag([word for word in words if not word in stop_words])
    words = [lemmatizer.lemmatize(word[0], pos=convert_pos(word[1])) for word in pos_words]
    return words

# the way I've written the tokenizer it will take a long time to run on the whole data set
# so I'm limiting the size of the data set here
# if you want to run it for about 5 minutes you can change train_data_size to len(newsgroups_train.data)
# this gives an example of what's possible, we'd have to come back and refactor the code to make it faster
# if we want to train it on the whole data set

train_data_size = len(newsgroups_train.data)

vectorizer = TfidfVectorizer(tokenizer=lemma_tokenizer) # removing stop words and lemmatizing
train_vectors = vectorizer.fit_transform(newsgroups_train.data[0:train_data_size])
print(train_vectors.shape)

clf = MultinomialNB(alpha=.01)
clf.fit(train_vectors, newsgroups_train.target[0:train_data_size])

newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
test_vectors = vectorizer.transform(newsgroups_test.data)

pred = clf.predict(test_vectors)
metrics.accuracy_score(newsgroups_test.target, pred)

(11314, 104746)


0.7040626659585767

When I run it on the whole data set I get an accuracy about the same, 0.704, but it shows you how to combine and experiment with these different pieces.

This was a very surface level overview of NLP but hopefully it gives you a sense of what is possible and what to look into further!