# Introduction to NLP
## In This Lesson
* Tokenization
* Vectorization
* Topic Modelling
* Predictive Modelling

# What is NLP

Natural Language Processing, NLP is the field of study relating to how computers can process free text data. NLP is a continuously evolving field, some of the most cutting edge machine learning and artificial intelligence programs are NLP focused. For this lesson, I will cover the basics of simple NLP.

# Data

Before we get started, let's download a dataset, for this lesson, I will be using The Complete Works of William Shakespeare from [Project Gutenberg](https://www.gutenberg.org/ebooks/100).


In [22]:

# Import requests to download the data
import requests

# Download shakespeare
response = requests.get("https://www.gutenberg.org/files/100/100-0.txt")
raw_text = response.text

# Trim first three characters (formatting)
trim_text = raw_text[3:]

# Split up into individual lines
# In this dataset, individual lines are split by two new line characters
lines = trim_text.replace("\r", "").split("\n\n")

# Print the first few lines to check this has worked
print(lines[:5])


['The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare', 'This eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.', 'Title: The Complete Works of William Shakespeare', 'Author: William Shakespeare', 'Release Date: January 1994 [eBook #100]\n[Most recently updated: April 25, 2021]']


# Tokenization
In order to 'read' text, the first basic principle for a computer to understand is where a word starts and begins.

A string is just a list of characters, there is nothing inherently special about spaces or punctuation.

Let's take a look at the first line from our dataset and try split it up into words.

In [23]:

first_line = lines[0]
print(first_line)


The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare



## Exercise
Write a function to split up the first line into words.


# Solution

In [24]:

def split_into_words(line):
    return line.split(" ")

print(split_into_words(first_line))


['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare,', 'by', 'William', 'Shakespeare']


Whilst this solution does work, it can be improved, really we don't want punctuation characters to appear as part of words. i.e. `'Shakespeare,'` should be `'Shakespeare'`.

We can also improve this by enforcing lower case, so it does not matter if a word is at the start of a sentence or not.


In [70]:

# Import the regular expression library to pick out punctuation caracters
import re

def split_into_words(line):
    # Remove punctuation
    line_no_punct = re.sub("[^\w\d\s]", "", line)
    # Colapse whitespace
    line_no_long_space = re.sub("\s+", " ", line_no_punct)
    # Push to lower case
    line_lower = line_no_long_space.lower()
    return line_lower.split(" ")

print(split_into_words(first_line))


['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'complete', 'works', 'of', 'william', 'shakespeare', 'by', 'william', 'shakespeare']


Here I have simply removed all punctuation. While punctuation does carry meaning, for more simple NLP solutions it is typically ignored.

Now we have a tokenization function, we can use that to tokenize all of our lines, giving us a list of lists of tokens:

In [71]:

# Tokenize all lines
tokenized_lines = [split_into_words(line) for line in lines]

# Print the first few lines to check this has worked
print(tokenized_lines[:5])


[['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'complete', 'works', 'of', 'william', 'shakespeare', 'by', 'william', 'shakespeare'], ['this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'if', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebook'], ['title', 'the', 'complete', 'works', 'of', 'william', 'shakespeare'], ['author', 'william', 'shakespeare'], ['release', 'date', 'january', '1994', 'ebook', '100', 'most', 'rec


# Vectorization

Even though tokenization has broken down our lines into groups of words, these are still not ideal for us to process in a machine learning algorithm. Computers like numbers. Vectorization is the conversion of these words to numbers, such that each unique word can be identified by its own number.

Rather than program this ourselves, let's take advantage of [Scikit-learn](https://scikit-learn.org/), which has [an algorithm for this very purpose](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [72]:

# Import count vectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate our vectorizer (it works just like any other sklearn class)
# We need to provide our tokenizer function as a parameter
vectorizer = CountVectorizer(tokenizer=split_into_words)

# Fit vectorizer
vectorizer.fit(lines)

# Vectorize (transform) our data
vectorized_lines = vectorizer.transform(lines)

# Print the first line, to check it has worked
print(vectorized_lines[0])


  (0, 4843)	1
  (0, 6513)	1
  (0, 10044)	1
  (0, 14023)	1
  (0, 20945)	2
  (0, 23800)	1
  (0, 26939)	2
  (0, 30289)	2
  (0, 34190)	2
  (0, 34514)	1


The data type is now a sparse matrix, we can check that the number of rows matches the number of unique words and the number with the value `2` matches duplicated words.

Or even better, map them back to the origional words!

In [90]:

# Create a word lookup dictionary.
# The vocabulary_ dictionary from the vectorizer is the wrong way round
word_lookup = {value: key for key, value in vectorizer.vocabulary_.items()}

# Loop all rows from the first document and print details
for key, frequency in zip(vectorized_lines[0].indices, vectorized_lines[0].data):
    print("{}: {}".format(word_lookup[key], frequency))


by: 1
complete: 1
ebook: 1
gutenberg: 1
of: 2
project: 1
shakespeare: 2
the: 2
william: 2
works: 1


# Topic Modelling

Now we have our vectorized matrix, we can start to use machine learning models.

Let's start with topic modelling.

Topic modelling is essentially running a clustering algorithm over your vectorized dataset to look for natural islands.

We are going to use [Latent Dirichlet Allocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html), a dimensionality reduction algorithm typically used for LDA. It will assign dimensional scores for each document, we can then use the highest scoring dimensions to cluster documents.


In [92]:

# Import LDA class
from sklearn.decomposition import LatentDirichletAllocation

# Create topic model class (10 topics)
topic_model = LatentDirichletAllocation(n_components=10)

# Fit topic model
topic_model.fit(vectorized_lines)

# Calculate topic scores
line_topic_scores = topic_model.transform(vectorized_lines)

# Print the first line, to check it has worked
print(line_topic_scores[0])


[0.00666706 0.00666759 0.62162255 0.00666785 0.00666759 0.00666719
 0.0066668  0.00666729 0.00666841 0.32503766]


As you can see, there are ten numbers, these correspond to the ten topics. The simplest way to assign a document to a topic is to assign each document the topic with the highest score.

## Exercise

Use `line_topic_scores` to calculate a topic number (0-10) for each line, using the maximum score.

## Solution



In [95]:

topics = [list(scores).index(max(scores)) for scores in line_topic_scores]

# Print first few to check
print(topics[:10])


[2, 2, 2, 9, 9, 5, 7, 2, 9, 9]


Now we can view documents for each topic to get a feel for what it might be!

The easiest way to do this is with a pandas data frame:

In [101]:

# Import pandas
import pandas as pd

# Display function to view data (jupyter notebooks only)
from IPython.display import display

# Create a data frame with the origional lines and their topics
df = pd.DataFrame()
df.loc[:, "Line"] = lines
df.loc[:, "Topic"] = topics

# Display dataframe to show
display(df)


Unnamed: 0,Line,Topic
0,The Project Gutenberg eBook of The Complete Wo...,2
1,This eBook is for the use of anyone anywhere i...,2
2,Title: The Complete Works of William Shakespeare,2
3,Author: William Shakespeare,9
4,Release Date: January 1994 [eBook #100]\n[Most...,9
...,...,...
27765,Professor Michael S. Hart was the originator o...,4
27766,Project Gutenberg-tm eBooks are often created ...,3
27767,Most people start at our Web site which has th...,2
27768,This Web site includes information about Proje...,2


Let's view the first few documents in topic 6 as an example

In [102]:

display(df[df["Topic"] == 6])


Unnamed: 0,Line,Topic
48,"TWELFTH NIGHT; OR, WHAT YOU WILL",6
391,"LAFEW.\nHow called you the man you speak of, m...",6
397,LAFEW.\nI would it were not notorious. Was thi...,6
401,"HELENA.\nI do affect a sorrow indeed, but I ha...",6
405,LAFEW.\nHow understand we that?,6
...,...,...
27097,"PAULINA.\nI am sorry, sir, I have thus far sti...",6
27103,LEONTES.\nWhat you can make her do\nI am conte...,6
27104,PAULINA.\nIt is requirâd\nYou do awake your ...,6
27114,[_Presenting Perdita who kneels to Hermione._],6


Unfortunately this isn't always hugely useful, we can also look at the top words used per topic using the `topic_model` object.

In [107]:

topic_components = topic_model.components_

# Show components for topic 6
print(topic_components[6])


[0.10000807 1.74640525 0.10000371 ... 0.10011028 0.1        0.10001298]


This components list has an index for every word, we need to calculate a sorted list of indices.

In [112]:

# I Googled this one, a trick of the trade is being proficient at Google

# It's good practice to always include a link when you do use code you find online for future reference
# https://stackoverflow.com/questions/7851077/how-to-return-index-of-a-sorted-list

my_list = topic_components[6]
sorted_indicies = sorted(range(len(my_list)), key=lambda k: my_list[k], reverse=True)

# Print top ten indicies
print(sorted_indicies[:10])


[34817, 15517, 20611, 30902, 34837, 16498, 33911, 14395, 9389, 16453]


We can then map these top indicies back to words

In [113]:
print([word_lookup[i] for i in sorted_indicies[:10]])

['you', 'i', 'not', 'to', 'your', 'it', 'what', 'have', 'do', 'is']


Oh dear! There are a lot of short words here.

Words like "you", "I", "to" and "it" are considered stop words. Typically in NLP we ignore stop words.

Fortunately we can get a list of English language stop words from the [NLTK](https://www.nltk.org/) package. 

Below I re-run the steps to this point with stop words excluded:


In [131]:

# Code to download stopwords if required
# import nltk
# nltk.download('stopwords')

# Import stopword list
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')

# Create a new tokenizer to remove stopwords
def new_tokenizer(line):
    first_stage = split_into_words(line)
    return list(filter(lambda w: w not in english_stopwords, first_stage))

# Instantiate our vectorizer (it works just like any other sklearn class)
# We need to provide our tokenizer function as a parameter
vectorizer = CountVectorizer(tokenizer=new_tokenizer)

# Fit vectorizer
vectorizer.fit(lines)

# Vectorize (transform) our data
vectorized_lines = vectorizer.transform(lines)

# Create topic model class (10 topics)
topic_model = LatentDirichletAllocation(n_components=10)

# Fit topic model
topic_model.fit(vectorized_lines)

# Calculate topic scores
line_topic_scores = topic_model.transform(vectorized_lines)

# Topic components
topic_components = topic_model.components_

# Create a word lookup dictionary.
# The vocabulary_ dictionary from the vectorizer is the wrong way round
word_lookup = {value: key for key, value in vectorizer.vocabulary_.items()}

# A function to pull out top words for each topic
def top_words(components):
    sorted_indicies = sorted(range(len(components)), key=lambda k: components[k], reverse=True)
    return [word_lookup[i] for i in sorted_indicies[:10]]

# Print top words for topic 6
print(top_words(topic_components[6]))


['would', 'fair', 'like', 'yet', 'hath', 'doth', 'eyes', 'night', 'shall', 'make']


Much better!

In [132]:

# Top words for all topics
for topic in range(10):
    print("Topic {}: {}".format(topic, top_words(topic_components[topic])))


Topic 0: ['scene', 'lord', 'hamlet', 'act', 'ii', 'iago', 'good', 'room', 'house', 'iii']
Topic 1: ['shall', 'thou', 'thy', 'let', 'us', 'hath', 'make', 'would', 'upon', 'may']
Topic 2: ['shall', 'come', 'must', 'palamon', 'like', 'mrs', 'arcite', 'see', 'let', 'othello']
Topic 3: ['love', 'would', 'valentine', 'hath', 'upon', 'proteus', 'sweet', 'shall', 'time', 'much']
Topic 4: ['sir', 'good', 'well', 'lord', '', 'come', 'would', 'master', 'duke', 'man']
Topic 5: ['', '_exit_', '_exeunt_', '_exeunt', '_exit', 'claudio', 'benedick', 'reenter', 'leonato', 'beatrice']
Topic 6: ['would', 'fair', 'like', 'yet', 'hath', 'doth', 'eyes', 'night', 'shall', 'make']
Topic 7: ['enter', '', 'antony', 'caesar', 'king', 'cleopatra', 'brutus', 'lord', 'duke', 'two']
Topic 8: ['thou', 'thee', 'thy', 'art', 'iâll', 'hast', 'man', 'dost', 'romeo', 'come']
Topic 9: ['king', 'thy', 'thou', 'thee', 'queen', 'lord', 'shall', 'gloucester', 'henry', 'richard']


# Predictive Modelling

Not only can we do unsupervised learning for NLP, but also supervised. For example, let's train a model to differentiate between lines from Shakespeare's complete works and [War and Peace](https://www.gutenberg.org/ebooks/2600).

In [137]:

# Download War and Peace
response_wp = requests.get("https://www.gutenberg.org/files/2600/2600-0.txt")
raw_text_wp = response_wp.text

# Trim first three characters (formatting)
trim_text_wp = raw_text_wp[3:]

# Split up into individual lines
# In this dataset, individual lines are split by two new line characters
lines_wp = trim_text_wp.replace("\r", "").split("\n\n")

# Print the first few lines to check this has worked
print(lines_wp[:5])


['\nThe Project Gutenberg EBook of War and Peace, by Leo Tolstoy', 'This eBook is for the use of anyone anywhere at no cost and with almost\nno restrictions whatsoever. You may copy it, give it away or re-use\nit under the terms of the Project Gutenberg License included with this\neBook or online at www.gutenberg.org', '\nTitle: War and Peace', 'Author: Leo Tolstoy', 'Translators: Louise and Aylmer Maude']


In order to train and assess a predictive model, we must be able to train and test on two different sets, containing samples from both Shakespeare and Tolstoy. Producing a Pandas data frame will make this easier!

In [140]:

# Produce Shakespeare data frame
shakespeare_df = pd.DataFrame()
shakespeare_df.loc[:, "Lines"] = lines
shakespeare_df["Shakespeare"] = 1

# Produce War and Peace data frame
wp_df = pd.DataFrame()
wp_df.loc[:, "Lines"] = lines_wp
wp_df["Shakespeare"] = 0

# Combine the two data frames
combined_df = shakespeare_df.append(wp_df, ignore_index=True)

# Display to check it has worked
display(combined_df)


Unnamed: 0,Lines,Shakespeare
0,The Project Gutenberg eBook of The Complete Wo...,1
1,This eBook is for the use of anyone anywhere i...,1
2,Title: The Complete Works of William Shakespeare,1
3,Author: William Shakespeare,1
4,Release Date: January 1994 [eBook #100]\n[Most...,1
...,...,...
40715,\nMost people start at our Web site which has ...,0
40716,http://www.gutenberg.org,0
40717,This Web site includes information about Proje...,0
40718,,0


Perfect, now we can train a machine learning model. [Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) models are commonly used for NLP, so we will use one of them.

In [142]:

# Import sklearn train test split function
from sklearn.model_selection import train_test_split

# Split data into a test and training set
train, test = train_test_split(combined_df, test_size=0.2)

# Train vectorizer
vectorizer = CountVectorizer(tokenizer=new_tokenizer)
vectorizer.fit(train["Lines"])

# Vectorize our training set
train_vectorized = vectorizer.transform(train["Lines"])

# Use this to train a machine learning model

# Import a Naive Bayes model
from sklearn.naive_bayes import GaussianNB

# Train model
predictive_model = GaussianNB()
predictive_model.fit(train_vectorized.toarray(), train["Shakespeare"])

# Now we can test!

# Vectorize test set
test_vectorized = vectorizer.transform(test["Lines"])

# Predict
predictions = predictive_model.predict(test_vectorized.toarray())

# Print first few predictions to check they are as expected
print(predictions[:10])


[0 1 0 0 0 0 0 1 1 1]


Now let's calculate percentage accuracy...

In [144]:

from sklearn.metrics import accuracy_score

print(accuracy_score(test["Shakespeare"], predictions))


0.9209233791748527


This is very encouraging! We would expect Shakespeare and Tolstoy to be easy to tell apart.

## Excercise

Write your own code to test user entered strings as either Shakespeare or Tolstoy.


## Solution


In [151]:

# Function to test a line
def shakespeare_or_tolstoy(line):
    vectorized = vectorizer.transform([line])
    pred = predictive_model.predict(vectorized.toarray())[0]
    if pred == 1:
        print("Shakespeare!")
    else:
        print("Tolstoy!")

# Take an input
my_input = input()

# Test input
shakespeare_or_tolstoy(my_input)


Beware the ides of March
Shakespeare!


# Recap

We have covered:
* Tokenization
* Stop words
* Vectorization
* Topic Modelling
* Predictive Modelling


# Homework

Write your own predictive model to differentiate between two open datasets.