# Common Text Parsing Techniques

Working with text data is annoying and hard because text is generated by people, and people are awful. We use all kinds of dumb words and symbols and complicated stuff that makes everything very confusing. Text parsing is the process of trying to make text data more like other kinds of data - numbers, categories etc. Generally, this falls into two distinct but related processes - text "cleaning", which is about removing extraneous or confusing details to standardise the text and reduce "noise", and text "representation", which is about turning text data into a numeric representation to make it accessible to both simple analysis (counting, scoring, etc.) and complex algorithms (similarity scores, categorisation, prediction etc.)

In this tutorial I'll go through some common techniques for cleaning up text. These are often used as the precursor to more complex analysis of text data. As usual, I'll try to show you what I think is the _one, best way_ to do this, as well as give a bit of insight into how this works behind the scenes. This is a bit trickier for this tutorial though, because the right way to do this is very dependant on the eventual goal of your analysis, and a lot of these techniques use very complex processes to achieve something that looks quite simple.

In [253]:
import pandas as pd
import string
import unicodedata
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
import nltk
import re

The `nltk` module is one of the most common tools used for working with text data. Thing is, it's _huge_. It includes huge amounts of text "corpuses", and other data sets. So when you first install it it comes with only the bare minimum of these. The extra stuff you have to download as you need it. Fortunately it provides a convenient method for doing that.

In [254]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\simon.carryer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\simon.carryer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\simon.carryer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\simon.carryer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The data we're going to use is several years' worth of entries in Wellington, New Zealand's "Burger Wellington" competition, where restaurants create and sell a fancy burger for the event. The dataset contains the burger name, the restaurant name, and a description of the burger, as well as some other details.

In [255]:
df = pd.read_csv("https://raw.githubusercontent.com/SimonCarryer/burgerwellington/master/burgers.csv")

In [413]:
df.columns

Index(['Burger Name', 'Burger Description', 'Restaurant', 'Price', 'Year',
       'Beef', 'Chicken', 'Duck', 'Lamb', 'Pork', 'Seafood', 'Sweet',
       'Vegetarian', 'Venison', 'Not Your Usual', 'Finalist', 'Winner',
       'Restaurant_cleaned'],
      dtype='object')

## String cleaning

String cleaning mean removing "noise" from text. Generally, it's all about making the text more "regular" - reducing the number of different characters and the variety of representations of the same information. The challenge is that all "noise" is also information. Consider - removing capital letters removes the distinction between "Grey" the name, and "grey" the colour. Removing punctuation removes the distinction between "its" and "it's", and removes the breaks between sentences.

In the case of this burger data, that's not a problem, but for other uses, it might be very important to keep some of that information. You should fit your string cleaning approach to the purpose of your analysis.

String cleaning can also be very time-consuming. If you have millions of rows of text, or very long documents, you might look for more efficient approaches than what I use here.

Here are a bunch of functions that take a single `string` input, and clean that `string` in one specific way.

In [363]:
def lowercase(text):
    "Convert BIG BOYS to wee chaps"
    return text.lower()

def remove_text_in_parentheses(text):
    """Removes any text (enclosed in parentheses)"""
    return re.sub("\(.*?\)", "", text)

def convert_special_characters(text):
    "Replace fancy characters with the nearest ascii equivalent"
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode() # https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize

def normalise_whitespace(text):
    "Tidy up multiple spaces, newlines, and leading/trailing spaces."
    return ' '.join(text.split())

def hyphens_to_spaces(text):
    """Replaces hyphens and slashes with spaces."""
    return text.replace("-", " ").replace("/", " ")

def remove_punctuation(text):
    """Removes all punctuation (as defined by string.punctuation)"""
    return text.translate(str.maketrans('', '', string.punctuation)) # https://python-reference.readthedocs.io/en/latest/docs/str/translate.html

Depending on what you need, you might only want some of the above functions, or you might want to write new functions or your own versions of the above. For example, you might want to remove only some kinds of punctuation, or you might want to preserve newline characters.

When you know exactly which functions you need (and importantly the order in which to apply them), it's handy to have a single function that applies them all in the correct order.

In [364]:
def clean_text(text, functions=[remove_text_in_parentheses, hyphens_to_spaces, convert_special_characters, remove_punctuation, lowercase, normalise_whitespace]):
    for function in functions:
        text = function(text)
    return text

In [367]:
clean_text("well — hello theré. 😋  how are you? (we don't want this bit) ")

'well hello there how are you'

In `pandas` you can apply that function to a `Series` using the `Series.apply` method.

In [366]:
df["Restaurant"].apply(clean_text).head()

0           1815 cafe bar
1        annam restaurant
2                  apache
3    artisan dining house
4           aston norwood
Name: Restaurant, dtype: object

__Excercises:__
* Apply the `clean_text` function to the `Burger Name` column.
* How many unique restaurant names are there in the dataset before cleaning, compared to after?
* Run the `clean_text` function without removing punctuation.
* __HARD MODE__: Make a function that replaces numbers with a dummy value, so "1815 cafe bar" becomes "XXXX cafe bar", and include it in the `clean_text` function.

## Bag of Words

"Bag of Words" is a form of text "representation" or "embedding". In other words, it's a way of turning your text data into numeric data. Bag of words is one of the most basic ways of doing this, but it's also extremely useful. Essentially it means learning a "vocabulary" of all the words in your text, and then making a dataset that's got one column for every word, one row for every document, and a count of how many times that word appears in the document.

For example, "the cat sat on the mat" would be represented as:

|the|cat|sat|on|mat|
|-|-|-|-|-|
|2|1|1|1|1|

If we had another document in our corpus, say "the cat ate the rat", our bag of words representation would look like this:

|the|cat|sat|on|mat|ate|rat|
|-|-|-|-|-|-|-|
|2|1|1|1|1|0|0|
|2|1|0|0|0|1|1|

A few this worth noting here: 
* Bag of words does not preserve the order of words. "The cat sat on the mat" and "the mat sat on the cat" would be represented the same way. 
* Bag of words quickly grows to have a very wide dataset, with a lot of columns. String cleaning can help reduce this (so "It's", "it's", "its" and "Its" all get one column, rather than four columns, for example), but you will still end up with very wide tables.
* You also get a very sparse dataset, where most rows contain mostly zeros. Most words don't show up in most documents.

To create a bag of words embedding, we use the `CountVectorizer` class from the `sklearn` module.

This class does two main things:

* `fit` takes a column of text (a set of "documents") and learns a vocabulary from them (a list of all the words in the documents).
* `transform` takes a set of documents and converts them into a bag-of-words embedding, with one row for each document and one column for each word.
* It also has a `fit_transform` method which conveniently lets you do both of the above things in one step.

By default the `CountVectorizer` will lower-case the text, and there's a `strip_accents` argument you can pass to make it normalise unusual characters too. For more fine-grained control you can pass your own text-cleaning function to the `preprocesser` keyword. We can use the `clean_text` function we wrote.

I strongly suggest checking out the documentation for the `CountVectorizer` class, as it has a lot of useful functions.

__NOTE:__ We _could_ create new columns in our `DataFrame` holding the cleaned version of each of the text columns, and do all our analysis on those. That would be a good option if we expected to do a lot of exploratory analysis on those columns, and it would be faster to clean the text only once. But! Another common use of the `CountVectorizer` is to use it as the first step in building some kind of text classification model. In those cases it's important that the training data and the test data are cleaned exactly the same way, and for that purpose having everything contained within the `CountVectorizer` object is very convenient.

In [313]:
vec = CountVectorizer(preprocessor=clean_text)

bag_of_words = vec.fit_transform(df["Restaurant"])

In [314]:
bag_of_words

<788x526 sparse matrix of type '<class 'numpy.int64'>'
	with 2054 stored elements in Compressed Sparse Row format>

What's this? the bag of words that the `Countvectorizer` has returned isn't a `DataFrame`, it's a `sparse_matrix`. What the heck is that? Remember how bag of words returns a very wide table, where most of the values are zero? To save on space and to make a big dataset like this easier to work with, the `Countvectorizer` returns a `sparse_matrix` format, which more efficiently represents this data. Instead of storing a value for every row and column, it only stores information about non-zero values, in the form of a tuple formatted like "(<count>,<row_index>,<col_index>)". For example `(1, 0, 2)` means "There was 1 instance in the first document of the third word in the vocabulary".
    
If you're working with large datasets, you might need to keep your data in `sparse_matrix` format. But for smaller datasets, or where you really need to see what's going on better, you can turn this representation back into a `DataFrame` pretty easily.

In [311]:
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())

In [312]:
df_of_words.head()

Unnamed: 0,1154,169,1815,20,44,88,absolute,ale,ales,alicetown,...,willis,wilson,wine,wood,woodfire,woods,woodshed,yard,zake,zibibbo
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can see the structure of the data much more easily. As expected, most of the values are zero.

The instance of the `CountVectorizer` class that we created, `vec`, has been `fit` with the cleaned restaurant names. We can see the vocabulary it learnt by calling its `get_feature_names` method. If you call `transform` on a new set of documents, it will use the vocabulary it already learnt (in other words, it will return the same columns, even if there are new words in the new set of documents).

In [282]:
len(vec.get_feature_names())

526

It doesn't know anything about the frequencies of those words though. To see that, we have to look at the bag of words `DataFrame`. 

In [283]:
df_of_words.sum().sort_values(ascending=False)

the           148
bar           130
cafe           80
restaurant     75
and            37
             ... 
jones           1
kk              1
lab             1
lady            1
1154            1
Length: 526, dtype: int64

With this `DataFrame`, you can now do anything you'd do with any other `DataFrame`. For example, grouping, taking sums, etc.

In [284]:
df_of_words.groupby(df["Year"]).sum()

Unnamed: 0_level_0,1154,169,1815,20,44,88,absolute,ale,ales,alicetown,...,willis,wilson,wine,wood,woodfire,woods,woodshed,yard,zake,zibibbo
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2015,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
2017,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,1,0,1
2018,0,0,1,0,0,0,0,0,1,0,...,1,1,0,1,1,1,0,1,0,1
2019,1,1,1,1,0,0,0,1,1,1,...,1,2,1,1,0,1,0,1,1,0


__Excercises:__
* Check out the documentation for the `CountVectorizer` class. Are there any other useful arguments you can pass it?
* Create a bag of words embedding of the `Burger Description` column. Don't forget to clean the strings first.
* What's the most common word in the burger names for each year?
* __HARD MODE__: Create a bag of words embedding for all the burgers, but using only words that appear in burgers from the first year of data.

## Tokenising

Remember how when we made that bag of words embedding, it made one column for every word in the corpus? How did it know what a "word" is? It split the documents on "white space". This process of splitting documents up into smaller chunks is called "tokenising", and there are a lot of different ways to do it. Splitting on white-space is the simplest way to do it, but there are other ways.

### Character Embeddings

Rather than splitting text up into its constituent words, you can split it up into characters. This substantially reduces the number of tokens (if you've cleaned you're text thoroughly, you'll have at most the 26 letters of the English alphabet, plus numbers maybe). On the other hand, since it doesn't preserve order, you lose a _lot_ of information.

A cool think about the `CountVectorizer` is that it comes pre-built with a few different ways of doing tokenization. You control that by passing a different magic word to the `analyzer` keyword argument.

In [316]:
vec = CountVectorizer(analyzer="char", preprocessor=clean_text)
bag_of_words = vec.fit_transform(df["Burger Name"])
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())
df_of_words.head()

Unnamed: 0,Unnamed: 1,0,1,2,3,4,5,7,8,a,...,q,r,s,t,u,v,w,x,y,z
0,2,0,0,0,0,0,0,0,0,0,...,0,0,2,1,2,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,1,2,1,0,0,0,1,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,1,...,0,0,1,2,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,2,...,0,3,1,1,1,0,0,0,0,0


### N-Grams

That character embedding above doesn't seem very useful, but there's a way to make it much more useful, which is to employ "n-grams". "N-Grams" means analysing overlapping _sets_ of tokens, rather than single tokens. A set of two tokens is called a "2-gram", a set of three is a "3-gram" and so on. For example, if we were breaking the word "burger" into character 2-grams, we'd get the following 2-grams: "bu", "ur", "rg", "ge" and "er".

This is neat because it retains more information about the order of the characters, while still keeping the number of distinct tokens fairly low.

Again, the `CountVectorizer` has a built-in method for creating n-grams, controlled by the `ngram_range` keyword argument. It expects a tuple in the format `(<minimum_gram_length>,<maximum_gram_length>)`. So, for example, passing `(1, 3)` means that you want to break your documents into all their constituent 1-grams, 2-grams, _and_ 3-grams. Passing `(3, 3)` means getting just the 3-grams.

Don't forget that spaces are still characters, so by default you'll get character n-grams with spaces in them from the start and end of words, like " a" and "t ". Usually that's a good thing.

In [317]:
vec = CountVectorizer(analyzer="char", ngram_range=(1, 2), preprocessor=clean_text)
bag_of_words = vec.fit_transform(df["Burger Description"])
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())
df_of_words.head()

Unnamed: 0,Unnamed: 1,1,2,3,4,5,7,a,b,c,...,z,za,ze,zi,zl,zo,zs,zu,zy,zz
0,22,0,0,0,0,0,0,3,2,3,...,0,1,1,0,0,0,0,0,0,0
1,17,0,0,0,0,0,0,1,0,3,...,0,0,0,0,0,0,0,0,0,0
2,27,0,0,0,0,0,0,3,1,7,...,0,0,0,0,0,0,0,0,0,0
3,21,0,0,0,0,0,0,4,3,2,...,0,0,0,0,0,0,0,0,0,0
4,19,0,0,0,0,0,0,2,3,3,...,0,0,0,0,0,0,0,0,0,0


Word tokens can also be turned into n-grams, though be aware that this will often substantially increase the size of your vocabulary.

In [318]:
vec = CountVectorizer(analyzer="word", ngram_range=(2, 2), preprocessor=clean_text)
bag_of_words = vec.fit_transform(df["Burger Description"])
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())
df_of_words.head()

Unnamed: 0,12 hour,12 pound,14 lb,15 hour,18 hour,200 gram,200g beef,200gm medium,4s chicken,72 dark,...,zeus smoky,zeus southland,zeus tzatziki,zeus yoghurt,zucchini and,zucchini citrus,zucchini cucumber,zucchini fries,zucchini pickle,zucchini pickles
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


__Excercises:__
* Make a `DataFrame` of the character 2-grams and 3-grams in all the burger names.
* What's the most commonly-ocurring word 2-gram in burger descriptions?

### Stop-words

Remember back when we looked at the most-common words in the burger descriptions, and the top few words were things like "with" and "and"? That's not very exciting, right? Usually we don't care about these kind of "filler" words. A common technique is to just remove these - when you do that, the list of words you remove is called you "stop-words".

Conveniently, the `CountVectorizer` comes with a pre-configured set of English stop words, which you can exclude automatically.

In [319]:
vec = CountVectorizer(stop_words="english", preprocessor=clean_text)
bag_of_words = vec.fit_transform(df["Burger Name"])
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())
df_of_words.sum().sort_values(ascending=False)[:10]

burger     196
chicken     22
beef        19
lamb        16
sweet       12
big         11
deer        11
la          10
pig         10
piggy       10
dtype: int64

You can also pass `stop_words` your own list of words. If you need access to your own list of stop-words, `nltk` has one which you can use. Here's some good advice on doing that, from the `sklearn` library: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words

In [304]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Any cleaning you do on your text you should also do on your stop words, to ensure that they match correctly (things like removing punctuation turning "you're" to "youre" and so on).

In [329]:
custom_stops = [clean_text(word) for word in stopwords.words('english')]
custom_stops[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre']

### Custom Tokenising

Maybe the way you want to split up your text isn't by word _or_ by character. Maybe you've got your own way you want to do that which uses your own particular logic. We can support that too!

The `CountVectorizer` takes a `tokenizer` keyword argument, which should be a method that takes a string and returns a list of tokens.

In [337]:
def tokenise(text):
    return [text[i:i+3] for i in range(0, len(text), 3)] # Split text into non-overlapping 3-character tokens for some reason

vec = CountVectorizer(tokenizer=tokenise, preprocessor=clean_text)
bag_of_words = vec.fit_transform(df["Burger Name"])
df_of_words = pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names())
df_of_words.head()

Unnamed: 0,20,3,44,4t,a,ag,al,an,ap,ar,...,yu,z,z b,za,zen,zer,zeu,zil,zo,zo.1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Stemming and Lemmatising

Stemming and Lemmatising are two complex subjects that I'll only touch on briefly here. They're both ways of reducing the complexity of text by attempting to trim words down to a "root" - removing all the inflection that languages add - "Jump", "jumping", "jumps", "jumped" all become "jump" for example.

"Stemming" is a kind of brute-force approach that looks for common inflection endings ("-e", "-ing", "-ed") and snips them off. 

"Lemmatising" is a more complex (and therefore slower) approach that more accurately reduces words to their roots (or "lemmas"), by using information about what part of speech the word represents.

In [406]:
example_text = clean_text("Here's an example of a sentence that get some benefits from stemming/lemmatising - or it might not!")
example_text

'heres an example of a sentence that get some benefits from stemming lemmatising or it might not'

`nltk` provides a few stemmer classes, of which `PorterStemmer` is probably the most widely-used. You can apply stemming as part of your tokenising step.

In [407]:
porter = PorterStemmer()

def tokenise(text):
    return [porter.stem(w) for w in text.split()]

In [408]:
tokenise(example_text)

['here',
 'an',
 'exampl',
 'of',
 'a',
 'sentenc',
 'that',
 'get',
 'some',
 'benefit',
 'from',
 'stem',
 'lemmatis',
 'or',
 'it',
 'might',
 'not']

You can see that the stemmer has been a bit over-zealous, snipping the "e" off the end of "example" and "sentence", but its turned "here's" into "here" and "stemming" into "stem", which is good.

We can use stemming in our `CountVectorizer` by passing the new `tokenise` method to its `tokenizer` keyword.

In [409]:
vec = CountVectorizer(tokenizer=tokenise, preprocessor=clean_text)

In [410]:
bag_of_words = vec.fit_transform(df["Burger Description"])
pd.DataFrame(bag_of_words.todense(), columns=vec.get_feature_names()).head()

Unnamed: 0,1,10,12,15,18,2,200,200g,200gm,25,...,yum,yuzu,zaatar,zaida,zani,zealand,zelati,zeppelin,zeu,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lemmatising is a bit more complicated, because we need to get the "part of speech" for each word that we want to lemmatise and, frustratingly, the `nltk` part-of-speech tagger returns tags in a different format than the lemmitiser expects. We have to do a bunch of faffing around.

In [411]:
lemmy = WordNetLemmatizer()

def get_wordnet_pos(tag):
    """Translate NLTK POS into wordnet POS"""
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag[0].upper(), wordnet.NOUN)

def tokenise(text):
    tagged_tokens = nltk.pos_tag(nltk.word_tokenize(text)) # Runs the text through the nltk pos_tagger
    return [lemmy.lemmatize(word, get_wordnet_pos(pos_tag)) for word, pos_tag in tagged_tokens] # Passes each word and translated pos tag to lemmatiser

In [412]:
tokenise(example_text)

['here',
 'an',
 'example',
 'of',
 'a',
 'sentence',
 'that',
 'get',
 'some',
 'benefit',
 'from',
 'stem',
 'lemmatising',
 'or',
 'it',
 'might',
 'not']

The results are pretty similar, but "example" and "sentence" have been spared the chop, since the lemmatiser knows they're nouns and don't get inflected like verbs do.

__Excercises:__
* What are the ten most-used words in burger descriptions, after applying lemmatisation?
* Make a `CountVectorizer` that makes character 1-grams of the first letter of each word in the text.
* __HARD MODE__: Make a `CountVectorizer` that makes character 2-grams, but _also_ removes English stop-words before making those tokens.