 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc/PGDip in Big Data Analytics & Artificial Intelligence*
 
 # Lemmatisation

In contrast to stemming, lemmatisation looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. This is the grammatical analysis of the formation of words from morphemes.

For example, the lemma of *was* is *be* and the lemma of *mice* is *mouse*. The lemma of *meeting* might be *meet** or *meeting* depending on its use in a sentence.

Firstly we import the `spaCy` library and the language model. Then we load the relevant language model. In this example, I'm using the **English language** model. The model will be stored in the instance called `nlp`.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Alternatively we could load the English language library as follows
# from spacy.lang.en import English
# nlp = English()



Next we create a short sentence and assign it to a `spaCy` document object.

In [2]:
doc_object = nlp(u"I ate an apple before I went to class. I'm still feeling hungry though.")

Let's see what tokens we have in the document, as well as the associated *parts-of-speech (POS)* tags using the `.pos_` attribute:

In [3]:
for token in doc_object:
    print(token, token.pos_)

I PRON
ate VERB
an DET
apple NOUN
before SCONJ
I PRON
went VERB
to ADP
class NOUN
. PUNCT
I PRON
'm AUX
still ADV
feeling VERB
hungry ADJ
though ADV
. PUNCT


Now I'd like to examine all of the root words from my sentence. 

In [4]:
for word in doc_object:
    print(word.text + " -----> " + word.lemma_)

I -----> I
ate -----> eat
an -----> an
apple -----> apple
before -----> before
I -----> I
went -----> go
to -----> to
class -----> class
. -----> .
I -----> I
'm -----> be
still -----> still
feeling -----> feel
hungry -----> hungry
though -----> though
. -----> .


Lets see what happens with the words we used in the stemming example, and compare the root words identified by lemmatisation versus stemming.

In [5]:
sample_words = nlp(u"caresses ponies pony cats running runner climber easily quickly")

for word in sample_words:
    print(word.text + " -----> " + word.lemma_, "\t", word.pos_)

caresses -----> caress 	 VERB
ponies -----> pony 	 NOUN
pony -----> pony 	 ADJ
cats -----> cat 	 NOUN
running -----> run 	 VERB
runner -----> runner 	 NOUN
climber -----> climber 	 NOUN
easily -----> easily 	 ADV
quickly -----> quickly 	 ADV


The stemming output was as follows:

```
caresses ------> caress
ponies ------> poni
pony ------> poni
cats ------> cat
running ------> run
runner ------> runner
climber ------> climber
easily ------> easili
quickly ------> quickli
```

We can see that unlike stemming where the root we got was *poni* for *ponies*, the roots that we arrived at through lemmatisation are actual words in the English dictionary.

Lemmatisation converts words in the second or third forms to their first form variants.

Next we create a simple function to input some NLP text and then use f-string formatting to tidy the output.

In [6]:
def create_lemmatization(text_to_convert):
    for token in text_to_convert:
        print (f"{token.text:{15}} {token.lemma_:{30}}")

Let's use this on an NLP sentence:

In [7]:
doc_object = nlp(u"The brown fox is quick and he is jumping over the lazy dog")
# Call the lemmatization function with my sentence
create_lemmatization(doc_object)

The             the                           
brown           brown                         
fox             fox                           
is              be                            
quick           quick                         
and             and                           
he              he                            
is              be                            
jumping         jump                          
over            over                          
the             the                           
lazy            lazy                          
dog             dog                           


Note that the lemmatisation of *is* is *be* - the name of the verb.

# More Part-of-Speech Tagging

Processing raw text intelligently is difficult. Lots of words that look completely different can mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what `spaCy` is designed to do: you put in raw text, and get back a `Doc` object, that comes with a variety of annotations.

As we've seen previously, a *Part-Of-Speech Tagger (POS Tagger)* reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.

In this section we'll look at POS tags in more detail.

### View Token Tags
We can obtain a particular token by its index position.
* To view the coarse POS tag use `token.pos_`
* To view the fine-grained tag use `token.tag_`
* To view the description of either type of tag use `spacy.explain(tag)`

Note that `token.pos` and `token.tag` commands return integer hash values. By adding the underscores we get the text (string) equivalent that lives in `doc.vocab`. So the data returned by adding the underscore is more meaningful to us than just the integer hash value.

Before we begin, lets have a look at the full ine of text that we'll now work on.

In [8]:
doc_object = nlp(u"I ate an apple before I went to class. I'm still feeling hungry though.")
print(doc_object)

I ate an apple before I went to class. I'm still feeling hungry though.


The `doc` object contains information on the *predictive POS tag* as well as other interexting attributes. We can easily show the POS tags generated by `spaCy` through a loop.

In [9]:
for token in doc_object:
        print(f" {token.text:{20}} {token.pos_:{10}} {token.tag_:{5}} {spacy.explain(token.tag_)}")

 I                    PRON       PRP   pronoun, personal
 ate                  VERB       VBD   verb, past tense
 an                   DET        DT    determiner
 apple                NOUN       NN    noun, singular or mass
 before               SCONJ      IN    conjunction, subordinating or preposition
 I                    PRON       PRP   pronoun, personal
 went                 VERB       VBD   verb, past tense
 to                   ADP        IN    conjunction, subordinating or preposition
 class                NOUN       NN    noun, singular or mass
 .                    PUNCT      .     punctuation mark, sentence closer
 I                    PRON       PRP   pronoun, personal
 'm                   AUX        VBP   verb, non-3rd person singular present
 still                ADV        RB    adverb
 feeling              VERB       VBG   verb, gerund or present participle
 hungry               ADJ        JJ    adjective (English), other noun-modifier (Chinese)
 though              

The `pos_` tag represents the *part-of-speech* tag for each word. For example lets examine the word *went*.

We can see that the POS tag (the course-grained POS tag) for *went* is VERB, and this is correct since *went* is a verb.

The *fine-grained POS tag* for the word *went* is VBD. We can use the command `spacy.explain()` to show more information on what the VBD tag means. 

The full explanation of VBD* is *verb, past tense*.

Every token is assigned a course-grained POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>


Tokens are subsequently given a fine-grained tag as determined by morphology:
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>

Here's a link to all the speech tags available through `spaCy`: https://spacy.io/api/annotation#pos-tagging

### Working with POS Tags
In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is "*I read library books*" present or past tense? Is *read* a verb or a noun?

Lets look at an example.

In [10]:
doc_object = nlp(u"I read library books.")
for word in doc_object:
    print(f" {word.text:{10}} {word.pos_:{10}} {word.tag_:{10}} {spacy.explain(word.tag_):{20}}")

 I          PRON       PRP        pronoun, personal   
 read       VERB       VBP        verb, non-3rd person singular present
 library    NOUN       NN         noun, singular or mass
 books      NOUN       NNS        noun, plural        
 .          PUNCT      .          punctuation mark, sentence closer


`spaCy` detects that the word *read* in this sentence is in the *present* tense.

In [11]:
doc_object = nlp(u"I recently read a library book.")
for word in doc_object:
    print(f" {word.text:{10}} {word.pos_:{10}} {word.tag_:{10}} {spacy.explain(word.tag_):{20}}")

 I          PRON       PRP        pronoun, personal   
 recently   ADV        RB         adverb              
 read       VERB       VBD        verb, past tense    
 a          DET        DT         determiner          
 library    NOUN       NN         noun, singular or mass
 book       NOUN       NN         noun, singular or mass
 .          PUNCT      .          punctuation mark, sentence closer


Then it detects that *read* in this example is in the past tense. 

POS tagging can be useful if we have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context of the sentence. While processing natural language, it is important to identify this difference. 

Fortunately, `spaCy` comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), is capable of returning the correct POS tag for the word.

Let's have a look at the **google** example.

In [12]:
doc_object = nlp(u"Can you google it?")

Lets examine the POS tags for the word *google* in this sentence.

In [13]:
word = doc_object[2]
print (word)

google


In [14]:
print(f" {word.text:{10}} {word.pos_:{10}} {word.tag_:{10}} {spacy.explain(word.tag_):{20}}")

 google     VERB       VB         verb, base form     


From the output, the word "google" has been correctly identified as a verb.

Now let's change the context of the sentence.

In [15]:
doc_object = nlp(u"Can you search for it on google?")
word = doc_object[6]
print(word)

google


In [16]:
print(f" {word.text:{10}} {word.pos_:{10}} {word.tag_:{10}} {spacy.explain(word.tag_):{20}}")

 google     PROPN      NNP        noun, proper singular


Now the word **google** is being used and detected as a noun.

## Counting POS Tags

The `.count_by()` method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. The method takes `spacy.attrs.POS` as a parameter value. 

Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

The `spacy.attrs.POS` command provides information on the **course-grained** POS tag for each word in the document object.

In [17]:
doc_object = nlp(u"I ate an apple before I went to class. I'm still feeling hungry though.")

# Count the frequencies of different coarse-grained POS tags in the sentence
POS_frequency = doc_object.count_by(spacy.attrs.POS)
POS_frequency

{95: 3, 100: 3, 90: 1, 92: 2, 98: 1, 85: 1, 97: 2, 87: 1, 86: 2, 84: 1}

In the output, we can see the ID of the POS tags along with their frequencies of occurrence. The text of the POS tag can be displayed by passing the ID of the tag to the vocabulary of the `spaCy` document.

In [18]:
doc_object.vocab[97].text

'PUNCT'

This means that there is *one* punctuation in the sentence.

We can use a loop to display the frequency of each POS tag in the sentence. **POS_frequency** is a dictionary object so we need to use the `.items()` command to access each element within it. The content of `POS_frequency` contains the POS tag ID followed by the number of occurrences of that ID. So we assign a variable to each item within the loop. The text of the POS tag can be displayed by passing the ID of the tag to the vocabulary of the spaCy document object. The vocab object provides a lookup table containing lexemes etc. See this link for more information https://spacy.io/api/vocab 

In [19]:
# Using the ".items()" command accesses each item in the dictionary
for tag_id, occurrences in POS_frequency.items():
        print(f" {tag_id:{10}} {doc_object.vocab[tag_id].text:{10}} {occurrences}")

         95 PRON       3
        100 VERB       3
         90 DET        1
         92 NOUN       2
         98 SCONJ      1
         85 ADP        1
         97 PUNCT      2
         87 AUX        1
         86 ADV        2
         84 ADJ        1


We can provide details on the fine-grained POS tag count too.
The `spacy.attrs.TAG` command provides information on the *fine-grained* POS tag for each word in the document object.

In [20]:
# Get counts on each fine-grained POS tag for the document object
fine_grained_POS = doc_object.count_by(spacy.attrs.TAG)

for tag_id, occurrences in sorted (fine_grained_POS.items()):
    print(f" {tag_id:{20}} {doc_object.vocab[tag_id].text:{10}} {occurrences}")

   164681854541413346 RB         2
  1292078113972184607 IN         2
  1534113631682161808 VBG        1
  9188597074677201817 VBP        1
 10554686591937588953 JJ         1
 12646065887601541794 .          2
 13656873538139661788 PRP        3
 15267657372422890137 DT         1
 15308085513773655218 NN         2
 17109001835818727656 VBD        2


The output is sorted on the tag ID.

We can look at the *syntactic dependencies* of the document object using the `spacy.attrs.dep` command.

In [21]:
# Count the different dependencies:
syn_dep_count = doc_object.count_by(spacy.attrs.DEP)

for tag_id, occurrences in sorted(syn_dep_count.items()):
    print(f'{tag_id:{10}} {doc_object.vocab[tag_id].text:{10}}: {occurrences}')

       398 acomp     : 1
       399 advcl     : 1
       400 advmod    : 2
       405 aux       : 1
       415 det       : 1
       416 dobj      : 1
       423 mark      : 1
       429 nsubj     : 3
       439 pobj      : 1
       443 prep      : 1
       445 punct     : 2
8206900633647566924 ROOT      : 2


## Visualizing POS Tags

Visualising POS tags in a graphical way is quite easy. The `displacy` module from the Spacy library is used for this purpose. 

To visualise the POS tags inside the Jupyter notebook, we need to call the `render` method from the `displacy` module and pass the spacy document object to it. We must set the style of the visualisation, and the `jupyter` attribute to `True`. I've already covered visualisation in the **Tokenisation** lecture so refer to it for further information. 

In [22]:
from spacy import displacy

doc_object = nlp(u"I ate an apple before I went to class. I'm still feeling hungry though.")
displacy.render(doc_object, style='dep', jupyter=True, options={'distance': 90})

## Exercise: Lemmatisation and Sentiment Analysis of IMDB Data

To get some hands on experience of applying the above techniques, we'll lemmatise the IMDB database we saw in the previous class. Then we'll use the Scattertext library to visualise the sentiment of words in the corpus based on the positive/negative classification of each review.

First, install Scattertext and examine the interactive plot produced in `demo.html` by the code below. This uses a sample corpus of US Democratic and Republican political speeches. 

Note that Scattertext takes a Pandas DataFrame as input.

In [23]:
%%capture
!pip install scattertext

In [24]:
import pandas as pd
import scattertext as st
from scattertext import produce_scattertext_explorer

data = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
data.head()

Unnamed: 0,party,text,speaker,parse
0,democrat,Thank you. Thank you. Thank you. Thank you so ...,BARACK OBAMA,"(thank, you, ., thank, you, ., thank, you, ., ..."
1,democrat,"Thank you so much. Tonight, I am so thrilled a...",MICHELLE OBAMA,"(thank, you, so, much, .)"
2,democrat,Thank you. It is a singular honor to be here t...,RICHARD DURBIN,"(thank, you, ., it, is, a, singular, honor, to..."
3,democrat,"Hey, Delaware. \nAnd my favorite Democrat, Jil...",JOSEPH BIDEN,"(hey, ,, delaware, ., and, my, favorite, democ..."
4,democrat,"Hello. \nThank you, Angie. I'm so proud of how...",JILL BIDEN,"(hello, ., thank, you, ,, angie, ., i, ', m, s..."


In [25]:
corpus = st.CorpusFromParsedDocuments(data, category_col='party', 
                                      parsed_col='parse').build()
html = st.produce_scattertext_explorer(corpus, category='democrat', 
                                       category_name='Democratic', 
                                       not_category_name='Republican', 
                                       minimum_term_frequency=5, 
                                       width_in_pixels=1000)
open('./demo.html', 'w').write(html);

Now, create a similar visualisation for the positive and negative reviews contained in the IMDB database. To see the difference made by lemmatisation, try creating visualisation for before and after. What other techniques could improve the output?

To ensure your code doesn't take too long to run, restrict your analysis to the first 1000 reviews.

Examine the output.