# PB016: Artificial Intelligence I, labs 10 - Natural language processing

This week's topic is natural language processing (NLP). We'll focus namely on:
1. __Text acquisition and pre-processing__
2. __Tokenization, tagging and stemming (dummy pipeline)__
3. __The NLTK NLP library, shallow syntactic analysis (smarter pipeline)__
4. __Sentiment analysis__

---

## 1. Text acquisition and preprocessing

__Basic facts__
- Usually a necessary step before applying the natural language processing methods themselves.
- It consists mainly of preparing texts for machine processing (e.g., removal of OCR noise, markup, splitting in segments, normalisation, etc.).

__Examples of typical tasks__
- Conversion and cleaning of text - obtaining material (e.g., by downloading), encoding transformations, conversion from the original format to plain text, possible removal of noise in the form of OCR errors, formatting and other marks, annotation passages, etc.
- Removal of "irrelevant" words - filtering of [stop-list](https://en.wikipedia.org/wiki/Stop_word) expressions that are a valid part of the language, but introduce noise in the context of a given NLP task (e.g., articles, prepositions and even certain verbs and nouns in English are noise for most tasks based on the "bag of words" approach, where only statistical parameters of the text matter regardless of its explicit syntactic structure).

### Downloading a sample text

![orwell](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/1984first.jpg)

In [1]:
# import library for opening URLs, etc.
import urllib.request

# open a link to sample text

sample_text_link = "http://gutenberg.net.au/ebooks01/0100021.txt"
f = urllib.request.urlopen(sample_text_link)

# decoding the contents of the link (just convert the binary string to text -
# it's already in a relatively clean plain text format)

sample_text = f.read().decode("utf-8")

# print the beginning and ending of the text

beginning = sample_text[:4115]
ending = sample_text[-6315:]

print('***** The beginning of the "raw version" of the 1984 novel *****\n')
print(beginning)

print('\n***** The end of the "raw version" of the 1984 novel *****\n')
print(ending)

***** The beginning of the "raw version" of the 1984 novel *****



Project Gutenberg Australia



Title: Nineteen eighty-four
Author: George Orwell (pseudonym of Eric Blair) (1903-1950)
* A Project Gutenberg of Australia eBook *
eBook No.:  0100021.txt
Language:   English
Date first posted: August 2001
Date most recently updated: November 2008

Project Gutenberg of Australia eBooks are created from printed editions
which are in the public domain in Australia, unless a copyright notice
is included. We do NOT keep any eBooks in compliance with a particular
paper edition.

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing this
file.

This eBook is made available at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg of Australia License which may be viewed online at
gutenberg.net.au/licence.html

To contact Project Gut

### Cleaning the sample text
- Removal of meta-data about the publication, appendices behind the story, other adjustments aimed at obtaining the text itself without structural annotations (i.e., annotation of parts, chapters, etc.).
- Notes on the solution:
  - The procedure is often very arbitrary, depending on the source text and what we want to do with it.
  - A good start is to look at parts of the text, e.g., with `print(sample_text[:K])` and `print(sample_text[-K:])`, where `K` is a reasonably small number of characters (see above).
  - From what we see, we decide what to delete, replace, etc.
  - Substitutions using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are often useful for text cleanup (without the use of specialized NLP libraries, but also with them). For details, see [re](https://docs.python.org/3/library/re.html) module in the standard Python library, specifically the `re.sub()` function.

In [2]:
# cutting the metadata in the beginning

cleaner_text = sample_text.split('PART ONE')[1]

# cutting the appendix after the main story

cleaner_text = cleaner_text.split('APPENDIX')[0]

# deleting the '\r' characters

cleaner_text = cleaner_text.replace('\r','')

# removing structural annotations using the RE module

import re

cleaner_text = re.sub('PART [A-Z]+','',cleaner_text)
cleaner_text = re.sub('Chapter [0-9]+','',cleaner_text)
cleaner_text = re.sub('THE END','',cleaner_text)

# cutting the whitespace around the text

cleaner_text = cleaner_text.strip()

# printing the beginning and ending

cleaner_beginning = cleaner_text[:3010]
cleaner_ending = cleaner_text[-412:]

print('***** Cleaner beginning of the 1984 novel *****\n')
print(cleaner_beginning)

print('\n***** Cleaner ending of the 1984 novel *****\n')
print(cleaner_ending)

***** Cleaner beginning of the 1984 novel *****

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulc

---
## 2. Tokenization, tagging and stemming (dummy pipeline)

__Basic facts - tokenization__
- Creating lists of individual paragraphs, sentences and words from the sample text.

__Basic facts - tagging__
- Assignment of grammatical or other categories to individual parts of the text.
- One of the most common methods of tagging text is to assign corresponding part of speech tags to individual words (POS - part of speech tagging).
- However, more complex units can also be tagged, e.g., noun and verb phrases, modifier phrases, etc., other, more abstract parts of syntactic trees, semantic tags, etc.

__Basic facts - stemming (and lemmatisation)__
- The process of reducing inflected (or sometimes derived) words to their word stem, base or root form, or some other sort of a canonical form.
 - Stemming is reduction of the word to a form that is most common among all its morphological variants.
 - Lemmatization is another (though often very related) form of normalisation by grouping together the inflected forms of a word so they can be analysed as a single canonical item that still bears the meaning of the original word - a lemma, or also a dictionary form.
- An example in English: `wait, waits, waiting, ...` $\rightarrow$ `wait` (stem), `wait` (lemma).
- An example in Czech: `čekat, čeká, čekající, ...` $\rightarrow$ `ček` (stem), `čekat` (lemma).

### Naive tokenization of the sample text
- Notes on the solution:
  - For paragraphs, one can take advantage of the fact that they are separated by double new lines. However, it might also be good to deal with the individual lines in the paragraphs themselves so that they become a uniform, unwrapped text.
  - For splitting the text into sentences and words it is possible to use the punctuation marks or spaces, respecively, and the `split()` function (either from the standard Python library or from the `re` module).

In [3]:
# function to create a paragraph list from the input text
def paragraph_tokenizer(text,min_wlen=5):
  """Tokenizing the text to paragraphs.

  Parameters
  ----------
  text : str
      The input text to be tokenized.
  min_wlen : int, optional (default is 5)
      The minimum number of words in a paragraph for the paragraph
      to be included in the tokenized output list.

  Returns
  -------
  list
      List of paragraph strings contained in the input text.
  """

  splits = [x.replace('\n', ' ') for x in text.split('\n\n')]
  return [x for x in splits if len(x.split()) >= min_wlen]

# function to create a sentence list from the input text
def sentence_tokenizer(text):
  """Tokenizing the text to sentences.

  Parameters
  ----------
  text : str
      The input text to be tokenized.

  Returns
  -------
  list
      List of sentence strings contained in the input text.
  """

  return re.split('[\.?!]', text)

# function to create a word list from the input text
def word_tokenizer(text):
  """Tokenizing the text to words.

  Parameters
  ----------
  text : str
      The input text to be tokenized.

  Returns
  -------
  list
      List of word strings contained in the input text.
  """

  return [x.strip('\.,;!') for x in text.split()]

In [4]:
# creating a list of paragraphs of the novel in 1984 and listing the first 4
paragraphs = paragraph_tokenizer(cleaner_text)
print('The first 4 paragraphs of 1984:\n', paragraphs[:4])
# creating a list of sentences of the novel in 1984 and listing the first 2
sentences = []
for paragraph in paragraphs:
  sentences += sentence_tokenizer(paragraph)
print('\nThe first 2 sentences of 1984:\n', sentences[:2])
# creating a list of words of the novel in 1984 and listing the first 8
words = []
for sentence in sentences:
  words += word_tokenizer(sentence)
print('\nThe first 8 words of 1984:\n', words[:8])

The first 4 paragraphs of 1984:
 ['It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.', 'The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. The flat was seven flights up, and Winston, who was thirty-nine and had a varicose ulcer above his

### Naive tagging and stemming of the sample text
- Tagging the first sentence of the 1984 novel with symbols of the relevant parts of speech (POS).
- Conseuqent stemming depending on the specific POS tags.
- Notes on the solution:
  - One can simply assign tags to individual words according to a hard-coded look-up dictionary (which is not very smart or scalable in practice, but it's OK for the purposes of this exercise).
  - One can assume that the tags for each relevant word type are as follows: `DT` for articles ("a", "an", "the"),` NN` for nouns or numerals, `VB` for verbs, `PR` for pronouns, `JJ` for adjectives,` IN` for prepositions, `CC` for conjunctions, and` ?? `for unknown words.
  - Stemming simply strips (some of) the word suffixes based on their POS tag then.

In [5]:
# naive POS tagger

def dummy_pos_tagger(words):
  """Adding POS tags to the list of input words.

  Parameters
  ----------
  words : list
      The input list of words to be tagged.

  Returns
  -------
  list
      List of `(word, POS_tag)` pairs.
  """

  # list of tagged words
  tagged_words = []
  # look-up dictionary with tags assigned to the words of interest
  tag_dict = {
    'it' : 'PR',
    'was' : 'VB',
    'a' : 'DT',
    'bright' : 'JJ',
    'cold' : 'JJ',
    'day' : 'NN',
    'in' : 'IN',
    'april' : 'NN',
    'and' : 'CC',
    'the' : 'DT',
    'clocks' : 'NN',
    'were' : 'VB',
    'striking' : 'VB',
    'thirteen' : 'NN'
  }
  # the tagging itself
  for word in words:
    tag = '??' # tag for unknown words
    if word.lower() in tag_dict:
      tag = tag_dict[word.lower()]
    tagged_words.append((word,tag))
  return tagged_words

# naive stemmer

def dummy_stemmer(tagged_words):
  """Adding stems tags to the list of input `(word, POS_tag)`
  pairs.

  Parameters
  ----------
  tagged_words : list
      The input list of pairs containing words and their POS tags.

  Returns
  -------
  list
      List of the stems of the words in the input list.
  """

  stemmed_words = []
  for word, tag in tagged_words:
    stemmed = word
    if tag == 'NN':
      stemmed = word.rstrip('s')
    elif tag == 'VB':
      if word.endswith('ed'):
        stemmed = word.rstrip('ed')
      if word.endswith('ing'):
        stemmed = word.rstrip('ing')
    stemmed_words.append(stemmed)
  return stemmed_words

In [6]:
# printing the words of the first sentence
words = word_tokenizer(sentences[0])
print('RAW WORDS   :\n'+'\n'.join(words))
tagged_words = dummy_pos_tagger(words)
# printing the tagged words of the first sentence
print('\nTAGGED WORDS:\n'+'\n'.join([str(x) for x in tagged_words]))
stemmed_words = dummy_stemmer(tagged_words)
# printing the stemmed words of the first sentence
print('\nSTEMMED WORDS:\n'+'\n'.join([str(x) for x in stemmed_words]))

RAW WORDS   :
It
was
a
bright
cold
day
in
April
and
the
clocks
were
striking
thirteen

TAGGED WORDS:
('It', 'PR')
('was', 'VB')
('a', 'DT')
('bright', 'JJ')
('cold', 'JJ')
('day', 'NN')
('in', 'IN')
('April', 'NN')
('and', 'CC')
('the', 'DT')
('clocks', 'NN')
('were', 'VB')
('striking', 'VB')
('thirteen', 'NN')

STEMMED WORDS:
It
was
a
bright
cold
day
in
April
and
the
clock
were
strik
thirteen


---

## 3. The NLTK NLP library, shallow syntactic analysis (smarter pipeline)

__Basic facts__
- There are a number of mature tools for doing NLP.
- One of the widely used libraries in Python is the Natural Language Toolkit - [NLTK](https://www.nltk.org/).
- NLTK contains easy-to-use implementations of a number of state of the art techniques, algorithms, corpora, etc.
 - It can thus solve all the above tasks (tokenization, tagging, stemming), among other things, which you will experiment with yourselves.
- Once these tasks are solved, one can focus on one of the "holy grails" of NLP - the syntactic analysis (also, parsing) of sentences in natural language.
  - Essential for determining grammatical correctness of sentences and also a prerequisite to various methods of determining meaning of sentences in as broad range of contexts as possible (semantic and discourse analysis - arguably the so far rather unattainable ultimate goals of NLP).
  - Parsing can be relatively [shallow](https://en.wikipedia.org/wiki/Shallow_parsing), or more complex, such as [dependency](https://en.wikipedia.org/wiki/Dependency_grammar) and/or [probabilistic](https://en.wikipedia.org/wiki/Statistical_parsing) (for more details, see the lecture and/or specialized NLP courses).
  - However, in these labs we will only deal with the shallow one.

### __Exercise 3.1: Tokenization, tagging and stemming using NLTK__
- Implement a simple NLP pipeline solving the tasks presented in the previous sections using the [NLTK](https://www.nltk.org/) library (which will lead to a much more systematic and robust solution).
- More specifically, check the NLTK docs and other relevant online materials to find out how to do the following:
 - Tokenize the cleaned text of the 1984 novel to sentences.
 - Tokenize the first sentence of the novel to words.
 - POS-tag the first sentence of the novel.
 - Stem the words in the first sentence of the novel.

In [7]:
# importing NLTK
import nltk
import pycrfsuite
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

## Downloading NLTK used libraries
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('rslp')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/jindmen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jindmen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package rslp to /home/jindmen/nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jindmen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [8]:
## NLTK pipeline
# paragraph tokenize
# sentence tokenize
sentences = nltk.tokenize.sent_tokenize(cleaner_text)
# word tokenize
words = nltk.tokenize.word_tokenize(cleaner_text)
print(sentences[:2])
print(words[:20])

"""
POS Tagging

Hunpos tagger uses trigrams (3-word sequences) for its tagging.
Perceptron tagger uses AI.
BERT is a family of language models allowing much more than just POS tagging.

In all three I use the pre-trained version.
"""

hunpos_tagger = nltk.tag.hunpos.HunposTagger('en_wsj.model').tag
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger().tag
bert_tagger = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

taggers = [
    (hunpos_tagger, 'Hunpos'),
    (perceptron_tagger, 'Perceptron'),
    (lambda x: list(map(lambda z: (z[0]['word'], z[0]['entity']), bert_tagger(x))), 'BERT')
]

for tagger, name in taggers:
    print(f"Tagger {name}:")
    print(tagger(words[:20]))

'''
Mapping of both BERT model and NLTK models to closer
(and smaller) space of tags for analysis.
'''
bert_mapping = {
    'ADJ': 'JJ',
    'ADV': 'RB',
    'ADP': 'IN',
    'AUX': 'VB',
    'CCONJ': 'CC',
    'DET': 'DT',
    'INTJ': 'UH',
    'NOUN': 'NN',
    'NUM': 'NN',
    'PART': 'RP',
    'PRON': 'PRP',
    'PROPN': 'NN',
    'PUNCT': 'PUNCT',
    'SCONJ': 'IN',
    'SYM': 'X',
    'VERB': 'VB',
    'X': 'X',
}

others_mapping = {
    ',': 'PUNCT',
    '.': 'PUNCT',
    'NNS': 'NN',
    'NNP': 'NN',
    'NNPS': 'NN',
    'VBP': 'VB',
    'VBN': 'VB',
    'VBG': 'VB',
    'VBD': 'VB',
    'VBZ': 'VB',
    'PRP$': 'PRP',
    'RBS': 'RB',
    'RBR': 'RB',
}


"""
Analysis of similiarity of taggers:
Generates tags for both these taggers, then compare them.
Where taggers differ, print neighborhood of the differing word.
"""
size = 100
padding = 3

def generate_tags(tagger, words, modifier=lambda x: x[1]):
    return map(modifier, tagger(words))

def generate_tags_from_sentences(tagger, sentences, size=200, modifier=lambda x: x[1],
                                 filter_func=lambda x: True):
    tags = []
    sentidx = 0
    while len(tags) < size and sentidx < len(sentences):
        tags.extend(filter(filter_func, tagger(sentences[sentidx])))
        sentidx += 1
    tags = tags[:size]
    tags = map(modifier, tags)
    return tags

def map_tag(tag, mapping):
    if tag in mapping:
        return mapping[tag]
    return tag

tags_hunpos = map(
    lambda x: map_tag(x, others_mapping),
    generate_tags(taggers[0][0], words[:size], lambda x: x[1].decode('utf-8'))
)
tags_perceptron = map(
    lambda x: map_tag(x, others_mapping),
    generate_tags(taggers[1][0], words[:size])
)
tags_bert = map(
    lambda x: map_tag(x, bert_mapping),
    generate_tags_from_sentences(bert_tagger, sentences, size,
                                 modifier=lambda x: x['entity'],
                                 filter_func=lambda x: len(x['word']) == 0 or x['word'][0] != '#'
                                )
)

print('\nAnalysis of dissimiliarity of taggers:\n')
for i, (tag1, tag2, tag3) in enumerate(zip(tags_hunpos, tags_perceptron, tags_bert)):
    if tag1 != tag2 or tag2 != tag3:
        print(f"Mismatch on word {i}: {words[i]} {tag1} {tag2} {tag3}")
        print(' '.join(words[max(0, i-padding):i+padding+1]))
        print()

'''
All three models are mostly the same. The difference between them
is not that big. The sets of tags are very different between NLTK and BERT.
This difference may have introduced part of the difference after mapping.
'''

['It was a bright cold day in April, and the clocks were striking thirteen.', 'Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.']
['It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April', ',', 'and', 'the', 'clocks', 'were', 'striking', 'thirteen', '.', 'Winston', 'Smith', ',', 'his']


Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Tagger Hunpos:
[('It', b'PRP'), ('was', b'VBD'), ('a', b'DT'), ('bright', b'JJ'), ('cold', b'JJ'), ('day', b'NN'), ('in', b'IN'), ('April', b'NNP'), (',', b','), ('and', b'CC'), ('the', b'DT'), ('clocks', b'NNS'), ('were', b'VBD'), ('striking', b'VBG'), ('thirteen', b'NN'), ('.', b'.'), ('Winston', b'NNP'), ('Smith', b'NNP'), (',', b','), ('his', b'PRP$')]
Tagger Perceptron:
[('It', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('bright', 'JJ'), ('cold', 'JJ'), ('day', 'NN'), ('in', 'IN'), ('April', 'NNP'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('clocks', 'NNS'), ('were', 'VBD'), ('striking', 'VBG'), ('thirteen', 'NN'), ('.', '.'), ('Winston', 'NNP'), ('Smith', 'NNP'), (',', ','), ('his', 'PRP$')]
Tagger BERT:
[('it', 'PRON'), ('was', 'VERB'), ('a', 'PROPN'), ('bright', 'ADJ'), ('cold', 'ADJ'), ('day', 'NOUN'), ('in', 'ADP'), ('april', 'PROPN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('the', 'DET'), ('clocks', 'NOUN'), ('were', 'VERB'), ('striking', 'VERB'), ('thirteen', 'NUM'), ('.', 'PUNCT'), 

'\nAll three models are mostly the same. The difference between them\nis not that big. The sets of tags are very different between NLTK and BERT.\nThis difference may have introduced part of the difference after mapping.\n'

In [9]:
# Stemming
stemmers = [
    (nltk.stem.wordnet.WordNetLemmatizer().lemmatize, 'WordNet'),
    (nltk.stem.lancaster.LancasterStemmer().stem, 'Lancaster'),
    (nltk.stem.porter.PorterStemmer().stem, 'Porter'),
    (nltk.stem.regexp.RegexpStemmer('ing$|s$|e$|able$').stem, 'RegexpStemmer'),
    (nltk.stem.rslp.RSLPStemmer().stem, 'RSLP'),
    (nltk.stem.snowball.EnglishStemmer().stem, 'Snowball English')
]

for stemmer, name in stemmers:
    print(f"Stemmer {name}:")
    print(list(map(lambda x: stemmer(x), words[:20])))

'''
Each stemmer does something a little bit different.
One difference is in the size of letters, WordNet and Regexp allow for capital letters by default.
Another difference is in the stemming of verbs. Some do the stemming of verbs into infinitive
form, others rip off suffixes.
'''

Stemmer WordNet:
['It', 'wa', 'a', 'bright', 'cold', 'day', 'in', 'April', ',', 'and', 'the', 'clock', 'were', 'striking', 'thirteen', '.', 'Winston', 'Smith', ',', 'his']
Stemmer Lancaster:
['it', 'was', 'a', 'bright', 'cold', 'day', 'in', 'april', ',', 'and', 'the', 'clock', 'wer', 'striking', 'thirteen', '.', 'winston', 'smi', ',', 'his']
Stemmer Porter:
['it', 'wa', 'a', 'bright', 'cold', 'day', 'in', 'april', ',', 'and', 'the', 'clock', 'were', 'strike', 'thirteen', '.', 'winston', 'smith', ',', 'hi']
Stemmer RegexpStemmer:
['It', 'wa', 'a', 'bright', 'cold', 'day', 'in', 'April', ',', 'and', 'th', 'clock', 'wer', 'strik', 'thirteen', '.', 'Winston', 'Smith', ',', 'hi']
Stemmer RSLP:
['it', 'wa', 'a', 'bright', 'cold', 'day', 'in', 'april', ',', 'and', 'the', 'clock', 'wer', 'striking', 'thirteen', '.', 'winston', 'smith', ',', 'hi']
Stemmer Snowball English:
['it', 'was', 'a', 'bright', 'cold', 'day', 'in', 'april', ',', 'and', 'the', 'clock', 'were', 'strike', 'thirteen', '.', '

'\nEach stemmer does something a little bit different.\nOne difference is in the size of letters, WordNet and Regexp allow for capital letters by default.\nAnother difference is in the stemming of verbs. Some do the stemming of verbs into infinitive\nform, others rip off suffixes.\n'

### __Exercise 3.2: Shallow parsing using NLTK__
- Use the list of tagged words from the previous example to create a shallow syntactic tree for the first sentence of the 1984 novel.
- In this tree, deal primarily with simple noun and verb phrases, but you may also deal with other, more complex syntactic structures if you will.
- Notes on the solution:
 - NLTK can do most of the work for you (search for the corresponding docs or StackOverflow questions).
 - For example, the class `RegexpParser` can come in handy (as included in the pre-filled code cell below). It only needs a grammar, which defines individual syntactic groups (noun phrases, verb phrases, etc.) based on the sequence of POS tags of the respective words contained in them.
 - A grammar is simply a string defining rules for regular expression analysis (see pre-defined code below).
 - E.g., a sequence of two or more adjectives would match the following rule: `JJ2: {<JJ><JJ>+}`.
 - You can assume that noun phrases typically consist of nouns (`<NN.*>`) and their modification by adjectives (`<JJ>`) with the possible use of articles (`<DT>`), while verb phrases consist of some unbroken sequence of verb forms (`<VB.*>`).

In [10]:
# TODO - complete the grammar for shallow parsing
'''
This is a readable grammar for the BERT model.
Also see the RegexpParser. As the parser works using regexes,
it needs multiple passes (loop) to compute.
(Basically it needs up to one cycle for each layer of syntactic tree.)
'''
chunk_grammar = """
  NP: {<PRON|NOUN|PROPN|NUM>}
  NP: {<ADJ><NP>}
  NP: {<NP><ADV>}
  ADV: {<ADP><NP>}
  NPD: {<DET><NP>}
  VP: {<VERB>|<AUX><VERB>?<NP|NPD>}
  CCONJ: {<PUNCT><CCONJ>?}
  SENT: {<NP|NPD><VP>}
  S: {<SENT><CCONJ><S>|<SENT><PUNCT>}
  NP: {<NP><CCONJ><SENT><CCONJ>}
"""

'''
I have also tried parsing sentences other than the first one,
but as the RegexpParser does not allow backtracking, I have failed to do so.
'''

def remove_split_words(words):
    # The BERT model sometimes splits some words with the second part
    # starting with '##', this function counteracts this behavior
    no_splits = []
    last_word = words[0]
    for word, tag in words[1:]:
        assert len(word) >= 1
        if word[:2] == '##':
            last_word = (last_word[0] + word[2:], tag)
        else:
            no_splits.append(last_word)
            last_word = (word, tag)
    no_splits.append(last_word)
    return no_splits

# the actual syntactic analysis
tagged_words = list(map(
    lambda x: (x['word'], x['entity']),
    bert_tagger(sentences[0])
))
tagged_words = remove_split_words(tagged_words)
chunk_parser = nltk.RegexpParser(chunk_grammar, root_label='S', loop=20)
chunked = chunk_parser.parse(tagged_words)

# printing the syntactic tree
print(tagged_words)
print('A shallow parse tree of the first sentence of 1984:')
print(chunked)

[('it', 'PRON'), ('was', 'AUX'), ('a', 'DET'), ('bright', 'ADJ'), ('cold', 'ADJ'), ('day', 'NOUN'), ('in', 'ADP'), ('april', 'PROPN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('the', 'DET'), ('clocks', 'NOUN'), ('were', 'AUX'), ('striking', 'VERB'), ('thirteen', 'NUM'), ('.', 'PUNCT')]
A shallow parse tree of the first sentence of 1984:
(S
  (SENT
    (NP it/PRON)
    (VP
      was/AUX
      (NPD
        a/DET
        (NP
          (NP bright/ADJ (NP cold/ADJ (NP day/NOUN)))
          (ADV in/ADP (NP april/PROPN))))))
  (CCONJ ,/PUNCT and/CCONJ)
  (SENT
    (NPD the/DET (NP clocks/NOUN))
    (VP were/AUX striking/VERB (NP thirteen/NUM)))
  (CCONJ ./PUNCT))


---

## 4. Sentiment analysis

__Basic facts__
- Analysis of emotional polarity of texts, or possibly multimodal data (reviews, news, political debates, social networks, etc.).
- A technique with a number of practical applications (product development, marketing, market analysis and forecasting, public opinion mining, disease outbreak detection and management, crime detection, etc.).

__Your task__
- Split into groups (min 2, max 4 people).
- Check the NLTK documentation, StackOverflow, etc., to find out which NLTK module(s) can be used for sentiment analysis and how they work. Feel free to use other sentiment analysis tools, though, should you find anything else that feels more appropriate and/or convenient.
- Choose the approach that works best for you and try to answer the following questions:
  - What is the overall tone of the 1984 novel? Is it a glorious utopia, or rather a gloomy dystopia?
  - What is the most cheerful sentence (and optionally also a paragraph) of the novel?
  - What is the most desperate sentence (and optionally also a paragraph) of the novel?
- Take your time, play around (no need to have a complete solution - what's important is to search for possible approaches, play and learn). Feel free to ask for help anytime.
- Discuss your analysis of the 1984 novel, and your general observations with the lab tutor. The collaborating members of the group with the best relative results and/or an interesting/elegant/efficient/unusual solution can earn bonus points.

__Optional task__
- Try to solve the same task using a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) (a state of the art deep learning architecture, especially for NLP tasks).
 - You may get an off-the-shelf model, such as a [HuggingFace](https://huggingface.co/transformers/) one, as described for instance [here](https://satish1v.medium.com/sentiment-analysis-with-hugging-face-4b080d0cf34d).
 - You may also try to train (or rather [fine-tune](https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning))) your own transformer-based model, as described for instance [here](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671) or [here](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/).
- Compare the results with the "classical" sentiment analysis method you implemented before - which one works better, and why (or rather: does such a question even make sense at all)?

In [11]:
# TODO - analysis of the sentiment of the novel 1984
#        (the actual solution is totally up to you)

from nltk.sentiment.vader import SentimentIntensityAnalyzer as vader_analyzer
from transformers import pipeline

'''
I have used two models, Vader provided by NLTK and DistilBERT the default model
for sentiment analysis provided by HuggingFace.

As can be seen in the output, while Vader does yield "vectors" of the sentiment,
that is how much positive, negative or neutral the sentence feels,
DistilBERT only classifies the input sentence into either `POSITIVE` or `NEGATIVE`,
giving how certain it is with its decision.
The output of Vader can thus be used as is, the output of DistilBERT should be
taken with a grain of salt, as we cannot sort the sentences on anything other
than the certainity. This should however still be quite good measure.
'''

classifier = pipeline('sentiment-analysis')
vader = vader_analyzer().polarity_scores

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [12]:
print("Computing the analysis of sentiment using transformer...")
sentiment = list(map(lambda x: (x, classifier(x)), sentences))
print("Done")
print("Computing the analysis of sentiment using Vader...")
sentiment_vader = list(map(lambda x: (x, vader(x)), sentences))
print("Done")

positive = list(filter(lambda x: x[1][0]['label'] == 'POSITIVE', sentiment))
negative = list(filter(lambda x: x[1][0]['label'] == 'NEGATIVE', sentiment))

Computing the analysis of sentiment using transformer...
Done
Computing the analysis of sentiment using Vader...
Done


In [13]:
# Using Vader:

positive_sorted = sorted(sentiment_vader, key=lambda x: x[1]['pos'], reverse=True)
negative_sorted = sorted(sentiment_vader, key=lambda x: x[1]['neg'], reverse=True)

for sentence, label in positive_sorted[:10]:
    print(sentence.replace('\n', ' '))
    print(label)
    print()

for sentence, label in negative_sorted[:10]:
    print(sentence.replace('\n', ' '))
    print(label)
    print()

Good.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4404}

Good.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4404}

Thanks, comrade!'
{'neg': 0.0, 'neu': 0.239, 'pos': 0.761, 'compound': 0.4926}

Yes, YOU!
{'neg': 0.0, 'neu': 0.251, 'pos': 0.749, 'compound': 0.4574}

And, yes!
{'neg': 0.0, 'neu': 0.251, 'pos': 0.749, 'compound': 0.4574}

'Thass funny.
{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'compound': 0.4404}

It showed respect, like.
{'neg': 0.0, 'neu': 0.263, 'pos': 0.737, 'compound': 0.6808}

Yes, I would.'
{'neg': 0.0, 'neu': 0.27, 'pos': 0.73, 'compound': 0.4019}

Yes, even...
{'neg': 0.0, 'neu': 0.27, 'pos': 0.73, 'compound': 0.4019}

'The truth, please, Winston.
{'neg': 0.0, 'neu': 0.303, 'pos': 0.697, 'compound': 0.5574}

A riot!
{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.5983}

Damn!
{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.4574}

O cruel, needless misunderstanding!
{'neg': 0.873, 'neu': 0.127, 'pos': 0.0, 'compound': -0.784}

He was blind, he

In [14]:
# Using transformer:

positive_sorted = sorted(positive, key=lambda x: x[1][0]['score'], reverse=True)
negative_sorted = sorted(negative, key=lambda x: x[1][0]['score'], reverse=True)

for sentence, label in positive_sorted[:10]:
    print(sentence.replace('\n', ' '))
    print(label)
    print()

for sentence, label in negative_sorted[:10]:
    print(sentence.replace('\n', ' '))
    print(label)
    print()

We have glorious news for you.
[{'label': 'POSITIVE', 'score': 0.9998831748962402}]

The taste was delightful.
[{'label': 'POSITIVE', 'score': 0.9998797178268433}]

It's fascinating.'
[{'label': 'POSITIVE', 'score': 0.9998770952224731}]

Don't you like feeling: This is me, this is my hand, this is my leg, I'm real, I'm solid, I'm alive!
[{'label': 'POSITIVE', 'score': 0.9998766183853149}]

In fact I'm proud of her.
[{'label': 'POSITIVE', 'score': 0.9998745918273926}]

His body was healthy and strong.
[{'label': 'POSITIVE', 'score': 0.999874472618103}]

'Yes, perfectly.'
[{'label': 'POSITIVE', 'score': 0.9998735189437866}]

'It is a beautiful thing,' said the other appreciatively.
[{'label': 'POSITIVE', 'score': 0.9998729228973389}]

'She's beautiful,' he murmured.
[{'label': 'POSITIVE', 'score': 0.999871015548706}]

I'm good at games.
[{'label': 'POSITIVE', 'score': 0.9998704195022583}]

'Nonsense.
[{'label': 'NEGATIVE', 'score': 0.9998021721839905}]

'Nonsense.
[{'label': 'NEGATIVE', 

---

#### _Final note_ - the materials used in this notebook are adapted from works licensed by the original authors as follows:
- Picture of the first edition of the novel 1984:
  - Retrieved from Wikipedia [here](https://en.wikipedia.org/wiki/File:1984first.jpg)
  - Author: [Brown University Library](http://library.brown.edu/search/c?SEARCH=PR6029.R8+N49+1949b)
  - License: none, or rather [Public Domain](https://en.wikipedia.org/wiki/public_domain) in the US, but probably still copyrighted in the country of origin (UK)
- The novel itself:
 - Retrieved from the Australian [Project Gutenberg](https://www.gutenberg.org/) site [here](http://gutenberg.net.au/ebooks01/0100021.txt)
 - Author: [George Orwell](https://en.wikipedia.org/wiki/George_Orwell)
 - License: [Public Domain](https://en.wikipedia.org/wiki/public_domain) in Australia and possibly in other jurisdictions, but in general, the copyright is held by the George Orwell estate and the text should be treated accordingly in terms of its public or any other use