# PB016: Artificial Intelligence I, labs 10 - Natural language processing

This week's topic is natural language processing (NLP). We'll focus namely on:
1. __Text acquisition and pre-processing__
2. __Tokenization, tagging and stemming (dummy pipeline)__
3. __The NLTK NLP library, shallow syntactic analysis (smarter pipeline)__
4. __Sentiment analysis__

---

## 1. Text acquisition and preprocessing

__Basic facts__
- Usually a necessary step before applying the natural language processing methods themselves.
- It consists mainly of preparing texts for machine processing (e.g., removal of OCR noise, markup, splitting in segments, normalisation, etc.).

__Examples of typical tasks__
- Conversion and cleaning of text - obtaining material (e.g., by downloading), encoding transformations, conversion from the original format to plain text, possible removal of noise in the form of OCR errors, formatting and other marks, annotation passages, etc.
- Removal of "irrelevant" words - filtering of [stop-list](https://en.wikipedia.org/wiki/Stop_word) expressions that are a valid part of the language, but introduce noise in the context of a given NLP task (e.g., articles, prepositions and even certain verbs and nouns in English are noise for most tasks based on the "bag of words" approach, where only statistical parameters of the text matter regardless of its explicit syntactic structure).

### Downloading a sample text

![orwell](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/1984first.jpg)

In [None]:
# import library for opening URLs, etc.
import urllib.request

# open a link to sample text

sample_text_link = "http://gutenberg.net.au/ebooks01/0100021.txt"
f = urllib.request.urlopen(sample_text_link)

# decoding the contents of the link (just convert the binary string to text -
# it's already in a relatively clean plain text format)

sample_text = f.read().decode("utf-8")

# print the beginning and ending of the text

beginning = sample_text[:4115]
ending = sample_text[-6315:]

print('***** The beginning of the "raw version" of the 1984 novel *****\n')
print(beginning)

print('\n***** The end of the "raw version" of the 1984 novel *****\n')
print(ending)

### Cleaning the sample text
- Removal of meta-data about the publication, appendices behind the story, other adjustments aimed at obtaining the text itself without structural annotations (i.e., annotation of parts, chapters, etc.).
- Notes on the solution:
  - The procedure is often very arbitrary, depending on the source text and what we want to do with it.
  - A good start is to look at parts of the text, e.g., with `print(sample_text[:K])` and `print(sample_text[-K:])`, where `K` is a reasonably small number of characters (see above).
  - From what we see, we decide what to delete, replace, etc.
  - Substitutions using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are often useful for text cleanup (without the use of specialized NLP libraries, but also with them). For details, see [re](https://docs.python.org/3/library/re.html) module in the standard Python library, specifically the `re.sub()` function.

In [None]:
# cutting the metadata in the beginning

cleaner_text = sample_text.split('PART ONE')[1]

# cutting the appendix after the main story

cleaner_text = cleaner_text.split('APPENDIX')[0]

# deleting the '\r' characters

cleaner_text = cleaner_text.replace('\r','')

# removing structural annotations using the RE module

import re

cleaner_text = re.sub('PART [A-Z]+','',cleaner_text)
cleaner_text = re.sub('Chapter [0-9]+','',cleaner_text)
cleaner_text = re.sub('THE END','',cleaner_text)

# cutting the whitespace around the text

cleaner_text = cleaner_text.strip()

# printing the beginning and ending

cleaner_beginning = cleaner_text[:3010]
cleaner_ending = cleaner_text[-412:]

print('***** Cleaner beginning of the 1984 novel *****\n')
print(cleaner_beginning)

print('\n***** Cleaner ending of the 1984 novel *****\n')
print(cleaner_ending)

---
## 2. Tokenization, tagging and stemming (dummy pipeline)

__Basic facts - tokenization__
- Creating lists of individual paragraphs, sentences and words from the sample text.

__Basic facts - tagging__
- Assignment of grammatical or other categories to individual parts of the text.
- One of the most common methods of tagging text is to assign corresponding part of speech tags to individual words (POS - part of speech tagging).
- However, more complex units can also be tagged, e.g., noun and verb phrases, modifier phrases, etc., other, more abstract parts of syntactic trees, semantic tags, etc.

__Basic facts - stemming (and lemmatisation)__
- The process of reducing inflected (or sometimes derived) words to their word stem, base or root form, or some other sort of a canonical form.
 - Stemming is reduction of the word to a form that is most common among all its morphological variants.
 - Lemmatization is another (though often very related) form of normalisation by grouping together the inflected forms of a word so they can be analysed as a single canonical item that still bears the meaning of the original word - a lemma, or also a dictionary form.
- An example in English: `wait, waits, waiting, ...` $\rightarrow$ `wait` (stem), `wait` (lemma).
- An example in Czech: `čekat, čeká, čekající, ...` $\rightarrow$ `ček` (stem), `čekat` (lemma).

### Naive tokenization of the sample text
- Notes on the solution:
  - For paragraphs, one can take advantage of the fact that they are separated by double new lines. However, it might also be good to deal with the individual lines in the paragraphs themselves so that they become a uniform, unwrapped text.
  - For splitting the text into sentences and words it is possible to use the punctuation marks or spaces, respecively, and the `split()` function (either from the standard Python library or from the `re` module).

In [3]:
# function to create a paragraph list from the input text
def paragraph_tokenizer(text,min_wlen=5):
  """Tokenizing the text to paragraphs.

  Parameters
  ----------
  text : str
      The input text to be tokenized.
  min_wlen : int, optional (default is 5)
      The minimum number of words in a paragraph for the paragraph
      to be included in the tokenized output list.

  Returns
  -------
  list
      List of paragraph strings contained in the input text.
  """

  splits = [x.replace('\n', ' ') for x in text.split('\n\n')]
  return [x for x in splits if len(x.split()) >= min_wlen]

# function to create a sentence list from the input text
def sentence_tokenizer(text):
  """Tokenizing the text to sentences.

  Parameters
  ----------
  text : str
      The input text to be tokenized.

  Returns
  -------
  list
      List of sentence strings contained in the input text.
  """

  return re.split('[\.?!]', text)

# function to create a word list from the input text
def word_tokenizer(text):
  """Tokenizing the text to words.

  Parameters
  ----------
  text : str
      The input text to be tokenized.

  Returns
  -------
  list
      List of word strings contained in the input text.
  """

  return [x.strip('\.,;!') for x in text.split()]

In [None]:
# creating a list of paragraphs of the novel in 1984 and listing the first 4
paragraphs = paragraph_tokenizer(cleaner_text)
print('The first 4 paragraphs of 1984:\n', paragraphs[:4])
# creating a list of sentences of the novel in 1984 and listing the first 2
sentences = []
for paragraph in paragraphs:
  sentences += sentence_tokenizer(paragraph)
print('\nThe first 2 sentences of 1984:\n', sentences[:2])
# creating a list of words of the novel in 1984 and listing the first 8
words = []
for sentence in sentences:
  words += word_tokenizer(sentence)
print('\nThe first 8 words of 1984:\n', words[:8])

### Naive tagging and stemming of the sample text
- Tagging the first sentence of the 1984 novel with symbols of the relevant parts of speech (POS).
- Conseuqent stemming depending on the specific POS tags.
- Notes on the solution:
  - One can simply assign tags to individual words according to a hard-coded look-up dictionary (which is not very smart or scalable in practice, but it's OK for the purposes of this exercise).
  - One can assume that the tags for each relevant word type are as follows: `DT` for articles ("a", "an", "the"),` NN` for nouns or numerals, `VB` for verbs, `PR` for pronouns, `JJ` for adjectives,` IN` for prepositions, `CC` for conjunctions, and` ?? `for unknown words.
  - Stemming simply strips (some of) the word suffixes based on their POS tag then.

In [5]:
# naive POS tagger

def dummy_pos_tagger(words):
  """Adding POS tags to the list of input words.

  Parameters
  ----------
  words : list
      The input list of words to be tagged.

  Returns
  -------
  list
      List of `(word, POS_tag)` pairs.
  """

  # list of tagged words
  tagged_words = []
  # look-up dictionary with tags assigned to the words of interest
  tag_dict = {
    'it' : 'PR',
    'was' : 'VB',
    'a' : 'DT',
    'bright' : 'JJ',
    'cold' : 'JJ',
    'day' : 'NN',
    'in' : 'IN',
    'april' : 'NN',
    'and' : 'CC',
    'the' : 'DT',
    'clocks' : 'NN',
    'were' : 'VB',
    'striking' : 'VB',
    'thirteen' : 'NN'
  }
  # the tagging itself
  for word in words:
    tag = '??' # tag for unknown words
    if word.lower() in tag_dict:
      tag = tag_dict[word.lower()]
    tagged_words.append((word,tag))
  return tagged_words

# naive stemmer

def dummy_stemmer(tagged_words):
  """Adding stems tags to the list of input `(word, POS_tag)`
  pairs.

  Parameters
  ----------
  tagged_words : list
      The input list of pairs containing words and their POS tags.

  Returns
  -------
  list
      List of the stems of the words in the input list.
  """

  stemmed_words = []
  for word, tag in tagged_words:
    stemmed = word
    if tag == 'NN':
      stemmed = word.rstrip('s')
    elif tag == 'VB':
      if word.endswith('ed'):
        stemmed = word.rstrip('ed')
      if word.endswith('ing'):
        stemmed = word.rstrip('ing')
    stemmed_words.append(stemmed)
  return stemmed_words

In [None]:
# printing the words of the first sentence
words = word_tokenizer(sentences[0])
print('RAW WORDS   :\n'+'\n'.join(words))
tagged_words = dummy_pos_tagger(words)
# printing the tagged words of the first sentence
print('\nTAGGED WORDS:\n'+'\n'.join([str(x) for x in tagged_words]))
stemmed_words = dummy_stemmer(tagged_words)
# printing the stemmed words of the first sentence
print('\nSTEMMED WORDS:\n'+'\n'.join([str(x) for x in stemmed_words]))

---

## 3. The NLTK NLP library, shallow syntactic analysis (smarter pipeline)

__Basic facts__
- There are a number of mature tools for doing NLP.
- One of the widely used libraries in Python is the Natural Language Toolkit - [NLTK](https://www.nltk.org/).
- NLTK contains easy-to-use implementations of a number of state of the art techniques, algorithms, corpora, etc.
 - It can thus solve all the above tasks (tokenization, tagging, stemming), among other things, which you will experiment with yourselves.
- Once these tasks are solved, one can focus on one of the "holy grails" of NLP - the syntactic analysis (also, parsing) of sentences in natural language.
  - Essential for determining grammatical correctness of sentences and also a prerequisite to various methods of determining meaning of sentences in as broad range of contexts as possible (semantic and discourse analysis - arguably the so far rather unattainable ultimate goals of NLP).
  - Parsing can be relatively [shallow](https://en.wikipedia.org/wiki/Shallow_parsing), or more complex, such as [dependency](https://en.wikipedia.org/wiki/Dependency_grammar) and/or [probabilistic](https://en.wikipedia.org/wiki/Statistical_parsing) (for more details, see the lecture and/or specialized NLP courses).
  - However, in these labs we will only deal with the shallow one.

### __Exercise 3.1: Tokenization, tagging and stemming using NLTK__
- Implement a simple NLP pipeline solving the tasks presented in the previous sections using the [NLTK](https://www.nltk.org/) library (which will lead to a much more systematic and robust solution).
- More specifically, check the NLTK docs and other relevant online materials to find out how to do the following:
 - Tokenize the cleaned text of the 1984 novel to sentences.
 - Tokenize the first sentence of the novel to words.
 - POS-tag the first sentence of the novel.
 - Stem the words in the first sentence of the novel.

In [7]:
# importing NLTK
import nltk

# TODO - YOUR OWN SOLUTION GOES HERE

### __Possible way of implementing the tokenization, tagging and stemming using NLTK__

In [None]:
# importing the necessary NLTK modules
import nltk
from nltk.stem import PorterStemmer

# downloading the necessary NLTK data
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

# tokenization of all cleaned text to sentences
sentences = nltk.sent_tokenize(cleaner_text)
# tokenization the first sentence to words
first_words = nltk.word_tokenize(sentences[0])
# POS tagging of the first sentence
tagged_words = nltk.pos_tag(first_words)
# stemming the words in the first sentence
stemmer = PorterStemmer()
tagged_stemmed_words = [(x,y,stemmer.stem(x)) for x, y in tagged_words]

# printing the tagged and stemmed sentence
print('\nThe first sentence of the novel 1984, POS-tagged and stemmed:')
print('\n'.join(['  '+str(x) for x in tagged_stemmed_words]))

### __Exercise 3.2: Shallow parsing using NLTK__
- Use the list of tagged words from the previous example to create a shallow syntactic tree for the first sentence of the 1984 novel.
- In this tree, deal primarily with simple noun and verb phrases, but you may also deal with other, more complex syntactic structures if you will.
- Notes on the solution:
 - NLTK can do most of the work for you (search for the corresponding docs or StackOverflow questions).
 - For example, the class `RegexpParser` can come in handy (as included in the pre-filled code cell below). It only needs a grammar, which defines individual syntactic groups (noun phrases, verb phrases, etc.) based on the sequence of POS tags of the respective words contained in them.
 - A grammar is simply a string defining rules for regular expression analysis (see pre-defined code below).
 - E.g., a sequence of two or more adjectives would match the following rule: `JJ2: {<JJ><JJ>+}`.
 - You can assume that noun phrases typically consist of nouns (`<NN.*>`) and their modification by adjectives (`<JJ>`) with the possible use of articles (`<DT>`), while verb phrases consist of some unbroken sequence of verb forms (`<VB.*>`).

### __Possible approach to shallow parsing using NLTK__

In [None]:
# the grammar for shallow parsing
chunk_grammar = """
  NP: {<DT>?<JJ>*<NN.*>}
  VP: {<VB.*>+}
"""

# the actual syntactic analysis
chunk_parser = nltk.RegexpParser(chunk_grammar)
chunked = chunk_parser.parse(tagged_words)

# printing the syntactic tree
print('A shallow parse tree of the first sentence of 1984:')
print(chunked)

---

## 4. Sentiment analysis

__Basic facts__
- Analysis of emotional polarity of texts, or possibly multimodal data (reviews, news, political debates, social networks, etc.).
- A technique with a number of practical applications (product development, marketing, market analysis and forecasting, public opinion mining, disease outbreak detection and management, crime detection, etc.).

__Your task__
- Split into groups (min 2, max 4 people).
- Check the NLTK documentation, StackOverflow, etc., to find out which NLTK module(s) can be used for sentiment analysis and how they work. Feel free to use other sentiment analysis tools, though, should you find anything else that feels more appropriate and/or convenient.
- Choose the approach that works best for you and try to answer the following questions:
 - What is the overall tone of the 1984 novel? Is it a glorious utopia, or rather a gloomy dystopia?
 - What is the most cheerful sentence (and optionally also a paragraph) of the novel?
 - What is the most desperate sentence (and optionally also a paragraph) of the novel?
- Take your time, play around (no need to have a complete solution - what's important is to search for possible approaches, play and learn). Feel free to ask for help anytime.
- Discuss your analysis of the 1984 novel, and your general observations with the lab tutor. The collaborating members of the group with the best relative results and/or an interesting/elegant/efficient/unusual solution can earn bonus points.

__Optional task__
- Try to solve the same task using a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) (a state of the art deep learning architecture, especially for NLP tasks).
 - You may get an off-the-shelf model, such as a [HuggingFace](https://huggingface.co/transformers/) one, as described for instance [here](https://satish1v.medium.com/sentiment-analysis-with-hugging-face-4b080d0cf34d).
 - You may also try to train (or rather [fine-tune](https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning))) your own transformer-based model, as described for instance [here](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671) or [here](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/).
- Compare the results with the "classical" sentiment analysis method you implemented before - which one works better, and why (or rather: does such a question even make sense at all)?

In [None]:
# TODO - analysis of the sentiment of the novel 1984
#        (the actual totally up to you)

### __Possible approach to sentiment analysis of the 1984 novel (using NLTK)__

#### Importing a pre-trained sentiment analysis module, downloading its data, initialising the module

In [None]:
# import and download of the required classes and data (Valence Aware
# Dictionary and sEntiment Reasoner)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# initialising the analyser
analyzer = SentimentIntensityAnalyzer()

#### Getting the overall sentiment, using the whole text as input

In [None]:
# analysis of the whole text at once and listing of relevant polarities
polarity_overall = analyzer.polarity_scores(cleaner_text)
print('\nOverall polarity of the 1984 novel (entire text in one go):')
print(' ', polarity_overall)

#### Getting the most extreme sentences

In [None]:
# variables for continuous updating of the most extreme sentences
most_neg, most_pos = 1, -1
extreme_sentences = {
    'pos' : {'text' : None, 'polarity' : None},
    'neg' : {'text' : None, 'polarity' : None}
}
# analysis of the whole text by sentences, calculating the average polarity of
# all sentences
avg_polarity = {'neg': 0, 'neu': 0, 'pos': 0, 'compound': 0}
for sen in sentences:
  polarity = analyzer.polarity_scores(sen)
  avg_polarity['neg'] += polarity['neg']
  avg_polarity['pos'] += polarity['pos']
  avg_polarity['neu'] += polarity['neu']
  avg_polarity['compound'] += polarity['compound']
  if polarity['compound'] > most_pos:
    extreme_sentences['pos']['text'] = sen.replace('\n',' ')
    extreme_sentences['pos']['polarity'] = polarity
    most_pos = polarity['compound']
  if polarity['compound'] < most_neg:
    extreme_sentences['neg']['text'] = sen.replace('\n',' ')
    extreme_sentences['neg']['polarity'] = polarity
    most_neg = polarity['compound']
for key in avg_polarity:
  avg_polarity[key] /= len(sentences)

# listing of the overall polarity and the most extreme sentences
print('Overall polarity of the 1984 novel (average of sentence polarities):')
print(' ', avg_polarity)
print('\nThe most positive sentence of the 1984 novel:')
print(' ', extreme_sentences['pos']['text'])
print('  ... sentence polarity:', extreme_sentences['pos']['polarity'])
print('\nThe most negative sentence of the 1984 novel:')
print(' ', extreme_sentences['neg']['text'])
print('  ... sentence polarity:', extreme_sentences['neg']['polarity'])

#### Getting the most extreme paragraphs

In [None]:
# variables for continuous updating of the most extreme paragraphs
most_neg, most_pos = 1, -1
extreme_paragraphs = {
    'pos' : {'text' : None, 'polarity' : None},
    'neg' : {'text' : None, 'polarity' : None}
}
# analysis of individual paragraphs (could also be done using the averages of
# polarity of individual sentences)
avg_polarity = {'neg': 0, 'neu': 0, 'pos': 0, 'compound': 0}
for par in paragraphs:
  polarity = analyzer.polarity_scores(par)
  avg_polarity['neg'] += polarity['neg']
  avg_polarity['pos'] += polarity['pos']
  avg_polarity['neu'] += polarity['neu']
  avg_polarity['compound'] += polarity['compound']
  if polarity['compound'] > most_pos:
    extreme_paragraphs['pos']['text'] = par
    extreme_paragraphs['pos']['polarity'] = polarity
    most_pos = polarity['compound']
  if polarity['compound'] < most_neg:
    extreme_paragraphs['neg']['text'] = par
    extreme_paragraphs['neg']['polarity'] = polarity
    most_neg = polarity['compound']
for key in avg_polarity:
  avg_polarity[key] /= len(paragraphs)

# listing of the overall polarity and the most extreme paragraphs
print('Overall polarity of the 1984 novel (average of paragraph polarities):')
print(' ', avg_polarity)
print('\nThe most positive paragraph of the 1984 novel:')
print(' ', extreme_paragraphs['pos']['text'])
print('  ... paragraph polarity:', extreme_paragraphs['pos']['polarity'])
print('\nThe most negative paragraph of the 1984 novel:')
print(' ', extreme_paragraphs['neg']['text'])
print('  ... paragraph polarity:', extreme_paragraphs['neg']['polarity'])

### __Possible approach to sentiment analysis of the 1984 novel (using a transformer)__

#### Installing transformers (if needed)

In [None]:
!pip install transformers

#### Importing and initialising a sentiment analysis pipeline

In [None]:
from transformers import pipeline

# initialising the sentiment analysis pipeline using a specific BERT-based
# model and GPU
classifier = pipeline("sentiment-analysis",
                      model="distilbert-base-uncased-finetuned-sst-2-english",
                      device=0)

#### Function for getting the two most extreme items from a list of chunks, and the average sentiment of the whole list

In [15]:
def get_extremes(chunks,model,max_tokens=512):
  """Getting the highest positive and negative polarities of the input list
  of chunks.

  Parameters
  ----------
  chunks : list
      The input list of text chunks (e.g., sentences or paragraphs) for which
      the sentiment polarities are to be computed.
  model : transformers.pipeline
      A pretrained HuggingFace classifier performing the sentiment analysis.
  max_tokens : int, optional (default is 512)
      The maximum number of tokens allowed in a chunk.

  Returns
  -------
  tuple
      A pair of the following variables:
      - extreme_chunks (dict) - a dictionary storing the texts and polarities
        of the chunks with the highest positive and negative polarities.
      - overall (float) - mean polarity of the input list of chunks.
  """

  # initialising bookkeeping variables
  most_neg, most_pos, overall = 0, 0, 0
  extreme_chunks = {
    'pos' : {'text' : None, 'polarity' : None},
    'neg' : {'text' : None, 'polarity' : None}
  }

  # processing the chunks in the input list
  for chunk in chunks:
    approx_tokens = chunk.split()
    if len(approx_tokens) > max_tokens:
      print('WARNING - skipping a too long chunk:',
            chunk)
      continue
    try:
      polarity = model(chunk)[0]
    except RuntimeError:
      print('WARNING - RuntimeError (chunk too long, probably; omitting):',
            chunk)
      continue
    label, score = polarity['label'], polarity['score']
    if label == 'POSITIVE':
      overall += score
      if score > most_pos:
        most_pos = score
        extreme_chunks['pos']['text'] = chunk
        extreme_chunks['pos']['polarity'] = score
    elif label == 'NEGATIVE':
      overall -= score
      if score > most_neg:
        most_neg = score
        extreme_chunks['neg']['text'] = chunk
        extreme_chunks['neg']['polarity'] = score
  overall /= len(chunks)

  # returning the extreme chunks and the overall score of the text
  return extreme_chunks, overall

#### Getting extreme sentences

In [None]:
# computing the extreme sentences
extreme_sentences, avg_polarity = get_extremes(sentences,classifier)

# listing of the overall polarity and the most extreme sentences
print('Overall polarity of the 1984 novel (mean sentence polarities):')
print(' ', avg_polarity)
print('\nThe most positive sentence of the 1984 novel:')
print(' ', extreme_sentences['pos']['text'])
print('  ... sentence polarity:', extreme_sentences['pos']['polarity'])
print('\nThe most negative sentence of the 1984 novel:')
print(' ', extreme_sentences['neg']['text'])
print('  ... sentence polarity:', -extreme_sentences['neg']['polarity'])

#### Getting extreme paragraphs

In [None]:
# computing the extreme sentences
extreme_paragraphs, avg_polarity = get_extremes(paragraphs,classifier)

# listing of the overall polarity and the most extreme paragraphs
print('Overall polarity of the 1984 novel (mean paragraph polarities):')
print(' ', avg_polarity)
print('\nThe most positive paragraph of the 1984 novel:')
print(' ', extreme_paragraphs['pos']['text'])
print('  ... paragraph polarity:', extreme_paragraphs['pos']['polarity'])
print('\nThe most negative paragraph of the 1984 novel:')
print(' ', extreme_paragraphs['neg']['text'])
print('  ... paragraph polarity:', -extreme_paragraphs['neg']['polarity'])

---

#### _Final note_ - the materials used in this notebook are adapted from works licensed by the original authors as follows:
- Picture of the first edition of the novel 1984:
  - Retrieved from Wikipedia [here](https://en.wikipedia.org/wiki/File:1984first.jpg)
  - Author: [Brown University Library](http://library.brown.edu/search/c?SEARCH=PR6029.R8+N49+1949b)
  - License: none, or rather [Public Domain](https://en.wikipedia.org/wiki/public_domain) in the US, but probably still copyrighted in the country of origin (UK)
- The novel itself:
 - Retrieved from the Australian [Project Gutenberg](https://www.gutenberg.org/) site [here](http://gutenberg.net.au/ebooks01/0100021.txt)
 - Author: [George Orwell](https://en.wikipedia.org/wiki/George_Orwell)
 - License: [Public Domain](https://en.wikipedia.org/wiki/public_domain) in Australia and possibly in other jurisdictions, but in general, the copyright is held by the George Orwell estate and the text should be treated accordingly in terms of its public or any other use