# Collecting and Preparing Text using Python

---
---

## 5 Steps of Text-Mining
There is no set way to do text-mining, but typically a workflow will involve steps like these:
1. Choosing and collecting your text
2. Cleaning and preparing your text
3. Exploring your data
4. Analysing your data
5. Presenting the results of your analysis

You may go through these steps more than once to refine your data and results, and frequently steps may be merged together. The important thing to realise is that steps 1-2 are critical in ensuring your data is capable of actually answering your research questions. You are likely to spend significant time on cleaning and preparing your text.

> **Rubbish in = rubbish out**

This notebook covers steps 1-3. The next notebook `3-analysing-and-visualising.ipynb` will show steps 4-5.

---
---

## More Python Basics

Before we start in earnest to code again, we need to cover a few more Python basics.

### Imports

Python has a lot of amazing capabilities built-in to the language itself, like being able to manipulate strings. However, in any Python project you are likely to want to use Python code written by someone else to go beyond the built-in capabilities. Code 'written by someone else' comes in the form of a file (or files) separate to the one you are currently working on.

An external Python file (or sometimes a **package** of files) is called a **module** and in order to use them in your code, you need to **import** it.

This is a simple process using the **keyword** `import` and the name of the module. Just make sure that you `import` something _before_ you want to use it!

In [129]:
import this

Obviously, that is a trivial example. It simply prints out the philosophy of the Python programming language.

You can also `import` modules and then use them:

In [130]:
import math
math.pi

3.141592653589793

In [131]:
import random
random.random()

0.6062029383574712

In [132]:
import locale
locale.getlocale()

('en_GB', 'UTF-8')

---
#### Going Further: The Python Standard Library

If you are interested in delving into everything that the Python language has to offer, you can browse the [Python standard library](https://docs.python.org/3/library/index.html) and try some of the modules there by importing them.

---
---
#### Going Further: How to Get 'Code Written By Someone Else'

I have completely glossed over how you get hold of modules and packages from other sources. There is more than one answer to this and it depends on how you have installed Python on your computer.

My recommended way to install Python for data science is with a distribution called [Anaconda](https://www.anaconda.com/distribution/), which has a built-in package manager called Conda -- but even more *excitingly*, comes bundled with over 1,500 packages already installed, including most of what you could possibly need for your text-mining and data analysis. Here is [Installing Anaconda on Windows](https://www.datacamp.com/community/tutorials/installing-anaconda-windows) and [Installing Anaconda on Mac](https://www.datacamp.com/community/tutorials/installing-anaconda-mac-os-x).

Another answer for the [traditional way of installing Python](https://www.python.org/downloads/) is by using the [Python Package Index](https://pypi.org/), known as PyPI, but that is out of scope for this workshop. Feel free to go and learn about this yourself with the tutorial [What Is Pip? A Guide for New Pythonistas](https://realpython.com/what-is-pip/) from RealPython.

---

As you will see in the sections below, we will `import` the Natural Language Toolkit (or parts of it), which is a massive **library** dedicated to working with natural language. A library is simply a *collection of modules* dedicated to some topic.

### Functions
A function is a _reusable block of code_ that has been wrapped up and given a _name_. In order to run the code, we use the name followed by `()` (parentheses). We have already seen this earlier. Here are all the functions (or methods) we have run so far:

In [133]:
# 'lower()' is the function
my_sentence = 'Butterflies are important as pollinators.'
my_sentence.lower()

'butterflies are important as pollinators.'

In [134]:
# 'upper()' is the function
my_sentence.upper()

'BUTTERFLIES ARE IMPORTANT AS POLLINATORS.'

In [135]:
# 'isalpha()' is the function
my_sentence.isalpha()

False

In [136]:
# 'random()' is the function
random.random()

0.02960725977527978

In [137]:
# 'getlocale()' is the function
import locale
locale.getlocale()

('en_GB', 'UTF-8')

---
#### Going Further: Functions and Methods
There is a technical difference between functions and methods. You don't need to worry about the distinction for our workshop. We will treat all functions and methods as the same.

If you are interested in learning more about functions and methods try this [Datacamp Python Functions Tutorial](https://www.datacamp.com/community/tutorials/functions-python-tutorial).

---

#### Functions that Take Arguments
If we need to pass particular information to a function, we put that information _in between_ the `()`. Like this:

In [138]:
math.sqrt(25)

5.0

The `25` is the value we want to pass to the `sqrt()` function so it can do its work. This value is called an **argument** to the function. Functions may take any number of arguments, depending on what the function needs.

Here is another function with an argument:

In [139]:
import calendar
calendar.isleap(2019)

False

Essentially, you can think of a function as a box. You put an input into the box (the input may be nothing), the box does something with the input, and then the box gives you back an output. You generally don't need to worry _how_ the function does what it does (unless you really want to, in which case you can look at its code). You just know that it works.

> ***Functions are the basis of how we 'get stuff done' in Python.***

With the `requests` function `get()` below, we can get the text of a Web page!

In [140]:
import requests
response = requests.get('https://www.wikipedia.org/')
response.text[0:270]

'<!DOCTYPE html>\n<html lang="mul" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n'

The string `'https://www.wikipedia.org/'` is the argument we want to pass to the `get()` function for it to open the Web page and read it for us.

Why not try your own URL? What explains the strange appearance of this text? What happens if you print the whole of `r.text` instead of slicing out the first 270 characters?

---
---
## Step 1: Choosing and Collecting Your Text
No matter your research subject, you need to be aware of the many issues of electronic data collection. We cannot cover them all here, but you should ask yourself some questions as you start to collect data, such as:
* What sort of data do I need to answer my research questions?
* What data is available?
* What is the quality of the data?
* How can I get the data?
* Am I allowed to use it for text-mining?

### A Simple Example: Top Words Used in Homer's Iliad

Our research question will be:

> What are the top 10 words used in Homer's Iliad in English translation?

#### What sort of data do I need to answer my research questions?

I need a copy of Homer's Iliad in English translation. In this instance, I am not bothered by which translation.

#### What data is available?

[Project Gutenberg](http://www.gutenberg.org/) is the first provider of free electronic books and has over 58,000. "You will find the world's great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education."

Here is Homer's Iliad, translated by Alexander Pope in 1899: http://www.gutenberg.org/ebooks/6130

#### What is the quality of the data?

Potentially variable. When some books are digitised by OCR ([Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)) they don't get corrected before being published online, but a quick look at this file shows that it is excellent quality.

#### How can I get the data?

Project Gutenberg clearly states on their [Terms of Use](http://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use) that their website is 'intended for human users only'. If you want to use code to get their data you must use one of their [mirror sites](http://www.gutenberg.org/MIRRORS.ALL) -- you should pick the one that is nearest to your location.

We will be using the text file at http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt

#### Am I allowed to use it for text-mining?

Project Gutenberg says in their [Permission: How To](http://www.gutenberg.org/wiki/Gutenberg:Permission_How-To) that "The vast majority of Project Gutenberg eBooks are in the public domain in the US." However, since UK copyright is different from US copyright, we still have to check for ourselves. This is a complicated area, but broadly we can say that UK copyright expires 70 years after the death of the author. Since [Alexander Pope](https://en.wikipedia.org/wiki/Alexander_Pope) died in 1744, we are probably ok to use his work.

### Getting a Copy of the Homer's Iliad Text
We saw above that we can use a Python library called `requests` to get Web pages. We can therefore get a copy of the text file like this:

In [141]:
response = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt')
iliad = response.text
iliad[18007:18500]

'It is on the coast, at\r\nsome distance from the city, northward, and appears to have been an open\r\ntemple of Cybele, formed on the top of a rock. The shape is oval, and in\r\nthe centre is the image of the goddess, the head and an arm wanting. She\r\nis represented, as usual, sitting. The chair has a lion carved on each\r\nside, and on the back. The area is bounded by a low rim, or seat, and\r\nabout five yards over. The whole is hewn out of the mountain, is rude,\r\nindistinct, and probably of the '

We can find out how many characters the file has by using the `len()` function.

In [142]:
len(iliad)

1201763

We can search for a particular string in the file. The function `find()` returns the index of the _first_ matching string it finds.

In [143]:
word = 'shield'
iliad.find(word)

183065

---
---

## Steps 2 and 3: Cleaning and Exploring Your Data
We are going to combine these two steps in this workshop.
### Inspecting and Preparing the Text
The first thing to do is inspect the text and see what might need sorting out. Looking again at the text by eye (http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt) you can see that the book starts with a load of front matter we don't want.

The book actually starts after the text "`***START OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`":

In [144]:
iliad[565:700]

'The Iliad of Homer\r\n\r\n\r\nTranslated by Alexander Pope,\r\n\r\nwith notes by the\r\nRev. Theodore Alois Buckley, M.A., F.S.A.\r\n\r\nand\r\n\r\nFlaxman'

There is also unwanted matter at the end after "`***END OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`" that we should get rid of too.

Why does the text have all these `\r` and `\n` in them?

---
#### Going Further: OCR Errors
We are very fortunate that this text does not suffer from common OCR errors, where the OCR process has 'transcribed' the text incorrectly. We won't be covering what to do about this in this workshop, but if you are curious you can read more about how the British Library has dealt with this in a blog post [Dealing with Optical Character Recognition errors in Victorian newspapers](https://blogs.bl.uk/digital-scholarship/2016/07/dealing-with-optical-character-recognition-errors-in-victorian-newspapers.html).

---

### Creating and Preparing a Local Copy

It is not very efficient to keep making Web requests to Project Gutenberg, especially with a very large corpus. I have therefore downloaded a copy for us (`6130-8.txt`) and placed it in our project under the `data` folder. We will use this local copy instead from now on.

I have also taken some steps to prepare the file on your behalf, to save us some time. In the spirit of full transparency and documentation here is what I have done:

* Removed the unwanted Gutenberg-related matter at the front and back of the book.
* Converted the character encoding from 'ISO 8859-1' to 'UTF-8'.

You don't need to worry about the details of _character encoding_ for this workshop. You only need to know that Python works most easily with UTF-8 files and so we must have the file in that encoding to avoid problems.

---
#### Going Further: Editing Text Files
Text files are files that should be *opened* as plain text and nothing else. They often have file extensions such as `.txt`, `.html`, `.xml`, `.csv`. Microsoft Word documents (`.doc` and `.docx`) are not plain text and you should never edit your `.txt` files in Microsoft Word or WordPad. You need a proper **text editor**.

Recommended text editors:
* Windows: [Notepad++]('https://notepad-plus-plus.org/')
* All platforms: [Sublime Text]('https://www.sublimetext.com/')

---

---
#### Going Further: Character Encoding
Character encoding is a very important topic, but it is not an easy one. If you end up dealing with a lot of text files in building up your corpus you will have to be aware that dealing with files that have different, or unknown, character encodings can get very messy. If you don't know, or wrongly assume the character encoding of a file you can end up with this sort of thing: ��� ࡻࢅ࢖

This is especially important if your corpus is written in a non-English language, because the accents or non-Latin alphabet characters of the text may get mangled. The short answer to the problem is to always make sure your files are encoded with 'UTF-8'.

> **'UTF-8' is often not the default encoding on Windows machines. This means that you can quickly end up in a mess when you edit and save your 'UTF-8' text files on Windows. The encoding may be automatically changed to 'ISO-8859-1' or 'latin-1'. You should find out how to save files as 'UTF-8' in your text editor.**

---

### Tokenising the Text
Now we are ready to start preparing and exploring our text. _Tokenising_ means splitting a text into meaningful elements, such as **words, sentences, or symbols**.

To do this we use a simple facility provided by the Natural Language Toolkit (NLTK) to read in the file and a function to do the tokenising for us. The code example below takes a single file and tokenises it. Remember NLTK is a library we need to `import` in order to use it in our code.

> **Important!**

> **The following code is the hardest code that will be presented in this notebook. You do not need to understand everything here so please don't lose heart at this point! 💖**

In [166]:
import nltk

# Download the tokeniser
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/mary/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [172]:
from pathlib import Path
import os

# Get a plain text reader
from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader('.', '')

# Read the text file
data_path = Path('data')
iliad_file = os.path.join(data_path, '6130-8.txt')
text = reader.raw(iliad_file)
text[0:700]

"The Iliad of Homer\r\n\r\n\r\nTranslated by Alexander Pope,\r\n\r\nwith notes by the\r\nRev. Theodore Alois Buckley, M.A., F.S.A.\r\n\r\nand\r\n\r\nFlaxman's Designs.\r\n\r\n1899\r\n\r\n\r\n\r\n\r\n\r\nCONTENTS\r\n\r\n\r\nINTRODUCTION.\r\nPOPE'S PREFACE TO THE ILIAD OF HOMER\r\nBOOK I.\r\nBOOK II.\r\nBOOK III.\r\nBOOK IV.\r\nBOOK V.\r\nBOOK VI.\r\nBOOK VII.\r\nBOOK VIII.\r\nBOOK IX.\r\nBOOK X.\r\nBOOK XI.\r\nBOOK XII.\r\nBOOK XIII.\r\nBOOK XIV.\r\nBOOK XV.\r\nBOOK XVI.\r\nBOOK XVII.\r\nBOOK XVIII.\r\nBOOK XIX.\r\nBOOK XX.\r\nBOOK XXI.\r\nBOOK XXII.\r\nBOOK XXIII.\r\nBOOK XXIV.\r\nCONCLUDING NOTE.\r\n\r\n\r\n\r\n\r\n\r\nILLUSTRATIONS\r\n\r\n\r\nHOMER INVOKING THE MUSE.\r\nMARS.\r\nMINERVA REPRESSING THE FURY OF ACHILLES.\r\nTHE DEPARTURE OF BRISEIS FROM THE TENT OF ACHILLES.\r\nTHETIS CALLING BRIAREUS TO THE A"

In [168]:
# Import the tokeniser
from nltk import word_tokenize

# Tokenise the text and print the first 20 characters
tokens = word_tokenize(text)
tokens[0:20]

['The',
 'Iliad',
 'of',
 'Homer',
 'Translated',
 'by',
 'Alexander',
 'Pope',
 ',',
 'with',
 'notes',
 'by',
 'the',
 'Rev',
 '.',
 'Theodore',
 'Alois',
 'Buckley',
 ',',
 'M.A.']

You can also `import` and use the sentence tokeniser `sent_tokenize` instead. Try this yourself.

There are a number of problems with these tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.

### Normalising to Lowercase
Normalising all words to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Shield', 'shield' and 'SHIELD' to be recognised as the same word.

However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may be not want the word 'Conservative' to be conflated with the word 'conservative'.

In our case, we will lowercase the whole file immediately before tokenising it:

In [169]:
text_lower = text.lower()
tokens = word_tokenize(text_lower)
tokens[0:20]

['the',
 'iliad',
 'of',
 'homer',
 'translated',
 'by',
 'alexander',
 'pope',
 ',',
 'with',
 'notes',
 'by',
 'the',
 'rev',
 '.',
 'theodore',
 'alois',
 'buckley',
 ',',
 'm.a.']

### Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words "poet" and "poet," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask.

It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string ''.

In [170]:
# Import a module that helps with string processing
import string

# Make a table that 'translates' all punctuation to None (i.e. empty) 
table = str.maketrans('', '', string.punctuation)
punc_table = {chr(key):value for (key, value) in table.items()}
punc_table

{'!': None,
 '"': None,
 '#': None,
 '$': None,
 '%': None,
 '&': None,
 "'": None,
 '(': None,
 ')': None,
 '*': None,
 '+': None,
 ',': None,
 '-': None,
 '.': None,
 '/': None,
 ':': None,
 ';': None,
 '<': None,
 '=': None,
 '>': None,
 '?': None,
 '@': None,
 '[': None,
 '\\': None,
 ']': None,
 '^': None,
 '_': None,
 '`': None,
 '{': None,
 '|': None,
 '}': None,
 '~': None}

In [171]:
tokens_nopunct = [token.translate(table) for token in tokens]
tokens_nopunct[0:20]

['the',
 'iliad',
 'of',
 'homer',
 'translated',
 'by',
 'alexander',
 'pope',
 '',
 'with',
 'notes',
 'by',
 'the',
 'rev',
 '',
 'theodore',
 'alois',
 'buckley',
 '',
 'ma']

### Removing Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens `''` and tokens that may be chapter numbers:

In [151]:
tokens_empty = [word for word in tokens_nopunct if word is '']
tokens_empty[0:10]

['', '', '', '', '', '', '', '', '', '']

In [152]:
tokens_nonwords = [word for word in tokens_nopunct if word.isnumeric()]
tokens_nonwords[0:10]

['1899', '1', '2', '3', '4', '5', '6', '7', '8', '9']

We can remove both these by filtering for only those words that are alphabetic:

In [153]:
words = [word for word in tokens_nopunct if word.isalpha()]
words[0:20]

['the',
 'iliad',
 'of',
 'homer',
 'translated',
 'by',
 'alexander',
 'pope',
 'with',
 'notes',
 'by',
 'the',
 'rev',
 'theodore',
 'alois',
 'buckley',
 'ma',
 'fsa',
 'and',
 'flaxman']

I have saved our clean words into a file (`CLEAN-6130-8.txt`) and placed it in our project under the `data` folder. You can inspect it now if you like.

---
---
## Summary

At this point, we should stop and marvel at what we have achieved. 😃

We have: 

* Chosen the right data for our research question
* Downloaded a public-domain text file from an online repository
* Manually cleaned it of unwanted material
* Tokenised it into words
* Normalised the words into lowercase
* Removed punctuation, empty tokens and digits

👏 👏 👏

The next notebook `3-analysing-and-visualising.ipynb` will show how we can analyse and visualise our data and recommend further resources.
