In this notebook:

1. [Loading and Tokenising Text](#1)
2. [Loading a Text File: From Your Notebook or From a Website](#2)
3. [Tokenising Strings: Splitting Them into Tokens (Words, etc.)](#3)

<a id="1"></a>
# 1. Loading and Tokenising Text

### Recap of syntax:

In [None]:
# Create a list, get its first item and get its length.
my_letters = ["A","B","C","D","E"] 
print(my_letters[0])
print(len(my_letters))

In [None]:
# Slice of items: from the first item up to (not including) the second-to-last item
print(my_letters[:-2]) 

In [None]:
# 'Comprehend' a list as its lowercase values:
my_letters_lowercased = [letter.lower() for letter in my_letters]
print(my_letters_lowercased) 

In [None]:
# Also: run this cell now. It's the usual imports of text mining libraries.

import nltk
import numpy
import string
import matplotlib.pyplot as plt

<a id="2"></a>
# 2. Loading a Text File: From Your Notebook or From a Website

#### Questions & Objectives

- How can I load a text file from my hard drive or a website?

#### Key Points

- To open and read a file on your computer, we use `open()`, `read()` and `close()`
- To open and read a file from the internet, we use `urllib.request.urlopen()` and `.read().decode('utf-8')`
- Once the file is opened, you can store its contents in a variable

Broadly speaking there are two contexts in which we load a text file for analysis:

- Local file:  you have your file on your [virtualized] computer or hard drive because you created or downloaded it earlier
- Remote file: you access the file directly from some website, 'on the fly', processing it with your code but never really saving it as your own (e.g., for copyright or convenience reasons)

### Loading a local file:

First let's load some file from your 'hard drive' - because we are working inside of Noteable, it acts as your hard drive. There's a file you downloaded called `file_inaugural_speech_obama.txt` and it is in the same folder as this notebook, so we reference it with the path `./file_inaugural_speech_obama.txt` (the `./` means 'same folder as this notebook').

In [None]:
file_name = "./data/barack_obama_speeches/inaugural_speech.txt"
my_file = open(file_name) # open the file
speech = my_file.read()   # read contents of the file and put them in a variable
my_file.close()           # close the file

# After that you have access to the file as text in the speech variable you created.
print("number of characters:", len(speech)) 
print(speech[:50])       # first 50 characters
print(speech[-50:])      # last 50 characters

### Loading a remote (online) file:
To read the same file from an online source (like from the White House website) we need to import the url-handling library `urllib`, but otherwise the process is very similar to reading a local file.

In [None]:
import urllib                           # you only have to do this once per notebook
link = "https://raw.githubusercontent.com/drpawelo/efi_text_mining_bootcamp/master/data/inaugural_speech_obama.txt"
my_file = urllib.request.urlopen(link)  # download the file (no need to open-close)
speech = my_file.read().decode('utf-8') # read and decode content and save it

# After that you have access to the file as text.
print(len(speech))  # how long is it?
print(speech[:50])  # first 50 characters
print(speech[-50:]) # last 50 characters

### What's similar and different:

Notice similarities and differences in both methods:

**GET LIBRARIES (TOOLS):** On top of what Python already gives you, you can use other libraries of code for special tasks such as text analysis and data visualisation. You do this only once per notebook.

- `import ...`

**OPEN**: Both methods need a name and address of the file (file path or website link).

- local:  `open(file_name)`
- remote: `urllib.request.urlopen(link)`

**READ**: With both methods, once you have access to the file, you need to READ the contents of it and put them in a string variable. Notice that remote files can come in various 'encodings' (ways to understand special characters and punctuation), so we usually specify the `UTF-8` (Unicode Transformation Format) encoding for plain English. Another common one is `latin1`.

- local: `my_file.read()`
- remote: `my_file.read().decode('utf-8')`

**CLOSE**: Only with the local file do we need to close it once we've read it. This is so that another script or user can open it later. This works like files on a computer: they can each be opened only one instance at a time.

- local: `my_file.close()`

### 🐛Minitask

Write some Python code to open the following online file and display the characters between indices 42380 and 42869 in that file (don't peek at what's in the file). Do you recognise what play this text is from?

http://www.gutenberg.org/files/1513/1513-0.txt

In [None]:
# Write your answer here:



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    link = "http://www.gutenberg.org/files/1513/1513-0.txt"
    my_file = urllib.request.urlopen(link)
    text = my_file.read().decode('utf-8')

    print(text[42380:42869])
    ### END SOLUTION
    
</details>

### So now we have a long string...what's next?

But as we can see it is not particularly useful to operate on **characters** as the main measure of length and to access parts of text. It would be more meaningful to ask for the first 10 words or last 10 words. Indeed, we might want to consider punctuation and symbols too.

This is where tokens come in:

<a id="3"></a>
# 3. Tokenising Strings: Splitting Them into Tokens (Words, etc.)

#### Questions & Objectives

- What is tokenisation?
- How can a string of raw text be tokenised?

#### Key Points

- Tokenisation means to split a string into separate words and punctuation marks, to be able to, for example, count them.
- Text can be tokenised using a tokeniser, e.g., the `punkt` tokeniser in NLTK.

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation mark which is contained in a string.

To tokenise we first need to import the `word_tokenize` method from the `tokenize` package of NLTK, which allows us to tokenise text without writing the code ourselves.

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s built-in `punkt` tokeniser by calling:

In [None]:
# Run this cell now (it's fine if you see some pink messages underneath it).

from nltk.tokenize import word_tokenize
nltk.download('punkt')

Let's tokenise (split into tokens) the nursery rhyme "Humpty Dumpty".

We will save the tokenised output in a list using the `humpty_tokens` variable so we can inspect it.

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
print(humpty_tokens) # print all tokens

In [None]:
# Let's print just a few of them to have a closer look:
print(humpty_tokens[0:10])

### Unifying and cleaning up the text

To further analyse the data, we'll first learn how to perform some clean-up tasks. 

As you can see in the above example, some of the words are uppercase and some are lowercase. But Python is case-sensitive, which means that 'hope' and 'Hope' are considered two completely different strings.

For example, when searching for a word or counting the occurrences of a word, we most likely will want to consider both the lowercase and uppercase versions of the word (e.g., `company` and `Company` ). That's why, to simplify the analysis, we often normalise the data by making it all lowercase. This way, both of the above words would simply become `company`, making the text easier to comprehend.

Since our list of tokens is a list of strings (words and punctuation) we can apply the `list comprehension loop` we learned about before to transform our list of mixed-case words into a list of lowercase words. 

As you might remember, a syntax for such loop is `[output_format for item in items ]` where:

- `output_format` is some operation we perform on item, like `item.lower()` or `len(item)`
- `items` is the list with all the elements we want to transform
- `item` is a temporary name we give to each element of `items`, for the purposes of using that name inside of `output_format`

Let's modify above example, so that we only work with lowercased tokens of the nursery rhyme:

In [None]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
lowercase_tokens = [token.lower() for token in humpty_tokens]
print(lowercase_tokens)

### 🖇💬Buddy Discussion: What would be the coolest text dataset to analyse?

#### Ask your buddy now if they reached the **BUDDY TASK**. Once you both did, complete this task:

Come up with ONE EXAMPLE EACH of a text source that you would LOVE to have access to and analyse. Don't worry if it would be very hard (or impossible) to acquire, just imagine you have a magic wand. For example:

- All the chats in Edinburgh taxis
- 1000 most popular recipies for apple pie
- Transcripts of all job interviews for academic jobs in UK this year

Don't spend too much time on this (2 minutes maximum) but take note of your favourite idea.

### 🐛Minitask

Let's do our first, very simple piece of analysis. Do you think there were more mentions of 'we' or 'they' in the innaugural speech we looked at before?

Let's try to re-use some pieces of code we wrote before and do our first very simple analysis:

First without the lowercasing:

- Copy-paste your code from before to load the speech of President Obama.
- Use `word_tokenize()` on that variable to turn it into a list of tokens.
- Count all occurances of a word 'we'. You can use the `a_list.count( a_word )` method like this:  `how_many_we = speech_tokens.count('we')`.
- Print how many there were.
- Do the same to count occurences of 'they'.

How many times are these words used?

- Now add the list comprehension after you tokenised the text into a list, changing list items into their lowercased equivalents. Do this after you tokenise the string, but before you do the counting.

Now which word is more frequent?

In [None]:
# Write your solution here



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
    ### BEGIN SOLUTION
    file_name = "./data/barack_obama_speeches/inaugural_speech.txt"
    my_file = open(file_name) # open the file
    speech_text = my_file.read() # read content of it and put them in a variable
    my_file.close() # close the file

    speech_tokens = word_tokenize(speech_text)
    speech_tokens = [word.lower() for word in speech_tokens]
    print(speech_tokens.count('we'))
    print(speech_tokens.count('they'))
    ### END SOLUTION
    
</details>









### 🦋 Extra Task (optional): if you have finished everything else already:

What other words could you look for? Do you think you could create a list of words, like `['hope', 'fear' ,'can', 'cannot']` and use a for loop to print counts of all of these words in the speech?

You can try to illustrate a particular point using data.



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
    ### BEGIN SOLUTION
    file_name = "./data/barack_obama_speeches/inaugural_speech.txt"
    my_file = open(file_name) # open the file
    speech_text = my_file.read() # read content of it and put them in a variable
    my_file.close() # close the file

    speech_tokens = word_tokenize(speech_text)
    speech_tokens = [word.lower() for word in speech_tokens]

    words_of_interest = ['hope', 'fear' ,'can', 'cannot']  
    for word in words_of_interest:
        print(word, speech_tokens.count(word))
    ### END SOLUTION
    
</details>







