<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Session 1: Orientation

<br>
Welcome to the *IPython Notebook*. Through this interface, you'll be learning a lot of things:

* A Programming language: **Python**
* A Python library: **NLTK**
* Overlapping research areas: **Corpus linguistics**, **Natural language processing**, **Distant reading**
* Additional skills: **Regular Expressions**, some **Shell commands**, and **tips on managing your data**

You can head [here](https://github.com/resbaz/lessons/blob/master/nltk/README.md) for the fully articulated overview of the course, but we'll almost always stay within IPython. 
Remember, everything we cover here will remain available to you after ResBaz is over, including these Notebooks. It's all accessible at the [ResBaz GitHub](https://github.com/resbaz/lessons/tree/master/nltk).

**Any questions before we begin?**

Alright, we're off!

## Text as data

Programming languages like Python are great for processing data. In order to apply it to *text*, we need to think about our text as data.
This means being aware of how text is structured, what extra information might be encoded in it, and how to manage to give the best results. 

## What is the Natural Language Toolkit?

<br>
We'll be covering some of the theory behind corpus linguistics later on, but let's start by looking at some of the tasks NLTK can help you with. 

NLTK is a Python Library for working with written language data. It is free and extensively documented. Many areas we'll be covering are treated in more detail in the NLTK Book, available free online from [here](http://www.nltk.org/book/).

> Note: NLTK provides tools for tasks ranging from very simple (counting words in a text) to very complex (writing and training parsers, etc.). Many advanced tasks are beyond the scope of this course, but by the time we're done, you should understand Python and NLTK well enough to perform these tasks on your own!

We will start by importing NLTK, setting a path to NLTK resources, and downloading some additional stuff.

> Note: If you are not familiar with Python, don't sweat! We will walk through the basics throughout this course.

In [71]:
from __future__ import print_function, division # __future__ ensures Python2 and Python3 interopability
from IPython.display import display, clear_output # clear output from download
import nltk # import Natural Language Processing Toolkit library

Oh, we've got to import some corpora used in the book as well...

> Note: this location is specific to virtual machines used with the Dit4c platform. If running elswhere, try running nltk.download(...) with your own path. 

In [None]:
user_nltk_dir = "/home/researcher/nltk_data" # specify our data directory
if user_nltk_dir not in nltk.data.path: # make sure nltk can access this directory
    nltk.data.path.insert(0, user_nltk_dir)
nltk.download("book", download_dir=user_nltk_dir) # download sample materials to data directory
clear_output() # clear the large amount of text we just generated

In [None]:
from nltk.book import *  
# asterisk means 'everything'

Importing the book has assigned variable names to ten corpora. We can call these names easily: 

In [None]:
text2

### Exploring Vocabulary: lexical richness

NLTK makes it really easy to get basic information about the size of a text and the complexity of its vocabulary.

* **`len()`** gives the number of symbols or 'tokens' in your text. This is the total number of words and items of punctuation.

* **`set()`** gives you a list of all the tokens in the text, without the duplicates. Hence, **`len(set(text3))`** will give you the total number unique tokens. Remember this still includes punctuation. 

* **`sorted()`** places items in the list into alphabetical order, with punctuation symbols and capitalised words first.

In [None]:
len(text3)

In [None]:
len(set(text3))

In [None]:
sorted(set(text3)) [0:35]

We can calculate *lexical richness* of a text by dividing the total number of words by the number of unique words. This tells us how diverse and varied the vocabulary is, which could be used to infer the author's writing.

We can also count the number of times a word is used and calculate what percentage of the text it represents.

In [None]:
len(text3)/len(set(text3))

In [None]:
text4.count("American")

**Challenge!** 

How would you calculate the percentage of Text 4 that is taken up by the word "America"?

In [None]:
100.0*text4.count("America")/len(text4) 

### Exploring Vocabulary: concordances, similar contexts, and dispersion

* **`concordance()`** shows you a word in context and is useful if you want to be able to discuss the ways in which a word is used in a text. 

* **`similar()`** will find words used in similar contexts; remember it is not looking for synonyms, although the results may include synonyms.

* **`common_contexts()`** looks for adjacent words commonly used for both words, which says how similar the author uses these words.

In [None]:
text1.concordance("monstrous")

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

In [None]:
text2.common_contexts(["monstrous", "very"])  

* **`collocations()`**: we can also find words that typically occur together, which tend to be very specific to a text or genre of texts. We'll talk more about these features and how to use them later.

In [None]:
text4.collocations()

Python also lets you create graphs to display data. To represent information about a text graphically, import the Python library **`matplotlib`**. We can then generate a dispersion plot that shows where given words occur in a text.

**`%matplotlib inline`** is an IPython-Notebook-specific command that tells it to display the plot below the code block.

In [None]:
%matplotlib inline

from nltk.draw.dispersion import dispersion_plot
dispersion_plot(text1, ["whale", "storm", "calm"])

**Challenge!**

Create a dispersion plot for the terms "citizens", "democracy", "freedom", "duties" and "America" in the innaugural address corpus.
What do you think it tells you? 

In [None]:
dispersion_plot(text4, ["citizens", "democracy", "freedom", "duties", "America"]) # plot five words longitudinally

## IPython Notebook

So, we've been writing Python code in an IPython notebook. Why?

1. The main strength of IPython is that you can run bits of code individually, so you don't have to keep repeating things. For example, if you scroll up to the last function and replace the 50 with 2, you can re-run that code and get the new answer. 
2. IPython allows you to display images alongside code, and to save the input and output together.
3. IPython makes learning a bit easier, as mistakes are easier to find and do not break an entire workflow.

You can get more information on IPython, including how to install it on your own machine, at the [IPython Homepage](http://ipython.org).

###  Defining a variable

In Python, we give the items we're working with names, a process called assignment. For instance, in the NLTK corpus, 'Sense and Sensibility' has been assigned the name 'text2', which is much easier to work with. 
We also assigend the name 'sent' to the sentence that we created in the previous exercise, so that we could then instruct Python to do various things with it. Assigning a variable in python looks like this: 

`variable = expression   `

You can call your variables (almost) anything you like, but it's a good idea to pick names that will be meaningful and easy to type. You can't use words that already have a meaning in Python, such as import, def, or not. If you try to use a word that is reserved, you'll get a syntax error.

In [None]:
string = 'user'

### Significant Whitespace
One thing that makes Python unique is that whitespace at the start of the line (use four spaces for consistency!) is meaningful. 
In many other languages, whitespace at the start of lines is simply a readability convention.

In [None]:
# Fix this whitespace problem!
if string == 'user':
    print('Phew, fixed.')

So, whitespace tells both Python and human readers where things start and stop.

### Defining a function

Next, we'll talk about *functions*. Advantages of functions are:

1. Save you typing
2. You can be sure you're doing exactly the same operation every time

Functions can include a **`return`** statement which gives an output to assign to variables or other functions.

In [None]:
def welcome(name):
    return 'Welcome, {}!'.format(name) # .format() lets you include variables inside strings

**`def`** is how we *define* a function in Python.

Notice that it doesn't do anything by itself. It needs to actually be *called*, and given some data:

In [None]:
welcome('Kim')

You may wish to repeat an operation multiple times looking at different texts or different terms within a text. Instead of re-entering the formula every time, you can assign a name and set of actions to a particular task.

> **Note**: Learn to love tab-completion! Typing the first one or two letters of a function or variable you've used previously then hitting tab will auto-complete that command, saving you typing (i.e. time and mistakes!). 

Previously, we calculated the lexical diversity of a text. In NLTK, we can create a function called **`lexical_diversity()`** that runs a single line of code. We can then call this function to quickly determine the lexical density of a corpus or subcorpus.

**Challenge!**

Write a function to calculate the lexical diversity of a text; test it out on the books in the NLTK corpus.

In [None]:
def lexical_diversity(text):
    return len(text)/len(set(text))

In [None]:
# After the function has been defined, we can run it:
lexical_diversity(text2)

The parentheses are important here as they sepatate the the task, that is the work of the function, from the data that the function is to be performed on. 

The data in parentheses is called the argument of the function. When we use a function, we say that we 'call' it. 

Other functions that we've used already include **`len()`** and **`sorted()`** - these were predefined. **`lexical_diversity()`** is one we set up ourselves; note that it's conventional to put a set of parentheses after a function, to make it clear what we're talking about.

### Lists

NLTK treats a text as a long list of words. First, we'll make some lists of our own, to give you an idea of how a list behaves.

> **Note**: we use square brackets `[]` here to define our list.

In [None]:
sent0 = ['Call', 'me', 'Ishmael', '.']
print(sent0)

In [None]:
len(sent0)

The opening sentences of each of our texts have been pre-defined for you. You can inspect them by typing in 'sent2' etc.

You can *concatenate*, or join up lists together, creating a new list containing all the items from both lists. You can do this by typing out the two lists or you can add two or more pre-defined lists.

In [None]:
print(sent4)
print(sent0)
print(sent4 + sent0)

We can also add an item to the end of a list by appending. When we **`append()`**, the list itself is updated. 

In [None]:
sent0.append('Please')
print(sent0)

**Challenge!** 

Define a function called **`please()`** that would append the word 'Please' after a list and return it.

In [None]:
def please(sentence):
    return sentence.append('please')

###  Indexing Lists

We can navigate this list with the help of indexes. Just as we can find out the number of times a word occurs in a text, we can also find where a word first occurs. We can navigate to different points in a text without restriction, so long as we can describe where we want to be.

In [None]:
print(text4.index('awaken'))

This works in reverse as well. We can ask Python to locate the 158th item in our list (note that we use square brackets here, not parentheses)

In [None]:
print(text4[158])

As well as pulling out individual items from a list, indexes can be used to pull out selections of text from a large corpus to inspect. We call this slicing

In [None]:
print(text5[16715:16735])

If we're asking for the beginning or end of a text, we can leave out the first or second number. For instance, [:5] will give us the first five items in a list while [8:] will give us all the elements from the eighth to the end. 

In [None]:
print(text2[:10])
print(text4[145700:])

To help you understand how indexes work, let's create one.

We start by defining the name of our index and then add the items. You probably won't do this in your own work, but you may want to manipulate an index in other ways. Pay attention to the quote marks and commas when you create your test sentence.

In [None]:
sent = ['The', 'quick', 'brown', 'fox']
print(sent[0])
print(sent[2])

Note that the first element in the list is zero. This is because we are telling Python to go zero steps forward in the list. If we use an index that is too large (that is, we ask for something that doesn't exist), we'll get an error.

We can modify elements in a list by assigning new data to one of its index values. We can also replace a slice with new material.

In [None]:
sent[2] = 'furry'
sent[3] = 'child'
print(sent)

**Challenge**

- Create a list called 'opening' that consists of the phrase "It was a dark and stormy night; the rain fell in torrents"
- Create a variable called 'clause' that contains the contents of 'opening', up to the semi-colon
- Create a variable called 'alphabetised' that contains the contents of 'clause' sorted alphabetically
- Print 'alphabetised' 

In [None]:
opening = ['It', 'was', 'a', 'dark', 'and', 'stormy', 'night', ';', 'the', 'rain', 'fell', 'in', 'torrents']
clause = opening[0:7]
alphabetised = sorted(clause)

Note that assigning a variable just causes Python to remember that information without generating any output. 

If you want Python to show you the result, you have to ask for it (this is a good thing when you assign a variable to a very long list!).

In [None]:
print(clause)

In [None]:
print(alphabetised)

### Frequency distributions

We can use Python's ability to perform statistical analysis of data to do further exploration of vocabulary. For instance, we might want to be able to find the most common or least common words in a text. We'll start by looking at frequency distribution.

In [None]:
from nltk.probability import FreqDist
from collections import Counter
fdist1 = FreqDist(text1)

In [None]:
fdist1.most_common(50)

In [None]:
fdist1['whale']

In [None]:
fdist1.plot(50, cumulative = True)

**Challenge!**

Let's compare the 15 most common tokens of the texts in the NLTK book. You could do this manually, but it will save you time and typing if you define a function and then loop it over the list, 'Library', that you created earlier.


In [None]:
def common_words(text):
    return FreqDist(text).most_common(15)
common_words(text1)

In [None]:
for book in Library:
    words = common_words(book)
    print(book, words)

## Exploring Vocab continued

As well as counting individual words, we can count other features of vocabulary, such as how often words of different lengths occur. We do this by putting together a number of the commands we've already learned.

We could start like this: 

     [len(word) for word in text1]

... but this would print the length of every word in the whole book, so let's skip that bit!

In [None]:
fdist2 = FreqDist(len(word) for word in text1)

In [None]:
fdist2.max()

In [None]:
fdist2.freq(3)

These last two commands tell us that the most common word length is 3, and that these 3 letter words account for about 20% of the book. 
We can see this just by visually inspecting the list produced by *fdist2.most_common()*, but if this list were too long to inspect readily, or we didn't want to print it, there are other ways to explore it.  

There are a number of functions defined for NLTK's frequency distributions:

 | Function | Purpose  |
 |--------------|------------|
 | fdist = FreqDist(samples) | create a frequency distribution containing the given samples |
 | fdist[sample] += 1 | increment the count for this sample |
 | fdist['monstrous']  | count of the number of times a given sample occurred |
 | fdist.freq('monstrous') | frequency of a given sample |
 | fdist.N()  |  total number of samples |
 | fdist.most_common(n)   |  the n most common samples and their frequencies |
 | for sample in fdist:   |  iterate over the items in fdist, when in the loop, we refer to each item as sample |
 | fdist.max() | sample with the greatest count |
 | fdist.tabulate()   |  tabulate the frequency distribution |
 | fdist.plot()  |   graphical plot of the frequency distribution |
 | fdist.plot(cumulative=True) | cumulative plot of the frequency distribution |
 | fdist1 < fdist2 | test if samples in fdist1 occur less frequently than in fdist2 |

It is possible to select the longest words in a text, which may tell you something about its vocabulary and style

In [None]:
vocab = set(text4)
long_words = [word for word in vocab if len(word) > 15]
sorted(long_words)

We can also use numerical operators to refine the types of searches we ask Python to run. We can use the following relational operators:


### Common relationals
 |  Relational | Meaning |
 |--------------:|:------------|
 | <    |  less than |
 | <=   |   less than or equal to |
 | ==  |    equal to (note this is two "=" signs, not one) |
 | !=   |   not equal to |
 | \>   |   greater than |
 | \>= |   greater than or equal to |

**Challenge!**

Using one of the pre-defined sentences in the NLTK corpus, use the relational operators above to find:

- Words longer than four characters
- Words of four or more characters
- Words of exactly four characters

In [None]:
longer = [word for word in sent2 if len(word) > 4]
more = [word for word in sent2 if len(word) >= 4]
exact = [word for word in sent2 if len(word) == 4]
print(longer, more, exact)

In [None]:
for word in sent2:
    if len(word) > 4:
        print(word)

We can fine-tune our selection even further by adding other conditions. For instance, we might want to find long words that occur frequently (or rarely).  

**Challenge!**

Can you find all the words in a text that are more than seven letters long and occur more than seven times?

In [None]:
fdist5 = FreqDist(text5)
sorted(word for word in set(text5) if len(word) > 7 and fdist5[word] > 7)

### Common operators

 | Operator  | Purpose  |
 |--------------|------------|
 | s.startswith(t) | test if s starts with t |
 | s.endswith(t)  |  test if s ends with t | 
 | t in s         |  test if t is a substring of s | 
 | s.islower()    |  test if s contains cased characters and all are lowercase | 
 | s.isupper()    |  test if s contains cased characters and all are uppercase | 
 | s.isalpha()    |  test if s is non-empty and all characters in s are alphabetic | 
 | s.isalnum()    |  test if s is non-empty and all characters in s are alphanumeric | 
 | s.isdigit()    |  test if s is non-empty and all characters in s are digits | 
 | s.istitle()    |  test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals) | 

In [None]:
sorted(w for w in set(text1) if w.endswith('ableness'))

In [None]:
sorted(n for n in sent7 if n.isdigit())

**Bonus!**

You'll remember right at the beginning we started looking at the size of the vocabulary of a text, but there were two problems with the results we got from using:

     len(set(text1)).

This count includes items of punctuation and treats capitalised and non-capitalised words as different things (*This* vs *this*). We can now fix these problems. We start by getting rid of capitalised words, then we get rid of the punctuation and numbers.

In [None]:
len(set(text1))

In [None]:
len(set(word.lower() for word in text1))

In [None]:
len(set(word.lower() for word in text1 if word.isalpha()))

<img style="float:left" src="http://images.catsolonline.com/cache/custom/CEN/CE651-250x250.jpg" />
<br>