# TAL Labo 1b : Text Segmentation with NLTK

## Objectives
The goal of the second part of the first [TAL](http://iict-space.heig-vd.ch/apu/cours-tal/) lab is to run simple operations for text analysis using the [NLTK](http://www.nltk.org/) toolkit.  You will use the environment that you set up in the first part of the first lab (Labo 1a): [Python 3](https://www.python.org/) with [Jupyter](https://jupyter.org/) notebooks, either using [Google Colab](https://colab.research.google.com) or on your own computer.

You will use NLTK functions to get texts from the web or local files, and segment (split) them into sentences and words (also called *tokens*).  You will also experiment with extracting some statistics about the texts.

To submit your practical work to the teachers, please execute all cells of your Jupyter notebook, then print it as a PDF document, and [email both files to the teacher](mailto:andrei.popescu-belis@heig-vd.ch).

## NLTK: the Natural Language (Processing) Toolkit

You should now add NLTK to your local Python installation, by following the instructions at the [NLTK website](http://www.nltk.org/install.html).  On Google Colab, NLTK is already installed.

A good way to get started with NLTK is to look at [Chapter 1](http://www.nltk.org/book/ch01.html) of the [NLTK book (NLP with Python)](http://www.nltk.org/book/) and to try some of the instructions there.  Note that the online book is updated for Python 3, but the printed book, also available in PDF on some websites, is only for Python 2 ([_Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit_, Steven Bird, Ewan Klein, and Edward Loper, O'Reilly Media, 2009](http://shop.oreilly.com/product/9780596516499.do)). 

To use NLTK in a Jupyter notebook, all you need is to `import nltk` before you need it.  You will still need to use the prefix `nltk.` unless you write for instance: `from nltk.book import *` (this will also import and define several long texts too).  

*Note:* NLTK includes a download manager which can be called from a Python interpreter using `import nltk` and then `nltk.download()`.  This can automatically download from the NLTK repository a large number of text collections (_corpora_).  In this practical session, we will not use any of these.

In [1]:
import nltk
nltk.download('punkt')
#from nltk.book import *

[nltk_data] Downloading package punkt to C:\Users\Vincent
[nltk_data]     Guidoux\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

Please write a list of words called `sentence1`, print its length (`len()`) and use `nltk.bigrams` to generate all bigrams from it, i.e. pairs of consecutive words.  You can see an example in [Sec. 3.3 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#collocations-and-bigrams).  Please also sort bigrams alphabetically.

In [3]:
# Please write your Python code in this cell and execute it.
sentence1 = ["bonjour", "petit", "nathan", "vincent"]
print(len(sentence1))
bibram = sorted(set(nltk.bigrams(sentence1)))
for i in bibram:
    print(i)



4
('bonjour', 'petit')
('nathan', 'vincent')
('petit', 'nathan')


Please define a string called `string2` with another sentence, not segmented into words yet.  Use the NLTK tokenizer (the function called `nltk.word_tokenize`) as explained in [Sec. 3.1 of the NLTK book](http://www.nltk.org/book/ch03.html#sec-accessing-text) to tokenize the string into a list of words called `sentence2`, and then print this list.

In [4]:
# Please write your Python code in this cell and execute it.
string2 = "Bonsoir grand guillaume Hochet"
tokenized = nltk.word_tokenize(string2)
list = []
for i in tokenized:
    list.append(i)
print(list)
    



['Bonsoir', 'grand', 'guillaume', 'Hochet']


## Using NLTK to download, tokenize, and save a text

Using inspiration from [Chapter 3 (3.1. Processing Raw Text) of the NLTK book](http://www.nltk.org/book/ch03.html), get some text from the Web, for instance select a book from the Gutenberg Project.  Do your best to keep only the meaningful text from it, without the header and the final license.  What is the length of the entire book?  Is this a number of characters or words? (If you are curious, you can also refer to [Python's documentation of Unicode support](https://docs.python.org/3.7/howto/unicode.html).)

In [8]:
from urllib import request
url1 = "http://www.gutenberg.org/files/18488/18488.txt" # pick a text here
# Please write your Python code below and execute it.

response = request.urlopen(url1)
raw = response.read().decode('utf8')
tokens = nltk.word_tokenize(raw)

print("Number of characters : ", len(raw))
print("Number of words : ", len(tokens))

# print(raw[4340:489067])




Number of characters :  508538
Number of words :  110957


In [10]:
# Determine, by trial and error, how much we should trim from the 
# beginning and from the end, to keep only the actual text of the book,
# then display its beginning and end.
# raw1[:123]    # adjust the start
# raw1[456789:] # adjust the end

print(raw[:4340])
print("-------------------------------------------------------------------------------------------------------------")
print("-------------------------------------------------------------------------------------------------------------")
print(raw[489067:])





Project Gutenberg's The Place Beyond the Winds, by Harriet T. Comstock

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: The Place Beyond the Winds

Author: Harriet T. Comstock

Illustrator: Harry Spafford Potter

Release Date: June 2, 2006 [EBook #18488]

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK THE PLACE BEYOND THE WINDS ***




Produced by Suzanne Shell, Mary Meehan and the Online
Distributed Proofreading Team at http://www.pgdp.net









[Illustration: "It was a beautiful thing, that dance, grotesque, pagan
and yet divine"]




THE PLACE BEYOND THE WINDS

BY HARRIET T. COMSTOCK


_Illustrated by_
HARRY SPAFFORD POTTER

GARDEN CITY, NEW YORK
DOUBLEDAY, PAGE & COMPANY
1914


**Segment (tokenize) the text into words.**  How many words does it have ?  Display a short fragment of it.

In [11]:
# Please write your Python code in this cell and execute it.

tokens = nltk.word_tokenize(raw)

print("Number of words : ", len(tokens))



Number of words :  110957


In [12]:
import os, codecs # we need to manage the encodings too

Now save a text version, by writing each token followed by a white space.  Inspect the text file using a text editor.

In [8]:
filename1 = "tokens1.txt"
# For a local file: relative path with respect to notebook
# For Google Colab: /content/gdrive/My Drive/tokens1.txt
if os.path.exists(filename1): 
    os.remove(filename1)
fd = codecs.open(filename1, 'a', 'utf8')
# Please write your Python code below and execute it.




**Now segment the raw text into sentences.**  There are two solutions: the first one is to start from the trimmed raw text and apply `nltk.sent_tokenize()`.

In [9]:
# Please write your Python code in this cell and execute it.



The second solution for segmentation (which seems better in this case) is to use the tokenized version that you saved.  

Load from the file the tokenized text (tokens + whtespace), and apply `nltk.sent_tokenize` to it.  

How many sentences does it have?  Display a short fragment of it.

In [10]:
# Second solution for a better text segmentation
fd = codecs.open(filename1, 'r', 'utf8')
# Please write your Python code below and execute it.




Save the segmented text as another file (one sentence per line).  Inspect the file using a text editor.

In [11]:
# Write every sentence on a separate line
filename2 = "sentences1.txt"
# For a local file: relative path with respect to notebook
# For Google Colab: /content/gdrive/My Drive/sentences1.txt
if os.path.exists(filename2): 
    os.remove(filename2)
fd = codecs.open(filename2, 'a', 'utf8')
# Please write your Python code below and execute it.




## Computing statistics over a text
You can also create a NLTK Text object from the tokens of the text, without sentence segmentation.  This enables you  to compute some statistics using NLTK functions.  NLTK Texts can store: (1) a string; (2) the list of all words (strings); (3) the list of all sentences (list of lists of strings).  In each case, the length of the object is different.  **However, only option (2) allows the use of predefined methods for NLTK Texts.**  Note that `nltk.word_tokenize()` and `nltk.sent_tokenize()` only apply to strings, not to NLTK Texts.

In [12]:
# Create a NLTK text object by word tokenization of the raw string (trimmed), for instance:
# text1tokenized = nltk.Text(nltk.word_tokenize(raw1trimmed)) # correct



**From now on, we will use the Text object created by word_tokenizing the initial raw string.**

[Chapter 1 of the NLTK book](http://www.nltk.org/book/ch01.html) provides examples of operations than can be done on texts.  For instance, you can search for a word in its context with `concordance`, or find words that are similar to a given one in terms of contexts with `similar`.

In [14]:
# Please write your Python code for finding concordances in this cell and execute it.



In [15]:
# Please write your Python code in this cell for finding similar words and execute it.



Determine the vocabulary (list of word types) of your text by converting the list to a set.  What is the size of the vocabulary?  What are the words longer than 15 characters?

In [16]:
# Please write your Python code in this cell and execute it.




## Frequency Distributions
You can also ask NLTK to compute word frequencies for a given text, yielding a new object called a frequency distribution: see [Sec. 3.1 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#frequency-distributions).  Using this object, you can ask for the most common (i.e. frequent) words.

In [18]:
# Please write your Python code in this cell and execute it.




Can you display words strictly longer than 4 characters among the 70 most frequent words?

In [17]:
# Please write your Python code in this cell and execute it.



Can you display the list of most frequent collocations (bi-grams) using the corresponding NLTK function?

In [21]:
# Please write your Python code in this cell and execute it.



## Graphical displays
We can also display various plots regarding word statistics, using _matplotlib_. Once it is installed, you need to run e.g. `import matplotlib.pyplot as plt` and then `%matplotlib inline`.  You can use two lists (e.g. x_values and y_values) to generate a plot with `plt.plot(x_values, y_values)`.  Or you can use the plotting functions from NLTK.

In [13]:
# Before using matplotlib to display graphs inline, you must execute the following two lines.
import matplotlib.pyplot
%matplotlib inline

Again using [Sec. 3.1 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#frequency-distributions), display the cumulative frequency plot of the 70 most frequent words of your text.

In [22]:
# Please write your Python code in this cell and execute it.




Can you build a list with the length of each word, then use this list to create a new `FreqDist` object, and plot the frequency distribution?  (Not the cumulative one.)  What is the most frequent length?  What can you observe about the order of the lengths by decreasing frequency?

In [23]:
# Please write your Python code in this cell and execute it.




**Zipf's Law.** Please produce a list of the frequencies of each word, in decreasing order.  Plot (for about 100 ranks) the values on the y axis and the rank (1st, 2nd, 3rd, ...) on the x axis. Add also a curve of the shape y = a(x+b) trying to fit a and b as close as you can so that the two curves look superposed.

In [24]:
# Please write your Python code in this cell and execute it.






## End of Lab 1b
Please print this notebook as a PDF file (once you have executed all cells), group it into a ZIP file with the `.ipynb` file, and files from Labo 1c, and email them to the teacher.