![alt text](https://www.msengineering.ch/typo3conf/ext/nm_theme_msengineering/Resources/Public/Template/img/mse_logo.jpg "MSE Logo") 

# AnTeDe Practical Work 1b : Text Segmentation with NLTK

## Objectives
The goal of the second part of the first [AnTeDe](https://moodle.msengineering.ch/course/view.php?id=1063) lab is to run simple operations for text analysis using the [NLTK](http://www.nltk.org/) toolkit.  You will use the environment that you set up in the first part of the first lab (Lab 1a): [Python 3](https://www.python.org/) with [Jupyter](https://jupyter.org/) notebooks.  

You will use NLTK functions to get texts from the web or local files, and segment (split) them into sentences and words (also called *tokens*).  You will also experiment with extracting some statistics about the texts.

To submit your practical work, please execute all cells of your Jupyter notebook, then save it, join it to part 1c in a *zip* archive, and submit it as homework #1 on the [AnTeDe Moodle page](https://moodle.msengineering.ch/course/view.php?id=1303).

## NLTK: the Natural Language (Processing) Toolkit

Please add NLTK to your Python installation, by following the instructions at the [NLTK website](http://www.nltk.org/install.html).  A good way to get started is to look at [Chapter 1](http://www.nltk.org/book/ch01.html) of the [NLTK book (NLP with Python)](http://www.nltk.org/book/) and to try some of the instructions there.  Note that the online book is updated for Python 3, but the printed book, also available in PDF on some websites, is only for Python 2 ([_Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit_, Steven Bird, Ewan Klein, and Edward Loper, O'Reilly Media, 2009](http://shop.oreilly.com/product/9780596516499.do)). 

To use NLTK in Jupyter, all you need is to `import nltk` once before you need it.  You will still need to use the prefix `nltk.` unless you write for instance: `from nltk.book import *` (which will import and define several long texts too).  Note: NLTK can automatically download from their website a large number of text collections, i.e. _corpora_.  NLTK has a download manager which can be called from a Python interpreter (not a notebook) using `import nltk` and then `nltk.download()`.  In this practical session, we will not use any of these.

In [None]:
import nltk
#from nltk.book import *

Please write a list of words called `sentence1`, print its length (`len()`) and use `nltk.bigrams` to generate all bigrams from it, i.e. pairs of consecutive words.  You can see an example in [Sec. 3.3 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#collocations-and-bigrams).  Please also sort bigrams alphabetically.

Please define a string called `string2` with another sentence, not segmented into words yet.  Use the NLTK tokenizer (the function called `nltk.word_tokenize`) as explained in [Sec. 3.1 of the NLTK book](http://www.nltk.org/book/ch03.html#sec-accessing-text) to tokenize the string into a list of words called `sentence2`, and then print this list.

## Using NLTK to download, tokenize, and save a text

With inspiration from [Chapter 3 (3.1. Processing Raw Text) of the NLTK book](http://www.nltk.org/book/ch03.html), get a text file from the Web, for instance a book from the Gutenberg Project.  Do your best to keep only the meaningful text from it, without the header and the final license.  What is the length of the entire book?  Is this a number of characters or words? (If you are curious, you can also refer to [Python's documentation of Unicode support](https://docs.python.org/3.8/howto/unicode.html).)

In [None]:
from urllib import request # you may need to: pip install urllib
# Please write your Python code below and execute it.


Determine, either by spotting the position of initial and ending strings, or by trial and error, how much your should trim from the beginning and from the end, to keep only the actual text of the book.

In [None]:
# Please write your Python code below and execute it.


**Segment the text into sentences and words.**  Note that normally only word segmentation is called *tokenization*, but NLTK uses this name for both functions.  

You will first perform sentence segmentation and write the result in a file, one sentence per line.  Then, you will segment each sentence into tokens (words and punctuations), and write the result in another file, one sentence per line, with each token follwed by a whitespace.  You will need the following NLTK functions:
* `nltk.sent_tokenize(...)` (documented [here](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize)) (note that "sentence tokenize" is a bit strange)
* `nltk.word_tokenize(...)` (documented [here](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize))

**Segment the text into sentences with NLTK.** Display the number of resulting sentences, and display a short excerpt (around 5 sentences).  What do you think about the quality of the segmentation?  *Note :* if you think that some special characters degrade the results, you can replace them in the full text with `.replace('s1', 's2')`.

In [None]:
# Please write your Python code in this cell and execute it.


Save a text version, with one sentence per line.  Inspect the file using a text editor.

In [None]:
import os

In [None]:
filename1 = "sample_text_1.txt"
# For a local file, this is the relative path with respect to the notebook
# For Google Colab, use e.g.: /content/gdrive/My Drive/sample_text_1.txt
if os.path.exists(filename1): 
    os.remove(filename1)
fd = open(filename1, 'a', encoding='utf8')
# Please write your Python code below and execute it.


**Now segment each sentence into tokens (i.e., words and punctuations).**  Store the result in a new variable (a list of lists) and display again a small sample.

In [None]:
# Please write your Python code in this cell and execute it.


How many tokens do you obtain in total?

In [None]:
# Please write your Python code in this cell and execute it.


Save the result as another file (one sentence per line, a whitespace after each token).  Inspect the file using a text editor.

In [None]:
filename2 = "sample_text_2.txt"
# For a local file, this is the relative path with respect to the notebook
# For Google Colab, use e.g.: /content/gdrive/My Drive/sample_text_2.txt
if os.path.exists(filename2): 
    os.remove(filename2)
fd = open(filename2, 'a', encoding='utf8')
# Please write your Python code below and execute it.


It is also possible to *tokenize a text without previously segmenting it into sentences*.  Please perform this operation, then display a short excerpt, and compare the resulting total number of tokens with the one obtained above.  (It is not necessary to write the result in a file.)

In [None]:
# Please write your Python code in this cell and execute it.


## Computing statistics over a text
You can create a `nltk.Text` object from the tokens of the text, without sentence segmentation.  This enables you  to compute statistics using NLTK functions.  NLTK Texts can in fact store: (1) a string; (2) the list of all words (strings); (3) the list of all sentences (list of lists of strings).  In each case, the length of the object is different.  **However, only option (2) allows the correct use of predefined methods for NLTK Texts.**  Note that `nltk.word_tokenize()` and `nltk.sent_tokenize()` only apply to strings, not to `ntlk.Text`.

Create and store a `nltk.Text` object by word tokenization of your entire text (no sentence tokenization).

In [None]:
# Please write your Python code in this cell and execute it.


[Chapter 1 of the NLTK book](http://www.nltk.org/book/ch01.html) provides examples of operations than can be done on texts.  For instance, you can search for a word in its context with `concordance`, or find words that are similar to a given one in terms of contexts with `similar`.  Please try these two functions and display a sample result.

In [None]:
# Please write your Python code in this cell and execute it.


In [None]:
# Please write your Python code in this cell and execute it.


Using `collocation_list`, please display the 10 most frequent collocations of your text.

In [None]:
# Please write your Python code in this cell and execute it.


One can compute the vocabulary of a text (i.e. the list of unique *word types*) by converting the list of words (the *tokens*) to a Python `set`.  Please compute the vocabulary of your text.  How many (different) words does it contain?  (This includes punctuations and other marks identified in tokenization.  Capitals are different from low-case.)  Which words of more than 20 letters appear in the vocabulary?

In [None]:
# Please write your Python code in this cell and execute it.


## Frequency Distributions
You can ask NLTK to compute word frequencies for a given text, yielding a new object called a frequency distribution (`FreqDist`): see [Sec. 3.1 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#frequency-distributions).  Using this object, you can ask for the most common (i.e. frequent) words.  

Please construct the frequency distribution of your text.

In [None]:
# Please write your Python code in this cell and execute it.


Can you display words strictly longer than 4 characters among the 70 most frequent words?

In [None]:
# Please write your Python code in this cell and execute it.


## Graphical displays
Python can display various plots regarding word statistics, using _matplotlib_. Once it is installed, you need to run e.g. `import matplotlib.pyplot as plt` and then `%matplotlib inline`.  You can use two lists (e.g. x_values and y_values) to generate a plot with `plt.plot(x_values, y_values)`.  Or you can use the plotting functions from NLTK.

In [None]:
# Before using matplotlib to display graphs inline, you must execute the following two lines.
import matplotlib.pyplot
%matplotlib inline

Again using [Sec. 3.1 of Ch. 1 of the NLTK book](http://www.nltk.org/book/ch01.html#frequency-distributions), display the cumulative frequency plot of the 70 most frequent words of your text.

In [None]:
# Please write your Python code in this cell and execute it.


Can you build a list with the length of each word (instead of the word), then use this list to create a new FreqDist object, and plot the frequency distribution?  (Not the cumulative one.)  What is the most frequent length?  What can you observe about the ordering of the lengths by decreasing frequency?

In [None]:
# Please write your Python code in this cell and execute it.


**Zipf's Law.** Please produce a list of the number of occurrences of each word, in decreasing order.  Plot (for about 100 ranks) the number of occurrences on the *y&nbsp;* axis and the rank of each value (1st, 2nd, 3rd, ...) on the *x&nbsp;* axis.  Then add a curve of the shape *y = a/(x+b)*, trying to fit *a&nbsp;* and *b&nbsp;* as close as you can so that the two curves look superposed.

In [None]:
# Please write your Python code in this cell and execute it.


## End of Lab 1b
Please save the completed notebook, add it to a *zip* file with 1c, and upload the archive to Moodle.