# Corpus and Lexicon

## Objectives
- Understanding: 
    - relation between corpus and lexicon
    - effects of pre-processing (tokenization) on lexicon
    
- Learning how to:
    - load basic corpora for processing
    - compute basic descriptive statistic of a corpus
    - building lexicon and frequency lists from a corpus
    - perform basic lexicon operations
    - perform basic text pre-processing (tokenization and sentence segmentation) using python libraries

### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

### Covered Material
- SLP
    - [Chapter 2: Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf) 
- NLTK 
    - [Chapter 2: Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)
    - [Chapter 3: Processing Raw Text](https://www.nltk.org/book/ch03.html)

### Requirements (Not required if you have installed the given env)

- [NLTK](http://www.nltk.org/)
    - run `pip install nltk`
    
- [spaCy](https://spacy.io/)
    - run `pip install spacy`
    - run `python -m spacy download en_core_web_sm` to install English language model (`spacy>=3.0`)

- [scikit-learn](https://scikit-learn.org/)
    - run `pip install scikit-learn`
    

# 0 Python Basics

In this, we briefly see the basic data structures in Python language. However, this do not substitute the suggested Python guides. 

### 0.1 Lists
Lists are one of the four (i.e. Dictionaries, Tuples, Sets) of built-in data structures. They can store multiple items of any type (e.g. objects, functions, strings, integers, etc.). To declare a list we use the squared brackets `[]`. 

 

In [1]:
colors = ['green', 'blue', 'yellow']
random_numbers = [42, 7, 3, 128]
for c in colors:
    print(c)
print('The length is', len(colors))
print('-'*89)
# Arrays can contain anything
for n in random_numbers:
    colors.append(n)
    
for elem in colors:
    print(elem)
print('The length is', len(colors))

green
blue
yellow
The length is 3
-----------------------------------------------------------------------------------------
green
blue
yellow
42
7
3
128
The length is 7


**Note:** In python assignments are done by call by reference or call by value. To better understand this important aspect of python check [this](https://www.geeksforgeeks.org/is-python-call-by-reference-or-call-by-value/) out.

In [2]:
# Here assignments are done by reference, called "Call by Object Reference"
colors = ['green', 'blue', 'yellow']
tmp_colors = colors
tmp_colors.append('ALPHA')
print('Colors:', colors)
print('Tmp colors:', tmp_colors)

Colors: ['green', 'blue', 'yellow', 'ALPHA']
Tmp colors: ['green', 'blue', 'yellow', 'ALPHA']


In [3]:
colors = ['green', 'blue', 'yellow']
tmp_colors = []
for c in colors:
    tmp_colors.append(c)
    
tmp_colors.append('ALPHA')
print('Colors:', colors)
print('Tmp colors:', tmp_colors)

Colors: ['green', 'blue', 'yellow']
Tmp colors: ['green', 'blue', 'yellow', 'ALPHA']


In [None]:
# This is called list of comprehension
# It's basically a compact for loop with append
# It's mainly used for copying arrays or filtering
tmp = [c for c in colors]
print(tmp)
tmp = [c for c in colors if c == "blue"]
print(tmp)

### 0.2 Dictionaries
Dictionaries store information in the format key: value pair. The keys are unique, duplicates are not allowed. 

In [4]:
capitals = {"Italy": "Rome", "U.S.A": "Washington D.C.",  "Japan": "Tokyo", "Asaland": "Asgard", "Galactic Empire": "Coruscant"}
for key, value in capitals.items():
    print("Country:", key, "Capital:", value)
print("♦"*89)
countries = [country for country in capitals]
print("Countries:", countries)
# OR
countries = [country for country in capitals.keys()]
print("Countries:", countries)
print("♦"*89)
capitals = [country for country in capitals.values()]
print("Capitals:", capitals)

Country: Italy Capital: Rome
Country: U.S.A Capital: Washington D.C.
Country: Japan Capital: Tokyo
Country: Asaland Capital: Asgard
Country: Galactic Empire Capital: Coruscant
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
Countries: ['Italy', 'U.S.A', 'Japan', 'Asaland', 'Galactic Empire']
Countries: ['Italy', 'U.S.A', 'Japan', 'Asaland', 'Galactic Empire']
♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦
Capitals: ['Rome', 'Washington D.C.', 'Tokyo', 'Asgard', 'Coruscant']


## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research. Before doing anything with a corpus we need to know its properties:

__Corpus Properties__:
- *Format* -- how to read/load it?
- *Natural Language* -- which tools/models can I use?
- *Annotation* -- what it is intended for?
- *Split* for __Evaluation__: (terminology varies from source to source)

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


#### 1.1.1. Text Corpora in NLTK
NLTK provides several corpora with loading functions. Plain text corpora come from a _Project Gutenberg_.

`nltk.corpus.gutenberg.fileids()` lists available books.

In [5]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /Users/eva01/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/eva01/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### 1.1.2. Units of Text Corpus
Depending on a goal, corpus can be seen as a sequence of:
- characters
- words (tokens)
- sentences
- paragraphs
- document

Each level, in turn, can be seen as a sequence of elements of the previous level.

- word -- a sequence of characters
- sentence -- a sequence of words
- paragraph -- a sequence of sentences
- document -- a sequence of paragraphs (or sentences, depending on our purpose)

#### 1.1.3. Loading NLTK Corpora

NLTK provides functions to load a corpus using these different levels, as `raw` (characters), `words`, and `sentences`.

In [7]:
alice_chars = nltk.corpus.gutenberg.raw('carroll-alice.txt')
print('chars:', alice_chars[0:10])
alice_words = nltk.corpus.gutenberg.words('carroll-alice.txt')
print('words:', alice_words[0:10])
alice_sents = nltk.corpus.gutenberg.sents('carroll-alice.txt')
print('sents:', alice_sents[0:10])

chars: [Alice's A
words: ['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll']
sents: [['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']'], ['CHAPTER', 'I', '.'], ['Down', 'the', 'Rabbit', '-', 'Hole'], ['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'"], ['So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', '),', 'whether', 

### 1.2. Corpus Descriptive Statistics (Counting)

*Corpus* can be described in terms of:

- total number of characters
- total number of words (_tokens_: includes punctuation, etc.)
- total number of sentences

- minimum/maximum/average number of character per token
- minimum/maximum/average number of words per sentence
- minimum/maximum/average number of sentences per document


__Example__

$$\text{Av. Token Count} = \frac{\text{count}(tokens)}{\text{count}(sentences)}$$


In [8]:
# let's compute average sentence length & round to the closest integer
round(len(alice_words)/len(alice_sents))

20

In [9]:
# let's compute length of each sentence
sent_lens = [len(sent) for sent in alice_sents]
# let's compute length of each word
word_lens = [len(word) for word in alice_words]
# let's compute length the number of characters in each sentence
chars_lens = [len(''.join(sent)) for sent in alice_sents]

avg_sent_len = round(sum(sent_lens)/len(sent_lens))
min_sent_len = min(sent_lens)
max_sent_len = max(sent_lens)
print("AVG sent len", avg_sent_len)
print("MIN sent len", min_sent_len)
print("MAX sent len", max_sent_len)

AVG sent len 20
MIN sent len 2
MAX sent len 204


In [10]:
# JOIN built-in function example
tmp = ['H', 'e', 'l', 'l', 'o']
print(' '.join(tmp))
print('⭐'.join(tmp))
print(' '.join(tmp).split())

H e l l o
H⭐e⭐l⭐l⭐o
['H', 'e', 'l', 'l', 'o']


#### Exercise 1

- Define a function to compute corpus descriptive statistics

    - input:
        - raw text (Chars)
        - words
        - sentences
    - output (print): 
        - average number of:
            - chars per word
            - words per sentence
            - chars per sentence
        - Size of the longest word and sentence


In [None]:
def statistics(words, sents):
    word_lens = # Add word lens
    sent_lens = # Add sentence lens
    chars_in_sents = # Add char lens
    
    word_per_sent = round(sum(sent_lens) / len(sents))
    char_per_word = round(sum(word_lens) / len(words))
    char_per_sent = round(sum(chars_in_sents) / len(sents))
    
    longest_sentence = # max(...)
    longest_word = # max(...)
    
    return word_per_sent, char_per_word, char_per_sent, longest_sentence, longest_word

word_per_sent, char_per_word, char_per_sent, longest_sent, longest_word = statistics(alice_words, alice_sents)

print('Word per sentence', word_per_sent)
print('Char per word', )
print('Char per sentence', )
print('Longest sentence', )
print('Longest word', )

## 2. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalog of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 

*Lexicon (or Vocabulary) Size* is one of the statistics reported for corpora. While *Word Count* is the number of __tokens__, *Lexicon Size* is the number of __types__ (unique words).

#### Token vs Word
The ***tokens*** are the elements in a sentences and they are used to compute the **occurrences** of a word. Instead, ***words*** are the **unique** elements that compose the Lexicon or Vocabulary of a corpus. We can think of words as classes and tokens as instances of those classes.

<br>

**For example**:
<br> 
-   How many tokens are there in the sentence "***to be or not to be***"? 
    - Answer: 6

-   How many words?
    -   Answer: 4

### 2.1. Lexicon and Its Size

#### 2.1.1. Constructing Lexicon and Computing its Size

Since lexicon is a list of unique elements, it is a `set` of corpus words (i.e. tokens).
Consequently, its size is the size of the set.

In [None]:
alice_lexicon = set(alice_words)
len(alice_lexicon)

__NOTE__:
We did not process our corpus in any way. Consequently, words with case variations are different entries in our lexicon.

In [None]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

#### 2.1.2. Lowercased Lexicon
Let's lowercase our corpus and re-compute the lexicon size.

In [None]:
alice_lexicon = set([w.lower() for w in alice_words])
len(alice_lexicon)

In [None]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

### 2.2. Frequency List

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts

#### 2.2.1. Computing Frequency List with python

In python, frequency list can be constructed in several ways. The most convenient is the `Counter`.

In [None]:
from collections import Counter
alice_freq_list = Counter(alice_words)

In [None]:
print(alice_freq_list.get('ALL', 0))
print(alice_freq_list.get('All', 0))
print(alice_freq_list.get('all', 0))

#### 2.2.2. Computing Frequency List with NLTK
NLTK provides `FreqDist` class to construct a Frequency List (`FreqDist` == _Frequency Distribution_)

In [None]:
alice_freq_dist = nltk.FreqDist(alice_words)

In [None]:
print(alice_freq_dist.get('ALL', 0))
print(alice_freq_dist.get('All', 0))
print(alice_freq_dist.get('all', 0))

#### Exercise 2

- compute frequency list of __lowercased__ "alice" corpus (you can use either method)
- report `5` most frequent words (use can use provided `nbest` function to get a dict of top N items)
- compare the frequencies to the reference values below

| Word   | Frequency |
|--------|----------:|
| ,      |     1,993 |
| '      |     1,731 |
| the    |     1,642 |
| and    |       872 |
| .      |       764 |


In [None]:
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

In [None]:
alice_lowercase_freq_list = # Counter(X) # Replace X with the word list of the corpus in lower case (see above)
nbest(alice_lowercase_freq_list, n=1) # Change N from 1 to 5

### 2.3. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). The common operations are removing words by frequency (minimum or maximum, i.e. *Frequency Cut-Off*) and removing words for a specific lists (i.e. *Stop Word Removal*).

#### 2.3.1. Frequency Cut-Off

##### Exercise 3

<!-- - define a function to compute a lexicon from a frequency list applying minimum and maximum frequency cut-offs
    
    - input: frequence list (dict)
    - output: list
    - use default values for min and max
     -->
- Using the function cut_off
    
    - compute lexicon applying:
    
        - minimum cut-off 2 (remove words that appear less than 2 times, i.e. remove [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
        - maximum cut-off 100 (remove words that appear more that 100 times)
        - both minimum and maximum thresholds together
        
    - report size for each comparing to the reference values in the table (on the lowercased lexicon)

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| original   | N/A | N/A | 2636 |
| cut-off    |   2 | N/A | 1503 |
| cut-off    | N/A | 100 | 2586 |
| cut-off    |   2 | 100 | 1453 |


In [None]:
def cut_off(vocab, n_min=100, n_max=100):
    new_vocab = []
    for word, count in vocab.items():
        if count >= n_min and count <= n_max:
            new_vocab.append(word)
    return new_vocab

lower_bound = float("-inf") # Change these two numbers integer to compute the required cut offs
upper_bound = float("inf")
lexicon_cut_off = len(cut_off(alice_lowercase_freq_list, n_min=lower_bound, n_max=upper_bound))

print('Original', len(alice_lowercase_freq_list))
print('CutOFF Min:', lower_bound, 'MAX:', upper_bound, ' Lexicon Size:', lexicon_cut_off)

#### 2.3.2. StopWord Removal

In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.

Let's check the stop word lists from the popular python libraries.

- spaCy
- NLTK
- scikit-learn

    
For NLTK we need to download them first

```python
import nltk
nltk.download('stopwords')
```

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP_WORDS
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as SKLEARN_STOP_WORDS
from nltk.corpus import stopwords

# nltk.download('stopwords') # Run only once

NLTK_STOP_WORDS = set(stopwords.words('english'))

print('spaCy: {}'.format(len(SPACY_STOP_WORDS)))
print('NLTK: {}'.format(len(NLTK_STOP_WORDS)))
print('sklearn: {}'.format(len(SKLEARN_STOP_WORDS)))
print(NLTK_STOP_WORDS)

##### Exercise 4
- using Python's built-in `set` [methods](https://docs.python.org/2/library/stdtypes.html#set):
    - compute the intersection between the 100 most frequent words in frequency list of the alice corpus and the list of stopwords (report count)
    - remove stopwords from the lexicon
    - print the size of:
            - original lexicon
            - lexicon without stopwords
            - overlap between 100 most freq. words and stopwords

| Operation       | Size |
|-----------------|-----:|
| original        | 2636 |
| no stop words   | 2490 |
| top 100 overlap |   65 |

In [None]:
# Set built-in Function
set_a = set(['a', 'b', 'c', 'd', 'e'])
set_b = set(['a', 'b', 'f'])

print(set_a.intersection(set_b)) # Compute overlap
print(set_a.difference(set_b)) # Remove Elements by computing the set diff

In [None]:
alice_vocab = set([w.lower() for w in alice_words])
top100 = list(nbest(alice_lowercase_freq_list,n=100).keys())
stop_words = NLTK_STOP_WORDS
overlap = # Compute the intersection between top100 and stop_words
alice_vocab_no_stopwords = # Remove Stopwords from alice vocab
print('Original', len(alice_vocab))
print('No stopwords', len(alice_vocab_no_stopwords))
print('To100 overlap', len(overlap))

## 3. Basic Text Pre-processing

Both frequency cut-off and stop word removal are frequently used text pre-processing steps. Depending on the application, there are several other common text pre-processing steps that are usually applied for transforming text for Machine Learning tasks.

__Text Normalization Steps__

- removing extra white spaces

- tokenization
    - documents to sentences (sentence segmentation/tokenization)
    - sentences to tokens

- lowercasing/uppercasing


- removing punctuation

- removing accent marks and other diacritics 

- removing stop words (see above)

- removing sparse terms (frequency cut-off)

- number normalization
    - numbers to words (i.e. `10` to `ten`)
    - number words to numbers (i.e. `ten` to `10`)
    - removing numbers

- verbalization (specifically for speech applications)

    - numbers to words
    - expanding abbreviations (or spelling out)
    - reading out dates, etc.
    

- [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)
    - the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

- [stemming](https://en.wikipedia.org/wiki/Stemming)
    - the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.


### 3.1. Tokenization and Sentence Segmentation

Given a "clean" text, in order to perform any analysis, we need to identify its units.
In other words, we need to _segment_ the text into sentences and words.

__NOTE__:
Since both _tokenization_ and _sentence segmentation_ are automatic, different tools yield different results.

#### 3.1.1. Tokenization and Sentence Segmentation with spaCy
The default spaCy NLP pipeline does several processing steps including __tokenization__, *part of speech tagging*, lemmatization, *dependency parsing* and *Named Entity Recognition* (we will see the ones in *italics* during the course). 


SpaCy produces a `Doc` object that contains `Span`s (sentences) and `Token`s.

In [None]:
import spacy
import en_core_web_sm
#nlp = en_core_web_sm.load()
# un-comment the lines above, if you get 'ModuleNotFoundError'
nlp = spacy.load("en_core_web_sm",  disable=["tagger", "ner"])
txt = alice_chars

In [None]:
# process the document
doc = nlp(txt)

In [None]:
print("first token: '{}'".format(doc[0]))
print("first sentence: '{}'".format(list(doc.sents)[0]))

In [None]:
# access list of tokens (Token objects)
print(len(doc))
# access list of sentences (Span objects)
print(len(list(doc.sents)))

#### 3.1.2. Tokenization and Sentence Segmentation with NLTK
NLTK's [tokenize](https://www.nltk.org/api/nltk.tokenize.html) package provides similar functionality using the methods below.

- `word_tokenize` 
- `sent_tokenize`

There are several tokenizer available (read documentation for more information).

In [None]:
# download NLTK tokenizer
nltk.download('punkt')

In [None]:
alice_words_nltk = nltk.word_tokenize(alice_chars)
alice_sents_nltk = nltk.sent_tokenize(alice_chars)
print(len(alice_words_nltk))
print(len(alice_sents_nltk))

In [None]:
print("first token: '{}'".format(alice_words_nltk[0]))
print("first sentence: '{}'".format(alice_sents_nltk[0]))

## Last Exercise
- Load another corpus from Gutenberg (e.g. `milton-paradise.txt`)
- On this, compute the descriptive statistics using the provided sentences and tokens (.raw, .words, etc.) as __reference__ 
    - After this you will get "reference" version 
- Tokenize and segment into sentences the provided raw corpus using the `spaCy` and `NLTK` libraries. Compute the descriptive statistics on the outcome
    - After this you will get "spaCy" and "NLTK" versions
- Compute lowercased lexicons for all 3 versions (reference, spaCy, NLTK) of the corpus
    - compare lexicon sizes
- Compute frequency distribution for all 3 versions (reference, spaCy, NLTK) of the corpus
    - compare top N frequencies