# Day 2 - NLTK - Continued

# Text Corpora and Lexical Resources

based on the NLTK book:

["Accessing Text Corpora and Lexical Resources"](https://www.nltk.org/book/ch02.html)

In [None]:
import nltk

## NLTK Text Corpora

NLTK includes many text collections (corpora) and other language resources, listed here: http://www.nltk.org/nltk_data/

Additional information:
* [NLTK Corpus How-to](http://www.nltk.org/howto/corpus.html)

In order to use these resources you may need to download them using `nltk.download()`

---

**NLTK book: ["Text Corpus Structure"](https://www.nltk.org/book/ch02#text-corpus-structure)**

There are different types of corpora:
* simple collections of text (e.g. Gutenberg corpus)
* categorized (texts are grouped into categories that might correspond to genre, source, author)
* temporal, demonstrating language use over a time period (e.g. news texts)

![Types of NLTK corpora](https://www.nltk.org/images/text-corpus-structure.png)

There are also annotated text corpora that contain linguistic annotations, representing POS tags, named entities, semantic roles, etc. 

### 1) Gutenberg Corpus

NLTK includes a small selection of texts (= multiple files) from the Project Gutenberg electronic text archive:

In [None]:
# let's explore its contents:

nltk.corpus.gutenberg.fileids()

In [None]:
# "Emma" by Jane Austen

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

print(emma)

In [None]:
# you can access corpus texts as characters, words (tokens) or sentences:

file_id = 'austen-emma.txt'

print("\nSentences:")
print( nltk.corpus.gutenberg.sents(file_id)[:3] )

print("\nWords:")
print( nltk.corpus.gutenberg.words(file_id)[:10] )

print("\nChars:")
print( nltk.corpus.gutenberg.raw(file_id)[:50] )

See https://www.nltk.org/book/ch02#gutenberg-corpus on how to compute statistics of words, sentences and characters (e.g. avg words per sentence).

---

### 2) Brown corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.

In [None]:
# Brown corpus categories list:

from nltk.corpus import brown
brown.categories()

In [None]:
# We can filter the corpus by (a) one or more categories or (b) file IDs:

print(brown.sents(categories='science_fiction')[:2])

In [None]:
for sent in brown.sents(categories='science_fiction')[:2]:
    print(" ".join(sent))
    print()

In [None]:
print(brown.sents(categories=['news', 'editorial', 'reviews']))

In [None]:
for sent in brown.sents(categories=['news', 'editorial', 'reviews'])[:3]:
    print(" ".join(sent))
    print()

In [None]:
print(brown.words(fileids=['cg22']))

We can use NLTK **ConditionalFreqDist** to collect statistics on the corpus distribution across genres and other properties:

In [None]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=genres, samples=modals)

#### Brown corpus contains tags with part-of-speech information

[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora) (NLTK book)

In [None]:
words = nltk.corpus.brown.tagged_words(tagset='universal')
words

In [None]:
# islice() lets us read a part of the corpus

from itertools import islice
words = islice(words, 300)

# let's convert it to a list
word_list = list(words)
word_list

In [None]:
# find all words with POS tag "ADJ"

tag = 'ADJ'

[item[0] for item in word_list if item[1] == tag]

**Additional examples** (using FreqDist, ...):
    
[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora)

### 3) NLTK Corpus functionality

* fileids()  = the files of the corpus
* fileids([categories])  = the files of the corpus corresponding to these categories

* categories()  = the categories of the corpus
* categories([fileids])  = the categories of the corpus corresponding to these files

* raw()  = the raw content of the corpus
* raw(fileids=[f1,f2,f3])  = the raw content of the specified files
* raw(categories=[c1,c2])  = the raw content of the specified categories

* words()  = the words of the whole corpus
* words(fileids=[f1,f2,f3])  = the words of the specified fileids
* words(categories=[c1,c2])  = the words of the specified categories

* sents()  = the sentences of the whole corpus
* sents(fileids=[f1,f2,f3])  = the sentences of the specified fileids
* sents(categories=[c1,c2])  = the sentences of the specified categories

* abspath(fileid)  = the location of the given file on disk
* encoding(fileid)  = the encoding of the file (if known)
* open(fileid)  = open a stream for reading the given corpus file
* root  = if the path to the root of locally installed corpus

* readme()  = the contents of the README file of the corpus


**Note: if you want to explore these corpora using `nltk.Text` functionality (e.g. as in the Introduction part) you will need to load them into `nltk.Text`**

## Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.

https://www.nltk.org/book/ch02#lexical-resources

We already used NLTK lexical resources (stopwords and common English words).

## WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. 

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
# a collection of synonym sets related to "wind"

wn.synsets('wind')

In [None]:
# words (lemmas) in one of synsets:

wn.synset('wind.n.08').lemma_names()

In [None]:
wn.synset('wind.n.08').definition()

In [None]:
wn.synset('wind.n.08').examples()

In [None]:
# let's explore all the synsets for this word

for synset in wn.synsets('wind'):
    print(synset.lemma_names())

In [None]:
# see all synsets that contain a given word

wn.lemmas('curve')

---

**Additional WordNet examples:**
* https://www.nltk.org/book/ch02#wordnet

### Try it yourself!

# Computing with Language: Statistics

In [None]:
import nltk

# make sure that NLTK language resources have been downloaded 
# (see "NLTK Introduction" notebook)

from nltk.book import *

In [None]:
# frequency distribution of text1
fdist1 = FreqDist(text1)

print(fdist1)

### [Word] Frequency Distributions

FreqDist is used to encode "frequency distributions", which count the number of times that each outcome of an experiment occurs.
* In case of text, its frequency distribution will contains counts of all tokens that appear in the text.
* Technically: FreqDist() creates a Python object (that holds information about a frequency distribution)

**FreqDist** methods:

* freq(sample) - returns the number of times "sample" appears in FreqDist
* hapaxes() - a list of samples that appear only once
* max() - the sample with the maximum number of occurences
* plot() - plot a FreqDist chart
* pprint() - "pretty print" the first items of FreqDist

NLTK book: http://www.nltk.org/book/ch01.html#computing-with-language-simple-statistics

Full list of methods: http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

In [None]:
# print frequency distribution (top results)

fdist1.pprint()

In [None]:
# max()

fdist1.max()

In [None]:
# freq()

print("','  :", fdist1.freq(","))
print("whale:", fdist1.freq("whale"))

---

Information about Python dictionaries: 
* ["Dictionaries and Structuring Data"](https://automatetheboringstuff.com/chapter5/)


In [None]:
# output of fdist1.pprint() looks like a Python "dictionary"

# can we look up its values by a given "key"?
fdist1["whale"]

In [None]:
# top 10 results (not that interesting for text)

fdist1.most_common(10)

In [None]:
# most_common() returns a list -> we can "slice" it

my_list = fdist1.most_common(100)

# results 50 through 59
my_list[50:60]

In [None]:
# plot the distribution
fdist1.plot(30)

In [None]:
# least common results (first 10 examples)

fdist1.hapaxes()[:10]

### Words can appear both in lowercase and Capitalized

Let's fix our FreqDist:

In [None]:
# need to "lowercase" the text before passing it to FreqDist
#   - see example in https://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

fdist2 = FreqDist(word.lower() for word in text1)

# we're going through the list of tokens in text,
#  - returning (generating) lowercase versions of these tokens
#  - and passing the result to FreqDist

In [None]:
# initial:
print(fdist1.freq("whale"))
print(fdist1.freq("Whale"))
print()

# fixed:
print(fdist2.freq("whale"))

### Cleaning data: removing stopwords

NLTK contains a corpus of *stopwords* - high-frequency words like "the", "to" and "also" - that we may want to filter out of a document before further processing.

Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

https://www.nltk.org/book/ch02#wordlist-corpora

In [None]:
from nltk.corpus import stopwords

# English stopwords
stop_words = stopwords.words("english")

stop_words[:8]

In [None]:
# let's start with text1 in lowercase

# return a list  
#   containing "word.lower()"
#     for every item (stored in variable "word")
#       in resource "text1"

text = [word.lower() for word in text1]

text[:7]

**NLTK book: [4.2   Operating on Every Element](https://www.nltk.org/book/ch01#operating-on-every-element)**

This *pattern* – doing something (e.g. modifying) with every item in a sequence and returning a list of results – is called Python *list comprehension*:
    
`result_list = [item.do_something() for item in list]`

List comprehensions may also contain conditions (only items matching the condition will be included in the resulting list):

`result_list = [item.do_something() for item in list `**`if`**` condition]`

It is very useful for filtering and modifying lists.

In [None]:
# we can filter either (a) text before calling FreqDist or (b) results of FreqDist.
# let's filter before calling FreqDist.

# create a set of stopwords (operations with sets are faster that with lists)
stop_set = set(stop_words)

# filter out stopwords (return only words not in the stoplist)
without_stopwords = [word for word in text if word not in stop_set]

text[:7]

In [None]:
# let's also filter out tokens that are not text or numbers

# Python has a built-in method .isalnum() that determines 
# if a string only consists of letters or digits:

# https://docs.python.org/3/library/stdtypes.html#str.isalnum

filtered = [word for word in without_stopwords if word.isalnum()]

filtered[:7]

In [None]:
# word frequency

freq = FreqDist(filtered)

freq.most_common(15)

In [None]:
# plot the distribution
freq.plot(30)

### Exploring data: finding interesting words

NLTK also includes a list of common English words. We can use it to find unusual or mis-spelt words in a text corpus.

See also: https://www.nltk.org/book/ch02#code-unusual

In [None]:
word_list = nltk.corpus.words.words()

# convert word list to a set (+ convert words to lowercase)
word_set = set(word.lower() for word in word_list)

# filter out common words
uncommon = [word for word in filtered if word not in word_set]

uncommon[:7]

In [None]:
# word frequency

freq = FreqDist(uncommon)

freq.most_common(15)

Note: in order to find really uncommon words we may need to clean data further (convert nouns to singular, etc.) or get a larger list of common words.

---

### Further information

[**Introduction to stylometry with Python**](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python) by François Dominic Laramée
* uses FreqDist

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognizable and unique ways. 

---

# Collocations and N-grams

NLTK book: [Collocations and Bigrams](https://www.nltk.org/book/ch01#collocations-and-bigrams)

**Bigrams** are just pairs of words that occur in the text.

**Collocations** are sequences (e.g. pairs) of words that occur together unusually often.

In [None]:
help(nltk.bigrams)

In [None]:
t1_bigrams = nltk.bigrams(text1[:10])

# to print bigrams, convert it to a list
list(t1_bigrams)

To find **collocations**, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words.

Additional information:
* [nltk.text.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* [NLTK documentation: collocations](https://www.nltk.org/api/nltk.html#module-nltk.collocations)

In [None]:
coll_list = text1.collocation_list()

coll_list

In [None]:
# we can also look for trigram collocations

# http://www.nltk.org/howto/collocations.html

coll_3 = nltk.collocations.TrigramCollocationFinder.from_words(text1)

trigram_measures = nltk.collocations.TrigramAssocMeasures()
scored = coll_3.score_ngrams(trigram_measures.raw_freq)

sorted(coll_3.nbest(trigram_measures.raw_freq, 15))

Note: in order to find unusual trigrams we would need to filter the results (and pick the most appropriate collocation measure) like it is done in `Text.collocation_list()`. 

Source code for `collocation_list()`: https://www.nltk.org/_modules/nltk/text.html#Text.collocation_list

```
    # print("Building collocations list")
    from nltk.corpus import stopwords

    ignored_words = stopwords.words("english")
            
    finder = BigramCollocationFinder.from_words(self.tokens, window_size)
    finder.apply_freq_filter(2)
    finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
            
    bigram_measures = BigramAssocMeasures()
    self._collocations = finder.nbest(bigram_measures.likelihood_ratio, num)
```

In [None]:
ignored_words = stopwords.words("english")

coll_3 = nltk.collocations.TrigramCollocationFinder.from_words(text1)

coll_3.apply_freq_filter(2)
coll_3.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

trigram_measures = nltk.collocations.TrigramAssocMeasures()
scored = coll_3.score_ngrams(trigram_measures.raw_freq)

sorted(coll_3.nbest(trigram_measures.raw_freq, 15))

# Your turn!

Choose one of NLTK corpora and **explore it using NLTK** (following examples here and in the NLTK book).

Also apply what you learned (FreqDist, ...) in section "Computing with Language: Statistics".

---

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary