# [2. Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)

Run the cell below before running any other code.

In [None]:
import nltk

## 1 - Accessing Text Corpora

### 1.1 - Guterberg Corpus

In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
type(emma)

In [None]:
len(emma)

* notice that emma is a `nltk.corpus.reader.util.StreamBackedCorpusView` object
* in order to use the `.concordance` method on the `emma` text, we need to convert `emma` into a `nltk.text.Text` object, as shown below

In [None]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
type(emma)

In [None]:
emma.concordance("surprize")

In [None]:
from nltk.corpus import gutenberg

In [None]:
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

#### Macbeth Sentences

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

In [None]:
macbeth_sentences

In [None]:
macbeth_sentences[1116]

In [None]:
longest_len = max(len(s) for s in macbeth_sentences)

In [None]:
[s for s in macbeth_sentences if len(s) == longest_len]

### 1.2 - Web and Chat Text

In [None]:
from nltk.corpus import webtext

for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
from nltk.corpus import nps_chat

chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]

### 1.3 Brown Corpus

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.words(categories='news')

In [None]:
brown.words(fileids=['cg22'])

In [None]:
brown.sents(categories=['news', 'editorial', 'reviews'])

#### Stylistics

In [None]:
from nltk.corpus import brown

In [None]:
news_text = brown.words(categories='news')

In [None]:
fdist = nltk.FreqDist(w.lower() for w in news_text)

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
for m in modals:
    print(m + ':', fdist[m], end=' ')

**Your Turn:** Choose a different section of the Brown Corpus, and adapt the previous example to count a selection of wh words, such as what, when, where, who, and why.

#### CFD Sneak Peek

* CFD's will be explained in more detail in Section 2

In [None]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

### 1.4 - Reuters Corpus

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.fileids()

In [None]:
reuters.categories()

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865', 'training/9880'])

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley', 'corn'])

### 1.5 - Inaugural Address Corpus

In [None]:
from nltk.corpus import inaugural

In [None]:
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

Pay attention to how this graph varies from the graph displayed in the book. NLTK's Inaugral Address Corpus is still updated, so data from United States presidents past 2005 are included in this graph.

* **note:** for this solution, I used matplotlib library functions to change the size of the graph
    * learn more about matplotlib here: [Intro to pyplot Tutorial](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)
* CFD's will be explained in more detail in Section 2

In [None]:
import matplotlib.pyplot as plt

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))

plt.figure(figsize=(16, 6)) 

cfd.plot()

### 1.7 - Corpora in Other Languages

In [None]:
nltk.corpus.cess_esp.words()

In [None]:
nltk.corpus.floresta.words()

In [None]:
nltk.corpus.indian.words('hindi.pos')

In [None]:
nltk.corpus.udhr.fileids()

In [None]:
nltk.corpus.udhr.words('Javanese-Latin1')[11:]

* **note:** for this solution, I used matplotlib library functions to change the size of the graph
    * learn more about matplotlib here: [Intro to pyplot Tutorial](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)
* CFD's will be explained in more detail in Section 2

In [None]:
import matplotlib.pyplot as plt
from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

plt.figure(figsize=(10, 6)) 

cfd.plot(cumulative=True)

**Your Turn:** Pick a language of interest in `udhr.fileids()`, and define a variable `raw_text = udhr.raw(Language-Latin1)`. Now plot a frequency distribution of the letters of the text using `nltk.FreqDist(raw_text).plot()`.

### 1.8 - Text Corpus Structure

In [None]:
from nltk.corpus import gutenberg

raw = gutenberg.raw("burgess-busterbrown.txt")

raw[1:20]

In [None]:
words = gutenberg.words("burgess-busterbrown.txt")

In [None]:
words[1:20]

In [None]:
sents = gutenberg.sents("burgess-busterbrown.txt")

In [None]:
sents[1:20]

### 1.9 - Loading your own Corpus

In this example, we are going to look at the root directory of this reposity. The `..` stands for a **parent directory**, or a folder one level higher in the folder hierarchy. See [Section 1.4 of this Unix Tutorial](http://www.ee.surrey.ac.uk/Teaching/Unix/unix1.html) for an in depth explanation of this.

And instead of looking at all of the files that have a `.` in them, we will observe all of the files that end with `.md`. These are markdown files, which are a type of text file.

* **Note:** if you are using Google Colab, change the corpus_root string to 'sample_data'. Click the folder icon on the left to see what's in the 'sample_data' folder.

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = '../' # change this string to 'sample_data' if using Google Colab
wordlists = PlaintextCorpusReader(corpus_root, '.*.md') 
wordlists.fileids()

In [None]:
wordlists.words('README.md')

Unfortunately the Penn Treebank is [not a free resource](https://catalog.ldc.upenn.edu/LDC99T42). Fortunately there are a lot of [free alternatives to use](https://stackoverflow.com/q/8949517/12578069).

* [American National Corpus](http://www.anc.org/data/masc/downloads/data-download/)

## 2 - Conditional Frequency Distributions

### 2.1 - Conditions and Events

In [None]:
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County')]

pairs

### 2.2 - Counting Words by Genre

In [None]:
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

The technique used to create the list of pairs below is called a **list comprehension**. Below are review resources for this technique:

* *Chapter 1, Section 3.2* of the NLTK book
* [Tutorial on List Comprehensions](https://www.programiz.com/python-programming/list-comprehension)

In [None]:
genre_word = [(genre, word)
               for genre in ['news', 'romance']         
               for word in brown.words(categories=genre)]

len(genre_word)

In [None]:
genre_word[:4] # [_start-genre]

In [None]:
genre_word[-4:] # [_end-genre]

In [None]:
cfd = nltk.ConditionalFreqDist(genre_word)
cfd

In [None]:
cfd.conditions() # [_conditions-cfd]

* **samples** are the number of unique words there are in a text (i.e. no duplicate words are counted)
* **outcomes** are the total number of words occuring in a text (i.e. including duplicate words)

In [None]:
print(cfd['news'])

In [None]:
print(cfd['romance'])

In [None]:
cfd['romance'].most_common(20)

The code below shows how many times does the word *could* appear in romance texts (in the *Brown Corpus*).

In [None]:
cfd['romance']['could']

### 2.3 - Plotting and Tabulating Distributions

In [None]:
from nltk.corpus import inaugural

cfd = nltk.ConditionalFreqDist(
           (target, fileid[:4]) 
           for fileid in inaugural.fileids()
           for w in inaugural.words(fileid)
           for target in ['america', 'citizen'] if w.lower().startswith(target))

In [None]:
from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

cfd = nltk.ConditionalFreqDist(
           (lang, len(word))
           for lang in languages
           for word in udhr.words(lang + '-Latin1'))

In [None]:
cfd.tabulate(conditions=['English', 'German_Deutsch'], samples=range(10), cumulative=True)

**Your Turn:** Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy, and which are most romantic. Define a variable called `days` containing a list of days of the week, i.e. `['Monday', ...]`. Now tabulate the counts for these words using `cfd.tabulate(samples=days)`. Now try the same thing using plot in place of tabulate. You may control the output order of days with the help of an extra parameter: `samples=['Monday', ...]`.

### 2.4 - Generating Random Text with Bigrams

In [None]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']

In [None]:
list(nltk.bigrams(sent))

**Figure 2.2**: Generating Random Text: this program obtains all bigrams from the text of the book of Genesis, then constructs a conditional frequency distribution to record which words are most likely to follow a given word; e.g., after the word *living*, the most likely word is *creature*; the `generate_model()` function uses this data, and a seed word, to generate random text.

In [None]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()
        

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

In [None]:
cfd['living']

In [None]:
generate_model(cfd, 'living')

## 3 - More Python: Reusing Code

### 3.1 - Creating Programs with a Text Editor

You can create `.py` files in Jupyter Notebooks as well. There are two ways of doing this using **Anaconda**'s jupyter notebooks:

1. Click the Jupyer logo on the top-left hand side of this notebook. You will see your file directory. Select the place where you want to save your `.py` file and select `New > Text File` in the top right-hand corner. This will open a text editor where you can write code your python code!
2. If you want to save a copy of your whole Jupyter Notebook as a python file, click `File > Download As` and select `.py`

There are also many popular source code editors with great plugins for Python. These include:
* [Visual Studio Code](https://code.visualstudio.com/)
* [PyCharm Community Edition](https://www.jetbrains.com/pycharm/)

To run a `.py` file as though you are using a terminal, simply run `!python file.py` where `file.py` is the python file you would like to run. The `!` is a special jupyter character that allows you to run a command from your computer's terminal. 

In the example below, you can run a program in this folder called `hello.py` which is a simple 'Hello World' script. 

In [107]:
!python hello.py

Hello World!


### 3.2 - Functions

In [110]:
def lexical_diversity(text):
     return len(text) / len(set(text))

In [109]:
def lexical_diversity(my_text_data):
    word_count = len(my_text_data)
    vocab_size = len(set(my_text_data))
    diversity_score = vocab_size / word_count
    return diversity_score

In [113]:
from nltk.corpus import genesis
kjv = genesis.words('english-kjv.txt')
lexical_diversity(kjv)

16.050197203298673

In [114]:
def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

In [115]:
plural('fairy')

'fairies'

In [116]:
plural('woman')

'women'

### 3.3 - Modules

* a `text_proc.py` file is provided in this folder 

In [120]:
from text_proc import plural

'fen'

In [122]:
plural('wish')

'wishes'

In [121]:
plural('fan')

'fen'

## 4 - Lexical Resources

## 5

## Your Turn Solutions

### 1.3

**Your Turn:** Choose a different section of the Brown Corpus, and adapt the previous example to count a selection of wh words, such as what, when, where, who, and why.

In [None]:
from nltk.corpus import brown

humor_text = brown.words(categories='humor')
humor_fdist = nltk.FreqDist(w.lower() for w in humor_text)
wh = ['what', 'when', 'where', 'who', 'why']

for w in wh:
    print(w + ':', fdist[m], end=' ')

### 1.7

**Your Turn:** Pick a language of interest in `udhr.fileids()`, and define a variable `raw_text = udhr.raw(Language-Latin1)`. Now plot a frequency distribution of the letters of the text using `nltk.FreqDist(raw_text).plot()`.

* in this example, I will choose `Portuguese_Portugues-Latin1`
* **note:** for this solution, I used matplotlib library functions to change the size of the graph
    * learn more about matplotlib here: [Intro to pyplot Tutorial](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py)

In [None]:
from nltk.corpus import udhr

udhr.fileids()

In [None]:
raw_text = udhr.raw('Portuguese_Portugues-Latin1')

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 6)) 

nltk.FreqDist(raw_text).plot()

### 2.3

**Your Turn:** Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy, and which are most romantic. Define a variable called `days` containing a list of days of the week, i.e. `['Monday', ...]`. Now tabulate the counts for these words using `cfd.tabulate(samples=days)`. Now try the same thing using plot in place of tabulate. You may control the output order of days with the help of an extra parameter: `samples=['Monday', ...]`.

In [None]:
from nltk.corpus import brown

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in ['news', 'romance']         
    for word in brown.words(categories=genre)
)

In [None]:
cfd

In [None]:
cfd.tabulate(samples=days)

In [None]:
cfd.plot(samples=days)

## Work

### 1.7

In [None]:
import pandas as pd
from nltk.corpus import udhr

languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))

def plot_freq(lang):
    max_length = max([len(word) for word in udhr.words(lang + '-Latin1')])
    eng_freq_dist = {}

    for i in range(max_length + 1):
        eng_freq_dist[i] = cfd[lang].freq(i)

    ed = pd.Series(eng_freq_dist, name=lang)

    ed.cumsum().plot(legend=True, title='Cumulative Distribution of Word Lengths')

In [None]:
for lang in languages:
    plot_freq(lang)