## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

## Getting started - Jupyter notebook

## https://bit.ly/bssdh_2023_python

---

Open this Jupyter notebook (`Day 2 - NLTK Introduction`) on Github: 
- https://github.com/CaptSolo/BSSDH_2023_beginners/tree/main/notebooks

Download the notebook file to your computer: click the "Download raw file" button.

![Download button](https://github.com/CaptSolo/BSSDH_2023_beginners/blob/main/notebooks/img/download_button.png?raw=1)

Open [Google Colab](https://colab.research.google.com/), choose the "Upload" tab and upload the downloaded notebook file.

* Uploaded notebooks can be found in the [Google Drive](https://drive.google.com/) folder `Colab Notebooks`

You are now ready for the workshop!

## 1) Getting started - NLTK

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [None]:
# In order to use a Python library, we need to import (load) it

import nltk
import pandas as pd # we will use it to read our data


In [None]:
# Let's check what NLTK version we have (for easier troubleshooting and reproducibility)
nltk.__version__

### nltk.Text

**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). It can perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [None]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)

my_text

In [None]:
my_text

In [None]:
type(my_text)

In [None]:
for item in dir(my_text):
    if not item.startswith("__"):
        print(item)

In [None]:
# How many times does the word "example" appear?
my_text.count("example")

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

In [None]:
# count works on tokens (full words in this case)
my_text.count('exam')

In [None]:
'exam' in my_text

In [None]:
'example' in my_text

### Tokenizing

Let's convert a text string into nltk.Text.
First, we need to split it into tokens (to *tokenize* it). 

In [None]:
# We need to download a package containing punctuation before we can tokenize

nltk.download('punkt')

In [None]:
# Splitting text into tokens (words, ...) = tokenizing

from nltk.tokenize import word_tokenize

excerpt = "NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”"
tokens = word_tokenize(excerpt)

tokens[:6]

In [None]:
type(tokens)

In [None]:
my_text2 = nltk.Text(tokens)

print(my_text2.count("NLTK"))

In [None]:
list(my_text2)[:10]


### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [None]:
# this is a big download of all book packages
nltk.download("book")

In [None]:
# After downloading the reources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

## 2) Exploring textual content

In [None]:
nltk.book.texts()

In [None]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

In [None]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

## Concordance

The concordance has a long history in humanities study and Roberto Busa's concordance Index Thomisticus—started in 1946—is arguably the first digital humanities project. Before computers were common, they were printed in large volumes such as John Bartlett's 1982 reference book A Complete Concordance to Shakespeare—it was 1909 pages pages long! 

A concordance gives the context of a given word or phrase in a body of texts. For example, a literary scholar might ask: how often and in what context does Shakespeare use the phrase "honest Iago" in Othello? A historian might examine a particular politician's speeches, looking for examples of a particular "dog whistle".

<font color="red">Read more</font>

* Geoffrey Rockwell and Stéfan Sinclair. [Tremendous Mechanical Labor: Father Busa's Algorithm](http://www.digitalhumanities.org/dhq/vol/14/3/000456/000456.html) (2020)
* Julianne Nyhan and Marco Passarotti, eds. [One Origin of Digital Humanities: Fr Roberto Busa in His Own Words](https://www.amazon.com/One-Origin-Digital-Humanities-Roberto/dp/3030183114/) (2019)
* Julianne Nyhan and Melissa Terras. [Uncovering 'hidden contributions to the history of Digital Humanities: the Index Thomisticus' femal keypunch operators](https://discovery.ucl.ac.uk/id/eprint/10052279/9/Nyhan_DH2017.redacted.pdf) (2017)
* Steven E. Jones [Roberto Busa, S.J., and the Emergence of Humanities Computing](https://www.routledge.com/Roberto-Busa-S-J-and-the-Emergence-of-Humanities-Computing-The-Priest/Jones/p/book/9781138587250) (2016)
___

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

In [None]:
text4.concordance("nation")

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

In [None]:
help(text4.similar)

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


In [None]:
help(text1.common_contexts)

In [None]:
text4.collocations(num=40, window_size=2)

In [None]:
help(text4.collocations)

In [None]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

In [None]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

In [None]:
print(text1[42:52])

## Visualizing the corpus

In [None]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

In [None]:
help(text4.dispersion_plot)

In [None]:
# Word frequency plot

text4.plot(30)

In [None]:
help(FreqDist.plot)

## Converting Our Corpora into a NLTK Text 

In [None]:
# We can use Pandas to read tabular data from any publicly accessible source
url = "https://github.com/CaptSolo/BSSDH_2023_beginners/raw/main/corpora/en_old_newspapers_5k.tsv"

## Another file we could have used:
# url_2 = "https://github.com/CaptSolo/BSSDH_2023_beginners/raw/main/corpora/lv_old_newspapers_5k.tsv"

# Read a CSV file and put its contents in a Pandas dataframe
df = pd.read_csv(url, sep="\t")

In [None]:
# Shape (size) of the dataframe
df.shape

In [None]:
# First 10 rows
df.head(10)

In [None]:
# Let us sort by Date - even though it is a string type

df = df.sort_values(by="Date", ascending=True)  # ascending is True by default, if you wanted Descending you could use ascending=False

df.head(10)

### Extracting Text from dataframe

In [None]:
### Extracting the Text column from the dataframe

documents = list(df.Text)  # df["Text"].tolist() would do the same
len(documents)


In [None]:
type(documents)

In [None]:
documents[:3]  # first three documents

In [None]:
# for the purpose of this analysis we will join all the documents together 
# this is not always appropriate depending on your needs

all_docs = "\n".join(documents)
len(all_docs)

In [None]:
all_docs[:120] # so now all documents are in one big string 
# notice the \n indicating newlines

Next, we lowercase our text and use the Natural Language Toolkit (NLTK) to tokenize it. Tokenizing breaks up the the document into individual words. Finally, we use our tokens to create an NLTK Text object.

In [None]:
# Tokenize

import nltk  # not needed if you already imported
nltk.download('punkt')  # again not needed if you already downloaded punkt

file_contents = all_docs.lower()
tokens = nltk.word_tokenize(file_contents)

text = nltk.Text(tokens)

In [None]:
# Verify that we have created an NLTK Text object
type(text)

In [None]:
# Create a concordance for the given word
text.concordance('million')

By default, the first 25 matches are printed along with 80 characters on each side of our string text. We can specify that more lines should be shown using a `lines` and `width` argument that accept integers.

In [None]:
# Create a concordance for the given word
# Increasing lines shown and number of characters
text.concordance('school', lines=50, width=100)

If we want to supply a bigram, trigram, or longer construction, they are supplied as individual strings within a Python list. (If you try to supply a string with a space in the middle, there will be no results.)

In [None]:
# Create a concordance for a sequence of words
text.concordance(['high', 'school'])

In [None]:
text.concordance('high school')

This method works well for a quick preview of the lines, but if we want to save this concordance for later analysis we can use the `.concordance_list()` method. The `.concordance_list()` method outputs a list, but the elements of that list *are not* simple strings. They are ConcordanceLine objects.

In [None]:
# Output the concordance data
output_list = text.concordance_list(['high', 'school'], width=200, lines=50)  # we do not have 50 matches in our dataset

In [None]:
type(output_list[0])

In [None]:
# We can view individual lines by using a Python list index followed by .line.
output_list[0].line

In [None]:
output_list[0].query

If we want to save our concordance, we can write to a file line-by-line.

In [None]:
# Writing the concordance to a text file

# encoding="utf-8" is very important for languages using symbols outside regular English characters
with open('my_concordance.txt', mode='w',encoding="utf-8") as f:
    
    for row in output_list:
        f.write(row.line)
        f.write('\n')

#### If you are using Google Colab, it is important to download saved files!

Files saved to the local directory (e.g. my_concordance.txt) of Google Colab servers will disappear after some time. In order to save these files, download them.

Another alternative is to save files to Google Drive folders in the first place.

## Normalizing text

**Normalizing text is covered in NLTK Book section 3.6:**
https://www.nltk.org/book/ch03.html#sec-normalizing-text

Yesterday we already normalized / converted text to lowercase but often that's not enough. 

We may want to go further and strip off any affixes, a task known as *stemming*. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as *lemmatization*. 

NLTK offers some stemming and lemmatization functions but they are limited to just some languages (e.g. Latvian is not one of them).
 
https://www.nltk.org/api/nltk.stem.html

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

porter = nltk.PorterStemmer()

for t in tokens:
    print(porter.stem(t), end=" ")

In [None]:
nltk.download('omw-1.4')

In [None]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

for t in tokens:
    print(wnl.lemmatize(t), end=" ")

## (Bonus) Working with languages not supported by NLTK

In [None]:
url = "https://github.com/CaptSolo/BSSDH_2023_beginners/raw/main/corpora/lv_old_newspapers_5k.tsv"

# Read a CSV file and put its contents in a Pandas dataframe
df = pd.read_csv(url, sep="\t")

In [None]:
df.head(5)

In [None]:
df = df.sort_values(by="Date", ascending=True)  # ascending is True by default, if you wanted Descending you could use ascending=False


In [None]:
df.head()

In [None]:
# Let's create NLTK Text

documents = list(df.Text)
all_docs = "\n".join(documents)

file_contents = all_docs.lower()
tokens = nltk.word_tokenize(file_contents)

text = nltk.Text(tokens)

In [None]:
text.concordance("basketbols")

Let's see how frequently does "basketbols" appear in the text:

In [None]:
for item in text:
    if item == "basketbols":
        print(item)

In [None]:
# There are other variations of the same word in the text:

import regex

for item in text:
    if regex.match(r"basket.*", item):
        print(item)

If the language of our text is not supported by NLTK we can use another library: 

simplemma: https://pypi.org/project/simplemma/

More information: [A simple multilingual lemmatizer for Python](https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html)


In [None]:
# we may need to install simplemma first (uncomment the following line to do that)
!pip install simplemma

In [None]:
import simplemma as sl

In [None]:
buf = []

for item in text:
    buf.append(sl.lemmatize(item, lang="lv"))
               
buf[:20]

In [None]:
text_new = nltk.Text(buf)

In [None]:
text_new.concordance("basketbols")

---

## Your turn!

Choose an NLTK text corpus and **explore it using NLTK** (following the examples in this notebook).

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

---

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Parts of this notebook were adopted from notebook by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />


