# NLTK Workshop 
Updated March, 2016

Introductory code to practice using Python's Natural Language Toolkit (NLTK), much of which is taken from the excellent [NLTK Book](http://www.nltk.org/book/). 

Begin by importing NLTK along with all the resources used in the NLTK book -- this second part assumes that you have already downloaded the book resources. If you haven't, first enter `nltk.download()` and select the "book" resources for download.

In [None]:
import nltk
from nltk.book import *

## Basic Text Data

In [None]:
text1

In [None]:
len(text1)

In [None]:
len(set(text1))

In [None]:
from __future__ import division
len(text1) / len(set(text1))

In [None]:
text1.tokens[:25]

In [None]:
text1.generate()

## Collocations
Find "collocations", that is, word combinations that occur more often than would be expected by chance.

In [None]:
text1.collocations()

A more involved approach allows users to make use of additional functionality.

In [None]:
from nltk.collocations import *
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
finder.nbest(trigram_measures.pmi, 10)

In [None]:
finder.apply_freq_filter(3)
finder.nbest(trigram_measures.pmi, 10)

## Concordances
Count words and see them in context.

In [None]:
text1.concordance("whale")

In [None]:
text1.count("monster")

In [None]:
text1.similar("monstrous")

In [None]:
text1.count("whale") / len(text1) * 100

In [None]:
fdist = FreqDist(text1)

In [None]:
fdist

In [None]:
fdist.items()[:50]

In [None]:
fdist.most_common(50)

### Filtering Word Lists

In [None]:
all_words = [w.lower() for w in text1 if w.isalpha()]
fdist_words = FreqDist(all_words)

In [None]:
fdist_words.items()[:50]

In [None]:
from nltk.corpus import stopwords
filtered_words = [w for w in lowercase_words if w not in stopwords.words('english')]
fdist_filtered_words = FreqDist(filtered_words)
fdist_filtered_words.items()[:50]

In [None]:
[word for word in set(filtered_words) if len(word) > 15]

In [None]:
import re
[(word, filtered_words.count(word)) for word in set(filtered_words) if re.search('^un.*ly$', word)]

## Classifying Text

Assigning text to categories algorithmically.

In [None]:
nltk.pos_tag(sent2)

[Reference](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) for part of speech tags.

In [None]:
nltk.corpus.brown.tagged_paras()

Supervised machine learning for gender classification of names.

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [None]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

In [None]:
labeled_names

In [None]:
import random
random.shuffle(labeled_names)

In [None]:
labeled_names

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
classifier.classify(gender_features('Neo'))

In [None]:
classifier.classify(gender_features('Trinity'))

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(5)

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(5)