# Text processing and topic modeling with Gensim + NLTK

This notebook provides a brief introduction to the [Natural Language ToolKit (NLTK)](https://www.nltk.org/) and [gensim](https://radimrehurek.com/gensim/index.html).

NLTK has tons of features, the vast majority of which we won't use. Perhaps the most useful aspect is that it ships with a number of text corpora on which to train. In addition, it has modules for processing text (tokenization, stemming/lemmatization), parsing sentences, tagging parts of speech, analyzing word sentiment (SentiWordNet), spelling correction, and much more.

In [None]:
import collections
import nltk
import re

from nltk.corpus import treebank
from nltk.stem import lancaster
from gensim import corpora, models

We'll need to grab the "treebank" corpus for this notebook:

In [None]:
nltk.download('treebank')

Gensim is particular about how it logs; just run the below to make things print nicely:

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Goal:** Discover a reasonable topic distribution on the treebank corpus. (Using LSI, LDA, whatever.)

In [None]:
nltk.corpus.treebank

Let's take a quick peek at the first several sentences:

In [None]:
for i, s in enumerate(treebank.sents()):
    if i > 5:
        # stop printing after the first 5
        break
    print(' '.join(s))

## Tokenization

Our data is already split up into tokens, but you might need to tokenize in your applications. NLTK to the rescue:
* `nltk.tokenize.sent_tokenize()` for breaking into sentences
* `nltk.tokenize.word_tokenize()` for breaking into words
* `nltk.tokenize.casual.casual_tokenize()` for Twitter-aware tokenizing (better punctuation handling)

In [None]:
tweet = "It's been too long since i've seen a @RascalFlatts concert. #countrymusic"

In [None]:
print(nltk.tokenize.word_tokenize(tweet))

In [None]:
print(nltk.tokenize.casual.casual_tokenize(tweet))

## Stopwords, stemming

There are a lot of filler words that we don't want to play a role in our modeling.

In [None]:
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords.union({'said', 'would', 'could', 'should', 'may', 'one', 'two', 'may', 'know', 'get', "i'm"})

*Stemming* reduces words down to a common root. There are a few different stemmers available in NLTK, e.g. `nltk.stem.porter.PorterStemmer`, `nltk.stem.lancaster.LancasterStemmer`, and `nltk.stem.SnowballStemmer`. We'll just use Lancaster here:

In [None]:
stemmer = lancaster.LancasterStemmer()

A few examples:

In [None]:
stemmer.stem('maximum'), stemmer.stem('apple'), stemmer.stem('establishment')

Remove stopwords, punctuation, etc.:

In [None]:
def normalize_sentence(s):
    r = [tok.lower() for tok in s]
    r = [tok for tok in r if re.match(r'[a-z0-9][a-z0-9\.\']+', tok)]
    return [stemmer.stem(t) for t in r if t not in stopwords]

Never do this in production! But for our purposes, we want to be able to have some insight into the raw documents:

In [None]:
raw_texts = [s for s in treebank.sents()]

Now we remove lowercase, remove stopwords, etc.:

In [None]:
texts = [normalize_sentence(s) for s in raw_texts]

Let's convert our corpus into bag-of-words now, with the aim of converting to TF-IDF.

In [None]:
word_counts = collections.Counter()
for text in texts:
    for token in text:
        word_counts[token] += 1

Let's get rid of tokens that appear only once:

In [None]:
texts = [
    [token for token in text if word_counts[token] > 1]
    for text in texts
]

Now we can use some gensim magic to convert to bag-of-words:

In [None]:
corpus_lexicon = corpora.Dictionary(texts)
corpus = [corpus_lexicon.doc2bow(text) for text in texts]

Now what we have is a bag-of-words model for each document in the corpus:

In [None]:
n = 110
print(' '.join(raw_texts[n]))
print(texts[n])
print(corpus[n])

Now we'll convert to TF-IDF (term frequency-inverse document frequency):

In [None]:
tfidf = models.TfidfModel(corpus, normalize=True)
tfidf_corpus = tfidf[corpus]

Let's try LSI (latent semantic indexing):

In [None]:
lsi_model = models.LsiModel(tfidf_corpus, id2word=corpus_lexicon, num_topics=5)

To see the topics learned, use `show_topics`:

In [None]:
lsi_model.show_topics()

Or just look at a single topic:

In [None]:
lsi_model.show_topic(4)

Let's try LDA now:

In [None]:
lda_model = models.LdaModel(corpus, id2word=corpus_lexicon, num_topics=5, passes=3)

Note that picking the number of topics is tricky!

In [None]:
lda_model.show_topics()