# Modeling:  Processing Text and Developing Topic Models

## Introduction

- Topic modeling algorithms such as Latent Dirichlet allocation extract themes from a set of text documents.

- Clustering algorithms such as K-means and Gaussian mixture models assume that an observation belongs to one and only one cluster.

- Latent Dirichlet allocation assumes that each document belongs to one or more topics.

- The number of topics is a hyperparameter.

- Topic models can be used to categorize tweets.

- In this demonstration we will use latent Dirichlet allocation (LDA) to look for topics in tweets.  We will also perform some basic natural language processing (NLP) to prepare the data for LDA.

## Lesson

### Setup

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)

import matplotlib.pyplot as plt
from pyspark.sql.functions import explode   
%matplotlib inline


- Create a SparkSession

In [2]:

spark = SparkSession.builder.\
    master("local[*]").\
    appName("lda_text.py").\
    getOrCreate()

- Load the twitter data, which includes the following attributes: "Topic","Sentiment","TweetId","TweetDate","TweetText"

In [3]:
data = "data/twitter_corpus.csv"
df = spark.read.csv(data, header=True).na.drop()
df.limit(5).toPandas()

Unnamed: 0,Topic,Sentiment,TweetId,TweetDate,TweetText
0,apple,positive,126415614616154112,Tue Oct 18 21:53:25 +0000 2011,Now all @Apple has to do is get swype on the i...
1,apple,positive,126404574230740992,Tue Oct 18 21:09:33 +0000 2011,@Apple will be adding more carrier support to ...
2,apple,positive,126402758403305474,Tue Oct 18 21:02:20 +0000 2011,Hilarious @youtube video - guy does a duet wit...
3,apple,positive,126397179614068736,Tue Oct 18 20:40:10 +0000 2011,@RIM you made it too easy for me to switch to ...
4,apple,positive,126395626979196928,Tue Oct 18 20:34:00 +0000 2011,I just realized that the reason I got into twi...


### Extracting and transforming features

- Spark MLlib provides a number of feature extractors and feature transformers to preprocess the tweets into a form appropriate for modeling.

#### Clean and tokenize the tweets

- Use the [RegexTokenizer](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer) class to tokenize the words:

In [4]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="TweetText", outputCol="words", pattern="\\W")

#### Remove stopwords

- Spark MLlib provides a transformer to remove stopwords.
- Use the [StopWordsRemover](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover) class to remove common words:

In [5]:
# stop words

add_stopwords = ["http","https","amp","rt","t","c","the"] 

stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").\
        setStopWords(add_stopwords)

#### Count the frequency of words in each tweet

- Use the [CountVectorizer](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer) class to compute the term frequency:

In [6]:
# bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
                               vocabSize=100, minDF=2)

- The `fit` method computes the top $N$ words where $N$ is set via the `vocabSize` hyperparameter:

In [7]:
from pyspark.ml import Pipeline

In [8]:
from functools import reduce

def pipe(v, *fns):
    return reduce(lambda x, f: f(x), fns, v)

In [9]:
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df)

In [10]:
dataset = pipelineFit.transform(df)

In [11]:
# trouble shooting:  

# Not using the pipeline: 

tokenized = regexTokenizer.transform(df)
filtered = stopwordsRemover.transform(tokenized)

## having issues with calling fit on the count vectorizer! 

vectorized = countVectors.fit(filtered)

- The resulting word vector is stored as a [SparseVector](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.linalg.SparseVector).


### Specify and fit a topic model using latent Dirichlet allocation (LDA)

- Use the `LDA` class to specify an LDA model:

In [12]:
from pyspark.ml.clustering import LDA
lda = LDA(featuresCol="features", k=2, seed=23456)

- Use the `explainParams` method to examine additional hyperparameters:

In [13]:
print(lda.explainParams())

checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
docConcentration: Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). (undefined)
featuresCol: features column name. (default: features, current: features)
k: The number of topics (clusters) to infer. Must be > 1. (default: 10, current: 2)
keepLastCheckpoint: (For EM optimizer) If using checkpointing, this indicates whether to keep the last checkpoint. If false, then the checkpoint will be deleted. Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care. (default: True)
learningDecay: Learning rate, set as anexponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. (default: 0.51)
lear

- Use the `fit` method to fit the LDA model:

In [14]:
lda_model = lda.fit(dataset)

- The resulting model is an instance of the `LDAModel` class:

In [15]:
type(lda_model)

pyspark.ml.clustering.LocalLDAModel

### Examine the LDA topic model

- Examine the estimated distribution of topics:

In [16]:
lda_model.estimatedDocConcentration()

DenseVector([0.4833, 0.5488])

- Examine the estimated distribution of words for each topic:

In [17]:
lda_model.topicsMatrix()

DenseMatrix(100, 2, [372.4489, 668.3935, 31.1719, 308.9413, 22.1776, 100.7804, 126.8033, 71.8673, ..., 37.2824, 37.0261, 24.7345, 26.9979, 22.9744, 1.2864, 1.2527, 20.4995], 0)

- Examine the topics:

In [18]:
lda_model.describeTopics().head(5)

[Row(topic=0, termIndices=[1, 0, 8, 3, 13, 9, 25, 14, 24, 6], termWeights=[0.1248746577244419, 0.06958390260879416, 0.06040851853180488, 0.0577188961304506, 0.03181813614912774, 0.029369340526942275, 0.024398159614899058, 0.024138727222349568, 0.02377859546634016, 0.023690416903199462]),
 Row(topic=1, termIndices=[2, 0, 4, 5, 3, 6, 7, 15, 10, 11], termWeights=[0.08197539387027768, 0.07460177931634981, 0.06827597090783148, 0.045118127189146956, 0.03880336339759435, 0.03753399649358934, 0.032804124666203006, 0.024632188134694716, 0.02440204456515243, 0.023828759138879642])]

- Create a function to print out the terms and weights for each topic:

In [19]:
def print_topics(model, n_terms, vocabulary):
    rows = model.describeTopics(n_terms).collect()
    for row in rows:
        print("---- Topic %s ----" % row["topic"])
        print(zip([vocabulary[i] for i in row["termIndices"]], row["termWeights"]))

- Print the topics:

### Apply the topic model

In [20]:
predictions = lda_model.transform(dataset)
predictions.select("TweetText", "topicDistribution").head(5)

[Row(TweetText='Now all @Apple has to do is get swype on the iphone and it will be crack. Iphone that is', topicDistribution=DenseVector([0.0352, 0.9648])),
 Row(TweetText='@Apple will be adding more carrier support to the iPhone 4S (just announced)', topicDistribution=DenseVector([0.0602, 0.9398])),
 Row(TweetText="Hilarious @youtube video - guy does a duet with @apple 's Siri. Pretty much sums up the love affair! http://t.co/8ExbnQjY", topicDistribution=DenseVector([0.0876, 0.9124])),
 Row(TweetText='@RIM you made it too easy for me to switch to @Apple iPhone. See ya!', topicDistribution=DenseVector([0.0608, 0.9392])),
 Row(TweetText='I just realized that the reason I got into twitter was ios5 thanks @apple', topicDistribution=DenseVector([0.0733, 0.9267]))]

- Examine various model performance measures:

In [21]:
lda_model.logLikelihood(dataset)
lda_model.logPerplexity(dataset)

4.149987740101946

## Further Reading

- [Wikipedia - Topic model](https://en.wikipedia.org/wiki/Topic_model)
- [Wikipedia - Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
- [Spark Documentation - Latent Dirichlet allocation](http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda)
- [Spark Python API - pyspark.ml.clustering.LDA class](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA)
- [Spark Python API - pyspark.ml.clustering.LDAModel class](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LDAModel)
- [Spark Python API - pyspark.ml.clustering.LocalLDAModel class](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.LocalLDAModel)
- [Spark Python API - psypark.ml.clustering.DistributedLDAModel class](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.DistributedLDAModel)

## Exercises

1. Use the `NGram` transformer to generate pairs of words (bigrams) from the tokenized tweets.  *See below if you need some hints/additional guidance*.

2. Fit an LDA model with $k=3$ topics.  *See below if you need some hints/additional guidance*.

### Exercise Hints

#### Exercise 1:

1. Import the `NGram` class from the `pyspark.ml.feature` module
2. Create an instance of the `NGram` class
3. Use the `transform` method to apply the `NGram` instance to the `tokenized` DataFrame
4. Print out a few rows of the transformed DataFrame

#### Exercise 2:

1. Use the `setK` method to change the number of topics for the `lda` instance
2. Use the `fit` method to fit the LDA model to the `vectorized` DataFrame
3. Use the `print_topics` function to examine the topics
4. Use the `transform` method to apply the LDA model to the `vectorized` DataFrame
5. Print out a few rows of the transformed DataFrame