# Topic Modeling Methods

Topic modeling is a powerful tool for quickly sorting through a lot of text and documents without having to read every one. There are several methods available for this using python, as well as several libraries. Topic modeling is extremely challenging to get meaningful results. "Garbage in, garbage out" is a phrase that applies well to this - we have to do a significant amount of text preprocessing to extract the right information to feed into a model. On this sheet, I will be topic modeling supreme court cases with the following:

- SKlearn

- LDA (with TF)

- LSA - AKA TruncatedSVD (with TF and TFIDF)

- NMF (with TFIDF)

## Reminder of Full Project Workflow

Extracting text using beautiful soup --> processing the text --> fitting text to a model --> applying model to other text

### Software Package & Built in Function Documentation
- textblob - http://textblob.readthedocs.io/en/dev/

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from textblob import TextBlob
from sklearn.preprocessing import Normalizer

In [None]:
doc_list.read_pickle("full_proj_lemmatized3.pickle") #always save your work!

In [None]:
doc_list.shape #checking to make sure we have the info we expected to have

## Testing Models

Try LDA, NMF and LSA as well as adjusting # of features, # topics, and overlap for best results.

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
    
def modeler(corp, n_topics, n_top_words, clf, vect):
    df = .80
    str_vect = str(vect).split("(")[0]
    str_clf = str(clf).split("(")[0]

    print("Extracting {} features for {}...".format(str_vect, str_clf))
    vect_trans = vect.fit_transform(corp)


    # Fit the model
    print("Fitting the {} model with {} features, "
          "n_topics= {}, n_topic_words= {}, n_features= {}..."
          .format(str_clf, str_vect, n_topics, n_top_words, n_features))

    clf = clf.fit(vect_trans)
    if str_clf == "TruncatedSVD":
        print("\nExplained variance ratio", clf.explained_variance_ratio_)
        
    print("\nTopics in {} model:".format(str_clf))
    feature_names = vect.get_feature_names()
    return print_top_words(clf, feature_names, n_top_words)

### Latent Dirchlet Allocation Model
In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

![LDA](https://image.slidesharecdn.com/topicmodeling-140603161649-phpapp02/95/topic-modeling-12-638.jpg?cb=1402404085)

#### Improving Accuracy of Topic Modeling
1. Frequency Filter
2. Part of Speech Tag Filter 
3. Batch Wise LDA

The results of topic models are completely dependent on the features (terms) present in the corpus. The corpus is represented as document term matrix, which in general is very sparse in nature. Reducing the dimensionality of the matrix can improve the results of topic modelling. Based on my practical experience, there are few approaches which do the trick.

**Frequency Filter**

Arrange every term according to its frequency. Terms with higher frequencies are more likely to appear in the results as compared ones with low frequency. The low frequency terms are essentially weak features of the corpus, hence it is a good practice to get rid of all those weak features. An exploratory analysis of terms and their frequency can help to decide what frequency value should be considered as the threshold.

**Part of Speech Tag Filter**

POS tag filter is more about the context of the features than frequencies of features. Topic Modelling tries to map out the recurring patterns of terms into topics. However, every term might not be equally important contextually. For example, POS tag IN contain terms such as – “within”, “upon”, “except”. “CD” contains – “one”,”two”, “hundred” etc. “MD” contains “may”, “must” etc. These terms are the supporting words of a language and can be removed by studying their post tags.


**Batch Wise LDA**

In order to retrieve most important topic terms, a corpus can be divided into batches of fixed sizes. Running LDA multiple times on these batches will provide different results, however, the best topic terms will be the intersection of all batches.

![LDA Explained](http://chdoig.github.io/pytexas2015-topic-modeling/images/lda-3.png)

In [None]:
modeler(doc_list.lem, 30, 30, LatentDirichletAllocation(n_topics=30, max_iter=5, learning_method='online', \
        learning_offset=50.,random_state=0), CountVectorizer(max_df=.80, min_df=2, 
                                                             stop_words='english'))

![](../images/lda-out.png)

In [None]:
LDA_mod(doc_list.lem, .95, 2, 2000,10) #df is a way to extract 'meaningful text' in this case

![](../images/lda-out2.png)

#### Notes about LDA model performance

LDA is the most frequently used model in conversations about topic modeling. LDA has proven ineffective for this project, it performs poorly at picking up subtle differences in a corpus about the same subject (as in, if I wanted to find the difference between Apple products and apple the fruit, LDA would probably work, but not if I need to find the difference between cases where the majority of the text is about the law). 

Likely because LDA can only use a count vectorizer rather than a tfidf, so this bag of words is a serious limitation to finding how these documents relate.

### Truncated SVD (LSA) Model
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). It is very similar to PCA, but operates on sample vectors directly, instead of on a covariance matrix. This means it can work with scipy.sparse matrices efficiently.

Notes: SVD suffers from a problem called “sign indeterminancy”, which means the sign of the components_ and the output from transform depend on the algorithm and random state. To work around this, fit instances of this class to data once, then keep the instance around to do transformations.

![LSA](https://image.slidesharecdn.com/ldatutorial-150317223352-conversion-gate01/95/topic-modeling-for-learning-analytics-researchers-lak15-tutorial-66-638.jpg?cb=1426894536)

In [None]:
modeler(doc_list.lem, 100, 30, TruncatedSVD(2, algorithm = 'arpack'), TfidfVectorizer(max_df=.8, min_df=2,stop_words='english'))

![](../images/lsa-out.png)

#### Notes about LSA performance

Issues similar to LDA - it's good at pulling out the law themes, but that's not really what we need. We need the law terms to not play a role at all in modeling for these topics - we know that this entire corpus is about the law, but we need to know what KIND of law each case within the corpus is about.

### NMF model
Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

![NMF](https://image.slidesharecdn.com/nlpmeetupsept2016derekgreene-160929091010/95/dynamic-topic-modeling-via-nonnegative-matrix-factorization-dr-derek-greene-4-638.jpg?cb=1475140310)

In [None]:
modeler(doc_list.lem, 30, 30, NMF(n_components=30, random_state=1, alpha=.1, l1_ratio=.5), \ 
        TfidfVectorizer(max_df=.98, min_df=2,stop_words='english'))

![](../images/nmf-out.png)

#### Notes about NMF performance
Seeing these results should make you happy. Being able to use tf-idf is very important for modeling the kind of law each case within the corpus is about.