### Topic modelling pipeline using spaCy, Gensim and LDAVis.

In [None]:
import os, re, operator, warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import gensim
import numpy as np
import spacy

warnings.filterwarnings('ignore')
nlp = spacy.load('en')

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
from gensim.models.phrases import Phrases, Phraser
import pyLDAvis.gensim

#set working directory
os.chdir('C:/your_working_directory')

#read the original excel supplied by service department
import pandas as pd
raw_df = pd.read_excel('your_raw_input.xlsx')

#if you just have on import field and not additional criteria you can cut down this code
cleaned_col = raw_df['field_to_import'][raw_df['filtered_by_some_column']!='Excluding_this_criteria'].replace(r'\r|\n|\\j', '', regex = True).fillna('')

#\n is the new line character and is used to separate responses as they are separated per cell in the excel
text = '\n'.join(cleaned_col)

In [None]:
#what does the raw text look like?
text[:1000]

### Text pre-processing with spaCy

Assuming an understanding of the standard techniques for preparing raw text for analysis I am going to use Spacy instead of these established processes to really show off a small portion Spacy's true power.
If you want to get really deep with Spacy take a look at their website https://spacy.io/

In [None]:
#ask spacy to parse the raw text into a new variable
doc = nlp(text)

In [None]:
#now how does the text look once parsed by spacy?
doc[:1000]

So that's pretty impressive Spacy has parsed our raw text and just in the preview above you can see how its added new lines for full stops and split out each call description as it's own document.
But there's a whole lot more that you can't see there, so let's take a closer look.

In [None]:
#Spacy tags every word with various metrics such as it's POS (Parts of speech) which is essentially the type of word.
#the base form of the word (lemmatised) much better than stemming which is brutal and inaccurate by comparison
#TAG The detailed part-of-speech tag. - unique to Spacy
#DEP : Syntactic dependency, i.e. the relation between tokens.
for token in doc[2:20]: #just the first 20 words
    print(token.text
         ,token.lemma_
         ,token.pos_
         ,token.tag_
         ,token.dep_
         ,token.is_stop)

In [None]:
#for checking the various named entity types which we may chose to exclude later
for ent in doc.ents[23:25]:
    print(ent.text
          #, ent.start_char
          #, ent.end_char
          , ent.label_)

In [None]:
#structure our newly parsed text using only the content we are interested in
texts, article = [], []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article!
    if (w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and not w.like_email
        and not w.like_url 
        #we are only interested in certain entity types so we're excluding these
        and w.ent_type_ not in ['PERSON','CARDINAL','MONEY'
                                ,'TIME','DATE','GPE','LOC'
                                ,'MONEY','QUANTITIY']):
        # we add the lematized version of the word
        article.append(w.lemma_)
    # if it's a new line, it means we're onto our next document
    if w.text == '\n':
        texts.append(article)
        article = []

In [None]:
texts[0:100]

As you can see Spacy has allowed us to remove words by their type so we just keep what we want to analyse, and it's very easy to go back and make changes to the process should you want to include something which you previously excluded. It's non destructive which is very handy for working iteratively which I do a lot in text mining.

### GenSim - Identifying unigrams and topic modelling

In [None]:
#when playing with the parameters you need to reload "texts" from 2 cells above, otherwise you will be further processing
#the already processed "texts" from the previous run of this cell
unigram = gensim.models.Phrases(texts, min_count=3, threshold=0.1)

texts = [unigram[line] for line in texts]

Lets have a look at some of the unigrams identifed by GenSim's phrase detection.
By analysing the list below then tweaking the phrase variables one can iteratively tune to get the most sensible mix of unigrams.
The unigrams are joined by and underscore, a human will read these as a 2 word phrase but to the modelling in the next step they will be represented by just a single number.

In [None]:
texts[5:20]

Now that we have our raw text reduced down to just the words and phrases which we feel gives the documents meaning there is one more step before we can begin to model the data.
Given that we want to leverage various statistical techniques to help us identify meaningful patterns in our text data we need to get the text into a useable numeric format.
Here we convert the text into a dictionary and a corpus (explanations below)

In [None]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

The dictionary simply lists each word with its own numerical index.
We can inspect individual words.

In [None]:
dictionary[1]

The corpus holds the document structure in a simple list of coordinates where each word or phrase is held within () the first number being the word ID which corresponds to the dictionary, the second number being the frequency of that word within the document. The multiple words for each document are contained within []

In [None]:
corpus[15]

There are other ways to numerically represent text data such as sparse matricies which allow one to do useful analyses such as Term Frequency - Inverse Document Frequency (see here https://en.wikipedia.org/wiki/Tf%E2%80%93idf )
Since I'm interested in topic modelling here I don't need to do such a thing, also GenSim expects the data in the format we have created.

### NOW FOR THE TOPIC MODELS
### LDA

Latent Dirichlet Allocation is a tried and tested technique used for topic modelling, adapted from Dirichlet Distributions specifically for the purpose of identifying topics in large volumes of text.
further reading here: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
and here: https://en.wikipedia.org/wiki/Dirichlet_distribution

In [None]:
#be sure to check out gensim's other models, I had sucess with LSI also.
#start high with your clusters and sense check the results, gradually
#reducing the number of clusters until you have fewer but more concise clusters.
ldamodel = LdaModel(corpus=corpus, num_topics=8, id2word=dictionary)

In [None]:
ldamodel.show_topics(8)

### pyLDAvis - sharing the results

As you can see above, representing the topics with just words and a nmumber showing the contribution of each word to the topic isn't exactly an intuitive way of exploring the result even for an analyst, let alone sharing the results back to someone who has the domain knowledge but is most likely not an analytical p`erson.
As with all types of analyses, good visualisation is required to get the most from your results.
Here I use PyLDAvis which is a tool for graphically representing and interacting with LDA model outputs.
More info here: https://pypi.org/project/pyLDAvis/

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)