* identify the problem
    * cluster feedback from extremely vulnerable peeps, extract "themes" - divide them into groups based on some measure of similarity
* represent data using numeric attributes
* use a standard algorithm to find a model
    * manually generate labels for the top n clusters

> All models are wrong some are useful


In [None]:
import sys
sys.path.append("..")

import src.utils.regex as regex
import numpy as np
import pandas as pd
import os
import joblib

#language packages
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

import codecs
from sklearn import feature_extraction

# some viz, requires matplotlib
import mpld3
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.manifold import MDS

# takes a while to run
! spacy download en_core_web_lg
import spacy

from nltk.stem.porter import PorterStemmer
STEMMER = PorterStemmer()
import nltk
from nltk.corpus import stopwords
from collections import Counter

# progress bar
from tqdm import tqdm, tqdm_notebook

# instantiate
tqdm.pandas(tqdm_notebook)

# pandas options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

In [None]:
data_path = '../data'
feedback_data_path = os.path.join(data_path, 'joined_uis_all_of_march.csv')

Read data in

In [None]:
df = pd.read_csv(feedback_data_path)
print(df.shape)
df.head(1)

Let's filter for users who visit '/coronavirus-extremely-vulnerable' during their journey.

In [None]:
corona_slugs = ['/coronavirus-extremely-vulnerable', '/done/coronavirus-extremely-vulnerable']


A lot of entries contain questions about travel advice, often with individual country names
this meant the clusterer was clustering by country name which wasn't ideal
The same goes for months etc, so they are removed.

In [None]:
model = spacy.load('en_core_web_lg')
def remove_common_terms(text):
    doc = model(text)
    for ent in doc.ents:
        if ent.label_ == "GPE" or ent.label_ == "DATE":
            text = text.replace(ent.text, ent.label_)
    return text

# Sanity check
print(remove_common_terms("to find out an update for my holiday in mexico in april"))

Clean the data, there is a lot going on here, explained in the comments. This takes a while to run.

In [None]:
q3 = "Q3"
df['q3_copy'] = df[q3]

# These are terms that are functionally the same but people use different terms, this standardises them
# to be improved..., 
same_terms = {
    "travelling": "travel",
    "travellers": "travel",
    "holiday": "travel",
    "self-isolation": "quarantine",
    "selfisolation": "quarantine",
    "self isolation": "quarantine",
    "isolation": "quarantine",
    "statuatory sick pay": "ssp",
    "sick pay": "ssp",
}

def clean_text(text):
    text = str(text)
    # We'll be removing non alphabetical characters but we want to keep the non emergency phone number 
    # '111' in, so we'll just replace that with text
    text = text.replace("111", "oneoneone")
    # Same for 999
    text = text.replace("999", "nineninenine")
    # Remove non alphabetical or space characters
    text = re.sub("[^a-zA-Z\s:]", "", text)
    # Use our function from previous cell
    text = remove_common_terms(text)
    # This is done after remove_common_terms because spacy doesn't 
    # always recognise country names without a capital letter at the beginning!
    text = text.lower()
    text = re.sub(regex.coronavirus_misspellings_and_typos_regex() + "|virus", "", text)
    # People using different terms for "I want to know", so just remove those
    text = re.sub("wanted to find out|to look up about|to get an update|to find infos|to find info|to find out|to understand|to read the|check on advice|to check|ti get advice|to get advice|for information on", "", text)
    for word_to_replace, word_to_replace_with in same_terms.items():
        text.replace(word_to_replace, word_to_replace_with)
    return text

# df[q3] = df[q3].apply(clean_text) # this takes a while, progress_map let's us see progress
df[q3] = df[q3].progress_map(clean_text)

print("The number of rows and columns after cleaning: ", df.shape)
# Remove rows without a page sequence
df = df[df['PageSequence'].notnull()].reset_index(drop=True)
print("The number of rows and columns after dropping nulls for PageSequence: ", df.shape)


In [None]:
corona_related_items_regex = regex.coronavirus_misspellings_and_typos_regex() + '|sick pay|ssp|sick|isolation|closures|quarantine|closure|cobra|cruise|hand|isolat|older people|pandemic|school|social distancing|symptoms|cases|travel|wuhan|care|elderly|care home|carehome'

# We only want to cluster rows that are relevant to corona stuff
# so we have the column 'has_corona_page'
# It is only true if they have visted a corona page AND included a relevant term in the feedback
# (there was some irrelevant stuff about passports, we may want to remove the need for a relevant term
# as people may be using terms not in that list and we might miss out on some insights)
for index, row in df.iterrows():
    has_corona_page = False
    # if re.search(corona_related_items_regex, df.at[index, q3]) is not None:
    for slug in row['PageSequence'].split(">>"):
        if slug in corona_slugs:
            has_corona_page = True
    df.at[index, 'has_corona_page'] = has_corona_page
df = df[df['has_corona_page']].reset_index(drop=True)

# Remove duplicate users
df = df.drop_duplicates('intents_clientID')

print(df.shape)
df.head()

# comment / feedback clustering
How can we learn about the underlying structure of feedback in a way that is informative and intuitive? The basic approach is to analyse the latent topics within each comment. This will require a pipeline that tokenises & stems, transforms the tokens into a vector space model (tf-idf), and then clusters them into groups (k-means or hdbscan or something).


* tokenizing and stemming each synopsis
* transforming the corpus into vector space using tf-idf
* calculating cosine distance between each document as a measure of similarity
* clustering the documents using the k-means algorithm
* using multidimensional scaling to reduce dimensionality within the corpus
* plotting the clustering output using matplotlib and mpld3
* topic modeling using Latent Dirichlet Allocation (LDA)



For the purposes of this walkthrough, imagine that we have 2 primary lists:
* 'Q3': seems to be the most pertinent question,
* 'label': some unique identifier would be helpful, but actually we primarily want something to label each comment with for humans to read.

We might want to extend this to include question 8 later.

# Stopwords, stemming, and tokenizing
This section is focused on defining some functions to manipulate the synopses. First, we load NLTK's list of English stop words. Stop words are words like "a", "the", or "in" which don't convey significant meaning.

In [None]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

We can customise our stopwords using a feedback loop informed by running through this process. We've retrosepctively come back and extended as below.

In [None]:
# through experimenting we found some additional ones to add
new_stopwords = ["'d", "'s", 'abov', 'ani', 'becaus', 'befor', 'could', 'doe', 'dure', 'might', 'onc', 'onli', 'sha', 'whi', 'wo', 'would']
stopwords.extend(new_stopwords)

Next we import the Snowball Stemmer which is actually part of NLTK. Stemming is just the process of breaking a word down into its root.

In [None]:
# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")


Below we define two functions:

* tokenize_and_stem: tokenizes (splits the comment into a list of its respective words (or tokens) and also stems each token
* tokenize_only: tokenizes the comment only
We use both these functions to create a dictionary which becomes important in case we want to use stems for an algorithm, but later convert stems back to their full words for presentation purposes.

In [None]:

# here we define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [None]:
df.iloc[0:10]['Q3']

Below we use my stemming/tokenizing and tokenizing functions to iterate over the list of comments to create two vocabularies: one stemmed and one only tokenized.



In [None]:
# Apply a user defined function to each column by doubling each value in each column
df.iloc[0:10]['Q3'].apply(tokenize_and_stem)

In [None]:
df.iloc[0:10]['Q3'].apply(tokenize_only)

In [None]:
# let's apply to the dataframe
df['Q3_tokenized_and_stemmed'] = df['Q3'].progress_map(tokenize_and_stem)

df['Q3_tokenized_only'] = df['Q3'].progress_map(tokenize_only)


Above we used stemming/tokenizing and tokenizing functions to iterate over the column of comments, from this we have created two vocabularies: one stemmed and one only tokenized. This is a slightly different approach, we take the data out of a pandas data frame format for this. 



In [None]:
#not super pythonic
#use extend so it's a big flat list of vocab
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in df['Q3'].to_list():
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'comments list', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

Using these two columns, we can create a pandas DataFrame with the stemmed vocabulary as the index and the tokenized words as the column. The benefit of this is it provides an efficient way to look up a stem and return a full token. The downside here is that stems to tokens are one to many: the stem 'run' could be associated with 'ran', 'runs', 'running', etc. For our purposes this is fine--we should be perfectly happy returning the first token associated with the stem we need to look up, as it's for labelling purposes.

In [None]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are only 51326 items in the DataFrame which isn't huge overhead in looking up a stemmed word based on the stem-index.



In [None]:
print(vocab_frame.head(10))


# Tf-idf and document similarity¶

Here, we define term frequency-inverse document frequency (tf-idf) vectorizer parameters and then convert the comments list into a tf-idf matrix.

To get a Tf-idf matrix, first count word occurrences by document / comment. This is transformed into a document-term matrix (dtm). This is also just called a term frequency matrix. 

Then apply the term frequency-inverse document frequency weighting: words that occur frequently within a document but not frequently within the corpus receive a higher weighting as these words are assumed to contain more meaning in relation to the document.

A couple things to note about the parameters we define below:

* `max_df`: this is the maximum frequency within the documents a given feature can have to be used in the tfi-idf matrix. If the term is in greater than X% of the documents it probably cares little meanining 
* `min_df`: this could be an integer (e.g. 5) and the term would have to be in at least 5 of the documents to be considered. Here we pass 0.2; the term must be in at least 20% of the document.
* `ngram_range`: this just means we'll look at unigrams, bigrams and trigrams.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, max_features=200000,
                                 min_df=0.05, stop_words=stopwords,
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

# we convert the pandas df to a list for consumption
%time tfidf_matrix = tfidf_vectorizer.fit_transform(df['Q3'].to_list()) #fit the vectorizer to all comments

print(tfidf_matrix.shape)

terms is just a list of the features used in the tf-idf matrix. This is a vocabulary



In [None]:
terms = tfidf_vectorizer.get_feature_names()
print(terms)

dist is defined as 1 - the cosine similarity of each document. Cosine similarity is measured against the tf-idf matrix and can be used to generate a measure of similarity between each document and the other documents in the corpus (each synopsis among the synopses). Subtracting it from 1 provides cosine distance which I will use for plotting on a euclidean (2-dimensional) plane.

Note that with dist it is possible to evaluate the similarity of any two or more comments.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

# K-means clustering
Using the tf-idf matrix, we can run a slew of clustering algorithms to better understand the hidden structure within the comments. We first chose k-means, as it's a good baseline. K-means initializes with a pre-determined number of clusters (this could also be considered helpful - "What are the top n clsuters people are commenting on?"). Each observation is assigned to a cluster (cluster assignment) so as to minimize the within cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence.



In [None]:

from sklearn.cluster import KMeans

num_clusters = 3

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

We can use joblib.dump to pickle the model, once it has converged and to reload the model/reassign the labels as the clusters.


In [None]:
joblib.dump(km,  '../models/doc_cluster.pkl')

km = joblib.load('../models/doc_cluster.pkl')
clusters = km.labels_.tolist()

In [None]:
# should be the same, where the cluster corresponds to our comment of interest
print(df.shape)
print(len(clusters))

In [None]:
# create assocaited variable
df['cluster'] = clusters

Here is some fancy indexing and sorting on each cluster to identify which are the top n words that are nearest to the cluster centroid. This gives a good sense of the main topic of the cluster.

First we remind ourselves ofthe structure of terms found in the comments. This contains the lemma and the associated word (non-lemmatized version of the word with the same meaning).


In [None]:
# testing .loc 
ind = 4
print("We are interested in this lemma found in the comments:")
print()
print(terms[ind])
print()
print("We are curious about the lexeme, the set of all forms that have the same meaning")
print()
print(vocab_frame.loc[terms[ind].split(' ')])

In [None]:
print(vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')

In [None]:
df.head(2)

In [None]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        # try vocab_frame.loc[terms[ind]]['words'], ix is deprecated
        print(' %s' % vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
        # print(' %s' % vocab_frame.loc[terms[ind]]['words'].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
    print() #add whitespace
    print() #add whitespace
    
    print("Cluster %d Q3:" % i, end='')
    for title in df.iloc[i]['Q3']:
        print('%s' % title, end='')
    print() #add whitespace
    print() #add whitespace
    
print()
print()

Let's get a better idea of some of the exemplar comments in these clusters. That will help us manually generate a sensible theme or label. We should also see the proportion of comments in each cluster. We can afford this to do manually for specific subsets of journeys, in this case we are focused on extremely vulnerable people.

In [None]:
for i in range(num_clusters):
    cluster_examples = df[df.cluster.eq(i)].Q3_tokenized_only.head(5).to_numpy()
    cluster_count = df.cluster.value_counts()[i]
    print(f"Some example comments from Cluster {i}: \n\n{cluster_examples}\n\n There are {cluster_count} of these examples. \n\n")

In [None]:
# easier to read
print(df.cluster.value_counts())
df.cluster.value_counts().plot.bar()


We have some interesting features from our SQL query from the GA data associated with the sessions associated wiht the comments, let's look at them


In [None]:
# summary stats to compare characteristics of groups

clusterino = df.groupby("cluster")

# Summary statistic of all clusters
(
    clusterino[['dayofweek',
                'total_seconds_in_session_across_days',
                'total_pageviews_in_session_across_days',
                'guidance_count',
                'done_page_flag']]
    .describe()
    .head()
)

We note that cluster 0, sessions spent more time on GOV.UK than the other clusters and also there was a higher mean `guidance_count`. This suggests users from this cluster did spend some time and effort looking for content or information that may or may not exist. Digging through this cluster might provide insight for content that is lacking.

# Wordclouds
Wordclouds can be useful ways to summarise a cluster, and the comments therein. WordCloud is a technique to show which words are the most frequent among the given text. The first thing you may want to do before using any functions is check out the docstring of the function, and see all required and optional arguments. To do so, type ?function and run it to get all information.

Ideally we would use the same stopwords from earlier and maybe the toeknized and stemmed comments.


In [None]:
text = " ".join(comment for comment in df[df.cluster == 0].Q3)


In [None]:
for i in range(num_clusters):
    text = " ".join(comment for comment in df[df.cluster == i].Q3)
    comments_n = len(df[df.cluster == i].Q3)

    
    stopwords_wc = set(STOPWORDS)
    stopwords_wc.update(["I'm, im, I am, i'm"])


    # Create and generate a word cloud image:
    # lower max_font_size, change the maximum number of word and lighten the background:
    wordcloud = WordCloud(max_font_size=50, max_words=20, background_color="white").generate(text)

    # Display the generated image:
    print(f"Cluster {i}: {comments_n} comments in this cluster. \n")
    print ("There are {} words in the combination of all comments in this cluster. \n".format(len(text)))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    # Save the image in the img folder:
    # wordcloud.to_file("img/blah.png")
    print("\n")

These word clouds provide useful summaries of the cluster themes as summarised by a human looking at them earlier in the doc. Some repetition of terms, we should work out how to use  our stopwords list from earlier and the tokenized and stemmed comments.


# Multidimensional sca

In [None]:
MDS()

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

xs, ys = pos[:, 0], pos[:, 1]
print()
print()

# Visualizing document clusters
In this section, we demonstrate how you can visualize the document clustering output using matplotlib.

First we define some dictionaries for going from cluster number to color and to cluster name. We based the cluster names off the words that were closest to each cluster centroid. Thus, it could be automated. See the earlier code chunk.

In [None]:
len(clusters)

In [None]:
#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3'}

#set up cluster names using a dict
cluster_names = {0: 'Information, help, date', 
                 1: 'Vulnerable, person, register', 
                 2: 'Delivery, shops, slot'}

In [None]:
#some ipython magic to show the matplotlib plots inline
%matplotlib inline 

#create data frame that has the result of the MDS plus the cluster numbers and titles
df_mds = pd.DataFrame(dict(x=xs, y=ys, label=clusters)) 

#group by cluster
groups = df_mds.groupby('label')


# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, 
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',         # changes apply to the y-axis
        which='both',      # both major and minor ticks are affected
        left='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelleft='off')
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the ... nothing relevant to label comments with
# for i in range(len(df_mds)):
#     ax.text(df_mds.iloc[i]['x'], df.iloc[i]['y'], size=8)  

    
    
plt.show() #show the plot

#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)

In [None]:
plt.close()

Some overlap, also looks like a non-linear boundary would be better, could try t-sne (could also try different algos in place of k-means). This could also be a relic of comments being made up of sentences which describe different problems, thus it might make more sense to split up comments to the sentence level prior to clustering. The experimental unit could be at the level of sentence in a comment.

This could be handled by LDA. LDA is a probabilistic topic model that assumes documents are a mixture of topics and that each word in the document is attributable to the document's topics. 

# LDA
This section focuses on using Latent Dirichlet Allocation (LDA) to learn yet more about the hidden structure within the comments from users on the extremely vulnerable online form.

This notebook is quite long, let's create a new one called `feedback_topic_modelling_extremely_vulnerable_people`