<a href="https://colab.research.google.com/github/Sparrow0hawk/ittt-ai-ml-dl/blob/session3-AC/session_3_topicsML/session3_unsup_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src=https://raw.githubusercontent.com/ARCLeeds/arcleeds.github.io/master/assets/img/lighterblueText_wLogo_m2.1.png alt="Research Computing logo" style="width:900px;">

# IT TechTalk - Session 2: Text analysis approaches

## Agenda

- [Introduction](#Introduction)
- [Preprocessing](#Preprocessing)
- [Build a Bag of words corpus](#Building-a-bag-of-words-corpus-and-dictionary)
- [Building a topic model](#Building-the-topic-model)

## Introduction

Text analysis is a classic computational and data science problem.

![NLP](https://deeplearninganalytics.org/wp-content/uploads/2019/04/nlp.png)

Compared with regression and classification approaches on continuous and categorical dataset taking text data and deriving distinct insights is a far more complicated task. Text data and especially free text (text fields in sentence form) is typically classed as a form of unstructured data because of the various nuances introduced by languages.

With the ever increasing computational power has come a side-by-side improvements in approaches to text analysis. 

There are a number of different approaches to text analysis such as sentiment analysis, machine translation, information retrieval and much more. In this talk we'll focus specifically on **topic modelling**. An unsupervised statistical approach for identifying abstract 'topics' from within a collection of documents (corpus).

We'll look specifically at latent Dirichlet allocation (LDA), a topic modelling approach developed in [2002](http://jmlr.csail.mit.edu/papers/v3/blei03a.html). LDA has become one of the most commonly used topic modelling approaches and many extensions of LDA have since been proposed.

We'll use financial complaints data from the [US Consumer Financial Protection Bureau](https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data) for this example.

In [1]:
# line wrapping for colab https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [2]:
!pip install gensim==3.8.0
import numpy as np
import pandas as pd
import gensim
import matplotlib.pyplot as plt

Collecting gensim==3.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/40/3d/89b27573f56abcd1b8c9598b240f53c45a3c79aa0924a24588e99716043b/gensim-3.8.0-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 1.9MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.0


We add some shell scripting here to create a data directory and download and unzip our data file.

In [3]:
%%bash 

if [ -d data/ ]; then
    echo "Data directory exists"
else
    mkdir data
fi

if test -f data/complaints.csv; then
    echo "Data file exists"
else 
    curl -LO http://files.consumerfinance.gov/ccdb/complaints.csv.zip; mv complaints.csv.zip data/ ;unzip data/complaints.csv.zip -d data/
fi

Archive:  data/complaints.csv.zip
  inflating: data/complaints.csv     


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   183  100   183    0     0   1236      0 --:--:-- --:--:-- --:--:--  1236
 45  263M   45  119M    0     0   126M      0  0:00:02 --:--:--  0:00:02  126M100  263M  100  263M    0     0   153M      0  0:00:01  0:00:01 --:--:--  186M


Next we load this dataset into a pandas DataFrame python object. This is alot like a spreadsheet and allows for easy manipulation of columns and rows.

In [4]:
# import the dataset
# for demo purposes we'll use a subset of the data 5% of total
ticket_data = pd.read_csv('data/complaints.csv')#.sample(frac=0.2, random_state=42)

ticket_data.dropna(subset=["Consumer complaint narrative"], inplace=True)

print(ticket_data.shape)

ticket_data.head()

(578251, 18)


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392
2,2019-10-25,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,I would like to request the suppression of the...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",CA,937XX,,Consent provided,Web,2019-10-25,Closed with explanation,Yes,,3417821
3,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
5,2019-03-05,Mortgage,Conventional home mortgage,Trouble during payment process,,The escrow unit of my mortgage servicing compa...,Company believes complaint represents an oppor...,Ditech Financial LLC,DC,200XX,,Consent provided,Web,2019-03-05,Closed with non-monetary relief,Yes,,3170261
8,2019-09-08,"Money transfer, virtual currency, or money ser...",Domestic (US) money transfer,Fraud or scam,,"I was sold access to an event digitally, of wh...",,"Paypal Holdings, Inc",RI,029XX,,Consent provided,Web,2019-09-08,Closed with explanation,Yes,,3366475


In [5]:
ticket_data.Product.value_counts()

Credit reporting, credit repair services, or other personal consumer reports    193159
Debt collection                                                                 119378
Mortgage                                                                         67553
Credit card or prepaid card                                                      40701
Credit reporting                                                                 31588
Student loan                                                                     26682
Checking or savings account                                                      23487
Credit card                                                                      18838
Bank account or service                                                          14885
Vehicle loan or lease                                                             9913
Money transfer, virtual currency, or money service                                9883
Consumer Loan                              

In [6]:
ticket_data.groupby('Product')['Consumer complaint narrative'].apply(lambda x: np.mean(len(x)))

Product
Bank account or service                                                          14885.0
Checking or savings account                                                      23487.0
Consumer Loan                                                                     9473.0
Credit card                                                                      18838.0
Credit card or prepaid card                                                      40701.0
Credit reporting                                                                 31588.0
Credit reporting, credit repair services, or other personal consumer reports    193159.0
Debt collection                                                                 119378.0
Money transfer, virtual currency, or money service                                9883.0
Money transfers                                                                   1497.0
Mortgage                                                                         67553.0
Other financi

In [7]:
ticket_data = ticket_data[ticket_data['Product'] == 'Credit card']

In [8]:
# lets peak and look what this looks like

ticket_data['Consumer complaint narrative'][:5].tolist()

['I was stupid enough to charge some items at MACY \'S on my Macy \'s credit card over XXXX. I was unable to log into my account online to make the payment because Macy \'s had updated the system and I simply did n\'t have the patience to navigate their irritating and obscure system. However, I called Macy \'s and paid my bill for $ XXXX a day or two late - was assured that there would be no late charges because I was a customer in good standing, yada yada, and received a confirmation number that I had paid the bill in full. \n\nI then received a bill for {$2.00} in interest. OK, I agree - I had n\'t paid the {$140.00} on time, regardless of the reason. Whatever. So I paid that {$2.00} bill immediately, by check, and left the country for an extended absence - in complete confidence that I had paid all my bills in full. \n\nImagine my surprise when I returned to the US and found that Macy \'s had received and cashed my check, but charged me a further {$2.00} anyway as a minimum interest

## Preprocessing

Pre processing is a crucial step in any text analytics project. Text data on its own is very difficult for machines to understand and therefore it requires cleaning and preparing before building models. This often involves a number of steps such as:
- Tokenisation, converting a long string of words into a list of individual words i.e. "the cat sat on the mat" -> ["the", "cat", "sat", "on", "the", "mat"]
- Noise removal, most commonly removing punctuation or things like hyperlinks or emojis
- Stopword removal, removing common words that don't contain information such as the, and, or, a 
- Stemming or lemming, this is the process of reverting words to their root either by chopping off suffixes (stemming) or reverting to word lemma (lemming)
- Normalisation, commonly this means converting all words to lower or uppercase


In [9]:
# lets slice out the text data from our dataframe
subsample_text = ticket_data['Consumer complaint narrative'].tolist()

In [10]:
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short, stem_text

def basic_preprocess(list_of_strings):
    """
    A basic function that takes a list of strings and runs some basic
    gensim preprocessing to tokenise each string.
    
    Operations:
        - convert to lowercase
        - remove html tags
        - remove punctuation
        - remove numbers
        - remove short tokens (less than 3)
    
    Outputs a list of lists
    """
    
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short]

    preproc_text = [preprocess_string(doc, CUSTOM_FILTERS) for doc in list_of_strings]
    
    return preproc_text

In [11]:
# what are stop words?

from gensim.parsing.preprocessing import STOPWORDS

print(STOPWORDS)

frozenset({'give', 'may', 'everyone', 'nor', 'as', 'across', 'they', 'us', 'mostly', 'these', 'me', 'empty', 'although', 'now', 'alone', 'during', 'sincere', 'anywhere', 'please', 'another', 'been', 'front', 'nothing', 'do', 'eight', 'too', 'against', 'three', 'thru', 'some', 'herself', 'elsewhere', 'if', 'once', 'through', 'four', 'move', 'seeming', 'further', 'for', 'didn', 'herein', 'each', 'my', 'the', 'in', 'describe', 'make', 'amongst', 'least', 'whither', 'very', 'couldnt', 'from', 'their', 'beforehand', 'it', 'interest', 'up', 'hereby', 'since', 'about', 'can', 'which', 'over', 'only', 'here', 'upon', 'whenever', 'becomes', 'whom', 'him', 'an', 'still', 'thereupon', 'does', 'moreover', 'don', 'etc', 'always', 'hereupon', 'again', 'done', 'also', 'perhaps', 'most', 'third', 'often', 'wherever', 'find', 'regarding', 'though', 'forty', 'mine', 'amoungst', 'her', 'this', 'cannot', 'between', 'fire', 'part', 'yourselves', 'bill', 'somehow', 'fifty', 'full', 'sometime', 'when', 'keep

In [12]:
import re

def remove_twitterisms(list_of_strings):
    """
    Some regular expression statements to remove twitter-isms
    
    Operations:
        - remove links
        - remove @tag
        - remove #tag
        
    Returns list of strings with the above removed
    """
    
    # removing some standard twitter-isms

    list_of_strings = [re.sub(r"http\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"@\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"#\S+", "", doc) for doc in list_of_strings]
    
    return list_of_strings

In [13]:
# removing emojis
# taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b#gistcomment-3315605

def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [14]:
def remove_redacted(string):
    
    string = [re.sub(r"(x|X){2,}", "", doc) for doc in string]
    
    return string

In [15]:
from gensim.models.phrases import Phrases

def n_gram(tokens):
    """Identifies common two/three word phrases using gensim module."""
    # Add bigrams and trigrams to docs (only ones that appear 10 times or more).
    # includes threshold kwarg (threshold score required by bigram)
    bigram = Phrases(tokens, min_count=10, threshold=100)
    trigram = Phrases(bigram[tokens], threshold = 100)

    for idx, val in enumerate(tokens):
        for token in bigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a bigram, add to document.bigram
                    tokens[idx].append(token)
        for token in trigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a trigram, add to document.
                    tokens[idx].append(token)
    return tokens

In [16]:
from nltk.stem import WordNetLemmatizer

def lemmatise(words):
    """
    Convert words to their lemma or root using WordNet lemmatizer
    """
    lemma = WordNetLemmatizer()
    # this function takes a list of lists of tokens
    return [[lemma.lemmatize(token,'v') for token in tokens] for tokens in words]

In [18]:
# next we implement the preprocessing steps

preprocessed_corpus = remove_twitterisms(subsample_text)

preprocessed_corpus = remove_redacted(preprocessed_corpus)

preprocessed_corpus = [remove_emoji(doc) for doc in preprocessed_corpus]

preprocessed_corpus = basic_preprocess(preprocessed_corpus)

# added stemming
preprocessed_corpus = lemmatise(preprocessed_corpus)

LookupError: ignored

In [None]:
# lets compare the original strings to the preprocessed strings
for orig, proc in zip(subsample_text[:5], preprocessed_corpus[:5]):
    
    print(orig)
    print(proc)
    print('\n')

## Building a bag of words corpus and dictionary

In [None]:
import numpy as np
from gensim.corpora import Dictionary

def bag_of_word_processing(corpus_of_tokens, lower_extreme, upper_extreme):
    """
    Take the list of tokens and convert them into a bag-of-words (BoW) format.

    Extended description of function.

    :param list of lists corpus_of_tokens: a list of strings produced during preprocessing representing all documents in corpus
    :param int lower_extreme: Description of arg2.
    :param float upper_extreme: the upper extreme filter limit, words are excluded if they occur in more documents than the proportion specified here
    :return: gensim.corpora.dictionary.Dictionary object 
    :return: list representing BoW corpus
    
    """

    # Create a dictionary representation of the documents.
    # gensim Dictionary function creates tokens -> tokenID dict
    dictionary = Dictionary(corpus_of_tokens)
    print('Number of unique words in initital documents:', len(dictionary))

    org_dict = len(dictionary)

    # Filter out words that occur less than 10 documents, or more than 70% of the documents.
    dictionary.filter_extremes(no_below=lower_extreme, no_above=upper_extreme)
    print('Number of unique words after removing rare and common words:', len(dictionary))

    filt_dict = len(dictionary)

    print('Token reduction of: ' + str((1-filt_dict/org_dict)*100)+'%')

    # transform to bag of words
    corpus = [dictionary.doc2bow(doc) for doc in corpus_of_tokens]
    print('Number of unique tokens: %d' % len(dictionary))
    print('Number of documents: %d' % len(corpus))

    # output on document length
    print('Average number of words per document before BoW transform: ', np.mean([len(item) for item in preprocessed_corpus]))
    print('Average number of words per BoW document: ',np.mean([len(corpus[i]) for i in range(len(corpus))]))

    return dictionary, corpus

In [None]:
working_dict, working_corpus = bag_of_word_processing(preprocessed_corpus, 10, 0.7)

## Building the topic model

We now have processed our initial texts into components that machines can interact with and we're ready to create a topic model!

In [None]:
from gensim.models import CoherenceModel, ldamulticore

def long_topic_scan(dictionary, corpus,  texts, limit, start=2, step=3):
    """
    Identify the topic number with the highest coherence score out of a broad range of numbers
    Adapted from https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts (doc_clean)
    limit : Max num of topics
    start : int starting topic number
    step : int increment from one topic number to another until limit is reached
    Returns:
    coherence_df : pd.DataFrame containing topic number and its calculated coherence score
    -------
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    graphical outputs
    """
    coherence_dict = dict()

    for num_topics in range(start, limit, step):
        model = ldamulticore.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=2)
        coherencemodel1 = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_dict[num_topics] = coherencemodel1.get_coherence()

    coherence_df = pd.DataFrame(pd.Series(coherence_dict)).reset_index()

    coherence_df.columns = ['Num_topics','Coherence_score']

    # Show graph
    fig, ax = plt.subplots(figsize=(12,10))
    ax.plot(coherence_df['Num_topics'], coherence_df['Coherence_score'])
    ax.set_xlabel("No. of topics", fontweight='bold')
    ax.set_ylabel("Cv Coherence score", fontweight='bold')
    ax.axvline(coherence_df[coherence_df['Coherence_score'] == coherence_df['Coherence_score'].max()]['Num_topics'].tolist(), color='red')

    return coherence_df

In [None]:
%%time

coh_model = long_topic_scan(working_dict, working_corpus, preprocessed_corpus, limit=50, step=3, start=2)

In [None]:
coh_model.sort_values(ascending = False, by='Coherence_score').head(1)

In [None]:
# next we build our working model using the topic number we've determined

working_model = ldamulticore.LdaMulticore(corpus=working_corpus, num_topics=5, id2word=working_dict, workers=2)

coherencemodel1 = CoherenceModel(model=working_model, texts=preprocessed_corpus, dictionary=working_dict, coherence='c_v')

In [None]:
coherencemodel1.get_coherence()

In [None]:
# ideas about inspecting the model
working_model.show_topics(-1)

In [None]:
!pip install pyLDAvis

In [None]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim.prepare(working_model, working_corpus, dictionary=working_model.id2word)

vis

In [None]:
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    # enumerate each topic and return number of topic, row of topic numbers and probabilities
    for row in ldamodel[corpus]:
        # sort row data into descending order
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        # split row value into j (numerated), topic number, topic probability
        # select top numerated (top ranked topic), retrieve topic text and join it altogether
        # combine into pandas dataframe with topic text and probability of topic
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wordprob = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wordprob])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num) + 1,
                                                                  round(prop_topic, 4),
                                                                  topic_keywords]), ignore_index=True)
            else:
                break

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    # add column names
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords', 'Original text']

    return sent_topics_df

In [None]:
topic_df = format_topics_sentences(working_model, working_corpus, ticket_data["Consumer complaint narrative"].tolist())

In [None]:
topic_df

In [None]:
ticket_data.reset_index().join(topic_df).groupby('Product')['Dominant_Topic'].value_counts().unstack().plot.bar(figsize=(16,8))

In [None]:
def get_top3_docs(dominant_topic_frame):

    table_lst = []

    # create dataframe of top 3 most representative docs for each topic
    for i in range(1, int(dominant_topic_frame['Dominant_Topic'].max())):
        # get indexes
        indy = dominant_topic_frame[dominant_topic_frame['Dominant_Topic'] == i].sort_values(by='Perc_Contribution', ascending=False).index.tolist()
        # test how many documents passed
        if len(indy) <= 3:
            for idx in indy:
                table_lst.append(dominant_topic_frame.iloc[idx, :])
        else:
            table_lst.append(dominant_topic_frame.iloc[indy[0], :])
            table_lst.append(dominant_topic_frame.iloc[indy[1], :])
            table_lst.append(dominant_topic_frame.iloc[indy[2], :])

    new_eg_df = pd.DataFrame(table_lst)

    return new_eg_df

In [None]:
top3_df = get_top3_docs(topic_df)

In [None]:
top3_df.head()

In [None]:
for topic in top3_df.Dominant_Topic.unique():
    
    print(f"Topic number {topic}")
    print(top3_df[top3_df.Dominant_Topic == topic].Topic_Keywords.tolist()[0], '\n')
    
    subset_df = top3_df[top3_df.Dominant_Topic == topic]['Original text'].tolist()
    
    for idx, item in enumerate(subset_df):
        print("text ", str(idx))
        print(item, '\n')
    
    print("----------")