# Session 3: Text analysis approaches

\#\#\# __DRAFT__ \#\#\#

Text analysis is a classic computational and data science problem, 

Compared with regression and classification approaches on continuous and categorical dataset taking text data and deriving distinct insights is a far more complicated task. Text data and especially free text (text fields in sentence form) is typically classed as a form of unstructured data because of the various nuances introduced by languages.

With the ever increasing computational power has come a side-by-side improvements in approaches to text analysis. 

The idea of topic modelling, identifying abstract 'topics' within a collection of documents (corpus) using statistical models, was first described in 1998, with probabilistic latent semantic analysis (PLSA) outlined in 1999 and latent Dirichlet allocation (LDA) developed in [2002](http://jmlr.csail.mit.edu/papers/v3/blei03a.html). LDA has become one of the most commonly used topic modelling approaches since although many extensions of LDA have since been proposed.

In [1]:
import numpy as np
import pandas as pd
import gensim
import matplotlib.pyplot as plt

In [2]:
# import the dataset

twitter_data = pd.read_csv('data/twcs.csv')

twitter_data.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [3]:
# create a sub dataframe for just data that was inbound
inbound_dat = twitter_data[twitter_data.inbound == True]

inbound_dat.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0
6,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610.0,
8,12,115713,True,Tue Oct 31 22:04:47 +0000 2017,@sprintcare You gonna magically change your co...,111314.0,15.0


In [4]:
inbound_dat.author_id.unique().shape

(702669,)

In [5]:
# join together all inbound tweets from the same user

user_tweets = inbound_dat.groupby('author_id')['text'].apply(lambda x: ','.join(x))

In [6]:
user_tweets = user_tweets.reset_index()

user_tweets.text[:10].tolist()

['Screw you @116016 and your stupid Blueprint program. I never signed up for this crap and now you’re going to charge me interest fees? https://t.co/WwBzUIhSbG,@ChaseSupport Actually it just doesn’t work in Safari, but that’s still pretty bad.,Dear @ChaseSupport, it’s kinda hard to pay my bills when the entire payment section of your site is unavailable 🤦🏻\u200d♀️',
 "Now the flight @Delta is sending our bag back on just got delayed two hours. So mad right now, I can't even.",
 '@MOO Big thanks to Quentin for the exceptional service! Just ordered our 3rd round of #businesscards 👍,The ribotRainbow! New #businesscards thanks to @moo 😊#rainbow #ourteamrocks https://t.co/nqMMUqYzKt https://t.co/gVtJDEoGFu',
 'Yup https://t.co/GpkFa9MfHQ,same. https://t.co/gxkJt8BNV6',
 '@comcastcares Is it possible to get business class internet at a residence, and if so are there any restrictions/limitations?',
 '@Delta I just sent you a DM,@Delta I will never fly your airline again',
 'Wow. Used to think

## Preprocessing

Pre processing is a crucial step in any text analytics project. Text data on its own is very difficult for machines to understand and therefore it requires cleaning and preparing before building models. This often involves a number of steps such as:
- Tokenisation, converting a long string of words into a list of individual words i.e. "the cat sat on the mat" -> ["the", "cat", "sat", "on", "the", "mat"]
- Noise removal, most commonly removing punctuation or things like hyperlinks or emojis
- Stopword removal, removing common words that don't contain information such as the, and, or, a 
- Stemming or lemming, this is the process of reverting words to their root either by chopping off suffixes (stemming) or reverting to word lemma (lemming)
- Normalisation, commonly this means converting all words to lower or uppercase


In [7]:
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_numeric, remove_stopwords

def basic_preprocess(list_of_strings):
    """
    A basic function that takes a list of strings and runs some basic
    gensim preprocessing to tokenise each string.
    
    Operations:
        - convert to lowercase
        - remove html tags
        - remove punctuation
        - remove numbers
    
    Outputs a list of lists
    """
    
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_numeric, remove_stopwords]

    preproc_text = [preprocess_string(doc, CUSTOM_FILTERS) for doc in list_of_strings]
    
    return preproc_text

In [8]:
# what are stop words?

from gensim.parsing.preprocessing import STOPWORDS

print(STOPWORDS)

frozenset({'anyway', 'seems', 'unless', 'de', 'am', 'however', 'himself', 'others', 'off', 'put', 'through', 'last', 'yet', 'only', 'across', 'serious', 'mostly', 'during', 'becomes', 'is', 'before', 'three', 'i', 'then', 'yourselves', 'did', 'hundred', 'until', 'say', 'more', 'at', 'don', 'per', 'empty', 'already', 'further', 'under', 'kg', 'cant', 'behind', 'whence', 'moreover', 'do', 'if', 'became', 'anything', 'where', 'up', 'but', 'when', 'several', 'via', 'down', 'though', 'below', 'while', 'against', 'than', 'even', 'elsewhere', 'alone', 'together', 'regarding', 'she', 'whatever', 'a', 'thereafter', 'with', 'are', 'either', 'its', 'take', 'may', 'also', 'you', 'else', 'amount', 'will', 'none', 'inc', 'front', 'now', 'fifteen', 'toward', 'seem', 'top', 'doing', 'towards', 'beyond', 'part', 'own', 'latter', 'whereafter', 'anywhere', 'name', 'must', 'namely', 'themselves', 'mine', 'whether', 'of', 'should', 'sixty', 'noone', 'used', 'bottom', 'nevertheless', 'here', 'itself', 'real

In [9]:
import re

def remove_twitterisms(list_of_strings):
    """
    Some regular expression statements to remove twitter-isms
    
    Operations:
        - remove links
        - remove @tag
        - remove #tag
        
    Returns list of strings with the above removed
    """
    
    # removing some standard twitter-isms

    list_of_strings = [re.sub(r"http\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"@\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"#\S+", "", doc) for doc in list_of_strings]
    
    return list_of_strings

In [10]:
# removing emojis
# taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b#gistcomment-3315605

def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [11]:
from gensim.models.phrases import Phrases

def n_gram(tokens):
    """Identifies common two/three word phrases using gensim module."""
    # Add bigrams and trigrams to docs (only ones that appear 10 times or more).
    # includes threshold kwarg (threshold score required by bigram)
    bigram = Phrases(tokens, min_count=10, threshold=100)
    trigram = Phrases(bigram[tokens], threshold = 100)

    for idx, val in enumerate(tokens):
        for token in bigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a bigram, add to document.bigram
                    tokens[idx].append(token)
        for token in trigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a trigram, add to document.
                    tokens[idx].append(token)
    return tokens

In [12]:
import random

def test_basic_proprocess():
    
    # just to randomise tests
    randomize = random.randrange(0,3)
    
    test_doc = [
        "THE CAT SAT ON THE MAT!",
        "<h1> hello world?</h1>",
        "wElL WeLl We?L, wHaT <strong>HaVe wE</strong>hERe"
    ]
    
    test_output = basic_preprocess(test_doc)
    
    # test if func returns list of lists
    assert isinstance(test_output[randomize], list)
    # test if all items 
    assert [x.islower() for x in test_output[randomize]] == [True] * len(test_output[randomize])
    # test if html tags are removed
    assert "<h1>" not in test_output[1]

In [13]:
# next we implement the preprocessing steps

preprocessed_corpus = remove_twitterisms(user_tweets.text.tolist())

preprocessed_corpus = [remove_emoji(doc) for doc in preprocessed_corpus]

preprocessed_corpus = basic_preprocess(preprocessed_corpus)

preprocessed_corpus = n_gram(preprocessed_corpus)

In [14]:
# lets compare the original strings to the preprocessed strings
for orig, proc in zip(user_tweets.text.tolist()[:10], preprocessed_corpus[:10]):
    
    print(orig)
    print(proc)
    print('\n')

Screw you @116016 and your stupid Blueprint program. I never signed up for this crap and now you’re going to charge me interest fees? https://t.co/WwBzUIhSbG,@ChaseSupport Actually it just doesn’t work in Safari, but that’s still pretty bad.,Dear @ChaseSupport, it’s kinda hard to pay my bills when the entire payment section of your site is unavailable 🤦🏻‍♀️
['screw', 'stupid', 'blueprint', 'program', 'signed', 'crap', 'you’re', 'going', 'charge', 'fees', 'actually', 'doesn’t', 'work', 'safari', 'that’s', 'pretty', 'bad', 'dear', 'it’s', 'kinda', 'hard', 'pay', 'bills', 'entire', 'payment', 'section', 'site', 'unavailable']


Now the flight @Delta is sending our bag back on just got delayed two hours. So mad right now, I can't even.
['flight', 'sending', 'bag', 'got', 'delayed', 'hours', 'mad', 'right', 't']


@MOO Big thanks to Quentin for the exceptional service! Just ordered our 3rd round of #businesscards 👍,The ribotRainbow! New #businesscards thanks to @moo 😊#rainbow #ourteamrocks 

## Building a bag of words corpus and dictionary

## Building the topic model