## Text Analysis Tutorial

Hello there - we'll be following this Jupyter Notebook for the tutorial. 
The purpose of this tutorial is to walk you through different parts of the text analysis pipeline, from getting a hold of our data, cleaning and annotating it all the way to swapping verbs in sentences and evaluating topic models.

We will not be looking to explore our textual data in depth, but rather in breadth; give a taste of the different kinds of analysis we can do.

Our step, naturally, is setting up our imports. We will be using spaCy for data pre-processing and computational linguistics, gensim for topic modelling, scikit-learn for classification, and Keras for text generation.
We will also use numpy and matplotlib for other parts of the tutorial.

### Imports

In [1]:
import gensim
import numpy as np
import spacy
from spacy import displacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import matplotlib.pyplot as plt
import sklearn

In [2]:
import warnings
import os
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now
%matplotlib inline

## Gathering Data

A huge part of text analysis is your data collection - one of the initial goals of the tutorial was to walk the user through the process of cleaning messy twitter data, or scraping data off the internet. But while this does remain an integral part of text analysis, a one and half hour tutorial cannot do justice to both the process of data collection and data analysis - so we will use two more popular, already available data-sets for the purpose of the tutorial.

Keep in mind the only main difference between using a standardised data-set and scraping your own data off the internet is that internet data is largely unstructured; this means we will be spending a lot of time in organising our data into a form that is easy to pre-processes. The datasets we will be working with will be the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF), and the [20NG dataset](http://qwone.com/~jason/20Newsgroups/). We will be performing different tasks with these two datasets, and will talk a little bit more about the datasets when we come across them.

Let us now get started with loading our first data-set, the Lee corpus, which we load using Gensim.

In [3]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read()

## Cleaning Data

It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set.
NTLK is usually a popular choice for pre-processing - but is a rather [outdated](https://explosion.ai/blog/dead-code-should-be-buried) and we will be checking out spaCy, an industry grade text-processing package. 

spaCy uses language models similar to the one we just downloaded before starting this tutorial.

In [4]:
nlp = spacy.load("en")

For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said' and 'Mister' which will not really add any value to the topic models.


In [5]:
my_stop_words = [u'say', u'\'s', u'mr', u'be', u'said', u'says', u'saying', 'today']
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [6]:
doc = nlp(text.lower())

Voila! With the `English` pipeline, all the heavy lifting has been done. Let's see what went on under the hood.

In [7]:
doc

hundreds of people have been forced to vacate their homes in the southern highlands of new south wales as strong winds today pushed a huge bushfire towards the town of hill top. a new blaze near goulburn, south-west of sydney, has forced the closure of the hume highway. at about 4:00pm aedt, a marked deterioration in the weather as a storm cell moved east across the blue mountains forced authorities to make a decision to evacuate people from homes in outlying streets at hill top in the new south wales southern highlands. an estimated 500 residents have left their homes for nearby mittagong. the new south wales rural fire service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around hill top are optimistic of defending all properties. as more than 100 blazes burn on new year's eve in new south wales, fire crews have been called to new fire at gunning, south of goulburn. while few details are available at this

## Computational Linguistics

Okay - now that we have our doc object, what exactly can we do with it?
We can see that the doc object now contains the entire corpus. This is important because we will be using this doc object to create our corpus for the machine learning algorithms. When creating a corpus for gensim/scikit-learn, we sometimes forget the incredible power which spaCy packs in its pipeline, so we will briefly demonstrate the same in this section with a smaller example sentence. Keep in mind that whatever we can do with a sentence, we can also just as well do with the entire corpus.

In [8]:
sent = nlp(u"Tom went to IKEA to get some of those delicious Swedish meatballs.")

Simple enough sentence, right? When we pass any kind of text through the spaCy pipeline, it becomes annotated. We will quickly have a look at the 3 most important of capabilities which spaCy provides - POS-tagging, NER-tagging, and dependency parsing.

#### POS-Tagging

In [9]:
for token in sent:
    print(token.text, token.pos_, token.tag_)

Tom PROPN NNP
went VERB VBD
to ADP IN
IKEA PROPN NNP
to PART TO
get VERB VB
some DET DT
of ADP IN
those DET DT
delicious ADJ JJ
Swedish ADJ JJ
meatballs NOUN NNS
. PUNCT .


#### NER-Tagging

In [10]:
for token in sent:
    print(token.text, token.ent_type_)

Tom PERSON
went 
to 
IKEA ORG
to 
get 
some 
of 
those 
delicious 
Swedish NORP
meatballs 
. 


In [11]:
for ent in sent.ents:
    print(ent.text, ent.label_)

Tom PERSON
IKEA ORG
Swedish NORP


In [12]:
displacy.render(sent, style='ent', jupyter=True)

#### Dependency Parsing

In [13]:
for chunk in sent.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)


Tom Tom nsubj went
IKEA IKEA pobj to
those delicious Swedish meatballs meatballs pobj of


In [14]:
for token in sent:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])


Tom nsubj went VERB []
went ROOT went VERB [Tom, to, get, .]
to prep went VERB [IKEA]
IKEA pobj to ADP []
to aux get VERB []
get advcl went VERB [to, some]
some dobj get VERB [of]
of prep some DET [meatballs]
those det meatballs NOUN []
delicious amod meatballs NOUN []
Swedish amod meatballs NOUN []
meatballs pobj of ADP [those, delicious, Swedish]
. punct went VERB []


In [15]:
displacy.render(sent, style='dep', jupyter=True)

This is just an example of the kind of annotations spaCy adds when it runs any text through its pipeline. We will see in the very next section that spaCy has a bunch of other information as well, such as whether a token is a number or not, stop-word or not, and other information which comes in very handy when pre-processing text. 

## Continuing Cleaning

Have a quick look at the output of the doc object. It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus. You can check out what a gensim corpus looks like [here](https://radimrehurek.com/gensim/tut1.html).

In [16]:
# we add some words to the stop word list
texts, article = [], []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article!
    if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
        # we add the lematized version of the word
        article.append(w.lemma_)
    # if it's a new line, it means we're onto our next document
    if w.text == '\n':
        texts.append(article)
        article = []

In [17]:
texts

[['hundred',
  'people',
  'force',
  'vacate',
  'home',
  'southern',
  'highland',
  'new',
  'south',
  'wale',
  'strong',
  'wind',
  'push',
  'huge',
  'bushfire',
  'town',
  'hill',
  'new',
  'blaze',
  'near',
  'goulburn',
  'south',
  'west',
  'sydney',
  'force',
  'closure',
  'hume',
  'highway',
  '4:00pm',
  'aedt',
  'marked',
  'deterioration',
  'weather',
  'storm',
  'cell',
  'move',
  'east',
  'blue',
  'mountain',
  'force',
  'authority',
  'decision',
  'evacuate',
  'people',
  'home',
  'outlying',
  'street',
  'hill',
  'new',
  'south',
  'wale',
  'southern',
  'highland',
  'estimate',
  'resident',
  'leave',
  'home',
  'nearby',
  'mittagong',
  'new',
  'south',
  'wal',
  'rural',
  'fire',
  'service',
  'weather',
  'condition',
  'cause',
  'fire',
  'burn',
  'finger',
  'formation',
  'ease',
  'fire',
  'unit',
  'hill',
  'optimistic',
  'defend',
  'property',
  'blaze',
  'burn',
  'new',
  'year',
  'eve',
  'new',
  'south',
  'wale

And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, punctauation markers, and added the lemmatized word. 

Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly.

In [18]:
bigram = gensim.models.Phrases(texts)

In [None]:
texts = [bigram[line] for line in texts]

In [None]:
texts[0]

In [None]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

## Topic Modelling

Topic Modelling refers to the probabilistic modelling of text documents as topics. Gensim remains the most popular library to perform such modelling, and we will be using it to perform our topic modelling. 

LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics. This is where the corpus we created earlier will come in handy.

In [None]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

This is a great way to get a view of what words end up appearing in our documents, and what kind of document topics might be present. For more details, such as the other topic models which Gensim provides, as well as ways to measure topic coherence (performance), and visualisation, the topic modelling notebook in the same directory will serve as a good resource.

## Text Classification

In the previous example, we worked with unlabelled, unstructured data. Classification is a machine learning task which is quite different from the previous examples because we are dealing with labelled data, and we know what classes we want to put our documents into - we are not discovering topics or classes.

For such an example, we would need to use a labelled data-set, and in our case we will be using the previously mentioned 20NG dataset.

In [19]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

In [20]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                             shuffle=True, random_state=42)
n_components = 5
labels = data_train.target
true_k = np.unique(labels).shape[0]

# convert to TF-IDF format
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english', use_idf=True)
X_train = vectorizer.fit_transform(data_train.data)

# Reduce dimensions
svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X_train = lsa.fit_transform(X_train)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
# order of labels in `target_names` can be different from `categories`
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42)

target_names = data_train.target_names
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

print("Extracting features from the test data using the same vectorizer")
X_test = vectorizer.transform(data_test.data)
X_test = lsa.fit_transform(X_test)

Take a minute to note the pre-processing steps we used above - it is less transparent than our method with spaCy, but it is still important to know and to be able to use the scikit-learn modules for the same. 

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb = GaussianNB()
y_pred_NB = gnb.fit(X_train, y_train).predict(X_test)

In [None]:
y_pred_NB

In [None]:
from sklearn.svm import SVC

In [None]:
svm = SVC()
y_pred_SVM = svm.fit(X_train, y_train).predict(X_test) 