<h1><center> $\color{darkslateblue}{\text{Natural Language Processing}}$</center></h1>

<h2><center> $\color{darkslateblue}{\text{Theoretical Framework of NPL and its Applications in News Headlines with the LDA Model}}$</center></h2>

---

<h3><center> $\color{darkslateblue}{\text{Antonio Romero Martínez-Eiroa y Beatriz Quevedo Gómez}}$  </center></h3> 
<h4><center> $\color{darkslateblue}{\text{Machine Learning, January 2021}}$  </center></h4>

<center> <img src='../data/robot.jpg' > <center>

Natural Language Processing (NLP) is a field of study focused on making sense of language using statistics and computers. NLP applications include chatbots, translation, sentiment analysis, among many others. The aim of this paper will be to explain and exemplify how it operates.

To do so, the streps that will be followed are: 

1. Understanding the main concepts
     * Regular expressions
     * Tokenization
     * Bag-of-words
2. Simple text preprocessing
3. Gensim and word vectors
     * How to create a gensim dictionary
     * TF-IDF with gensim
4. Named Entity Recognition
     * SpaCy
5. Supervised learning with NLP
     * Steps
6. Implementation of NPL to news headlines
     * Data pre-processing
     * Bag-of-Words of the data set
     * TF-IDF
     * LDA
     * Performance evaluation by classifying sample document using LDA bag-of-words and TF-IFD models
     * Testing model on unseen document
7. Conclussions
8. References

***

<h3> $\color{darkslateblue}{\text{1. Understanding the main concepts}}$</h3>

<h4> $\color{darkslateblue}{\text{Regular expressions}}$</h4>

**Regular expressions** (regex) are strings with a special syntax that allow to match patterns in other strings. They are an excellent tool for text analysis or NLP. Some applications of regular expressions are, for example to find all web links in a document, to parse email addresses or to remove or replace unwanted characters.



Regex can be used easily with Python through the **re** library. With `re.match` you can match a substring, this method matches a pattern with a string. It takes the pattern as the fist argument, the string as the second argument and returns the matched object:

In [1]:
import re
re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

You can also use special patterns that regex understands, like the `\w+`, which will match a word.

In [2]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

Common regex patterns are:
* `\w+`, matches a word
* `\d` which maches a digit
* `\s` which matches spaces
* `.*`, a wildcard that will match any letter or symbol (useful in usernames, for example)
* `+` or `*` which allows things to become greedy, grabbing repeats of single letters
* `[a-z]` which matches a lowecase group (for instance 'abcdefg')


Note than while `\w` will only match a letter, adding the `+` after allows to match a whole word. 

Additionaly, the patterns used with capital leters such as `\S`, will do the opposite as the lowercase pattern, matching in this case anything that are not spaces. 

Python's re module can not only match an entire string or substring based on a pattern with `match`, but it can also split a tring on regex (`split`), find all patterns in a string (`findall`) or search for a pattern (`search`). 

`search` does not require to match the pattern at the beginning of the string, so, to observe the difference between `re.search()` and `re.match()`:

When you use `match` and `search` whith the same pattern and string when the pattern is at the beginning of the string, you obtain identical matches:

In [3]:
re.match('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [4]:
re.search('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

When you use `search` for a pattern that appears later in the string, you get the result, but not using `match`. This is because match will try to match a string from the beginning until it cannot match any longer.

In [5]:
re.match('cd', 'abcde')

In [6]:
re.search('cd', 'abcde')

<re.Match object; span=(2, 4), match='cd'>

`re.split()` can be used for **tokenization** so you can preprocess text using regex or doing NLP.

In [7]:
re.split('\s+', 'Split on spaces')

['Split', 'on', 'spaces']

<h4> $\color{darkslateblue}{\text{Tokenization}}$</h4>

**Tokenization** is the process of transforming a string or document into tokens (smaller chunks). This is usually one step in the process of preparing a text for NLP.

There are many different rules and theories regarding tokenization, and one can create its own rules using regular expressions, but normally tokenization would do things like breaking out words or sentences, separating puntuaction or even just tonkenize parts of a string (linke separating all hashtags in a tweet).

The `nltk` library is a natural language toolkit and is often used in tokenization:

In [8]:
import nltk
from nltk.tokenize import word_tokenize

word_tokenize('Hi there!')

['Hi', 'there', '!']

Why tokenize? It can help whith simple text processing tasks, like mapping part of speech, matching common words, removing unwanted tokens, etc. 

For example, in *I don't like Sam's shoes* when it is tokenized ("I", "do", "n't", "like", "Sam", "'s", "shoes", ".") you can clearly see the negation in the `n't` and possesion in the `'s`. This indicators can help to determine meaning from simple texts.

Beyond just tokenizing words, `nltk` has other tokenizers that include:
* `sent_tokenize`, which tokenizes a document into individual sentences
* `regexp_tokenize`, which tokenizes a string or document based on a regular expression pattern
* `TweetTokenizer` is a special class just for tweet tokenization that allows you yo spearate hashtags, mentions and lots of exclamation points!!!!

OR is represented using `|` and you can define a group using `()` or a explicit character ranges using `[]`.

In [9]:
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

Some regex ranges and groups are:

* `[A-Za-z]+` : matches upper and lowercase English alphabet
* `[0-9]`: numbers from 0 to 9
* `[A-Za-z\-\.]+` : upper and lowercase English alphabet, - and . (usefull for url websites, for example)
* `(a-z)` : a, - and z
* `(\s+|,)` : spaces or a comma

In [10]:
my_str = 'match lowecase spaces numbers like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<re.Match object; span=(0, 37), match='match lowecase spaces numbers like 12'>

<h4> $\color{darkslateblue}{\text{Bag-of-words}}$</h4>

**Bag of words** is a very simple and basic method for finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. 

The theory is that the more frequent a word or token is, the more central or important it might be to the text. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

In [11]:
from nltk.tokenize import word_tokenize
from collections import Counter
counter = Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))
counter

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

In [12]:
counter.most_common(2)

[('The', 3), ('cat', 3)]

Notice that the word THE appears twice in the bag of words, once with uppercase and once in lowercase. If we added a **preprocessing step**, which will be presented in the following section, to handle this issue, we could lowercase all of the words in the text so each word is counted only once.

***

<h3> $\color{darkslateblue}{\text{2. Simple text preprocessing}}$</h3>

Text processing helps make for better input data when performing machine learning or other statistical methods, as shown above. Preprocessing steps like tokenization or lowercasing words are commonly used in NLP. 

Other common techniques are things like **lemmatization or stemming**, where the words are shortened to their root stems, or techniques like **removing stop words**, which are common words in a language that don't carry a lot of meaning (such as and or the, or removing punctuation or unwanted tokens). 

Since each model and process will have different results, the optimum is to try a few different pre-processing approaches and see which one works best for your task and objective.

In [13]:
import nltk

text = """The cat is in the box. The cat likes the box. 
        The box is over the cat."""

- Firstly we tranform the text all in lowercase using the string lower method. 

    The string `is_alpha` method will return *True* if the string has only alphabetical characters. We use this method along with an **if** 
    statement iterating over our tokenized result to only return 
    **alphabetic strings** (this will effectively strip tokens with numbers 
    or punctuation). 
    
    To read out the process in both code and English we 
    say we take each token from the word_tokenize output of the lowercase 
    text if it contains only alphabetical characters. 

In [14]:
tokens = [w for w in word_tokenize(text.lower())
          if w.isalpha()]

- In the next line, we use another list comprehension to **remove** words  that are in the stopwords list. This stopwords list for english comes  built in with the NLTK library.

In [15]:
from nltk.corpus import stopwords


no_stops = [t for t in tokens
            if t not in stopwords.words('english')]

Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

As shown in the example, preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

***

<h3> $\color{darkslateblue}{\text{3. Gensim and word vectors}}$</h3>


**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora (a corpus –or if plural, corpora– is a set of texts used to help implement NLP tasks) and performing topic identification and document comparisons.

A **word embedding** or **vector** is trained from a larger corpus and is a multi-dimensional representation of a word or document. 

With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. 

<center> <img src='../data/wordvector.png' > <center>

For example, in this graphic you can see that the vector operation *king* minus *queen* is approximately equal to *man* minus *woman*. Or that *Spain* is to *Madrid* as *Italy* is to *Rome*. 

The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.

Another example of Gensim is the graphic you can see below, which is an example of **LDA visualization**. LDA stands for **latent dirichlet allocation**, and it is a statistical model that can be applied to text using Gensim for topic analysis and modelling. 

This graph is just a portion of a blog post written in 2015 using Gensim to analyze US presidential addresses. The article is really neat and you can find the link [here](http://tlfvincent.github.io/2015/10/23/presidential-speech-topics/)

<center> <img src='../data/gensim.png' > <center>

<h4> $\color{darkslateblue}{\text{How to create a gensim dictionary}}$</h4>

Gensim allows you to build **corpora** and **dictionaries** using simple classes and functions. 

In the example below the documents are a list of strings that look like movie reviews about space or sci-fi films. 

In [16]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

my_documents = ['The movie was about a spaceship and aliens.',
                'I really liked the movie!',
                'Awesome action scenes, but boring characters.',
                'The movie was awful! I hate alien films.',
                'Space is cool! I liked the movie.',
                'More space films, please!',]

- First it is important to do some basic preprocessing. In this case and for brevity we will only tokenize and lowercase, although for better results it would be convenient to apply more, such as removing punctuation and stop words. 

In [17]:
tokenized_docs = [word_tokenize(doc.lower()) 
                  for doc in my_documents]

- Then we can pass the tokenized documents to the **Gensim Dictionary** class. This will create a mapping with an id for each token. This is the beginning of our corpus. 

In [18]:
dictionary = Dictionary(tokenized_docs)


- We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document. We can take a look at the tokens and their ids by looking at the `token2id` attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.

In [19]:
dictionary.token2id

{'.': 0,
 'a': 1,
 'about': 2,
 'aliens': 3,
 'and': 4,
 'movie': 5,
 'spaceship': 6,
 'the': 7,
 'was': 8,
 '!': 9,
 'i': 10,
 'liked': 11,
 'really': 12,
 ',': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'cool': 24,
 'is': 25,
 'space': 26,
 'more': 27,
 'please': 28}

- Using this dictionary, we can then create a **Gensim corpus**, different from a normal corpus, which is just a collection of documents.

    Gensim uses a **simple bag-of-words model** which transforms each document into a bag of words using the token ids and the frequency of each token in the document. In the example we can see that the Gensim corpus is a list of lists: each list item representing one document and each document a series of tuples: 
    * The first item represents the **tokenid** from the **dictionary** 
    * The second item represents the token **frequency** in the **document**

In [20]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)],
 [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

In so doing, we have a new bag-of-words model and corpus thanks to Gensim. And unlike our previous Counter-based bag of words, this Gensim model can be **easily saved**, **updated** and **reused** thanks to the extra tools we have available in Gensim. 

Our dictionary can also be updated with new texts and extract only words that meet particular thresholds. We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises.

<h4> $\color{darkslateblue}{\text{Tf-idf with gensim}}$</h4>

**Tf-idf**, which stands for *term-frequncy - inverse document frequency*, is a commonly used NLP model that helps you determine the most important words in each document in the corpus. 

The idea behind *tf-idf* is that each corpus might have more shared words than just stopwords. These common words are like stopwords and should be removed or at least down-weighted in importance. 

For example, if I am an astronomer, sky might be used often but is not important, so I want to downweight that word. TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. 

**Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.**

The equation to calculate the weights can be outlined like so:

<center> <img src='../data/tfidf.png' > <center>

The weight of token **i** in document **j** is calculated by taking the term frequency (how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term.

The weight ($w_{i,j}$) will be low if the term doesnt appear often in the document because the $tf_{i,j}$ variable will then be low. However, it will also be low if the logarithm is close to zero, meaning the internal equation is low. This way we can see if the total number of documents divded by the number of documents that have the term is close to **one**, then the logarithm will be close to **zero**. So words that occur across many or all documents will have a very low *tf-idf* weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.

A *tf-idf* model can be built using Gensim and the corpus one developed previously. Taking a look at the corpus used in the last example, around movie reviews, the Bag of Words corpus can be used to translate it into a *tf-idf* model by simply passing it in initialization.


In [21]:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]

[(5, 0.1746298276735174),
 (7, 0.1746298276735174),
 (9, 0.1746298276735174),
 (10, 0.29853166221463673),
 (11, 0.47316148988815415),
 (12, 0.7716931521027908)]

In the example one can appreciate how tokens 5, 7 and 9 have a weight of 0.174, whereas token 12 has a weight of 0.77.

These weights can help you determine good topics and keywords for a corpus with shared vocabulary.

***

<h3> $\color{darkslateblue}{\text{4. Named Entity Recognition}}$</h3>

**NER** (Named Entity Recognition) is a NLP task used to identify important named entities in the text –such as people, places and organizations– they can even be dates, states, works of art and other categories depending on the libraries and notation you use. 

NER can be used alongside topic identification, or on its own to determine important items in a text or answer basic natural language understanding questions such as who, what, when and where.

Taking the following piece of text, from the English Wikipedia article on *Albert Einstein* you can see the application of the NER.

<center> <img src='../data/NER.png' > <center>

The text has been highlighted for different types of named entities that were found using the Stanford NER library. You can see the dates, locations, persons and organizations found and extract infomation on the text based on these named entities.

In this way, one can use NER to solve problems like fact extraction as well as which entities are related using computational language models. 

For example, in this text we can see that Einstein has something to do with the United States, Adolf Hitler and Germany. We can also see by token proximity that Betrand Russel and Einstein created the Russel-Einstein manifesto –all from simple entity highlighting–.

NLTK library allows you to interact with named entity recognition via it's own model, but also the aforementioned **Stanford library**. The Stanford library integration requires you to perform a few steps before you can use it, including installing the required Java files and setting system environment variables. You can also use the standford library on its own without integrating it with NLTK or operate it as an API server. 

The stanford CoreNLP library has great support for named entity recognition as well as some related NLP tasks such as **coreference** (or linking pronouns and entities together) and **dependency trees** to help with parsing meaning and relationships amongst words or phrases in a sentence.

For the following simple use case, we will use the built-in named entity recognition with NLTK. To do so, we take a normal sentence, and preprocess it via tokenization. Then, we can tag the sentence for parts of speech. This will add tags for proper nouns, pronouns, adjective, verbs and other part of speech that NLTK uses based on an english grammar:

In [22]:
sentence = '''In New York, I like to ride the Metro to
            visit MOMA and some restaurants rated
            well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]

When we take a look at the tags, we see *New* and *York* are tagged *NNP* which is the tag for a **proper noun, singular**.

Then we pass this tagged sentence into the `ne_chunk` function, or named entity chunk, which will return the sentence as a **tree**. 

In [23]:
print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ./.)


This tree shows the named entities tagged as their own chunks such as GPE or **geopolitical entity** for *New York*, or *MOMA* and *Metro* as **organizations**. It also identifies *Ruth Reichl* as a **person**. 

It does so without consulting a knowledge base, like wikipedia, but instead uses **trained statistical and grammatical parsers**.

<h4> $\color{darkslateblue}{\text{SpaCy}}$</h4>

**SpaCy** is a NLP library similar to **Gensim** but with different implementations, including a particular focus on **creating NLP pipelines to generate models and corpora**. SpaCy is open-source and has several extra libraries and tools built by the same team, including **Displacy** - a visualization tool for viewing parse trees which uses Node-js to create interactive text.

Using the **displacy entity recognition visualizer**, we can enter the sentence used in the last example:

<center> <img src='../data/SpaCy.png' > <center>
    
Here, we can see the SpaCy has identified three named entities and tagged them with the appropriate entity label –such as location or person–. SpaCy also has tools to build word and document vectors from text.



* To start using spacy for NER, we must first install it and download all the appropriate pre-trained word vectors (you can also train vectors yourself and load them; but the pretrained ones let us get started immediately). 

    We can load those into an object, `nlp`, which functions similarly to our Gensim dictionary and corpus. It has several linked objects, including `entity` which is an Entity Recognizer object from the pipeline module. This is what is used to **find entities in the text**. 

In [24]:
import spacy

nlp = spacy.load('en')
nlp.entity

<spacy.pipeline.pipes.EntityRecognizer at 0x7fa6615aba60>

* Then we load a new document by passing a string into the NLP variable. When the document is loaded, the named entities are stored as a document attribute called `ents`. We see Spacy properly tagged and identified the three main entities in the sentence. 

In [25]:
doc = nlp("""Berlin is the capital of Germany;
            and the residence of Chancellor Angela Merkel.""")
doc.ents

(Berlin, Germany, Angela Merkel)

* We can also investigate the labels of each entity by using indexing to pick out the first entity and the `label_` attribute to see the label for that particular entity. 

  Here we see the label for Berlin is GPE. 

In [26]:
print(doc.ents[0], doc.ents[0].label_)

Berlin GPE


Spacy has several other language models available, including advanced German and Chinese implementations. It's a great tool especially if you want to build your own extraction and NLP pipeline quickly and iteratively.

**Why use Spacy for NER?** Outside of being able to integrate with the other great Spacy features like easy pipeline creation, it has a different set of entity types and often labels entities differently than nltk. In addition, Spacy comes with informal language corpora, allowing you to more easily find entities in documents like Tweets and chat messages.

***

<h3> $\color{darkslateblue}{\text{5. Supervised learning with NLP}}$</h3>

**Supervised learning** is a form of machine learning where you are given or create training data. This data has a label or outcome which you want the model or algorithm to learn.

To help create features and train a model, we will use `Scikit learn`, a powerful open-source library. One of the ways you can create supervised learning data from text is by using bag of words models or TFIDF as features.

Let's say you have a dataset full of movie plots and genres from the IMDB database, as shown in the following chart:

<center> <img src='../data/IMDB.png' > <center>


Action and Sci-Fi movies have been separated, removing any movies labeled both action and Sci-Fi.
    
You want to **predict** whether a movie is action or sci-fi based on the plot summary. The dataset we've extracted has categorical features generated using some preprocessing. We can see the plot summary, and the sci-fi and action columns. You can also see the Sci-Fi column, which is 1 for movies that are scifi and 0 for movies that are action. The Action column is the inverse of the Sci-Fi column.

<h4> $\color{darkslateblue}{\text{Steps}}$</h4>

The superivsed learning process is:
1. Collection and preprocessing of data. 
2. Determine a label –what we want the model to learn–. 
3. Split the data into training and testing datasets, keeping them separate so we can build our model using only the training data. The test data remains unseen so we can test how well our model performs after it is trained. **This is an essential part of Supervised Learning**
4. Extract features from the text to predict the label. We will use a bagof words vectorizer built into scikit-learn to do so. 
5. After the model is trained, we can then test it using the test dataset. There are also other methods to evaluate model performance, such as k-fold cross validation.

***

<h3> $\color{darkslateblue}{\text{6. Implementation of NLP to news headlines}}$</h3>

Once the basis of the **NLP** has been established, is time to apply it to an example. 

The aim is to implement **LDA** to a data set that contains data of news headlines published over a period of seventeen years and split them into topics, to see if the model can classify correctly a headline into a predefined topic.

<h4> $\color{darkslateblue}{\text{Data load}}$</h4>

The first step is to import `pandas` and the data that will be used. The DataFrame is sourced from the **Australian news source ABC** ([Australian Broadcasting Corporation](https://www.abc.net.au/)).

In order to be able to work better with the data, two columns will be set up: `headline_text` –with the headline of the article in Ascii , English , lowercase– and `index` –the position the headline occupies in the DataFrame–. 

In [27]:
import pandas as pd

In [28]:
data = pd.read_csv('../data/abcnews-date-text.csv', error_bad_lines = False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If `error_bad_lines` is set as *False*, then these “bad lines” will dropped from the DataFrame that is returned.

Then the DataFrame is renamed as `documents` and it is observed that it has a total of **1,186,018 headlines**.

In [29]:
len(documents)

1186018

The DF therefore has the following form:

In [30]:
documents[:10]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
5,ambitious olsson wins triple jump,5
6,antic delighted with record breaking barca,6
7,aussie qualifier stosur wastes four memphis match,7
8,aust addresses un security council over iraq,8
9,australia is locked into war timetable opp,9


<h4> $\color{darkslateblue}{\text{Data preprocessing}}$</h4>

Once the DF is loaded and examined, the first step of the Natural Language Processing, **preprocessing** will be carried out. 

In this section it will be done the **tokenization**, splitting the text into sentences and then into words, as well as the removal of the punctuation. Words that have fewer than 3 characters and all **stopwords** will also be removed, but **lowercasing** wont be necessary as the words are already in that format.

Likewise, words will be **lemmatized** (words in third person are changed to first person and verbs in past and future tenses are changed into present) and **stemmed** (reduced to their root form). 

This will be done through the libraries `gensim` and `nltk` wich will be imported along with its **dependencies** and `numpy`.

In [31]:
import gensim
import nltk
import numpy as np

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

Firtst the lemmatizing and stemming will be done. An example of how they work can be seen below.

Using `WordNetLemmatizer` the verb *went* in past tense is transformed to *go*. The parameter `pos = 'v'` is pointing out that it is a verb. However, it can also be a noun (n), an adjetive (j) or an adverb (r).

In [32]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v'))

go


Through `SnowballStemmer` one can choose the language, so a group of original words are return into their root. Then the words will be transformed from plural to singular and the function `stemmer.stem` will be applied to them so they are stemmed, creating a DF with the words before and after being processed throughout the function.

In [33]:
stemmer = SnowballStemmer('english')

original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']

stem = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data = {'original word': original_words, 'stemmed': stem})


Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


So once the functionality of both lemmatize and stemming has been visually seen, we will proceed to define the functions.

The *first* function is defined to get the words lemmatized and stemmed at the same time. 

The *second* function is in charge of the tokenization once the previous process it's been done, which is the reason why the first fuction is contained by the second. In this function the **stopwords** and the words with a lenght of less than **3** charcters are ruled out. 

In [34]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In the code below, a single headline (the one in the line `3,000` for example) has been selected so the diference between the original text, splited by blank spaces, and the the text once it has been processed by the fuction descibed above can be seen.

It is also splitted by spaces. 

In [35]:
doc_sample = documents[documents['index'] == 3000].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['brogden', 'pledges', 'cut', 'in', 'hospital', 'waiting', 'lists']


 tokenized and lemmatized document: 
['brogden', 'pledg', 'hospit', 'wait', 'list']


Finally, the function is applied to the hole set of headlines, and the first ten headlines of the resulting data frame are shown.

In [36]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

So for example, on the second headline we can see the topic is been talked about is probably about a *witness* that is aware of *defamation*. Another example is in the fifth headline, where the headline has something to do with an *strike* that can affect *australian travel*.

In this way the pre-processing section is finished, in which the necessary changes to the text have been made and it is ready to be worked on.

<h4> $\color{darkslateblue}{\text{Bag-of-words of the data set}}$</h4>

In this section a dictionary will be formed from `processed_docs`, containing the number of times a word appears in the training set. This one of the methods for finding topics in a text that we will see. The other one is TF-IDF, which will be explained later.

The `corpora.Dictionary` module is applied to the processed document. It implements the concept of a Dictionary – a mapping between words and their integer ids–. The dictionary object is used to create a bag of words corpus which will be used later on the input to topic modelling.

In [37]:
dictionary = gensim.corpora.Dictionary(processed_docs)

Then, the dictionary counts all the words in the document by iterating through it. The dictionary contains the number of times a word appears in the document.

The result of the function below is the word and its order. Note that count stops at 10 because is the limit of words a topic can have.

In [38]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


The function from Gensim `filter_extremes` has de functionality of filtering out tokens that apperar in **less than 15 documents** or **more than the 50%** of them in order to keep only the first 100,000 most frequent tokens. 

In [39]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Furthermore, `doc2bow`, a function from also Gensim,  creates (for each document) a dictionary reporting how **many words** and **how many times** those words appear. 

In [40]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[3000]

[(365, 1), (393, 1), (665, 1), (1142, 1), (1191, 1)]

When applied to the headline that has ben chosen as example (#3000), the results show the word number 365 appears one time, so as it happens with the following words.

The **preview bag-of-words for the sample preprocessed document** is the result of the following function, which is the same result as the dictionary function above. 

In [41]:
bow_doc_3000 = bow_corpus[3000]

for i in range(len(bow_doc_3000)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_3000[i][0], 
                                                     dictionary[bow_doc_3000[i][0]], 
                                                     bow_doc_3000[i][1]))

Word 365 ("hospit") appears 1 time.
Word 393 ("pledg") appears 1 time.
Word 665 ("wait") appears 1 time.
Word 1142 ("list") appears 1 time.
Word 1191 ("brogden") appears 1 time.


*Keep in mind that when it comes to headlines, it is normal that the words are not repeated*.

<h4> $\color{darkslateblue}{\text{TF-IDF}}$</h4>

In this section a **TF-IDF** (*Term-Frequency - Inverse Document Frequency*) model will be created using `models.TfidfModel` on `bow_corpus` and saved to `tfidf`, so that afterwards it can be applied the transformation to the entire corpus and call it `corpus_tfidf`. 

By doing this, the number of times each word appears in a headline will be counted, measured by the number of time the word appears in the corpus. It is asumed that words that appears a lot in the hole corpus are less informative. 

The result of this process will be a **TF-IDF score** (its weight) for each word.

In [42]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [43]:
corpus_tfidf = tfidf[bow_corpus]

In [57]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5850076620505259),
 (1, 0.38947256567331934),
 (2, 0.4997099083387053),
 (3, 0.5063271308533074)]


<h4> $\color{darkslateblue}{\text{LDA}}$</h4>

In this section we will run **LDA** using both *bag-of-words* and *TF-IDF* models so their results can be compared.

As mentioned earlier, LDA stands for **Latent Dirichlet Allocation**, a statistical model that can be applied to text using Gensim for topic analysis and modeling. 

In this way, LDA will show the relative weight of each word for several topics in both BOW and TF-IDF.

* **Running LDA using Bag of Words**

`lda_model` is applied to the bag of words created before.

The parameter `num_topics` indicates the number of requested latent topics to be extracted from the training corpus (10 in our case), `id2word` is used to determine the vocabulary size, as well as for debugging and topic printing, `pases` is the number of passes through the entire corpus (2) and `workers` indicates the number of workers processes to be used for parallelization (2). 

In [46]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics = 10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers = 2)

Keeping this in mind, the topics are generated:

In [47]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.022*"hous" + 0.022*"south" + 0.020*"north" + 0.017*"bushfir" + 0.016*"miss" + 0.013*"interview" + 0.012*"west" + 0.011*"hospit" + 0.011*"coast" + 0.010*"investig"
Topic: 1 
Words: 0.031*"kill" + 0.023*"shoot" + 0.021*"protest" + 0.020*"dead" + 0.019*"polic" + 0.019*"attack" + 0.014*"offic" + 0.013*"assault" + 0.013*"chines" + 0.011*"michael"
Topic: 2 
Words: 0.056*"australia" + 0.045*"australian" + 0.026*"world" + 0.017*"canberra" + 0.017*"test" + 0.013*"win" + 0.011*"final" + 0.011*"farm" + 0.011*"open" + 0.010*"return"
Topic: 3 
Words: 0.030*"polic" + 0.029*"charg" + 0.026*"court" + 0.024*"death" + 0.024*"murder" + 0.020*"woman" + 0.017*"face" + 0.017*"alleg" + 0.016*"crash" + 0.013*"trial"
Topic: 4 
Words: 0.019*"chang" + 0.018*"say" + 0.015*"speak" + 0.015*"power" + 0.013*"worker" + 0.012*"climat" + 0.012*"concern" + 0.011*"flood" + 0.011*"fear" + 0.010*"emerg"
Topic: 5 
Words: 0.021*"market" + 0.020*"news" + 0.018*"women" + 0.018*"live" + 0.016*"tasmania" + 0.01

Here you can se that for example, **topic 1** has words such as *kill, shoot, dead, attack, assault* which indicates the topic is about crimes and offenses. **Topic 4** has words like *change, climat, concern, flood, fear, emergency*, so basically climate change, and **topic 6**, *elec, state, labor, liber, leader, parti, campaign*, which it appears to be a topic about elections and the different parties. 

* **Running LDA using TF-IDF**

For the corpus created before using TF-IDF, the same model will be applied. 

The only parameter changing is workers beacuse this process require more computing power.

In [48]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10,
                                             id2word = dictionary,
                                             passes=2, 
                                             workers=4)

In [49]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.020*"countri" + 0.015*"hour" + 0.008*"christma" + 0.007*"andrew" + 0.007*"insid" + 0.006*"august" + 0.006*"june" + 0.006*"bushfir" + 0.006*"brief" + 0.005*"kill"
Topic: 1 Word: 0.011*"elect" + 0.007*"sport" + 0.007*"liber" + 0.006*"financ" + 0.006*"mark" + 0.006*"jam" + 0.005*"marriag" + 0.005*"quiz" + 0.005*"asylum" + 0.005*"toni"
Topic: 2 Word: 0.008*"street" + 0.007*"live" + 0.007*"girl" + 0.006*"octob" + 0.006*"polit" + 0.005*"foreign" + 0.005*"refuge" + 0.005*"senat" + 0.005*"univers" + 0.005*"blog"
Topic: 3 Word: 0.021*"news" + 0.019*"market" + 0.015*"rural" + 0.010*"price" + 0.009*"drought" + 0.008*"turnbul" + 0.008*"farmer" + 0.008*"share" + 0.008*"farm" + 0.007*"nation"
Topic: 4 Word: 0.022*"trump" + 0.010*"drum" + 0.010*"govern" + 0.009*"health" + 0.007*"tuesday" + 0.006*"michael" + 0.006*"tasmania" + 0.006*"fund" + 0.006*"david" + 0.006*"care"
Topic: 5 Word: 0.016*"crash" + 0.009*"miss" + 0.009*"search" + 0.008*"die" + 0.008*"monday" + 0.008*"road" + 0.007*"

In this case topics are different, for example the **topic 8** seems to be about cricket league final in australia. 

Once the two models have been done, the main diference between *BOW* and *TF-IDF* is BOW just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the importancy of the words as well.

<h4> $\color{darkslateblue}{\text{Performance evaluation by classifying sample document using LDA bag-of-words model}}$</h4>


We will check where our test document would be classified. Lets remind what are the main words from the headline:

In [50]:
processed_docs[3000]

['brogden', 'pledg', 'hospit', 'wait', 'list']

This function indicates the score of each topic and the relative value in descendinG order thanks to `sorted(key=lambda tup: -1*tup[1]`. 

In [51]:
for index, score in sorted(lda_model[bow_corpus[3000]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.35019564628601074	 
Topic: 0.044*"trump" + 0.037*"year" + 0.035*"sydney" + 0.028*"queensland" + 0.022*"home" + 0.021*"adelaid" + 0.018*"perth" + 0.016*"brisban" + 0.015*"peopl" + 0.015*"royal"

Score: 0.34957143664360046	 
Topic: 0.031*"govern" + 0.018*"feder" + 0.016*"warn" + 0.015*"countri" + 0.015*"fund" + 0.014*"claim" + 0.014*"life" + 0.013*"say" + 0.012*"stori" + 0.011*"health"

Score: 0.18346960842609406	 
Topic: 0.022*"hous" + 0.022*"south" + 0.020*"north" + 0.017*"bushfir" + 0.016*"miss" + 0.013*"interview" + 0.012*"west" + 0.011*"hospit" + 0.011*"coast" + 0.010*"investig"

Score: 0.016682324931025505	 
Topic: 0.019*"chang" + 0.018*"say" + 0.015*"speak" + 0.015*"power" + 0.013*"worker" + 0.012*"climat" + 0.012*"concern" + 0.011*"flood" + 0.011*"fear" + 0.010*"emerg"

Score: 0.016682039946317673	 
Topic: 0.036*"elect" + 0.018*"water" + 0.018*"state" + 0.016*"tasmanian" + 0.012*"labor" + 0.011*"liber" + 0.011*"morrison" + 0.011*"leader" + 0.011*"parti" + 0.010*"campaig

The test headline (no.3000) has the highest probability to be part of the topic that the LDA model for BOW assigned on the first place, which talks about different states of Australia (Sydney, Queensland, Perth, Brisbane...)

<h4> $\color{darkslateblue}{\text{Performance evaluation by classifying sample document using LDA TF-IDF model}}$</h4>

In [52]:
for index, score in sorted(lda_model_tfidf[bow_corpus[3000]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))
    


Score: 0.40001052618026733	 
Topic: 0.011*"elect" + 0.007*"sport" + 0.007*"liber" + 0.006*"financ" + 0.006*"mark" + 0.006*"jam" + 0.005*"marriag" + 0.005*"quiz" + 0.005*"asylum" + 0.005*"toni"

Score: 0.2735215127468109	 
Topic: 0.022*"trump" + 0.010*"drum" + 0.010*"govern" + 0.009*"health" + 0.007*"tuesday" + 0.006*"michael" + 0.006*"tasmania" + 0.006*"fund" + 0.006*"david" + 0.006*"care"

Score: 0.20962771773338318	 
Topic: 0.019*"charg" + 0.017*"murder" + 0.016*"polic" + 0.013*"donald" + 0.012*"alleg" + 0.012*"court" + 0.010*"jail" + 0.010*"arrest" + 0.010*"woman" + 0.009*"shoot"

Score: 0.016692066565155983	 
Topic: 0.011*"climat" + 0.009*"hobart" + 0.009*"chang" + 0.007*"cattl" + 0.006*"northern" + 0.006*"territori" + 0.006*"decemb" + 0.005*"remot" + 0.005*"australia" + 0.005*"princ"

Score: 0.016692031174898148	 
Topic: 0.008*"street" + 0.007*"live" + 0.007*"girl" + 0.006*"octob" + 0.006*"polit" + 0.005*"foreign" + 0.005*"refuge" + 0.005*"senat" + 0.005*"univers" + 0.005*"blog"


The test headline (no.3000) has the highest probability to be part of the topic that the LDA model for TF-IDF assigned on the first place, which talks about elections, finance, marks... 

<h4> $\color{darkslateblue}{\text{Testing model on unseen document}}$</h4>

Last but not least, a new headline is defined and saved as `unseen_document`. The headline will be *How a Pentagon deal became an identity crisis for Google*

This headline is **preprocessed** por lemmatization, stemming and tokenization as before.

In [56]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 15)))

Score: 0.48979589343070984	 Topic: 0.021*"market" + 0.020*"news" + 0.018*"women" + 0.018*"live" + 0.016*"tasmania" + 0.013*"high" + 0.013*"rise" + 0.012*"price" + 0.012*"lose" + 0.012*"break" + 0.012*"street" + 0.011*"fall" + 0.011*"gold" + 0.011*"record" + 0.010*"busi"
Score: 0.21052519977092743	 Topic: 0.036*"elect" + 0.018*"water" + 0.018*"state" + 0.016*"tasmanian" + 0.012*"labor" + 0.011*"liber" + 0.011*"morrison" + 0.011*"leader" + 0.011*"parti" + 0.010*"campaign" + 0.010*"give" + 0.009*"green" + 0.009*"season" + 0.009*"futur" + 0.008*"talk"
Score: 0.1829420030117035	 Topic: 0.030*"polic" + 0.029*"charg" + 0.026*"court" + 0.024*"death" + 0.024*"murder" + 0.020*"woman" + 0.017*"face" + 0.017*"alleg" + 0.016*"crash" + 0.013*"trial" + 0.012*"jail" + 0.012*"accus" + 0.011*"case" + 0.011*"guilti" + 0.011*"victoria"
Score: 0.01667730137705803	 Topic: 0.031*"govern" + 0.018*"feder" + 0.016*"warn" + 0.015*"countri" + 0.015*"fund" + 0.014*"claim" + 0.014*"life" + 0.013*"say" + 0.012*"stor

Here we can see that the model works! 

It is asigning the headline to a topic that talks about **the market, news, price, lose, break (from "breaking news" probably), fall and business**. 

***

<h3> $\color{darkslateblue}{\text{7. Conclussions}}$</h3>

**NLP** is a tool that can encompass great things whose applications are many. In this work a theoretical base has been presented, on which the knowledge of the practical part is based. 

There, a **LDA** model is created that assigns a topic to a specific headline. 

As it can be seen in the results, the LDA model difers when implemented using **Bag-Of-Words** as input or **TF-IDF**. This is due to the diferences between both algotims, the fist one is simpler and only counts words while the second one is more complex, giving each a word a certain weight considering it in each headline a in the hole document.

***

<h3> $\color{darkslateblue}{\text{8. References}}$</h3>

* http://www.nltk.org/book/
* https://www.youtube.com/watch?v=X4d4MiTVNcw&t=475s&utm_content=145787035&utm_medium=social&utm_source=linkedin&hss_channel=lcp-3740012
* https://medium.com/towards-artificial-intelligence/natural-language-processing-with-spacy-steps-and-examples-155618e84103
* https://code.datasciencedojo.com/datasciencedojo/tutorials/blob/master/Introduction%20to%20Natural%20Language%20Processing/Introduction%20to%20Natural%20Language%20Processing.pdf?utm_content=142114660&utm_medium=social&utm_source=linkedin&hss_channel=lcp-3740012
* https://www.iotcentral.io/blog/see-this-simple-introduction-to-natural-language-processing-nlp
* https://www.datasciencecentral.com/profiles/blogs/analyzing-the-structure-and-effectiveness-of-news-headlines-using
* https://www.datasciencecentral.com/profiles/blogs/how-i-used-nlp-spacy-to-screen-data-science-resumes

***