# Prompt

## Code Instructions

Please click here and follow the article. (https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)

This article is for text summarization using python.

Please implement their code first. Then, try to apply their code to your data (5 long or short articles). You might notice that their code did not have the text cleaning except the stop-words. Please refer to the text cleaning methods used in ICE-1 and add appropriate text cleaning methods to the text summarization code. Then, apply the modified code to your data again.


## Questions - Answered at the end of the notebook.

1. What are the two main strategies used in text summarization?

2. Which feature is used in the text summarization code? Explain how to calculate it.

3. What is the similarity measurement method used in this code?

4. We know in ICE-1, TF-IDF is used as the text feature. Can we use it in this code? 

5. Compare the outputs above. Are they the same or not? Please analyze the comparison result.

# Imports, Installs, and Downloads

## Imports

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
import re
import requests
from bs4 import BeautifulSoup

## Downloads

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Get Data

## Load Text

Note: All articles are from medium.com and towardsdatascience.com, but the text had been manually put into variables so that loading files isn't necessary. I would have scraped directly from webpages but most sits now protect against that.

In [6]:
data = {}

### A1 - GPT-what? A non-technical guide to OpenAI’s groundbreaking new NLP model

In [7]:
a1 = """Hype over GPT-3 reached an all-time high on Twitter over the weekend and many are calling the technological development a groundbreaking inflection point for future AI research. In this article I explore what GPT is, what it means for AI development, and where we might be headed from here.
OpenAI’s GPT-3 language model gained significant attention last week, leading many to believe that the new technology represents a significant inflection point in the development of Natural Language Processing (NLP) tools. Those with early API access through OpenAI’s beta program went to Twitter to showcase impressive early tools built using GPT-3 technology:




For non-engineers, this may look like magic, but there is a lot to be unpacked here. In this article I will provide a brief overview of GPT and what it can be used for.

What is OpenAI and GPT-3?

OpenAI is an AI research laboratory founded in 2015 by Elon Musk, Sam Altman, and others with the mission of creating AI that benefits all of humanity. The company recently received $1 billion of additional funding from Microsoft in 2019 and is considered a leader in AI research and development.

Historically, obtaining large quantities of labelled data to use to train models has been a major barrier in NLP development (and AI development in general). Normally, this can be extremely time consuming and expensive. To solve this, scientists have used an approach called transfer learning: use the existing representations/information learned in a previously-trained model as a starting point to fine-tune and train a new model for a different task.

For example, suppose you would like to learn a new language — German. Initially, you will still think about your sentences in English, then translate and rearrange words to come up with the German equivalent. The reality is, you are still indirectly applying learnings about sentence structure, language, and communication from the previous language even though the actual words and grammar are different. This is why learning new languages is typically easier if you already know another language.

Applying this strategy to AI means that we can use pre-trained models to create new models more quickly with less training data. In this great walkthrough, Francois Chollet compared the effectiveness of an AI model trained from scratch to one built from a pre-trained model. His results showed that the latter had 15% greater predictive accuracy after training both with the same amount of training data.

In 2018, OpenAI presented convincing research showing that this strategy (pairing supervised learning with unsupervised pre-training) is particularly very effective in NLP tasks. They first produced a generative pre-trained model (“GPT”) using “a diverse corpus of unlabeled text” (i.e. over 7,000 unique unpublished books from a variety of genres), essentially creating a model that “understood” English and language. Next, this pre-trained model could be further fine-tuned and trained to perform specific tasks using supervised learning. As an analogy, this would be like teaching someone English, then training him or her for the specific task of reading and classifying resumes of acceptable and unacceptable candidates for hiring.

GPT-3 is the latest iteration of the GPT model and was first described in May 2020. It contains 175 billion parameters compared to the 1.5 billion in GPT-2 (117x increase) and training it consumed several thousand petaflop/s-days of computing power. GPT-3 is fed with much more data and tuned with more parameters than GPT-2, and as a result, it has produced some amazing NLP capabilities so far. The volume of data and computing resources required makes it impossible for many organizations to recreate this, but luckily they won’t have to since OpenAI plans to release access via API in the future.

Critical reception

Admittedly, GPT-3 didn’t get much attention until last week’s viral tweets by Sharif Shameem and others (above). They demonstrated that GPT-3 could be used to create websites based on plain English instructions, envisioning a new era of no-code technologies where people can create apps by simply describing them in words. Early adopter Kevin Lacker tested the model with a Turing test and saw amazing results. GPT-3 performed exceptionally well in the initial Q&A and displayed many aspects of “common sense” that AI systems traditionally struggle with.

However, the model is far from perfect. Max Woolf performed a critical analysis noting several issues such as model latency, implementation issues, and concerning biases in the data that need to be re-considered. Several users have reported these issues on Twitter as well:



OpenAI’s blog discusses some of the key drawbacks of the model, most notably that GPT’s entire understanding of the world is based on the texts it was trained on. Case in point: it was trained in October 2019 and therefore does not know about COVID-19. It is unclear how these texts were chosen and what oversight was performed (or required) in this process.

Additionally, the enormous computing resources required to produce and maintain these models raise serious questions about the environmental impact of AI technologies. Although often overlooked, both hardware and software usage significantly contribute to depletion of energy resources, excessive waste generation, and excessive mining of rare earth minerals with the associated negative impacts to human health.

To quell concerns, OpenAI has repeatedly stated its mission to produce AI for the good of humanity and aims to stop access to its API if misuse is detected. Even in it’s beta access form, it asks candidates to describe their intentions with the technology and the benefits and risks to society.


Where do we go from here?

Without a doubt, GPT-3 still represents a major milestone in AI development. Many early users have built impressive apps that accurately process natural language and produce amazing results. In summary:

GPT-3 is a major improvement upon GPT-2 and features far greater accuracy for better use cases. This is a significant step forward for AI development, impressively accomplished in just a two-year time frame
Early tools that have been built on GPT-3 show great promise for commercial usability such as: no-code platforms that allow you to build apps by describing then; advanced search platforms using plain English; and better data analytics tools that make data gathering and processing much faster
OpenAI announced plans to release a commercial API, which will enable organizations to build products powered by GPT-3 at scale. However, many questions remain about how exactly this will be executed — pricing, SLA, model latency, etc.
Users have pointed out several issues that need to be addressed before widespread commercial use. Inherent biases in the model, questions around fairness and ethics, and concerns about misuse (fake news, bots, etc.) need to be thought through and oversight might be necessary
OpenAI is openly committed to creating AI for the benefit of humanity, but still, monitoring for misuse at scale will be difficult to achieve. This raises a broader question about the necessity of government involvement to protect the rights of individuals
All said, I’m extremely excited to see which new technologies are built on GPT-3 and how OpenAI continues to improve on its model. Increased attention and funding in NLP and GPT-3 might be enough to ward off fears from many critics that an AI winter might be coming (myself included). Despite the shortfalls of the model, I am hoping that everyone can be optimistic about a future where humans and machines will communicate with each other in a unified language and the ability to create tools using technology will be accessible to billions of more people.
"""
data.update({"a1": "https://towardsdatascience.com/gpt-what-why-this-groundbreaking-model-is-driving-the-future-of-ai-and-nlp-e38fcf891172"})

### A2 - How Transformers Work

In [8]:
a2 = """If you liked this post and want to learn how machine learning algorithms work, how did they arise, and where are they going, I recommend the following:

Making Things Think: How AI and Deep Learning Power the Products We Use - Holloway
It is the obvious which is so difficult to see most of the time. People say 'It's as plain as the nose on your face.'…
www.holloway.com

Transformers are a type of neural network architecture that have been gaining popularity. Transformers were recently used by OpenAI in their language models, and also used recently by DeepMind for AlphaStar — their program to defeat a top professional Starcraft player.

Transformers were developed to solve the problem of sequence transduction, or neural machine translation. That means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc..


Sequence transduction. The input is represented in green, the model is represented in blue, and the output is represented in purple. GIF from 3
For models to perform sequence transduction, it is necessary to have some sort of memory. For example let’s say that we are translating the following sentence to another language (French):

“The Transformers” are a Japanese [[hardcore punk]] band. The band was formed in 1968, during the height of Japanese music history”

In this example, the word “the band” in the second sentence refers to the band “The Transformers” introduced in the first sentence. When you read about the band in the second sentence, you know that it is referencing to the “The Transformers” band. That may be important for translation. There are many examples, where words in some sentences refer to words in previous sentences.

For translating sentences like that, a model needs to figure out these sort of dependencies and connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have been used to deal with this problem because of their properties. Let’s go over these two architectures and their drawbacks.

Recurrent Neural Networks
Recurrent Neural Networks have loops in them, allowing information to persist.


The input is represented as x_t
In the figure above, we see part of the neural network, A, processing some input x_t and outputs h_t. A loop allows information to be passed from one step to the next.

The loops can be thought in a different way. A Recurrent Neural Network can be thought of as multiple copies of the same network, A, each network passing a message to a successor. Consider what happens if we unroll the loop:


An unrolled recurrent neural network
This chain-like nature shows that recurrent neural networks are clearly related to sequences and lists. In that way, if we want to translate some text, we can set each input as the word in that text. The Recurrent Neural Network passes the information of the previous words to the next network that can use and process that information.

The following picture shows how usually a sequence to sequence model works using Recurrent Neural Networks. Each word is processed separately, and the resulting sentence is generated by passing a hidden state to the decoding stage that, then, generates the output.


GIF from 3
The problem of long-term dependencies
Consider a language model that is trying to predict the next word based on the previous ones. If we are trying to predict the next word of the sentence “the clouds in the sky”, we don’t need further context. It’s pretty obvious that the next word is going to be sky.

In this case where the difference between the relevant information and the place that is needed is small, RNNs can learn to use past information and figure out what is the next word for this sentence.


Image from 6
But there are cases where we need more context. For example, let’s say that you are trying to predict the last word of the text: “I grew up in France… I speak fluent …”. Recent information suggests that the next word is probably a language, but if we want to narrow down which language, we need context of France, that is further back in the text.


Image from 6
RNNs become very ineffective when the gap between the relevant information and the point where it is needed become very large. That is due to the fact that the information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.

In theory, RNNs could learn this long-term dependencies. In practice, they don’t seem to learn them. LSTM, a special type of RNN, tries to solve this kind of problem.

Long-Short Term Memory (LSTM)
When arranging one’s calendar for the day, we prioritize our appointments. If there is anything important, we can cancel some of the meetings and accommodate what is important.

RNNs don’t do that. Whenever it adds new information, it transforms existing information completely by applying a function. The entire information is modified, and there is no consideration of what is important and what is not.

LSTMs make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. In this way, LSTMs can selectively remember or forget things that are important and not so important.

Internally, a LSTM looks like the following:


Image from 6
Each cell takes as inputs x_t (a word in the case of a sentence to sentence translation), the previous cell state and the output of the previous cell. It manipulates these inputs and based on them, it generates a new cell state, and an output. I won’t go into detail on the mechanics of each cell. If you want to understand how each cell works, I recommend Christopher’s blog post:

Understanding LSTM Networks -- colah's blog
These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that…
colah.github.io

With a cell state, the information in a sentence that is important for translating a word may be passed from one word to another, when translating.

The problem with LSTMs
The same problem that happens to RNNs generally, happen with LSTMs, i.e. when sentences are too long LSTMs still don’t do too well. The reason for that is that the probability of keeping the context from a word that is far away from the current word being processed decreases exponentially with the distance from it.

That means that when sentences are long, the model often forgets the content of distant positions in the sequence. Another problem with RNNs, and LSTMs, is that it’s hard to parallelize the work for processing sentences, since you are have to process word by word. Not only that but there is no model of long and short range dependencies. To summarize, LSTMs and RNNs present 3 problems:

Sequential computation inhibits parallelization
No explicit modeling of long and short range dependencies
“Distance” between positions is linear
Attention
To solve some of these problems, researchers created a technique for paying attention to specific words.

When translating a sentence, I pay special attention to the word I’m presently translating. When I’m transcribing an audio recording, I listen carefully to the segment I’m actively writing down. And if you ask me to describe the room I’m sitting in, I’ll glance around at the objects I’m describing as I do so.

Neural networks can achieve this same behavior using attention, focusing on part of a subset of the information they are given. For example, an RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN.

To solve these problems, Attention is a technique that is used in a neural network. For RNNs, instead of only encoding the whole sentence in a hidden state, each word has a corresponding hidden state that is passed all the way to the decoding stage. Then, the hidden states are used at each step of the RNN to decode. The following gif shows how that happens.


The green step is called the encoding stage and the purple step is the decoding stage. GIF from 3
The idea behind it is that there might be relevant information in every word in a sentence. So in order for the decoding to be precise, it needs to take into account every word of the input, using attention.

For attention to be brought to RNNs in sequence transduction, we divide the encoding and decoding into 2 main steps. One step is represented in green and the other in purple. The green step is called the encoding stage and the purple step is the decoding stage.


GIF from 3
The step in green in charge of creating the hidden states from the input. Instead of passing only one hidden state to the decoders as we did before using attention, we pass all the hidden states generated by every “word” of the sentence to the decoding stage. Each hidden state is used in the decoding stage, to figure out where the network should pay attention to.

For example, when translating the sentence “Je suis étudiant” to English, requires that the decoding step looks at different words when translating it.


This gif shows how the weight that is given to each hidden state when translating the sentence “Je suis étudiant” to English. The darker the color is, the more weight is associated to each word. GIF from 3
Or for example, when you translate the sentence “L’accord sur la zone économique européenne a été signé en août 1992.” from French to English, and how much attention it is paid to each input.


Translating the sentence “L’accord sur la zone économique européenne a été signé en août 1992.” to English. Image from 3
But some of the problems that we discussed, still are not solved with RNNs using attention. For example, processing inputs (words) in parallel is not possible. For a large corpus of text, this increases the time spent translating the text.

Convolutional Neural Networks
Convolutional Neural Networks help solve these problems. With them we can

Trivial to parallelize (per layer)
Exploits local dependencies
Distance between positions is logarithmic
Some of the most popular neural networks for sequence transduction, Wavenet and Bytenet, are Convolutional Neural Networks.


Wavenet, model is a Convolutional Neural Network (CNN). Image from 10
The reason why Convolutional Neural Networks can work in parallel, is that each word on the input can be processed at the same time and does not necessarily depend on the previous words to be translated. Not only that, but the “distance” between the output word and any input for a CNN is in the order of log(N) — that is the size of the height of the tree generated from the output to the input (you can see it on the GIF above. That is much better than the distance of the output of a RNN and an input, which is on the order of N.

The problem is that Convolutional Neural Networks do not necessarily help with the problem of figuring out the problem of dependencies when translating sentences. That’s why Transformers were created, they are a combination of both CNNs with attention.

Transformers
To solve the problem of parallelization, Transformers try to solve the problem by using Convolutional Neural Networks together with attention models. Attention boosts the speed of how fast the model can translate from one sequence to another.

Let’s take a look at how Transformer works. Transformer is a model that uses attention to boost the speed. More specifically, it uses self-attention.


The Transformer. Image from 4
Internally, the Transformer has a similar kind of architecture as the previous models above. But the Transformer consists of six encoders and six decoders.


Image from 4
Each encoder is very similar to each other. All encoders have the same architecture. Decoders share the same property, i.e. they are also very similar to each other. Each encoder consists of two layers: Self-attention and a feed Forward Neural Network.


Image from 4
The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other words in the input sentence as it encodes a specific word. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.


Image from 4
Self-Attention
Note: This section comes from Jay Allamar blog post

Let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output. As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.


Image taken from 4
Each word is embedded into a vector of size 512. We’ll represent those vectors with these simple boxes.

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512.

In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.


Image from 4
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

Self-Attention
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented — using matrices.


Figuring out relation of words within a sentence and giving the right attention to it. Image from 8
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.


Image taken from 4
Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up creating a “query”, a “key”, and a “value” projection of each word in the input sentence.

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.


Image from 4
The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper — 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.


Image from 4
This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).


Image from 4
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

Multihead attention
Transformers basically work like that. There are a few other details that make them work better. For example, instead of only paying attention to each other in one dimension, Transformers use the concept of Multihead attention.

The idea behind it is that whenever you are translating a word, you may pay different attention to each word based on the type of question that you are asking. The images below show what that means. For example, whenever you are translating “kicked” in the sentence “I kicked the ball”, you may ask “Who kicked”. Depending on the answer, the translation of the word to another language can change. Or ask other questions, like “Did what?”, etc…




Images from 8
Positional Encoding
Another important step on the Transformer is to add positional encoding when encoding each word. Encoding the position of each word is relevant, since the position of each word is relevant to the translation.

Overview
I gave an overview of how Transformers work and why this is the technique used for sequence transduction. If you want to understand in depth how the model works and all its nuances, I recommend the following posts, articles and videos that I used as a base for summarizing the technique

The Unreasonable Effectiveness of Recurrent Neural Networks
Understanding LSTM Networks
Visualizing A Neural Machine Translation Model
The Illustrated Transformer
The Transformer — Attention is all you need
The Annotated Transformer
Attention is all you need attentional neural network models
Self-Attention For Generative Models
OpenAI GPT-2: Understanding Language Generation through Visualization
WaveNet: A Generative Model for Raw Audio
"""
data.update({"a2": "https://towardsdatascience.com/transformers-141e32e69591"})

### A3 - Illustrated Guide to Transformers- Step by Step Explanation

In [9]:
a3 = """Transformers are taking the natural language processing world by storm. These incredible models are breaking multiple NLP records and pushing the state of the art. They are used in many applications like machine language translation, conversational chatbots, and even to power better search engines. Transformers are the rage in deep learning nowadays, but how do they work? Why have they outperform the previous king of sequence problems, like recurrent neural networks, GRU’s, and LSTM’s? You’ve probably heard of different famous transformers models like BERT, GPT, and GPT2. In this post, we’ll focus on the one paper that started it all, “Attention is all you need”.

Check out the link below if you’d like to watch the video version instead.


Attention Mechanism
To understand transformers we first must understand the attention mechanism. The Attention mechanism enables the transformers to have extremely long term memory. A transformer model can “attend” or “focus” on all previous tokens that have been generated.

Let’s walk through an example. Say we want to write a short sci-fi novel with a generative transformer. Using Hugging Face’s Write With Transformer application, we can do just that. We’ll prime the model with our input, and the model will generate the rest.


Our input: “As Aliens entered our planet”.

Transformer output: “and began to colonized Earth, a certain group of extraterrestrials began to manipulate our society through their influences of a certain number of the elite to keep and iron grip over the populace.”

Ok, so the story is a little dark but what’s interesting is how the model generated it. As the model generates the text word by word, it can “attend” or “focus” on words that are relevant to the generated word. The ability to know what words to attend too is all learned during training through backpropagation.


Attention mechanism focusing on different tokens while generating words 1 by 1
Recurrent neural networks (RNN) are also capable of looking at previous inputs too. But the power of the attention mechanism is that it doesn’t suffer from short term memory. RNN’s have a shorter window to reference from, so when the story gets longer, RNN’s can’t access words generated earlier in the sequence. This is still true for Gated Recurrent Units (GRU’s) and Long-short Term Memory (LSTM’s) networks, although they do a bigger capacity to achieve longer-term memory, therefore, having a longer window to reference from. The attention mechanism, in theory, and given enough compute resources, have an infinite window to reference from, therefore being capable of using the entire context of the story while generating the text.


Hypothetical reference window of Attention, RNN’s, GRU’s & LSTM’s
Attention Is All You Need — Step by Step Walkthrough
The attention mechanism’s power was demonstrated in the paper “Attention Is All You Need”, where the authors introduced a new novel neural network called the Transformers which is an attention-based encoder-decoder type architecture.


Transformer Model
On a high level, the encoder maps an input sequence into an abstract continuous representation that holds all the learned information of that input. The decoder then takes that continuous representation and step by step generates a single output while also being fed the previous output.

Let’s walk through an example. The paper applied the Transformer model on a neural machine translation problem. In this post, we’ll demonstrate how it’ll work for a conversational chatbot.

Our Input: “Hi how are you”

Transformer Output: “I am fine”

Input Embeddings
The first step is feeding out input into a word embedding layer. A word embedding layer can be thought of as a lookup table to grab a learned vector representation of each word. Neural networks learn through numbers so each word maps to a vector with continuous values to represent that word.


converting Words to Input Embeddings
Positional Encoding
The next step is to inject positional information into the embeddings. Because the transformer encoder has no recurrence like recurrent neural networks, we must add some information about the positions into the input embeddings. This is done using positional encoding. The authors came up with a clever trick using sin and cosine functions.


We won’t go into the mathematical details of positional encoding, but here are the basics. For every odd index on the input vector, create a vector using the cos function. For every even index, create a vector using the sin function. Then add those vectors to their corresponding input embeddings. This successfully gives the network information on the position of each vector. The sin and cosine functions were chosen in tandem because they have linear properties the model can easily learn to attend to.

Encoder Layer
Now we have the encoder layer. The Encoders layers job is to map all input sequences into an abstract continuous representation that holds the learned information for that entire sequence. It contains 2 sub-modules, multi-headed attention, followed by a fully connected network. There are also residual connections around each of the two sublayers followed by a layer normalization.


Encoder Layer Sub Modules
To break this down, let’s first look at the multi-headed attention module.

Multi-Headed Attention
Multi-headed attention in the encoder applies a specific attention mechanism called self-attention. Self-attention allows the models to associate each word in the input, to other words. So in our example, it’s possible that our model can learn to associate the word “you”, with “how” and “are”. It’s also possible that the model learns that words structured in this pattern are typically a question so respond appropriately.


Encoder Self-Attention Operations. Reference this when looking at Illustrations below.
Query, Key, and Value Vectors
To achieve self-attention, we feed the input into 3 distinct fully connected layers to create the query, key, and value vectors.

What are these vectors exactly? I found a good explanation on stack exchange stating….

“The query key and value concept come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (values).

Dot Product of Query and Key
After feeding the query, key, and value vector through a linear layer, the queries and keys undergo a dot product matrix multiplication to produce a score matrix.


Dot Product multiplication of the query and the key
The score matrix determines how much focus should a word be put on other words. So each word will have a score that corresponds to other words in the time-step. The higher the score the more focus. This is how the queries are mapped to the keys.


Attention scores from the dot product.
Scaling Down the Attention Scores
Then, the scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.


Scaling down the Attention scores
Softmax of the Scaled Scores
Next, you take the softmax of the scaled score to get the attention weights, which gives you probability values between 0 and 1. By doing a softmax the higher scores get heighten, and lower scores are depressed. This allows the model to be more confident about which words to attend too.


Taking the softmax of the scaled scores to get probability values
Multiply Softmax Output with Value vector
Then you take the attention weights and multiply it by your value vector to get an output vector. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words. Then you feed the output of that into a linear layer to process.


Computing Multi-headed Attention
To make this a multi-headed attention computation, you need to split the query, key, and value into N vectors before applying self-attention. The split vectors then go through the self-attention process individually. Each self-attention process is called a head. Each head produces an output vector that gets concatenated into a single vector before going through the final linear layer. In theory, each head would learn something different therefore giving the encoder model more representation power.


Splitting Q, K, V, N times before applying self-attention
To sum it up, multi-headed attention is a module in the transformer network that computes the attention weights for the input and produces an output vector with encoded information on how each word should attend to all other words in the sequence.

The Residual Connections, Layer Normalization, and Feed Forward Network
The multi-headed attention output vector is added to the original positional input embedding. This is called a residual connection. The output of the residual connection goes through a layer normalization.


Residual connection of the positional input embedding and the output of Multi-headed Attention
The normalized residual output gets projected through a pointwise feed-forward network for further processing. The pointwise feed-forward network is a couple of linear layers with a ReLU activation in between. The output of that is then again added to the input of the pointwise feed-forward network and further normalized.


Residual connection of the input and output of the point-wise feedforward layer.
The residual connections help the network train, by allowing gradients to flow through the networks directly. The layer normalizations are used to stabilize the network which results in substantially reducing the training time necessary. The pointwise feedforward layer is used to project the attention outputs potentially giving it a richer representation.

Encoder Wrap-up
That wraps up the encoder layer. All of these operations are to encode the input to a continuous representation with attention information. This will help the decoder focus on the appropriate words in the input during the decoding process. You can stack the encoder N times to further encode the information, where each layer has the opportunity to learn different attention representations therefore potentially boosting the predictive power of the transformer network.

Decoder Layer
The decoder’s job is to generate text sequences. The decoder has a similar sub-layer as the encoder. it has two multi-headed attention layers, a pointwise feed-forward layer, and residual connections, and layer normalization after each sub-layer. These sub-layers behave similarly to the layers in the encoder but each multi-headed attention layer has a different job. The decoder is capped off with a linear layer that acts as a classifier, and a softmax to get the word probabilities.


Decoder Layer. Reference This diagram while reading.
The decoder is autoregressive, it begins with a start token, and it takes in a list of previous outputs as inputs, as well as the encoder outputs that contain the attention information from the input. The decoder stops decoding when it generates a token as an output.


The decoder is autoregressive as it generates a token 1 at a time while being fed in the previous outputs.
Let’s walk through the decoding steps.

Decoder Input Embeddings & Positional Encoding
The beginning of the decoder is pretty much the same as the encoder. The input goes through an embedding layer and positional encoding layer to get positional embeddings. The positional embeddings get fed into the first multi-head attention layer which computes the attention scores for the decoder’s input.

Decoders First Multi-Headed Attention
This multi-headed attention layer operates slightly differently. Since the decoder is autoregressive and generates the sequence word by word, you need to prevent it from conditioning to future tokens. For example, when computing attention scores on the word “am”, you should not have access to the word “fine”, because that word is a future word that was generated after. The word “am” should only have access to itself and the words before it. This is true for all other words, where they can only attend to previous words.


A depiction of Decoder’s first Multi-headed Attention scaled attention scores. The word “am”, should not any values for the word “fine”. This is true for all other words.
We need a method to prevent computing attention scores for future words. This method is called masking. To prevent the decoder from looking at future tokens, you apply a look ahead mask. The mask is added before calculating the softmax, and after scaling the scores. Let’s take a look at how this works.

Look-Ahead Mask
The mask is a matrix that’s the same size as the attention scores filled with values of 0’s and negative infinities. When you add the mask to the scaled attention scores, you get a matrix of the scores, with the top right triangle filled with negativity infinities.


Adding a look-ahead mask to the scaled scores
The reason for the mask is because once you take the softmax of the masked scores, the negative infinities get zeroed out, leaving zero attention scores for future tokens. As you can see in the figure below, the attention scores for “am”, has values for itself and all words before it but is zero for the word “fine”. This essentially tells the model to put no focus on those words.


This masking is the only difference in how the attention scores are calculated in the first multi-headed attention layer. This layer still has multiple heads, that the mask is being applied to, before getting concatenated and fed through a linear layer for further processing. The output of the first multi-headed attention is a masked output vector with information on how the model should attend on the decoder’s input.


Multi-Headed Attention with Masking
Decoder Second Multi-Headed Attention, and Point-wise Feed Forward Layer
The second multi-headed attention layer. For this layer, the encoder’s outputs are the queries and the keys, and the first multi-headed attention layer outputs are the values. This process matches the encoder’s input to the decoder’s input, allowing the decoder to decide which encoder input is relevant to put a focus on. The output of the second multi-headed attention goes through a pointwise feedforward layer for further processing.

Linear Classifier and Final Softmax for Output Probabilities
The output of the final pointwise feedforward layer goes through a final linear layer, that acts as a classifier. The classifier is as big as the number of classes you have. For example, if you have 10,000 classes for 10,000 words, the output of that classier will be of size 10,000. The output of the classifier then gets fed into a softmax layer, which will produce probability scores between 0 and 1. We take the index of the highest probability score, and that equals our predicted word.


Linear Classifier with Softmax to get the Output Probabilities
The decoder then takes the output, add’s it to the list of decoder inputs, and continues decoding again until a token is predicted. For our case, the highest probability prediction is the final class which is assigned to the end token.

The decoder can also be stacked N layers high, each layer taking in inputs from the encoder and the layers before it. By stacking the layers, the model can learn to extract and focus on different combinations of attention from its attention heads, potentially boosting its predictive power.


Stacked Encoder and Decoder
And That’s It!
And that’s it! That’s the mechanics of the transformers. Transformers leverage the power of the attention mechanism to make better predictions. Recurrent Neural networks try to achieve similar things, but because they suffer from short term memory. Transformers can be better especially if you want to encode or generate long sequences. Because of the transformer architecture, the natural language processing industry can achieve unprecedented results.

Check out michaelphi.com for more content like this.
"""
data.update({"a3": "https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0"})

### A4 - Introduction to Word Embedding and Word2Vec

In [10]:
a4 = """Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word. Having said this, what follows is how do we generate them? More importantly, how do they capture the context?

Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.

Let’s tackle this part by part.

Why do we need them?

Consider the following similar sentences: Have a good day and Have a great day. They hardly have different meaning. If we construct an exhaustive vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.

Now, let us create a one-hot encoded vector for each of these words in V. Length of our one-hot encoded vector would be equal to the size of V (=5). We would have a vector of zeros except for the element at the index representing the corresponding word in the vocabulary. That particular element would be one. The encodings below would explain this better.

Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ; great=[0,0,0,1,0]` ; day=[0,0,0,0,1]` (` represents transpose)

If we try to visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

Our objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.


Google Images
Here comes the idea of generating distributed representations. Intuitively, we introduce some dependence of one word on the other words. The words in context of this word would get a greater share of this dependence. In one hot encoding representations, all the words are independent of each other, as mentioned earlier.

How does Word2Vec work?

Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)

CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.

Let the input to the Neural Network be the word, great. Notice that here we are trying to predict a target word (day) using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (day). In the process of predicting the target word, we learn the vector representation of the target word.

Let us look deeper into the actual architecture.


CBOW Model
The input or the context word is a one hot encoded vector of size V. The hidden layer contains N neurons and the output is again a V length vector with the elements being the softmax values.

Let’s get the terms in the picture right:
- Wvn is the weight matrix that maps the input x to the hidden layer (V*N dimensional matrix)
-W`nv is the weight matrix that maps the hidden layer outputs to the final output layer (N*V dimensional matrix)

I won’t get into the mathematics. We’ll just get an idea of what’s going on.

The hidden layer neurons just copy the weighted sum of inputs to the next layer. There is no activation like sigmoid, tanh or ReLU. The only non-linearity is the softmax calculations in the output layer.

But, the above model used a single context word to predict the target. We can use multiple context words to do the same.


Google images
The above model takes C context words. When Wvn is used to calculate hidden layer inputs, we take an average over all these C context word inputs.

So, we have seen how word representations are generated using the context words. But there’s one more way we can do the same. We can use the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. Another variant, called Skip Gram model does this.

Skip-Gram model:


This looks like multiple-context CBOW model just got flipped. To some extent that is true.

We input the target word into the network. The model outputs C probability distributions. What does this mean?

For each context position, we get C probability distributions of V probabilities, one for each word.

In both the cases, the network uses back-propagation to learn. Detailed math can be found here

Who wins?

Both have their own advantages and disadvantages. According to Mikolov, Skip Gram works well with small amount of data and is found to represent rare words well.

On the other hand, CBOW is faster and has better representations for more frequent words.

What’s ahead?

The above explanation is a very basic one. It just gives you a high-level idea of what word embeddings are and how Word2Vec works.

There’s a lot more to it. For example, to make the algorithm computationally more efficient, tricks like Hierarchical Softmax and Skip-Gram Negative Sampling are used. All of it can be found here.

Thanks for reading! I have started my personal blog and I don’t intend to write more amazing articles on Medium. Support my blog by subscribing to thenlp.space"""
a4 = a4.replace("→", "->")
data.update({"a4": "https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa"})

### A5 - Your Guide to Natural Language Processing (NLP)

In [11]:
a5 = """Everything we express (either verbally or in written) carries huge amounts of information. The topic we choose, our tone, our selection of words, everything adds some type of information that can be interpreted and value extracted from it. In theory, we can understand and even predict human behaviour using that information.

But there is a problem: one person may generate hundreds or thousands of words in a declaration, each sentence with its corresponding complexity. If you want to scale and analyze several hundreds, thousands or millions of people or declarations in a given geography, then the situation is unmanageable.

Data generated from conversations, declarations or even tweets are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases, and represent the vast majority of data available in the actual world. It is messy and hard to manipulate. Nevertheless, thanks to the advances in disciplines like machine learning a big revolution is going on regarding this topic. Nowadays it is no longer about trying to interpret a text or speech based on its keywords (the old fashioned mechanical way), but about understanding the meaning behind those words (the cognitive way). This way it is possible to detect figures of speech like irony, or even perform sentiment analysis.

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

It is a discipline that focuses on the interaction between data science and human language, and is scaling to lots of industries. Today NLP is booming thanks to the huge improvements in the access to data and the increase in computational power, which are allowing practitioners to achieve meaningful results in areas like healthcare, media, finance and human resources, among others.

Use Cases of NLP
In simple terms, NLP represents the automatic handling of natural human language like speech or text, and although the concept itself is fascinating, the real value behind this technology comes from the use cases.

NLP can help you with lots of tasks and the fields of application just seem to increase on a daily basis. Let’s mention some examples:

NLP enables the recognition and prediction of diseases based on electronic health records and patient’s own speech. This capability is being explored in health conditions that go from cardiovascular diseases to depression and even schizophrenia. For example, Amazon Comprehend Medical is a service that uses NLP to extract disease conditions, medications and treatment outcomes from patient notes, clinical trial reports and other electronic health records.
Organizations can determine what customers are saying about a service or product by identifying and extracting information in sources like social media. This sentiment analysis can provide a lot of information about customers choices and their decision drivers.
An inventor at IBM developed a cognitive assistant that works like a personalized search engine by learning all about you and then remind you of a name, a song, or anything you can’t remember the moment you need it to.
Companies like Yahoo and Google filter and classify your emails with NLP by analyzing text in emails that flow through their servers and stopping spam before they even enter your inbox.
To help identifying fake news, the NLP Group at MIT developed a new system to determine if a source is accurate or politically biased, detecting if a news source can be trusted or not.
Amazon’s Alexa and Apple’s Siri are examples of intelligent voice driven interfaces that use NLP to respond to vocal prompts and do everything like find a particular shop, tell us the weather forecast, suggest the best route to the office or turn on the lights at home.
Having an insight into what is happening and what people are talking about can be very valuable to financial traders. NLP is being used to track news, reports, comments about possible mergers between companies, everything can be then incorporated into a trading algorithm to generate massive profits. Remember: buy the rumor, sell the news.
NLP is also being used in both the search and selection phases of talent recruitment, identifying the skills of potential hires and also spotting prospects before they become active on the job market.
Powered by IBM Watson NLP technology, LegalMation developed a platform to automate routine litigation tasks and help legal teams save time, drive down costs and shift strategic focus.
NLP is particularly booming in the healthcare industry. This technology is improving care delivery, disease diagnosis and bringing costs down while healthcare organizations are going through a growing adoption of electronic health records. The fact that clinical documentation can be improved means that patients can be better understood and benefited through better healthcare. The goal should be to optimize their experience, and several organizations are already working on this.


Number of publications containing the sentence “natural language processing” in PubMed in the period 1978–2018. As of 2018, PubMed comprised more than 29 million citations for biomedical literature
Companies like Winterlight Labs are making huge improvements in the treatment of Alzheimer’s disease by monitoring cognitive impairment through speech and they can also support clinical trials and studies for a wide range of central nervous system disorders. Following a similar approach, Stanford University developed Woebot, a chatbot therapist with the aim of helping people with anxiety and other disorders.

But serious controversy is around the subject. A couple of years ago Microsoft demonstrated that by analyzing large samples of search engine queries, they could identify internet users who were suffering from pancreatic cancer even before they have received a diagnosis of the disease. How would users react to such diagnosis? And what would happen if you were tested as a false positive? (meaning that you can be diagnosed with the disease even though you don’t have it). This recalls the case of Google Flu Trends which in 2009 was announced as being able to predict influenza but later on vanished due to its low accuracy and inability to meet its projected rates.

NLP may be the key to an effective clinical support in the future, but there are still many challenges to face in the short term.

Basic NLP to impress your non-NLP friends
The main drawbacks we face these days with NLP relate to the fact that language is very tricky. The process of understanding and manipulating language is extremely complex, and for this reason it is common to use different techniques to handle different challenges before binding everything together. Programming languages like Python or R are highly used to perform these techniques, but before diving into code lines (that will be the topic of a different article), it’s important to understand the concepts beneath them. Let’s summarize and explain some of the most frequently used algorithms in NLP when defining the vocabulary of terms:

Bag of Words
Is a commonly used model that allows you to count all words in a piece of text. Basically it creates an occurrence matrix for the sentence or document, disregarding grammar and word order. These word frequencies or occurrences are then used as features for training a classifier.

To bring a short example I took the first sentence of the song “Across the Universe” from The Beatles:

Words are flowing out like endless rain into a paper cup,

They slither while they pass, they slip away across the universe

Now let’s count the words:


This approach may reflect several downsides like the absence of semantic meaning and context, and the facts that stop words (like “the” or “a”) add noise to the analysis and some words are not weighted accordingly (“universe” weights less than the word “they”).

To solve this problem, one approach is to rescale the frequency of words by how often they appear in all texts (not just the one we are analyzing) so that the scores for frequent words like “the”, that are also frequent across other texts, get penalized. This approach to scoring is called “Term Frequency — Inverse Document Frequency” (TFIDF), and improves the bag of words by weights. Through TFIDF frequent terms in the text are “rewarded” (like the word “they” in our example), but they also get “punished” if those terms are frequent in other texts we include in the algorithm too. On the contrary, this method highlights and “rewards” unique or rare terms considering all texts. Nevertheless, this approach still has no context nor semantics.

Tokenization
Is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens, and at the same time throwing away certain characters, such as punctuation. Following our example, the result of tokenization would be:


Pretty simple, right? Well, although it may seem quite basic in this case and also in languages like English that separate words by a blank space (called segmented languages) not all languages behave the same, and if you think about it, blank spaces alone are not sufficient enough even for English to perform proper tokenizations. Splitting on blank spaces may break up what should be considered as one token, as in the case of certain names (e.g. San Francisco or New York) or borrowed foreign phrases (e.g. laissez faire).

Tokenization can remove punctuation too, easing the path to a proper word segmentation but also triggering possible complications. In the case of periods that follow abbreviation (e.g. dr.), the period following that abbreviation should be considered as part of the same token and not be removed.

The tokenization process can be particularly problematic when dealing with biomedical text domains which contain lots of hyphens, parentheses, and other punctuation marks.

For deeper details on tokenization, you can find a great explanation in this article.

Stop Words Removal
Includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed, hence removing widespread and frequent terms that are not informative about the corresponding text.

Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, freeing up database space and improving processing time.

There is no universal list of stop words. These can be pre-selected or built from scratch. A potential approach is to begin by adopting pre-defined stop words and add words to the list later on. Nevertheless it seems that the general trend over the past time has been to go from the use of large standard stop word lists to the use of no lists at all.

The thing is stop words removal can wipe out relevant information and modify the context in a given sentence. For example, if we are performing a sentiment analysis we might throw our algorithm off track if we remove a stop word like “not”. Under these conditions, you might select a minimal stop word list and add additional terms depending on your specific objective.

Stemming
Refers to the process of slicing the end or the beginning of words with the intention of removing affixes (lexical additions to the root of the word).

Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

The problem is that affixes can create or expand new forms of the same word (called inflectional affixes), or even create new words themselves (called derivational affixes). In English, prefixes are always derivational (the affix creates a new word as in the example of the prefix “eco” in the word “ecosystem”), but suffixes can be derivational (the affix creates a new word as in the example of the suffix “ist” in the word “guitarist”) or inflectional (the affix creates a new form of word as in the example of the suffix “er” in the word “faster”).

Ok, so how can we tell the difference and chop the right bit?


A possible approach is to consider a list of common affixes and rules (Python and R languages have different libraries containing affixes and methods) and perform stemming based on them, but of course this approach presents limitations. Since stemmers use algorithmics approaches, the result of the stemming process may not be an actual word or even change the word (and sentence) meaning. To offset this effect you can edit those predefined methods by adding or removing affixes and rules, but you must consider that you might be improving the performance in one area while producing a degradation in another one. Always look at the whole picture and test your model’s performance.

So if stemming has serious limitations, why do we use it? First of all, it can be used to correct spelling errors from the tokens. Stemmers are simple to use and run very fast (they perform simple operations on a string), and if speed and performance are important in the NLP model, then stemming is certainly the way to go. Remember, we use it with the objective of improving our performance, not as a grammar exercise.

Lemmatization
Has the objective of reducing a word to its base form and grouping together different forms of the same word. For example, verbs in past tense are changed into present (e.g. “went” is changed to “go”) and synonyms are unified (e.g. “best” is changed to “good”), hence standardizing words with similar meaning to their root. Although it seems closely related to the stemming process, lemmatization uses a different approach to reach the root forms of words.

Lemmatization resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas.

For example, the words “running”, “runs” and “ran” are all forms of the word “run”, so “run” is the lemma of all the previous words.


Lemmatization also takes into consideration the context of the word in order to solve other problems like disambiguation, which means it can discriminate between identical words that have different meanings depending on the specific context. Think about words like “bat” (which can correspond to the animal or to the metal/wooden club used in baseball) or “bank” (corresponding to the financial institution or to the land alongside a body of water). By providing a part-of-speech parameter to a word ( whether it is a noun, a verb, and so on) it’s possible to define a role for that word in the sentence and remove disambiguation.

As you might already pictured, lemmatization is a much more resource-intensive task than performing a stemming process. At the same time, since it requires more knowledge about the language structure than a stemming approach, it demands more computational power than setting up or adapting a stemming algorithm.

Topic Modeling
Is as a method for uncovering hidden structures in sets of texts or documents. In essence it clusters texts to discover latent topics based on their contents, processing individual words and assigning them values based on their distribution. This technique is based on the assumptions that each document consists of a mixture of topics and that each topic consists of a set of words, which means that if we can spot these hidden topics we can unlock the meaning of our texts.

From the universe of topic modelling techniques, Latent Dirichlet Allocation (LDA) is probably the most commonly used. This relatively new algorithm (invented less than 20 years ago) works as an unsupervised learning method that discovers different topics underlying a collection of documents. In unsupervised learning methods like this one, there is no output variable to guide the learning process and data is explored by algorithms to find patterns. To be more specific, LDA finds groups of related words by:

Assigning each word to a random topic, where the user defines the number of topics it wishes to uncover. You don’t define the topics themselves (you define just the number of topics) and the algorithm will map all documents to the topics in a way that words in each document are mostly captured by those imaginary topics.
The algorithm goes through each word iteratively and reassigns the word to a topic taking into considerations the probability that the word belongs to a topic, and the probability that the document will be generated by a topic. These probabilities are calculated multiple times, until the convergence of the algorithm.
Unlike other clustering algorithms like K-means that perform hard clustering (where topics are disjointed), LDA assigns each document to a mixture of topics, which means that each document can be described by one or more topics (e.g. Document 1 is described by 70% of topic A, 20% of topic B and 10% of topic C) and reflect more realistic results.


Topic modeling is extremely useful for classifying texts, building recommender systems (e.g. to recommend you books based on your past readings) or even detecting trends in online publications.

How does the future look like?
At the moment NLP is battling to detect nuances in language meaning, whether due to lack of context, spelling errors or dialectal differences.

On March 2016 Microsoft launched Tay, an Artificial Intelligence (AI) chatbot released on Twitter as a NLP experiment. The idea was that as more users conversed with Tay, the smarter it would get. Well, the result was that after 16 hours Tay had to be removed due to its racist and abusive comments:



Microsoft learnt from its own experience and some months later released Zo, its second generation English-language chatbot that won’t be caught making the same mistakes as its predecessor. Zo uses a combination of innovative approaches to recognize and generate conversation, and other companies are exploring with bots that can remember details specific to an individual conversation.

Although the future looks extremely challenging and full of threats for NLP, the discipline is developing at a very fast pace (probably like never before) and we are likely to reach a level of advancement in the coming years that will make complex applications look possible.

Thanks Jesús del Valle , Jannis Busch and Sabrina Steinert for your valuable inputs

Interested in these topics? Follow me on Linkedin or Twitter
"""
data.update({"a5": "https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1"})

In [12]:
data

{'a1': 'https://towardsdatascience.com/gpt-what-why-this-groundbreaking-model-is-driving-the-future-of-ai-and-nlp-e38fcf891172',
 'a2': 'https://towardsdatascience.com/transformers-141e32e69591',
 'a3': 'https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0',
 'a4': 'https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa',
 'a5': 'https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1'}

In [13]:
a1

'Hype over GPT-3 reached an all-time high on Twitter over the weekend and many are calling the technological development a groundbreaking inflection point for future AI research. In this article I explore what GPT is, what it means for AI development, and where we might be headed from here.\nOpenAI’s GPT-3 language model gained significant attention last week, leading many to believe that the new technology represents a significant inflection point in the development of Natural Language Processing (NLP) tools. Those with early API access through OpenAI’s beta program went to Twitter to showcase impressive early tools built using GPT-3 technology:\n\n\n\n\nFor non-engineers, this may look like magic, but there is a lot to be unpacked here. In this article I will provide a brief overview of GPT and what it can be used for.\n\nWhat is OpenAI and GPT-3?\n\nOpenAI is an AI research laboratory founded in 2015 by Elon Musk, Sam Altman, and others with the mission of creating AI that benefits 

In [14]:
articles = [a1, a2, a3, a4, a5]

## Write Text to Files

In [15]:
def write_to_file(txt, title, remove_newline=False):
    if remove_newline:
        txt = re.sub("\s+", " ", txt)       # Remove newlines to make article's code work
    with open("%s.txt"%(title), "w") as file:
        file.write(txt)

In [16]:
for i, a in enumerate(articles):
    write_to_file(a, "a%d"%(i+1), remove_newline=True)

# Article's Code

In [17]:
txt = """In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow." The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."""
with open("msft.txt", "w") as file:
    file.write(txt)

In [18]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []
    for sentence in article:
        # print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
        # sentences.pop() 
    
    return sentences

In [19]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

In [20]:
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

In [21]:
def generate_summary_old(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []
    # Step 1 - Read text and tokenize
    sentences =  read_article(file_name)
    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)
    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))
        # Step 5 - Offcourse, output the summarize texr
        print("\nSummarize Text %d: \n%s"%(i, ". ".join(summarize_text)))

In [22]:
# Test on article's data
generate_summary_old("msft.txt", 2)

Indexes of top ranked_sentence order are  [(0.13083383500129075, ['This', 'program', 'also', 'included', 'developer-focused', 'AI', 'school', 'that', 'provided', 'a', 'bunch', 'of', 'assets', 'to', 'help', 'build', 'AI', 'skills.']), (0.1262117608296295, ['Envisioned', 'as', 'a', 'three-year', 'collaborative', 'program,', 'Intelligent', 'Cloud', 'Hub', 'will', 'support', 'around', '100', 'institutions', 'with', 'AI', 'infrastructure,', 'course', 'content', 'and', 'curriculum,', 'developer', 'support,', 'development', 'tools', 'and', 'give', 'students', 'access', 'to', 'cloud', 'and', 'AI', 'services']), (0.11386767381829474, ['The', 'company', 'will', 'provide', 'AI', 'development', 'tools', 'and', 'Azure', 'AI', 'services', 'such', 'as', 'Microsoft', 'Cognitive', 'Services,', 'Bot', 'Services', 'and', 'Azure', 'Machine', 'Learning.According', 'to', 'Manish', 'Prakash,', 'Country', 'General', 'Manager-PS,', 'Health', 'and', 'Education,', 'Microsoft', 'India,', 'said,', '"With', 'AI', '

# Assignment Code/Modified Article Code

## Load Files

In [23]:
# Effectivly unmodified (removed print and a broken list pop)
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []
    for sentence in article:
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    
    return sentences

## Clean Text

In [24]:
from nltk.stem.wordnet import WordNetLemmatizer
def clean_text(sentence):

    # Recombine sentence
    sent = " ".join(sentence)

    # Lowercase
    sent = sent.lower()

    # Drop possessives
    sent = re.sub("('s)+", "", sent)

    # Drop punctuation
    punct = list("!@#$%^&*(),./<>?;:'\"[]\{}|`~-_")
    sent = re.sub("([^a-zA-Z0-9 ])", "", sent)

    # Change numbers for single token
    sent = re.sub("[0-9]+", "<NUM>", sent)

    # Re-split sentence
    sent = re.sub("\s+", " ", sent.strip())
    sent = sent.split(" ")

    # Lemmatize words and remove stopwords
    lemmatizer = WordNetLemmatizer()
    for i, w in enumerate(sent):
        sent[i] = lemmatizer.lemmatize(w, pos="v")

    return sent

## Sentence Similarity

In [25]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
    stopwords.append("<NUM>")
    
    # Clean text
    sent1 = clean_text(sent1)
    sent2 = clean_text(sent2)
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

In [26]:
# Unmodified
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

## Genereate Summary

In [27]:
# Unmodified
def generate_summary_update(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []
    # Step 1 - Read text and tokenize
    sentences =  read_article(file_name)
    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)
    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))
        # Step 5 - Offcourse, output the summarize texr
        print("\nSummarize Text %d: \n%s"%(i, ". ".join(summarize_text)))

In [28]:
# Test on article's data
generate_summary_update("msft.txt", 2)

Indexes of top ranked_sentence order are  [(0.13873172763348482, ['This', 'program', 'also', 'included', 'developer-focused', 'AI', 'school', 'that', 'provided', 'a', 'bunch', 'of', 'assets', 'to', 'help', 'build', 'AI', 'skills.']), (0.12941619104824872, ['Envisioned', 'as', 'a', 'three-year', 'collaborative', 'program,', 'Intelligent', 'Cloud', 'Hub', 'will', 'support', 'around', '100', 'institutions', 'with', 'AI', 'infrastructure,', 'course', 'content', 'and', 'curriculum,', 'developer', 'support,', 'development', 'tools', 'and', 'give', 'students', 'access', 'to', 'cloud', 'and', 'AI', 'services']), (0.10637727120952348, ['The', 'company', 'will', 'provide', 'AI', 'development', 'tools', 'and', 'Azure', 'AI', 'services', 'such', 'as', 'Microsoft', 'Cognitive', 'Services,', 'Bot', 'Services', 'and', 'Azure', 'Machine', 'Learning.According', 'to', 'Manish', 'Prakash,', 'Country', 'General', 'Manager-PS,', 'Health', 'and', 'Education,', 'Microsoft', 'India,', 'said,', '"With', 'AI', 

# Generate Summaries Using Both Unmodified and Modified Code

In [29]:
for i, a in enumerate(articles):
    print("Article a%d: =======================================================================\n\nOld: ------------------------------------------------------------------------------\n"%(i+1))
    generate_summary_old("a%d.txt"%(i+1))
    print("\nUpdated: ------------------------------------------------------------------------\n")
    generate_summary_update("a%d.txt"%(i+1))
    print("\n\n")


Old: ------------------------------------------------------------------------------

Indexes of top ranked_sentence order are  [(0.04229507258112702, ['Applying', 'this', 'strategy', 'to', 'AI', 'means', 'that', 'we', 'can', 'use', 'pre-trained', 'models', 'to', 'create', 'new', 'models', 'more', 'quickly', 'with', 'less', 'training', 'data']), (0.03413848933830867, ['To', 'solve', 'this,', 'scientists', 'have', 'used', 'an', 'approach', 'called', 'transfer', 'learning:', 'use', 'the', 'existing', 'representations/information', 'learned', 'in', 'a', 'previously-trained', 'model', 'as', 'a', 'starting', 'point', 'to', 'fine-tune', 'and', 'train', 'a', 'new', 'model', 'for', 'a', 'different', 'task']), (0.0331916180902335, ['Historically,', 'obtaining', 'large', 'quantities', 'of', 'labelled', 'data', 'to', 'use', 'to', 'train', 'models', 'has', 'been', 'a', 'major', 'barrier', 'in', 'NLP', 'development', '(and', 'AI', 'development', 'in', 'general)']), (0.033142312522327765, ['OpenAI’s

# Answered Questions

## 1. What are the two main strategies used in text summarization?

Extractive and abstractive summarization. 

Extractive summarization calculates and selects the top representative segments of a given document/corpus and uses the segments as a summary. This usually produces "summaries" that don't function well as summaries, but do make structural sense linguistically and are consistent with the information in the document.

Abstractive summarization generates new text based on the document/corpus given. Conceptually this is what we consider a proper summary to be, however this is a much more difficult research problem that so far has had significant difficulty generating significant strings of text that don't devolve into gibberish. 

## 2. Which feature is used in the text summarization code? Explain how to calculate it.

The feature being used is a count vector. This is calculated by first creating a vector that is as long as the entire dictionary for the document/corpus, with each value being a 0. Then the string is examined, and for each time a token is encountered 1 is added to that token's place in the vector. The end result is a vector that contains the count of all the tokens used in the string out of all of the tokens that could have possibly been used.

## 3. What is the similarity measurement method used in this code?

Cosine similarity.

## 4. We know in ICE-1, TF-IDF is used as the text feature. Can we use it in this code? 

Technically yes, but we shouldn't. Since TF-IDF features scale based on the document in question, we would only be looking at a dictionary of tokens for each document individually, not for each corpus. Since in this example our corpus consists of two strings, and each document is only one string, then the TF-IDF would amount to a dictionary and counts of only the words in each string, set most of the vectors to a flat shape with low values. This would interfere with the cosine similarity calculations, and is more work than necessary.

## 5. Compare the outputs above. Are they the same or not? Please analyze the comparison result.

I got results that were identical to the unmodified code. Given the features involved and the articles I chose, this makes sense. This code uses token counts to calculate a similarity score. The only thing that could significantly change the similarity scores would be a change in the token counts Given the nature of the text preprocessing methods used, the only thing likely to change the counts would be the lemmatization. This could have potentially collapsed some counts, thereby reducing the vocabulary and upping some token's counts. However, I chose some technical machine learning/NLP articles to work with. Since the terms used in these are usually fairly exacting and these kinds of articles don't expound a lot or include fluff, it would make sense that the lemmatization collapse very few vocabulary terms, which would in turn have next to no effect on the similarity scores.