# NLP (Natural Language Processing)

Via the site 'Beingdatum.com'  I've followed the course: 'Guide on Deep Learning for NLP'. 

This Notebook is a summary of that course, which I will use as reference work when having questions in a NLP project. 

In [1]:
!pip install nltk



In [2]:
import nltk
#nltk.download()

## 1. Tokenization
The process of segmenting running text into words and sentences

### Manually tokenize a textfile 

In [3]:
filename = 'filename.txt' #this is a textfile in which we have written 'Hello World'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lowercase
words = [word.lower() for word in words]
print(words)

['hello', 'world']


### Tokenize a textfile with NLTK (Natural Language Toolkit)

In [4]:
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

['Hello', 'World']


## 2. Create a Bag of Words model

To create the bag of words model, we need to create a matrix where the columns correspond to the most frequent words in our dictionary where rows correspond to the document or sentences.

### Step 1 Tokenize a text:

In [6]:
text = 'Deep learning methods are popular for natural language, primarily because they are delivering on their promise. Some of the first large demonstrations of the power of deep learning were in natural language processing, specifically speech recognition. More recently in machine translation.'

We will first preprocess the data, in order to:

- Convert text to lower case.
- Remove all non-word characters.
- Remove all punctuations.

In [7]:
#import all needed librairies:
import nltk 
import re 
import numpy as np 
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)):     
    dataset[i] = dataset[i].lower() #all text in lowercase
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) #Search  any non-alphanumeric character
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])  #search for all punctuations and split the text into three sentences

In [8]:
#output
dataset

['deep learning methods are popular for natural language primarily because they are delivering on their promise ',
 'some of the first large demonstrations of the power of deep learning were in natural language processing specifically speech recognition ',
 'more recently in machine translation ']

### Step 2 Obtaining most frequent words in our text: 


We will apply the following steps to generate our model:
- We declare a dictionary to hold our bag of words. 
- Next we tokenize each sentence to words. 
- Now for each word in a sentence, we check if the word exists in our dictionary.
- If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [16]:
word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1
#show the output
import pandas as pd
table = pd.DataFrame(word2count, index=[0])
table.T

Unnamed: 0,0
deep,2
learning,2
methods,1
are,2
popular,1
for,1
natural,2
language,2
primarily,1
because,1


In our model, we have a total of 41 words. However when processing large texts, the number of words could reach millions. We do not need to use all those words. Hence, we select a particular number of most frequently used words. To implement this we use:

In [19]:
import heapq
freq_words = heapq.nlargest(20, word2count, key=word2count.get) #20 denotes the number of words we want. In larger datasets we can set this to larger numbers
freq_words

['of',
 'deep',
 'learning',
 'are',
 'natural',
 'language',
 'the',
 'in',
 'methods',
 'popular',
 'for',
 'primarily',
 'because',
 'they',
 'delivering',
 'on',
 'their',
 'promise',
 'some',
 'first']

### Step 3  Building the Bag of Words model:
In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.+

This can be implemented with the help of following code:

In [20]:
X = []
for data in dataset:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)

X

array([[0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

## 3. N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios).
- When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. 
- When N=2, this is called bigrams 
- When N=3 this is called trigrams. 
- When N>3 this is usually referred to as four grams or five grams and so on.

In [21]:
import latex

#### How many N-grams in a sentence?
If X = Number of words in a given sentence K, the number of N-grams for sentence K would be: 

${Ngrams_K}= X - (N-1)$

In [22]:
#Python code for N-grams
def generate_ngrams(text,n): 
    # split sentences into tokens
    tokens=re.split("\\s+",text)
    ngrams=[] 
    # collect the n-grams
    for i in range(len(tokens)-n+1):
     temp=[tokens[j] for j in range(i,i+n)]
     ngrams.append(" ".join(temp)) 
    return ngrams


In [24]:
generate_ngrams('This is sparta', 2)

['This is', 'is sparta']

## 4. Word Embedding

Word embedding is a dense representation of words in the form of numeric vectors.

The word embedding representation is able to reveal many hidden relationships between words. For example, vector(“cat”) – vector(“kitten”) is similar to vector(“dog”) – vector(“puppy”).

### Why do we use word embedding?
Words aren’t things that computers naturally understand. By encoding them in a numeric form, we can apply mathematical rules and do matrix operations to them. This makes them amazing in the world of machine learning, especially.

Take deep learning for example. By encoding words in a numerical form, we can take many deep learning architectures and apply them to words. Convolutional neural networks have been applied to NLP tasks using word embedding and have set the state-of-the-art performance for many tasks.

Even better, what we have found is that we can actually pre-train word embedding that are applicable to many tasks. 

### examples of word embedding:
-  <span style="text-decoration: underline">**One-Hot-Encoding (Count Vectorizing)**</span>:

Create a vector that has as many dimensions as your corpora has unique words. Each unique word has a unique dimension and will be represented by a 1 in that dimension with 0s everywhere else.
- <span style="text-decoration: underline">**TF-IDF Transform**</span>:

TF-IDF vectors are related to one-hot encoded vectors. However, instead of just featuring a count, they feature numerical representations where words aren’t just there or not there. Instead, words are represented by their term frequency multiplied by their inverse document frequency.
In simpler terms, words that occur a lot but everywhere should be given very little weighting or significance. We can think of this as words like the or and in the English language. They don’t provide a large amount of value.
However, if a word appears very little or appears frequently, but only in one or two places, then these are probably more important words and should be weighted as such.
Again, this suffers from the downside of very high dimensional representations that don’t capture semantic relatedness.
- <span style="text-decoration: underline">**Co-Occurrence Matrix**</span>:

A co-occurrence matrix is exactly what it sounds like: a giant matrix that is as long and as wide as the vocabulary size. If words occur together, they are marked with a positive entry. Otherwise, they have a 0. 
It boils down to a numeric representation that simple asks the question of “Do words occur together? If yes, then count this.”
And what can we already see becoming a big problem? Super large representation! If we thought that one-hot encoding was high dimensional, then co-occurrence is high dimensional squared. That’s a lot of data to store in memory

Advantages of Co-occurrence Matrix

- It preserves the semantic relationship between words. i.e men and women tend to be closer than man and apple.
- It uses SVD at its core, which produces more accurate word vector representations than existing methods.
- It uses factorization which is a well-defined problem and can be efficiently solved.
- It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

Disadvantages of Co-Occurrence Matrix+

- It requires huge memory to store the co-occurrence matrix.
- But, this problem can be circumvented by factoring the matrix out of the system for example in Hadoop clusters etc. and can be saved

#### example using count vertorizing / one-hot-encoding

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
# To create a Count Vectorizer, we simply need to instantiate one.

sample_text = ["Machine Learning: Introduction to Machine learning and hands-on experience on the various applications of ML",
"Deep Learning: Introduction to Deep learning & NLP"]
#these are two booktitles

#CountVectorizer Plain and Simple
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(sample_text)
count_vector=cv.fit_transform(sample_text)

What happens above is that the 2 books titles are preprocessed, tokenized and represented as a sparse matrix as explained in the introduction. By default, CountVectorizer does the following:

- lowercases your text (set lowercase=false if you don’t want lowercasing)
- uses utf-8 encoding
- performs tokenization (converts raw text to smaller units of text)
- uses word level tokenization (meaning each word is treated as a separate token)
- ignores single characters during tokenization (so say bye bye to words like ‘a’ and ‘I’)
Now, let’s look at the vocabulary (collection of unique words from our documents):

In [26]:
cv.vocabulary_

{'machine': 7,
 'learning': 6,
 'introduction': 5,
 'to': 13,
 'and': 0,
 'hands': 4,
 'on': 11,
 'experience': 3,
 'the': 12,
 'various': 14,
 'applications': 1,
 'of': 10,
 'ml': 8,
 'deep': 2,
 'nlp': 9}

As we are using all the defaults, these are all word level tokens, lower-cased. Note that the numbers here are not counts, they are the position in the sparse vector.

In [27]:
#let's check the shape:
count_vector.shape

(2, 15)

we have two rows (two titles) and 15 unique words! 

### CountVectorizer and Stop Words

Now, the first thing you may want to do, is to eliminate stop words from your text as it has limited predictive power and may not help with downstream tasks such as text classification. Stop word removal is a breeze with CountVectorizer and it can be done in several ways:

- Use a custom stop word list that you provide
- Use sklearn’s built in English stop word list (not recommended)
- Create corpora specific stop words using max_df and min_df (highly recommended)

Now let’s look at these 3 ways of using stop words.

In [28]:
#Custom stop word list:
cv = CountVectorizer(sample_text,stop_words=["all","on","the","is","and","to"])
count_vector=cv.fit_transform(sample_text)
count_vector.shape

(2, 11)

In [29]:
#the shape has changed from 15 unique words to 11 unique words because the stop words have been removed. 
#let's see what python has remembered as the stop word list:
cv.stop_words

# Note that we can actually load stop words directly from a file into a list and supply that as the stop word list.

['all', 'on', 'the', 'is', 'and', 'to']

#### Stop Words using MIN_DF:

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.+

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at _**how many documents contained a term**_, better known as _**document frequency**_. The MIN_DF value can be an _**absolute value**_ (e.g. 1, 2, 3, 4) or a _**value representing proportion of documents**_ (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

In [31]:
#Eliminating words that appeared in less than 2 documents:

cv = CountVectorizer(sample_text,min_df=2)
count_vector=cv.fit_transform(sample_text)
cv.stop_words_

{'and',
 'applications',
 'deep',
 'experience',
 'hands',
 'machine',
 'ml',
 'nlp',
 'of',
 'on',
 'the',
 'various'}

In [32]:
#To see what’s remaining, all we need to do is check the vocabulary again
cv.vocabulary_

{'learning': 1, 'introduction': 0, 'to': 2}

#### Stop Words using MAX_DF:

Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. 

MAX_DF looks at _**how many documents contained a term**_, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an _**absolute value**_ (e.g. 1, 2, 3, 4) or a _**value representing proportion of documents**_ (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

In [34]:
cv = CountVectorizer(sample_text,max_df=0.50)
count_vector=cv.fit_transform(sample_text)
cv.stop_words_

{'introduction', 'learning', 'to'}

Now, to see which words have been eliminated, you can use cv.stop_words_ (see output above):+

In this example, all the words that appeared in all 2 book titles have been eliminated.

### Prediction based methods:

#### 1. Continuous Bag of Words(CBOW) model
CBOW is learning to predict the word by the context. A context may be single word or multiple words for a given target words.

- Example text “The man jumped over the wall.”

CBOW's approach is to treat {“The”, “man”, ’over”, “the’, “wall”} as a context and from these words, be able to predict or generate the center word “jumped”. 

#### 2. Skip-gram model

- Example text “The man jumped over the wall.”

Skip-gram's approach is to create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “man”, “over”, “the”, “wall”. Here we call the word “jumped” the context. 

Advantages/Disadvantages of CBOW and Skip-gram:

- Being probabilistic is nature, these are supposed to perform superior to deterministic methods(generally).
- These are low on memory. They don’t need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.
- Though CBOW (predict target from context) and skip-gram (predict context words from target) are just inverted methods to each other, they each have their advantages/disadvantages. Since CBOW can use many context words to predict the 1 target word, it can essentially _**smooth out**_ over the distribution. This is essentially like regularization and offer very good performance when our input data is not so large. However the skip-gram model is more _**fine grained**_ so we are able to extract more information and essentially have more accurate embeddings when we have a large data set (large data is always the best regularizer). Skip-gram with negative sub-sampling outperforms every other method generally.

## Word2Vec using Gensim library

Let’s create a corpus using a single wikipedia article. To do so, we need to scrape wikipedia using BeautifulSoup.

In [38]:
!pip install lxml
#Used to parse XML and HTML



In [40]:
#The article we are going to scrape is the Wikipedia article on Artificial Intelligence. 
#Let’s write a Python Script to scrape the article from Wikipedia

import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In the script above, we first download the Wikipedia article using the urlopen method of the request class of the urllib library. We then read the article content and parse it using an object of the BeautifulSoup class. Wikipedia stores the text content of the article inside p tags. We use the find_all function of the BeautifulSoup object to fetch all the contents from the paragraph tags of the article.

Finally, we join all the paragraphs together and store the scraped article in article_text variable for later use.

**Preprocessing**

The next step is to preprocess the content for Word2Vec model. The following script preprocess the text:

In [41]:
# Cleaning the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

In [44]:
print(all_words)

[['computer', 'science', 'artificial', 'intelligence', 'ai', 'sometimes', 'called', 'machine', 'intelligence', 'intelligence', 'demonstrated', 'machines', 'contrast', 'natural', 'intelligence', 'displayed', 'humans', 'leading', 'ai', 'textbooks', 'define', 'field', 'study', 'intelligent', 'agents', 'device', 'perceives', 'environment', 'takes', 'actions', 'maximize', 'chance', 'successfully', 'achieving', 'goals', 'colloquially', 'term', 'artificial', 'intelligence', 'often', 'used', 'describe', 'machines', 'computers', 'mimic', 'cognitive', 'functions', 'humans', 'associate', 'human', 'mind', 'learning', 'problem', 'solving', 'machines', 'become', 'increasingly', 'capable', 'tasks', 'considered', 'require', 'intelligence', 'often', 'removed', 'definition', 'ai', 'phenomenon', 'known', 'ai', 'effect', 'quip', 'tesler', 'theorem', 'says', 'ai', 'whatever', 'done', 'yet', 'instance', 'optical', 'character', 'recognition', 'frequently', 'excluded', 'things', 'considered', 'ai', 'become', 

In the above script, we converted all the text to lowercase, and then removed all the digits, special characters, extra spaces from the text. After preprocessing, we are left with words. The Word2Vec model is trained on collection of words. Firest, we need to convert our article into sentences, we use nltk.sent_tokenize utility to convert our article to sentences, and then to words, we use nltk.word_tokenize utility. And as a last step, we remove all the stop words such as in, at, and on, etc.

After the script completes its execution, the all_words object contains the list of all the words in the article. We will use this list to create our Word2Vec model with the Gensim library.

**Creating Gensim model**

The word list is passed to the Word2Vec class of the gensim.models package. We need to specify the value for the min_count parameter. A value of 2 for min_count specifies to include only those words in the Word2Vec model that appear at least twice in the corpus. 

In [46]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/09/ed/b59a2edde05b7f5755ea68648487c150c7c742361e9c8733c6d4ca005020/gensim-3.8.1-cp37-cp37m-win_amd64.whl (24.2MB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/0c/09/735f2786dfac9bbf39d244ce75c0313d27d4962e71e0774750dc809f2395/smart_open-1.9.0.tar.gz (70kB)
Collecting boto3 (from smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/d5/57/e9675a5a8d0ee586594ff19cb9a601334fbf24fa2fb29052d2a900ee5d23/boto3-1.11.9-py2.py3-none-any.whl (128kB)
Collecting jmespath<1.0.0,>=0.7.1 (from boto3->smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/83/94/7179c3832a6d45b266ddb2aac329e101367fbdb11f425f13771d27f225bb/jmespath-0.9.4-py2.py3-none-any.whl
Collecting botocore<1.15.0,>=1.14.9 (from boto3->smart-open>=1.8.1->gensim)
  Downloading https://files.pythonhosted.org/packages/64/4c/b0b0d3b6f84a05f9135051b56d3eb8708012a28

In [47]:
from gensim.models import Word2Vec

word2vec = Word2Vec(all_words, min_count=2)

vocabulary = word2vec.wv.vocab
print(vocabulary)
#The output we see is the list of unique words which appear at least twice.

{'computer': <gensim.models.keyedvectors.Vocab object at 0x000002114D9A4550>, 'science': <gensim.models.keyedvectors.Vocab object at 0x000002114F225B70>, 'artificial': <gensim.models.keyedvectors.Vocab object at 0x000002114F226160>, 'intelligence': <gensim.models.keyedvectors.Vocab object at 0x000002114F226048>, 'ai': <gensim.models.keyedvectors.Vocab object at 0x000002114F707588>, 'sometimes': <gensim.models.keyedvectors.Vocab object at 0x000002114F7075C0>, 'called': <gensim.models.keyedvectors.Vocab object at 0x000002114F817EF0>, 'machine': <gensim.models.keyedvectors.Vocab object at 0x000002114A479080>, 'demonstrated': <gensim.models.keyedvectors.Vocab object at 0x000002114A5364A8>, 'machines': <gensim.models.keyedvectors.Vocab object at 0x000002114A536518>, 'contrast': <gensim.models.keyedvectors.Vocab object at 0x000002114A536550>, 'natural': <gensim.models.keyedvectors.Vocab object at 0x000002114A536588>, 'displayed': <gensim.models.keyedvectors.Vocab object at 0x000002114A5365C0

## Model Analysis
We successfully created our Word2Vec model in the last section. Now is the time to explore what we created.

### Finding Vectors for a Word
We know that the Word2Vec model converts words to their corresponding vectors. Let’s see how we can view vector representation of any particular word.


In [51]:
#v1 = word2vec.wc('artificial')
#The code does not work, as word2vec has no attribute wc. Will look into this in more detail later

The vector v1 contains the vector representation for the word “artificial”. By default, a hundred dimensional vector is created by Gensim Word2Vec. This is a much, much smaller vector as compared to what would have been produced by bag of words. If we use the bag of words approach for embedding the article, the length of the vector for each will be 1206 since there are 1206 unique words with a minimum frequency of 2. If the minimum frequency of occurrence is set to 1, the size of the bag of words vector will further increase. On the other hand, vectors generated through Word2Vec are not affected by the size of the vocabulary

### Finding Similar Words

Earlier we said that contextual information of the words is not lost using Word2Vec approach. We can verify this by finding all the words similar to the word “intelligence”.

In [52]:
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

[('many', 0.680070698261261),
 ('human', 0.6624724864959717),
 ('ai', 0.6479796171188354),
 ('could', 0.6053080558776855),
 ('artificial', 0.6050031185150146),
 ('use', 0.5637954473495483),
 ('computer', 0.5551739931106567),
 ('humans', 0.5545571446418762),
 ('search', 0.5525883436203003),
 ('learning', 0.5486801862716675)]

From the output, you can see the words similar to “intelligence” along with their similarity index. The word “ai” is the one of the most similar words to “intelligence” according to the model, which actually makes sense. Similarly, words such as “human” and “artificial” often coexist with the word “intelligence”. Our model has successfully captured these relations using just a single Wikipedia article.

## 5. Embedding Layer

The embedding layer in Keras can be used when we want to create the embeddings to embed higher dimensional data into lower dimensional vector space.

Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded so that each word is represented by a unique integer. 

This data preparation step can be performed using the Tokenizer API also provided with Keras. 

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset. You must specify:
- the input dim which is the size of the vocabulary
- the output dim which is the size of the vector space of the embedding
- and optionally the input length which is the number of words in input sequences

example:

A vocabulary of 200 words, a distributed representation of 32 dimensions and an input length of 50 words.

- _layer = Embedding(200, 32, input_length=50)
- _Concrete example of defining an Embedding layer_

In fact, the output vectors are not computed from the input using any mathematical operation. Instead, each input integer is used as an index to access a table that contains all possible vectors. That is the reason why you need to specify the size of the vocabulary as the first argument (so the table can be initialized).+

The most common application of this layer is for text processing.

In [55]:
#Let’s see a simple example. Our training set consists only of two phrases:

S1 = 'Hope to see you soon'
S2 = 'Nice to see you again'

#So we can encode these phrases by assigning each word a unique integer number (by order of appearance in our training dataset for example). 
#Then our phrases could be rewritten as:

P1 = [0,1,2,3,4]
P2 = [5,1,2,3,6]



#Now imagine we want to train a network whose first layer is an embedding layer. In this case, we should initialize it as follows:

Embedding(7, 2, input_length=5)

- The first argument (7) is the number of distinct words in the training set. 
- The second argument (2) indicates the size of the embedding vectors. 
- The input_length argument, of course, determines the size of each input sequence.

Code found on https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py

## 6. Part of Speech Tagging

The most basic and useful task while dealing with text based problems is to tokenize each word separately and label each word according to its most likely part of speech. This task is basically called as Part of Speech Tagging (POST).

In this lesson, we will use the Brown Corpus, an influential dataset that is used in many studies of POST. The Brown Corpus defined a tagset (specific collection of POS labels) that has been reused in many other annotated resources in English.

Recently, a different version of the tagset has been defined which is called the Universal Parts of Speech Tagset.

Let’s start with NLTK installation and downloading i.e. 

In [None]:
!pip install nltk
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


### The FreqDist NLTK Object: Counting Things

FreqDist (Frequency Distribution) is an object that counts occurrences of objects. It is defined in the nltk.probability module

In [5]:
#Let's demonstrate how it works
from nltk.probability import FreqDist
list = ['a', 'b', 'a']
fdist = FreqDist(list)
fdist

FreqDist({'a': 2, 'b': 1})

In [4]:
fdist['a']
#the frequency of 'a'

2

In [5]:
fdist['c']
#the frequency of 'c' --> 0 as this is not in the list

0

In [6]:
fdist.max()
#the maximum object / most frequent object

'a'

In [7]:
len(fdist) 
#how many objects are in the list

2

In [8]:
fdist.keys()
#What unique keys are there in the list

dict_keys(['a', 'b'])

In [9]:
fdist.freq('a')
#What % is the frequency of 'a' of the total number of counts 

0.6666666666666666

In [10]:
fdist.N()   
#number of samples counted

3

### Working on the Brown Corpus with NLTK
The tagged_sents version of the corpus is a list of sentences. Each sentence is a list of pairs (tuples) (word, tag). Similarly, one can access the corpus as a flat list of tagged words.

In [3]:
import nltk
nltk.download('brown')
brown_news_tagged = brown.tagged_sents(tagset='universal')
brown_news_words = brown.tagged_words(tagset='universal')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Renate\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


In [7]:
#Let’s check: which word is the most common among the 100,000 words in this part of the Brown corpus? Which tag is the most common?
import nltk
nltk.download('universal_tagset')
fdistw = FreqDist([w for (w, t) in brown_news_words])
fdistw.N()    

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Renate\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


1161192

We saw 1,161,192 words in this section of the corpus

In [8]:
len(fdistw) 
#distinct words

56057

In [9]:
fdistw.max() 
#most frequent word

'the'

In [10]:
fdistw['the']  
#how many times is 'the' in this list

62713

In [11]:
print('%5.2f%%' % (fdistw.freq('the') * 100))
#We observe that the distribution of word tokens to word types is extremely unbalanced – a single word (the) accounts for over 5% of the word tokens. 
#This is a general observation of linguistic data — known as Zipf’s Law — few types are extremely frequent, and many types are extremely rare.

 5.40%


### Distinguishing Word Type and Work Token
When distinguishing word type and word token, we can decide to consider that strings that vary only because of case variants correspond to the same word type – for example, the token Book and book correspond to the same word type book (and so do bOOk and bOok)

In [12]:
# Let us count the words without distinction of upper/lower case and the tags
fdistwl = FreqDist([w.lower() for (w, t) in brown_news_words])
fdistt = FreqDist([t for (w, t) in brown_news_words])

In [15]:
len(fdistwl)

49815

When we ignore case variants, there are only 49,815 word types instead of 62,713 when case differences are kept

In [16]:
#let's find the number of NOUN's in the text
print('%5.2f%%' % (fdistt.freq('NOUN') * 100))

23.73%


### Perplexity of the POST task
The first question we address about the task of POS tagging is: how complex is it?

One way to quantify the complexity of the task is to measure its perplexity. The intuitive meaning of perplexity is: when the tagger is considering which tag to associate to a word, how “confused” is it? That is, how many options does it have to choose from?

Obviously, this depends on the number of tags in the tagset. The universal tagset includes 17 tags:

In [17]:
fdistt

FreqDist({'NOUN': 275558, 'VERB': 182750, '.': 147565, 'ADP': 144766, 'DET': 137019, 'ADJ': 83721, 'ADV': 56239, 'PRON': 49334, 'CONJ': 38151, 'PRT': 29829, ...})

In the absence of any knowledge about words and tags, the perplexity of the task with a tagset of size 17 will be 17. We will see that adding knowledge will reduce the perplexity of the task.

Note that the decision on how to tag a word, without more information is ambiguous for multiple reasons:

- The same string can be understood as a noun or a verb (book).
- Some POS tags have a systematically ambiguous definition: a present participle can be used in progressive verb usages (I am going:VERB), but it can also be used in an adjectival position modifying a noun: (A striking:ADJ comparison). In other words, it is unclear in the definition itself of the tag whether the tag refers to a syntactic function or to a morphological property of the word.

### Measuring success
Let’s say, we develop a tagger, how to ensure if the tagger is successful or not, whether the decision made by the tagger is good or not?

The way to address these issues is to define a criterion for success, and to test the tagger on a large test dataset. Assume we have a large dataset of 1M words with their tags assigned manually. We first split the dataset into 2 parts: one part on which we will “learn” facts about the words and their tags (we call this the training dataset), and one part which we use to test the results of our tagger (we call this the test dataset).

It is critical NOT to test our tagger on the training dataset — because we want to test whether the tagger is able to generalize from data it has seen and make decision on unseen data. (A “stupid” tagger would learn the exact data seen in the training dataset “by heart”, and respond exactly as shown when asked on training data — it would get a perfect score on the training data, but a poor score on any unseen data.)

In [18]:
#This is one way to split the dataset into training and testing:
brown_train = brown_news_tagged[5000:]
brown_test = brown_news_tagged[:5000]
from nltk.tag import untag
test_sent = untag(brown_test[0])
print("Tagged: ", brown_test[0])
print("Untagged: ", test_sent)

Tagged:  [('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]
Untagged:  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


To measure success, in this task, we will measure accuracy. The tagger object in NLTK includes a method called evaluate to measure the accuracy of a tagger on a given test set.

In [19]:
# A default tagger assigns the same tag to all words
from nltk import DefaultTagger
default_tagger = DefaultTagger('XYZ')
default_tagger.tag('This is a test'.split())

[('This', 'XYZ'), ('is', 'XYZ'), ('a', 'XYZ'), ('test', 'XYZ')]

In [20]:
#Since ‘NOUN’ is the most frequent universal tag in the Brown corpus, we can use a tagger that assigns ‘NOUN’ to all words as a baseline.
default_tagger = DefaultTagger('NOUN')
print('Accuracy: %4.1f%%' % (100.0 * default_tagger.evaluate(brown_test)))

Accuracy: 30.2%


Still, note that we improved the expected accuracy from picking one out of 17 answers with no knowledge, to picking one out of about 3 answers with very little knowledge (what is the most frequent tag in the dataset)

### Sources of Knowledge to Improve Tagging Accuracy
Intuitively, the sources of knowledge that can help us decide what is the tag of a word include:

- A dictionary that lists the possible parts of speech for each word
- The context of the word in a sentence (neighboring words)
- The morphological form of the word (suffixes, prefixes)
We will now develop a sequence of taggers that use these knowledge sources, and combine them. The taggers we develop will implement the nltk.tag.TaggerI interface

### Lookup Tagger: Using Dictionary Knowledge
Assume we have a dictionary that lists the possible tags for each word in English. Could we use this information to perform better tagging?

The intuition is that we would only assign to a word a tag that it can have in the dictionary. For example, if “box” can only be a Verb or a Noun, when we have to tag an instance of the word “box”, we only choose between 2 options – and not between 17 options. Thus, dictionary knowledge will reduce the perplexity of the task.

There are 3 issues we must address to turn this into working code:

- Where do we get the dictionary?
- How do we choose between the various tags associated to a word in the dictionary? (For example, how do we choose between VERB and NOUN for “box”).
- What do we do for words that do not appear in the dictionary?

The simple solutions we will test are the following – note that for each question, there exist other strategies that we will investigate later:

- Where do we get the dictionary: we will learn it from a sample dataset.
- How do we choose between the various tags associated to a word in the dictionary: we will choose the most likely tag as observed in the sample dataset.
- What do we do for words that do not appear in the dictionary: we will pass unknown words to a backoff tagger.

In [21]:
#The nltk.UnigramTagger implements this overall strategy. 
#It must be trained on a dataset, from which it builds a model of “unigrams”. The following code shows how it is used:

from nltk import UnigramTagger
# Train the unigram model
unigram_tagger = UnigramTagger(brown_train)
# Test it on a single sentence
unigram_tagger.tag(untag(brown_test[0]))

[('The', 'DET'),
 ('Fulton', 'NOUN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", None),
 ('recent', 'ADJ'),
 ('primary', 'ADJ'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', 'NOUN'),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

Note that the unigram tagger leaves some words tagged as ‘None’ — these are unknown words, words that were not observed in the training dataset.

In [22]:
#How successful is this tagger?
print('Unigram tagger accuracy: %4.1f%%' % ( 100.0 * unigram_tagger.evaluate(brown_test)))

Unigram tagger accuracy: 90.5%


90.5% is quite an improvement on the 31% of the default tagger. And this is without any backoff and without using morphological clues.

Is 90.5% a good level of accuracy? In fact it is not. It is accuracy per word. It means that on average, in every sentence of about 20 words, we will accumulate 2 errors. 2 errors in each sentence is a very high error rate. It makes it difficult to run another task on the output of such a tagger. Think how difficult the life of a parser would be if 2 words in every sentence are wrongly tagged. The problem is known as the pipeline effect — when language processing tools are chained in a pipeline, error rates accumulate from module to module.

In [23]:
#How much would a good backoff help? Let’s try first to add the NN-default tagger as a backoff:
nn_tagger = DefaultTagger('NOUN')

ut2 = UnigramTagger(brown_train, backoff=nn_tagger)
print('Unigram tagger with backoff accuracy: %4.1f%%' % ( 100.0 * ut2.evaluate(brown_test)))

Unigram tagger with backoff accuracy: 94.5%


Adding a simple backoff (with accuracy of 31%) improved accuracy from 90.5% to 94.5%.

One way to report this is in terms of error reduction: the error rate went down from 9.5% (100-90.5) to 5.5%. That’s an absolute error reduction of 9.5-5.5 = 4.0%. Error reduction is generally reported as a percentage of the error: 100.0 * (4.0 / 9.5) = 42.1% relative error reduction.

In other words, out of the words not tagged by the original model (with no backoff), 42.1% were corrected by the backoff.

What can we say about this error reduction? It is different from the accuracy of the backoff (42.1% vs 31%).

One lesson to learn from this is that the distribution of unknown words is significantly different from the distribution of all the words in the corpus.

### Summary

Parts of speech are classes of words that can be characterized by criteria of different types:

- Syntactic: 2 words of the same class can substitute each other in a sentence and leave the sentence syntactically acceptable
- Morphological: words of the same class are inflected in similar manner
- Semantic: words of the same class denote entities of similar semantic types (object, action, property, relation)

Tagsets of various granularities can be considered. We mentioned the standard Brown corpus tagset (about 60 tags for the complete tagset) and the reduced universal tagset (17 tags).

The key point of the approach we investigated is that it is data-driven: we attempt to solve the task by:

- Obtain sample data annotated manually: we used the Brown corpus
- Define a success metric: we used the definition of accuracy
- Measure the adequacy of a method by measuring its success
- Measure the complexity of the problem by using the notion of perplexity

The computational methods we developed are characterized by:

- We first define possible knowledge sources that can help us solve the task. Specifically, we investigated dictionary, morphological context as possible sources.
- We developed computational models that represent each of these knowledge sources in simple data structures (hash tables, frequency distributions, conditional frequency distributions).
- We tested simple machine learning methods: data is acquired by inspecting a training dataset, then evaluated by testing on a test dataset.
- We investigated one method to combine several systems into a combined system: backoff models.
This methodology will be further developed in the next chapters, as we will address more complex tasks (parsing, summarization) and use more sophisticated machine learning methods.

The task of Parts of Speech tagging is very well studied in English. The most efficient systems obtain accuracy rates of over 98% even on fine granularity tagsets – which is equivalent to the rate of success human beings obtain and the best agreement among human taggers generally obtained. The best systems use better machine learning algorithms (HMM, SVM) and treat unknown words (words not seen in training data) with more sophistication than what we have observed here.