# Natural language processing with NLTK and spaCy

Natural language processing is a field that studies automatic computational processing of human languages.
<br>
Generally, NLP addresses the following tasks:
<br><br>
<b>Tokenization</b>	- segmenting text.
<br>
<b>Part-of-speech (POS) Tagging</b>	- Assigning word types to tokens, like verb or noun.
<br>
<b>Dependency Parsing</b> - Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
<br>
<b>Lemmatization</b> - Assigning the base forms of words.
<br>
<b>Sentence Boundary Detection (SBD)</b> - Finding and segmenting individual sentences.

### Differences between NLTK and spaCy
NLTK and spaCy have many overlapping functionalities. In comparison to spaCy, NLTK takes much broader approach. NLTK suggests a variety of approaches to solve one task (person needs to know what to choose), while spaCy provides 1 approach for 1 task and this approach can use recently proposed method. SpaCy is also much more performance-focussed than NLTK.
<br>
Although the two libraries provide the same functionality, spaCy's implementation will usually be faster.

# NLTK

NLTK (Natural Language Toolkit) a free and open-source leading platform for building Python programs to work with human language data.
<br>
It is written in Python and provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers, etc.
<br>


In [2]:
import warnings
warnings.filterwarnings('ignore')

import re
import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist

from gensim.models import Word2Vec, KeyedVectors

import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

<p>The First thing that we need to do is to load the Quora dataset</p>

In [2]:
# Load the data

train_data = pd.read_csv('../data/train.csv')
test_data = pd.read_csv('../data/test.csv')

train_data = train_data[:400000]

train_text = train_data['question_text'].values
train_labels = train_data['target'].values

test_text = test_data['question_text'].values
test_qid = test_data['qid'].values

In [3]:
train_text[:5]

array(['How did Quebec nationalists see their province as a nation in the 1960s?',
       'Do you have an adopted dog, how would you encourage people to adopt and not shop?',
       'Why does velocity affect time? Does velocity affect space geometry?',
       'How did Otto von Guericke used the Magdeburg hemispheres?',
       'Can I convert montra helicon D to a mountain bike by just changing the tyres?'],
      dtype=object)

### Data preprocessing

The first thing that we can do with the data is to convert all the letters to lowecase

In [4]:
# Convert to lowercase
train_text = [token.lower() for token in train_text]

In [5]:
train_text[:5]

['how did quebec nationalists see their province as a nation in the 1960s?',
 'do you have an adopted dog, how would you encourage people to adopt and not shop?',
 'why does velocity affect time? does velocity affect space geometry?',
 'how did otto von guericke used the magdeburg hemispheres?',
 'can i convert montra helicon d to a mountain bike by just changing the tyres?']

### Word tokenization

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module.

The text is first tokenized into sentences using the <b>PunktSentenceTokenizer</b>, then each sentence is tokenized into words using 4 different word tokenizers:

<b>TreebankWordTokenizer</b>

<b>WordPunctTokenizer</b>

<b>PunctWordTokenizer</b>

<b>WhitespaceTokenizer</b>

By default, NLTK uses the TreebankWordTokenizer, which uses regular expressions to tokenize text and it asumes that the text has already been splitted into sentences.

In [6]:
tokenized_words_train = [word_tokenize(i) for i in train_text]
tokenized_words_test = [word_tokenize(i) for i in test_text]

In [11]:
np.save('tokenized_words_train', tokenized_words_train)
np.save('tokenized_words_test', tokenized_words_test)

NameError: name 'tokenized_words_train' is not defined

In [12]:
tokenized_words_train = np.load('tokenized_words_train.npy')
tokenized_words_test = np.load('tokenized_words_test.npy')

### Text cleaning

The <b>isalpha()</b> is a built-in Python method which checks if a string contains only alphabethical characters.

In [7]:
# Remove punctuation and numbers
tokenized_words_train = [[word for word in sent if word.isalpha()] for sent in tokenized_words_train]

In [8]:
# Remove non-ASCII characters
tokenized_words_train_flat = [item for sublist in tokenized_words_train for item in sublist]

cleaned_data = [re.sub(r'[^\x00-\x7f]', r'', word) for word in tokenized_words_train_flat]

The <b>FreqDist</b> function returns the frequency distribution for the outcomes of an experiment.

A frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

In [9]:
# Remove low frequency words
freq_words = FreqDist(cleaned_data)

cleaned_data = { key : value for key, value in freq_words.items() if value > 10 }

filtered_data = []
temp_array = []

for sent in tokenized_words_train:
    for word in sent:
        if word in cleaned_data.keys():
            temp_array.append(word)
    filtered_data.append(temp_array)
    temp_array = []

NLTK also provides a list of stop-words, which are the most frequent words in a language.

For example the most frequent English words are the following words:

In [3]:
stop_words = list(stopwords.words('english'))

Now we will remove those words because they appear in almost every sentence, thus won't have much impact on the classification.

<b>Caution!</b> Removing the stop words might not be a good approach

In [15]:
# Remove stop words
# filtered_data_no_stopwords = []
# temp_array = []

# for sent in filtered_data:
#     for word in sent:
#         if word not in stop_words:
#             temp_array.append(word)
#     filtered_data_no_stopwords.append(temp_array)
#     temp_array = []

# filtered_data = filtered_data_no_stopwords

<b>Bag-of-Words</b> is a very intuitive approach for converting the words into numerical values.

The approach follows 3 steps:

<ol>
<li>Splitting the documents into tokens</li>
<li>Assigning a weight to each token proportional to the frequency with which it shows up in the document and/or corpora.</li>
<li>Creating a document-term matrix with each row representing a document and each column addressing a token.</li>
</ol>


The <b>Count Vectorizer</b> counts the number of times a token shows up in the document and uses this value as its weight.

The <b>tokenizer</b> argument overrides the string tokenization step while preserving the preprocessing and n-grams generation steps. 

In [10]:
vectorizer = CountVectorizer(
    tokenizer = lambda sent: sent, 
    analyzer = 'word',
    lowercase=False
)

X_train = vectorizer.fit_transform(filtered_data)
X_test = vectorizer.transform(tokenized_words_test)

In [11]:
# Cross valudation
LR = LogisticRegression()

scores = cross_val_score(
    LR, 
    X_train, 
    train_labels, 
    cv = 5, 
    scoring = 'f1'
)

In [None]:
avg_score = np.sum(scores) / len(scores)
avg_score

# spaCy

spaCy is an open-source software library for advanced natural language processing, written in Cython.
<br>
It's focus is on providing software for production usage and excels at large-scale information extraction tasks.
<br>

spaCy provides the following key features:
<ol>
    <li>Non-destructive tokenization</li>
    <li>Named entity recognition</li>
    <li>"Alpha tokenization" support for over 25 languages</li>
    <li>Pre-trained word vectors</li>
    <li>Part-of-speech tagging</li>
    <li>Labelled dependency parsing</li>
    <li>Syntax-driven sentence segmentation</li>
    <li>Text classification</li>
    <li>Built-in visualizers for syntax and named entities</li>
    <li>Deep learning integration</li>
</ol>

In [1]:
import warnings
warnings.filterwarnings('ignore')

import re
import string
from collections import defaultdict, Counter

import pandas as pd
import numpy as np

import spacy
from spacy.tokenizer import Tokenizer

from gensim.models import Word2Vec, KeyedVectors

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

In [2]:
# Load the data

train_data = pd.read_csv('../data/train.csv')
test_data = pd.read_csv('../data/test.csv')

train_data = train_data[:400000]

train_text = train_data['question_text'].values
train_labels = train_data['target'].values

test_text = test_data['question_text'].values
test_qid = test_data['qid'].values

# load the Spacy model
spacy_model = spacy.load('en_core_web_sm')

In [None]:
# python -m spacy download en_core_web_sm ==> command to install a spaCy model

## Data preprocessing

The first thing that we can do with the data is to convert all the letters to lowecase

In [3]:
# Convert to lowercase
train_text = [token.lower() for token in train_text]
test_text = [token.lower() for token in test_text]

### Processing pipeline

When we call <b>spacy_model</b> on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. 

<img src="images/processing.png" alt="processing">

<br>
where <b>tagger</b> assigns pat-of-speech tags, <b>parser</b> assigns dependency labels and <b>ner</b> detects and labels named entities.

### Word tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens.
<br>
The input to the tokenizer is a unicode text, and the output is a Doc object, which is a sequence of tokens.

SpaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.
<br><br>
The tokenization algorithm is done in the following steps:
<ol>
    <li>Iterate over space-separated substrings.</li>
    <li>Check whether we have an explicitly defined rule for this substring. If we do, use it.</li>
    <li>Otherwise, try to consume a prefix.</li>
    <li>If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.</li>
    <li>If we didn't consume a prefix, try to consume a suffix.</li>
    <li>If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.</li>
    <li>Once we can't consume any more of the string, handle it as a single token.</li>
</ol>

In [4]:
tokenizer = Tokenizer(spacy_model.vocab)

tokenized_words_train = [tokenizer(sent) for sent in train_text]
tokenized_words_test = [tokenizer(sent) for sent in test_text]

In [None]:
np.save('tokenized_words_spacy_train', tokenized_words_train)
np.save('tokenized_words_spacy_test', tokenized_words_test)

In [None]:
tokenized_words_train = np.load('tokenized_words_spacy_train.npy')
tokenized_words_test = np.load('tokenized_words_spacy_test.npy')

In [None]:
tokenized_words_train[0:5]

In [5]:
# Remove punctuation and numbers
tokenized_words_train = [[word for word in sent if word.is_alpha] for sent in tokenized_words_train]

In [6]:
# Remove non-ASCII characters
tokenized_words_train_flat = [item for sublist in tokenized_words_train for item in sublist]

cleaned_data = [re.sub(r'[^\x00-\x7f]', r'', word.text) for word in tokenized_words_train_flat]

In [7]:
# Remove low-frequency words
freq_words = Counter(cleaned_data)

cleaned_data = { key : value for key, value in freq_words.items() if value > 10 }

filtered_data = []
temp_array = []

for sent in tokenized_words_train:
    for word in sent:
        if word.text in cleaned_data.keys():
            temp_array.append(word)
    filtered_data.append(temp_array)
    temp_array = []

<b>Caution!</b> Removing the stop words might not be the best approach.

In [None]:
# Remove stop words
# filtered_data_no_stopwords = [[word for word in sent if word.is_stop == False] for sent in filtered_data]

# filtered_data = filtered_data_no_stopwords

In [8]:
# Lemmatization
filtered_data_no_lemma = [[word.lemma_ for word in sent] for sent in filtered_data]

filtered_data = filtered_data_no_lemma

## Word embedding

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
<br>
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
<br>
It tries to make words with similar context occupy close spatial positions.
<br><br>
The Word2Vec model can be obtained using 2 techniques: 
<ol>
    <li>Skip Gram</li>
    <li>Common Bag Of Words (CBOW)</li>
</ol>

In [9]:
embed_wiki = KeyedVectors.load_word2vec_format('../data/wiki-news-300d-1M.vec')

In [10]:
X = [[embed_wiki[word] for word in sent if word in embed_wiki.vocab] for sent in filtered_data] 

In [47]:
# Get average of the vectors
X_avg = []

for vector in X:
    if len(vector) >= 1:
        X_avg.append(np.mean(vector, axis=0))
    else:
        X_avg.append(np.zeros(300))

In [48]:
X_avg = np.array(X_avg)

In [13]:
# Cross valudation
LR = LogisticRegression()

scores = cross_val_score(
    LR, 
    X_avg, 
    train_labels, 
    cv = 5, 
    scoring = 'f1_macro'
)

In [18]:
avg_score = np.sum(scores) / len(scores)
avg_score

0.6377160304145425

## Additional stuff

In [None]:
text = train_text[:100]

tokenized_text = [spacy_model(word) for word in text]

### Speech tagging

<img src="images/tagging.png" alt="tagging">

In [None]:
# Speech tagging
tagged_text = [{word : word.tag_ for word in sent} for sent in tokenized_text]

In [None]:
tagged_text

### Dependency parsing

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.
<br>
<img src="images/parsing.png" alt="parsing">

#### Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [None]:
# Get noun chunks

sample_sent = tokenized_text[1]

text = []
root = []
root_dep = []
root_head = []

for chunk in sample_sent.noun_chunks:
    text.append(chunk.text)
    root.append(chunk.root.text)
    root_dep.append(chunk.root.dep_)
    root_head.append(chunk.root.head.text)

df = pd.DataFrame({
        'TEXT': text, 
        'ROOT.TEXT': root, 
        'ROOT.DEP': root_dep, 
        'ROOT.HEAD.TEXT': root_head
    })

df = df[['TEXT', 'ROOT.TEXT', 'ROOT.DEP', 'ROOT.HEAD.TEXT']]

In [None]:
print(sample_sent)
df

<b>Text</b>: The original noun chunk text.
<br>
<b>Root text</b>: The original text of the word connecting the noun chunk to the rest of the parse.
<br>
<b>Root dep</b>: Dependency relation connecting the root to its head.
<br>
<b>Root head text</b>: The text of the root token's head.

### Named entity recognition

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
<br>
In the image below, we can see the entity types that spaCy supports

<img src="images/entity_types.png" alt="entity_types" >

In [None]:
# Named entity recognition
sample_sent = 'Google was founded in 1998 in California'
doc = spacy_model(sample_sent)

text = []
start = []
end = []
label = []

for ent in doc.ents:
    text.append(ent.text)
    start.append(ent.start_char)
    end.append(ent.end_char)
    label.append(ent.label_)
    
df = pd.DataFrame({
    'TEXT': text, 
    'START': start, 
    'END': end, 
    'LABEL': label
})

df = df[['TEXT', 'START', 'END', 'LABEL']]

In [None]:
df

### Similarity

In [None]:
text = 'dog cat chair'
doc = spacy_model(text)

token1 = doc[0]
token2 = doc[1]
token3 = doc[2]


print('Similarity between {0} and {1} is: {2}'.format(token1, token2, token1.similarity(token2)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token3, token1.similarity(token3)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token1, token1.similarity(token1)))

### spaCy Word2Vec model

In [None]:
# load the Spacy model
spacy_model = spacy.load('en_core_web_md')

In [None]:
text = train_text[:100]

tokenized_text = [spacy_model(word) for word in text]

text = []
has_vector = []
vector_norm = []

tokenized_text

for sent in tokenized_text:
    for word in sent:
        text.append(word.text)
        has_vector.append(word.has_vector)
        vector_norm.append(word.vector_norm)
    
    
df = pd.DataFrame({
        'TEXT': text, 
        'HAS VECTOR': has_vector, 
        'VECTOR NORM': vector_norm, 
    })

df = df[['TEXT', 'HAS VECTOR', 'VECTOR NORM']]

In [None]:
df.head()

In [None]:
def represent_word_spacy(word, etypes=set([u'MONEY', u'DATE',
                                           u'TIME',
                                           u'CARDINAL',
                                           u'PERCENT'])):
    """
    Returns the word or its representation (unicode)
    Args:
        word (spacy word): word from spacy
        etypes (set): set of words for different entities
    Returns:
        spacy token represented in different format
    """
    
    if word.like_url:
        return 'url'
    elif word.ent_type_ in etypes:
        return word.ent_type_.lower()
    elif u'%' in word.lemma_:
        return u'percent'
    elif word.like_num:
        return u'cardinal'
    elif ("dd" and ":") in word.shape_:
        return u'time'
    elif "$" in word.text:
        return re.sub('\d','d', word.text)
    # checks if the string is date 
    elif (("d/d/d" or "d-d-d" or "d.d.d") in re.sub('\d+', 'd', word.text)):
        return u'date'
    else:
        return (word.lemma_).lower()

In [None]:
cleaned_data = []
temp = []

for sent in tokenized_text:
    for word in sent:
        temp.append(represent_word_spacy(word))
    cleaned_data.append(temp)
    temp = []

In [None]:
tokenized_text[0]

In [None]:
cleaned_data[0]