# Natural language processing with NLTK and spaCy

Natural language processing (NLP) is concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
<br>
Generally, the NLP field is characterized by the following terms:
<br><br>
<b>Tokenization</b>	- segmenting text into words, punctuations marks etc.
<br>
<b>Part-of-speech (POS) Tagging</b>	- Assigning word types to tokens, like verb or noun.
<br>
<b>Dependency Parsing</b> - Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
<br>
<b>Lemmatization</b> - Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "playing" is "play".
<br>
<b>Sentence Boundary Detection (SBD)</b> - Finding and segmenting individual sentences.

### Differences between NLTK and spaCy
NLTK offers some of the same functionality as spaCy. Although originally developed for teaching and research, its longevity and stability has resulted in a large number of industrial users. It's the main alternative to spaCy for tokenization and sentence segmentation. In comparison to spaCy, NLTK takes a much more "broad church" approach – so it has some functions that spaCy doesn't provide, at the expense of a bit more clutter to sift through. spaCy is also much more performance-focussed than NLTK: where the two libraries provide the same functionality, spaCy's implementation will usually be faster and more accurate.

# NLTK

NLTK (Natural Language Toolkit) a free and open-source leading platform for building Python programs to work with human language data.
<br>
It is written in Python and provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers, etc.
<br>


In [6]:
import warnings
warnings.filterwarnings('ignore')

import re
import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist

from gensim.models import Word2Vec, KeyedVectors

import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

<p>The First thing that we need to do is to load the Quora dataset</p>

In [2]:
# Load the data

train_data = pd.read_csv('../data/train.csv')
test_data = pd.read_csv('../data/test.csv')

train_data = train_data[:400000]

train_text = train_data['question_text'].values
train_labels = train_data['target'].values

test_text = test_data['question_text'].values
test_qid = test_data['qid'].values

In [3]:
train_data.sample(10)

Unnamed: 0,qid,question_text,target
311184,3cf3f6d126ce410d7b58,Why don't we Hindus hate ex-president Abdul Ka...,1
226106,2c36fd62c059967a2fee,Who do also hate the new unskippable ads on Yo...,0
50491,09e5a49af0d7923ff008,Are Americans aware of their global unpopularity?,1
138993,1b3737b7501354682449,Which apps can be used on Android to see world...,0
321911,3f16506befbca637cd09,What should Caucasians do to reduce fertility ...,1
122308,17f041a8626c1f19934b,Is there a version of Frank Castle where he fi...,0
190804,254e7e77f4efa0d081a2,"Why is Malayalam new year called ""vishu""?",0
72181,0e2688cb6136e20d9e94,Which instruction is used to selectively mask ...,0
175406,224bdb1a11a98141a4c3,Are construction delays common with Emaar off ...,0
181432,2374f1641661814436c2,Which will be the best course online for AWS D...,0


### Data preprocessing

The first thing that we can do with the data is to convert all the letters to lowecase

In [4]:
# Convert to lowercase
train_text = [token.lower() for token in train_text]
test_text = [token.lower() for token in test_text]

### Word tokenization

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module.

The text is first tokenized into sentences using the <b>PunktSentenceTokenizer</b>, hen each sentence is tokenized into words using 4 different word tokenizers:

<b>TreebankWordTokenizer</b>

<b>WordPunctTokenizer</b>

<b>PunctWordTokenizer</b>

<b>WhitespaceTokenizer</b>

By default, NLTK uses the TreebankWordTokenizer, which uses regular expressions to tokenize text and it asumes that the text has already been splitted into sentences.

In [5]:
tokenized_words_train = [word_tokenize(i) for i in train_text]
tokenized_words_test = [word_tokenize(i) for i in test_text]

In [11]:
np.save('tokenized_words_train', tokenized_words_train)
np.save('tokenized_words_test', tokenized_words_test)

NameError: name 'tokenized_words_train' is not defined

In [12]:
tokenized_words_train = np.load('tokenized_words_train.npy')
tokenized_words_test = np.load('tokenized_words_test.npy')

### Text cleaning

The <b>isalpha()</b> is a built-in Python method which checks if a string contains only alphabethical characters.

In [7]:
# Remove punctuation and numbers
tokenized_words_train = [[word for word in sent if word.isalpha()] for sent in tokenized_words_train]

In [8]:
# Remove non-ASCII characters
tokenized_words_train_flat = [item for sublist in tokenized_words_train for item in sublist]

cleaned_data = [re.sub(r'[^\x00-\x7f]', r'', word) for word in tokenized_words_train_flat]

The <b>FreqDist</b> function returns the frequency distribution for the outcomes of an experiment.

A frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

In [9]:
# Remove low frequency words
freq_words = FreqDist(cleaned_data)

cleaned_data = { key : value for key, value in freq_words.items() if value > 10 }

filtered_data = []
temp_array = []

for sent in tokenized_words_train:
    for word in sent:
        if word in cleaned_data.keys():
            temp_array.append(word)
    filtered_data.append(temp_array)
    temp_array = []

NLTK also provides a list of stop-words, which are the most frequent words in a language.

For example the most frequent English words are the following words:

In [10]:
stop_words = list(stopwords.words('english'))
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Now we will remove those words because they appear in almost every sentence, thus won't have much impact on the classification.

In [11]:
# Remove stop words
filtered_data_no_stopwords = []
temp_array = []

for sent in filtered_data:
    for word in sent:
        if word not in stop_words:
            temp_array.append(word)
    filtered_data_no_stopwords.append(temp_array)
    temp_array = []

filtered_data = filtered_data_no_stopwords

<b>Bag-of-Words</b> is a very intuitive approach for converting the words into numerical values.

The approach follows 3 steps:

<ol>
<li>Splitting the documents into tokens</li>
<li>Assigning a weight to each token proportional to the frequency with which it shows up in the document and/or corpora.</li>
<li>Creating a document-term matrix with each row representing a document and each column addressing a token.</li>
</ol>


The <b>Count Vectorizer</b> counts the number of times a token shows up in the document and uses this value as its weight.

The <b>tokenizer</b> argument overrides the string tokenization step while preserving the preprocessing and n-grams generation steps. 

In [12]:
vectorizer = CountVectorizer(
    tokenizer = lambda sent: sent, 
    analyzer = 'word',
    lowercase=False
)

X_train = vectorizer.fit_transform(filtered_data)
X_test = vectorizer.transform(tokenized_words_test)

In [16]:
# Cross valudation
LR = LogisticRegression()

scores = cross_val_score(
    LR, 
    X_train, 
    train_labels, 
    cv = 5, 
    scoring = 'f1'
)

In [17]:
avg_score = np.sum(scores) / len(scores)
avg_score

0.49030520084606233