Anything we express (verbally or in writtern) carries huge amount of information. And that information can be used to predict human behaviour. In this notebook I'll try to explain how Natural Language Processing can be used to see multiple facet of data and get meaningful information out of that.


# NLP
Natural Language Processing is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

In simple words, NLP represents the automatic handling of natural human language like speech or text.

## Use cases of NLP
- Sentiment analysis
- Cognitive assistant
- Identifying fake news/spams 
- Personalized chat bots
etc.

# NLP with spaCy

spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks. Most people find it intuitive, and it has excellent [documentation](https://spacy.io/usage).

spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with `spacy.load.`

For example, here's how you would load the English language model.



In [None]:
import spacy
nlp = spacy.load('en')

With the model loaded, you can process text like this:

In [None]:
doc = nlp("Tea is healthy and calming, don't you think?")

There's a lot you can do with the doc object you just created.

# Tokenizing
This returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't".

You can see the tokens by iterating through the document.

In [None]:
for token in doc:
    print(token)

Why are lemmas and identifying stopwords important? Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy and calming. Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

However, lemmatizing and dropping stopwords might result in your models performing worse. So you should treat this preprocessing as part of your hyperparameter optimization process.

# Pattern Matching
Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a `Matcher`. When you want to match a list of terms, it's easier and more efficient to use `PhraseMatcher`. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest. First you create the `PhraseMatcher`itself.

In [None]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting `attr='LOWER'` will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the `nlp` model.

In [None]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then you create a document from the text to search and use the phrase matcher to find where the terms occur in the text.

In [None]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [None]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

# Text Classification with SpaCy
A common task in NLP is text classification. This is "classification" in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries.

In this tutorial, you'll learn text classification with spaCy. The classifier will detect spam messages, a common functionality in most email clients. Here is an overview of the data you'll use:

In [None]:
import pandas as pd
# Loading the spam data
spam = pd.read_csv('../input/nlp-course/spam.csv')
# Read data
spam.head(10)

## Bag of Words
Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).

As an example, take the sentences "Tea is life. Tea is love." and "Tea is healthy, calming, and delicious." as our corpus. The vocabulary then is `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}` (ignoring punctuation).

For each document, count up how many times a term occurs, and place that count in the appropriate element of a vector. The first sentence has "tea" twice and that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look like

\begin{align}
v_1 &= \left[\begin{matrix} 2 & 2 & 1 & 1 & 0 & 0 & 0 & 0 \end{matrix}\right] \\
v_2 &= \left[\begin{matrix} 1 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \end{matrix}\right]
\end{align}
 
This is called the **bag of words** representation. You can see that documents with similar terms will have similar vectors. Vocabularies frequently have tens of thousands of terms, so these vectors can be very large.

Another common representation is **TF-IDF (Term Frequency - Inverse Document Frequency)**. TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus.

## Building a Bag of Words model
Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. spaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.

The TextCategorizer is a spaCy **pipe**. Pipes are classes for processing and transforming tokens. When you create a spaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model `doc = nlp("Some text here")`, the output of the pipes are attached to the tokens in the `doc` object. The lemmas for `token.lemma_` come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

In [None]:
import spacy

# Create an empty model
nlp = spacy.blank('en')

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture":"bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Since the classes are either ham or spam, we set `"exclusive_classes"` to `True`. We've also configured it with the bag of words (`"bow"`) architecture. spaCy provides a convolutional neural network architecture as well, but it's more complex than you need for now.

Next we'll add the labels to the model. Here "ham" are for the real messages, "spam" are spam messages.

In [None]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

## Training the Text Categorizer Model
Next, you'll convert the labels in the data to the form TextCategorizer requires. For each document, you'll create a dictionary of boolean values for each class.

For example, if a text is "ham", we need a dictionary `{'ham': True, 'spam': False}`. The model is looking for these labels inside another dictionary with the key `'cats'`.

In [None]:
train_texts = spam['text'].values
train_labels = [{'cats':{'ham':label == 'ham', 'spam':label == 'spam'}}
               for label in spam['label']]

Then we combine the texts and labels into a single list.

In [None]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]