# NLP Techniques 
Based on: https://www.youtube.com/watch?v=M7SWr5xObkA (Keith Galli, PyCon 2020)

Additional Resources: 
* [Natural Language Toolkit - NLTK](https://www.nltk.org/)
* [Scikit-learn](https://scikit-learn.org/stable/index.html)
    * [scikit-learn text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html)
    * [The Bag of Words representation](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation)
    * [CountVector](https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage)
    * [scikit-learn SVM](https://scikit-learn.org/https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extractionstable/modules/svm.html)
* [spaCy](https://spacy.io/)
* [Recurrent Neural Networks and Natural Language Processing](https://towardsdatascience.com/recurrent-neural-networks-and-natural-language-processing-73af640c2aa1)
* [Regex101](https://regex101.com/)
* [Regular Expression Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)
* [Textblog](https://textblob.readthedocs.io/en/dev/)


## Bag of Words Model

In this model, texts are is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

After transforming the text into a "bag of words", we can calculate various measures to characterize the text. The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text.

The Bag-of-words model is an orderless document representation — only the counts of words matter.




In [1]:
# Define some training utterances (clauses or inputs provided by the user)
# Define two classes, one for books and one for clothing

class Category:
  BOOKS = "BOOKS"
  CLOTHING = "CLOTHING"

# for each utterance, assign the appropriate class in order of the list

train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

**Vectorization** is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

With Scikitlearn's Text Feature Extraction, we can: 

* **tokenize** strings and giving an integer id for each p ossible token, for instance by using white-spaces and punctuation as token separators.
* **count** the occurrences of tokens in each document.
* **normalize** and **weight** with diminishing importance tokens that occur in the majority of samples / documents.



In [2]:
# Vecotorizing will help extract numerical features from text content
# CountVectorizer implements both tokenization and occurrence counting in a single class

from sklearn.feature_extraction.text import CountVectorizer

# Variable for training utterances
train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

# The fit method is calculating the mean and variance of each of the features present in our data
train_x_vectors = vectorizer.fit_transform(train_x)

# Get output feature names for transformation.
print(vectorizer.get_feature_names_out())

# print array for features compared to each utterance
print(train_x_vectors.toarray())

['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


##### Train SVM Model

In [3]:
from sklearn import svm

# SVC (support vector classification)
clf_svm = svm.SVC(kernel='linear')

    
#fit is estimator to be able to predict the classes to which unseen samples belong. 
# two arguments, our vectors and catagories
clf_svm.fit(train_x_vectors, train_y)


SVC(kernel='linear')

#### Test new utterance of the trained model

In [4]:
test_x = vectorizer.transform(['i love the book'])

clf_svm.predict(test_x)

array(['BOOKS'], dtype='<U8')

There are some drawnbacks of the BOW model. For instance, if we change "book" to "story" in the above test, the prediction is "clothing." 

In [5]:
# from wiki on BoW and n-gram model

from tensorflow import keras
from typing import List
from keras.preprocessing.text import Tokenizer

sentence = ["John likes to watch movies. Mary likes movies too."]

def print_bow(sentence: List[str]) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentence)
    sequences = tokenizer.texts_to_sequences(sentence)
    word_index = tokenizer.word_index 
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f'We found {len(word_index)} unique tokens.')

print_bow(sentence)

Bag of word sentence 1:
{'likes': 2, 'movies': 2, 'john': 1, 'to': 1, 'watch': 1, 'mary': 1, 'too': 1}
We found 7 unique tokens.


# Word Vectors
Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

* **continuous bag of word (CBOW Model)**: The distributed representations of context (or surrounding words) are combined to predict the word in the middle. 
* **skip gram**: The distributed representation of the input word is used to predict the context. 

Both look at a window of text (e.g., "Best book I've ever read) and look at different tokens and will utilize the surrounding tokens. For example, "book" with "read" or "story" and "characters" might be in a similar in vector space. Then, We can start building out bigger relationships.  
</br> 
[NLP 101: Word2Vec — Skip-gram and CBOW](https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314#:~:text=Continuous%20Bag%20of%20Words%20Model%20(CBOW)%20and%20Skip%2Dgram&text=In%20the%20CBOW%20model%2C%20the,used%20to%20predict%20the%20context%20.)

spaCy: https://spacy.io/

In [None]:
 !pip install Spacy
 !python -m spacy download en_core_web_md



In [None]:
import spacy 
nlp = spacy.load("en_core_web_md")

In [8]:
print(train_x)

['i love the book', 'this is a great book', 'the fit is great', 'i love the shoes']


In [9]:
docs = [nlp(text) for text in train_x]
train_x_word_vectors = [x.vector for x in docs]

In [10]:
from sklearn import svm 

clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

SVC(kernel='linear')

In [11]:
# test_x = ["I love the book"]
# test_x = ["I love the story"]
test_x = ["I love the purse"]
test_docs = [nlp(text) for text in test_x]
test_x_word_vectors =  [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_word_vectors)

array(['CLOTHING'], dtype='<U8')

A note about word vectors, if we had 10 different catalgories and longer sentences, the meanings might get lost and get less precise. Another drawback is with words with multiple meanings, such as "check". (i.e., "I went to the bank and wrote a check" and "let me check that out.")

#Working with Text

## Regex (Regular Expressions)
* Pattern matching of strings 
* Ccan be used to add, remove, isolate, and manipulate all kinds of text and data.
* Regex Cheatsheet: https://cheatography.com/davechild/cheat-sheets/regular-expressions/
* Regex101: https://regex101.com/

In [12]:
# use re.match to find a match in the string
# you could also use re.search to check if a regex is anywhere in a string (e.g., when the entirety doesn't need to match)
import re

regexp = re.compile(r"^ab[^\s]*cd$")
phrases = ["abcd", "xxx", "abxxcd", "ab cd"]
matches = []

for phrase in phrases: 
 if re.match(regexp, phrase):
  matches.append(phrase)

print(matches)

['abcd', 'abxxcd']


In [13]:
# connecting regex to the work we've already done 

# pipe is the "or" sign here
# \b is word bountries
regexp = re.compile(r"\bread\b|\bstory\b|\bbook\b")

phrases = ["I liked that story.", "I like that book", "this hat is nice"]

matches = []
for phrase in phrases:
  if re.search(regexp, phrase):
    matches.append(phrase)

print(matches)

['I liked that story.', 'I like that book']


## Stemming and Lemmatization
* Stemming: a process that stems or removes last few characters from a word, often leading an incorrect spelling (e.g., "Caring" to "Car").	
* Lemmatization: considers the context and converts the word to its meaningful base form, which is called Lemma. (e.g., "Caring" to "Care")

Above, especially with BOW, "book" and "books" are treated differently. However, we can get a "base" word to work with. 

The NLTK Library is helpful for this task.

In [None]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')

#### Stemming

In [15]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = "reading the books"
words = word_tokenize(phrase)

stemmed_words = []
for word in words:
  stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words)

'read the book'

#### Lemmatizing

In [16]:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

phrase = "reading the books"
words = word_tokenize(phrase)

lemmatized_words = []
for word in words:
  lemmatized_words.append(lemmatizer.lemmatize(word, pos='v'))

" ".join(lemmatized_words)

'read the book'

## Removing Stop Words 
#### Tokenize then remove stop word

In [17]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

phrase = "Here is an example sentence demonstrating the removal of stopwords"

words = word_tokenize(phrase)

stripped_phrase = []
for word in words:
  if word not in stop_words:
    stripped_phrase.append(word)

" ".join(stripped_phrase)

'Here example sentence demonstrating removal stopwords'

# Various Other Techniques (Spell Correction, Sentiment, POS Tagging)

In [None]:
!python -m textblob.download_corpora

In [19]:
from textblob import TextBlob

phrase = "the book was horrible"

tb_phrase = TextBlob(phrase)

tb_phrase.correct()

tb_phrase.tags

tb_phrase.sentiment

Sentiment(polarity=-1.0, subjectivity=1.0)

## Transformer Architecture
#### Setup


In [None]:
!pip install spacy-transformers
!python -m spacy download en_core_web_lg

### Using Spacy to utilize BERT Model

In [40]:
import spacy
import torch
import spacy_transformers

nlp = spacy.load("en_core_web_lg")
doc = nlp("Here is some text to encode.")

In [41]:
class Category:
  BOOKS = "BOOKS"
  BANK = "BANK"

train_x = ["good characters and plot progression", "check out the book", "good story. would recommend", "novel recommendation", "need to make a deposit to the bank", "balance inquiry savings", "save money"]
train_y = [Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BANK, Category.BANK, Category.BANK]
     

In [43]:
from sklearn import svm

docs = [nlp(text) for text in train_x]
train_x_vectors = [doc.vector for doc in docs]
clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x = ["check this story out"]
docs = [nlp(text) for text in test_x]
test_x_vectors = [doc.vector for doc in docs]

clf_svm.predict(test_x_vectors)

array(['BOOKS'], dtype='<U5')