# SUMMARY

#### NLP Fundamentals
* Bag-of-Words
* Word Vectors
##### Misc NLP Techniques
* Regexes 
* Stemming & Lemmatization
#### State of the Art
* Recurrent Neural Nets
* Attention is all you need & Transformer Architecture
* OpenAI GPT, BERT etc....

## Set up 
### Install necessary libraries & download models

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
import spacy
import re
import nltk

### Define some training utterances

In [2]:
class Category:
  BOOKS = "BOOKS"
  CLOTHING = "CLOTHING"

train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING] # classify whether these are book category or cloth category

## Bag of Word Model

#### Basic Concept：Fit vectorizer to transform text to bag-of-words vectors

In [3]:
vectorizer = CountVectorizer(binary = True)
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names_out()) # 會有哪些字被切割出來
print(train_x_vectors.toarray()) # 上面被切出來的字是否有出現在這個字串裡

['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


#### Train SVM Model

In [4]:
# build Train SVM Model
clf_svm = svm.SVC(kernel = 'linear')
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [5]:
# test clf_svm
test_x = vectorizer.transform(['i like the book']) # input the str want to predict

clf_svm.predict(test_x)

array(['BOOKS'], dtype='<U8')

* limitatin：Not able to handle with the word not in the train data.

## Word Vectors / Word Embedding
Capture the semantic meaning of a word in a vector.

In [6]:
nlp = spacy.load("en_core_web_md")

In [7]:
print(train_x)

['i love the book', 'this is a great book', 'the fit is great', 'i love the shoes']


In [8]:
# take all the individual word embeddings and average them together
docs = [nlp(text) for text in train_x] # turn train_x to [i love the book, this is a great book, the fit is great, i love the shoes]
train_x_word_vectors = [x.vector for x in docs]

In [9]:
# build Train SVM Model
clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

SVC(kernel='linear')

In [10]:
# test clf_svm_wv
test_x = ["I went to the bank and wrote a check", "let me check that out"]
test_docs = [nlp(text) for text in test_x]
test_x_word_vectors =  [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_word_vectors)

array(['BOOKS', 'BOOKS'], dtype='<U8')

## Regexes / Regular Expression
Pattern matching of strings in Python. Ex. 123-123-1234 and 555-555-5555 ... are all phone number and we can use Regexes to identify them.

Password checkers, phone number, email and so on

In [11]:
regexp = re.compile(r"\bread\b|\bstory\b|book")

phrases = ["I liked that story.", "the car treaded up the hill", "this hat is nice"]

matches = []
for phrase in phrases:
  if re.search(regexp, phrase):
    matches.append(phrase)

print(matches)

['I liked that story.']


## Stemming / Lemmatization

#### Stemming

In [14]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = "reading the books"
words = word_tokenize(phrase)

stemmed_words = []
for word in words:
  stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words)

'read the book'

#### Lemmatization

In [17]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

phrase = "reading the books"
words = word_tokenize(phrase)

lemmatized_words = []
for word in words:
  lemmatized_words.append(lemmatizer.lemmatize(word, pos='v'))

" ".join(lemmatized_words)

'read the book'

## Stopwords
A set of most common words in language and need to remove.

In [18]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english') # set up the common words English

phrase = "Here is an example sentence demonstrating the removal of stopwords"

words = word_tokenize(phrase)

stripped_phrase = []
for word in words:
  if word not in stop_words:
    stripped_phrase.append(word)

" ".join(stripped_phrase)

'Here example sentence demonstrating removal stopwords'

## Various other techniques (spell correction, sentiment, & pos tagging)

In [29]:
from textblob import TextBlob

phrase = "the book was horrible"

tb_phrase = TextBlob(phrase)

In [30]:
# cant use to correct the spelling
# if phrase =  "the book was horribler" could get the same output
tb_phrase.correct() 

TextBlob("the book was horrible")

In [33]:
# word tags
tb_phrase.tags

[('the', 'DT'), ('book', 'NN'), ('was', 'VBD'), ('horrible', 'JJ')]

In [32]:
# sentiment analysis
tb_phrase.sentiment

Sentiment(polarity=-1.0, subjectivity=1.0)