Natural Language Processing (NLP)

* Objectives:
    * What are the stop words and where do we get some?
    * What does it mean to stem or lemmatize your text?
    * What is an $n$-gram and why might they help?
    * Term frequency is useful but what are some of the issues?
    * Processing text via vectorization and TF-IDF
    * Understanding document similarity
    * Knowing basic usage examples of NLP
    * Explaining word2vec algorithm
    * Applying Naive Bayes model to text data
    * Understanding the Laplace Smoothing technique
    * Knowing example of multi-class classification

1) What is NLP?
* Use Cases:
    * Conversational Agents:
        * Siri, Cortana, Google Now, Alexa
        * Talking to your car
        * Communicating with robots
    * Machine Translation:
        * Google Translate
        * Google's Neural Machine Translation
    * Speech Recognition, Speech Synthesis
    * Lexical Semantics, Sentiment Analysis
    * Dialogue Systems, Question Answering
* What are the challenges with NLP?
    * Ambiguity
        * "Court to Try Shooting Defendant"
        * "Hospitals are sued by seven foot doctors"
        * What does it mean when we say: "I made her duck"
            * I cooked waterfowl for her
            * I cooked waterfowl belonging to her
            * I created the paper mache duck she owns
            * I caused her to quickly lower her head or body
            * I waved my magic wand and turned her into undifferentiated waterfowl
        * **Word Sense Disambiguation** - the problem of determining which sense was meant by a specific word
    * Part of speech tagging
    * Syntactic disambiguation - "I made her duck" example
* Knowledge of language:
    * Phonetics & Phonology - linguistic sounds
    * Morphology - meaningful components of words
    * Semantics - meaning of word
    * Pragmatics - meaning wrt goals and intentions
    * Discourse - linguistic units larger than a single utterance
* NLP vocabulary
    * **Corpus** - a collection of documents
    * **Tokens** - each document has a collection of tokens
    * Each token is a "word"
        * There are cases where the "word" is "the" or "a"
        * There are also different versions of a word
        * e.g. "Banks working with that bank on the east bank were banking on a banker's strike"
    * **$n$-grams** - a block of $n$ words
        * "little" - unigram
        * "little boy" - bigram
        * "little boy blue" - trigram
        * "little boy blue and the man on the moon" - 9-gram

2) "Parameters" to Tune in NLP
* **Stopwords** - words which ahve no real meaning but make the sentence grammatically correct
    * e.g. "I", "me", "my", "you", etc.
    * scikit-learn's contains 318 words for the English set of stop words
* **Sentence Segmentation**
    * **Bag of Words**
* **$N$-grams**
* **Normalization**
* Specific abberviations and meaningful word coupling e.g. New York, POTUS, LDAP
* **Stemming** - the process of reducing words to their stem, base or root form
* **Lemmatization** - removes inflectional endings only and returns word to the based or dictionary form
    * e.g. (car, cars, car's, cars') $\rightarrow$ car

3) Text Processing Steps
1. Lower all of your text (depending on the parts of speech (POS))
2. Strip out miscellaneous spacing and punctuation
3. Remove stop words (but be careful as they may be domain or use-case specific)
4. Stem/Lemmatize the text
5. Part-of-Speech Tagging
6. Expand feature matrix with $N$-gram

4) Making text machine consumable (Term-Frequency Matrix / TF-IDF)
* **Term-Frequency Matrix** - converts the corpus of text data into some form of numeric matrix representation
    * Each column of the matrix is a word
    * Each row is a document
    * Each cell therein contains the count of that word in a document
    * e.g. "oh,the,thinks,you,can,think,if,only,you,try"
        * "if,you,try,you,can,think,up,a,guff,going,by"
        * "and,what,would,you,do,if,you,meet,a,jaboo"

| jaboo | if | try | can | you | guff | think | going | would | up | only | thinks | met | oh | by | what | do |
|:-----:|:--:|:---:|:---:|:---:|:----:|:-----:|:-----:|:-----:|:--:|:----:|:------:|:---:|:--:|:--:|:----:|:--:|
| 0     | 1  | 1   | 1   | 2   | 0    | 1     | 0     | 0     | 0  | 1    | 1      | 0   | 1  | 0  | 0    | 0  |
| 0     | 1  | 1   | 1   | 2   | 1    | 1     | 1     | 0     | 1  | 0    | 0      | 0   | 0  | 1  | 0    | 0  |
| 1     | 1  | 0   | 0   | 2   | 0    | 0     | 0     | 1     | 0  | 0    | 0      | 1   | 0  | 0  | 1    | 1  |

* Stopwords removed: "the", "a", "and"
* What are problems with this approach?
    * Documents may have different lengths
    * (-) May have over or underrepresentation issues due to terms with high frequency
* Solutions:
    * (Okay Solution) Normalize the term counts by the length of a document which would alleviate some problems (e.g. **L2 Normalization** in sklearn): $tf(t,d)=\frac{f_{t,d}}{\sqrt{\sum_{i\in V}(f_{i,d})^2}}$
    * (Better Solution) Have the value associated with a document-term be a measure of the importance in relation to the rest of the corpus (**TF-IDF**)
* **Term Frequency and Inverse Document Frequency (TF-IDF)** - having a value associated with document-term be a measure of the importance in relation to the rest of the corpus
    * There two parts to TF-IDF that we need to answer in order to calculate it:
        1. How apparently important was this token in this document?
            * **Term-Frequency (TF)** - $|t|$ in this document
        2. How common is this term in general?
            * **Document Frequency** - $\frac{|\text{documents containing }t|}{|\text{documents}|}$
            * **Inverse Document Frequency (IDF)** - inverse the previous term and log it (if you want). Then, add 1 to the denominator to avoid divide-by-zero scenario: $IDF(t,D) = log\big(\frac{|\text{documents}|}{1+|\text{documents containing }t|}\big)$
    * TF-IDF is calculated by multiplying the **Term-Frequency** by **Inverse-Document-Frequency**
        * Log scale is used so terms that occurs 10x times more than another are not 10x times more important
        * The "1" term on the bottom of the equation is known as a **smoothing constant** and is there to ensure that we don't have a zero in the denominator
    * Adjusting thresholds for inclusion/exclusion (TfidfVectorizer):
        * **max_df** - specifies words which should be excluded due to appearing in **more** than a given number of documents. Can either be absolute counts or a number between 0 and 1 indicating a proportion.
        * **min_df** - specifies words which should be excluded due to appearing in **less** than a given number of documents. Can either be absolute counts or a number between 0 and 1 indicating a proportion.
        * **max_features** - specifies the number of features to include in the resulting matrix. If not None, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.
* Now that we have a matrix representation of the corpus, how should we go about comparing documents to identify those which are most similar to one another? Using distance metrics!
    * **Cosine Similarity** - $dist(a,b) =\frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2}}$
    * **Euclidean Distance** - $dist(a,b) = \sqrt{\sum_{i=1}^n (a_i-b_i)^2}$

5) Great NLP tools - spaCy/word2vec
* **spaCy** - leveraging the power of **Cython**, it is the fastest syntactic parser in the world and is capable of parsing over 13,000 words per minute
    * industrial-strength NLP tool in Python
    * can perform:
        1. lemmatization
        2. part-of-speech tagging
        3. sentence extraction
        4. entity extraction
    * excels at large-scale information extraction tasks
* **word2vec** - computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
    * group of related models that are used to produce word embeddings
    * typically they are two-layer neural networks that are trained to reconstruct linguistic contexts of word
    * input is a large corpus of text and output is a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space
    * word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space

6) Naive Bayes Classifier / Laplace Smoothing
* **Naive Bayes Classifier**
    * **Bayes' Theorem** - allows switch of the events $X$ and $Y$ in a $P(X|Y)$ situation, provided we know certain other probabilities
        * $\begin{align}
        P(A\cap B) = & P(B\cap A) \\
        P(A|B)P(B) = & P(B|A)P(A) \\
        P(A|B)P(B) = & P(A\cap B) \\
        P(A|B) = & \frac{P(B|A)P(A)}{P(B)} \\
        Posterior = & \frac{\text{(Prior)(Likelihood)}}{\text{Evidence Normalizing Constant}}
        \end{align}$
    * Naive Bayes classifiers are considered naive because we assume that **all words in the string are assumed independent from one another**
    * While this clearly isn't true, they still perform remarkably well and historically were deployed as spam classifiers in the 90's. Naive Bayes handles cases where our number of features **vastly outnumber** our data points (e.g. more words than documents). These methods are also computationally efficient in that we just have to calculate sums.
    * Example: arbitrary document, $w_1,\dots,w_n$, and we would like to calculate the probability that it was from the sports section
        * $P(y_c|w_1,\dots,w_n)=P(y_c)\prod_{i=1}^n P(w_i|y_c)$ where $P(y_c)=\frac{\sum_{i=1}y==y_c}{|D|}$
* **Laplace Smoothing** - serves to remove the possibility of having a 0 in the denominator or the numerator, both of which would break our calculation, by adding 1 to the numerator and denominator
    * $\begin{align} P(w_i|y_c)
    = & \frac{count(w_{D,i}|y_c)+1}{\sum_{w\in V}[count(w|y_c)+1]} \\
    = & \frac{count(w_{D,i}|y_c)+1}{\sum_{w\in V}[count(w|y_c)+|V|]}
    \end{align}$
    * $\begin{align} 
    P(y_c|w_{d,1},\dots,w_{d,n}) = & P(y_c)\prod_{i=1}^n P(w_{d,i}|y_c) \\
    log(P(y_c|w_{d,1},\dots,w_{d,n})) = & log(P(y_c))+\sum_{i=1}^n log(P(w_{d,i}|y_c)) \\
    \end{align}$
* Naive Bayes Classifier Intuition
    * For unknown words, use Laplace Smoothing
    * Useful for online learning
    * Load of extensions and variants out there
* Pros of Naive Bayes:
    * (+) Good with wide data ($p>>n$)
    * (+) Good if $n$ is small or $n$ is quite big
    * (+) Fast to train
    * (+) Good at online learning, streaming data
    * (+) Simple to implement, not necessarily memory-bound (DB implementations)
    * (+) Multi-class classification possible
* Cons of Naive Bayes:
    * (-) Naive assumption means correlated features are not actually treated right
    * (-) Sometimes outperformed by other models