### Introduction to NLP

NLP pipelines in general consist of three stages:

* Text Processing
* Feature Extraction
* Modeling

**text Processing Steps:**

* Tokenization — convert sentences to words
* Removing unnecessary punctuation, tags
* Change Capitalization
* Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic
* Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
* Lemmatization — Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.

### Feature Extraction

**Bag-of-words**

treats each document as a unordered collection of words. Turns each document into a vector of numbers, basically how many times a word has been occured in a document.

* collect unique words in corpus to form a vocabulary
* form Document-Term Matrix
* compare 2 documents using "dot product" of 2 vectors
$$
ab = \sum a_0 b_0 + a_1 b_1 + ... + a_n b_n 
$$
* greater the dot product of 2 vectors, the more similar they are
* dot product is flawed, it only captures the portions of overlaps, so very diffrent vectors can have the same product of very similar vectors. 
* better measure is **cosine similarity**
$$ 
cos(\Theta) = \frac{a \cdot b}{||a|| \cdot ||b||}
$$
where $||a|| = \sqrt{a_1^2 + a_2^2 + ... + a_n^2}$ represents their magnitudes or Euclidian norms.

if imagine vectors in n-dimensional space, cosine similarity means the cosine of angle between 2 vectors:
* Identical vectors have cosine of 1
* Orthogonal vectors have cosine of 0
* Opposite vectors have cosine of -1

values of cosine are always between **[-1, +1]**

**TF-IDF**

One limiation of Bag of words: Treats every word as being equally important.

We can approach this by collecting each word's frequency and then dividing the term frequencies by the document frequency of that term to get a relative measure.

$$
tfidf(t, d, D) = tf(t, d) \cdot idf(t, D),
$$


The total number of documents in the collection, divided by the number of documents where t is present. tf-idf assigns weight to words that signify their relevance in the document.

**One-Hot Encoding**

In the context of Language Processing One-hot encoding means:

Treat each word like a class, assign it a vector. If the word is present this variable is one and zero otherwise.

**Word Embedding*

* One-Hot Encoding breaks down when we have a large vocabulary to deal with, because the size of word representation grows with the number of words in the dictionary.
* Word Embedding provides fixed length numeric vector to represnt a word instead of its characters.
* every word in your dictionary has a unique vector associated with it.

**Embedding technique 1: Word2Vec**

Word2Vec is perhaps one of the most popular examples of word embeddings used in practice. 

* SkipGram: 
- summary: pick a window size that count as your neighbord radius, any word inside this window will be the context.
- Pick any word from a sentence,
- convert it into a one-hot encoded vector and feed it into a neural network (or some ohter probabilistic model).
- Train model to predict context words as best as it can.
- Take an intermediate representation like a hidden layer in a neural network.
- Outputs of that layer for a given word become the corresponding word vector.

**Embedding technique 2: GloVe**