# Tasks
A good NLP system typically performs many of these tasks inside
* Analysis Tasks
    * Syntactic
        * Tokenization
            * Dividing a body of text into the (indivisible) atomic units, such as words
            * Especially important in languages such as Japanese where the words are not divided by spaces or punctuation marks such as apostrophes
    * Semantic
        * Sentence/Synopsis classification
            * Spam detection
            * News article classification into political, technology, etc etc
            * Product review rating sentiment classification
        * Named Entity Recognition (NER)
            * Attempt to extracting names of entities (such as persons, locations, organizations, etc etc)
    * Pragmatic
        * Word Sense Disambiguation (WSD)
            * The sentences "the dog barked at the cows" and "the dog bit the tree bark" has completely different meanings of the word bark
        * Part-of-Speech (PoS) tagging
            * Figuring out what purpose does the word serve in a given sentence, such as proper noun, common noun, phrasel verb, etc etc
* Generation Tasks
    * Language generation
        * Predicting new text to follow after being classified in a large text body, such as a scifi story
    * Question answering (QA)
        * Chatbots on social media sites
    * Machine translation (MT)
        * Different languages have highly different morphological structures, and the the **word alignment problem**, where the word-to-word relationships can be one to many, one to one, many to one, or many to many makes this a very difficult task
        
        
Deep learning allows for spatial and temporal filtering from the data, as all deep learning really does it to emphasize or de-emphasize specific bits of the complex data representation.

Language isn't just sequential, but also recursive, which remains as one of the active fields of research to this day:
* John says that Mary says that Bill told Richard he was on his way.

# Deep Learning Approach To Natural Processing
* Learns much richer features from raw data instead of using limited human-engineered features
    * Made the tedious and expensive task of feature engineering to be obsolete
* Does the feature learning and task learning simultaneously
* The large number of parameters, the weights and biases, allow for understanding significantly more features than a human might have engineered
* Considered black boxes due to the poor interpretability
    * "how" and "what" features learnt by deep models are still open problems in the field

# Data Structures
## One-Encoding
A sparse encoding that's super inefficient
## Bag of Words
Every single one  of the words in the vocabulary are placed in a dictionary, and an index associated with them
* Only keeps track of the words seen, and its frequency in the sentence
* Loses the ordering of the word
* Only really useful for simple tasks where the presence of certain words can affect the meaning greatly

## Word Embedding
Classifies or translates every single word into a vector with n-dimensions, and every component of the vector tells us how similar it is to other words. The vector of similar words would be pointing at a similar dimension

The embedding itself is a layer of a model, and it actually learns the embedding of the word from its context

That is, they optimize the representations they create given a certain
criterion.  The  implicit,  default  criterion  is:  maximize  the  distinctiveness  of  the  vector
representations, such that the confusability of any two vectors is kept to a minimum

It's trained to optimize for a specific task, where the embedded representation is used later, such as sentiment classification after the embedding

One downside is that it does not care about establishing relationship between words that share a similar context. word2vec tries to do just that

## word2vec
Has 2 more or less equivalent implementations:
* Predicting words from context
* Predicting context from words

Tries to maximize the similarity (or the dot product of) words that appear close together in a context and minimize words that do not

$\frac{v_c \cdot v_w}{sum_i(v_{ci} \cdot v_w)}$

The numerator is calculating the similarity between the word and its context, while the denominator is calculating the similarity of all the other contexts and the target word. We maximize this ratio to rensure words that appear together in text have more similar vectors than words that do not.

Because of the many possible contexts `ci`, computing this can be very slow, so we pick `ci` contexts at random in a process of negative sampling

eg. If `cat` appears in the context of `food`, and the vector of `food` is more similar to the vector of `cat` than the vectors of several other randomly chosen words, such as `democracy`, `greed`, `Terry`, instead of all the words in the language, making it much faster to train

1. Create a dictionary for the data, mapping every word to unique integers
2. Using a random generator, and fixed context window size (context basically meaning the words surrounding the word we're targeting right now, and the context size meaning how many words total in each direction of the word to look at), collect valid context words in this window
    * The `skipgrams()` function does this

In [1]:
def process_data(textFile,window_size):
    couples=[]
    labels=[]
    sentences = getLines(textFile)
    vocab = dict()
    create_vocabulary(vocab, sentences)
    vocab_size=len(vocab)
    for s in sentences:
        words=[]
        for w in s.split(" "):
            w=re.sub("[.,:;'\"!?()]+","",w.lower())
            if w!='':
                words.append(vocab[w])
        c,l=skipgrams(words,vocab_size,window_size=window_size)
        couples.extend(c)
        labels.extend(l)
    return vocab,couples,labels

3. Sample arbitrary words in the dictionary that are out of the scope context of the input word, to generate the negative contexts
    * eg. In the sentence `the restaurant has a terrible ambiance and the food is awful`, for the word `restaurant`, the words `the`, `has`, `a`, `terrible`, all fit in its context, but the words `terrible`, `ambience`, `and`, don't
    * In this toy example, we also artifically increase the number of words in the vocabulary to a lot higher to avoid labeling valid context words as non-context words
    * The `skipgrams()` function does this
4. Generate batches to train hthe model

In [None]:
def generator(target,context, labels, batch_size):
    batch_target = np.zeros((batch_size, 1))
    batch_context = np.zeros((batch_size, 1))
    batch_labels = np.zeros((batch_size,1))
    while True:
        for i in range(batch_size):
            index= random.randint(0,len(target)-1)
            batch_target[i] = target[index]
            batch_context[i]=context[index]
            batch_labels[i] = labels[index]
        yield [batch_target,batch_context], [batch_labels]

5. Use a combination of two Embedding layers, one for the source words, and one for the context words, feeding into a dense layer, to which the output labels are fed a 0 for a negative context, vice versa. The network learns to decide whether the chosen context words are valid contexts for the target words

You can also use pre-trained word2vec models trained from better training data, such as the Stanford University wrd2vec model with 400k vocabulary

## doc2vec
Embedding can extend past word level to sentences, paragraphs and even entire documents, collectively referred to as documents

Useful for:
* Matching questions with answers
* Document retrieval
    * Similar documents to an input document
    * eg. Given a restaurant review about the fish, you want to find similar complaints about seafood
    
The general approach is to give each document a unique ID, and average the embedding of every word embedding in the document, along with the embedding of the paragraph ID

We use a sliding window of a pre-specified size (eg. 3), and use that to create n-grams from the document. Then we use every n-gram, the document ID, and a target word, which is the first word beyond the current n-gram, to train the model to predict this target word. The document ID becomes associated with many n-grams inside itself during training, forming a memory and encoding the missing information from the local n-grams to predict the next word

After training, the document ID becomes associated with vectors that describe the aggregation of this missing information for predicting next words, for all sequences; the approximation of the topic of a document

eg. For the document with ID `127`, the content is 'My first visit to Hiro was a delight!'

The n-grams and their respective targets become :
* my first visit - to
* first visit to - Hiro
* visit to Hiro - was
* to Hiro was - a
* Hiro was a - delight
* was a delight (we can choose to generate an `end of sentence` pseudotoken for this)

The dataset is as follows

`Took an hour to get our food only 4 tables in restaurant my food was Luke warm,
Our server was running around like he was totally overwhelmed.
There is not a deal good enough that would drag me into that establishment again.
Hard to judge whether these sides were good because we were grossed out by the melted
styrofoam and didn't want to eat it for fear of getting sick.
On a positive note, our server was very attentive and provided great service.
Frozen pucks of disgust, with some of the worst people behind the register.
The only thing I did like was the prime rib and dessert section.`

The following code reads the file with one document per line into integer-valued arrays for teh document identifiers, context arrays of word n-grams, and target words