# Vector Representations

## Overview
Machine learning algorithms work by detecting patterns in your data and so prefer the data to be structured, as opposed to say text - which is relatively unstructured. So - a necessary processing step in Natural Language processing is the representation of text in a structured way. Representing text as mathematical structures allow algorithms to perform mathematical functions on text and in so doing "learn" the meaning of text. 

The types and forms of the structures depend on the type and scope of the NLP project you are doing. If you were to undertake a project to understand reddit posts then you would need to be able to represent all the possible words that can be contained in all the posts in your project. You may also want to decide if your are more concerned about representing each post separately, especially if you are attempting to classify posts or if your project is more concerned with how words relate to each other as in a project to detect named entities.

This chapter looks at how to represent text so as to make it easier to machine learning algorithims to work.

## Levels of Representation
Text can be represented at different levels of granularity - (1) character (2) word (3) sentences (4) paragraphs (5) documents. Let's start at each level to see how each can be represented as vectors, bearing in mind that each level is a superset of the level that precedes it i.e. words contain characters, sentences contain words and so on.

## Exercise 1: Load A Text File

## Character Representation
The goal of character representation is to find a way to represent characters numerically before performing machine learning. The simplest way is to take use an existing numeric encoding of characters - such as the Ascii charset.

In [14]:
ord('A'), ord('B')

(65, 66)

However because the encoded numeric values are increasing - it contains information about relative scale of the values - giving the impression that A - coded as 65 is less than B - coded as 66. On-hot encoding gets aroud these problems by

In [None]:
import string

chars = string.ascii_letters + '.,- '

## Manual One-Hot Encoding

In [34]:

char_value = {c:i for i,c in enumerate(chars)}

def encode_char(c):
    encoded_char = [0] * len(chars)
    encoded_char[char_value[c]] =1
    return encoded_char

def encode_word(word):
    return[encode_char(c) for c in word]

#encode_word('Hello World')

In [31]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

## Exercise 2: One Hot Encoding using Scikit-Learn

### One Hot Encoding

In [2]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

word = 'data'
word_chars = np.array(list(word)).reshape(1, -1)
print(word_chars)
encoder = OneHotEncoder(handle_unknown='ignore', n_values=10)
encoder.fit_transform(np.array(1,3,5).reshape(1,-1)).toarray()

[['d' 'a' 't' 'a']]


ValueError: only 2 non-keyword arguments accepted

### Exercise: One Hot Encoding Text using Keras

In [1]:
from keras.utils import to_categorical

Using TensorFlow backend.


# Word Representations

## Word Embeddings

Now that we have seen how characters can be represented let's now look at how to represent words as vectors. You can use one-hot encoding for words.

Once we have represented words as one-hot vectors One problem with one-hot vectors is that with only 1 position filled with a 1 and the rest 0's the vector is incedibly sparse. This gives us a representation that takes up a lot of memory and computation time to represent one word and is exspecially an issue with large vocabularies. For example - with a vocabulary size of 10000 each one-hot vector takes 10000 * bytes of memory - although you can reduce this by using sparse vectors.

The other problem with one-hot encoding is that it doesn't make efficient use of memory to carry information - since every other byte but one is empty.

It turns out that there is another transformation that we can do to those vectors that can give use an even better representation of words -  a representation that both reduces meory usage compares and carries more information compared to one-hot encoding.

An embedding vector is a vector that is transformed from a high dimension to a lower dimension. So if you had one-hot vectors of size 10000 you can compress that vector into a smaller dimension - for example 100.



In [57]:
from sys import getsizeof
getsizeof([1]*10000), getsizeof([1]*100)

(80064, 864)

## Word2Vec
Word2Vec is an approach to training shallow neural networks to produce word embeddings.

In [2]:
from gensim.models import word2vec
corpus = [
          'Text of the first document.',
          'Text of the second document made longer.',
          'Number three.',
          'This is number four.',
]
# we need to pass splitted sentences to the model
tokenized_sentences = [sentence.split() for sentence in corpus]
model = word2vec.Word2Vec(tokenized_sentences, min_count=1)



There are two main methods for creeating embeddings using the Word2Vec approach - Continuous Bag of Words and Skip Grams.

## Continuous Bag of Words (CBOW)

## Skip Grams

## Word Vector Math

## Doc2Vec
Word2Vec works nicely as embeddings for individual words but what if we need to create embeddings for paragraphs or documents. This is where Doc2Vec comes in - it is a way of creating embeddings for groups of words.

Doc2Vec is based on the concept of a paragraph vector - which is a vector that gets trained to capture the meaning of a paragraph or document. Like Word2Vec, it is also trained using a shallow neural network and it enjoys much the same properties as word vectors. If trained properly paragraphs or documents that have similar meaining will have similar document vectors.

# Exercise: Training Document Vectors

In this exercise we will learn how to train document vectors.

In [58]:
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

### Exercise 

In [None]:
from pathlib import Path
import glob

def load_corpus(fn):
    pass