# Representing Text using Numerical Vectors    
Feature Vectors from Text

#### One Hot Encoding
One Hot encoding is used in converting categorical variables into features or columns and coding one or zero for the presence of that particular category. When applied to documents, the number of features is going to be the number of total tokens present in the whole corpus.   

This example is representing words.

In [2]:
import pandas as pd

Text = "I am learning NLP"

pd.get_dummies(Text.split())

Unnamed: 0,I,NLP,am,learning
0,1,0,0,0
1,0,0,1,0
2,0,0,0,1
3,0,1,0,0


### Bag of Words


## Count Vectorizer
Similar to One Hot, but instead of just storing "1" for exists, we keep the frequency (number of appearances) of the word in the document, or sentence.

More:   
https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

In [33]:
# import pandas and sklearn's CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# set of documents
corpus = ['The quick brown fox jumps over the lazy dog!',
           'This document is the second document.']

# instantiate the vectorizer object
vectorizer = CountVectorizer()

#Learn a vocabulary dictionary of all tokens in the raw documents.
vectorizer.fit(corpus)
# convert the documents into a document-term matrix
X = vectorizer.transform(corpus)

# or together: fit_transform
# X = vectorizer.fit_transform(corpora)

# retrieve the terms found in the corpora
tokens = vectorizer.get_feature_names()

print(vectorizer.get_feature_names())

['brown', 'document', 'dog', 'fox', 'is', 'jumps', 'lazy', 'over', 'quick', 'second', 'the', 'this']


In [34]:
X

<2x12 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

Q:   
Why do you think output is represented in Compressed Sparse Row format? 

In [36]:
# Each row represents a document.
print(X.toarray()) 

[[1 0 1 1 0 1 1 1 1 0 2 0]
 [0 2 0 0 1 0 0 0 0 1 1 1]]


In [37]:
vectorizer.inverse_transform(X)

[array(['brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'],
       dtype='<U8'),
 array(['document', 'is', 'second', 'the', 'this'], dtype='<U8')]

In [41]:
# create a dataframe from a word matrix
pd.DataFrame(data=X.toarray(), index=['Doc1', 'Doc2'],
             columns=tokens)

Unnamed: 0,brown,document,dog,fox,is,jumps,lazy,over,quick,second,the,this
Doc1,1,0,1,1,0,1,1,1,1,0,2,0
Doc2,0,2,0,0,1,0,0,0,0,1,1,1


In [42]:
# We can write a function to create the dataframe from the word matrix
def word_matrix2df(word_matrix, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(word_matrix)]
    df = pd.DataFrame(data=word_matrix.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

# create a dataframe from the matrix
word_matrix2df(X, tokens)

Unnamed: 0,brown,document,dog,fox,is,jumps,lazy,over,quick,second,the,this
Doc0,1,0,1,1,0,1,1,1,1,0,2,0
Doc1,0,2,0,0,1,0,0,0,0,1,1,1


In [45]:
print(vectorizer.get_stop_words())

None


Q:    
Now you define a 5th document and encode it in (represent it with) the vocabulary for above documents.

In [46]:
corpus2 = ['This is the fifth document.']
# fifth is not in vocabulary
X2 = vectorizer.transform(corpus2)
print(X2.toarray()) 

[[0 1 0 0 1 0 0 0 0 0 1 1]]


In [47]:
# fifth is lost as you see when we revert back to words
vectorizer.inverse_transform(X2)

[array(['document', 'is', 'the', 'this'], dtype='<U8')]

#### WordNet lemmatizer    
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. 

WordNet superficially resembles a thesaurus. It groups words together based on their meanings.   
2 important distinctions:   
1) WordNet interlinks not just word forms but specific senses of words.   
2) WordNet labels the semantic relations among words

In [49]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize("dogs")

'dog'

Note: We have use POS Tagging before word lemmatization.

In [51]:
# If you give pos ("part of speech") tag, you get better lemmas (root words)
# pos defaults to noun 
wordnet_lemmatizer.lemmatize("are")

'are'

In [52]:
wordnet_lemmatizer.lemmatize('are', pos='v')

'be'

### n-grams
Bigram is the combination of 2 words.
Trigram is 3 words and so on.

For “I am learning NLP”, here are first 3 n-gram sets:   
Unigrams: “I”, “am”, “ learning”, “NLP”   
Bigrams: “I am”, “am learning”, “learning NLP”   
Trigrams: “I am learning”, “am learning NLP”   

In [54]:
Text = "I am learning NLP"
# we can use textblob library:
from textblob import TextBlob
TextBlob(Text).ngrams(2)

[WordList(['I', 'am']),
 WordList(['am', 'learning']),
 WordList(['learning', 'NLP'])]

### CountVectorizer with n-grams
We can directly ask CountVectorizer to generate frequencies of n-grams, instead of unigrams only...    
Same code as above, except "ngram_range=(1,3)" parameter passed to CountVectorizer class constructor.   
Previous model had a vector of length 12.    
Q:   
What is the size of current vector?

In [56]:
# import pandas and sklearn's CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# set of documents
corpus = ['The quick brown fox jumps over the lazy dog!',
           'This document is the second document.']

# instantiate the vectorizer object
vectorizer = CountVectorizer(ngram_range=(1,3))

#Learn a vocabulary dictionary of all tokens in the raw documents.
vectorizer.fit(corpus)
# convert the documents into a document-term matrix
X = vectorizer.transform(corpus)

# or together: fit_transform
# X = vectorizer.fit_transform(corpora)

# retrieve the terms found in the corpora
tokens = vectorizer.get_feature_names()

print(vectorizer.get_feature_names())

['brown', 'brown fox', 'brown fox jumps', 'document', 'document is', 'document is the', 'dog', 'fox', 'fox jumps', 'fox jumps over', 'is', 'is the', 'is the second', 'jumps', 'jumps over', 'jumps over the', 'lazy', 'lazy dog', 'over', 'over the', 'over the lazy', 'quick', 'quick brown', 'quick brown fox', 'second', 'second document', 'the', 'the lazy', 'the lazy dog', 'the quick', 'the quick brown', 'the second', 'the second document', 'this', 'this document', 'this document is']


In [57]:
len(vectorizer.get_feature_names())

36

In [58]:
# create a dataframe from a word matrix
pd.DataFrame(data=X.toarray(), index=['Doc1', 'Doc2'],
             columns=tokens)

Unnamed: 0,brown,brown fox,brown fox jumps,document,document is,document is the,dog,fox,fox jumps,fox jumps over,...,the,the lazy,the lazy dog,the quick,the quick brown,the second,the second document,this,this document,this document is
Doc1,1,1,1,0,0,0,1,1,1,1,...,2,1,1,1,1,0,0,0,0,0
Doc2,0,0,0,2,1,1,0,0,0,0,...,1,0,0,0,0,1,1,1,1,1


### Hashing Trick
To make the representation more compact, we can use a hash function to get an index for a column title. 