# Representing Text using Numerical Vectors    
Feature Vectors from Text

References:   
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction   

https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html   

# Bag of Words Models

#### One Hot Encoding
One Hot encoding is used in converting categorical variables into features or columns and coding one or zero for the presence of that particular category. When applied to documents, the number of features is going to be the number of total tokens present in the whole corpus.   

This example is representing words.

In [19]:
import pandas as pd

Text = "I am learning NLP"

pd.get_dummies(Text.split())

Unnamed: 0,I,NLP,am,learning
0,1,0,0,0
1,0,0,1,0
2,0,0,0,1
3,0,1,0,0


## Count Vectorizer
Similar to One Hot, but instead of just storing "1" for exists, we keep the frequency (number of appearances) of the word (groups) in the document, or sentence.

More:   
https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

In [20]:
# import pandas and sklearn's CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# set of documents
corpus = ['The quick brown fox jumps over the lazy dog!',
           'This document is the second document.']

# instantiate the vectorizer object
vectorizer = CountVectorizer()

#Learn a vocabulary dictionary of all tokens in the raw documents.
vectorizer.fit(corpus)
# convert the documents into a document-term matrix
X = vectorizer.transform(corpus)

# or together: fit_transform
# X = vectorizer.fit_transform(corpora)

# retrieve the terms found in the corpora
tokens = vectorizer.get_feature_names()

print(vectorizer.get_feature_names())

['brown', 'document', 'dog', 'fox', 'is', 'jumps', 'lazy', 'over', 'quick', 'second', 'the', 'this']


In [21]:
X

<2x12 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

Q:   
Why do you think output is represented in Compressed Sparse Row format? 

In [22]:
# Each row represents a document.
print(X.toarray()) 

[[1 0 1 1 0 1 1 1 1 0 2 0]
 [0 2 0 0 1 0 0 0 0 1 1 1]]


In [23]:
vectorizer.inverse_transform(X)

[array(['brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'],
       dtype='<U8'),
 array(['document', 'is', 'second', 'the', 'this'], dtype='<U8')]

In [24]:
# create a dataframe from a word matrix
pd.DataFrame(data=X.toarray(), index=['Doc1', 'Doc2'],
             columns=tokens)

Unnamed: 0,brown,document,dog,fox,is,jumps,lazy,over,quick,second,the,this
Doc1,1,0,1,1,0,1,1,1,1,0,2,0
Doc2,0,2,0,0,1,0,0,0,0,1,1,1


In [25]:
# To display for learning.
# don't actually convert sparse matrix to dataframe in production

# We can write a function to create the dataframe from the word matrix
def word_matrix2df(word_matrix, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(word_matrix)]
    df = pd.DataFrame(data=word_matrix.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

# create a dataframe from the matrix
word_matrix2df(X, tokens)

Unnamed: 0,brown,document,dog,fox,is,jumps,lazy,over,quick,second,the,this
Doc0,1,0,1,1,0,1,1,1,1,0,2,0
Doc1,0,2,0,0,1,0,0,0,0,1,1,1


In [26]:
print(vectorizer.get_stop_words())

None


Q:    
Now you define a 5th document and encode it in (represent it with) the vocabulary for above documents.

In [27]:
corpus2 = ['This is the fifth document.']
# fifth is not in vocabulary
X2 = vectorizer.transform(corpus2)
print(X2.toarray()) 

[[0 1 0 0 1 0 0 0 0 0 1 1]]


In [28]:
# word "fifth" is lost as you see when we revert back to words
vectorizer.inverse_transform(X2)

[array(['document', 'is', 'the', 'this'], dtype='<U8')]

#### WordNet lemmatizer    
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. 

WordNet superficially resembles a thesaurus. It groups words together based on their meanings.   
2 important distinctions:   
1) WordNet interlinks not just word forms but specific senses of words.   
2) WordNet labels the semantic relations among words

In [30]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize("dogs")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ozgozt\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


'dog'

Note: We have use POS Tagging before word lemmatization.

In [31]:
# If you give pos ("part of speech") tag, you get better lemmas (root words)
# pos defaults to noun 
wordnet_lemmatizer.lemmatize("are")

'are'

In [32]:
wordnet_lemmatizer.lemmatize('are', pos='v')

'be'

### n-grams

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence.

N-grams to the rescue! Instead of building a simple collection of unigrams (n=1) (bag of words), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.   

Bigram is the combination of 2 words.
Trigram is 3 words and so on.

For “I am learning NLP”, here are first 3 n-gram sets:   
Unigrams: “I”, “am”, “ learning”, “NLP”   
Bigrams: “I am”, “am learning”, “learning NLP”   
Trigrams: “I am learning”, “am learning NLP”   

In [35]:
# TextBlob depends on the Punkt Tokenizer Models (13 MB download zipped)
nltk.download('punkt')

Text = "I am learning NLP"
# we can use textblob library:
from textblob import TextBlob
TextBlob(Text).ngrams(2)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ozgozt\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


[WordList(['I', 'am']),
 WordList(['am', 'learning']),
 WordList(['learning', 'NLP'])]

### CountVectorizer with n-grams
We can directly ask CountVectorizer to generate frequencies of n-grams, instead of unigrams only...    
Same code as above, except "ngram_range=(1,3)" parameter passed to CountVectorizer class constructor.   
Previous model had a vector of length 12.    
#### Q:   
What is the size of current vector?

In [36]:
# import pandas and sklearn's CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# set of documents
corpus = ['The quick brown fox jumps over the lazy dog!',
           'This document is the second document.']

# instantiate the vectorizer object
vectorizer = CountVectorizer(ngram_range=(1,3))

#Learn a vocabulary dictionary of all tokens in the raw documents.
vectorizer.fit(corpus)
# convert the documents into a document-term matrix
X = vectorizer.transform(corpus)

# or together: fit_transform
# X = vectorizer.fit_transform(corpora)

# retrieve the terms found in the corpora
tokens = vectorizer.get_feature_names()

print(vectorizer.get_feature_names())

['brown', 'brown fox', 'brown fox jumps', 'document', 'document is', 'document is the', 'dog', 'fox', 'fox jumps', 'fox jumps over', 'is', 'is the', 'is the second', 'jumps', 'jumps over', 'jumps over the', 'lazy', 'lazy dog', 'over', 'over the', 'over the lazy', 'quick', 'quick brown', 'quick brown fox', 'second', 'second document', 'the', 'the lazy', 'the lazy dog', 'the quick', 'the quick brown', 'the second', 'the second document', 'this', 'this document', 'this document is']


In [37]:
len(vectorizer.get_feature_names())

36

In [39]:
# to display for learning.
# don't actually convert sparse matrix to dataframe in production
word_matrix2df(X, tokens)

Unnamed: 0,brown,brown fox,brown fox jumps,document,document is,document is the,dog,fox,fox jumps,fox jumps over,...,the,the lazy,the lazy dog,the quick,the quick brown,the second,the second document,this,this document,this document is
Doc0,1,1,1,0,0,0,1,1,1,1,...,2,1,1,1,1,0,0,0,0,0
Doc1,0,0,0,2,1,1,0,0,0,0,...,1,0,0,0,0,1,1,1,1,1


### Hashing Trick
To make the representation more compact, we can use a hash function to get an index for a column title. 
it’s one way and once vectorized, the features cannot be retrieved.

In [60]:
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy brown dog"]

# Let’s create the HashingVectorizer of a vector size of 10.
# transform
vectorizer = HashingVectorizer(n_features=20, alternate_sign = False, norm='l1') # default norm is l2, better normalization for document similarity
# changed defaults for better understanding
# in default settings some will have negative values to compensate for hash collisions:
# more: https://github.com/scikit-learn/scikit-learn/issues/7513

# create the hashing vector
vector = vectorizer.transform(text)
# summarize the vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[0.  0.  0.  0.  0.  0.1 0.  0.1 0.2 0.  0.  0.1 0.  0.2 0.  0.1 0.  0.
  0.2 0. ]]


In [41]:
# WARNING: keep things in sparse representation,
# I am converting here to array for learning
type(vector)

scipy.sparse.csr.csr_matrix

In [61]:
# Vocabulary size assumed less than 10 above, if more there will be conflicts!
# to demonstrate, lets use vocabulary 2
vectorizer = HashingVectorizer(n_features=3, alternate_sign = False, norm='l1') 
# create the hashing vector
vector = vectorizer.transform(text)
# summarize the vector
print(vector.shape)
print(vector.toarray())

(1, 3)
[[0.4 0.4 0.2]]


### TF-IDF Vectorizer
**Term frequency (TF):** Term frequency is simply the ratio of the count of a word present in a document, to the length of the document.   
   
**Inverse Document Frequency (IDF):** IDF of each word is the log of the ratio of the total number of documents in the document set to the number of documents in the document set that contain that word. 


In [1]:
mini_corpus = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]

from sklearn.feature_extraction.text import TfidfVectorizer
#Create the transform
vectorizer = TfidfVectorizer()
#Tokenize and build vocab
vectorizer.fit(mini_corpus)
#Summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


In above example, "the" occurs in all documents so has the smallest idf.

More: 
Customizing the components of vectorizers:   
https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af