# Encode Text data for Machine Learning
* The text must be parsed to remove words through tokenization
* Then the words needs to be transformed to integer or floating number before fitting in the model


#Bag of Words
* This model is focused around the occurence or count of words or the degree to which the words is present within the document and across the documents.
* Three methoods - Count Vectorizer, Tf-Idf Vectorizer and Hashing Vectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
text = ['The quick brown fox jumped over the lazy dog.']
cv = CountVectorizer()
vector = cv.fit_transform(text)


In [5]:
print(cv.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [6]:
print('Shape of vector::',vector.shape)
print('Array::',vector.toarray())
print(type(vector))

Shape of vector:: (1, 8)
Array:: [[1 1 1 1 1 1 1 2]]
<class 'scipy.sparse.csr.csr_matrix'>


# Tf-IDF Vectorization
One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.

Inverse Document Frequency: This downscales words that appear a lot across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
#text = ["the quick brown fox jumped over the lazy dog."]



text = ["the quick brown fox jumped over the lazy dog.",
		"the dog.",
		"the fox"]



#text = ["the quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)


{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


* Above a vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector.

* The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.

In [8]:
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]
