# The Problem with Text

A problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers
This is called feature extraction or feature encoding.

A popular and simple method of feature extraction with text data is called the bag-of-words model of text.

## Bag of Words

A bag-of-words is a representation of text that describes the occurrence of words within a document.

A vocabulary is chosen, where perhaps some infrequently used words are discarded. A given document of text is then represented using a vector with one position for each word in the vocabulary and a score for each known word that appears (or not) in the document.

It is called a "bag" of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

1.A vocabulary of known words.

2.A measure of the presence of known words

## Bag-of-Words with scikit-learn
The scikit-learn Python library for machine learning provides tools for encoding documents for a bag-of-words model.

An instance of the encoder can be created, trained on a corpus of text documents and then used again and again to encode training, test, validation and any new data that needs to be encoded for your model.

There is an encoder to score words based on their count called CountVectorizer, one for using a hash function of each word to reduce the vector length called HashingVectorizer, and a one that uses a score based on word occurrence in the document and the inverse occurrence across all documents called TfidfVectorizer.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]



# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
#vector = vectorizer.transform([text[0]])
# summarize encoded vector
#print(vector.shape)
#print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


In [5]:
vector = vectorizer.transform([text[1]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 8)
[[0.         0.78980693 0.         0.         0.         0.
  0.         0.61335554]]


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


## Bag-of-Words with Keras
The Keras Python library for deep learning also provides tools for encoding text using the bag-of words-model in the Tokenizer class.

As above, the encoder must be trained on source documents and then can be used to encode training data, test data and any other data in the future. The API also has the benefit of performing basic tokenization prior to encoding the words.

In [8]:
from keras.preprocessing.text import Tokenizer
docs=text
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents

Using TensorFlow backend.


OrderedDict([('the', 4), ('quick', 1), ('brown', 1), ('fox', 2), ('jumped', 1), ('over', 1), ('lazy', 1), ('dog', 2)])
3
{'the': 1, 'fox': 2, 'dog': 3, 'quick': 4, 'brown': 5, 'jumped': 6, 'over': 7, 'lazy': 8}
{'lazy': 1, 'brown': 1, 'over': 1, 'dog': 2, 'jumped': 1, 'fox': 2, 'quick': 1, 'the': 3}


In [11]:
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

[[0. 2. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0.]]
