🗣️ ***A Greater Understanding Brought To You By This Article: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/***

# How to use text in ML models:

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).



Notes cover:
- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

# Bag-of-Words Model
https://machinelearningmastery.com/gentle-introduction-bag-words-model/

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

### Word Counts with CountVectorizer 🍎

The **CountVectorizer** provides a simple way to 
1. tokenize a collection of text documents 
2. build a vocabulary of known words 
3. encode new documents using that vocabulary

***Steps:***
1. Create an instance of the ```CountVectorizer``` class.
2. Call the ```fit()``` function in order to learn a vocabulary from one or more documents.
3. Call the ```transform()``` function on one or more documents as needed to encode each as a vector.

An **encoded vector** is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them **sparse**. Python provides an efficient way of handling sparse vectors in the ```scipy.sparse package```.

The vectors returned from a call to ```transform()``` will be *sparse vectors*, and you can transform them back to numpy arrays to look and better understand what is going on by calling the ```toarray()``` function.

Below is an example of using the ```CountVectorizer``` to tokenize, build a vocabulary, and then encode a document.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

# create the transform
vectorizer = CountVectorizer()

CountVectorizer()

...learning from the given text

In [13]:
# tokenize and build vocab
vectorizer.fit(text)

# access the vocabulary to see what exactly was tokenize
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


 We can see that all words were made lowercase by default and that the punctuation was ignored. These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.

 ... encode text into a sparse vector

In [16]:
# encode document
vector = vectorizer.transform(text)
print("vector shape: ",vector.shape) # 8 words in the text and 1 body of text therefore encoded vectors have a length of 8

vector shape:  (1, 8)


In [15]:
# summarize encoded vector
# print(vector.shape)
print(type(vector))
print(vector.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


We can then see that the encoded vector is a **sparse matrix**. Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.


***Importantly***, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.

In [None]:
# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocab and the other word not in the vocab completely ignored.
[[0 0 0 0 0 0 0 1]]
The encoded vectors can then be used directly with a machine learning algorithm.



🥥🤯 
SoOOooo what that means it's best to use a vectorizor that that has encoded or "learned" as many words as possible so no matter what body of text we transform, the sparse matrix returned won't be all zeros and the unlearned words in thr text won't be ignored 🤓

the last element in the matrix cooresponds to the word "the" that was originally learned at the ```vectorizer.transform(text)``` stage. In ```text2```, there is only 1 occurance of the word "the" 😁

### Word Frequencies with TfidfVectorizer 🍊

Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to **calculate word frequencies**, and by far the most popular method is called ***TF-IDF***. This is an acronym than stands for *“Term Frequency – Inverse Document” Frequency* which are the components of the resulting scores assigned to each word.

* **Term Frequency:** This summarizes how often a given word appears within a document.
* **Inverse Document Frequency:** This downscales words that appear a lot across documents.


Without going into the math, ***TF-IDF are word frequency scores that try to highlight words that are more interesting***, e.g. frequent in a document but not across documents.

The TfidfVectorizer will 
- tokenize documents
- learn the vocabulary 
- inverse document frequency weightings 
- allow you to encode new documents. 

Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
    "The dog.",
    "The fox"]

# create the transform
vectorizer = TfidfVectorizer()


In [20]:
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


Now we have the Inverse document frequency vector 

In [19]:
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


We have the array version of our encoded vector for the first document.

inverse document frequency weightings

### Hashing with HashingVectorizer 🍒

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.👎

A clever work around is to **use a one way hash of words to convert them to integers**. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. 

A downside is that **the hash is a one-way function** so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The ```HashingVectorizer``` class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

The example below demonstrates the ```HashingVectorizer``` for encoding a single document.

***Note:*** This vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.

In [21]:
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = HashingVectorizer(n_features=20)

An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. *Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.*

In [22]:
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]


# 🏎️🏁 Finished 🏎️🏁

Look what you've learned today 👏🤓