Credits: This notebook contains an excerpt from the [Python Data Science Handbook]
by Jake VanderPlas;

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). <br/>
If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Vectorization

In our previous lessons and assignment the examples we assume that <br/>
    you have a examples in a dataset format.
    
In the real world, data rarely comes in such a form.

In this notebook we will practice vectorization of text data (and some categorical data).

In this notebook, we will review the following:<br/>
* The CountVectorizer
* The TfidfVectorizer
* The DictVectorizer

In [1]:
# Standard Improts
import os
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

warnings.simplefilter("ignore")
%matplotlib inline

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Text Features
In the vectorization process of text features we need to convert text to a set of representative numerical values.<br/>
For example, most automatic mining of social media data relies on some form of encoding the text as numbers.<br/>

### The CountVectorizer
One of the simplest methods of encoding data is by *word counts*: <br/>
you take each snippet of text, count the occurrences of each word within it, <br/>
and put the results in a table.

For example, consider the following set of three phrases:

In [5]:
sample = ['problem of evil',
          'evil queen is evil',
          'horizon problem']

For a vectorization of this data based on word count, we could construct <br/>
   a column representing the word "problem," the word "evil," the word "horizon," and so on.<br/>
   
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
sample
vec = CountVectorizer()
X_train = vec.fit_transform(sample)
type(X_train)
X_train
type(X_train.toarray())
X_train.toarray()

['problem of evil', 'evil queen is evil', 'horizon problem']

scipy.sparse.csr.csr_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

numpy.ndarray

array([[1, 0, 0, 1, 1, 0],
       [2, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]], dtype=int64)

The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [8]:
import pandas as pd
sample
pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names())

['problem of evil', 'evil queen is evil', 'horizon problem']

Unnamed: 0,evil,horizon,is,of,problem,queen
0,1,0,0,1,1,0
1,2,0,1,0,0,1
2,0,1,0,0,1,0


In [9]:
sample_test = ['big problem problem',
          'not evil queen',
          'the horizon']
sample_test
X_test= vec.transform(sample_test)
pd.DataFrame(X_test.toarray(), columns=vec.get_feature_names())

['big problem problem', 'not evil queen', 'the horizon']

Unnamed: 0,evil,horizon,is,of,problem,queen
0,0,0,0,0,2,0
1,1,0,0,0,0,1
2,0,1,0,0,0,0


#### Some parameters you should review:
* **analyzer** - default=’word’ but we could change to ‘char’, ‘char_wb’
  * Option ‘char_wb’ creates character n-grams only from text inside word boundaries
* **tokenizer** - Override the string tokenization step while preserving the preprocessing and n-grams generation steps
* **stop_words** - if a list is set (stop_words=python_lst), it is assumed to contain stop words, all of which will be removed from the resulting tokens.
* **ngram_range** - tuple - (min_n, max_n), default=(1, 1) - if changed we could catch ngrams.
* **min_df** - float in range [0.0, 1.0] or int, default=1 - the minimum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_df** - float in range [0.0, 1.0] or int, default=1.0 - the maximum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_features** - int, default=None - max_features ordered by term frequency across the corpus.
* **vocabulary** - Mapping or iterable, default=None - mapping functions as a dictionary (e.g., a list similar words as values), or a close-list of words, which only these words will be considered in the language 
* **dtype** - type, default=np.int64 -  the type of the value of the feature

For additional information click the link: [sklearn's CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
sample
vec_ngrams = CountVectorizer(ngram_range=(1,2))
X_train = vec_ngrams.fit_transform(sample)
type(X_train)
X_train
type(X_train.toarray())
X_train.toarray()
pd.DataFrame(X_train.toarray(), columns=vec_ngrams.get_feature_names())

['problem of evil', 'evil queen is evil', 'horizon problem']

scipy.sparse.csr.csr_matrix

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

numpy.ndarray

array([[1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
       [2, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

Unnamed: 0,evil,evil queen,horizon,horizon problem,is,is evil,of,of evil,problem,problem of,queen,queen is
0,1,0,0,0,0,0,1,1,1,1,0,0
1,2,1,0,0,1,1,0,0,0,0,1,1
2,0,0,1,1,0,0,0,0,1,0,0,0


### The TfidfVectorizer
There are some issues with the `CountVectorizer` approach, <br/>
   the raw word counts lead to features which put too much weight on words that appear very frequently, <br/>
   and this can be sub-optimal in some classification algorithms.<br/>

One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*),<br/>
   which weights the word counts by a measure of how often they appear in the documents.<br/>
The syntax for computing these features is similar to the previous example:

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
sample
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

['problem of evil', 'evil queen is evil', 'horizon problem']

Unnamed: 0,evil,horizon,is,of,problem,queen
0,0.517856,0.0,0.0,0.680919,0.517856,0.0
1,0.732359,0.0,0.481482,0.0,0.0,0.481482
2,0.0,0.795961,0.0,0.0,0.605349,0.0


#### Some parameters you should review:
* **norm** -  default=’l2’ - ‘l1’ also possible. 
  * `'l2'` - sum of squares of vector elements is 1,
  * `'l1'` - Sum of absolute values of vector elements is 1
* **use_idf** - bool, default=True - if is False, like CountVectorizer, but with tf, instead of count.
* **sublinear_tf** - bool, default=False - if is True (zipf law), replace tf with 1 + log(tf).

* **stop_words** - if a list is set (stop_words=python_lst), it is assumed to contain stop words, all of which will be removed from the resulting tokens.
* **ngram_range** - tuple - (min_n, max_n), default=(1, 1) - if changed we could catch ngrams.
* **min_df** - float in range [0.0, 1.0] or int, default=1 - the minimum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_df** - float in range [0.0, 1.0] or int, default=1.0 - the maximum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_features** - int, default=None - max_features ordered by term frequency across the corpus.
* **vocabulary** - Mapping or iterable, default=None - mapping functions as a dictionary (e.g., a list similar words as values), or a close-list of words, which only these words will be considered in the language 
* **dtype** - type, default=np.int64 -  the type of the value of the feature

For additional information click the link: [sklearn's TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

### The DictVectorizer

One common type of non-numerical data is *categorical* data. <br/>
For example, imagine you are exploring some data on housing prices, <br/>
and along with numerical features like "price" and "rooms", you also have "neighborhood" information.<br/>

For example, your data might look something like this:

In [13]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

You might be tempted to encode this data with a straightforward numerical mapping:

In [14]:
# Is this a good solution?
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

Thus such a mapping would imply, for example, that *Queen Anne < Fremont < Wallingford*, or even that *Wallingford - Queen Anne = Fremont*, which (niche demographic jokes aside) does not make much sense.

In this case, one proven technique is to use *one-hot encoding*, <br/>
which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.<br/>
It is also useful for user processed text, as well as predefined dictionaries (e.g., of city names).

When your data comes as a list of dictionaries, Scikit-Learn's ``DictVectorizer`` will do this for you:

In [17]:
from sklearn.feature_extraction import DictVectorizer
vec_dict = DictVectorizer(sparse=False, dtype=int)
train_vector = vec_dict.fit_transform(data)
data
pd.DataFrame(train_vector, columns=vec_dict.get_feature_names())

[{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
 {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
 {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
 {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]

Unnamed: 0,neighborhood=Fremont,neighborhood=Queen Anne,neighborhood=Wallingford,price,rooms
0,0,1,0,850000,4
1,1,0,0,700000,3
2,0,0,1,650000,3
3,1,0,0,600000,2


**Notice that the 'neighborhood' column has been expanded into three separate columns**, <br/>
representing the three neighborhood labels, <br/>
and that each row has a 1 in the column associated with its neighborhood.<br/>
With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.

To see the meaning of each column, you can inspect the feature names:

In [None]:
vec.get_feature_names()

#### DictVectorizer - the sparse=True option
There is one clear disadvantage of this approach: <br/>
if your category has many possible values, this can *greatly* increase the size of your dataset.
However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:

In [19]:
vec = DictVectorizer(sparse=True, dtype=int)
sparse_vector = vec.fit_transform(data)
type(sparse_vector)
sparse_vector
sparse_vector.toarray()

scipy.sparse.csr.csr_matrix

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding.

For additional information click the link: [sklearn's DictVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)