In [146]:
import re
from collections import defaultdict
from time import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.datasets import fetch_20newsgroups


### Data
load data from The 20 newsgroups text dataset, which comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training and one for testing. For the sake of simplicity and reducing the computational cost, we select a subset of 7 topics and use the training set only.

In [2]:


categories = [
    "alt.atheism",
    "comp.graphics",
    "comp.sys.ibm.pc.hardware",
    "misc.forsale",
    "rec.autos",
    "sci.space",
    "talk.religion.misc",
]

print("Loading 20 newsgroups training data")
raw_data, _ = fetch_20newsgroups(subset="train", categories=categories, return_X_y=True)

Loading 20 newsgroups training data


In [5]:
print('Number of documents : ',len(raw_data))

Number of documents :  3803


### Token
A token may be a word, part of a word or anything comprised between spaces or symbols in a string. Here we define a function that extracts the tokens using a simple regular expression (regex) that matches Unicode word characters. This includes most characters that can be part of a word in any language, as well as numbers and the underscore:

In [9]:
def custom_tokenizer(doc):
    tokens = [tok.lower() for tok in re.findall(r"\w+", doc)]
    return tokens

In [12]:
# usage
custom_tokenizer('hello world')

['hello', 'world']

### Token frequency
A function that counts the (frequency of) occurrence of each token in a given document. It returns a frequency dictionary to be used by the vectorizers.

In [31]:
def token_frequency(text):
    tokens = custom_tokenizer(text)
    freq_dict = defaultdict(int)
    for tok in tokens:
        freq_dict[tok] += 1
    return freq_dict

In [32]:
# usage
token_freq = token_frequency('SKlearn is awesome. this is a helloworld notebook')
token_freq

defaultdict(int,
            {'sklearn': 1,
             'is': 2,
             'awesome': 1,
             'this': 1,
             'a': 1,
             'helloworld': 1,
             'notebook': 1})

### Bag of Words representation
**Breaking a text document into word tokens, potentially losing the order information between the words in a sentence is often called a Bag of Words representation.**

In [207]:
result_dict = defaultdict(list)

When the feature space is not large enough, hashing functions tend to map distinct values to the same hash code (hash collisions). As a result, it is impossible to determine what object generated any particular hash code.

So to estimate the number of unique terms in the original dictionary is to count the number of active columns in the encoded feature matrix.

In [162]:
# when we use feature hash use this util to count the # of unique terms
def non_zero_columns(x):
    # np.nonzero(x)
    # Return the indices of the elements that are non-zero.
    # Returns a tuple of arrays, one for each dimension of a
    row, col = np.nonzero(x)

    # find distinct cols
    distinct = np.unique(col)
    return distinct

In [200]:

def test_vectorizer(vectorizer, title, is_hash=False, is_raw=False ,is_doc=False):
    t0 = time()
    if is_hash and is_raw:
        # When we set input_type="string" in the FeatureHasher,
        #  it vectorize the strings output directly from word tokens
        transformed = vectorizer.fit_transform([custom_tokenizer(doc) for doc in raw_data])
    elif is_doc:
        # CountVectorizer is optimized by reusing a compiled regular expression for the full training set
        # instead of creating one per document as done in our naive tokenize function.
        # Convert a collection of text documents to a matrix of token counts.
        transformed = vectorizer.fit_transform(raw_data)
    else:
        transformed = vectorizer.fit_transform([token_frequency(doc) for doc in raw_data])

        
    duration = time() - t0

    result_dict['vectorizer'].append(f"{vectorizer.__class__.__name__} -- {title}")
    result_dict['duration'].append(duration)

    terms = len(non_zero_columns(transformed) )if is_hash else len(vectorizer.get_feature_names_out())
    print('Number of unique terms :',terms)
    print('Time cost :',duration)
    return vectorizer

### Dict vectorizer

* Transforms lists of feature-value mappings to vectors.
* When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding:
* If a feature value is a sequence or set of strings, this transformer will iterate over the values and will count the occurrences of each string value.
* Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix.

In [208]:
dict_vec = DictVectorizer()

dict_vec = test_vectorizer(dict_vec, title='on freq dicts')


Number of unique terms : 47928
Time cost : 1.8799326419830322


The actual mapping from text token to column index is explicitly stored in the `.vocabulary_ attribute` which is a potentially very large Python dictionary:

In [77]:
print('Size of vocab :',len(dict_vec.vocabulary_)) # This is a dict

Size of vocab : 47928


In [58]:
print('Mapping for word "the" is ',dict_vec.vocabulary_['the'])
print('Mapping for word "world" is ',dict_vec.vocabulary_['world'])
print('Mapping for word "book" is ',dict_vec.vocabulary_['book'])

Mapping for word "the" is  42976
Mapping for word "world" is  47125
Mapping for word "book" is  10872


In [67]:
text = 'This book is amazing'
transformation = dict_vec.transform(token_frequency(text))
print(transformation.todense())
print('Shape : ',(transformation.todense()).shape)


[[0. 0. 0. ... 0. 0. 0.]]
Shape :  (1, 47928)


### FeatureHasher

* Dictionaries take up a large amount of storage space and grow in size as the training set grows. 
* Instead of growing the vectors along with a dictionary, feature hashing builds a vector of pre-defined length by applying a hash function h to the features (e.g., tokens), then using the hash values directly as feature indices and updating the resulting vector at those indices. 
* When the feature space is not large enough, hashing functions tend to map distinct values to the same hash code (hash collisions). As a result, it is impossible to determine what object generated any particular hash code.

* Because of the above it is impossible to recover the original tokens from the feature matrix and the best approach to estimate the number of unique terms in the original dictionary is to count the number of active columns in the encoded feature matrix. 

#### The number of unique tokens when using the FeatureHasher is lower than those obtained using the DictVectorizer. This is due to hash collisions.

In [114]:
feat_hash = FeatureHasher()
feat_hash = test_vectorizer(feat_hash, title='on freq dicts', is_hash=True)

Number of unique terms : 46896
Time cost : 1.105360984802246


The default number of features for the FeatureHasher is `2**20`. 
* Here we set n_features = 2**18 to illustrate hash collisions.

In [203]:
feat_hash = FeatureHasher(n_features=2**18)
feat_hash = test_vectorizer(feat_hash, title='on freq dicts', is_hash=True)

Number of unique terms : 43873
Time cost : 1.0374019145965576


#### The number of collisions can be reduced by increasing the feature space. Notice that the speed of the vectorizer does not change significantly when setting a large number of features, though it causes larger coefficient dimensions and then requires more memory usage to store them, even if a majority of them is inactive.


* We can confirm that the number of unique tokens gets closer to the number of unique terms found by the DictVectorizer when we increase the feature-space.

In [209]:
feat_hash = FeatureHasher(n_features=2**22)
feat_hash = test_vectorizer(feat_hash, title='on freq dicts', is_hash=True)

Number of unique terms : 47668
Time cost : 1.25927734375


### FeatureHasher on raw tokens
* one can set `input_type="string"` in the FeatureHasher to vectorize the strings output directly from the customized tokenize function. 
* This is equivalent to passing a dictionary with an implied frequency of 1 for each feature name.
* FeatureHeasher with `input_type="string"` is slightly faster than the variant that works on frequency dict because it does not count repeated tokens: each token is implicitly counted once, even if it was repeated.

In [210]:
feat_hash = FeatureHasher(n_features=2**22, input_type='string')
feat_hash = test_vectorizer(feat_hash, title='on raw tokens', is_hash=True, is_raw=True)

Number of unique terms : 47668
Time cost : 1.0963690280914307


In [211]:
result_dict['vectorizer']

['DictVectorizer -- on freq dicts',
 'FeatureHasher -- on freq dicts',
 'FeatureHasher -- on raw tokens']

### Plot the results
In both cases FeatureHasher is approximately twice as fast as DictVectorizer. This is handy when dealing with large amounts of data, with the downside of losing the invertibility of the transformation, which in turn makes the interpretation of a model a more complex task.

The FeatureHeasher with `input_type="string"` is slightly faster than the variant that works on frequency dict because it does not count repeated tokens: each token is implicitly counted once, even if it was repeated. Depending on the downstream machine learning task, it can be a limitation or not.

In [168]:
plt.figure(figsize=(15,4))
plt.barh(result_dict['vectorizer'], result_dict['duration'] );
plt.xlabel('Time cost')

<img src='./plots/Feat_hash_vs_Dict_vectorizer.png'>

### Special purpose text vectorizers
* ### CountVectorizer 
   * `CountVectorizer` accepts raw data as it internally implements tokenization and occurrence counting.
   * The CountVectorizer is more flexible, it accepts various regex patterns through the token_pattern parameter.

In [212]:
count_vec = CountVectorizer()
count_vec = test_vectorizer(count_vec, title='on raw data', is_doc=True)

Number of unique terms : 47885
Time cost : 1.2202889919281006


#### We see that using the CountVectorizer implementation is approximately twice as fast as using the DictVectorizer along with the simple function we defined for mapping the tokens. The reason is that CountVectorizer is optimized by reusing a compiled regular expression for the full training set instead of creating one per document as done in our naive tokenize function.

### HashingVectorizer
Now we make a similar experiment with the HashingVectorizer
* HashingVectorizer is equivalent to combining 
    * the “hashing trick” implemented by the FeatureHasher class
    * and the text preprocessing and tokenization of the CountVectorizer.

**This strategy has several advantages:**

* it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

* it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.

* it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

**There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):** 

* there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

* there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (**e.g. 2 ** 18 for text classification problems**).

* no IDF weighting as this would render the transformer stateful.

#### We can see that this is the fastest text tokenization strategy so far, assuming that the downstream machine learning task can tolerate a few collisions.

In [213]:

hash_vec = HashingVectorizer(n_features=2**18)

hash_vec = test_vectorizer(hash_vec, is_doc=True, is_hash=True, title='on raw data')

Number of unique terms : 43837
Time cost : 0.8495104312896729


### TfidfVectorizer

* Convert a collection of raw documents to a matrix of TF-IDF features.

* Equivalent to CountVectorizer followed by TfidfTransformer.

In a large text corpus, some words appear with higher frequency `(e.g. “the”, “a”, “is” in English)` and do not carry meaningful information about the actual contents of a document. 

If we were to feed the word count data directly to a classifier, those very common terms would shadow the frequencies of rarer yet more informative terms. 

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the `tf–idf transform` as implemented by the `TfidfTransformer`. 

`TF` stands for `“term-frequency”` while `“tf–idf”` means `term-frequency times inverse document-frequency`.

The `TfidfVectorizer`, which is equivalent to combining the tokenization and occurrence counting of the CountVectorizer along with the normalizing and weighting from a TfidfTransformer.

In [214]:
tf_idf = TfidfVectorizer()
tf_idf = test_vectorizer(tf_idf, is_doc=True, title='on raw data')

Number of unique terms : 47885
Time cost : 1.2552766799926758


In [216]:
plt.figure(figsize=(15,4))
plt.barh(result_dict['vectorizer'], result_dict['duration'] );
plt.xlabel('Time cost')

<img src='./plots/compare_vectorizers_in_sklearn.png'>