## Working with Custom Vectorizers<a class="tocSkip" >
---
This notebook contains a series of code snippets used to create and demonstrate custom vectorizers.

<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Standard-vectorizer" data-toc-modified-id="Standard-vectorizer-0.0.1">Standard vectorizer</a></span></li><li><span><a href="#Hand-made-vectorizer" data-toc-modified-id="Hand-made-vectorizer-0.0.2">Hand-made vectorizer</a></span></li><li><span><a href="#Customized-tokenizer-and-preprocessor" data-toc-modified-id="Customized-tokenizer-and-preprocessor-0.0.3">Customized tokenizer and preprocessor</a></span></li><li><span><a href="#Custom-analyzer" data-toc-modified-id="Custom-analyzer-0.0.4">Custom analyzer</a></span></li><li><span><a href="#Modified-vectorizer-class" data-toc-modified-id="Modified-vectorizer-class-0.0.5">Modified vectorizer class</a></span></li></ul></li></ul></li></ul></div>

### Standard vectorizer
A run-of-the-mill vectorizer, nothing special about it.

In [29]:
# import pandas and sklearn's CountVectorizer class
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# print create a dataframe from a word matrix
def wm2df(wm, feat_names):
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)
# set of documents
corpora = ['The quick brown fox.','Jumps over the lazy dog!']
# instantiate the vectorizer object
cvec = CountVectorizer(lowercase=False)
# convert the documents into a document-term matrix
wm = cvec.fit_transform(corpora)
# retrieve the terms found in the corpora
tokens = cvec.get_feature_names()
# create a dataframe from the matrix
wm2df(wm, tokens)

Unnamed: 0,Jumps,The,brown,dog,fox,lazy,over,quick,the
Doc0,0,1,1,0,1,0,0,1,0
Doc1,1,0,0,1,0,1,1,0,1


### Hand-made vectorizer

This functions together, when used together, work like a simplified version of sklearn's CountVectorizer. Its purpose is to ilustrate the different step necessary to make a vectorizer work.

In [30]:
# necessary imports
import re
import numpy as np
from collections import defaultdict
from scipy.sparse import csr_matrix

def tokenize(corpus):
    # create a pattern to extract words
    pattern = re.compile(r'\b\w\w+\b')
    return(re.findall(pattern, corpus))

def set_weights(tokens):
    # create a dictionary to hold the tokens and their weights
    token_counts = defaultdict(int)
    # iterate over the tokens increasing their weights by 1
    for token in tokens:
        token_counts[token] += 1
    return(token_counts)

def simple_vectorizer(corpora):
    # create lists to hold the feature names, doc_counts and
    # matrix_rows
    feat_names = []
    doc_counts = []
    matrix_seed = []
    
    #iterate over the corpora and 
    for corpus in corpora:
        # tokenize docs
        tokens = tokenize(corpus)
        # assign the weights
        doc_count = set_weights(tokens)
        # add the feat names and vectorized docs to the matrix
        doc_counts.append(doc_count)
        feat_names.extend(doc_count.keys())
    
    # create a list of unique feat names
    unique_feat_names = list(set(feat_names))
    
    # assemble fill missing tokens with zeros
    for doc_count in doc_counts:
        matrix_row = [doc_count.get(feat_name, 0)\
                      for feat_name in unique_feat_names]
        matrix_seed.append(matrix_row)
        
    # create a sparse matrix
    matrix = csr_matrix(matrix_seed)
    return(csr_matrix(matrix_seed), unique_feat_names)

wm, tokens = simple_vectorizer(corpora)
wm2df(wm, tokens)

Unnamed: 0,Jumps,The,dog,quick,brown,fox,the,over,lazy
Doc0,0,1,0,1,1,1,0,0,0
Doc1,1,0,1,0,0,0,1,1,1


### Customized tokenizer and preprocessor

Vectorizer customized by passing user defined callables as tokenizer and preprocessor.

In [27]:
import spacy
from html import unescape

# create a spaCy tokenizer
spacy.load('en')
lemmatizer = spacy.lang.en.English()

# remove html entities from docs and
# set everything to lower case
def my_preprocessor(doc):
    return(unescape(doc).lower())

# tokenize the doc and lemmatize its tokens
def my_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])

corpora = [
    'The quick brown fox&#x0002E;',
    'jumped over the lazy dog&#x00021;'
]

custom_vec = CountVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer)
cwm = custom_vec.fit_transform(corpora)
tokens = custom_vec.get_feature_names()
wm2df(cwm, tokens)

Unnamed: 0,!,.,brown,dog,fox,jump,lazy,over,quick,the
Doc0,0,1,1,0,1,0,0,0,1,1
Doc1,1,0,0,1,0,1,1,1,0,1


In [18]:
# instantiate a vectorizer with custom preprocessor and tokenizer,
# set to remove stop words and extract bigrams
custom_vec = CountVectorizer(preprocessor=my_preprocessor,
                             tokenizer=my_tokenizer,
                             ngram_range=(1,2),
                             stop_words='english')
cwm = custom_vec.fit_transform(corpora)
tokens = custom_vec.get_feature_names()
wm2df(cwm, tokens)

Unnamed: 0,!,.,brown,brown fox,dog,dog !,fox,fox .,jump,jump lazy,lazy,lazy dog,quick,quick brown
Doc0,0,1,1,1,0,0,1,1,0,0,0,0,1,1
Doc1,1,0,0,0,1,1,0,0,1,1,1,1,0,0


### Custom analyzer
Customizing a vectorizer with a user define callable class as analyzer.

In [28]:
# create a custom analyzer class
class MyAnalyzer(object):
    
    # load spaCy's english model and define the tokenizer/lemmatizer
    def __init__(self):
        spacy.load('en')
        self.lemmatizer_ = spacy.lang.en.English()
        
    # allow the class instance to be called just like
    # just like a function and applies the preprocessing and
    # tokenize the document
    def __call__(self, doc):
        doc_clean = unescape(doc).lower()
        tokens = self.lemmatizer_(doc_clean)
        return([token.lemma_ for token in tokens])
    
analyzer = MyAnalyzer()
custom_vec = CountVectorizer(analyzer=analyzer,
                             ngram_range=(1,2),
                             stop_words='english')
cwm = custom_vec.fit_transform(corpora)
tokens = custom_vec.get_feature_names()
wm2df(cwm, tokens)

Unnamed: 0,!,.,brown,dog,fox,jump,lazy,over,quick,the
Doc0,0,1,1,0,1,0,0,0,1,1
Doc1,1,0,0,1,0,1,1,1,0,1


### Modified vectorizer class
Create a modfified vectorizer by creating a new class which inherits from the CountVectorizer class.

In [20]:
# defines a custom vectorizer class
class CustomVectorizer(CountVectorizer): 
    
    # overwrite the build_analyzer method, allowing one to
    # create a custom analyzer for the vectorizer
    def build_analyzer(self):
        
        # load stop words using CountVectorizer's built in method
        stop_words = self.get_stop_words()
        
        # create the analyzer that will be returned by this method
        def analyser(doc):
            
            # load spaCy's model for english language
            spacy.load('en')
            
            # instantiate a spaCy tokenizer
            lemmatizer = spacy.lang.en.English()
            
            # apply the preprocessing and tokenzation steps
            doc_clean = unescape(doc).lower()
            tokens = lemmatizer(doc_clean)
            lemmatized_tokens = [token.lemma_ for token in tokens]
            
            # use CountVectorizer's _word_ngrams built in method
            # to remove stop words and extract n-grams
            return(self._word_ngrams(lemmatized_tokens, stop_words))
        return(analyser)
    

custom_vec = CustomVectorizer(ngram_range=(1,2),
                              stop_words='english')
cwm = custom_vec.fit_transform(corpora)
wm2df(cwm, custom_vec.get_feature_names())

Unnamed: 0,!,.,brown,brown fox,dog,dog !,fox,fox .,jump,jump lazy,lazy,lazy dog,quick,quick brown
Doc0,0,1,1,1,0,0,1,1,0,0,0,0,1,1
Doc1,1,0,0,0,1,1,0,0,1,1,1,1,0,0
