## Feature Extraction 
Module in sklearn to extract features from text as well as images. 
1. Text Feature extraction
    - BOW based representation, dimensionality problem
      - CountVectorizer
      - TfIdfTransformer
      - TfIdfVectorizer
      - Customizing Vectorizers
    - Limitations of BOW and Hashing tricks  
      - FeatureHashing
    - Loading features from dictionary   

### 1.1 BOW / Bag of n-word based representation 
- Text corpus represented in terms of a dictionary of words 
- Each document in corpus then represented as a numerical vector, using the BOW
- The vector can comprise of counts, normalized counts, TfIDf etc
- **Limitation of the approach is that BOW representation does not capture the context**  
- General steps for vectorizing a document 
  - Vocabulary creation 
    - tokenize, i.e split sentences into 1-grams, 2 gramss etc.
    - stemming (working -> work), lemmatization (bad -> good), again need to think about the application 
      in the context
    - remove stop words, most applications of vector representation will not benefit from these often used words
    - Finally assign an index to each token in the BOW
  - Document term matrix  
    - Columns are indices of words in BOW. 
    - Rows are documents / sentences 
    - data is count / normalized count / tfidf etc

#### 1.1.1 CountVectorizer  
Vectorizer classes convert documents to DTM i.e vectorize documents to numerical features. 
- sklearn class implements 
  - tokenization (n-grams) : default setting to use punctuation and white space to identify token boundaries  
  - stop word removal : uses a standard english stop word list, not best for every use case
  - vocab creation : assigning index to each word in the bag 
    - **Be aware of traps : we've splits to 'we' and 've', stop word may remove we, not ve**
  - Vocabs can be be of hugee dimension, DTM is created as a sparse scipy matrix  
  - DTM is normalized across rows

#### CountVectorizer : performs pre-processing on text, use of arguments
- strip_accents : remove accents like above and convert to corresponding unocode characters
- preprocessor :  by default, just remove accents using a funcion or an option specified by strip_accents
- tokenizer : works with other arguments, to produce tokens
  - analyzer : word / char / char_wb (i.e how should tokens be created, from words or chatracters)
    - char_wb is word boundary aware, can be useful to overcome mis-spellings, insteasd of char 
    - **n-gram can be good to preserve some location specific information if to be used as features**
    - **Wider NLP tasks that aim to extract structures from sentences are outside scope of sklearn**
  - ngram_range : ngrams based on words or characters as specified above
  - uses a default regex pattern to tokenize by considering whitespace or punctuation as boundaries, and ignores boundaries, when creating tokens, also takes token of min length 2 
- stop_words : 
  - shortcomings of stop word lists should be understood before applying 
  - paper : http://aclweb.org/anthology/W18-2502 
    - stop words should ideally be taken through same pre-processing and tokenization scheme
    - can be generated by studying intra-corpus term frequency
- token_pattern : used only if tokenizer is word, pattern uses space and punctiation to mark boundaries
- ngram_range : tuple, with min and max range of n-grams
- min_df , max_df : useful to truncate vocab , by setting lower and upper limit on document frequency. 
- max_features : max n features, ordered by frequency. 
- vocabulary : supplied vocab
- binary :  usefule to generate a binary value, instead of count 

##### accent removal

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
     'These are`nt åa the first document~',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?']
count_vect = CountVectorizer(strip_accents = None)
count_vect.fit_transform(corpus)
count_vect.get_feature_names()

['and',
 'are',
 'document',
 'first',
 'is',
 'nt',
 'one',
 'second',
 'the',
 'these',
 'third',
 'this',
 'åa']

In [22]:
corpus = [
     'These are`nt åa the first document~',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?']
count_vect = CountVectorizer(strip_accents = 'unicode')
count_vect.fit_transform(corpus)
count_vect.get_feature_names()

['aa',
 'and',
 'are',
 'document',
 'first',
 'is',
 'nt',
 'one',
 'second',
 'the',
 'these',
 'third',
 'this']

##### using 1-gram and 2-gram token to captures locality context

In [31]:
corpus = [
     ['Is this the first document?']
count_vect = CountVectorizer(strip_accents = 'unicode',analyzer= 'word',ngram_range= (1,2))
X = count_vect.fit_transform(corpus)
count_vect.vocabulary_

{'document': 0,
 'first': 1,
 'first document': 2,
 'is': 3,
 'is this': 4,
 'the': 5,
 'the first': 6,
 'this': 7,
 'this the': 8}

In [35]:
corpus = [
     'These is the first document~',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?']
count_vect = CountVectorizer()
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
print(count_vect.vocabulary_)
X.toarray()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'these', 'third', 'this']
{'these': 7, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'this': 9, 'second': 5, 'and': 0, 'third': 8, 'one': 4}


array([[0, 1, 1, 1, 0, 0, 1, 1, 0, 0],
       [0, 1, 0, 1, 0, 2, 1, 0, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 0, 1]], dtype=int64)

#### 1.1.2 TfidfTransformer 
- Only using counts, will make features that have high occurence across documents important 
- rescale count/frequency, by factoring how frequent the word is across documents helps
  achive balance betweem term frequeny, and document frequency of term. 
- $tfidf(t,d) = tf(t,d) * idf(t)$
- $idf(t) = log(\frac{n_d}{1+df(t)})$ , textbook definition, range of the function can go from -inf to inf. 
- sklearn implement smoothed and non-smoothed versions, and has an option to normalize across rows/documents 
  - non-smoothed
  $idf(t) = 1 + log( \frac{n_d}{df(t)})$ , this is always >=1, but can have 0 division on new data
  - smoothed
  $idf(t) = 1 + log( \frac{1 + n_d}{1+df(t)})$, cannot have 0 divisions on new data  
- sublinear tf = 1 + log(tf).   
- **in case of short length texts, tf-idf can be noisy, binary vectorization is more appropriate**
- **pipeline feature when building models, can allow for tuning all these as hyperparamters to a model**

In [43]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm = 'l2', smooth_idf= True, sublinear_tf= False)
Y = tfidf.fit_transform(X)
Y.toarray()

array([[ 0.        ,  0.38782252,  0.47903796,  0.38782252,  0.        ,
         0.        ,  0.31707032,  0.60759891,  0.        ,  0.        ],
       [ 0.        ,  0.26714448,  0.        ,  0.26714448,  0.        ,
         0.8370669 ,  0.21840812,  0.        ,  0.        ,  0.32997658],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.        ,  0.55280532,  0.        ],
       [ 0.        ,  0.41812662,  0.51646957,  0.41812662,  0.        ,
         0.        ,  0.34184591,  0.        ,  0.        ,  0.51646957]])

##### reproduce the first vector

In [54]:
import numpy as np
from numpy.linalg import norm
v = [0, 1 + np.log(5/4), 1 + np.log(5/3), 1 + np.log(5/4), 0,0,1 + np.log(5/5),1 + np.log(5/2),0,0]
v/norm(v, ord = 2, keepdims= False)

array([ 0.        ,  0.38782252,  0.47903796,  0.38782252,  0.        ,
        0.        ,  0.31707032,  0.60759891,  0.        ,  0.        ])

#### 1.1.3 TfidfVectorizer 
- basically combines, CountVectorizer and TfIdfTransformer

#### 1.2 Uses, Limitations of BOW  representation 
Uses 
- Classificaton of text documents. 
- Information retreival applications, also involve Clustering documents   
- Topic modeling ( using LDA, NMF)  

Shortcomings -
- Do not capture context that well, somewhat improvement possible with character , n-grams 
- **When scoring on new dataset, can encounter new words that were not in vocabulary, in applications like 
spam classification, that can be a way to avert a classifier**  
- **BOW vocab has to be stored in RAM, and can get complex to to store big vocabularies**

#### 1.1.4 Customizing Vectorizers 
- preprocessor, tokenizer, analyzer can be customized to work with other processing tasks like stemming, lemmatization
etc. by NLTK 
- Examples covered in the documentation

#### 1.2.1 Feature Hashing 
- Idea of feature hashing is to not construct a BOW or vocab at all, rather define a hash function, that takes
let us say words of a sentence as input, and generates a fixed length vector representation of the sentence/document
- Feature hashing concepts are explained here - https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f  
  - No inverse transform
  - collisisons, selecting a large feature space to avoid collisions
- sklearn implements a FeatureHasher class and a Vectorizer   

In [55]:
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction.text import HashingVectorizer

In [57]:
h = FeatureHasher(n_features=10)
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.toarray()

array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

##### HashingVectorizer first vectorizes and extract toakens and their counts ( like the D above) and then projects into the n-feature space

In [59]:
HashingVectorizer(n_features= 10)

In [60]:
c = ['dog cat cat elephant elephant elephant elephant',
    'dog dog run run run run run']
hv = HashingVectorizer(n_features= 10)
X = hv.fit_transform(c)
X.toarray()

array([[ 0.        ,  0.        , -0.87287156, -0.21821789,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.43643578],
       [ 0.        ,  0.        ,  0.        , -0.37139068, -0.92847669,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

##### Dimenstionality is not a constraint when the matrix is a CSR matrix, and algorithms that work with CSR
LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive,
but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

#### Out of core learning with Hashing 
- Example https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

#### 1.3 DictVectorizer 
- Takes features in form of a dictionary, which offers a concise way of bundling features, instead of in a dataframe
- **sometime features extracted for sequence models are in dictionary format (TBD)**
- This vectorizer converts the categorical features in the dictionary to a one hot encoded form 


In [66]:
from sklearn.feature_extraction import DictVectorizer

In [68]:
>>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Francisco', 'temperature': 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()


array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])