# Feature extraction


As you may have experienced from the challenge, large performance gains can be expected by constructing meaningful features from the data. 

<img style="float: left; width: 50px; margin-top: -20px" src="https://cdn1.iconfinder.com/data/icons/hawcons/32/700303-icon-61-warning-128.png" /> Feature extraction should not be mistaken with feature _selection_. Both are very different steps in the learning pipeline.

---

The process of constructing meaningful features from the data is unfortunately domain-specific (unless if you have millions of samples in which case you might be able to automate the process). Here are some strategies to turn unstructed data items into arrays of numerical features.

  * **Text documents**:	the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. A popular strategy consists in counting the frequency of each word or pair of consecutive words in each document, which we will see in more detail in this class.


  * **Images**:	Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization). Take some transformation of the signal (gradients in each pixel, wavelets transforms...). Compute the Euclidean, Manhattan or cosine similarities of the sample to a set reference images. The code book may have been previously extracted from the same dataset using an unsupervised learning algorithm on the raw pixel signal. Perform local feature extraction: split the picture into small regions and perform feature extraction locally in each area. Then combine all the features of the individual areas into a single array.


  * **Sounds** (or more generally waveforms like EEG): Similar strategies as for images within a 1D space instead of 2D


In this class we will examine techniques to feature extraction from text data. This is an important user case of feature extraction. Even if your challenge does not have any text data, in your career you will likely need to extract information from text data, so this is an important skill to acquire.


In [3]:
# some imports
import numpy as np
from sklearn import datasets

## 1 - A basic feature : the CountVectorizer

First, we create a `CountVectorizer`. This scikit-learn object enables to convert a corpus of raw texts into a matrix whose columns are the words of the corpus and rows are the documents of the corpus. Each element of the matrix is the number of occurences of a given word in a given document. This matrix is the basic feature you can get from a corpus. In order get such a matrix, we use the `fit_transform` function.

In [4]:
# import and create a vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)

# print the extracted features
print('Features: %s' % vectorizer.get_feature_names())
print('Data matrix:')
print(X.toarray())

Features: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Data matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


### Exercise 1

The 20 newsgroups dataset comprises around 18000 newsgroups posts on different topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). 

Here we will use 2 topics (Computer Graphics and Religion) and the goal is to predict to which category does the post belong to.


The exercise consists in the following:

  * Construct features from this dataset using the CountVectorizer.
  * Fit a (regularized) linear model such as the logistic regression with L1 regularization. L1 regularization generates sparse solutions, which are ideal to discard a large number of features.
  * Find the regularization parameter by cross-validation.
  * What are the most important features (words) for this dataset?
  * What generalization performance do you obtain?

In [6]:
categories = [
        'talk.religion.misc',
        'comp.graphics',
    ]
remove = ('headers', 'footers', 'quotes')
data_train = datasets.fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = datasets.fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')
print('Example of two samples in the dataset: \n')
print('--------------\n')
print(data_train.data[0])
print('\n--------------')
print(data_train.data[1])

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


data loaded
Example of two samples in the dataset: 

--------------

Hi! Everyone,

Since some people quickly solved the problem of determining a sphere from
4 points, I suddenly recalled a problem which is how to find the ellipse
from its offset. For example, given 5 points on the offset, can you find
the original ellipse analytically?

I spent two months solving this problem by using analytical method last year,
but I failed. Under the pressure, I had to use other method - nonlinear
programming technique to deal with this problem approximately.

Any ideas will be greatly appreciated. Please post here, let the others
share our interests.

--------------


You know, everybody scoffed at that guy they hung up on a cross too.
He claimed also to be the son of God; and it took almost two thousand 
years to forget what he preached.

	Love thy neighbor as thyself.


Anybody else wonder if those two guys setting the fires were 'agent 
provacateurs.'




---

## 2- A more sophisticated feature : TF-IDF

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform. 

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: $\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$. Hence, the more frequent a term in a given document, the higher its tf term. But, if this term is also frequent in the whole corpus, its inverse document frequency decrease, so does its value. In the end, we get high tf-idf values for rare and but characteristic terms. See the [scikit-learn documentation for more information](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting).


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [8]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)

# print the extracted features
print('Features: %s' % vectorizer.get_feature_names())
print('Data matrix:')
print(X.toarray())

Features: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Data matrix:
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574
   0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]


### Exercise 2

Compared to the CountVectorizer, do you obtain better performance or more meaningful features?