# B - 2 - Feature Engineering

## Description

**Features** are characteristics of the texts, variables, such as words appearing in the text or in the title, or any other information about that text that can be used by the model to predict the label. To be understood by the model, features need to be represented as number or boolean, and are organized in mattrix where each rows represent one text and each column a feature. This n-dimensional representation of the features is called the feature space, and the process of extracting and selecting the most pertinent features, is called **feature engineering**.

**Process aim:** 
The aim of this process is to output a file representing the features that will be used to train a model in outputing labels.

**Input**: Dataset in JSON or CSV, including one or several unstructured textal fields.

Sub-processes:
1. Import and prepare dataset
2. Feature extraction
    * Pre-processing
    * Vectorization
3. Export features space

**Output**: a CSV file representing the features

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Binarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils.extmath import density

## 1. Import dataset
Import the dataset compiling all data acquired through the previous steps (metadata and raw texts)

In [2]:
# Import the dataset 
dataset = pd.read_csv('../data/B_engineering/doc_2000_2017_reduced.csv', usecols=['record_id','title','text'])

In [3]:
dataset.set_index('record_id',inplace=True)

In [4]:
dataset.head(2)

Unnamed: 0_level_0,title,text
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1
455823,Letter dated 2001/12/06 from the Permanent Rep...,A/56/682–S/2001/1159 United Nations General As...
694579,Provisional summary record of the 46th meeting...,E/2010/SR.46 United Nations Economic and Socia...


## 2. Features Extraction

### Preprocessing
Pre-processing involves a number of operation that aim at normalizing the text. These include operations such as converting the text in lower case, removing punctuation, removing stop words and tokenization. Tokenization split the text in units called tokens. Tokens can be sentence, world or combination of words (n-grams).

Vectorization is the process of converting a text into a vectorized representation of its content. Each tokens becomes a features represented by a numberical value that will be used by machine learning algorithms. There several possibilities to build the vectorized representation of the text.

#### Example
Corpus of 2 simple texts <code>['The United Nations Development Programme was established in 1965.', 'The United Nations Environment Programme was established in 1972.']</code>

1. convert raw text in lower cases => 
<code>['the united nations development programme was established in 1965.','the united nations environment programme was established in 1972.'\]</code>
2. remove punctuations => <code>['the united nations development programme was established in 1965','the united nations environment programme was established in 1972']</code>
3. remove stop words => <code>['united nations development programme established 1965','united nations environment programme established 1972']</code>
4. tokenize: split the raw text in smaller units, usually representing words or n-grams.
    * world (1-gram) tokenization: => <code>[['united', 'nations', 'development', 'programme', 'established', 1965'], ['united', 'nations', 'environment', 'programme', 'established', 1972']]</code>
    * world (2-grams) tokenization: => <code>[['1965','development','development programme','established','established 1965','nations','nations development','programme','programme established','united','united nations'], ['1972','environment','environment programme','established','established 1972','nations','nations environment','programme','programme established','united','united nations']]</code>
     
### Vectorization

Vectorization is the process of converting a text into a vectorized representation of its content. Each tokens becomes a feature represented by a numberical value that will be used by machine learning algorithms to classify texts withing various categories.  There are several possibilities to build the vectorized representation of the text.
1. Each feature is represented by a **boolean** value and has the same importance.
2. Each feature is represented by its **frequency** each text, without taking into account the length of the text. 
3. Each feature is representent by its **term frequency-inverse document frequency (TF-IDF)**: normalized the frequency measure so that the frequency of each features in the overall corpus is taken into account. For instance, if 'united nations' appear in all texts, then its importance in caracterizing the text is lowered. 

#### Document term matrix
The result of vectorization is a document-term-matrix, where one row represent a text and one column a feature. The number indicated is either a binary value (0 is not in text, 1 is in text), or the term frequency in the text, or the tf-itf score. 
The following example show a binary matrix

| 1965  | 1972  |  development |environment|  established |  programme  | united |nations| 
|---|---|---|---|---|---|---|---|---|
|1|0|1|0|1|1|1|1|
|0|1|0|1|1|1|1|1|


#### Density
In text mining, it is frequent to have a document-term matrix with many zeros, because some features appear only in a few texts. A sparse matrix is a matrix with a high proportion of zeros. On the contrary a dense matrix has a low proportion of zeros. The **density** measures how dense a matrix is. The density of a matrix with few zero_values elements is close to 1. On the other hand, the density of a matrix where the majority of elements are non-zero values is close to 0. This is often the case when analyzing texts.

### Features extraction in practice

Term frequency and TF-IDF are commonly used to extract features from a corpus of texts. There are more advanced methods.

Functions tf_extractor() and tfidf_extractor return:
* a vocabulary where each features is an entry: feature_text: feature_number: the vocabulary is not used for machine learning processes. However, it is usefull to map the index number back to the string when investigating which features are the most important. It is structured as a dictionary:
<pre>
{
'1965': 0
'1972': 1
'development':2
...
}
</pre>
* a document-term matrix

#### Customization

Change the following parameters:
* min_df: ignore terms that have a document frequency strictly lower than the given threshold.
* ngram_range: you can add more features by selecting the ngram range.
* max_features: only consider the top max_features ordered by term frequency across the corpus.
* language: you can select another language depending on the language of the corpus, or set it to None

In [5]:
def tf_extractor(dataset, min_df=1, max_df=1.0, ngram_range=(1,1),language='english',max_features=None):
    '''
    Takes a dataset and several additional parameters
    - min_df: will ignore the terms that have a frequency strictly lower than the threshold
    - ngram_range: value to extract different n-grams. (1,1) will create token of one word, while (1,2) will create token
    one and two words.
    - language of the list words that will be excluded (stop words)
    Returns a dictionary of features and a document-term frequency matrix. Features are represented by the term frequency
    in each text.
    '''
    vectorizer = CountVectorizer(min_df=min_df, max_df=max_df, ngram_range=ngram_range, stop_words=language,max_features=max_features)
    vocabulary = vectorizer.fit(dataset)
    matrix = vectorizer.fit_transform(dataset)
    
    return vocabulary, matrix

In [6]:
def tf_idf_extractor(dataset, min_df=1, max_df=1.0, ngram_range=(1,1),language='english',max_features=None):
    '''
    Takes a dataset and several additional parameters
    - min_df: will ignore the terms that have a frequency strictly lower than the threshold
    - ngram_range: value to extract different n-grams. (1,1) will create token of one word, while (1,2) will create token
    one and two words.
    - language of the list words that will be excluded (stop words)
    Returns a dictionary of features and a document-term ITF matrix. Features are represented by the TF-ITF score
    '''
    vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, ngram_range=ngram_range, stop_words=language,max_features=max_features)
    vocabulary = vectorizer.fit(dataset)
    matrix = vectorizer.fit_transform(dataset)
    return vocabulary, matrix

In [7]:
def test_density(dataset, min_df, max_df, ngram_range):
    r = []
    for n in ngram_range:
        for min_val in min_df:
            for max_val in max_df:
                print("n_grams:{}, min_df: {}, max_df: {}".format(n, min_val, max_val))
                tf_idf_vocabulary, tf_idf_matrix = tf_idf_extractor(dataset,min_df=min_val, max_df=max_val, ngram_range=n)
                docs  = tf_idf_matrix.shape[0]
                features = tf_idf_matrix.shape[1]
                pc_nnz = density(tf_idf_matrix)*100
                values = [n,min_val,max_val,docs, features,pc_nnz]
                r.append(values)
    report = pd.DataFrame(r, columns=['n_grams', 'min_df', 'max_df', 'documents', 'features', '% non-zero values'])
    return report

### Extracting Features from Full Text
We will perform feature extraction from the full text using tf_idf_extractor and the default parameter.

In [8]:
# Create the corpus
#ft_corpus = dataset['text'].tolist()
# Get the vocabulary and the document-term matrix
tf_idf_vocabulary_ft, tf_idf_matrix_ft = tf_idf_extractor(dataset['text'], min_df=0.001, max_df=0.7)

In [9]:
# Length of the vocabulary
print(len(tf_idf_vocabulary_ft.vocabulary_))

28376


In [10]:
# Print some information on the sparsity / density of the document-term matrix
# Dimension
print("matrix dimension: {} documents * {} features".format(tf_idf_matrix_ft.shape[0], tf_idf_matrix_ft.shape[1]))
# Total number of elements
print("total number of elements: {}".format(tf_idf_matrix_ft.shape[0] * tf_idf_matrix_ft.shape[1]))
# Non-zero values
print("non-zero values: {}".format(tf_idf_matrix_ft.nnz))
# Density
print("density: {}".format(density(tf_idf_matrix_ft)))
print("{0:.2f} % of the matrix are non-zero elements".format(density(tf_idf_matrix_ft)*100))

matrix dimension: 113348 documents * 28376 features
total number of elements: 3216362848
non-zero values: 77713770
density: 0.024162003378544187
2.42 % of the matrix are non-zero elements


### Extracting Features from Title
We will perform features extraction from the title using tf_idf and 3-ngrams tokens

In [11]:
# Get the vocabulary and the document-term matrix
tf_idf_vocabulary_title, tf_idf_matrix_title = tf_idf_extractor(dataset['title'], min_df=0.005)

In [12]:
# Print some information on the sparsity / density of the document-term matrix
# Dimension
print("matrix dimension: {} documents * {} features".format(tf_idf_matrix_title.shape[0], tf_idf_matrix_title.shape[1]))
# Total number of elements
print("total number of elements: {}".format(tf_idf_matrix_title.shape[0] * tf_idf_matrix_title.shape[1]))
# Non-zero values
print("non-zero values: {}".format(tf_idf_matrix_title.nnz))
# Density
print("density: {}".format(density(tf_idf_matrix_title)))
print("{0:.2f} % of the matrix are non-zero elements".format(density(tf_idf_matrix_title)*100))

matrix dimension: 113348 documents * 557 features
total number of elements: 63134836
non-zero values: 1409828
density: 0.02233042943201753
2.23 % of the matrix are non-zero elements


### Saving the feature space

In [13]:
ft_features = pd.DataFrame(tf_idf_matrix_ft.toarray(),columns=tf_idf_vocabulary_ft.get_feature_names())

In [14]:
ft_features = ft_features.set_index(dataset.index)

In [15]:
title_features = pd.DataFrame(tf_idf_matrix_title.toarray(),columns=tf_idf_vocabulary_title.get_feature_names())

In [16]:
title_features = title_features.set_index(dataset.index)

In [17]:
ft_features.to_csv('../data/C_learning/doc_2000_20007_ft_features.csv')

In [18]:
title_features.to_csv('../data/C_learning/doc_2000_20007_title_features.csv')