# B - 1 - Full text processing - Input space - Feature extraction

## Description

**Process aim:** 
The aim of this process is to output a file representing the features that machine learning processes will use to output probable labels.

**Input**: Dataset in JSON or CSV, including one or several unstructured textal fields.

Sub-processes:
1. Import and prepare dataset
2. Pre-processing and vectorization
    * pre-processing
    * vectoriziation
3. Export features space

**Output**: a CSV file representing the features

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.extmath import density
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## 1. Import dataset
Import the dataset compiling all data acquired through the previous steps (metadata and raw texts)

In [2]:
# Import the dataset 
dataset = pd.read_csv('data/pre-processing/dataset_from_MARC_text.csv')
# Filter to keep only rows that have an English raw text
dataset = dataset[dataset['text'].isnull() == False]
# Create a corpus of raw texts
corpus = dataset['text'].tolist()

## 2. Pre-processing and vectorization

2. Pre-processing and vectorization
    * pre-processing:
        * convert raw text in lower cases
        * remove punctuations
        * remove stop words
        * tokenize: split the text to have a list of words
    * vectoriziation
        * Term frequency
        * Term weight normalization using TF-IDF

### Pre-processing

CountVectorizer takes on parameters to specify how to pre-process each text. In the following section we used mostly default parameters such as:
* lower_case=True: convert the text in lower case
* tokenizer=None: we use the default token_pattern that identified tokens using word boundaries. Tokens can also be numerical (i.e. 2017).
* analyzer='word': uses words to make features

In addition, we add one parameter:
* stop_words='english': use an english list to remove stop words (i.e. 'the', a', 'perhaps')

### Vectorization using term frequency
At the same time, CountVectorizer creates a bag of words as its associate each token with number of time its occurs in the overall corpus.

In [3]:
# Create a vectorizer that will perform the pre-processing steps, and create a bag of words
%time
vectorizer = CountVectorizer(stop_words='english')

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


#### Bag of words
The vectorizer is then combined with the function fit(). This function will count how many times tokens arrive in the overall corpus of texts an dproduce a bag of words.

<pre><code>
('united', 26903)
('nations', 18284)
('general', 12474)
('assembly', 4736)
('official', 18967)
('records', 21788)
('fifth', 11718)
</code></pre>

In [4]:
# Create the bag of words
%time
bag_of_words = vectorizer.fit(corpus)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


To see the content of the bag of words:

In [5]:
# Display the bag of words
bag_of_words.vocabulary_

{'65': 2456,
 'pv': 21246,
 '71': 2805,
 'united': 26903,
 'nations': 18284,
 'general': 12474,
 'assembly': 4736,
 'official': 18967,
 'records': 21788,
 'fifth': 11718,
 'session': 23646,
 'st': 24557,
 'plenary': 20266,
 'meeting': 17261,
 'tuesday': 26437,
 '21': 1255,
 'december': 8644,
 '2010': 1088,
 'new': 18454,
 'york': 28235,
 'president': 20701,
 'mr': 18001,
 'deiss': 8814,
 'switzerland': 25320,
 'called': 6232,
 'order': 19219,
 '05': 61,
 'reports': 22297,
 'committee': 7278,
 'spoke': 24493,
 'french': 12199,
 'consider': 7706,
 'agenda': 3757,
 'items': 15099,
 '27': 1463,
 '28': 1481,
 '61': 2109,
 '63': 2145,
 '68': 2764,
 '105': 161,
 '106': 167,
 '118': 400,
 '130': 459,
 'request': 22353,
 'rapporteur': 21507,
 'asif': 4695,
 'garayev': 12385,
 'azerbaijan': 5073,
 'introduce': 14882,
 'intervention': 14846,
 'great': 12821,
 'honour': 13609,
 'privilege': 20816,
 'submitted': 24968,
 'allocated': 4004,
 'contained': 7799,
 'documents': 9786,
 '448': 1830,
 '460'

In [6]:
# Count the number of tokens in the bag of words
len(bag_of_words.vocabulary_)

28410

#### Documents Term Matrix
A document term matrix is a matrix where each row represent a text, and each columns a token frequency in the related text. The value will be 0 if the token is not present in the text.

From the bag of words it is possible to produce the document term matrix by appliying the .transform() function.
<pre>
<code>
bag_of_words.transform(corpus)
</code>
</pre>

You can also skip the bag of words steps and directly transform your corpus of text in a document term matrix by applying:
<pre>
<code>
voctorizer.fit_transform(corpus)
</code>
</pre>

For each text, the function returns the document index, the feature number and its frequency.
<pre>
<code>
  (0, 22872)	1
  (0, 15279)	1
  (0, 19947)	1
  (0, 17431)	1
  (0, 11113)	1
</code>
</pre>

The feature need to be represented by a number in order to be further processed by learning algorithms. To get the token associated with the feature number:
<pre>
<code>
token = vectorizer.get_feature_names()[22872]
</code>
</pre>

In [7]:
# Transform the corpus in a document term matrix
%time
documents_term_matrix = vectorizer.fit_transform(corpus)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.11 µs


In [8]:
print("Feature 22872 is a numerical representation of the token '{}'".format(vectorizer.get_feature_names()[22872]))

Feature 22872 is a numerical representation of the token 'rose'


#### Density
From the document term matrix, it is possible to measure how sparse the matrix is by outputing the density. The density of a matrix that has few zero_values elements is close to 1. On the other hand, the density of a matrix where the majority of elements are non-zero values is close to 0. This is often the case when analyzing texts.

In [9]:
# The dimension of the matrix is 198 texts * 28410 features.
matrix_dimension = documents_term_matrix.shape
# The number of non-zero occurence in the overall 
non_zero_values = documents_term_matrix.nnz
print("matrix dimension: {} * {}".format(matrix_dimension[0], matrix_dimension[1]))
print("total number of elements: {}".format(matrix_dimension[0] * matrix_dimension[1]))
print("non zero values: {}".format(non_zero_values))
print("density: {}".format(density(documents_term_matrix)))

matrix dimension: 198 * 28410
total number of elements: 5625180
non zero values: 273503
density: 0.04862119967716589


## Term weight normalization using term frequency-inverse document frequency (TF-IDF)

Terms frequency is an interesting indicator, however we need a normalized measure to weight the importance of each term because the length of each text varies in the corpus. A common measure is the TF-IDF (term frequency-inverse document frequency. The number of time a words appear in a document increase its importance, but this importance is weighted using the number of time the word appear in the corpus.

To get the TF-IDF we can process in 2 different ways:
* If we have computed the term frequency matrix using CountVectorizer, we can subsequent apply TfidfTransformer
<pre>
<code>
tf_idf_matrix = TfidfTransformer().fit_transform(documents_term_matrix)
</code>
</pre>

* We can perform all the pre-processing and vectorization steps at once using TdifVectorizer:
<pre>
<code>
tf_idf_matrix2 = TfidfVectorizer(stop_words='english').fit_transform(corpus)
</code>
</pre>

The two matrices will be similar.

In [10]:
# Create a document term matrix matrix with it-idf
%time
tf_idf_matrix = TfidfVectorizer(stop_words='english').fit_transform(corpus)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs


In [11]:
print(tf_idf_matrix)

  (0, 2456)	0.222021408045
  (0, 21246)	0.0319171945754
  (0, 2805)	0.073215833811
  (0, 26903)	0.0946796386413
  (0, 18284)	0.0247361218072
  (0, 12474)	0.0554678703819
  (0, 4736)	0.153203704587
  (0, 18967)	0.00255890915247
  (0, 21788)	0.00170593943498
  (0, 11718)	0.00984531551688
  (0, 23646)	0.0144373951434
  (0, 24557)	0.00281272952737
  (0, 20266)	0.0097315827825
  (0, 17261)	0.00597078802242
  (0, 26437)	0.00212103060994
  (0, 1255)	0.00556090444714
  (0, 8644)	0.00753866539911
  (0, 1088)	0.00614681777477
  (0, 18454)	0.0307069098296
  (0, 28235)	0.00170593943498
  (0, 20701)	0.0830503250465
  (0, 18001)	0.0249605416719
  (0, 8814)	0.00198145966657
  (0, 25320)	0.0235959624163
  (0, 6232)	0.00530108457743
  :	:
  (197, 5242)	0.00453753864715
  (197, 9468)	0.00453753864715
  (197, 19392)	0.00453753864715
  (197, 9469)	0.0226876932358
  (197, 27005)	0.00453753864715
  (197, 6464)	0.00453753864715
  (197, 9889)	0.00453753864715
  (197, 19865)	0.00453753864715
  (197, 26271)	0.0

### Saving the feature space

In [12]:
# Use the document-terms matrix to create a new dataset
%time
features_topic = pd.DataFrame(tf_idf_matrix.toarray(),)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.2 µs


In [13]:
# Let's add a new feature, the number of non-zero values by document
features_topic['nnz'] = [tf_idf_matrix[i].nnz for i in range( 0, documents_term_matrix.shape[0])]

In [14]:
features_topic.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28401,28402,28403,28404,28405,28406,28407,28408,28409,nnz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1891
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,821
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.012753,0.0,0.0,0.0,0.0,0.0,625


In [15]:
# Add the new features to the initial dataset
dataset = dataset.join(features_topic, rsuffix='-topic')

In [16]:
dataset.head()

Unnamed: 0,record_id,body,corporates,geographic_terms,session,symbol,title,topics,url-en,url-es,...,28401,28402,28403,28404,28405,28406,28407,28408,28409,nnz
0,703243,A/,,,65,A/65/PV.71,"General Assembly official records, 65th sessio...","['SOCIAL DEVELOPMENT', 'AGEING PERSONS', 'DEMO...",http://digitallibrary.un.org/record/703243/fil...,http://digitallibrary.un.org/record/703243/fil...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1891.0
1,703245,A/,['UN. Peacebuilding Commission. Organizational...,,65,A/65/PV.72,"General Assembly official records, 65th sessio...","['POPULATION PROGRAMMES', 'CHEMICAL WEAPONS', ...",http://digitallibrary.un.org/record/703245/fil...,http://digitallibrary.un.org/record/703245/fil...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,821.0
2,701732,A/,['UN. General Assembly (66th sess. : 2011-2012...,,65,A/65/PV.57,"General Assembly official records, 65th sessio...","['STATE RESPONSIBILITY', 'STANDARDS OF CONDUCT...",http://digitallibrary.un.org/record/701732/fil...,http://digitallibrary.un.org/record/701732/fil...,...,0.0,0.0,0.0,0.012753,0.0,0.0,0.0,0.0,0.0,625.0
3,730980,A/,"['UN. Board of Auditors', 'UN System', 'UN. Of...",MYANMAR,66,A/66/PV.93,"General Assembly official records, 66th sessio...","['PERSONS WITH DISABILITIES', 'HUMAN RIGHTS', ...",http://digitallibrary.un.org/record/730980/fil...,http://digitallibrary.un.org/record/730980/fil...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1306.0
4,730711,A/,"['Unesco', 'UN Trust Fund for Partnerships - P...",,66,A/66/PV.83,"General Assembly official records, 66th sessio...","['HEALTH POLICY', 'FOREIGN POLICY', 'MEMORIALS...",http://digitallibrary.un.org/record/730711/fil...,http://digitallibrary.un.org/record/730711/fil...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2349.0


In [18]:
# Save everything
dataset.to_csv('data/input spaces/dataset_features_ft.csv')