# Text Classification and Legacy Library Metadata

This project aims at:

* evaluating the potentials and limitations of using legacy library metadata to automatically annotate textual documents;
* providing a model to automatically annotate textual documents, that could be used by system and metadata librarians to learn and simulate the learning process.

The notebook is organized as follow:
    1. Introduction to text classification
        1.1 Concepts and terminology
        1.2 Classification process
        1.3 Learning algorithms
    2. Case study:
        2.1 Dataset
        2.2 Data cleaning and normalization
        2.3 Label binarization
        2.4 Features engineering
        2.5 Text classification pipelines
        2.6 Analysis and reports
    4. References

## 1. Introduction to text classification
### 1.1. Concepts and definitions

**Text mining:** Text mining explores and identifies patterns in unstructured textual data with the aim of extracting structured information and infer knowledge. It relies on methods from the fields of information retrieval, natural language processing, statistics, machine learning and data mining.

**Machine Learning**

Murphy (2012, p.1) defines machine learning as a “set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.” A learning model executes a program that uses training data to optimize a set of parameters. A model can suffer from two different issues. __Underfitting__ occurs when the model makes many training errors. It happens when the hypothesis function is too simple. Adding features or increasing the complexity of the hypothesis help to mitigate this problem. __Overfitting__ occurs when the model fits perfectly the training data, in other words when the proportion of errors is low. Consequently, the model will likely fail to generalize to new examples that are not in the training set. :

**Classification**

Classification is learning task that attempts to learn a relationship between input __features__ and output __labels__ from a set of examples that are already labeled. In text classification, features are typically words or tokens found in the title or text. Labels are categories assigned to the text, for instance library subject headings. **Binary classification** refers to a learning problem whose goal is to output one label by example out of two possible classes. In **multi-class classification** problems, the aim is to output one label for each example, out of q possible ones, where q > 2. **Multi-label classification** attempt to assign several labels to each text. It is a common problem when dealing with library material as a book, an article is assigned several subject headings. Multi-label classification is a computationally expensive and complex learning problem. The complexity and computational costs increase with the size of the vocabulary used to index the documents.

**Vectorization** 

Vectorization is the process of converting a text into a vectorized representation of its content. Each token (word or group of words) becomes a feature represented by a numberical value that will be used by machine learning algorithms. There several possibilities to build the vectorized representation of the text including:
* each feature is represented by a boolean value (0,1), the process of converting features in boolean value is called **binarization**.
* each feature is represented by its **frequency** in the text.
* each feature is represented by its frequency in the text normalized by the frequency of the features in the overall corpus. A common measure is the **term frequency-inverse document frequency (TF-IDF)**

**Document term matrix** 

The result of vectorization is a document-term-matrix, where one row represent a text and one column a feature. The number indicated is either a binary value (0 is not in text, 1 is in text), or the term frequency in the text, or the tf-itf score. When the measure is a binary value, then the matrix is called binary matrix. **binary matrices** are also used to represent the association of each record with each label in the dataset. For instance in the following example, record 32 is labelled with ABKHAZIA (GEORGIA) and AERIAL BOMBINGS.

| 1965  | 1972  |  development |environment|  established |  programme  | united |nations| 
|---|---|---|---|---|---|---|---|---|
|1|0|1|0|1|1|1|1|
|0|1|0|1|1|1|1|1|

**Label binarization** 

Label binarization is the process of converting labesl into a **binary matrix** that represents the association of each record with each label in the dataset. For instance in the following example, record 32 is labelled with ABKHAZIA (GEORGIA) and AERIAL BOMBINGS.

| record_id  | ABDUCTION  |  ABKHAZIA (GEORGIA) | ABU MUSA | ABYEI (SUDAN) | ADMINISTRATIVE PROCEDURE | AERIAL BOMBINGS |
|---|---|---|---|---|---|---|
|32|0|1|0|0|0|1|

**Feature engeneering** 

Feature engeneering consists of extracting and selecting the most pertinent features to build a simpler representation of the text, usually a vector representation.  **Feature extraction** applies several pre-processing techniques such as tokenization, stemming, lower case and stop words filtering. **Tokenization** uses white spaces and special characters to split the text into a stream of words. **Lower case and stop words filtering** transform all words in lower case and remove non-significative words such as the, is, a, an. **Stemming** suppresses prefixes and suffixes to reduce many word forms to a common root, which might be but does not always is a valid semantic root. **Feature selection** is the process of selecting the most pertinent tokens to feed the learning algorithm.

## 1.2 Text classification Process

    
    1. Data pre-processing
        1.1 Data acquisition
        1.2 Data cleaning and normalization
    2. Training
        2.1 Label extraction and binarization
        2.2 Features extraction and selection
        2.3 Model selection
        2.4 Testing and cross-validation
        2.5 Analysis
    3. Deployment

## 1.3 Learning Algorithm