# Text Classification and Legacy Library Metadata

This project is part of a dissertation research conducted for the completion of a Master of Science in Information Technology at the [University of Liverpool](https://www.liverpool.ac.uk/). It aims at:

* evaluating the potentials and limitations of using legacy library metadata to automatically annotate textual documents;
* providing a model to automatically annotate textual documents, that could be used by system and metadata librarians to learn and simulate the learning process.

The notebook is organized as follow:
    1. Introduction to text classification
        1.1 Concepts and terminology
        1.2 Learning algorithms
        1.3 Classification process
        1.4 Classification metrics
    2. Case study:
        2.1 Dataset
        2.2 Data cleaning and normalization
        2.3 Label binarization
        2.4 Feature engineering
        2.5 Text classification pipelines
        2.6 Analysis and reports
    4. References

## 1. Introduction to text classification
### 1.1. Concepts and definitions

**Text mining:** Text mining explores and identifies patterns in unstructured textual data with the aim of extracting structured information and infer knowledge. It relies on methods from the fields of information retrieval, natural language processing, statistics, machine learning and data mining.

**Machine Learning**
Murphy (2012, p.1) defines machine learning as a “set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.” A learning model executes a program that uses training data to optimize a set of parameters. A model can suffer from two different issues. __Underfitting__ occurs when the model makes many training errors. It happens when the hypothesis function is too simple. Adding features or increasing the complexity of the hypothesis help to mitigate this problem. __Overfitting__ occurs when the model fits perfectly the training data, in other words when the proportion of errors is low. Consequently, the model will likely fail to generalize to new examples that are not in the training set. :

**Classification**: Classification is learning task that attempts to learn a relationship between input __features__ and output __labels__ from a set of examples that are already labeled. In text classification, features are typically words or tokens found in the title or text. Labels are categories assigned to the text, for instance library subject headings. **Binary classification** refers to a learning problem whose goal is to output one label by example out of two possible classes. In **multi-class classification** problems, the aim is to output one label for each example, out of q possible ones, where q > 2. **Multi-label classification** attempt to assign several labels to each text. It is a common problem when dealing with library material as a book, an article is assigned several subject headings. Multi-label classification is a computationally expensive and complex learning problem. The complexity and computational costs increase with the size of the vocabulary used to index the documents.

**Vectorization** is the process of converting a text into a vectorized representation of its content. Each token (word or group of words) becomes a feature represented by a numberical value that will be used by machine learning algorithms. There several possibilities to build the vectorized representation of the text including:
* each feature is represented by a boolean value (0,1), the process of converting features in boolean value is called **binarization**.
* each feature is represented by its **frequency** in the text.
* each feature is represented by its frequency in the text normalized by the frequency of the features in the overall corpus. A common measure is the **term frequency-inverse document frequency (TF-IDF)**

**Document term matrix** The result of vectorization is a document-term-matrix, where one row represent a text and one column a feature. The number indicated is either a binary value (0 is not in text, 1 is in text), or the term frequency in the text, or the tf-itf score. When the measure is a binary value, then the matrix is called binary matrix. **binary matrices** are also used to represent the association of each record with each label in the dataset. For instance in the following example, record 32 is labelled with ABKHAZIA (GEORGIA) and AERIAL BOMBINGS.

| 1965  | 1972  |  development |environment|  established |  programme  | united |nations| 
|---|---|---|---|---|---|---|---|---|
|1|0|1|0|1|1|1|1|
|0|1|0|1|1|1|1|1|

**Label binarization** is the process of converting labesl into a **binary matrix** that represents the association of each record with each label in the dataset. For instance in the following example, record 32 is labelled with ABKHAZIA (GEORGIA) and AERIAL BOMBINGS.

| record_id  | ABDUCTION  |  ABKHAZIA (GEORGIA) | ABU MUSA | ABYEI (SUDAN) | ADMINISTRATIVE PROCEDURE | AERIAL BOMBINGS |
|---|---|---|---|---|---|---|
|32|0|1|0|0|0|1|

**Feature engeneering** consists of extracting and selecting the most pertinent features to build a simpler representation of the text, usually a vector representation.  **Feature extraction** applies several pre-processing techniques such as tokenization, stemming, lower case and stop words filtering. **Tokenization** uses white spaces and special characters to split the text into a stream of words. **Lower case and stop words filtering** transform all words in lower case and remove non-significative words such as the, is, a, an. **Stemming** suppresses prefixes and suffixes to reduce many word forms to a common root, which might be but does not always is a valid semantic root. **Feature selection** is the process of selecting the most pertinent tokens to feed the learning algorithm.

## 1.2 Learning Algorithm

## 1.3 Text classification Process

The following figure summarize the classification process. In practice, training and testing phase are often performed together.
<img src='img/classification.png'/>
### Pre-processing
**Aim:** to acquire the data and tranform the data to an appropriate format that can be used to train a model.
**Input:** imput files are library metadata, including an identifier, a title, a subject field, a full text field or a url to access it.
**Sub-processes:**
* Data collection: to acquire the metadata and full text of the resources needed to train a model;
* Data cleaning, normalization and transformation: clean and normalized key fields in the dataset;
* Data filtering and reduction: remove records or fields that are not needed
* Dataset preparation: split the dataset in 3 subsets for training, validation and testing.
**Output:** three distinct datasets.

### Training
**Aim:** train several models - several combination parameters and algorithms for that perform feature selection and classification.

**Input:** a training set composed at minima of an identifier, a field representing labels, and a field representing features.

**Sub-processes:**
* Labels extractioan and binarization
* Features extraction and vectorization
* Training: train several models using different parameters for feature selection and different algorithms

**Output:** Output several training models to be further evaluate and validated.

### Validation
**Aim:** Validate the different models on a subset of data to select the most performant one using a validation set.

**Input:** several trained models

**Sub-processes:**
* Labels prediction: use the trained model and apply it on the validation set to predict the label;
* Model evaluation: using several metrics evaluate which model perform best;
* Analysis: analysis of the output of each models and the dataset to determine if better result could be done using annother training set, or reducing and filtering the dataset further.
* Model selection: select the best model.

**Output:** 
* Several reports including classification reports (metrics) and insight on the dataset;
* Most performant model.

### Testing
**Aim:** to test the most performant model on a set of data that were not used during the training and validation phase to ensure that it generalize well to other data.

**Input:** Most performant model

**Sub-processes:**
* Testing
* Comparison of the output with the previous ones

**Output:** 
* Several reports on the model performance during the testing phase. The model is selected for deployment once it performs well in the testing phase.