## Introduction to Text Classification

*Copyright (c) 2022 Institute for Quantum Computing, Baidu Inc. All Rights Reserved.*

Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. Organizations today have large volumes of voice and text data from various communication channels like emails, text messages, social media newsfeeds, video, audio, and more. They use NLP software to automatically process this data, analyze the intent or sentiment in the message, and respond in real time to human communication.

The text classification task is one of the fundamental tasks in NLP. It predicts the category of the input text. It is the technique behind the applications such as news headline classification and sentiment analysis.

Here, we use news headline classification as an example to demonstrate the power of Quantum Machine Learning (QML) for the text classification problems.

We use two types of news headlines, real estate industry and automobile industry, as the dataset to classify them. In this dataset, the training set consists of 400 text data and the test set consists of 100 text data. A sample of the data is as follows.

- 奔驰GLS怎么样？
- 如何评价襄阳房价？
- 南宁的城建怎么样？
- 保时捷为什么这么贵

## News Headline Classification Using QSANN Model

### Introduction to the QSANN Model

Quantum Self-Attention Neural Networks (QSANN) is a hybrid quantum-classical algorithm in a supervised learning framework. It uses the parameterized quantum circuit (PQC) to encode features on text data, the self-attention mechanism for feature extraction, and finally a fully connected neural network to process the classification results.

In summary, the general principle of QSANN is as follows.

1. Each word of the input text is mapped into a corresponding parameterized quantum circuit. The quantum state obtained from the evolution of this circuit is the corresponding feature representation of this word.
2. Use the self-attention mechanism to process the quantum state and obtain the processed feature representation.
3. Use the fully connected neural network to process the obtained features and get the predicted classification results.

### Workflow

QSANN is a learning model. So we need to first train the model using the dataset. After the training converges, we get a trained model which can classify the data corresponding to the task. Thus, the workflow is as follows.

1. Prepare the dataset.
2. Train with the dataset to obtain a trained model.
3. Use the model to predict the input text and get the prediction results.

## How to Use

### Predict Using the Model

Here, we have presented a trained model that can be directly used for news headline classification in the real estate industry and the automobile industry. Just make the corresponding configuration in the configuration file `example.toml` and enter the command `python qsann_classification.py --config example.toml` to use the trained QSANN model for predicting the input text.

### Online Demo

Here, we give a version of the demo that can be tested online. First define the contents of the configuration file.

In [1]:
test_toml = r"""
# The overall configuration file of the model.
# Enter the current task, which can be 'train' or 'test', representing training and prediction respectively. Here we use test, indicating that we want to make a prediction.
task = 'test'
# The text to be tested.
text = '奔驰GLS怎么样？'
# The path of the trained model, which will be loaded.
model_path = 'qsann.pdparams'
# The path of the vocabulary file in the dataset.
vocab_path = 'headlines500/vocab.txt'
# The number of qubits which the quantum circuit contains.
num_qubits = 6
# The number of the self-attention layers.
num_layers = 1
# The depth of the embedding circuit.
depth_ebd = 1
# The depth of the query circuit.
depth_query = 1
# The depth of the key circuit.
depth_key = 1
# The depth of the value circuit.
depth_value = 1
# The classes of input text to be predicted.
classes = ['房地产行业', '汽车行业']
"""


Next is the code for the prediction section.

In [3]:
import os
import warnings

warnings.filterwarnings('ignore')
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

import toml
from paddle_quantum.qml.qsann import train, inference

config = toml.loads(test_toml)
task = config.pop('task')
if task == 'train':
    train(**config)
elif task == 'test':
    prediction = inference(**config)
    text = config['text']
    print(f'The input text is {text}.')
    print(f'The prediction of the model is {prediction}.')
else:
    raise ValueError("Unknown task, it can be train or test.")

The input text is 奔驰GLS怎么样？.
The prediction of the model is 汽车行业.


Here, we only need to modify the content of the text in the configuration file, and then run the entire code to quickly test other texts.

## Note

Here, we provide models for text classification of news headlines in the automotive and real estate industries. Developers can also use their own datasets to train the corresponding models.

### The structure of the dataset

If you want to use a custom dataset for training, you just need to prepare the dataset according to the rules. Prepare `train.txt` and `test.txt` in the dataset folder, and `dev.txt` if a validation set is needed. One line is used to represent one piece of data in each file. Each line contains text and a corresponding label, separated by tabs. Text is composed of space-separated words.

### Introduction to the Configuration File

In `test.toml`, there is a complete reference to the configuration files needed for testing. In `train.toml`, there is a complete reference to the configuration files needed for training. Use `python qsann_classification --config train.toml` to train the model. Use `python qsann_classification --config test.toml` to load the trained model for testing.


## Citation

```tex
@article{li2022quantum,
  title={Quantum Self-Attention Neural Networks for Text Classification},
  author={Li, Guangxi and Zhao, Xuanqiang and Wang, Xin},
  journal={arXiv preprint arXiv:2205.05625},
  year={2022}
}
```
