# Quantum support vector machines for text classification

The Support Vector Machine (SVM) is a widely utilized machine learning algorithm renowned for its ability to achieve high accuracy in both binary and multiclass classification tasks. The computational complexity of the SVM algorithm can be influenced by the choice of the kernel used to transform the input data into a higher-dimensional feature space.

Different kernels offer varying degrees of efficacy, with some proving to be more effective than others. However, there is often a trade-off between the accuracy and the computational complexity of the kernel. For specific datasets, it may be necessary to opt for a kernel with greater complexity to discern intricate patterns within the data. In certain computational challenges, leveraging a quantum kernel has the potential to significantly reduce the required processing time.

## Library importation

In [100]:
# General Imports
import numpy as np
import pandas as pd

# Visualisation Imports
import matplotlib.pyplot as plt

# Scikit Imports
from sklearn import datasets, model_selection, naive_bayes
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import f1_score

# Qiskit Imports
from qiskit import Aer, execute
from qiskit.circuit import QuantumCircuit, Parameter, ParameterVector, QuantumRegister, ClassicalRegister
from qiskit.circuit.library import PauliFeatureMap, ZFeatureMap, ZZFeatureMap, EfficientSU2
from qiskit.circuit.library import TwoLocal, NLocal, RealAmplitudes, EfficientSU2
from qiskit.circuit.library import HGate, RXGate, RYGate, RZGate, CXGate, CRXGate, CRZGate, XGate
from qiskit.circuit.library.data_preparation import StatePreparation
from qiskit_machine_learning.kernels import QuantumKernel


# NLP Imports
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer

## Filtering the data

The data is filtered to make it easier to work with. You can skip this part if you have already save the data.

In [101]:
data = pd.read_excel('../ressources/generation-energy-idea-and-comment-submissions.xlsx', sheet_name=1, usecols=['Idea or Comment/Idée ou commentaire', 'Idea or Comment Description'])
data.rename(columns = {'Idea or Comment/Idée ou commentaire':'label', 'Idea or Comment Description':'text'}, inplace = True)
data['text'].dropna(inplace=True)
data['text'] = data['text'].astype(str)
data['text'] = [entry.lower() for entry in data['text']]
data['text']= [word_tokenize(entry) for entry in data['text']] # break each entry in a set of words
data['label'] = data['label'].apply(lambda x: 1 if x == "Idea/Idée" else 0)
data.head()

Unnamed: 0,label,text
0,1,"[investing, more, money, on, research, and, de..."
1,0,"[ralph, klein, was, a, bell, end, for, selling..."
2,0,"[put, enough, efforts, ,, resources, and, comm..."
3,1,"[there, are, so, many, good, ideas, about, cle..."
4,0,"[i, agree, ., i, live, in, a, small, town, abo..."


### Lemmatizer and stop word

Remove everything that is not a word or not relevant to the data.

Source : https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34

In [102]:
# Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
for index,entry in enumerate(data['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'trainable_data'
    data.loc[index,'trainable_data'] = str(Final_words)

In [103]:
# convert string of a list to list 
data['trainable_data'] = data['trainable_data'].apply(lambda x: eval(x))

# assemble the list of words into a string
data['trainable_data'] = data['trainable_data'].apply(lambda x: ' '.join(x))

### Saving the data

In [104]:
data.to_csv('./ressources/data_cleaned.csv', index=False)

In [105]:
data = pd.read_csv('./ressources/data_cleaned.csv')
data['trainable_data'] = data['trainable_data'].astype(str)
reduced_data = data[0:400]

In [106]:
sample_train, sample_test, label_train, label_test = model_selection.train_test_split(reduced_data['trainable_data'], reduced_data['label'], train_size=0.80, test_size=0.20)

In [107]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(label_train)
Test_Y = Encoder.fit_transform(label_test)

# Vectorizing the data

For this part, the Bert model was tried, but we didn't achieve an accuracy similar to the base SVM since we had to reduce it to a vector of size 10. The Bert model gives a vector that would necessitate more than 700 qubits to work without data reduction. Because of this, we decide to switch and use an already existing method using sklearn.

In [108]:
Tfidf_vect = TfidfVectorizer(max_features=10)
Tfidf_vect.fit(data['trainable_data'])
Train_X_Tfidf = Tfidf_vect.transform(sample_train)
Test_X_Tfidf = Tfidf_vect.transform(sample_test)

Train_X = Train_X_Tfidf.toarray()
Test_X = Test_X_Tfidf.toarray()

# Creating the Quantum kernel

Many kernel has been tested, but the one that can more easily give a quantum advantage is the ZZFeatureMap.

In [109]:
# zz feature map
zz_map = ZZFeatureMap(feature_dimension=10, reps=1, entanglement='linear', insert_barriers=True)
zz_kernel = QuantumKernel(feature_map=zz_map, quantum_instance=Aer.get_backend('statevector_simulator'))

In [None]:
zz_circuit = zz_kernel.construct_circuit(Train_X[0], Train_X[1])
zz_circuit.decompose().decompose().draw(output='mpl')

In [111]:
zzcb_svc = SVC(kernel=zz_kernel.evaluate)
zzcb_svc.fit(Train_X, label_train)
zzcb_score = zzcb_svc.score(Test_X, label_test)

print("QSVM Accuracy Score -> ","%.2f" % (zzcb_score*100))

QSVM Accuracy Score ->  67.50


# Hyperparameter tunning



In [112]:
for i in range(1,20):
    zz_map = ZZFeatureMap(feature_dimension=10, reps=i, entanglement='linear', insert_barriers=True)
    zz_kernel = QuantumKernel(feature_map=zz_map, quantum_instance=Aer.get_backend('statevector_simulator'))
    zz_circuit = zz_kernel.construct_circuit(Train_X[0], Train_X[1])
    print("depth : " + str(zz_circuit.decompose().decompose().depth()))

depth : 59
depth : 117
depth : 175
depth : 233
depth : 291
depth : 349
depth : 407
depth : 465
depth : 523
depth : 581
depth : 639
depth : 697
depth : 755
depth : 813
depth : 871
depth : 929
depth : 987
depth : 1045
depth : 1103


In [113]:
for i in range(1,20):
    zz_map = ZZFeatureMap(feature_dimension=10, reps=i, entanglement='linear', insert_barriers=True)
    zz_kernel = QuantumKernel(feature_map=zz_map, quantum_instance=Aer.get_backend('statevector_simulator'))
    zzcb_svc = SVC(kernel=zz_kernel.evaluate)
    zzcb_svc.fit(Train_X, label_train)
    zzcb_score = zzcb_svc.score(Test_X, label_test)

    print("QSVM Accuracy Score with " + str(i) + " reps -> ","%.2f" % (zzcb_score*100))

QSVM Accuracy Score with 1reps ->  67.50
QSVM Accuracy Score with 2reps ->  67.50
QSVM Accuracy Score with 3reps ->  67.50
QSVM Accuracy Score with 4reps ->  71.25
QSVM Accuracy Score with 5reps ->  67.50
QSVM Accuracy Score with 6reps ->  67.50
QSVM Accuracy Score with 7reps ->  67.50
QSVM Accuracy Score with 8reps ->  67.50
QSVM Accuracy Score with 9reps ->  67.50
QSVM Accuracy Score with 10reps ->  67.50
QSVM Accuracy Score with 11reps ->  67.50
QSVM Accuracy Score with 12reps ->  67.50
QSVM Accuracy Score with 13reps ->  67.50
QSVM Accuracy Score with 14reps ->  67.50


As we can see, the depth is scaling linearly and the accuracy is increasing until 4 repetitions. If the speed is the main concern, doing only one or two repetitions of the kernel is good enough.

# Conclusion

To ascertain whether Quantum Support Vector Machines (QSVM) offer a competitive edge on industrial cases, it's essential to test them empirically. As of now, there's no definitive evidence proving whether QSVM can consistently outperform its classical SVM counterpart across various datasets.