<a href="https://colab.research.google.com/github/Psonu2003/AQSVC_German_Credit_data/blob/main/AQSVC_German_Credit_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Author: Pratham Gujar

Email: pratham.gujar30@gmail.com

# Imports and Installations

In [1]:
%pip install qiskit -q
%pip install pylatexenc -q
%pip install qiskit_machine_learning -q
%pip install imblearn -q
%pip install qiskit-algorithms -q
%pip install qiskit-aer -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.5/108.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pylatexenc (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np
from qiskit_machine_learning.kernels import FidelityQuantumKernel
from qiskit.circuit.library import ZZFeatureMap
from qiskit_aer import AerSimulator
from qiskit_machine_learning.algorithms import QSVC
from sklearn.metrics import classification_report, recall_score
import itertools
from time import time

# Part 1: Simple AQSVC For German Credit Data

I used a processed version of the Statlog German Credit data where the features are better categorized for the QSVC model to use. The alphanumeric codes were removed and replaced with extra columns in the dataset.


In [None]:
def adjusted_features(df, save=True, output_file='German_Adjusted_Features.csv'):
  """
  Adjusts the features based on correlation thresholds.

  Args:
      df (pandas.DataFrame): The input DataFrame.
      save (bool, optional): Whether to save the adjusted features. Defaults to True.
      output_file (str, optional): The output file path. Defaults to 'German_Adjusted_Features.csv'.

  Returns:
      pandas.DataFrame: The DataFrame with adjusted features.
  """

  correlation_matrix = df.corr()

  # Extract correlations with the target 'classification'
  correlations_with_target = correlation_matrix['classification'].drop('classification').sort_values(ascending=False)

  adjusted_features = correlations_with_target[
      (correlations_with_target >= 0.55) | (correlations_with_target <= -0.05)
  ].index.tolist()


  adjusted_features_data = df[adjusted_features + ['classification']]

  if save:
    adjusted_features_data.to_csv(output_file, index=False)

  return adjusted_features_data



I noticed that the QSVC model had trouble fitting the minority data properly since there were much more features with a positive correlation than a negative correlation. To better the fitting of the minority class, I reduced the features from the dataset which were positively correlated to give more focus to the minority features. I made sure to not trim too much so the score remained above at least 0.7.

In [None]:
def preprocess(X, y, test_size=0.2, random_state=42, n_dim=4):
  """
  Preprocesses data using PCA based on the number of qubits (n_dim)

  Args:
      X (numpy.ndarray): Input features.
      y (numpy.ndarray): Target labels.
      test_size (float, optional): Proportion of the dataset to include in the test split. Defaults to 0.2.
      random_state (int, optional): Seed for the random number generator. Defaults to 42.
      n_dim (int, optional): Number of dimensions for the PCA. Defaults to 4.

  Returns:
      tuple: A tuple containing the preprocessed training data, test data, training labels, and test labels.
  """

  sample_train, sample_test, label_train, label_test = train_test_split(X, y, test_size=test_size, random_state=random_state)


  # Reduce dimensions
  pca = PCA(n_components=n_dim).fit(sample_train)
  sample_train = pca.transform(sample_train)
  sample_test = pca.transform(sample_test)

  # Normalize
  std_scale = StandardScaler().fit(sample_train)
  sample_train = std_scale.transform(sample_train)
  sample_test = std_scale.transform(sample_test)

  # Scale
  samples = np.append(sample_train, sample_test, axis=0)
  minmax_scaler = MinMaxScaler((-1,1)).fit(samples)
  sample_train = minmax_scaler.transform(sample_train)
  sample_test = minmax_scaler.transform(sample_test)

  # Select
  train_size = 100
  sample_train = sample_train[:train_size]
  label_train = label_train[:train_size].values

  test_size = 20
  sample_test = sample_test[:test_size]
  label_test = label_test[:test_size].values

  return sample_train, sample_test, label_train, label_test

In [None]:
def run_qsvc(sample_train, label_train, sample_test, n_dim=4, save=True, output_model='german_qsvc.joblib', entanglement='linear', reps=2):
  """
  Runs QSVC on the provided data.

  Args:
      sample_train (numpy.ndarray): Training data.
      label_train (numpy.ndarray): Training labels.
      sample_test (numpy.ndarray): Test data.
      n_dim (int, optional): Number of dimensions for the QSVC. Defaults to 4.
      save (bool, optional): Whether to save the trained QSVC model. Defaults to True.

  Returns:
      numpy.ndarray: Predictions for the test data.
  """

  zz_map = ZZFeatureMap(feature_dimension=n_dim, reps=reps, entanglement=entanglement, insert_barriers=True)

  zz_kernel = FidelityQuantumKernel(feature_map=zz_map)

  qsvc = QSVC(quantum_kernel=zz_kernel)
  qsvc.fit(sample_train, label_train)

  if save:
    qsvc.save(output_model)

  predictions = qsvc.predict(sample_test)

  return predictions

In [None]:
def optimize_qsvc(X, y, test_params=None, n=10):
    """
    Optimization function to find the best repetitions (reps), entanglement, and number of qubits (n_dim)
    for the QSVC model.

    Args:
        X (numpy.ndarray): Input features.
        y (numpy.ndarray): Target labels.

    Returns:
        tuple: A tuple containing the best minority recall, the best parameters

    """
    best_recall = 0
    best_params = {}
    best_report = ""
    if test_params is None:
      reps_range = [1, 2, 3]
      entanglement_options = ['linear', 'full', 'circular', 'sca']
      n_dim_range = [2, 4, 6]
    else:
      reps_range = test_params['reps']
      entanglement_options = test_params['entanglement']
      n_dim_range = test_params['n_dim']

    params_combinations = list(itertools.product(reps_range, entanglement_options, n_dim_range))[:n]
    np.random.shuffle(params_combinations)

    for i, (reps, entanglement, n_dim) in enumerate(params_combinations):
        start = time()
        # Preprocess data
        sample_train, sample_test, label_train, label_test = preprocess(X, y, n_dim=n_dim)

        # Run QSVC
        predictions = run_qsvc(sample_train, label_train, sample_test, n_dim=n_dim, reps=reps, entanglement=entanglement, save=False)

        # Compute minority recall
        minority_recall = recall_score(label_test, predictions, pos_label=-1)  # Adjust pos_label to match your data

        params = {
            'reps': reps,
            'entanglement': entanglement,
            'n_dim': n_dim
        }

        if minority_recall > best_recall:
            best_recall = minority_recall
            best_params = params
            # best_report = classification_report(label_test, predictions)

        print(f"Iteration {i + 1}: Current Recall: {minority_recall:.2f}, Params: {params}, Time Took: {time()-start:.2f}s")

    return best_recall, best_params

In [None]:
def main():
  df = pd.read_csv('German_Preprocessed.csv')
  df = adjusted_features(df, save=False)

  X = df.drop('classification', axis=1)
  y = df['classification']

  best_recall, best_params = optimize_qsvc(X, y, n=10)

  print("\nFinal Results:")
  print("Best Minority Recall:", best_recall)
  print("Best Parameters:", best_params)

main()

# Part 2: Ways to improve

The current code utilizes the **Aer Simulator** rather than an actual quantum backend. This limits the number of qubits we can use to create a feature map for the data. There could be more accurate models that utilize 20 or 30 qubits which would be too long to classically simulate.




Another proposition would be to deeply investigate the dataset and identify possible pairwise interactions amongst features. This would allow us to alter the **ZZFeatureMap** to better map the controlled gates with the particular features. This should make evaluating the kernel matrix more efficient. Furthermore, this could make the model's outcome more accurate but this would depend on several more factors. By doing this, we are reducing the complexity of the circuit which would cause some encodings to no longer happen. Thus, we lose some information which may lead to a more innacurate model. We must also be careful of overfitting to ensure the model can properly predict outcomes of unseen data.

If we do run this on a quantum computer, we will need to consider noise and the fault tolerance of the computer. We will need to use error correction syndromes (such as the Perfect Code algorithm) to identify potential errors and fix them before improperly computing the kernel matrix.