# 👋🌍 DGA Botnet detection - Qiskit version over Microsoft Q-Devices

In this notebook, we'll review the application of Variational Quantum Classifiers for DGA-Botnet Detection

## Submit a job to Microsoft Quantum Devices using Azure Quantum
In this notebook we are using Qiskit. At the time of this writing, azure supports below machines:
- ionq.qpu
- ionq.qpu.aria-1
- ionq.simulator
- quantinuum.hqs-lt-s1
- quantinuum.hqs-lt-s1-apival
- quantinuum.hqs-lt-s2
- quantinuum.hqs-lt-s2-apival
- quantinuum.hqs-lt-s1-sim
- quantinuum.hqs-lt-s2-sim
- quantinuum.qpu.h1-1
- quantinuum.sim.h1-1sc
- quantinuum.qpu.h1-2
- quantinuum.sim.h1-2sc
- quantinuum.sim.h1-1e
- quantinuum.sim.h1-2e
- rigetti.sim.qvm
- rigetti.qpu.aspen-11
- rigetti.qpu.aspen-m-2
- rigetti.qpu.aspen-m-3
- microsoft.estimator


Also, Qiskit provides below simulators:


- AerSimulator('aer_simulator'),
- AerSimulator('aer_simulator_statevector'),
- AerSimulator('aer_simulator_density_matrix'),
- AerSimulator('aer_simulator_stabilizer'),
- AerSimulator('aer_simulator_matrix_product_state'),
- AerSimulator('aer_simulator_extended_stabilizer'),
- AerSimulator('aer_simulator_unitary'),
- AerSimulator('aer_simulator_superop'),
- QasmSimulator('qasm_simulator'),
- StatevectorSimulator('statevector_simulator'),
- UnitarySimulator('unitary_simulator'),
- PulseSimulator('pulse_simulator')

We select the backend for our experiment

### 1. Connect to the Azure Quantum workspace

To connect to the Azure Quantum service, construct an instance of the `AzureQuantumProvider`. Note that it's imported from `azure.quantum.qiskit`.

In [1]:
from azure.quantum.qiskit import AzureQuantumProvider
provider = AzureQuantumProvider (
    resource_id = "/subscriptions/2501e059-13d8-4207-85d6-ec58656e2fae/resourceGroups/AzureQuantum/providers/Microsoft.Quantum/Workspaces/Rigetti",
    location = "eastus"
)

Qiskit has different versions and classifiers like QSVM has been changed to QSVC , to have a stable run we make sure that we have the version that matches this experiment/

In [None]:
!pip install qiskit==0.31.0



Let's see what providers and targets are enabled in this workspace with the following command:


In [2]:
from qiskit import QuantumCircuit
from qiskit.visualization import plot_histogram
from qiskit.tools.monitor import job_monitor

print("This workspace's targets:")
for backend in provider.backends():
    print("- " + backend.name())

This workspace's targets:
- ionq.qpu
- ionq.qpu.aria-1
- ionq.simulator
- quantinuum.hqs-lt-s1
- quantinuum.hqs-lt-s1-apival
- quantinuum.hqs-lt-s2
- quantinuum.hqs-lt-s2-apival
- quantinuum.hqs-lt-s1-sim
- quantinuum.hqs-lt-s2-sim
- quantinuum.qpu.h1-1
- quantinuum.sim.h1-1sc
- quantinuum.qpu.h1-2
- quantinuum.sim.h1-2sc
- quantinuum.sim.h1-1e
- quantinuum.sim.h1-2e
- rigetti.sim.qvm
- rigetti.qpu.aspen-11
- rigetti.qpu.aspen-m-2
- rigetti.qpu.aspen-m-3
- microsoft.estimator


### 2. Uploading data from a container that we made
Next, we upload the dataset metioned at https://ieee-dataport.org/open-access/botnet-dga-dataset#files to a container inside Azure Storage. This is necessary for current version of Jupiter available of azure quantum. We may make random smaller sample size to test that our code is good enough when we change the backend. To have the dataset for project, we use below commands. the URLs are from Azure storage that we made earlier.

In [3]:
!wget https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/BotnetDgaDataset.rst

--2022-12-27 02:31:33--  https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/BotnetDgaDataset.rst
Resolving aq5efd7d2644dd406cb3ec2d.blob.core.windows.net (aq5efd7d2644dd406cb3ec2d.blob.core.windows.net)... 52.239.169.4
Connecting to aq5efd7d2644dd406cb3ec2d.blob.core.windows.net (aq5efd7d2644dd406cb3ec2d.blob.core.windows.net)|52.239.169.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 571 [application/octet-stream]
Saving to: ‘BotnetDgaDataset.rst’


2022-12-27 02:31:33 (8.55 MB/s) - ‘BotnetDgaDataset.rst’ saved [571/571]



In [4]:
#!wget https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/dgabotnet_main.csv
!wget https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/BotnetDgaDataset_1000.csv



--2022-12-27 02:31:36--  https://aq5efd7d2644dd406cb3ec2d.blob.core.windows.net/dga/BotnetDgaDataset_1000.csv
Resolving aq5efd7d2644dd406cb3ec2d.blob.core.windows.net (aq5efd7d2644dd406cb3ec2d.blob.core.windows.net)... 52.239.169.4
Connecting to aq5efd7d2644dd406cb3ec2d.blob.core.windows.net (aq5efd7d2644dd406cb3ec2d.blob.core.windows.net)|52.239.169.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75901 (74K) [text/csv]
Saving to: ‘BotnetDgaDataset_1000.csv’


2022-12-27 02:31:37 (593 KB/s) - ‘BotnetDgaDataset_1000.csv’ saved [75901/75901]



As mentioned earlier, because the lack of file operation, we may use below code to be sure that our data is in the right location

In [5]:
import os
files = os.listdir(os.curdir)
for file in files:
    print(file)

.bash_logout
.bashrc
.profile
.jupyter
BotnetDgaDataset_1000.csv
BotnetDgaDataset.rst
.local
.ipython
azurequantumtoken.json
.gitconfig
.dotnet
.nuget
.templateengine
.azure
.packages


In case we need the list of Qiskit simulators

In [7]:
from qiskit import QuantumCircuit, ClassicalRegister, QuantumRegister
from qiskit import execute, Aer
Aer.backends()


[AerSimulator('aer_simulator'),
 AerSimulator('aer_simulator_statevector'),
 AerSimulator('aer_simulator_density_matrix'),
 AerSimulator('aer_simulator_stabilizer'),
 AerSimulator('aer_simulator_matrix_product_state'),
 AerSimulator('aer_simulator_extended_stabilizer'),
 AerSimulator('aer_simulator_unitary'),
 AerSimulator('aer_simulator_superop'),
 QasmSimulator('qasm_simulator'),
 StatevectorSimulator('statevector_simulator'),
 UnitarySimulator('unitary_simulator'),
 PulseSimulator('pulse_simulator')]

### 3. Running the VQC
The first version of this code is taken from DOI: 10.24433/CO.4005597.v2 accesible from: https://codeocean.com/capsule/3610673/tree/v2 
 
The pseudocode is as below: 
1. Import necessary libraries including qiskit, numpy, pandas, matplotlib, and concurrent.futures.
2. Define function to load data from a CSV file, including reading in the file, storing relevant information such as the number of samples and features, and returning the data, target, and target names as arrays.
3. Define function to convert data and target into a dataframe with appropriate column names, and return the combined dataframe, data, and target.
4. Define function to load botnet data, including reading in the file and returning data and target.
5. Define main function to perform quantum-enhanced machine learning on botnet data, including options for standardizing, scaling, and binarizing the data.
6. Split the data into training and test sets.
7. Set up quantum circuit and choose optimization algorithm.
8. Train model using quantum circuit and optimization algorithm.
9. Test model on test set and calculate accuracy.
10. Plot results as a histogram.
11. Save results to a text file.


In [9]:
import qiskit
from qiskit import QuantumCircuit
from qiskit import Aer, transpile
from qiskit.tools.visualization import plot_histogram, plot_state_city
import qiskit.quantum_info as qi
import numpy as np
import os
from qiskit import BasicAer
from qiskit.aqua import QuantumInstance, aqua_globals
from qiskit.aqua.algorithms import VQC

from qiskit.aqua.components.optimizers import SPSA, ADAM, AQGD, CG, COBYLA, L_BFGS_B, GSLS, NELDER_MEAD, NFT, P_BFGS, POWELL, SLSQP, TNC
from qiskit.aqua.components.feature_maps import RawFeatureVector
from qiskit.circuit.library import TwoLocal, PauliFeatureMap, ZFeatureMap, ZZFeatureMap, NLocal, TwoLocal, RealAmplitudes, EfficientSU2, ExcitationPreserving
from qiskit.aqua.utils import split_dataset_to_data_and_labels, map_label_to_class_name, get_feature_dimension
import csv
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Binarizer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import datetime
import concurrent.futures
import time
datafilename="BotnetDgaDataset_1000.csv"
resultname="result_BotnetDgaDataset_1000.txt"
cwd=os.getcwd()
mycsv=cwd+"/"+datafilename


def load_data(filepath):

    with open(filepath) as csv_file:
        data_file = csv.reader(csv_file)
        temp = next(data_file)
        n_samples = 1000
        # int(temp[0])
        n_features = 7
        #int(temp[1])
        target_names = np.array(temp[2:])
        data = np.empty((n_samples, n_features))
        target = np.empty((n_samples,), dtype=int)

        for i, ir in enumerate(data_file):
            data[i] = np.asarray(ir[:-1], dtype=np.float64)
            target[i] = np.asarray(ir[-1], dtype=int)

    return data, target, target_names

load_data(mycsv)

def _convert_data_dataframe(data, target,
                            feature_names, target_names):
    data_df = pd.DataFrame(data, columns=feature_names)
    target_df = pd.DataFrame(target, columns=target_names)
    combined_df = pd.concat([data_df, target_df], axis=1)
    X = combined_df[feature_names]
    y = combined_df[target_names]
    if y.shape[1] == 1:
        y = y.iloc[:, 0]
    return combined_df, X, y


def load_botnetdga(*, as_frame=False):

    data, target, target_names = load_data(datafilename)

    with open('BotnetDgaDataset.rst') as rst_file:
        fdescr = rst_file.read()

    feature_names = ['MinREBotnets',
                     'CharLength',
                     'TreeNewFeature',
                     'nGramReputation_Alexa']

    frame = None
    target_columns = ['target', ]
    if as_frame:
        frame, data, target = _convert_data_dataframe(data,
                                                      target,
                                                      feature_names,
                                                      target_columns)

    return data, target


def botnetdga(training_size, test_size, n,
              standardize=False, pca=False, scale=False, plot_data=False,
              binarize=False):
    
    class_labels = [r'benign', r'dga']

    data, target = load_botnetdga()
    sample_train, sample_test, label_train, label_test = \
        train_test_split(data, target, train_size=training_size, test_size=test_size, random_state=7)

    
    # Now we standardize for gaussian around 0 with unit variance
    if standardize:
        std_scale = StandardScaler().fit(sample_train)
        sample_train = std_scale.transform(sample_train)
        sample_test = std_scale.transform(sample_test)

    
    # Now reduce number of features to number of qubits
    if pca:
        pca = PCA(n_components=n).fit(sample_train)
        sample_train = pca.transform(sample_train)
        sample_test = pca.transform(sample_test)

    
    # Scale to the range (-1,+1)
    if scale:
        samples = np.append(sample_train, sample_test, axis=0)
        minmax_scale = MinMaxScaler((-1, 1)).fit(samples)
        sample_train = minmax_scale.transform(sample_train)
        sample_test = minmax_scale.transform(sample_test)

    
    if binarize:
        med = np.median(np.append(sample_train, sample_test, axis=0), axis=0)

        transformer = Binarizer(threshold=med)
    
        sample_train = transformer.transform(sample_train)
        sample_test = transformer.transform(sample_test)

    
    # Pick training size number of samples from each distro
    training_input = {key: (sample_train[label_train == k, :])[:training_size]
                      for k, key in enumerate(class_labels)}
    test_input = {key: (sample_test[label_test == k, :])[:test_size]
                  for k, key in enumerate(class_labels)}

    if plot_data:
        LegitMinREBotnets = []
        LegitCharLength = []
        LegitTreeNewFeature = []
        LegitnGramReputation_Alexa = []

        DgaMinREBotnets = []
        DgaCharLength = []
        DgaTreeNewFeature = []
        DganGramReputation_Alexa = []

    
        i = 0
        while i < len(sample_train):
            if label_train[i] == 0:
                LegitMinREBotnets.append(sample_train[i][2])
                LegitCharLength.append(sample_train[i][4])
                LegitTreeNewFeature.append(sample_train[i][5])
                LegitnGramReputation_Alexa.append(sample_train[i][6])
            else:
                DgaMinREBotnets.append(sample_train[i][2])
                DgaCharLength.append(sample_train[i][4])
                DgaTreeNewFeature.append(sample_train[i][5])
                DganGramReputation_Alexa.append(sample_train[i][6])
            i += 1
        n_bins = None
        class_labels = [r'benign', r'dga']
        colors = ['blue', 'green']
        x0 = [LegitMinREBotnets, DgaMinREBotnets]
        x1 = [LegitCharLength, DgaCharLength]
        x2 = [LegitTreeNewFeature, DgaTreeNewFeature]
        x3 = [LegitnGramReputation_Alexa, DganGramReputation_Alexa]



        fig, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2)


        ax0.hist(x0, n_bins, density=True, histtype='bar', color=colors, label=class_labels)
        ax0.legend(prop={'size': 10})
        ax0.set_title('MinREBotnets')

        ax1.hist(x1, n_bins, density=True, histtype='bar', color=colors, label=class_labels)
        ax1.legend(prop={'size': 10})
        ax1.set_title('CharLength')

        ax2.hist(x2, n_bins, density=True, histtype='bar', color=colors, label=class_labels)
        ax2.legend(prop={'size': 10})
        ax2.set_title('TreeNewFeature')

        ax3.hist(x3, n_bins, density=True, histtype='bar', color=colors, label=class_labels)
        ax3.legend(prop={'size': 10})
        ax3.set_title('nGramReputation_Alexa')

        fig.tight_layout()
        plt.show()

    return sample_train, training_input, test_input, class_labels


def runTheComputation(experimentID, optimizer, feature_map, var_form, training_input, test_input):
    print(f'# Experiment  {experimentID}  = ')
    
    start = time.perf_counter()
    #backend = provider.get_backend("ionq.simulator")
    #quantum_instance = QuantumInstance(backend)
    vqc = VQC(optimizer,
              feature_map,
              var_form,
              training_input,
              test_input)

    #backend = BasicAer.get_backend('statevector_simulator')
    backend = Aer.get_backend('qasm_simulator')
    

    quantum_instance = QuantumInstance(backend)
  
    
    
    result = vqc.run(quantum_instance)


    finish = time.perf_counter()

    print('\n' + str(experimentID) + ')  Accuracy =  ' + str(result['testing_accuracy'])  + '  ; nQubits =  ' + str(feature_map.num_qubits) )
    
    f = open(resultname, "a")
    f.write('\n' + str(experimentID) + ')  Accuracy =  ' + str(result['testing_accuracy'])  + '  ; nQubits =  ' + str(feature_map.num_qubits) )
    f.flush()
    f.close()
    
    return experimentID, result['testing_accuracy'], feature_map.num_qubits, round(finish - start, 2)


def main():
    

    start = time.perf_counter()
    print(mycsv)
    # BotnetDGA data set
    plot_data = False
    training_size = 700
    #1352500
    test_size = 300
    #450833
    feature_dim = 7
    standardize = False
    pca = False
    scale = False
    binarize = False
    
    f = open(resultname, "a")
    f.write(datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S_") + "   training_size = " + str(training_size) + "  test_size = " + str(test_size) + "  feature_dim = " + str(
        feature_dim) + "\n\n")
    f.flush()
    f.close()

    
    
    sample_train, training_input, test_input, class_labels = botnetdga(training_size=training_size,
                                                                       test_size=test_size,
                                                                       n=feature_dim,
                                                                       standardize=standardize,
                                                                       pca=pca,
                                                                       scale=scale,
                                                                       plot_data=plot_data,
                                                                       binarize=binarize)

    

    optimizer1 = SLSQP()
    
    nFeature = get_feature_dimension(training_input)

    feature_map1 = RawFeatureVector(nFeature)
    
    var_form11 = TwoLocal(feature_map1.num_qubits, ['ry', 'rz'], 'cz')
    
    with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.submit(runTheComputation, 1, optimizer1, feature_map1, var_form11, training_input, test_input)
        print(runTheComputation)
    finish = time.perf_counter()

    print(f'ALL Finished in {round( (finish-start)/3600, 2)} hour(s)')
    f = open(resultname, "a")
    f.write(f'ALL Finished in {round( (finish-start)/3600, 2)} hour(s)')
    f.write(f'ALL Finished in {round( (finish-start), 2)} second(s)')
    f.flush()
    f.close()


if __name__ == '__main__':
    main()


/home/jovyan/BotnetDgaDataset_1000.csv
# Experiment  1  = 
<function runTheComputation at 0x7fe8cb4683a0>


See the results inside the textfile

In [None]:
import os
os.chdir('/home/jovyan/')
path=os.getcwd()
print(path)
print(resultname)
files = os.listdir(os.curdir)
#for file in files:
#    print(file)
with open(resultname) as f:
    s = f.read()
print(s)


In [None]:
import qiskit
print(qiskit.__version__)

In [None]:
!pip install qiskit --upgrade


In [None]:
import qiskit
print(qiskit.__version__)