In [46]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [47]:
# Load our dataset
data = pd.read_csv('../datasets/Obfuscated/Obfuscated-MalMem2022_edited.csv')
X_drop_columns = ['Class', 
                'Category', 
                'svcscan.interactive_process_services', 
                'handles.nport', 
                'modules.nmodules',
                'pslist.nprocs64bit', 
                'callbacks.ngeneric']
X = data.drop(columns=X_drop_columns)
y = data.Category

In [48]:
# string to int encoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [49]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [50]:
# Create a Quadratic Discriminant Analysis model
qda = QuadraticDiscriminantAnalysis()

In [51]:
# Fit the model on the training data
qda.fit(X_train, y_train)



In [52]:
# Predict on the test data
y_pred = qda.predict(X_test)

In [53]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Training accuracy:', qda.score(X_train, y_train))
print('Test accuracy:', qda.score(X_test, y_test))

Training accuracy: 0.6966623595094716
Test accuracy: 0.6946356448034586


In [54]:
y_pred = qda.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, y_pred)}")
print(f"Precision score: {precision_score(y_test, y_pred, average='weighted', zero_division=0)}")
print(f"Recall score: {recall_score(y_test, y_pred, average='weighted', zero_division=0)}")
print(f"F-1 score: {f1_score(y_test, y_pred, average='weighted', zero_division=0)}")

Accuracy score: 0.6946356448034586
Precision score: 0.7864917755165365
Recall score: 0.6946356448034586
F-1 score: 0.6431766849408324


In [55]:
# Display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[8733    8    9    0]
 [   0  175 2666   81]
 [   0   63 2921   45]
 [   0   36 2460  382]]


## Quadratic Discriminant Analysis (QDA)

### Overview
Quadratic Discriminant Analysis (QDA) is a statistical method used in machine learning for classification tasks. It is a variant of Linear Discriminant Analysis (LDA) where each class is allowed to have its covariance matrix, as opposed to LDA which assumes a common covariance matrix among the classes. This allows QDA to model a wider range of datasets, especially those where the class distributions are not linearly separable.

### How QDA Works
QDA models the probability density of each class as a Gaussian distribution. For each class `k`, the decision boundary that separates the classes is quadratic, which gives QDA its name. The discriminant function for QDA is given by:

- **Quadratic discriminant function**:
  \[
  \delta_k(x) = -\frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \log \pi_k
  \]

Where:
- \( \Sigma_k \) is the covariance matrix of class `k`.
- \( \mu_k \) is the mean vector of class `k`.
- \( \pi_k \) is the prior probability of class `k`.

### Assumptions
- Each class follows a normal (Gaussian) distribution.
- Classes have their own covariance matrices.

### Advantages
- QDA can model a more flexible decision boundary than LDA.
- Suitable for classes that have distinct distributions.
- Performs well when the class distributions are Gaussian.

### Disadvantages
- Requires estimating more parameters (the covariance matrices), which can be a drawback in terms of computational cost and required sample size.
- Not suitable for very high-dimensional data unless the sample size is large enough to estimate the covariance matrices reliably.

### Applications
- Ideal for moderately sized datasets where the assumption of Gaussian distributed classes with different variances holds.
- Commonly used in applications like pattern recognition, medical diagnosis, and machine vision where the class distributions are inherently different.

### Implementation in Python
QDA can be easily implemented using the `QuadraticDiscriminantAnalysis` class from the `sklearn.discriminant_analysis` module of the Scikit-learn library.

```python
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
model = QuadraticDiscriminantAnalysis()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
