# Adding new components

In this tutorial we will show how to add new components to the QSPRpred package. 
QSPRpred is designed to be modular, so that new components can be added easily.
Here we will give some general explanation on how to add new components and give two examples (adding a new descriptor and a new model).

## General explanation

Each module in QSPRpred has a `interfaces.py` file that contains base classes for the components of that module.
These base classes are used to define the interface of the components and to ensure that the components are compatible with the rest of the package.
The base classes are not meant to be used directly, but rather to be subclassed by the user to create new components.
You can see the base classes as a template that you can use to create your own components.

Steps to add a new component:
1. Find the base class for the component you want to add in the `interfaces.py` file of the module you want to add the component to.
2. Create a new class that inherits from the base class.
3. Implement the abstract methods of the base class in your new class.
4. Check the inputs and outputs of the new methods match the ones defined in the base class (see the docstrings of the base class).
5. Check that your new class is compatible with the rest of the package by running the tests.

## Adding a new descriptor

Find the base class for descriptors in the `interfaces.py` file of the `descriptors` module. 


## Adding a new model

Find the base class for models `QSPRModels` in the `interfaces.py` file of the `models` module.
In the `models.py` file of the `models` module you can find an example of a model that inherits from the base class `QSPRSklearn`.
This example is a `QSPRModel` that uses estimators from the `scikit-learn` package.

Here we will show how to create a new `QSPRModel` that uses a gzip-knn 

In [1]:
import gzip
import numpy as np

def gzip_knn(training_set: np.ndarray, test_set: np.ndarray, k: int = 3):
    """Calculate the NCD between each test instance and all training instances.
    
    Args:
        training_set (np.ndarray): training set, each row is a string and the last column is the class
        test_set (np.ndarray): test set, each row is a string and the last column is the class
        k (int): number of nearest neighbors to consider
        
    Returns:
        predict_class (str): predicted class for each test instance
    """
    predicted_class = []
    for (x1 , _) in test_set: 
        Cx1 = len(gzip.compress(x1.encode()))
        distance_from_x1 = [] 
        for (x2 , _) in training_set: 
            Cx2 = len(gzip.compress(x2.encode()))
            x1x2 = " ".join([x1 , x2])
            Cx1x2 = len(gzip.compress(x1x2.encode()))
            ncd = (Cx1x2 - min ( Cx1 , Cx2 )) / max (Cx1 , Cx2)
            distance_from_x1.append(ncd)
        sorted_idx = np.argsort(np.array(distance_from_x1))
        top_k_class = training_set[sorted_idx[:k], 1]
        predicted_class.append(max(set(top_k_class), key = list(top_k_class).count))
    return predicted_class

# Example training set
training_set = np.array([
    ["This is a test", "ClassA"],
    ["Another example", "ClassB"],
    ["Some random text", "ClassA"],
    ["Just a sample", "ClassB"]
])

# Example test set
test_set = np.array([
    ["Test data 1", "Unknown"],
    ["Test data 2", "Unknown"]
])
k = 3
print(gzip_knn(training_set, test_set, k))

['ClassB', 'ClassB']


In [4]:
import gzip
import numpy as np

class GzipKNNAlgorithm():
    """A KNN algorithm using the NCD metric to calculate the distance between two strings.
    
    Attributes:
        k (int): number of nearest neighbors to consider
        trainingSet (np.ndarray): training set, each row is a string and the last column is the class
    """
    
    def __init__(self, k: int = 3):
        """Initialize the gzip knn algorithm.
        
        Args:
            k (int): number of nearest neighbors to consider
        """
        self.k = k
        self.trainingSet = None
    
    def __call__(self, test_set: np.ndarray) -> np.ndarray:
        """Calculate the NCD between each test instance and all training instances.
        
        Args:
            test_set (np.ndarray): test set, each row is a string and the last column is the class
            
        Returns:
            predict_class (np.ndarray): predicted class for each test instance
            predicted_class_probability (float): probability of the predicted class
        """
        predicted_class = []
        for (x1 , _) in test_set: 
            print(x1)
            Cx1 = len(gzip.compress(x1.encode()))
            distance_from_x1 = [] 
            for (x2 , _) in self.trainingSet: 
                Cx2 = len(gzip.compress(x2.encode()))
                x1x2 = " ".join([x1 , x2])
                Cx1x2 = len(gzip.compress(x1x2.encode()))
                ncd = (Cx1x2 - min ( Cx1 , Cx2 )) / max (Cx1 , Cx2)
                distance_from_x1.append(ncd)
            sorted_idx = np.argsort(np.array(distance_from_x1))
            top_k_class = self.trainingSet[sorted_idx[:k], 1]
            predicted_class.append(max(set(top_k_class), key = list(top_k_class).count))
            class_prob = list(top_k_class).count(max(set(top_k_class), key = list(top_k_class).count)) / k
            
        return predicted_class, class_prob

# Example training set
training_set = np.array([
    ["This is a test", "ClassA"],
    ["Another example", "ClassB"],
    ["Some random text", "ClassA"],
    ["Just a sample", "ClassB"]
])

# Example test set
test_set = np.array([
    ["Test data 1", "Unknown"],
    ["Test data 2", "Unknown"]
])

k = 3
gzip_knn = GzipKNNAlgorithm(k)
gzip_knn.trainingSet = training_set
print(gzip_knn(test_set))

Test data 1
Test data 2
(['ClassB', 'ClassB'], 0.6666666666666666)


In [None]:
from Scripts.QSPRpred.qsprpred.data.data import QSPRDataset
from qsprpred.data.data import QSPRDataset
from qsprpred.models.models import QSPRModel
import numpy as np
import pandas as pd
import gzip
from typing import Any, Optional, Type

class GzipKNNModel(QSPRModel):
    """GzipKNNModel class for K-Nearest Neighbors with NCD distance metric."""
        
    def supportsEarlyStopping(self) -> bool:
        """Check if the model supports early stopping.
        
        Returns:
            (bool): whether the model supports early stopping or not
        """
        return False

    def fit(
        self,
        X: pd.DataFrame | np.ndarray | QSPRDataset,
        y: pd.DataFrame | np.ndarray | QSPRDataset,
        estimator: Type[GzipKNNAlgorithm] = None,
        early_stopping: bool = False,
        **kwargs
    ) -> GzipKNNAlgorithm:
        """Fit the model to the given data matrix or `QSPRDataset`.

        If early stopping is used, the number of iterations after which the
        model stopped training is returned as well.

        Args:
            X (pd.DataFrame, np.ndarray, QSPRDataset): data matrix to fit
            y (pd.DataFrame, np.ndarray, QSPRDataset): target matrix to fit
            estimator (Any): estimator instance to use for fitting
            early_stopping (bool): if True, early stopping is used
            kwargs: additional keyword arguments for the fit function

        Returns:
            (GzipKNNAlgorithm): fitted estimator instance
        """
        estimator = self.estimator if estimator is None else estimator
        if isinstance(X, QSPRDataset):
            X = X.getFeatures(raw=True, concat=True)
            y = y.getTargetPropertiesValues(concat=True)
        if estimator is None:
            estimator = self.loadEstimatorFromFile()
            
        # set training set in estimator
        estimator.trainingSet = np.column_stack((X, y))
        
        return estimator

    def predict(
        self,
        X: pd.DataFrame | np.ndarray | QSPRDataset,
        estimator: Any = None
    ) -> np.ndarray:
        """Make predictions for the given data matrix or `QSPRDataset`.

        Args:
            X (pd.DataFrame, np.ndarray, QSPRDataset): data matrix to predict
            estimator (Any): estimator instance to use for fitting

        Returns:
            np.ndarray:
                2D array containing the predictions, where each row corresponds
                to a sample in the data and each column to a target property
        """
        estimator = self.estimator if estimator is None else estimator
        if isinstance(X, QSPRDataset):
            X = X.getFeatures(raw=True, concat=True)
            
        X.concat(X, np.array(["unkown", len(X.shape[0])]))
        return estimator(X)
    
    def predictProba(
        self,
        X: pd.DataFrame | np.ndarray | QSPRDataset,
        estimator: Any = None
    ):
        """See `QSPRModel.predictProba`."""
        estimator = self.estimator if estimator is None else estimator
        if isinstance(X, QSPRDataset):
            X = X.getFeatures(raw=True, concat=True)
        if self.featureStandardizer:
            X = self.featureStandardizer(X)
        preds, prob = estimator(X)
        # if preds is a numpy array, convert it to a list
        # to be consistent with the multiclass-multitask case
        if isinstance(preds, np.ndarray):
            preds = [preds]
        return preds
    
    def 
