In [None]:
# Computational Complexity of ML Algorithms
# https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms/
#https://www.semanticscholar.org/paper/ACE%3A-an-automatic-complexity-evaluator-M%C3%A9tayer/40fddae949cc89c4cc34562104e765ee301483da
# https://github.com/terryyin/lizard
# https://medium.com/analytics-vidhya/time-complexity-of-ml-models-4ec39fad2770
# https://medium.com/analytics-vidhya/computational-complexity-of-ml-algorithms-1bdc88af1c7a

* A theoretical point of view
* Some bounds
* Here, I propose upper bounds (as the implementation achieving this bound will be described) when the data is dense.

* Calling nn the number of training sample, pp the number of features, ntreesntrees the number of trees (for methods based on various trees), nsvnsv, the number of support vectors and nlinli the number of neurons at layer ii in a neural network, we have the following approximations.
* Algorithm	     Classification/Regression	                     Training	                  Prediction
* Decision Tree	          C+R	                                O(n2p)O(n2p)	                O(p)O(p)
* Random Forest	          C+R	                            O(n2pntrees)O(n2pntrees)  	O(pntrees)O(pntrees)
* Random Forest	R Breiman implementation	                O(n2pntrees)O(n2pntrees)	O(pntrees)O(pntrees)
* Random Forest	C Breiman implementation	                  O(n2√pntrees)O(n2pntrees)	O(pntrees)O(pntrees)
* Extremly Random Trees	 C+R	                              O(npntrees)O(npntrees)	O(npntrees)O(npntrees)
* Gradient Boosting (ntreesntrees)	C+R	O(npntrees)O(npntrees)	O(pntrees)O(pntrees)
* Linear Regression	R	                                        O(p2n+p3)O(p2n+p3)	            O(p)O(p)
* SVM (Kernel)	C+R	                                             O(n2p+n3)O(n2p+n3)	          O(nsvp)O(nsvp)
* k-Nearest Neighbours (naive)	C+R	−−	                                O(np)                      O(np)
* Nearest centroid	C	                                               O(np)O(np)	              O(p)O(p)
* Neural Network	C+R	 ?	                                           O(pnl1+nl1nl2+...)O(pnl1+nl1nl2+...)
* Naive Bayes	     C	                                                   O(np)O(np)	      O(p)

* Justifications
* Decision Tree based models
* Obviously, ensemble methods multiply the complexity of the original model by the number of “voters” in the model, and replace the training size by the size of each bag.
* When training a decision tree, a split has to be found until a maximum depth dd has been reached.
* The strategy for finding this split is to look for each variable (there are pp of them) to the different thresholds (there are up to nn of them) and the information gain that is achieved (evaluation in O(n)O(n))
* In the Breiman implementation, and for classification, it is recommanded to use √pp predictors for each (weak) classifier.
* Extremly random trees
* The search strategy for the optimal split simply does not take place in the case of ERTs. This makes it much simpler to find a (weaker) split.
* However (in my experience), ERTs implementation are not much faster than RFs.
* Linear regressions
* The problem of finding the vector of weights ββ in a linear regression boils down to evaluating the following equation: β=(X′X)−1X′Yβ=(X′X)−1X′Y.
* The most computationnaly intensive part is to evaluate the product X′XX′X, which is done in p2np2n operations, and then inverting it, which is done in p3p3 operations.
* Though most implementations prefer to use a gradient descent to solve the system of equations (X′X)β=X′Y(X′X)β=X′Y, the complexity remains the same.
* Support Vector Machine
* For the training part, the classical algorithms require to evaluate the kernel matrix KK, the matrix whose general term is K(xi,xj)K(xi,xj) where KK is the specified kernel.
* It is assumed that K can be evaluated with a O(p)O(p) complexity, as it is true for common kernels (Gaussian, polynomials, sigmoid…). This assumption may be wrong for other kernels.
* Then, solving the constrained quadratic programm is “morally equivalent to” inverting a square matrix of size nn, whose complexity is assumed to be O(n3)O(n3)
* k-Nearest Neighbours
* In its simplest form, given a new data point xx, the kNN algorithm looks for the k closest points to xx in the training data and returns the most common label (or the averaged values of targets for a regression problem).
* To achieve this, it is necessary to compare the distance between xx and every point in the data set. This amounts to nn operations. For the common distances (Euclide, Manhattan…) this operation is performed in a O(p)O(p) operations. Not that kernel k Nearest Neighbours have the same complexity (provided the kernels enjoy the same property).
* However, many efforts pre-train the kNN, indexing the points with quadtrees, which enable to lower dramatically the number of comparisons to the points in the training set.
* Likewise, in the case of a sparse data set, with an inverted indexation of the rows, it is not necessary to compute all the pairwise distances.
* The practical point of view
* All this is nice, but what about real life examples ? We can focus on sk-learn implementations below.
* The assumptions will be that the complexities take the form of O(nαpβ)O(nαpβ) and αα and ββ will be estimated using randomly generated samples with nn and pp varying. Then, using a log-log regression, the complexities are estimated.
* Though this assumption is wrong, it should help to have a better idea of how the algorithms work and it will reveal some implementation details / difference between the default settings of the same algorithm that one may overlook.
* The results
* Method	                  αα	ββ
* RandomForestRegressor	    1.21	0.89
* ExtraTreesRegressor	     1.03	0.88
* AdaBoostRegressor	        0.71	1.01
* LinearRegression	        0.72	1.3
* SVR	                    1.96	0.42
* RandomForestClassifier	1.09	0.38
* ExtraTreesClassifier	    0.81	0.31
* AdaBoostClassifier	    0.89	0.79
* SVC	                    2.05	0.52
* LogisticRegression(solver=liblinear)	0.9	 0.88
* LogisticRegression(solver=sag)	       1.03	 0.95
* Surprisingly, some methods are sublinear in nn. Perhaps the sample sizes were too small. As expected, the Support Vector show a complexity in nn that does not scale well with the sample size (close to 2).
* Another interesting point to note are the complexities in pp for the random forest and extra trees, the component in pp varies according to the fact that we are performing a regression or a classification problem. A short look at the documentation explains it, they have different behaviors for each problem!



### The code
Fore those who would like to run the code over other algorithms, here is the method I used.

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

In [7]:
#from IPython.display import Image
#Image("../input/ml-cplexity/ml.JPG")

In [2]:
import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math

In [3]:



class ComplexityEvaluator:

    def __init__(self, nrow_samples, ncol_samples):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples

    def _time_samples(self, model, random_data_generator):
        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                start_time = time.time()
                model.fit(train, labels)
                elapsed_time = time.time() - start_time

                result = {"N": nrow, "P": ncol, "Time": elapsed_time}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        data = pd.DataFrame(self._time_samples(model, random_data_generator))
        print(data)
        data = data.applymap(math.log)
        linear_model = LinearRegression(fit_intercept=True)
        linear_model.fit(data[["N", "P"]], data[["Time"]])
        return linear_model.coef_

In [26]:
class TestModel:
    def __init__(self):
        pass
    
    def fit(self , x, y):
        time.sleep(x.shape[0] /1000)
        

In [27]:
def random_data_generator(n , p):
    return np.random.rand(n , p) , np.random.rand(n , 1)

In [28]:
# After a small unit test, everything seems consistent.

In [16]:
if __name__ == "__main__":
    model = TestModel()
    nrow_samples = [200, 500, 1000, 2000, 3000]
    ncol_samples = [1,5,10]
    complexity_evaluator = ComplexityEvaluator(nrow_samples , ncol_samples)
    res = complexity_evaluator.Run(model , random_data_generator)

       N   P      Time
0    200   1  0.211877
1    200   5  0.204793
2    200  10  0.205135
3    500   1  0.502595
4    500   5  0.505322
5    500  10  0.503024
6   1000   1  1.003017
7   1000   5  1.007673
8   1000  10  1.010750
9   2000   1  2.015588
10  2000   5  2.006563
11  2000  10  2.014557
12  3000   1  3.010454
13  3000   5  3.007248
14  3000  10  3.009346


In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression

In [5]:
regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor(),
                     AdaBoostRegressor(),
                     LinearRegression(),
                     SVR()]

classification_models = [RandomForestClassifier(),
                         ExtraTreesClassifier(),
                         AdaBoostClassifier(),
                         SVC(),
                         LogisticRegression(),
                         LogisticRegression(solver='sag')]

In [6]:
names = ["RandomForestRegressor",
         "ExtraTreesRegressor",
         "AdaBoostRegressor",
         "LinearRegression",
         "SVR",
         "RandomForestClassifier",
         "ExtraTreesClassifier",
         "AdaBoostClassifier",
         "SVC",
         "LogisticRegression(solver=liblinear)",
         "LogisticRegression(solver=sag)"]

In [9]:
#using sample data to run on different models
sample_data = pd.read_csv('C:/Users/aozde/OneDrive/Documents/CapstoneProject/HouseData/Sample project/kc_house_data.csv')
sample_data = sample_data.loc[:, sample_data.dtypes !=np.object]
sample_data = sample_data.fillna(0)
nrows = sample_data.iloc[:,:-1].values.tolist()
ncols = sample_data['price'].values.tolist()
complexity_evaluator = ComplexityEvaluator(nrows,ncols)

PermissionError: [Errno 13] Permission denied: 'C:/Users/aozde/OneDrive/Documents/CapstoneProject/HouseData/Sample project/kc_house_data.csv'

In [36]:
i = 0
for model in regression_models:
    res = complexity_evaluator.Run(model, random_data_generator , 'House price')[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

TypeError: Run() takes 3 positional arguments but 4 were given

### And, some unit tests (that can just be appended at the bottom of the previous class).
### After a small unit test, everything seems consistent.

In [31]:
import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math

In [32]:
class ComplexityEvaluator:

    def __init__(self, nrow_samples, ncol_samples):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples

    def _time_samples(self, model, random_data_generator):
        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                start_time = time.time()
                model.fit(train, labels)
                elapsed_time = time.time() - start_time

                result = {"N": nrow, "P": ncol, "Time": elapsed_time}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        data = pd.DataFrame(self._time_samples(model, random_data_generator))
        print(data)
        data = data.applymap(math.log)
        linear_model = LinearRegression(fit_intercept=True)
        linear_model.fit(data[["N", "P"]], data[["Time"]])
        return linear_model.coef_

In [33]:
if __name__ == "__main__":
    class TestModel:

        def __init__(self):
            pass

        def fit(self, x, y):
            time.sleep(x.shape[0] / 1000.)

    def random_data_generator(n, p):
        return np.random.rand(n, p), np.random.rand(n, 1)

    model = TestModel()

    complexity_evaluator = ComplexityEvaluator(
            [200, 500, 1000, 2000, 3000], [1,5,10])

    res = complexity_evaluator.Run(model, random_data_generator)

    print(res)

       N   P      Time
0    200   1  0.214533
1    200   5  0.205450
2    200  10  0.202715
3    500   1  0.505632
4    500   5  0.508802
5    500  10  0.505620
6   1000   1  1.014447
7   1000   5  1.011487
8   1000  10  1.014487
9   2000   1  2.012260
10  2000   5  2.005218
11  2000  10  2.012006
12  3000   1  3.011629
13  3000   5  3.013196
14  3000  10  3.001961
[[ 0.9884102  -0.00523214]]


In [34]:
import numpy as np
import ComplexityEvaluator
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression


def random_data_regression(n, p):
    return np.random.rand(n, p), np.random.rand(n)


def random_data_classification(n, p):
    return np.random.rand(n, p), np.random.binomial(1, 0.5, n)


regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor(),
                     AdaBoostRegressor(),
                     LinearRegression(),
                     SVR()]

classification_models = [RandomForestClassifier(),
                         ExtraTreesClassifier(),
                         AdaBoostClassifier(),
                         SVC(),
                         LogisticRegression(),
                         LogisticRegression(solver='sag')]

names = ["RandomForestRegressor",
         "ExtraTreesRegressor",
         "AdaBoostRegressor",
         "LinearRegression",
         "SVR",
         "RandomForestClassifier",
         "ExtraTreesClassifier",
         "AdaBoostClassifier",
         "SVC",
         "LogisticRegression(solver=liblinear)",
         "LogisticRegression(solver=sag)"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
    [500, 1000, 2000, 5000, 10000, 15000, 20000],
    [5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
    res = complexity_evaluator.Run(model, random_data_regression)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

for model in classification_models:
    res = complexity_evaluator.Run(model, random_data_classification)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1


ModuleNotFoundError: No module named 'ComplexityEvaluator'

###  So let’s enjoy the number of algorithms offered by sklearn. The following list may be updated as new algorithms are tested.

In [5]:
!pip install ComplexityEvaluator

ERROR: Could not find a version that satisfies the requirement ComplexityEvaluator (from versions: none)
ERROR: No matching distribution found for ComplexityEvaluator


In [4]:
import numpy as np
import ComplexityEvaluator
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression


def random_data_regression(n, p):
    return np.random.rand(n, p), np.random.rand(n)


def random_data_classification(n, p):
    return np.random.rand(n, p), np.random.binomial(1, 0.5, n)


regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor(),
                     AdaBoostRegressor(),
                     LinearRegression(),
                     SVR()]

classification_models = [RandomForestClassifier(),
                         ExtraTreesClassifier(),
                         AdaBoostClassifier(),
                         SVC(),
                         LogisticRegression(),
                         LogisticRegression(solver='sag')]

names = ["RandomForestRegressor",
         "ExtraTreesRegressor",
         "AdaBoostRegressor",
         "LinearRegression",
         "SVR",
         "RandomForestClassifier",
         "ExtraTreesClassifier",
         "AdaBoostClassifier",
         "SVC",
         "LogisticRegression(solver=liblinear)",
         "LogisticRegression(solver=sag)"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
    [500, 1000, 2000, 5000, 10000, 15000, 20000],
    [5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
    res = complexity_evaluator.Run(model, random_data_regression)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

for model in classification_models:
    res = complexity_evaluator.Run(model, random_data_classification)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

ModuleNotFoundError: No module named 'ComplexityEvaluator'

In [None]:
# https://towardsdatascience.com/speed-up-jupyter-notebooks-20716cbe2025
# https://towardsdatascience.com/understanding-time-complexity-with-python-examples-2bda6e8158a7
# https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html
# https://people.duke.edu/~ccc14/sta-663/AlgorithmicComplexity.html