# Forset Cover Type Classification
Lab Assignment Three: Extending Logistic Regression

*Mark Brubaker*

## Business Understanding
This dataset is made of observations made on the cover type in 30 x 30 meter cells in Roosevelt National Forest of northern Colorado. The cover type is a classification of the main type of tree cover in the area. Along with the cover type, 12 cartographic attributes were also recorded. Each of these attributes are described below.

1. Elevation: Elevation in meters
2. Aspect: Aspect in degrees azimuth
3. Slope: Slope in degrees
4. Horizontal_Distance_To_Hydrology: Horz Dist to nearest surface water features
5. Vertical_Distance_To_Hydrology: Vert Dist to nearest surface water features
6. Horizontal_Distance_To_Roadways: Horz Dist to nearest roadway
7. Hillshade_9am: Hillshade index at 9am, summer solstice
8. Hillshade_Noon: Hillshade index at noon, summer soltice
9. Hillshade_3pm: Hillshade index at 3pm, summer solstice
10. Horizontal_Distance_To_Fire_Points: Horz Dist to nearest wildfire ignition points
11. Wilderness_Area: 4 binary columns, 1 if observation is within 2000 meters of that wilderness area (1 max)
12. Soil_Type: 40 binary columns, 1 if observation is within 2000 meters of that soil type (1 max)

Both wilderness area and soil type are already one hot encoded. Each of the wilderness area values represent a different part of the national park. The soil types are based on the United States Forest Service Ecological Landtype Units (ELU). The ELU is a classification system that groups similar soils into broad categories based on climate zones and geologic zones. There is an argument to made that soil types shouldn't be one hot encoded as this broad catagoies to carry some relationial data based on zone types. However, for the purposes of this lab, we will assume that the soil types are independent of each other as creating a new encoding scheme that preserved the realtionsships of similar soils while not creating any imbalences would be a significant undertaking and require a lot of domain knowledge.

The cover type is the target variable. There are seven possible cover types. The cover types are as follows:
1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

Developing a use case for this data set is tricky because it is explicitly stated that none of the feature were gathered remotely. Furthermore, any method to that could be used to gather the feature data could just as eaisly be used to record the target variable. This makes using a classifier to predict the cover in an existing area largely pointless. Instead a classifier could be used as a generative modle for creating realistic worlds. Games like Dwarven Fortress pride themselves on being able to create complex worlds that are realistic and believable. Ecological research that requires a virtual forest could also generate forests this way. A classifier of this type could be used in coordination with existing world generation techniques to place forests in a more realistic manner.

Further adding to this use case, in this situation it would be very easy to collect the data needed to feed into a model from a game world without any cover type being present. Features like elevation, hillshade and, aspect could be gathered from a height map and some simple math. The distance to water and roads could be calculated from exisitng world features. Even soil type could be decided based on other existing features like biome data. The only feature that would be difficult to gather would be the wilderness area. However, this could be solved by simply creating a new wilderness area for each new world that is generated. Finally when generating the types of trees that are present in each tile/area of the world, the cover type probabilties could be used as a distribution for the types of trees that are present.

While this use case diminishes the value of having a high accuracy, it is still important because the goal of using such a complicated model to generate a world is to increase realisim. If the accuracy is too low then the fidelity of the forests in the world will be as well. In the use case of a game the barrier is fairly low for the user to notice the difference between a realistic forest and a fake one. While it is better to have a more realistic model, the consiquence for a low accuracy or an incorrect guess are not signifigant. Ideally, for use in game world generation, an accuracy of ~70% would be possible.  In the use case of ecological research, the barrier is much higher. Incorrect guesses could have a signifigant impact on the research. If there were enough bad guesses the all conclusions from the reasearch would be invalid. In this case, an accuracy of ~95% would be the minimum with ~99% being ideal to get the highest confidence in the results.

Depending on size of map generated the speed of the classification could be a factor. For game world generation total times should be kept under ~30 seconds but placing forests should only be a small percentage of this time. For ecological research, the speed of the classification is not as important as models can be left to run for long periods of time. Most likely, in both of these cases any generation would done using an already traied model so the speed of the classification would not be a large factor.

Several studies have been done on classification for this dataset mostly using two main tequiques. The first is a linear discriminant analysis model which acheived ~58.3% accuracy. The second was a traditional neural network which acheived ~70.5% accuracy.

In [188]:
import pickle
import numpy as np
from scipy.stats import normaltest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as SKLogisticRegression
from sklearn.metrics import accuracy_score
from scipy.special import expit
from scipy.optimize import minimize_scalar
import copy
from numpy import ma
import numpy.linalg
import time

LOAD_FROM_PICKLE = True
USE_SHRUNK_DATASET = True
USE_GPU = True

np.set_printoptions(precision=2, suppress=True)

All imports for the project are done here as well as some global variables that will be used throughout the project.

In [189]:
if USE_GPU:
    import cupy as cp
    xp = cp
else:
    xp = np

This selects whether the majority of operations will be done on the CPU or GPU. This works because cupy work as a drop in replacement for numpy. If you want to use the GPU, you will need to install cupy which requires CUDA. Which cupy installation you select is based on what version of CUDA you have. Instructions can be found here: https://docs.cupy.dev/en/stable/install.html.

In [190]:
# load the data
if LOAD_FROM_PICKLE:
    with open('../Data/Pickle/cover_data.pickle', 'rb') as handle:
        data = np.load(handle, allow_pickle=True)

    print('Loaded data from pickle')
else:
    data = np.loadtxt('../Data/Cov_Type/covtype.data', delimiter=',')
    with open('../Data/Pickle/cover_data.pickle', 'wb') as handle:
        np.save(handle, data, allow_pickle=True)

Loaded data from pickle


The data can be loaded from a pickle object to save time

In [191]:
# check for missing values
print('Number of missing values: {}'.format(np.sum(np.isnan(data))))

# check for duplicates as the number of unique rows should be equal to the number of rows
print('Number of duplicate rows: {}'.format(data.shape[0] - np.unique(data, axis=0).shape[0]))

# cp.unique along an axis is not implemented yet so it had to be done in numpy
if USE_GPU:
    # convert to cupy array
    data = cp.array(data)

Number of missing values: 0
Number of duplicate rows: 0


The data is checked for missing values and duplicates

In [192]:
print('Data shape: {}'.format(data.shape))

# used for faster testing
if USE_SHRUNK_DATASET:
    # get the number of samples for the class witht the least samples
    min_samples = xp.bincount(data[:, -1].astype(int), minlength=7)[1:].min()

    # get min number of samples for each class
    data = xp.concatenate([data[data[:, -1] == i][:min_samples] for i in range(1, 8)])

    print('New data shape:', data.shape)

# split the data into features and labels
X = data[:, :-1]
y = data[:, -1]

Data shape: (581012, 55)
New data shape: (19229, 55)


Here, if selected, the data is shrunk so that every class has the same number of observations. This is done to prevent the model from overfitting to the majority class. This is done by randomly selecting the same number of observations from each class. The number of observations is determined by the size of the smallest class. The leaves each class with 2747 observations, totaling 19269 observations. Even though this is a significant reduction in the size of the dataset, it is still more than large enough to train a model on. Splitting the data into training and testing is important so the model can be evaluated on data it has never seen before.

In [193]:
if USE_GPU:
    X_norm_check = X.get()
else:
    X_norm_check = X

# check if the data is normally distributed
for i in range(X_norm_check.shape[1]):
    stat, p = normaltest(X_norm_check[:, i])
    if p > 0.05:
        print('Feature {} is normally distributed'.format(i))

# normalize the features
X[:, :10] = (X[:, :10] - X[:, :10].min(axis=0)) / (X[:, :10].max(axis=0) - X[:, :10].min(axis=0) + 1e-8)

First the features are checked to see if they have a normal distribution This will effect if the data is normalized or standardized. As can be seen by the lack of output, none of the features have a normal distribution. This means that the data will be normalized.

Next the quanitative data is  normalized so that features with higher values don't dominate the model. This is done by adjusting the values so that they are between 0 and 1. This is only done on the quanitative data as to not effect the one hot encoded data.

In [194]:
# split the data into train and test
if USE_GPU:
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, stratify=y.get())
else:
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, stratify=y)

Because there is now equal representation among classes, using an 80/20 split for training and testing is acceptable. Furthermore, the data is stratified based on class so that the same proportion of each class is in both the training and testing data is the same. 2747 instances of each class should be enough to to ensure every class is well represented in both the training and testing data but this guarantees it. This should prevent both underfitting and overfitting.

In [195]:
# helper accuracy function
def print_accuracy(y, yhat):
    if USE_GPU:
        print('Accuracy of: ', round(accuracy_score(y_test.get(), yhat.get()) * 100, 3) , '%')
    else:
        print('Accuracy of: ', round(accuracy_score(y_test, yhat) * 100 , 3), '%')

def get_accuracy(y, yhat):
    if USE_GPU:
        return round(accuracy_score(y_test.get(), yhat.get()) * 100, 6)
    else:
        return round(accuracy_score(y_test, yhat) * 100 , 6)

These functions make calculating/printing the accuracy of madels eaiser and cleaner.

In [196]:
class BinaryLogisticRegression:
    # private:
    def __init__(self, eta, solver='base', iterations=20, C1=0.0, C2=0.0, line_iters=0, batch_size=0):
        self.eta = eta
        self.solver = solver
        self.iters = iterations
        self.C1 = C1
        self.C2 = C2
        self.line_iters = line_iters
        self.batch_size = batch_size
        # internally we will store the weights as self.w_ to keep with sklearn conventions

        solvers = ['base', 'line_search', 'stochastic', 'mini_batch', 'newton']
        if solver not in solvers:
            raise ValueError('solver %s is not one of %s' % (solver, solvers))
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
    
    # convenience, private and static:
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    @staticmethod
    def _add_bias(X):
        return xp.hstack((xp.ones((X.shape[0], 1)), X)) # add bias term

    # this defines the function with the first input to be optimized
    # therefore eta will be optimized, with all inputs constant
    @staticmethod
    def _line_search_objective_function(eta, X, y, w, grad):
        wnew = w - grad * eta
        g = expit(X @ wnew)

        if USE_GPU:
            g = g.get()
            y = y.get()

        # has to be run on the CPU because of the use of the ma module
        return -np.sum(ma.log(g[y == 1])) - np.sum(ma.log(1 - g[y == 0]))

    def _add_regularization(self, grad):
        L1 = self.C1 * xp.sign(self.w_[1:])
        L2 = self.C2 * -2 * self.w_[1:]
        grad[1:] += L1 + L2

        return grad

    def _get_gradient(self, X, y):
        match self.solver:
            case 'base' | 'line_search':
                ydiff = y - self.predict_proba(X, add_bias=False).ravel() # get y difference
                gradient = xp.mean(X * ydiff[:, xp.newaxis], axis=0) # make ydiff a column vector and multiply through
            case 'stochastic':
                idx = int(np.random.rand() * len(y)) # grab random instance
                ydiff = y[idx] - self.predict_proba(X[idx], add_bias=False) # get y difference
                gradient = X[idx] * ydiff[:, xp.newaxis] # make ydiff a column vector and multiply through
            case 'mini_batch':
                idx = np.random.choice(len(y), size=self.batch_size, replace=False) # grab random instance
                ydiff = y[idx] - self.predict_proba(X[idx], add_bias=False).ravel() # get y difference
                gradient = xp.mean(X[idx] * ydiff[:, xp.newaxis], axis=0) # make ydiff a column vector and multiply through
            case 'newton':
                g = self.predict_proba(X, add_bias=False).ravel() # get sigmoid value for all classes
                ydiff = y - g # get y difference
                gradient = xp.sum(X * ydiff[:, xp.newaxis], axis=0) # make ydiff a column vector and multiply through
                
                # the hessian has no L1 regularization and L2 will only be included if C2 is not 0
                hessian = (X * (g * (1 - g))[:, xp.newaxis]).T @ X - (2 * self.C2)# calculate the hessian
                # I swapped X.T @ np.diag(g*(1-g)) with (X * (g*(1-g))[:, xp.newaxis]).T
                # They work the same but the latter does the diagonal multiplication using way less memory
                # The reduction in memory allows the use of the GPU for the hessian calculation
                # otherwise my GPU runs out of memory

        
        gradient = gradient.reshape(self.w_.shape) # make gradient a column vector
        gradient = self._add_regularization(gradient)

        # the hessian is special
        if self.solver == 'newton':
            # the cupy version of linlog.pinv is not stable so I'm just using the numpy version
            return cp.array(np.linalg.pinv(hessian.get(), hermitian=True)) @ gradient
        else:
            return gradient
    
    # public:
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = xp.zeros((num_features, 1)) # init weight vector to zeros
        
        match self.solver:
            case 'base' | 'stochastic' | 'mini_batch' | 'newton':
                # for as many as the max iterations
                for _ in range(self.iters):
                    gradient = self._get_gradient(Xb, y)
                    self.w_ += gradient * self.eta # multiply by learning rate

            case 'line_search':
                for _ in range(self.iters):
                    gradient = -self._get_gradient(Xb, y)
                    # minimization inopposite direction
                    
                    # do line search in gradient direction, using scipy function
                    opts = {'maxiter':self.line_iters} # unclear exactly what this should be
                    res = minimize_scalar(self._line_search_objective_function, # objective function to optimize
                                        bounds=(0,self.eta * 10), #bounds to optimize
                                        args=(Xb, y, self.w_, gradient), # additional argument for objective function
                                        method='bounded', # bounded optimization for speed
                                        options=opts) # set max iterations
                    
                    eta = res.x # get optimal learning rate
                    self.w_ -= gradient * eta # set new function values
                    # subtract to minimize

        

    def predict_proba(self, X, add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X) > 0.5) #return the actual prediction

This class handles the individual binary logistic regression models. Each solver type is handeled deifferently in the _get_gradient function. Inside the _get_gradient function the _add_regularzation function is called and adds it as is required.

In [197]:
class LogisticRegression:
    def __init__(self, eta, solver='base', iterations=20, C1=0.0, C2=0.0, line_iters=5, batch_size=5):
        self.eta = eta
        self.solver = solver
        self.iters = iterations
        self.C1 = C1
        self.C2 = C2
        self.line_iters = line_iters
        self.batch_size = batch_size
        # internally we will store the weights as self.w_ to keep with sklearn conventions

        solvers = ['base', 'line_search', 'stochastic', 'mini_batch', 'newton']
        if solver not in solvers:
            raise ValueError('solver %s is not one of %s' % (solver, solvers))
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self, X, y):
        num_samples, num_features = X.shape
        self.unique_ = xp.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        # create a classifier for each class
        for yval in self.unique_:
            blr = BinaryLogisticRegression(self.eta, self.solver, self.iters, self.C1, self.C2, self.line_iters, self.batch_size)
            self.classifiers_.append(blr)

            # create a binary label for each class and train the classifier
            y_binary = (y == yval)
            blr.fit(X, y_binary)

            
        # save all the weights into one matrix, separate column for each class
        self.w_ = xp.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self, X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X)) # get probability for each classifier
        
        return xp.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return self.unique_[xp.argmax(self.predict_proba(X), axis=1)] # take argmax along row
            


This is the main logistic regression class that is called by the user. It creates smaller binary logistic regression classes to make it's classifications.

In [198]:
%%time

lr_sk = SKLogisticRegression(solver='liblinear') # all params default

if USE_GPU:
    lr_sk.fit(X_train.get(), y_train.get())
    yhat = lr_sk.predict(X_test.get())
    print('Accuracy of: ', accuracy_score(y_test.get(), yhat))
else:
    lr_sk.fit(X_train, y_train)
    yhat = lr_sk.predict(X_test)
    print('Accuracy of: ', accuracy_score(y_test, yhat))    

Accuracy of:  0.6809672386895476
CPU times: user 342 ms, sys: 126 ms, total: 468 ms
Wall time: 288 ms


The scikit learn is used to create a baseline model to compare the custom models to.

In [199]:
solvers = ['base', 'line_search', 'stochastic', 'mini_batch', 'newton']
num_iters = [1000, 150, 1500, 200, 12] # each solver has a different number of base iterations
etas = [.001, .003, .01, .03, .1, .3, 1, 3] # each increasse in eta is ~3x
Cs = [(0.0, 0.0), (0.01, 0.0), (0.0, 0.01), (0.01, 0.01)] # defines if we use L1 or L2 regularization or both

results = []
outputs = []

for solver in solvers:
    for eta in etas:
        for C1, C2 in Cs:
            # time the training in milliseconds
            start = round(time.time() * 1000)

            # the extra arguments will be ignored by the solvers that don't use them
            lr = LogisticRegression(eta, solver=solver, iterations=(num_iters[solvers.index(solver)]), C1=C1, C2=C2, line_iters=3, batch_size=500)
            lr.fit(X_train, y_train)
            yhat = lr.predict(X_test)

            end = round(time.time() * 1000)

            settings = solver + ' eta: ' + str(eta) + ' C1: ' + str(C1) + ' C2: ' + str(C2) + ' iters: ' + str(num_iters[solvers.index(solver)])
            accuracy = get_accuracy(y_test, yhat)
            time_taken = end - start

            # store the settings, accuracy, and time
            # print(settings + ' accuracy: ' + str(accuracy) + ' time: ' + str(time_taken) + 'ms')
            results.append((settings, accuracy, time_taken))
            outputs.append(yhat)

            # print progress
            print(str(round((len(results) / (len(solvers) * len(etas) * len(Cs))) * 100, 2)) + '% complete', end='\r')

100.0% complete

Here each solver is tested with different combinations of L1 and L2 regularzation. I also tested different eta values for each model as I found they could have a signifigant impact on the accuracy of the model. Overall, 160 models were tested. Running all of them on the reduced dataset and using the GPU takes ~5 minutes. Some data snooping was done to select the remaining values. This was done by manually testing values on individual solvers to get a feel for how number of iterations would effect both accuracy and run time. I selected values by trying to find ones that produced a high accuracy while still being fast. The hard coded values for line_iters and batch_size were selected the same way. My overall goal was to give every solver the best chance it could have without giving a large unfair advantage to one by letting it run significantly longer than the others.

In [200]:
# calculate the average accuracy and time for each solver
for solver in solvers:
    accuracies = []
    times = []
    for result in results:
        if result[0].startswith(solver):
            accuracies.append(result[1])
            times.append(result[2])

    print(solver + ' average accuracy: ' + str(round(sum(accuracies) / len(accuracies), 3)) + ' average time: ' + str(round(sum(times) / len(times))) + 'ms')

base average accuracy: 53.926 average time: 3325ms
line_search average accuracy: 52.656 average time: 2268ms
stochastic average accuracy: 48.412 average time: 2633ms
mini_batch average accuracy: 50.947 average time: 959ms
newton average accuracy: 55.053 average time: 179ms


Here the average accuracy and run time of each solver can be seen. The base solver and line_search produce similar results because they use a similar approach. Notably, the line search runs faster as it requires fewer iterations due to its adaptive step size. Stochastic and mini_batch are much the same although the accuracy they reach is very dependent on the test/train split and the random values they select. As mini_batch takes the gradient over more samples it perfroms better more consistantly. Which of the first four solvers has the accuracy can change depending on the test/train split but base and line_search do better more often. Newton always performs both the best and the fastest. This is because of the few number of iterations required to reach a high accuracy and the fact that it used a second order approximation to the gradient. This means that it is able to find local minima much faster than the other solvers. It also benifits greatly from being able to run entirely on the GPU. Fewer iterations means that memory is copied to and from the GPU less often making it more efficient.

In [201]:
# get the best settings for each solver
for solver in solvers:
    best_accuracy = 0
    best_time = 0
    best_settings = ''
    for result in results:
        if result[0].startswith(solver):
            if result[1] > best_accuracy:
                best_accuracy = result[1]
                best_time = result[2]
                best_settings = result[0]
            elif result[1] == best_accuracy and result[2] < best_time:
                best_accuracy = result[1]
                best_time = result[2]
                best_settings = result[0]

    print('Best settings for ' + best_settings + ' accuracy: ' + str(round(best_accuracy, 3)) + ' time: ' + str(best_time) + 'ms')

Best settings for base eta: 3 C1: 0.0 C2: 0.0 iters: 1000 accuracy: 67.161 time: 3309ms
Best settings for line_search eta: 1 C1: 0.0 C2: 0.0 iters: 150 accuracy: 65.159 time: 2293ms
Best settings for stochastic eta: 3 C1: 0.01 C2: 0.0 iters: 1500 accuracy: 60.92 time: 2478ms
Best settings for mini_batch eta: 3 C1: 0.0 C2: 0.0 iters: 200 accuracy: 62.923 time: 945ms
Best settings for newton eta: 1 C1: 0.0 C2: 0.0 iters: 12 accuracy: 69.397 time: 172ms


When selecting the best solver looking at the high end for each solver can give a better picture of the optimal setting. Here both base and line_search start to look a lot better. Despite this newton always has the best model. One interesting thing to note is that higher eta valeus tend to do better. This is likely as they allow the odel to move around the space quicker and sometimes even let it move out of local minima. Across numerous runs C2 is almost always 0 for the best medels. C1 is often 0 but not always. This sugests that regularzation is not super helpful for this dataset.

In [202]:
# sort the results by (accuracy / time) and print the top 10
results.sort(key=lambda x: x[1] / x[2], reverse=True)
for i in range(10):
    print('Settings: ' + results[i][0] + ' accuracy: ' + str(round(results[i][1], 3)) + ' time: ' + str(results[i][2]) + 'ms')

Settings: newton eta: 1 C1: 0.0 C2: 0.0 iters: 12 accuracy: 69.397 time: 172ms
Settings: newton eta: 0.3 C1: 0.01 C2: 0.01 iters: 12 accuracy: 68.487 time: 170ms
Settings: newton eta: 0.3 C1: 0.0 C2: 0.0 iters: 12 accuracy: 68.565 time: 172ms
Settings: newton eta: 0.3 C1: 0.0 C2: 0.01 iters: 12 accuracy: 68.487 time: 174ms
Settings: newton eta: 0.3 C1: 0.01 C2: 0.0 iters: 12 accuracy: 68.565 time: 175ms
Settings: newton eta: 0.003 C1: 0.01 C2: 0.01 iters: 12 accuracy: 63.287 time: 164ms
Settings: newton eta: 0.1 C1: 0.0 C2: 0.01 iters: 12 accuracy: 65.211 time: 174ms
Settings: newton eta: 0.03 C1: 0.01 C2: 0.0 iters: 12 accuracy: 63.521 time: 171ms
Settings: newton eta: 0.001 C1: 0.01 C2: 0.01 iters: 12 accuracy: 63.287 time: 171ms
Settings: newton eta: 0.1 C1: 0.0 C2: 0.0 iters: 12 accuracy: 65.237 time: 177ms


One last metric i've decided to use is the (accuracy / time). This shows which models give the best return for the time it took to train them. This metric is less important but still provides some interesting insight. Beacuse of it signifiagantly faster runtime, newton dominates this ranking

If it isn't clear by now, newton is the best solver. To compare with scikit learn I have selected the model with eta = 1 and C = 0. This is the model that has the best accuracy and a good (accuracy / time). This is the model that will be used for the rest of the project.

In [187]:
sk_times = []
sk_accuracies = []
times = []
accuracies = []

for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, stratify=y.get())

    start = round(time.time() * 1000)
    lr_sk = SKLogisticRegression(solver='liblinear')
    lr_sk.fit(X_train.get(), y_train.get())
    yhat = lr_sk.predict(X_test.get())
    end = round(time.time() * 1000)
    sk_times.append(end - start)
    sk_accuracies.append(get_accuracy(y_test, cp.array(yhat)))

    start = round(time.time() * 1000)
    lr = LogisticRegression(1, solver='newton', iterations=10, C1=0.0, C2=0.0)
    lr.fit(X_train, y_train)
    yhat = lr.predict(X_test)
    end = round(time.time() * 1000)
    times.append(end - start)
    accuracies.append(get_accuracy(y_test, yhat))

print('Average time for sklearn: ' + str(round(sum(sk_times) / len(sk_times))) + 'ms')
print('Average time for custom: ' + str(round(sum(times) / len(times))) + 'ms')
print('Average accuracy for sklearn: ' + str(round(sum(sk_accuracies) / len(sk_accuracies), 3)))
print('Average accuracy for custom: ' + str(round(sum(accuracies) / len(accuracies), 3)))

print('Best accuracy for sklearn: ' + str(round(max(sk_accuracies), 3)))
print('Best accuracy for custom: ' + str(round(max(accuracies), 3)))


Average time for sklearn: 311ms
Average time for custom: 196ms
Average accuracy for sklearn: 68.495
Average accuracy for custom: 69.555
Best accuracy for sklearn: 69.709
Best accuracy for custom: 71.243


In order to best compare my model to scikit learn I have performed more extencive testing using both. As the test/train split is random and accounts for a large amount of varience in the accurcay of both models I ran them both 100 times, each with a new test/train split. This is done to get a better idea of how they compare by isolating the models as a constant accross tests.

As can be seen above, my model performs slightly better than scikit learns in accuracy. Although less important as both are fast, my model also consistantly runs faster by over 50%. The speed comparison isn't quite fair as my model runs on the GPU and theirs is multithreaded on the CPU.

One reason for my model's better performance over scikit learns might be it's ability for custamization. Scikit learn's model is part of a large library and must me made in a way that perfroms well across a wide range of datasets. For my model, I was able to test many different option and select the best one based on it's performance on this dataset. Because if this my model is a better choice for this dataset.

---
Overall, when selecting a model to use for creating forests in realistic virtual enviroments the best choice is difficult. Between scikit learn and my model I would recomend mine everytime. But, the existing neural net model has a higher average accuracy at ~70.5%. The run time of this model is not known but unless it is signifigantly slower than mine speed would not be a major factor. One benifit my model has is realative simplicity and transparency. If these were desiarable traits to a delevoper the slightly lower accuracy could be worth it.

While either model would likely work well enough, I would ultimately recomend using the existing neural net approach for the higher accuracy. The accuracy of this model is high enough to meet the standrds I laid out for use in realistic game world creation. Because no current approach has is close to 90% accuracy I would not recomend using any of these models for use in realistic virtual enviroments for the purpose of ecological research.

One final note, the neural net model was made in 2000 and advancments in machine learning have been made since then. It may be that the best choice would not be to use an existing model but to create a new one that takes advantage of newer techniques.