In this notebook we build the baseline model to compare our image recognition models against. We first ensure that we're only dealing with rose observations that have been labeled as flowering or not. We then test the accuracy of four different models. The first model just predicts observations from May, June or July to be in bloom and others to not be in bloom. We then use five fold cross-validation to test the generalization accuracy of three other models listed below: 

* Logistic Regression
* Random Forest Classifier
* Gradient Boosting Classifier

These three models are fit on the following variables from each observation: 

* Year
* Month
* Day 
* Latitude
* Longitude

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold

#Get rid of SettingWithCopyWarnings
pd.options.mode.chained_assignment = None  

In [3]:
roseObs = pd.read_csv("../../Data/processedData/Roses/roseObs")
labeledRoseObs = roseObs.dropna()

#Create column for whether the observation is of a plant that is in bloom or not. 
labeledRoseObs["isFlowering"] = labeledRoseObs['reproductiveCondition'].isin(['flowering','flowering|fruiting','flowering|fruiting|flower budding','flowering|flower budding']).astype(int)

Maybe try to use log10 in your analysis. We can also compute the correlation. We could compare mean and median of variables as a function of categorical variables.Even better we can make a box and whisker plot.


In [7]:
kfold = KFold(n_splits=5, random_state=97, shuffle=True)
training_vars = ["year", "month", "day", "decimalLatitude", "decimalLongitude"]

#This will be an array that holds the generalization accuracy of each model on each of the different splits
accuracy = np.zeros((4, 5))

# Set a counter for what split we're on
i = 0

#For each split of the data, fit our four models and compute the generalization accuracy
for train_index, val_index in kfold.split(labeledRoseObs):
    
    roses_train = labeledRoseObs.iloc[train_index]
    roses_val = labeledRoseObs.iloc[val_index]
    
    # "Fit" and get validation error for the baseline model
    baseline_pred = []
    for month in roses_val["month"]:
        # a better way to do this would be to check which months have a majority of observations in flower
        if month in [5,6,7]:
            baseline_pred.append(1)
        else:
            baseline_pred.append(0)
    
    #Compute the accuracy on the validation set
    accuracy[0, i] = 1 - sum(abs(np.array(roses_val["isFlowering"]) - np.array(baseline_pred)))/len(roses_val)
    
    ## Fit and get the accuracy on the validation set for the remaining models
    LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
    LR.fit(roses_train[training_vars], roses_train["isFlowering"])
    pred = LR.predict(roses_val[training_vars])
    accuracy[1,i] = 1 - sum(abs(np.array(roses_val["isFlowering"]) - np.array(pred)))/len(roses_val)
    
    RF = RandomForestClassifier(n_estimators= 200, max_depth=400, random_state=97)
    RF.fit(roses_train[training_vars], roses_train["isFlowering"])
    pred = RF.predict(roses_val[training_vars])
    accuracy[2,i] = 1 - sum(abs(np.array(roses_val["isFlowering"]) - np.array(pred)))/len(roses_val)

    GB = GradientBoostingClassifier(n_estimators = 400, max_leaf_nodes = 10, max_depth =  None, random_state = 97, min_samples_split = 10)
    GB.fit(roses_train[training_vars], roses_train["isFlowering"])
    pred = GB.predict(roses_val[training_vars])
    accuracy[3,i] = 1 - sum(abs(np.array(roses_val["isFlowering"]) - np.array(pred)))/len(roses_val)

    
    ## Increasing the counter
    i = i + 1

#Print out our accuracy table at the end.
accuracy

array([[0.78061674, 0.80264317, 0.81674009, 0.79118943, 0.77621145],
       [0.76740088, 0.76651982, 0.78414097, 0.77621145, 0.77268722],
       [0.86696035, 0.88193833, 0.88105727, 0.86079295, 0.85550661],
       [0.86343612, 0.87136564, 0.87312775, 0.85991189, 0.83964758]])

In [8]:
#Compute the mean accuracy of each row to determine which model performs best on average
np.mean(accuracy,axis = 1)

array([0.79348018, 0.77339207, 0.8692511 , 0.8614978 ])

We see that the Random Forest Classifier performs the best so far with a generalization accuracy of almost 87%. This will be the model we want to compare our CV model against.