# Ideas:

Forecast into future (change age at obs)
Forecast rainy next 5 years (change age at obs and prcp)
Forecast change in temp (change age at obs and temp)
Idetify if a specific species it at risk


# 5 Modeling

## 5.0 Introduction

Now that we've got our model established, I want to do some exploring. First, I will load in my full testing and training data and fit the model to the full range of data. I will unfortunately need to do the full imputing, SMOTE (to address imbalanced classes), scaling, and encoding to keep things consistent. 
\
\
Once we have that full sampling prepped, we can fit on it then predict using our *ENTIRE DATAFRAME*. That's a lot of data so let's hope it goes well. From there we can do some mapping and modeling. Should be fun!

## 5.1 Imports

In [77]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTENC

import geopandas as gpd
from shapely.geometry import Point, Polygon

## 5.2 Load and process

### 5.2.0 Load Data
Load in the raw X and y data we saved at the end of our last step. This is our full 20k sample records.

In [78]:
X = pd.read_csv('../data/data_outputs/X_full_sample.csv')
y = pd.read_csv('../data/data_outputs/y_full_sample.csv')

print(X.shape, y.shape)

(20000, 10) (20000, 1)


In [79]:
#copy X and y to new dataframes
X_full = X.reset_index(drop=True)
y_full = y.reset_index(drop=True)

### 5.2.1 Process Data for Fitting Model

#### 5.2.1.0 Initialized Imputers, Scalers, and Encoder

In [80]:
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
ss_scaler = StandardScaler()
pow_trans = PowerTransformer()
ohe1 = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #one for our common_name column
ohe2 = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #one for our native column

#### 5.2.1.1 Create and Apply Name Grouping Function

In [81]:
def group_categories(df, col, n_limit):
    """ Store categories in df[col] with counts less than the specified n and overwrite the corresponding values in the df with 'Other' """
    groups = df[col]
    group_counts = groups.value_counts()
    mask = groups.isin(group_counts[group_counts<n_limit].index)
    df.loc[mask, col] = 'Other'

In [82]:
#Run group categories on common_name, here using 10 since the data is larger
group_categories(X_full, 'common_name', 10)

In [83]:
#View Value Counts
X_full['common_name'].value_counts()

Other                      1092
Red maple                   696
Apple/crabapple             566
Purpleleaf plum variety     565
Norway maple                542
                           ... 
California hazel             10
Holm (holly) oak             10
River birch                  10
Ponderosa pine               10
Norway spruce                10
Name: common_name, Length: 263, dtype: int64

#### 5.2.1.2 Imput Missing Values

In [84]:
#Prep for SMOTE, impute missing and group categorical values
#impute missing values
X_full['age_at_obs'] = num_imputer.fit_transform(X_full[['age_at_obs']])
X_full['common_name'] = cat_imputer.fit_transform(X_full[['common_name']])

In [85]:
X_full.shape

(20000, 10)

#### 5.2.1.3 SMOTE X Data to Address Class Imbalance

In [86]:
#View our balance of classes
y_full.value_counts() / y_full.shape[0] * 100

condition_index
4.0                55.325
3.0                24.135
5.0                13.510
2.0                 5.420
1.0                 1.610
dtype: float64

In [87]:
#Use smote
sm = SMOTENC(random_state=42, categorical_features=[0,2])
X_smote, y_smote = sm.fit_resample(X_full, y_full)

In [88]:
#view new resampled distribution
print(y_smote.value_counts() / y_smote.shape[0] * 100)

print(X_smote.shape)

condition_index
1.0                20.0
2.0                20.0
3.0                20.0
4.0                20.0
5.0                20.0
dtype: float64
(55325, 10)


#### 5.2.1.4 Scale Numerical Values

In [89]:
#Standard Scaler
X_smote[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']] = ss_scaler.fit_transform(X_smote[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']])

#Power Transformer
X_smote[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']] = pow_trans.fit_transform(X_smote[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']])


In [90]:
X_smote.head()

Unnamed: 0,common_name,diameter_breast_height_CM,native,age_at_obs,adj_reports,norm_prcp_mm_total,norm_snow_mm_total,distance_between,temp_avg_normal,prcp_mm_normal
0,Norwegian sunset maple,1.001507,introduced,0.351846,0.293119,-1.626061,0.576894,-0.767596,1.283766,-1.212323
1,Red maple,1.273186,introduced,0.628893,0.717425,-0.70773,-0.994286,-1.326386,-0.784582,0.770351
2,Paperbark maple,-1.156845,introduced,-1.550908,-1.389654,-0.097643,-0.994286,0.911477,-0.784582,0.770351
3,Cleveland norway maple,1.188151,introduced,-0.407873,0.868185,-1.212498,-0.994286,-0.628394,1.283766,-1.212323
4,Other,-0.741043,no_info,-2.177863,0.918547,0.795283,1.517,0.496514,1.283766,-1.212323


#### 5.2.1.5 Encode Categorical Variables

In [91]:
#reset index to create clean join after encoding
X_smote.reset_index(drop=True, inplace=True)

#transform our common_name field with ohe and put into dataframe
X_name = ohe1.fit_transform(X_smote[['common_name']])
name_df = pd.DataFrame(X_name, columns=ohe1.categories_[0]) #indexing 0 here to grab only the column names

#transform our native field with ohe and put into dataframe
X_native = ohe2.fit_transform(X_smote[['native']])
native_df = pd.DataFrame(X_native, columns=ohe2.categories_[0])

#concat these two dataframes back into X_res and X_test and create new dataframe, dropping original categorical fields
X_smote_scaled_coded = pd.concat([X_smote, name_df, native_df], axis=1)

#drop original categorical fields
X_smote_scaled_coded.drop(columns=['common_name', 'native'], inplace=True)

In [92]:
#review shape of created variables
print(X_smote.shape,name_df.shape,native_df.shape)
print(X_smote_scaled_coded.shape)
print(y_smote.shape)

(55325, 10) (55325, 264) (55325, 3)
(55325, 275)
(55325, 1)


### 5.2.2 Fit Model on Our Full Data Sample

#### 5.2.2.0 Setup Model

We'll use the same parameters as our final model from pre-processing. I didn't end up saving the model and importing due to the large size.

In [93]:
model = RandomForestClassifier(n_estimators=1000, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', max_depth=40, bootstrap=True, random_state=42)

#### 5.2.2.1 Fit on Full 20k Sample

In [94]:
model.fit(X_smote_scaled_coded, y_smote.values.ravel()) #values.ravel will flatted the array.

#### 5.2.2.2 View Classification Result

In [95]:
print(f'Accuracy on training data: {accuracy_score(model.predict(X_smote_scaled_coded), y_smote):.2f}')
print("Classification Report for Training Data")
print(classification_report(y_smote, model.predict(X_smote_scaled_coded)))

Accuracy on training data: 0.99
Classification Report for Training Data
              precision    recall  f1-score   support

         1.0       1.00      1.00      1.00     11065
         2.0       1.00      0.99      1.00     11065
         3.0       0.99      0.99      0.99     11065
         4.0       0.98      0.99      0.99     11065
         5.0       0.99      0.99      0.99     11065

    accuracy                           0.99     55325
   macro avg       0.99      0.99      0.99     55325
weighted avg       0.99      0.99      0.99     55325



Now that we've added in our full train and test set, I'm noticing our model is really accurate on the 1s and 2s when SMOTE-ed. Some further tuning might be helpful to make it more accurate on the more common classes, but I will leave it for now. We can move onto applying this model to our full data set. Woohoo!

## 5.3 Apply Model to Full Data

### 5.3.1 Load Full Data

In [108]:
seattle_trees_all = pd.read_csv('../data/data_outputs/seattle_trees_full.csv')

seattle_trees_all.head(3)

Unnamed: 0,common_name,diameter_breast_height_CM,native,age_at_obs,condition_index,adj_reports,norm_prcp_mm_total,norm_snow_mm_total,distance_between,temp_avg_normal,prcp_mm_normal
0,(european) white birch,40.64,introduced,27.765115,5.0,237,1071.925479,0.0,0.947927,53.2,960.628
1,Kwanzan flowering cherry,5.08,no_info,27.743212,3.0,237,1071.925479,0.0,3.367105,53.2,960.628
2,Japanese snowbell tree,2.54,introduced,27.756901,5.0,237,1071.925479,0.0,1.14569,53.2,960.628


In [109]:
seattle_trees_all.shape

(158004, 11)

#### 5.3.1.1 Split Dependent and Independent Variables

In [116]:
seattle_X = seattle_trees_all.drop(columns='condition_index')
seattle_y = seattle_trees_all['condition_index']

### 5.3.2 Pre-Process Data

#### 5.3.2.0 Group Name Categories

In [117]:
#Run group categories on common_name, here using 10 since the data is larger
group_categories(seattle_X, 'common_name', 10)

#### 5.3.2.1 Impute Missing Values

In [118]:
#impute missing values
seattle_X['age_at_obs'] = num_imputer.transform(seattle_X[['age_at_obs']])
seattle_X['common_name'] = cat_imputer.transform(seattle_X[['common_name']])

In [119]:
seattle_X.shape

(158004, 10)

#### 5.3.2.2 Scale Numerical Values

In [120]:
#Standard Scaler
seattle_X[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']] = ss_scaler.transform(seattle_X[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']])

#Power Transformer
seattle_X[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']] = pow_trans.transform(seattle_X[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']])


#### 5.3.2.3 Encode Categorical Values

In [121]:
#reset index to create clean join after encoding
seattle_X.reset_index(drop=True, inplace=True)

#transform our common_name field with ohe and put into dataframe
seattle_name = ohe1.transform(seattle_X[['common_name']])
seattle_name_df = pd.DataFrame(seattle_name, columns=ohe1.categories_[0]) #indexing 0 here to grab only the column names

#transform our native field with ohe and put into dataframe
seattle_native = ohe2.transform(seattle_X[['native']])
seattle_native_df = pd.DataFrame(seattle_native, columns=ohe2.categories_[0])

#concat these two dataframes back into X_res and X_test and create new dataframe, dropping original categorical fields
seattle_scaled_coded = pd.concat([seattle_X, seattle_name_df, seattle_native_df], axis=1)

#drop original categorical fields
seattle_scaled_coded.drop(columns=['common_name', 'native'], inplace=True)

In [122]:
#review shape of created variables
print(seattle_X.shape,seattle_name_df.shape,seattle_native_df.shape)
print(seattle_scaled_coded.shape)
print(seattle_y.shape)

(158004, 10) (158004, 264) (158004, 3)
(158004, 275)
(158004,)


### 5.3.3 Run Model on Full Seattle Data And Review Results

#### 5.3.3.0 Predict Using Model

In [123]:
#predict using tune model
y_pred = model.predict(seattle_scaled_coded)

print(f'Accuracy on training data: {accuracy_score(y_pred, seattle_y):.2f}')
print("Classification Report for Training Data")
print(classification_report(seattle_y, y_pred))
print('----------------------------------------')
print(confusion_matrix(seattle_y, y_pred))

Accuracy on training data: 0.65
Classification Report for Training Data
              precision    recall  f1-score   support

         1.0       0.21      0.28      0.24      2806
         2.0       0.27      0.30      0.28      8268
         3.0       0.53      0.49      0.51     37485
         4.0       0.77      0.75      0.76     87557
         5.0       0.61      0.67      0.64     21888

    accuracy                           0.65    158004
   macro avg       0.48      0.50      0.49    158004
weighted avg       0.65      0.65      0.65    158004

----------------------------------------
[[  775   306   587   809   329]
 [  365  2472  2461  2351   619]
 [  890  2972 18510 12402  2711]
 [ 1321  2978 11789 65814  5655]
 [  343   499  1840  4543 14663]]
