# Ideas:

Forecast into future (change age at obs)
Forecast rainy next 5 years (change age at obs and prcp)
Forecast change in temp (change age at obs and temp)
Idetify if a specific species it at risk


# 5 Modeling

## 5.0 Introduction

Now that we've got our model established, I want to do some exploring. First, I will load in my full testing and training data and fit the model to the full range of data. I will unfortunately need to do the full imputing, SMOTE (to address imbalanced classes), scaling, and encoding to keep things consistent. 
\
\
Once we have that full sampling prepped, we can fit on it then predict using our *ENTIRE DATAFRAME*. That's a lot of data so let's hope it goes well. From there we can do some mapping and modeling. Should be fun!

## 5.1 Imports

In [2]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTENC

import geopandas as gpd
from shapely.geometry import Point, Polygon

## 5.2 Load and process

### 5.2.0 Load Data
Load in the raw X and y data we saved at the end of our last step. This is our full 20k sample records.

In [13]:
X = pd.read_csv('../data/data_outputs/X_full_sample.csv')
y = pd.read_csv('../data/data_outputs/y_full_sample.csv')

print(X.shape, y.shape)

(20000, 10) (20000, 1)


In [14]:
#copy X and y to new dataframes
X_full = X.reset_index(drop=True)
y_full = y.reset_index(drop=True)

### 5.2.1 Process Data for Fitting Model

#### 5.2.1.0 Initialized Imputers, Scalers, and Encoder

In [15]:
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
ss_scaler = StandardScaler()
pow_trans = PowerTransformer()
ohe1 = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #one for our common_name column
ohe2 = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #one for our native column

#### 5.2.1.1 Create and Apply Name Grouping Function

In [16]:
def group_categories(df, col, n_limit):
    """ Store categories in df[col] with counts less than the specified n and overwrite the corresponding values in the df with 'Other' """
    groups = df[col]
    group_counts = groups.value_counts()
    mask = groups.isin(group_counts[group_counts<n_limit].index)
    df.loc[mask, col] = 'Other'

In [17]:
#Run group categories on common_name
group_categories(X_full, 'common_name', 3)

In [18]:
#View Value Counts
X_full['common_name'].value_counts()

Red maple                         696
Apple/crabapple                   566
Purpleleaf plum variety           565
Norway maple                      542
(smooth) japanese maple           518
                                 ... 
American sycamore                   3
Eugene`s (carolina) poplar          3
Rancho pear                         3
American dream swamp white oak      3
Turkestan mountain ash              3
Name: common_name, Length: 428, dtype: int64

#### 5.2.1.2 Imput Missing Values

In [19]:
#Prep for SMOTE, impute missing and group categorical values
#impute missing values
X_full['age_at_obs'] = num_imputer.fit_transform(X_full[['age_at_obs']])
X_full['common_name'] = cat_imputer.fit_transform(X_full[['common_name']])

In [20]:
X_full.shape

(20000, 10)

#### 5.2.1.3 SMOTE X Data to Address Class Imbalance

In [23]:
#View our balance of classes
y_full.value_counts() / y_full.shape[0] * 100

condition_index
4.0                55.325
3.0                24.135
5.0                13.510
2.0                 5.420
1.0                 1.610
dtype: float64

In [29]:
#Use smote
sm = SMOTENC(random_state=42, categorical_features=[0,2])
X_smote, y_smote = sm.fit_resample(X_full, y_full)

In [30]:
#view new resampled distribution
print(y_smote.value_counts() / y_smote.shape[0] * 100)

print(X_smote.shape)

condition_index
1.0                20.0
2.0                20.0
3.0                20.0
4.0                20.0
5.0                20.0
dtype: float64
(55325, 10)


#### 5.2.1.4 Scale Numerical Values

In [31]:
#Standard Scaler
X_smote[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']] = ss_scaler.fit_transform(X_smote[['age_at_obs','norm_prcp_mm_total', 'temp_avg_normal','prcp_mm_normal']])

#Power Transformer
X_smote[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']] = pow_trans.fit_transform(X_smote[['diameter_breast_height_CM', 'norm_snow_mm_total','distance_between','adj_reports']])


In [32]:
X_smote.head()

Unnamed: 0,common_name,diameter_breast_height_CM,native,age_at_obs,adj_reports,norm_prcp_mm_total,norm_snow_mm_total,distance_between,temp_avg_normal,prcp_mm_normal
0,Norwegian sunset maple,1.001482,introduced,0.351816,0.293119,-1.626061,0.576894,-0.768214,1.283766,-1.212323
1,Red maple,1.273147,introduced,0.628887,0.717425,-0.70773,-0.994286,-1.327058,-0.784582,0.770351
2,Paperbark maple,-1.156072,introduced,-1.551102,-1.389654,-0.097643,-0.994286,0.911242,-0.784582,0.770351
3,Cleveland norway maple,1.188115,introduced,-0.407969,0.868185,-1.212498,-0.994286,-0.628992,1.283766,-1.212323
4,Leonard messel lobner magnolia,-0.740522,no_info,-2.178112,0.918547,0.795283,1.517,0.496152,1.283766,-1.212323


#### 5.2.1.5 Encode Categorical Variables

In [33]:
#reset index to create clean join after encoding
X_smote.reset_index(drop=True, inplace=True)

#transform our common_name field with ohe and put into dataframe
X_name = ohe1.fit_transform(X_smote[['common_name']])
name_df = pd.DataFrame(X_name, columns=ohe1.categories_[0]) #indexing 0 here to grab only the column names

#transform our native field with ohe and put into dataframe
X_native = ohe2.fit_transform(X_smote[['native']])
native_df = pd.DataFrame(X_native, columns=ohe2.categories_[0])

#concat these two dataframes back into X_res and X_test and create new dataframe, dropping original categorical fields
X_smote_scaled_coded = pd.concat([X_smote, name_df, native_df], axis=1)

#drop original categorical fields
X_smote_scaled_coded.drop(columns=['common_name', 'native'], inplace=True)

In [34]:
#review shape of created variables
print(X_smote.shape,name_df.shape,native_df.shape)
print(X_smote_scaled_coded.shape)

(55325, 10) (55325, 429) (55325, 3)
(55325, 440)


### 5.2.2 Fit Model on Our Full Data Sample