# 3 Pre-Processing & Training

In our last step, we did some joining of our data based on latitude and longitude and explored our features. In this next step we will prepare for modeling by tuning our features, and maybe even adding a feature or two. We will have some challenges with finding the right mix and tuning of features when our initial correlation review didn't show much to work with. Maybe more challenging will be how to deal with our fairly imbalanced data.

## 3.0 Imports

In [249]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime as dt
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

## 3.1 Load Data

In [250]:
trees_df = pd.read_csv('../data/data_outputs/seattle_trees_explored.csv')

trees_df.head(3)

Unnamed: 0,planted_date,most_recent_observation,common_name,long_trees,lat_trees,diameter_breast_height_CM,condition,native,age_at_obs,condition_index,nearest_station,station_id,station_name,lat_prcp,long_prcp,adj_reports,norm_prcp_mm_total,norm_snow_mm_total,distance_between,tree_id
0,1991-07-22,2019-04-27,(european) white birch,-122.28208,47.635207,40.64,excellent,introduced,27.765115,5.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,0.947927,1
1,1991-07-30,2019-04-27,Kwanzan flowering cherry,-122.318952,47.649141,5.08,fair,no_info,27.743212,3.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,3.367105,2
2,1991-07-25,2019-04-27,Japanese snowbell tree,-122.299891,47.637863,2.54,excellent,introduced,27.756901,5.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,1.14569,3


## 3.2 Prep DF for Train-Test split

We'll take another look at the columns, as we can likely drop the additional reference info from our climate 'prcp' data source. And then we'll split our dependent and independent variables.

### 3.2.0 Drop Unecessary Columns

We'll drop the reference cols from climate data like I mentioned above, but also the 'condition' column because it is duplicative of our target feature. And our date cols, because we have the calculated age feature that will be our variable related to dates/ages.

In [251]:
#View our columns
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158004 entries, 0 to 158003
Data columns (total 20 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   planted_date               155133 non-null  object 
 1   most_recent_observation    157999 non-null  object 
 2   common_name                157332 non-null  object 
 3   long_trees                 158004 non-null  float64
 4   lat_trees                  158004 non-null  float64
 5   diameter_breast_height_CM  158004 non-null  float64
 6   condition                  158004 non-null  object 
 7   native                     158004 non-null  object 
 8   age_at_obs                 155128 non-null  float64
 9   condition_index            158004 non-null  float64
 10  nearest_station            158004 non-null  object 
 11  station_id                 158004 non-null  object 
 12  station_name               158004 non-null  object 
 13  lat_prcp                   15

In [252]:
#drop our columns that are reference from climate dataset and the original condition column (which we used to create our target index feature)
trees_df = trees_df.drop(columns=['nearest_station', 'station_id',
       'station_name', 'lat_prcp', 'long_prcp', 'condition', 'planted_date','most_recent_observation'])

In [253]:
trees_df.columns

Index(['common_name', 'long_trees', 'lat_trees', 'diameter_breast_height_CM',
       'native', 'age_at_obs', 'condition_index', 'adj_reports',
       'norm_prcp_mm_total', 'norm_snow_mm_total', 'distance_between',
       'tree_id'],
      dtype='object')

### 3.2.1 Split Dependent and Independent Variables

In [254]:
# split data into X and y
X = trees_df.drop(columns=['condition_index'])
y = trees_df[['condition_index']]

## 3.3 Train-Test Split
We'll use an 80:20 split here

In [255]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(126403, 11) (126403, 1) (31601, 11) (31601, 1)


## 3.4 Impute Missing Values

We will use the median for our age at observation and mode for common name.

### 3.4.0 Calculate Modes and Means

In [256]:
X_median_age = X_train['age_at_obs'].median()
X_mode_name = X_train['common_name'].mode()

### 3.4.1 Impute Values

In [257]:
#Impute planted_date
X_train['age_at_obs'] = X_train['age_at_obs'].fillna(X_median_age)
X_train['common_name'] = X_train['common_name'].fillna(X_mode_name[0]) #using 0 index to grab the name of the mode

In [258]:
#validate no missing values
X_train.isna().sum()

common_name                  0
long_trees                   0
lat_trees                    0
diameter_breast_height_CM    0
native                       0
age_at_obs                   0
adj_reports                  0
norm_prcp_mm_total           0
norm_snow_mm_total           0
distance_between             0
tree_id                      0
dtype: int64

## 3.5 Feature Engineering

### 3.5.0 Create New Fields - Daily Averages

I want to create a daily average field for the snow and rainfall, using 365 days (understandably) as the denominator.

In [259]:
#Add column for daily rainfall avg
X_train['prcp_daily_avg'] = X_train['norm_prcp_mm_total'].div(365.0)
X_train['snow_daily_avg'] = X_train['norm_snow_mm_total'].div(365.0)

### 3.5.1 Feature Scaling

I'll start by reviewing distributions of our numerical variables to see if we need to do any column specifc scaling or if we can apply some standard methodology.

### 3.5.x Categorical Encoding

We'll need to encode our categorical features. And for our tree names, we'll likely need to group together some of the less frequent options so we don't overwhelm our model with a crazy number of columns.

In [260]:
cat_columns = trees_df.select_dtypes(include='object').columns

cat_columns

Index(['common_name', 'native'], dtype='object')

In [261]:
#view value_counts of common_name field
trees_df['common_name'].value_counts()

Red maple                     5218
Apple/crabapple               4653
Norway maple                  4510
Purpleleaf plum variety       4313
(smooth) japanese maple       4193
                              ... 
Silver leaved mountain gum       2
Doublefile viburnum              2
Shade king red maple             2
Spindle tree                     2
Hokusai flowering cherry         2
Name: common_name, Length: 670, dtype: int64

In [262]:
#how many of the 670 categories have less than 100 records?

vc = pd.DataFrame(trees_df['common_name'].value_counts())

vc.reset_index(inplace=True)

vc[vc['common_name'] < 100].value_counts()

index                             common_name
(arnold) tulip tree               61             1
Pleated viburnum                  8              1
Purple crabapple                  2              1
Prospector elm                    19             1
Professor sprenger crabapple      3              1
                                                ..
Emerald avenue european hornbeam  11             1
Elizabeth magnolia                6              1
Easy street maple                 5              1
Eastern redcedar                  15             1
Zumi crabapple                    61             1
Length: 436, dtype: int64

In [263]:
#view value_counts of common_name field
trees_df['native'].value_counts()

introduced             116127
no_info                 32616
naturally_occurring      9261
Name: native, dtype: int64

#### 3.5.1.0 Convert common_name Field to Group Names with < 100 Occurences

This will limit the number of columns we have. We won't do thes ame for the native field. We'll do it by defining a function that can be utilized later as well.

In [264]:
def group_categories(df, col, n_limit):
    """ Store categories in df[col] with counts less than the specified n and overwrite the corresponding values in the df with 'Other' """
    groups = df[col]
    group_counts = groups.value_counts()
    mask = groups.isin(group_counts[group_counts<n_limit].index)
    df[mask] = 'Other'

In [265]:
#use our group categories feature on our X_train set
group_categories(X_train, 'common_name', 100)

X_train['common_name'].value_counts()

Other                          12408
Red maple                       4684
Apple/crabapple                 3733
Norway maple                    3667
Purpleleaf plum variety         3455
                               ...  
Autumn applause white ash        102
Vanessa parrotia                 102
Skyline honey locust             101
Princess diana serviceberry      100
Yellowwood                       100
Name: common_name, Length: 198, dtype: int64

#### 3.5.1.1 Run OHE On Our Object Variables

In [266]:
#create a onehotencoder instance
ohe = OneHotEncoder(drop='first', handle_unknown='ignore')

#fit on our test set
ohe.fit(X_train[['common_name','native']])