# 3 Pre-Processing & Training

In our last step, we did some joining of our data based on latitude and longitude and explored our features. In this next step we will prepare for modeling by tuning our features, and maybe even adding a feature or two. We will have some challenges with finding the right mix and tuning of features when our initial correlation review didn't show much to work with. Maybe more challenging will be how to deal with our fairly imbalanced data.

## 3.0 Imports

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime as dt
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline

## 3.1 Load Data

In [25]:
trees_df = pd.read_csv('../data/data_outputs/seattle_trees_explored.csv')

trees_df.head(3)

Unnamed: 0,planted_date,most_recent_observation,common_name,long_trees,lat_trees,diameter_breast_height_CM,condition,native,age_at_obs,condition_index,nearest_station,station_id,station_name,lat_prcp,long_prcp,adj_reports,norm_prcp_mm_total,norm_snow_mm_total,distance_between,tree_id
0,1991-07-22,2019-04-27,(european) white birch,-122.28208,47.635207,40.64,excellent,introduced,27.765115,5.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,0.947927,1
1,1991-07-30,2019-04-27,Kwanzan flowering cherry,-122.318952,47.649141,5.08,fair,no_info,27.743212,3.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,3.367105,2
2,1991-07-25,2019-04-27,Japanese snowbell tree,-122.299891,47.637863,2.54,excellent,introduced,27.756901,5.0,WA-KG-266,WA-KG-266,Seattle 2.9 ENE,47.630883,-122.290286,237,1071.925479,0.0,1.14569,3


## 3.2 Prep DF for Train-Test split

We'll take another look at the columns, as we can likely drop the additional reference info from our climate 'prcp' data source. And then we'll split our dependent and independent variables.

### 3.2.0 Drop Unecessary Columns

In [26]:
#View our columns
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158004 entries, 0 to 158003
Data columns (total 20 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   planted_date               155133 non-null  object 
 1   most_recent_observation    157999 non-null  object 
 2   common_name                157332 non-null  object 
 3   long_trees                 158004 non-null  float64
 4   lat_trees                  158004 non-null  float64
 5   diameter_breast_height_CM  158004 non-null  float64
 6   condition                  158004 non-null  object 
 7   native                     158004 non-null  object 
 8   age_at_obs                 155128 non-null  float64
 9   condition_index            158004 non-null  float64
 10  nearest_station            158004 non-null  object 
 11  station_id                 158004 non-null  object 
 12  station_name               158004 non-null  object 
 13  lat_prcp                   15

In [27]:
#drop our columns that are reference from climate dataset and the original condition column (which we used to create our target index feature)
trees_df = trees_df.drop(columns=['nearest_station', 'station_id',
       'station_name', 'lat_prcp', 'long_prcp', 'condition'])

In [28]:
trees_df.columns

Index(['planted_date', 'most_recent_observation', 'common_name', 'long_trees',
       'lat_trees', 'diameter_breast_height_CM', 'native', 'age_at_obs',
       'condition_index', 'adj_reports', 'norm_prcp_mm_total',
       'norm_snow_mm_total', 'distance_between', 'tree_id'],
      dtype='object')

### 3.2.1 Solidify dtypes

In [29]:
#make updates to dtypes
trees_df = trees_df.astype({'planted_date': 'datetime64', 'most_recent_observation': 'datetime64'})

trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158004 entries, 0 to 158003
Data columns (total 14 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   planted_date               155133 non-null  datetime64[ns]
 1   most_recent_observation    157999 non-null  datetime64[ns]
 2   common_name                157332 non-null  object        
 3   long_trees                 158004 non-null  float64       
 4   lat_trees                  158004 non-null  float64       
 5   diameter_breast_height_CM  158004 non-null  float64       
 6   native                     158004 non-null  object        
 7   age_at_obs                 155128 non-null  float64       
 8   condition_index            158004 non-null  float64       
 9   adj_reports                158004 non-null  int64         
 10  norm_prcp_mm_total         158004 non-null  float64       
 11  norm_snow_mm_total         158004 non-null  float64 

### 3.2.2 Split Dependent and Independent Variables

In [30]:
# split data into X and y
X = trees_df[['condition_index']]
y = trees_df.drop(columns=['condition_index'])

## 3.3 Train-Test Split
We'll use an 80:20 split here

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(126403, 1) (126403, 13) (31601, 1) (31601, 13)
