# Forest for the Trees 
### Predicting Tree Types from the NYC Tree Survey Using Random Forest

Dataset can be found [here](https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh) on NYC Open Data (data dictionary included)

Published on [Brunchline](http://www.brunchline.co) by [@DQOfficial](http://github.com/DQOfficial)

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline
import sys
print sys.version
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import roc_curve, auc

Populating the interactive namespace from numpy and matplotlib
2.7.11 (default, Dec  5 2015, 14:44:47) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)]


`%matplotlib` prevents importing * from pylab and numpy


In [39]:
ls

GDP.csv                          euro.csv
Lower_Manhattan_Retailers.csv    [1m[34mflight_hist_pickle[m[m/
[1m[34mMLLec4_data[m[m/                     [1m[34mflight_hist_raw[m[m/
SI_Sales.dta                     [1m[34mnyc_pluto_15v1[m[m/
SI_Sales_Old.dta                 nyc_zipcodes.csv
agencies.csv                     pluto_manhattan_usi.csv
[1m[34mbuilding_footprints_shape_10-15[m[m/ russell.csv
[1m[32mcitibike_feb15.csv[m[m*              tree_census_2015.csv
citibike_feb15.csv.zip           unique_locations.csv
communities.csv                  [1m[32mvehicles.csv[m[m*


In [40]:
# navigate to the local directory
cd dan/desktop/python/datasets

SyntaxError: invalid syntax (<ipython-input-40-c0d055dd72ea>, line 2)

In [41]:
# read in the data from NYC Open Data

# this one uses the local csv file since i was having issues with the socrata API
df = pd.read_csv('tree_census_2015.csv')

# this one pulls from the API directly; uncomment to use if you'd like
#df = pd.read_json('https://data.cityofnewyork.us/resource/nwxe-4ae8.json')

In [42]:
print df.columns
print ''
print 'we have %d columns' % len(df.columns)
print 'and we have %d types of trees' % len(df.spc_common.unique())

Index([u'FID', u'tree_id', u'block_id', u'created_at', u'tree_dbh',
       u'stump_diam', u'curb_loc', u'status', u'health', u'spc_latin',
       u'spc_common', u'steward', u'guards', u'sidewalk', u'user_type',
       u'root_stone', u'root_grate', u'root_other', u'trunk_wire',
       u'trnk_light', u'trnk_other', u'brch_light', u'brch_shoe',
       u'brch_other', u'address', u'zipcode', u'zip_city', u'cb_num',
       u'borocode', u'boroname', u'cncldist', u'st_assem', u'st_senate',
       u'nta', u'nta_name', u'boro_ct', u'state', u'latitude', u'longitude',
       u'x_sp', u'y_sp'],
      dtype='object')

we have 41 columns
and we have 133 types of trees


In [43]:
# create new dataframe with only the columns we'd like
data = df[['block_id','zipcode','borocode','brch_light','cncldist',
           'brch_other','brch_shoe', 'tree_dbh','stump_diam','spc_common']]

In [44]:
# which trees are the most common?
trees_df = pd.DataFrame({'count':data.spc_common.value_counts()})
trees_df.reset_index(inplace=True)
trees_df.columns=['name','count']
print 'the top three most common trees are:'
print trees_df[:3]
print trees_df[trees_df.name=='Pin Oak']

the top three most common trees are:
                       name  count
0          London Planetree  51890
1  Honeylocust var. inermis  49199
2              Callery Pear  45092
      name  count
3  Pin Oak  34555


In [45]:
# create a binary flag variable for pin oak trees, since we don't want to measure it across all different types
data['pin_oak_flag'] = np.where(data['spc_common']=='Pin Oak',1,0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [46]:
# then convert categorical data to numeric since random forest doesn't like categorical data
def convert(data):
    num = preprocessing.LabelEncoder()
    for i in data.columns:
        data[i] = num.fit_transform(data[i])
        
    return data
    
data = convert(data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [47]:
data.head()

Unnamed: 0,block_id,zipcode,borocode,brch_light,cncldist,brch_other,brch_shoe,tree_dbh,stump_diam,spc_common,pin_oak_flag
0,17045,116,2,0,32,0,0,4,0,118,0
1,17045,116,2,0,32,0,0,4,0,118,0
2,17045,116,2,0,32,0,0,4,0,118,0
3,16905,116,2,1,32,0,0,15,0,60,0
4,16905,116,2,1,32,0,0,18,0,60,0


In [48]:
# since the random forest takes forever with 468,000 entries, let's use the sample function in pandas to slim it down
# for this test, we use 10%, or about 47,000 rows
sampled_df=data.sample(frac=.2,replace=True)

data=sampled_df

In [None]:
# split data set into train and test for both our target and predictor variables
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test=train_test_split(data.ix[:,0:9],data['pin_oak_flag'],test_size=0.3, random_state=0)
# train the random forest classifier
clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(X_train,y_train)

# calculate model accuracy for both train and test sets
print 'Predicting the Pin Oak:'
pred = clf.predict(X_train)
print 'The accuracy for the training set is:',1.0*sum(y_train==pred)/len(pred)
pred = clf.predict(X_test)
print 'The accuracy for the test set is:', 1.0*sum(y_test==pred)/len(pred)

Predicting the Pin Oak:
The accuracy for the training set is: 0.997102200802
The accuracy for the test set is: 0.921034838618


In [None]:
# split data set into train and test for both our target and predictor variables
from sklearn.cross_validation import train_test_split

# rather than just looking at the most common tree, what if we try it with all 63 different types?
X_train, X_test, y_train, y_test=train_test_split(data.ix[:,0:9],data['spc_common'],test_size=0.3, random_state=0)
# train the random forest classifier
clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(X_train,y_train)

# calculate model accuracy for both train and test sets
print 'Predicting all types:'
pred = clf.predict(X_train)
print 'The accuracy for the training set is:',1.0*sum(y_train==pred)/len(pred)
pred = clf.predict(X_test)
print 'The accuracy for the test set is:', 1.0*sum(y_test==pred)/len(pred)

### As we can see, the accuracy for predicting one of the 63 types is a lot worse than just predicting the pin oak