In [21]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import chi2_contingency

In [3]:
tree_data = pd.read_csv('2015StreetTreeCensus_TreeData.csv')

In [19]:
tree_data.head()

Unnamed: 0,tree_id,block_id,created_at,tree_dbh,stump_diam,curb_loc,status,health,spc_latin,spc_common,...,boro_ct,state,latitude,longitude,x_sp,y_sp,council district,census tract,bin,bbl
0,180683,348711,8/27/2015,3,0,OnCurb,Alive,Fair,Acer rubrum,red maple,...,4073900,New York,40.723092,-73.844215,1027431.148,202756.7687,29.0,739.0,4052307.0,4022210000.0
1,200540,315986,9/3/2015,21,0,OnCurb,Alive,Fair,Quercus palustris,pin oak,...,4097300,New York,40.794111,-73.818679,1034455.701,228644.8374,19.0,973.0,4101931.0,4044750000.0
2,204026,218365,9/5/2015,3,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,...,3044900,New York,40.717581,-73.936608,1001822.831,200716.8913,34.0,449.0,3338310.0,3028870000.0
3,204337,217969,9/5/2015,10,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,...,3044900,New York,40.713537,-73.934456,1002420.358,199244.2531,34.0,449.0,3338342.0,3029250000.0
4,189565,223043,8/30/2015,21,0,OnCurb,Alive,Good,Tilia americana,American linden,...,3016500,New York,40.666778,-73.975979,990913.775,182202.426,39.0,165.0,3025654.0,3010850000.0


## Chi Test for Independence 

### Tree Location  

In [33]:
curb_df = pd.pivot_table(tree_data, values='tree_id', index=['curb_loc'],
               columns=['health'], aggfunc='count', fill_value=0)

In [34]:
curb_df

health,Fair,Good,Poor
curb_loc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OffsetFromCurb,4035,20877,963
OnCurb,92469,507973,25855


In [37]:
curb_chi = chi2_contingency(curb_df)
curb_chi

(22.09679349710235,
 1.5912640755111868e-05,
 2,
 array([[  3828.80743117,  20982.18529774,   1064.00727109],
        [ 92675.19256883, 507867.81470226,  25753.99272891]]))

In [41]:
print(f"chisquare:", curb_chi[0])
print(f"p-value:", curb_chi[1])
print(f"degree of freedom:", curb_chi[2])

chisquare: 22.09679349710235
p-value: 1.5912640755111868e-05
degree of freedom: 2


- The p-value is 1.5912640755111868e-05 and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between curb_loc and health of tree**).

### Presence and type of tree guard  

In [43]:
guards_df = pd.pivot_table(tree_data, values='tree_id', index=['guards'],
               columns=['health'], aggfunc='count', fill_value=0)

In [44]:
guards_df

health,Fair,Good,Poor
guards,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harmful,3839,15322,1091
Helpful,7166,42638,2062
,84123,464978,23204
Unsure,1376,5912,460


In [45]:
guards_chi = chi2_contingency(guards_df)
guards_chi

(575.2539751179152,
 5.0679876449462705e-121,
 6,
 array([[2.99675853e+03, 1.64224877e+04, 8.32753808e+02],
        [7.67479153e+03, 4.20585001e+04, 2.13270833e+03],
        [8.46859516e+04, 4.64086105e+05, 2.35329433e+04],
        [1.14649838e+03, 6.28290709e+03, 3.18594534e+02]]))

- The p-value is 5.0679876449462705e-121 and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between presence of guard and health of tree**).

### Sidewalk damage immediately adjacent to tree

In [47]:
sidewalk_df = pd.pivot_table(tree_data, values='tree_id', index=['sidewalk'],
               columns=['health'], aggfunc='count', fill_value=0)

In [48]:
sidewalk_df

health,Fair,Good,Poor
sidewalk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Damage,28699,151889,6605
NoDamage,67805,376960,20213


In [50]:
sidewalk_chi = chi2_contingency(sidewalk_df)
sidewalk_chi

(268.16845546032187,
 5.86083893603376e-59,
 2,
 array([[ 27699.59607526, 151795.81866872,   7697.58525601],
        [ 68804.40392474, 377053.18133128,  19120.41474399]]))

- The p-value is 5.86083893603376e-59 and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between presence of sidewalk damange immediately adjacent to tree and health of tree**).

### Root problems caused by paving stones in the tree bed

In [52]:
rootstone_df = pd.pivot_table(tree_data, values='tree_id', index=['root_stone'],
               columns=['health'], aggfunc='count', fill_value=0)

In [53]:
rootstone_df

health,Fair,Good,Poor
root_stone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,73005,417538,21630
Yes,23499,111312,5188


In [54]:
rootstone_chi = chi2_contingency(rootstone_df)
rootstone_chi

(602.5935971939034,
 1.407547158003004e-131,
 2,
 array([[ 75787.89520556, 415324.0112271 ,  21061.09356734],
        [ 20716.10479444, 113525.9887729 ,   5756.90643266]]))

- The p-value is 1.407547158003004e-131 and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between root problems caused by paving stones in the tree bed and health of tree**).

### Root problems caused by metal grates 

In [56]:
rootgrates_df = pd.pivot_table(tree_data, values='tree_id', index=['root_grate'],
               columns=['health'], aggfunc='count', fill_value=0)

In [57]:
rootgrates_df

health,Fair,Good,Poor
root_grate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,95652,526421,26563
Yes,852,2429,255


In [59]:
rootgrate_chi = chi2_contingency(rootgrates_df)
rootgrate_chi

(358.14098499650447,
 1.7008782839751363e-78,
 2,
 array([[9.59807666e+04, 5.25982637e+05, 2.66725960e+04],
        [5.23233356e+02, 2.86736260e+03, 1.45404047e+02]]))

- The p-value is 1.7008782839751363e-78 and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between root problems caused by metal grates and health of tree**).

### Presence of other root problems 

In [61]:
rootother_df = pd.pivot_table(tree_data, values='tree_id', index=['root_other'],
               columns=['health'], aggfunc='count', fill_value=0)

In [62]:
rootother_df

health,Fair,Good,Poor
root_other,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,90015,507136,24699
Yes,6489,21714,2119


In [63]:
rootother_chi = chi2_contingency(rootother_df)
rootother_chi

(1929.1145277540363,
 0.0,
 2,
 array([[ 92017.15559699, 504261.71700104,  25571.12740197],
        [  4486.84440301,  24588.28299896,   1246.87259803]]))

- The p-value is 0.0, and  below the commonly accepted level of significance of 0.05. 
- Null hypothesis is rejected (**H₀: there is relationship between presence of other root problems and health of tree**).

### Trunk problems caused by rope or wires

In [65]:
trunkrope_df = pd.pivot_table(tree_data, values='tree_id', index=['trunk_wire'],
               columns=['health'], aggfunc='count', fill_value=0)

In [66]:
trunkrope_df

health,Fair,Good,Poor
trunk_wire,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,93737,519095,26066
Yes,2767,9755,752


In [67]:
trunkwire_chi = chi2_contingency(trunkrope_df)
trunkwire_chi

(510.960042186385,
 1.1128499296957543e-111,
 2,
 array([[ 94539.80329116, 518086.03757904,  26272.1591298 ],
        [  1964.19670884,  10763.96242096,    545.8408702 ]]))

- The p-value is  1.1128499296957543e-111, and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between trunk problems caused by rope or wires and health of tree**).

### Trunk problems caused by lights 

In [68]:
trunklight_df = pd.pivot_table(tree_data, values='tree_id', index=['trnk_light'],
               columns=['health'], aggfunc='count', fill_value=0)

In [69]:
trunklight_df

health,Fair,Good,Poor
trnk_light,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,96288,528096,26757
Yes,216,754,61


In [70]:
trunklight_chi = chi2_contingency(trunklight_df)
trunklight_chi

(42.66285289590515,
 5.443512224202895e-10,
 2,
 array([[9.63514396e+04, 5.28013956e+05, 2.67756042e+04],
        [1.52560404e+02, 8.36043789e+02, 4.23958066e+01]]))

- The p-value is 5.443512224202895e-1, and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between trunk problems caused by lights and health of tree**).

### Presence of other trunk problems 

In [73]:
trunkother_df = pd.pivot_table(tree_data, values='tree_id', index=['trnk_other'],
               columns=['health'], aggfunc='count', fill_value=0)

In [74]:
trunkother_df

health,Fair,Good,Poor
trnk_other,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,87124,509577,22898
Yes,9380,19273,3920


In [75]:
trunkother_chi = chi2_contingency(trunkother_df)
trunkother_chi

(11805.985908701758,
 0.0,
 2,
 array([[ 91684.06784713, 502436.36824335,  25478.56390952],
        [  4819.93215287,  26413.63175665,   1339.43609048]]))

- The p-value is 0.0, and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is rejected (**H₀: there is relationship between presence of other trunk problems and health of tree**).

### Branch problems caused by lights or wires

In [77]:
branchwire_df = pd.pivot_table(tree_data, values='tree_id', index=['brch_light'],
               columns=['health'], aggfunc='count', fill_value=0)

In [78]:
branchwire_df

health,Fair,Good,Poor
brch_light,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,85580,479779,24448
Yes,10924,49071,2370


In [79]:
branchwire_chi = chi2_contingency(branchwire_df)
branchwire_chi

(410.09208343188084,
 8.905033276106503e-90,
 2,
 array([[ 87275.64925817, 478277.86527174,  24253.48547009],
        [  9228.35074183,  50572.13472826,   2564.51452991]]))

- The p-value is 8.905033276106503e-90, and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between presence of branch problems caused by lights  and health of tree**).

### Branch problems caused by shoes

In [81]:
branchshoe_df = pd.pivot_table(tree_data, values='tree_id', index=['brch_shoe'],
               columns=['health'], aggfunc='count', fill_value=0)

In [82]:
branchshoe_df

health,Fair,Good,Poor
brch_shoe,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,96410,528565,26786
Yes,94,285,32


In [83]:
branchshoe_chi = chi2_contingency(branchshoe_df)
branchshoe_chi

(38.61402692088887,
 4.121645086487258e-09,
 2,
 array([[9.64431830e+04, 5.28516718e+05, 2.68010992e+04],
        [6.08169992e+01, 3.33282248e+02, 1.69007532e+01]]))

- The p-value is 4.121645086487258e-09, and significantly above the commonly accepted level of significance of 0.05. 
- Null hypothesis is not rejected (**H₀: there is no relationship between presence of branch problems caused by shoes  and health of tree**).

### Presence of other branch problems 

In [84]:
branchother_df = pd.pivot_table(tree_data, values='tree_id', index=['brch_other'],
               columns=['health'], aggfunc='count', fill_value=0)

In [85]:
branchother_df

health,Fair,Good,Poor
brch_other,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,88616,516041,23160
Yes,7888,12809,3658


In [86]:
branchother_chi = chi2_contingency(branchother_df)
branchother_chi

(15143.783795117037,
 0.0,
 2,
 array([[ 92900.11188459, 509100.39138448,  25816.49673092],
        [  3603.88811541,  19749.60861552,   1001.50326908]]))

- The p-value is 0.0, and below the commonly accepted level of significance of 0.05. 
- Null hypothesis is rejected (**H₀: there is relationship between presence of other branch problems and health of tree**).

**Variable of interest for next step** 
- presence of other branch problems - brch_other
- presence of other trunk problems - trnk_other
- presence of other root problems - root_other