# Trees of Vancouver


## Foreword
Author: Daniel Fu Yaw Yang
Date 19/5/2023

We are going to do an exploratory data analysis for a subset of Vancouver Trees Data ( [Vancouver Street Trees](https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id)) 


## Introduction

### Question(s) of interests


For this project, we will be using a subset of the Vancouver Street TreesLinks. data set. We are provided with a smaller data set with 5,000 rows. https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csvLinks. The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government License – VancouverLinks to an external site and has been wrangled and cleaned and then was generated randomly 

1. Is there a correlation between diameter and height of the trees?
2. How does the diameter vary across different species ?
3. Does the location planted have an impact on tree growth?
4. Do root barriers affect growth in this case diameter of trees.


# Import libraries 

Let's import the libraries nessecary for our EDA


In [1]:
import altair as alt
import pandas as pd
import numpy as np
import os
from vega_datasets import data

#This line of code is so it shows u in html
alt.data_transformers.enable("data_server")

DataTransformerRegistry.enable('data_server')

### Read in the dataframe from the url provided to us

In [2]:
# Read data from url

url="https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv"
trees_df=pd.read_csv(url)

trees_df

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.256350,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,6132,E 53RD AV,E 53RD AV,SERRULATA,Victoria-Fraserview,,17.0,EVEN,PRUNUS,N,...,9,Y,47059,KWANZAN FLOWERING CHERRY,2,2200,KWANZAN,N,49.221161,-123.061023
4996,5642,E 32ND AV,E 32ND AV,XX,Kensington-Cedar Cottage,2014-01-14,3.0,EVEN,CORNUS,N,...,10,N,247874,EDDIES WHITE WONDER DOGWOOD,1,1700,EDDIE'S WHITE WONDER,N,49.241544,-123.070644
4997,8777,DAWSON ST,DAWSON ST,TULIPIFERA,Killarney,2002-04-15,3.5,EVEN,LIRIODENDRON,N,...,7,Y,192642,ARNOLD TULIPTREE,2,6500,ARNOLD,N,49.224511,-123.048723
4998,23489,E 13TH AV,E 13TH AV,INVOLUCRATA,Mount Pleasant,2003-12-02,5.5,EVEN,DAVIDIA,N,...,5,Y,202500,DOVE OR HANDKERCHIEF TREE,1,300,,Y,49.259208,-123.096905


<!-- Theres quite a lot of data that are redundant in this data so we will drop some and their reaseons as listed:

Unnamed: 0 - Lack of documentation so we are unable to use this column    
std_street - has almost the same data as on_street so we are onnly going to kee one of the two   
street_side_name - Data isnt relevant for this EDA   
civic_number - does not show up in dataframe,so I decided to drop it   
tree_id - just identification number for each tree   
on_street_block - on_street will be enough for the data   
cultivar_name - is aboutt the same as common_name but has more nan values   
latitude and longitude - We have the area of neighbourhood so we dont need the exact location   
 -->

### Cleaning Dataframe
    Now that we have the Dataframe, we can see that there are alot of columns and they might be irrelevant columns for our EDA
    Hence we will be dropping them so we can see a cleaner Dataframe.We will be dropping these columns:
        'Unnamed: 0'
        'std_street'
        "street_side_name"
        'civic_number'
        "tree_id"
        "on_street_block"
        'cultivar_name'
        'date_planted'


In [3]:
#drop irrelevant columns
trees_df= trees_df.drop(columns=['Unnamed: 0', 'std_street',"street_side_name",'civic_number',"tree_id","on_street_block",'cultivar_name','date_planted'])
trees_df

Unnamed: 0,on_street,species_name,neighbourhood_name,diameter,genus_name,assigned,plant_area,curb,common_name,height_range_id,root_barrier,latitude,longitude
0,W 20TH AV,PLATANOIDES,Riley Park,28.5,ACER,N,15,Y,NORWAY MAPLE,4,N,49.252711,-123.106323
1,W 18TH AV,CALLERYANA,Arbutus-Ridge,6.0,PYRUS,N,7,Y,CHANTICLEER PEAR,2,N,49.256350,-123.158709
2,ROSS ST,NIGRA,Sunset,12.0,PINUS,N,7,Y,AUSTRIAN PINE,4,N,49.213486,-123.083254
3,DOMAN ST,AMERICANA,Killarney,11.0,FRAXINUS,N,7,Y,AUTUMN APPLAUSE ASH,4,N,49.220839,-123.036721
4,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,15.5,AESCULUS,Y,N,Y,COMMON HORSECHESTNUT,4,N,49.238514,-123.154958
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,E 53RD AV,SERRULATA,Victoria-Fraserview,17.0,PRUNUS,N,9,Y,KWANZAN FLOWERING CHERRY,2,N,49.221161,-123.061023
4996,E 32ND AV,XX,Kensington-Cedar Cottage,3.0,CORNUS,N,10,N,EDDIES WHITE WONDER DOGWOOD,1,N,49.241544,-123.070644
4997,DAWSON ST,TULIPIFERA,Killarney,3.5,LIRIODENDRON,N,7,Y,ARNOLD TULIPTREE,2,N,49.224511,-123.048723
4998,E 13TH AV,INVOLUCRATA,Mount Pleasant,5.5,DAVIDIA,N,5,Y,DOVE OR HANDKERCHIEF TREE,1,Y,49.259208,-123.096905


Now let's see the Dataframe''s info

In [4]:
#describe dataframe

trees_df.info()
print("\n")
trees_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   on_street           5000 non-null   object 
 1   species_name        5000 non-null   object 
 2   neighbourhood_name  5000 non-null   object 
 3   diameter            5000 non-null   float64
 4   genus_name          5000 non-null   object 
 5   assigned            5000 non-null   object 
 6   plant_area          4950 non-null   object 
 7   curb                5000 non-null   object 
 8   common_name         5000 non-null   object 
 9   height_range_id     5000 non-null   int64  
 10  root_barrier        5000 non-null   object 
 11  latitude            5000 non-null   float64
 12  longitude           5000 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 507.9+ KB




Unnamed: 0,diameter,height_range_id,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0
mean,12.340888,2.7344,49.247349,-123.107128
std,9.2666,1.56957,0.021251,0.049137
min,0.0,0.0,49.202783,-123.22056
25%,4.0,2.0,49.230152,-123.144178
50%,10.0,2.0,49.247981,-123.105861
75%,18.0,4.0,49.263275,-123.063484
max,71.0,9.0,49.29393,-123.023311


# Question 1
##
Is there a relationship between diameter and height of the trees?
 
To find out we are going to chart a graph with diameter(inches) on the x axis and height(using height range id as it is 10ft per unit) on the y axis.


In [5]:
scatter = alt.Chart(trees_df).mark_circle().encode(
    x=alt.X('diameter', title='Diameter (in)'),
    y=alt.Y('height_range_id',title ='Height(*10 ft)'),
    color=alt.Color('genus_name:N', legend=None),
    tooltip=['genus_name:N', 'diameter:Q', 'height_range_id:Q', 'neighbourhood_name:N']
).properties(
    width=700,
    height=500
)
scatter


We can see the realtionship on the chart above that it is linear but it isnt very definitive so lets see if it'll show a more linear result if we take the mean of both height and diameter.
With the code below, we can see that it is a linear progression as diameter increases height increases so for the rest of the EDA we can focus on using one of them as they have a positive relationship.

In [6]:
# create linear regression line
regression = scatter.transform_regression(
    'diameter', 'height_range_id', method='poly', order=1
).mark_line(color='red')
regression

In [7]:
brush = alt.selection_interval(encodings=['x', 'y'])
height_diameter_chart = (scatter + regression).add_selection(brush).properties(
    title='Relationship between tree height and diameter',
    width=600,
    height=400
)

# Apply opacity based on brush selection
height_diameter_chart = height_diameter_chart.encode(
    opacity=alt.condition(brush, alt.value(0.8), alt.value(0.1))
)

height_diameter_chart

There is! As height increases diameter also increases around all the species of trees provided

# Question 2
## Do different genus of tree have different median diameter?

The Dataframe is has a lot of genus of trees so lets filter it out to the top 5 most trees planted around Vancouver

In [8]:
# Calculate the count of each species
genus_count = trees_df['genus_name'].value_counts()

# Select the top three species
top_genus = genus_count.head(5).index.tolist()

# Filter the dataframe for the top three species
filtered_df = trees_df[trees_df['genus_name'].isin(top_genus)]
filtered_df 


Unnamed: 0,on_street,species_name,neighbourhood_name,diameter,genus_name,assigned,plant_area,curb,common_name,height_range_id,root_barrier,latitude,longitude
0,W 20TH AV,PLATANOIDES,Riley Park,28.5,ACER,N,15,Y,NORWAY MAPLE,4,N,49.252711,-123.106323
3,DOMAN ST,AMERICANA,Killarney,11.0,FRAXINUS,N,7,Y,AUTUMN APPLAUSE ASH,4,N,49.220839,-123.036721
6,NASSAU DRIVE,CAMPESTRE,Victoria-Fraserview,12.0,ACER,N,15,Y,HEDGE MAPLE,3,N,49.217522,-123.071311
8,W PENDER ST,PALUSTRIS,Downtown,8.0,QUERCUS,N,C,Y,PIN OAK,1,N,49.281303,-123.108253
11,W 45TH AV,CERASIFERA,Kerrisdale,4.5,PRUNUS,N,8,Y,NIGHT PURPLE LEAF PLUM,2,N,49.230925,-123.156131
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4991,WALES ST,AMERICANA,Renfrew-Collingwood,19.0,TILIA,N,7,Y,BASSWOOD,5,N,49.236139,-123.051816
4992,E 53RD AV,SERRULATA,Victoria-Fraserview,20.0,PRUNUS,N,9,Y,KWANZAN FLOWERING CHERRY,2,N,49.221161,-123.060833
4994,ASH ST,TRUNCATUM,Marpole,3.0,ACER,N,,Y,PACIFIC SUNSET MAPLE,1,N,49.216851,-123.120103
4995,E 53RD AV,SERRULATA,Victoria-Fraserview,17.0,PRUNUS,N,9,Y,KWANZAN FLOWERING CHERRY,2,N,49.221161,-123.061023


Now that we have a filtered dataframe, we can chart a circle point graph of the diameter of the 5 tree genus' in vancouver

In [9]:
chart = alt.Chart(filtered_df).mark_circle().encode(
    x=alt.X('diameter', title='Diameter (cm)'),
    y=alt.Y('height_range_id', sort='-x', title ='Count'),
    color=alt.Color('genus_name', title='Tree Species'),
    tooltip=['genus_name', 'diameter', 'height_range_id', 'neighbourhood_name']).properties(
    title='Diameter Distribution for Top 5 Species',
    width=600,
    height=400
)
chart

Then we can Facet the graph above to each area and see if the diameter of the trees stay linear in every area

In [10]:
chart_facet =chart.facet(column='genus_name', columns=5).resolve_scale(y='independent')
chart_facet

With the facted graphs we can see that they do not have the same distribution of diameter.
and if we want to go on detail,we can plot a boxplot to figure out the median diameter and more.


In [11]:
# Create a boxplot to analyze diameter distribution for the selected species
filtered_chart = alt.Chart(filtered_df).mark_boxplot().encode(
    x='genus_name:N',
    y='diameter:Q',
    color='genus_name:N'
).properties(
    title='Diameter Distribution for Top 5 Species',
    width=600,
    height=400
)

filtered_chart

From the boxplot we can see that Fraxinus has smallest diameter of trees.Then Prunus and Tilia that has the same median,while diameter of Prunus is more spread apart compared to Tilia and finally with the highest median diameter is Quercus.

# Question 3 
## Does the location planted have an impact on tree growth?

Now we are going to see if different locations produces healthier hence wider trees!

In [12]:
# create a dropdown selection tool for common_name
genus_name_select = alt.binding_select(options=list(filtered_df['genus_name'].unique()), name='Genus Name')
genus_name_selector = alt.selection_single(fields=['genus_name'], bind=genus_name_select)

# create a dropdown selection tool for neighbourhood_name
neighbourhood_name_select = alt.binding_select(options=list(trees_df['neighbourhood_name'].unique()), name='Neighbourhood')
neighbourhood_name_selector = alt.selection_single(fields=['neighbourhood_name'], bind=neighbourhood_name_select)


Let's focus on the neighbourhood on the first chart.With this graph we can see the average diameter distribution of all the top 5 trees in a selected area


In [13]:
genus_name_chart = chart.encode(opacity=alt.condition(neighbourhood_name_selector, alt.value(1), alt.value(0))).add_selection(neighbourhood_name_selector)
genus_name_chart

After fiddling around the dropdown,we can see that the distribution of the diameter of trees are quite similar accross areas but its not quite clear.

We will need a mmmore suitable graph that includes all the trees without separating them into their own genus' so we can produce a more clearer picture of the distribution of diameter in different areas

In [14]:
# Create a scatter plot to correlate location planted with tree diameter
scatter_plot = alt.Chart(filtered_df).mark_circle().encode(
    x=alt.X('neighbourhood_name:O', title='Neighbourhood'),
    y=alt.Y('diameter:Q', title='Diameter'),
    color=alt.Color('diameter:Q', scale=alt.Scale(scheme='viridis'), legend=None)
).properties(
    title='Correlation between Location Planted and Tree Diameter',
    width=600,
    height=400
)

scatter_plot

In this scatterplot, the distribution of diameter among all the trees across the neighbourhoods are about even so location does not affect the growth rate of trees in Vancouver

# Question 4
## Do root barriers affect growth in this case diameter of Trees

Now we will see if a root barrier affects the growth of the trees 

In [15]:
scatter_plot = alt.Chart(filtered_df).mark_circle(
).encode(
    x=alt.X('genus_name:N', title='Genus Name'),
    y=alt.Y('diameter:Q', title='Diameter'),
    color=alt.Color('root_barrier:N', scale=alt.Scale(scheme='darkred', reverse=True), legend=alt.Legend(title='Root Barrier')),
    tooltip=['genus_name', 'root_barrier', 'diameter']
).properties(
    title='Correlation between Genus Name, Root Barrier, and Diameter',
    width=600,
    height=400
).interactive()


scatter_plot 

Eventhough the color scheme isnt the greatest even with the graph all zoomed in, we can vaguely tell that trees with root barriers has lesser growth and it makes sense as it impedes their space to grow but if we want a clear and simple graph we can just go for a bar plot as shown below:

In [16]:
bar_plot = alt.Chart(filtered_df).mark_bar().encode(
    x=alt.X('root_barrier:N', title='Root Barrier'),
    y=alt.Y('diameter:Q', title='Diameter'),
    color=alt.Color('root_barrier:N', legend=None)
).properties(
    title='Comparison of Root Assigned with Tree Diameter',
    width=400,
    height=300
)
bar_plot 

With this it is very clear that root barriers almost halves their growth rate

# Conclusion

## Question 1 :  Is there a correlation between diameter and height of the trees?
    ~ The combined graph show that there is a positive relationship between diameter and height.
        As diameter increases,height also increases


## Question 2 : Do different genus of tree have different median diameter ?
    ~ We can see that in the faceted and boxplot graphs that the distribution of diameter accross the top 5 trees are different. 
    
## Question 3 : Does the location planted have an impact on tree growth?
    ~ we plotted out  a scattermap  of the top 5 trees around vancouver and are able to see that it does not have an impact on the growth rate of trees 

## Question 4 : Do root barriers affect growth in this case diameter of trees
    ~ Yes The bar plot paints a very clear picture that root barriers almost halves the growth rate of trees
       



# References
Trees DataFrame : https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name
Vancouver Map Help : https://cdn-uploads.piazza.com/paste/klvia6r082u1jy/f31721ac4272704ae2ef201335cd6142943f6dd9aac1598230415ecfa65e0319/vancouver_map_help.html
Data Visualisation Modules : https://canvas.ubc.ca/courses/114341/modules