# Vancouver Trees Dataset Analysis and Visualization Report
<div style="text-align: right"> Oct 22, 2022 </div>

<div style="text-align: right"> Notebook by Renbo Xu </div>

## Introduction
My nationality is China. I came to Canada to study several years ago and what impressed me the most is its beautiful natural environment. With its scenic views, mild climate, and friendly people, Vancouver is well-known around the world as both a popular tourist attraction and one of the best places to live (https://vancouver.ca/news-calendar/our-city.aspx). Its diverse natual environment,such as mountains, oceans and diverse wild animal, has attarcted a lot of immigrants to live here. One of the most important features, vancouver trees, also contributes a lot to the amazing beauty of Vancouver city. I am curious about how trees are distributed in different vancouver neighbourhood, what kind of genus they are, when they were planted, and so on. To answer these questions, a exploratory data analysis and visualization on the vancouver tree dataset would be necessory.

In this report, I will explore the vancouver tree distribution by analyzing a subset of Vancouver Street Trees dataset (https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name). The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government Licence – Vancouver (https://opendata.vancouver.ca/pages/licence/). The data analysis and visualization of this dataset will give us more details of vancouver trees (subset is from https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv).

## Question(s) of interests
By exploring this dataset, I am interested in answering the following questions:
* What is the number of trees planted in different year and month?
* Is there a relationship between tree diameter and height?
* What is the number of trees for different neighborhood and tree genus
* What is the top 5 tree genus distribution in different neighbourhood

For the final dashboard, I would like to present the tree details of specific/selected vancouver neighbourhood, such as the top 10 genus, tree diameter and height distribution, planted year, and street side name.

## Analysis

### Data imports

In [1]:
# Import libraries needed for EDA
import altair as alt
import pandas as pd
import numpy as np

alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

In [2]:
# Load the dataset and parse the 'date_planted' as date datatype
van_tree_df = pd.read_csv('small_unique_vancouver.csv', parse_dates=['date_planted'])

# Take a look at all the columns of the dataset
van_tree_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,NaT,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,NaT,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


### Data Description and analysis

First, let's take a look at the general information of this dataset and all the columns.

In [3]:
van_tree_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       

According to the information above, we are able to see that there are 5000 entries in total and 21 columns. The first column has no name. It is actually the index number of the original dataset (we are only analyzing the subset data). Therefore, this column will be dropped for further analysis. 

In [4]:
# Drop the index column of original dataset
van_tree_df=van_tree_df.iloc[:,1:]
van_tree_df.head()

Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,civic_number,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,66,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,2323,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,ROSS ST,ROSS ST,NIGRA,Sunset,NaT,12.0,ODD,PINUS,N,7855,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,6938,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,NaT,15.5,ODD,AESCULUS,Y,5295,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


Next, let us take a look at the details of the each colum's information from website https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name. The columns details are summarized in Table 1.

#### <center> Table 1 Vancouver dataset column details </center>

|Column name|Datatype|Details|
|-----------|--------|-------|
|std_street|object|Street name of the site at which the tree is associated with
|on_street|object|The name of the street at which the tree is physically located on
|species_name|objec|Species name
|neighbourhood_name|object|City's defined local area in which the tree is located. 
|date_planted|datetime|The date of planting in YYYYMMDD format.  
|diameter|float|DBH in inches (DBH stands for diameter of tree at breast height)
|street_side_name|object|The street side which the tree is physically located on (Even, Odd or Median (Med))
|genus_name|object|Genus name
|assigned|object|Indicates whether the address is made up to associate the tree with a nearby  lot (Y=Yes or N=No)
|civic_number|int|Street address of the site at which the tree is associated with
|plant_area|object|B = behind sidewalk, G = in tree grate, N = no sidewalk, C = cutout, a number  indicates boulevard width in feet
|curb|object|Curb presence (Y = Yes, N = No)
|tree_id|int|Numerical ID
|common_name|object|Common name
|height_range_id|int|0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft)
|on_street_block|int|The street block at which the tree is physically located on
|cultivar_name|object|Cultivar name
|root_barrier|object|Root barrier installed (Y = Yes, N = No)
|latitude|float|Location latitude
|longitude|float|Location longitude

In [5]:
# Let us print out the summarized information for numeric columns
van_tree_df.describe()

Unnamed: 0,diameter,civic_number,tree_id,height_range_id,on_street_block,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,12.340888,2975.7076,128682.5846,2.7344,2960.227,49.247349,-123.107128
std,9.2666,2078.580429,75412.260406,1.56957,2086.861052,0.021251,0.049137
min,0.0,2.0,36.0,0.0,0.0,49.202783,-123.22056
25%,4.0,1300.5,61321.5,2.0,1300.0,49.230152,-123.144178
50%,10.0,2639.0,130130.5,2.0,2600.0,49.247981,-123.105861
75%,18.0,4123.0,191332.0,4.0,4100.0,49.263275,-123.063484
max,71.0,9113.0,270750.0,9.0,9100.0,49.29393,-123.023311


This dataset has 7 numerical columns, including **diameter, civic_number, tree_id, height_range_id, on_street_block, latitude and longitude**. The rest of columns are categorical except **date_planted** is temporal.

In [6]:
# Let us print out the summarized information for categorical and temporal columns
van_tree_df.describe(exclude=[np.number],datetime_is_numeric=True)

Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,street_side_name,genus_name,assigned,plant_area,curb,common_name,cultivar_name,root_barrier
count,5000,5000,5000,5000,2363,5000,5000,5000,4950.0,5000,5000,2658,5000
unique,603,607,171,22,,4,67,2,38.0,2,361,176,2
top,CAMBIE ST,CAMBIE ST,SERRULATA,Renfrew-Collingwood,,ODD,ACER,N,10.0,Y,KWANZAN FLOWERING CHERRY,KWANZAN,N
freq,52,49,463,384,,2554,1218,4564,736.0,4593,383,383,4679
mean,,,,,2003-09-06 04:03:08.912399488,,,,,,,,
min,,,,,1989-10-31 00:00:00,,,,,,,,
25%,,,,,1997-11-06 00:00:00,,,,,,,,
50%,,,,,2003-02-12 00:00:00,,,,,,,,
75%,,,,,2009-11-17 00:00:00,,,,,,,,
max,,,,,2019-05-07 00:00:00,,,,,,,,


In this dataset, majority of the columns have 5000 entries, while **date_planted, plant_area, cultivar_name** has less entries, whose entries are 2363, 4950, 2658, respectively. Since **date_planted** and **cultivar_name** has only half of the entries of the total entries, I will keep the **date_planted** because it is one of the variables of my interest, but I will eliminate **cultivar_name**. Also, I will drop the NaN values of **plant_area** for further analysis. Last, I will drop some columns of no interest, including **std_street, on_street, assigned, civic_number**.

In [7]:
# Drop the columns of no interest and 'cultivar_name'
tree_df=van_tree_df.drop(columns=['std_street','on_street','assigned','civic_number','cultivar_name'])

# Drop the NaN rows in 'plant_area'
tree_df=tree_df.dropna(subset=['plant_area'])
tree_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4950 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   species_name        4950 non-null   object        
 1   neighbourhood_name  4950 non-null   object        
 2   date_planted        2328 non-null   datetime64[ns]
 3   diameter            4950 non-null   float64       
 4   street_side_name    4950 non-null   object        
 5   genus_name          4950 non-null   object        
 6   plant_area          4950 non-null   object        
 7   curb                4950 non-null   object        
 8   tree_id             4950 non-null   int64         
 9   common_name         4950 non-null   object        
 10  height_range_id     4950 non-null   int64         
 11  on_street_block     4950 non-null   int64         
 12  root_barrier        4950 non-null   object        
 13  latitude            4950 non-null   float64     

Now the dataset is ready for visualization. 

## Exploratory Visualizations

### Question 1: What is the number of trees planted in different year and month?

In [8]:
# Drop the NaN values of 'data_planted' for analysis
tree_date_df=tree_df.dropna(subset=['date_planted'])

# Create 'Year' and 'Month' columns from date_planted
tree_date_df=tree_date_df.assign(Year=tree_date_df['date_planted'].dt.year.astype(int))
tree_date_df=tree_date_df.assign(Month=tree_date_df['date_planted'].dt.month_name())
tree_date_df.head()

Unnamed: 0,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,root_barrier,latitude,longitude,Year,Month
0,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,15,Y,21421,NORWAY MAPLE,4,0,N,49.252711,-123.106323,2000,February
1,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,7,Y,129645,CHANTICLEER PEAR,2,2300,N,49.25635,-123.158709,1992,February
3,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,N,49.220839,-123.036721,1999,November
5,PERSICA,West End,2012-04-05,3.0,EVEN,PARROTIA,C,Y,233622,VANESSA PERSIAN IRONWOOD,1,1100,N,49.281906,-123.133076,2012,April
7,OFFICINALIS,Kensington-Cedar Cottage,2001-04-02,3.0,EVEN,MAGNOLIA,N,Y,187792,CHINESE MAGNOLIA,2,3700,N,49.251127,-123.071912,2001,April


In [9]:
# Plot a bar chart to view the numbers of trees planted in different years and different month
year_chart=(alt.Chart(tree_date_df)
            .mark_bar()
            .encode(
                alt.X('Year:N', title='Tree planted year'),
                alt.Y('count()',title='Number of trees',axis=alt.Axis(grid=False)),
                alt.Color(value='green'),
                tooltip='count()')
            .properties(title='Number of trees planted in different year'))

month_order=['January','February','March','April','May','June','July','August','September','October','November','December']

month_chart=(alt.Chart(tree_date_df)
             .mark_bar().encode(
                 alt.X('Month:O', title='Tree planted month',sort=month_order),
                 alt.Y('count()',title='Number of trees', axis=alt.Axis(grid=False)),
                 alt.Color(value='pink'),
                 tooltip=('count()'))
             .properties(title='Number of trees planted in different month'))

date_chart=year_chart|month_chart
date_chart

#### Figure 1 Vancouver trees planted in different year and month                               

From Figure 1, it seems that from year 1989 to 1996, the number of trees planted is steadily increasing. Later from 1996 to 2013, the planted tree numbers fluctuates slightly and then starts dropping significantly from year 2013 to 2016. After that, the planted tree number starts slightly increasing again. For the number of planted trees for different month, Februry is the month with the most planted trees, while July and August have the minimum planted trees.

### Question 2: Is there a relationship between tree diameter and height?

Usually, the tree with larger diameter is higher in height. Is this true for the vancouver tree dataset? Let's take a look at the relationship between the **diameter** and **height_range_id**.

In [10]:
# Creat a scatterplot for 'diameter' and 'height_range_id', set the count() for color channel
diameter_height_scatterplot=(alt.Chart(tree_df)
                             .mark_circle()
                             .encode(
                                 alt.X('height_range_id:Q', title='Tree height range id'),
                                 alt.Y('diameter:Q', title='Tree diameter (inch)'),
                                 alt.Color('count()', title='Number of trees'))
                             .properties(title='Relationship of tree diameter VS tree height range id'))

diameter_height_scatterplot

#### Figure 2 Relationship of diameter and height of Vancouver trees

From the scatterplot, we are able to see that the tree height and tree diameter has positive relationship, indicating that tree with larger diameter usually has higher height range id. However, this is a just overall trend, not apply to every single point. Therefore, I would like to do a boxplot to reveal more statistics and also add a line chart of mean diameter value to the scatterplot.

In [11]:
# Creat a boxplot for 'diameter' and 'height_range_id'
diameter_height_boxplot=(alt.Chart(tree_df)
                         .mark_boxplot()
                         .encode(
                             alt.X('height_range_id:Q', title='Tree height range id'),
                             alt.Y('diameter:Q', title='Tree diameter (inch)'))
                         .properties(title='Relationship of tree diameter VS tree height range id'))

# Creat a line chart using mean value of 'diameter' and add to scatterplot
diameter_height_lineplot=(alt.Chart(tree_df)
                          .mark_line(color='red')
                          .encode(
                              alt.X('height_range_id:Q', title='Tree height range id'),
                              alt.Y('mean(diameter):Q', title='')))

# Combine the scatterplot and line chart together, lay out with boxplot vertically
(diameter_height_lineplot + diameter_height_scatterplot)|diameter_height_boxplot

####                  Figure 3 Scatterplot and boxplot of diameter and height of Vancouver tree

From Figure 3, it seems that the overall trend of relatishop between tree diameter and height range id (reflected by the mean and median values) is positive, except there is a slight decrease in diameter from tree height range id 8 to 9 (this could due to less datapoints).

Only knowing the relationship between the tree diameter and height is not enough for this report. I am more interested in the tree details of different neighbourhood. Therefore, I would like to present the tree diameter and height range id in rugplot. Add widget to select different neibourhood and genus, to see their tree diameter and height. Also, I am curious about that for specific/selected neighbourhood and genus,  what is the number of trees of different street_side_name and when they were planted  (tree planted year). Therefore, I created the following plot (Figure 4) to answer these questions and add it to the final dashboard.

In [12]:
neighbourhood = tree_df['neighbourhood_name'].unique()
dropdown_neighbourhood = alt.binding_select(name='Neighbourhood Name', options=neighbourhood)

genus = tree_df['genus_name'].unique()
dropdown_genus=alt.binding_select(name='Genus', options=genus)

select_neighbourhood_genus=alt.selection_single(fields=['neighbourhood_name','genus_name'], bind= {'neighbourhood_name': dropdown_neighbourhood, 'genus_name': dropdown_genus})

tree_diameter = (alt.Chart(tree_df)
                 .mark_tick()
                 .encode(
                     alt.X('diameter:Q', title='Diameter (inches)', scale=alt.Scale(domain=(0, 80))),
                     color=alt.condition(select_neighbourhood_genus, alt.value('blue'), alt.value('')))
                .add_selection(select_neighbourhood_genus))

tree_height = (alt.Chart(tree_df)
                 .mark_tick()
                 .encode(
                     alt.X('height_range_id:Q', title='Height range ID', scale=alt.Scale(domain=(0, 9))), 
                     color=alt.condition(select_neighbourhood_genus, alt.value('orange'), alt.value('')))
                .add_selection(select_neighbourhood_genus))

street_side_barplot = (alt.Chart(tree_df)
               .transform_filter(select_neighbourhood_genus)
               .mark_bar()
               .encode(
                   alt.X('street_side_name:N', title='Street side name'),
                   alt.Y('tree_count:Q', title='Number of trees'),
                   alt.Color(value='green'),
                   tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
               .transform_aggregate(
                   tree_count='count()', 
                   groupby=['street_side_name'])
                .add_selection(select_neighbourhood_genus))


year_barplot = (alt.Chart(tree_date_df)
               .transform_filter(select_neighbourhood_genus)
               .mark_bar()
               .encode(
                   alt.X('Year:N', title='Tree planted year'),
                   alt.Y('tree_count:Q', title='Number of trees'),
                   alt.Color(value='navy'),
                   tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
               .transform_aggregate(
                   tree_count='count()',
                   groupby=['Year'])
               .add_selection(select_neighbourhood_genus))

tree_detail_title = 'Tree details of different neighbourhood & genus'
tree_detail_plot = ((tree_diameter.properties(height=50) & tree_height & street_side_barplot.properties(height=100, width=200))| year_barplot.properties(width=500)).properties(title=alt.TitleParams(tree_detail_title, anchor='middle'))
tree_detail_plot

#### Figure 4 Tree details for different neighbourhood and genus

By selecting different neighbourhood and genus, we are able to see for specific genus in specific neighbourhood, what is its diameter and height range id, what is its street side number and how many of it has been planted over the year.

### Question 3: What is the number of trees for different neighborhood and tree genus

I would like to find out the number of trees of different neighborhood and tree genus by making the following barplots.

In [13]:
categorical_columns=['neighbourhood_name','genus_name']

repeat_plot = (alt.Chart(tree_df)
               .mark_bar()
               .encode(
                   alt.X('count()', title='Number of trees'), 
                   alt.Y(alt.repeat(), type='nominal', title='',sort='-x'), 
                   alt.Color(value='navy'), 
                   tooltip=alt.Tooltip("count()", title="Number of trees"))
               .properties(width=200, height=800)
               .repeat(categorical_columns))

repeat_plot.properties(title=alt.TitleParams('Number of trees of different neighbourhood and different genus', anchor='middle'))

#### Figure 5 Number of trees of different neighbourhood and genus

The top 5 neighbourhood with most tree number are **Kensington_Cedar Cottage, Renfrew_Collingwood, Hastings_Sunrise, Dunbar_Southlands, Sunset**. The top 5 tree genus planted are **ACER, PRUNUS, TILIA, FRAXINUS, QUERCUS**. How these top 5 genus trees distributed in different neighbourhood? Let us analyze this in the next question.

### Question 4: What is the top 5 tree genus distribution in different neighbourhood

In [14]:
# Filter the data only including top 5 tree genus
filtered_genus_df=tree_df.query("genus_name == ['ACER', 'PRUNUS', 'TILIA', 'FRAXINUS', 'QUERCUS']")
filtered_genus_df

# Plot the bar chart and add tooltip for tree count
neighbourhood_genus_plot=(alt.Chart(filtered_genus_df)
                          .mark_bar()
                          .encode(
                              alt.X('count()', title='Number of trees'),
                              alt.Y('genus_name:N', title=''),
                              alt.Color('genus_name', scale=alt.Scale(scheme='set3')),
                              tooltip='count()').properties(width = 200).facet('neighbourhood_name', columns=5))

neighbourhood_genus_plot.properties(title=alt.TitleParams('Number of trees of Top 5 genus in different neighbourhood', anchor='middle'))

#### Figure 6 Number of trees of top 5 genus in different neighbourhood

From Figure 6, we are able to see that among the 5 genus, **ACER** and **PRUNUS** are the top 2 genus planted for almost all the neighbourhoods (except **Downtown**). The number of trees of these 5 genus is different for different neighbourhood.

Even though Figure 6 can give us a lot of information of top 5 tree gunus of different neighbourhood but it is not flexible. What if I want to know the top 10 genus? What if I am only interested in a specific/selected neighbourhood, such as West End? In this case, there is no need to show data for other neighbourhoods. Therefore, I have imporved the plot by applying selection features as follows.

In [15]:
click_1 = alt.selection_single(fields=['neighbourhood_name'])

neighbourhood_barplot = (alt.Chart(tree_df)
                         .mark_bar().encode(
                             alt.X('neighbourhood_name:N', title='Neighbourhood name'),
                             alt.Y('count()',title='Number of Trees'),
                             alt.Color('neighbourhood_name:N', scale=alt.Scale(scheme='set3'), legend=None),
                             opacity=alt.condition(click_1, alt.value(1), alt.value(0.1)),
                         tooltip='count()')
                         .add_selection(click_1))

genus_barplot = (alt.Chart(tree_df)
               .transform_filter(click_1)
               .mark_bar()
               .encode(
                   alt.X('tree_count:Q', title='Number of Trees', scale=alt.Scale(domain=(0, 1300))),
                   alt.Y('genus_name:N', sort='-x', title='Genus name'),
                   alt.Color(value='purple'),
                   tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
               .transform_aggregate(
                   tree_count='count()',
                   groupby=['genus_name'])
               .transform_window(
                   rank='rank(tree_count)',
                   sort=[alt.SortField('tree_count', order='descending')])
               .transform_filter(alt.datum.rank <=10)
               .add_selection(click_1))

neighbourhood_genus_title = 'Number of trees of different genus for different neighbourhood'
neighbourhood_genus_plot= (neighbourhood_barplot.properties(height=200) | genus_barplot).properties(title = alt.TitleParams(neighbourhood_genus_title, anchor='middle'))

neighbourhood_genus_plot

#### Figure 7 Number of trees of top 10 genus for different neighbourhood

From Figure 7, we are able to find the top 10 genus (in descending order) of one specific/selected neighbourhood by clicking on the barchart. Also, I have added interactive feature to get the number of trees of different genus. This is a more convenient and efficient way to get the required information. We will include the Figure 7 in the final dashboard.

## Discussion

The main purpose of this dataset analysis and visualization is to explore detailed information about vancouver trees. From the previous analysis, there are some interesting points I have found to answer the questions mentioned at beginning of this report.

First, it seems that the number of trees planted in different year and month are quite different (Figure 1). The number of trees planted steadily increases from year 1989 to 1996. Later from 1996 to 2013, the planted tree number fluctuates slightly and then starts dropping significantly from year 2013 to 2016. After that, the planted tree number starts slightly increasing again. Most trees were planted in Februry, while least were planted in July and August. It is recommended to plant trees in the rain season to ensure the survival of saplings especially in the first few months after they are planted (https://essc.org.ph/content/view/132/). This is maybe the reason we see the number of trees starts increasing from October, then peaks in February, and slows down from May, which follows the vancouver average precipitation trend (https://weather-and-climate.com/average-monthly-precipitation-Rainfall,vancouver,Canada).

From common sense, trees with larger diameters usually has higher height (positive correlation). This is supported by our scatterplot and boxplot (Figure 3). However, we do observe a slight decrease in diameter from tree height range id 8 to 9. This could be due to less datapoints. Another reason could be the way to present the tree height (based on range id, not actual height). To improve this or get more confirmed result, collecting more data and present data of real tree height (measured in inch) would be helpful.

From Figure 5 and 6, we are able to see that different vancouver neightbourhood has different number of trees and different genus.  The top 5 neighbourhood with most tree number are **Kensington_Cedar Cottage, Renfrew_Collingwood, Hastings_Sunrise, Dunbar_Southlands, Sunset**. The top 5 tree genus planted are **ACER, PRUNUS, TILIA, FRAXINUS, QUERCUS**. By clicking the interactive plot (Figure 7), we are able to look into more details of top ten genus of different neighbourhood.

For the final interactive dashboard, the goal is to build a tool for goveronment or people who care about understanding their community/neighbourhood tree distribution/details. By using this dashboard, more detailed information about trees, such as genus, diameter & height, tree planted year and street side name, will be presented in an more convenient and efficient way.

This data visualization helps me answer all my questions and the results meet my expectation. Other information I would like to explore is to dig deeper into the tree species and common name of different neighouborhood. In addition, the current dataset just includes the tree numbers, not population of neighbourhood. Neighbourhood with more population could plant more trees. If the dataset could have included the population of the neighbourhood, it would be helpful to understand the number of trees/person, which will give us a better idea about how the community has done regarding the tree planting.

## Interactive Dashboard

Now, I am ready to make the final dashboard. For the dashboard, I would like to combine Figure 4 and Figure 7 together. So it shows all the information about the tree details for different vancouver neighbourhood. The dashboard is coded as follows and show in Figure 9.

In [17]:
dashboard_title =alt.TitleParams(
    'Vancouver trees dataset analysis and visualization', 
    subtitle = ['What is the tree details for different neighbourhood?'],
    fontSize=30, subtitleFontSize = 20, align ='center', anchor='middle')

dashboard_plot = (neighbourhood_genus_plot & tree_detail_plot).properties(title=dashboard_title)
dashboard_plot

#### Figure 9 Dashboard of Vancouver Tree Dataset analysis and visualization

## Reference
* https://vancouver.ca/news-calendar/our-city.aspx
* https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name
* https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv
* https://opendata.vancouver.ca/pages/licence/
* https://essc.org.ph/content/view/132/
* https://weather-and-climate.com/average-monthly-precipitation-Rainfall,vancouver,Canada