<div class="usecase-title">UC00178_Urban_Forest_Analysis</div>

<div class="usecase-authors"><b>Authored by: </b> Peregrin J Ryan</div>

<div class="usecase-duration"><b>Duration:</b> 30 mins</div>

<div class="usecase-level-skill">
    <div class="usecase-level"><b>Level: </b>Beginner</div>
    <div class="usecase-skill"><b>Pre-requisite Skills: </b>Python</div>
</div>

<div class="usecase-section-header">Scenario</div>

The city of Melbourne is ever changing, it has a need to adapt to its ever growing population. As a part of that change the city of Melbourne should look at how that change has occurred. What needs have to be met, what changes have we made before that had positive impacts and how does changing one aspect of the city's infrastructure affect others? In that same vein, we now better understand the positive impacts of green spaces within our lived environment. As a part of that the cities we live in need to accommodate trees for cooling, clean air and have positive impacts on our mental health. The Urban Forest project was created as a way to look at current tree locations and over the years has collected data on how the green life in Melbourne has changed. In this project we will investigate how the green spaces in Melbourne have evolved.

<div class="usecase-section-header">What this use case will teach you</div>

At the end of this use case you will:
- How to load in the correct packages using Python
- Collect data from the Melbourne open data (MOP) using API v2.1 GET request.
- How to take our data and prepare it for analysis.
- Use geographic mapping data in Python
- Compare and contrast how Melbourne's urban forest has evolved

<div class="usecase-section-header">Introduction and background relating to problem:</div>

This use case uses the Melbourne Urban Forest data set from the MOP. Which is data of where different trees are located within the Melbourne CBD area. This dataset is made up of different years dating back to 2008. From that initial dataset we are going to walk through the steps of loading in each year and comparing them to one another to create a better picture of how the Melbourne green areas have changed if at all. We will look at if there are increases or decreases in trees and investigate how long some have been there. From that information we can evaluate if more efforts are needed to improve Melbourne's green spaces or how to protect the ones that are there.

## Datasets used:
Dataset 1: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2008-urban-forest/information/

Dataset 2: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2011-urban-forest/information/

Dataset 3: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2013/information/

Dataset 4: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2014/information/

Dataset 5: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2015-urban-forest/information/

Dataset 6: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-public-realm-2018-urban-forest/information/

Dataset 7: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2018-entire-municipal-area-urban-forest/information/

Dataset 8: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2019/information/

Dataset 9: https://data.melbourne.vic.gov.au/explore/dataset/tree-canopies-2021-urban-forest/information/


## Step 1: Import packages
This use case will require a number of packages to work. These are simple to include and import. However, if they are not installed using the command `! pip install`, which will ensure the packages are installed on your instance of Python.

One package cannot be imported however, `API_store` is a file you will need to create locally in the same folder as this use case. It will contain your MOP API key (stored as `API = <Insert your key here>`). Simply create the file in the same folder as this one and it should be imported without any issues.

In [1]:
# Uncomment these to install the packages needed:
# ! pip install requests
# ! pip install pandas

In [2]:
# Imports needed to request and collect data from API
import requests
import pandas as pd
from io import StringIO
# Create this as a local file to store your API key
import API_store
# This helps us just avoid pop ups that we don't need
import warnings
warnings.filterwarnings('ignore')

## Step 2: Import Data

This is where we call a function called collect data and we then use our API key to get all of the .csv files we need. But when we read in those .csv files we convert them into Pandas data frames. This allows us a way to work with that data to visualise it and take what we need. 

In [3]:
# This is the function to collect the data from the API
def collect_data(dataset_id):
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    apikey = API_store.API
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC',
        'api_key': apikey
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        dataset = pd.read_csv(StringIO(url_content), delimiter=';')
        return dataset
    else:
        print(f'Request failed with status code {response.status_code}')

Here is where we use the function we just called to collect the data we need. We will go through each of our datasets and make a dataframe for each of them.

In [4]:
dataset_ids = ['tree-canopies-2008-urban-forest','tree-canopies-2011-urban-forest','tree-canopies-2013',
               'tree-canopies-2014', 'tree-canopies-2015-urban-forest', 'tree-canopies-public-realm-2018-urban-forest',
               'tree-canopies-2018-entire-municipal-area-urban-forest', 'tree-canopies-2019', 'tree-canopies-2021-urban-forest']

tree_canopy_2008 = collect_data(dataset_ids[0]) 
tree_canopy_2011 = collect_data(dataset_ids[1])
tree_canopy_2013 = collect_data(dataset_ids[2])
tree_canopy_2014 = collect_data(dataset_ids[3])
tree_canopy_2015 = collect_data(dataset_ids[4])
tree_canopy_2018_pr = collect_data(dataset_ids[5])
tree_canopy_2018_ma = collect_data(dataset_ids[6])
tree_canopy_2019 = collect_data(dataset_ids[7])
tree_canopy_2021 = collect_data(dataset_ids[8])

Now that we have waited a good while and have our datasets, we need to verify that they are actually loaded in. As well as get an understanding of what is included in each dataset. We can do that by calling the dataset and using the command `.head(3)`. This will print out the first three rows of each dataset. That way we can know they were loaded in, since they aren't empty. We can also see the shape of each of our datasets, which will be helpful in our analysis.

In [5]:
tree_canopy_2008.head(3)

Unnamed: 0,geo_point_2d,geo_shape,updated_by,shape_area,data_linea,shape_leng,updated_da
0,"-37.79852634957519, 144.94570229091082","{""coordinates"": [[[[144.94569518074886, -37.79...",Grace Detailed-GIS Services,9.99079,Tree canopy mapped using 2008 aerial photos an...,16.047832,3 Dec 2011
1,"-37.79809504416373, 144.92062736716736","{""coordinates"": [[[[144.92063888842435, -37.79...",Grace Detailed-GIS Services,66.159928,Tree canopy mapped using 2008 aerial photos an...,31.480883,3 Dec 2011
2,"-37.79841797552432, 144.94065289498525","{""coordinates"": [[[[144.94065604535228, -37.79...",Grace Detailed-GIS Services,10.427187,Tree canopy mapped using 2008 aerial photos an...,11.461481,3 Dec 2011


In [6]:
tree_canopy_2008.shape

(58790, 7)

In [7]:
tree_canopy_2011.head(3)

Unnamed: 0,geo_point_2d,geo_shape,tree_area,street_fro,data_lin_1,yearplant,street_nam,updated_by,shape_area,park_stree,...,street_to,family,ggis_id,overhead_c,date_plant,t11,canopy_dia,height_11,descriptio,roadseg_de
0,"-37.7971025448064, 144.92805294520574","{""coordinates"": [[[[144.92808097355524, -37.79...",85.552695,,Tree Inventory: Existing data (fields from GIS...,0,,Grace Detailed-GIS Services. info@gracegis.com.au,85.552695,,...,,,91186,,,11,0.0,9.0,,
1,"-37.79677607198214, 144.92647587608803","{""coordinates"": [[[[144.92650356294064, -37.79...",30.053766,,Tree Inventory: Existing data (fields from GIS...,0,,Grace Detailed-GIS Services. info@gracegis.com.au,30.053766,,...,,,91190,,,11,0.0,7.0,,
2,"-37.79615547394656, 144.93082811899046","{""coordinates"": [[[[144.93088560280367, -37.79...",63.268135,Wolseley Parade,Tree Inventory: Existing data (fields from GIS...,1997,Bellair Street,Grace Detailed-GIS Services. info@gracegis.com.au,63.268135,Street,...,Ormond Street,Platanaceae,91192,Powerlines - High Voltage,1997-07-10T07:00:00.000Z,11,18.0,11.0,Tree - Platanus x acerifolia,Bellair Street between Wolseley Parade and Orm...


In [8]:
tree_canopy_2011.shape

(94699, 51)

In [9]:
tree_canopy_2013.head(3)

Unnamed: 0,geo_point_2d,geo_shape,height_yes,shape_area,objectid,shape_leng,updated_da,z_mean,tree_area,ggis_id
0,"-37.82369619166995, 144.9763990227848","{""coordinates"": [[[[144.97640095130544, -37.82...",1,3.886529,12818,0,,2.462622,0,12818
1,"-37.822976569861765, 144.93703319984712","{""coordinates"": [[[[144.93702970158145, -37.82...",1,22.901252,12873,0,,7.649852,0,12873
2,"-37.82302364997733, 144.94109760355792","{""coordinates"": [[[[144.94109968112505, -37.82...",1,4.529607,12885,0,,1.608364,0,12885


In [10]:
tree_canopy_2013.shape

(99194, 10)

In [11]:
tree_canopy_2014.head(3)

Unnamed: 0,geo_point_2d,geo_shape,shape_area,t,objectid,shape_leng,ggis_id,height_min
0,"-37.80701864340057, 144.97261514642457","{""coordinates"": [[[[144.97262202661392, -37.80...",2.417016,2014,29804,0,2173,1.782669
1,"-37.806174766348335, 144.96921747656356","{""coordinates"": [[[[144.9692316752021, -37.806...",6.281363,2014,29845,0,2310,3.082909
2,"-37.77611151375127, 144.9413844567119","{""coordinates"": [[[[144.94141107471987, -37.77...",23.420418,2014,18,0,83,3.01857


In [12]:
tree_canopy_2014.shape

(64877, 8)

In [13]:
tree_canopy_2015.head(3)

Unnamed: 0,geo_point_2d,geo_shape,qa_id_1,shape_area,area_2015,change_cod,objectid,shape_leng
0,"-37.792768219526344, 144.93379943223366","{""coordinates"": [[[[144.9338068318798, -37.792...",0,20.840909,20.840909,0,30340,0
1,"-37.79335390234411, 144.96628652593543","{""coordinates"": [[[[144.96628667082896, -37.79...",0,55.567579,55.567579,0,30348,0
2,"-37.79337278014516, 144.96252125050952","{""coordinates"": [[[[144.9625445009508, -37.793...",0,389.006759,389.006759,0,30359,0


In [14]:
tree_canopy_2015.shape

(60712, 8)

In [15]:
tree_canopy_2018_pr.head(3)

Unnamed: 0,geo_point_2d,geo_shape,objectid,shape_leng,shape_area
0,"-37.787924849178765, 144.95013560084595","{""coordinates"": [[[[144.95014678527318, -37.78...",27905,22.910803,35.618151
1,"-37.78758587991684, 144.94758519515517","{""coordinates"": [[[[144.94759778226256, -37.78...",28004,11.677391,9.861962
2,"-37.78766984132651, 144.950294465972","{""coordinates"": [[[[144.9503028706184, -37.787...",27988,4.824124,1.363877


In [16]:
tree_canopy_2018_pr.shape

(32787, 5)

In [17]:
tree_canopy_2018_ma.head(3)

Unnamed: 0,geo_point_2d,geo_shape,objectid,shape_leng,shape_area
0,"-37.79985604102316, 144.94863020815643","{""coordinates"": [[[[144.94869965802565, -37.79...",29435,114.357712,248.017254
1,"-37.80012294659624, 144.96904230676208","{""coordinates"": [[[[144.9690200627957, -37.800...",29440,22.209443,30.098817
2,"-37.80011811346107, 144.97185932314238","{""coordinates"": [[[[144.9718664539436, -37.800...",29471,4.979338,1.65813


In [18]:
tree_canopy_2018_ma.shape

(54680, 5)

In [19]:
tree_canopy_2019.head(3)

Unnamed: 0,geo_point_2d,geo_shape,id
0,"-37.7928966058365, 144.96013431840203","{""coordinates"": [[[144.9601369448, -37.7928977...",26067
1,"-37.79221015605665, 144.92248035841538","{""coordinates"": [[[144.9224614652, -37.7922391...",26065
2,"-37.79305183612952, 144.97106636274984","{""coordinates"": [[[144.9710715016, -37.7930550...",25977


In [20]:
tree_canopy_2019.shape

(114784, 3)

In [21]:
tree_canopy_2021.head(3)

Unnamed: 0,geo_point_2d,geo_shape
0,"-37.77506304683423, 144.93898465421296","{""coordinates"": [[[[144.9389624164712, -37.775..."
1,"-37.775132956993566, 144.93979253397976","{""coordinates"": [[[[144.93978541786475, -37.77..."
2,"-37.775360479960504, 144.94145114868167","{""coordinates"": [[[[144.941452857118, -37.7753..."


In [22]:
tree_canopy_2021.shape

(57980, 2)

From this we have confirmed a few important things that will affect how we go forward when working with this data. First, we have a few datasets and each seems to be different from one another. First, some of the datasets are much larger than others for example the 2019 dataset has roughly 114,000 entries. Whereas the initial 2008 dataset has only 58,000 entries. So we will need to figure out what they cover and to what extent all of the different datasets are compatible with one another.


The other thing is that some of the earlier dataset have more information than the later ones. Since we are comparing all datasets and their progression we will need to make the data more compatible. So we will only look at a few columns. We will keep the `geo_point_2d`, `geo_shape` and the `id`. Keeping in mind that the 2021 has no `id` so we will see if it can be backfilled based on the `geo_point_2d` data we have. We will also use the 2011 `yearplant` data for some trees with an ID. That way we can see how old some of the trees are and if they are still around in later datasets. This will require a few assumptions about the data. First the `id` is consistent through the whole dataset and doesn't change each year or over the years. For measuring consistency and how long a tree has been in a location we will need to rely on the `geo_point_2d`s to be the same and if not very consistent with one another.


With these challenges established lets begin cleaning our datasets to just the columns we want to work with. As well as look for any missing data within our data.

## Step 3: Data Wrangling
In data analysis this is one of the most important steps. We will be cleaning up our data since it was sourced over a long time and each survey might have changed. For our analysis we need to standardise it so we can get a clear picture when we analyse the data later on.

### 3.1 Organise Columns
The first step as discussed is to standardise the columns. So we will get each data set year and get the `geo_points_2d` and `geo_shape` columns and where applicable the `id` (or equivalent). Lets also use the `yearplant` from 2011's data and make a new dataframe to store the year the trees were planted, the `id` and the `geo_points_2d`.

In [None]:
# Replace original data frames with new ones that just have what we need.
tree_age = tree_canopy_2011[['geo_point_2d','objectid', 'yearplant']]
tree_canopy_2008 = tree_canopy_2008[['geo_point_2d', 'geo_shape']]
tree_canopy_2011 = tree_canopy_2011[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2013 = tree_canopy_2011[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2014 = tree_canopy_2011[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2015 = tree_canopy_2011[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2018_pr = tree_canopy_2018_pr[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2018_ma = tree_canopy_2018_ma[['geo_point_2d', 'geo_shape','objectid']]
tree_canopy_2019 = tree_canopy_2019[['geo_point_2d', 'geo_shape','id']]
tree_canopy_2021 = tree_canopy_2021[['geo_point_2d', 'geo_shape']]

# Rename 2019's `id` to `objectid` to match the other data:
tree_canopy_2019.rename(columns={'id': 'objectid'})

### 3.2 Check for missing data
Now that our data has what we are looking for we need to see if there is any data that is missing from our dataframes. We can do this using the `isnull()` pandas function. It tells us if there are any missing values by giving us a `boolean` response. if there is, it will return `True`. We can then investigate further if it does return as `True`.

In [24]:
print("2008 any null values:    ",tree_canopy_2008.isnull().values.any())
print("2011 any null values:    ",tree_canopy_2011.isnull().values.any())
print("2013 any null values:    ",tree_canopy_2013.isnull().values.any())
print("2014 any null values:    ",tree_canopy_2014.isnull().values.any())
print("2015 any null values:    ",tree_canopy_2015.isnull().values.any())
print("2018_pr any null values: ",tree_canopy_2018_pr.isnull().values.any())
print("2018_ma any null values: ",tree_canopy_2018_ma.isnull().values.any())
print("2019 any null values:    ",tree_canopy_2019.isnull().values.any())
print("2021 any null values:    ",tree_canopy_2021.isnull().values.any())
print("TreeAge any null values: ",tree_age.isnull().values.any())

2008 any null values:     False
2011 any null values:     False
2013 any null values:     False
2014 any null values:     False
2015 any null values:     False
2018_pr any null values:  True
2018_ma any null values:  False
2019 any null values:     False
2021 any null values:     False
TreeAge any null values:  False


From this we can know that for most of our datasets we have no null values in the data. This is great, however the 2018 public realm data had shown that it was missing some values so we can investigate further.

In [None]:
# Print a formatted readout to see what values are missing from each column, then give a percentage of missing data
print(f"Missing data:\n{tree_canopy_2018_pr.isnull().sum()}\n------",
      f"\nTotal missing values: {len(tree_canopy_2018_pr[tree_canopy_2018_pr.isnull().any(axis=1)])}\nTotal values: {len(tree_canopy_2018_pr)}",
      f"\nPercent of missing values: {round((len(tree_canopy_2018_pr[tree_canopy_2018_pr.isnull().any(axis=1)])/len(tree_canopy_2018_pr))*100,2)}%")

Missing data:
geo_point_2d    0
geo_shape       2
objectid        0
dtype: int64
------ 
Total missing values: 2
Total values: 32787 
Percent of missing values: 0.01%


So we can see that we are missing only two `geo_shape` data. However, that doesn't cause any issues for us since we would only use that for drawing our trees on the map. And aren't needed for our analysis. So we don't have to remove them.

### 3.3 Look for duplicate data
Now we want to quickly check to see if there are any duplicate values in our dataset. Since we don't want to have potential double ups in our data, which can cause issues when visualising. We can use the `duplicated()` command built into pandas to check this. It will return a `bool` value telling us if there are any duplicates with `True` just like before.

In [None]:
# We simply print the returned boolean values so we can print them in a human readable way
print("2008 any null values:    ",tree_canopy_2008.duplicated().any())
print("2011 any null values:    ",tree_canopy_2011.duplicated().any())
print("2013 any null values:    ",tree_canopy_2013.duplicated().any())
print("2014 any null values:    ",tree_canopy_2014.duplicated().any())
print("2015 any null values:    ",tree_canopy_2015.duplicated().any())
print("2018_pr any null values: ",tree_canopy_2018_pr.duplicated().any())
print("2018_ma any null values: ",tree_canopy_2018_ma.duplicated().any())
print("2019 any null values:    ",tree_canopy_2019.duplicated().any())
print("2021 any null values:    ",tree_canopy_2021.duplicated().any())
print("TreeAge any null values: ",tree_age.duplicated().any())

2008 any null values:     False
2011 any null values:     False
2013 any null values:     False
2014 any null values:     False
2015 any null values:     False
2018_pr any null values:  False
2018_ma any null values:  False
2019 any null values:     False
2021 any null values:     False
TreeAge any null values:  False


### 3.4 Match id's to trees
Here we want to look at the `objectids` and see if we can match them to the tree locations. This will do a few things for us. First confirm that the id system has stayed the same throughout the years. It will also tell us if any trees have been lost. And it will let us backfill any trees in the 2021 dataset if needed. Depending on what ids match and when they match will affect our approach going forward.

## TO DO
___
### Data cleaning
- Match ids to trees
   - Match ids to tree age and update ages

### Visualisation for data cleaning
- visualise each tree on the map
- find similar coverage area
   - Find cut off for non applicable areas
### Data Analysis
- Visualise progression of trees
   - visualise all trees progressively
   - graph the numbers of trees
- Tree age analysis
   - Find oldest trees
   - are older trees in similar area
       - KNN on predicting where older trees are?
- Provide analysis on Urban forest
   - trends in older and newer trees
   - oldest trees and maintaining them
   - Need for more green space?
- Conclusion
