# Random Forest Land Cover Classification

This notebook is based on a GEE [Supervised Classification Guide](https://developers.google.com/earth-engine/guides/classification).

Random forests (RF) are a machine learning technique that uses an ensemble approach to predict outcomes based on different sets of training data. A RF is comprised of individual decision trees (Figure 1), which are each trained on a subset of data. These decision trees divide the data based on a subset of all the variables, where the first split (aka decision) is choosing the variable that can divide the training data into two groups that are as different as possible (Figure 1). This continues until there is no further subsetting possible, or until the limit of the tree height is reached (this can be set by the user). The user selects the number of decision trees to be used in the RF, and each tree has a unique set of training data and a subset of all the variables.

Once all the trees have been generated, the ensemble part of the approach kicks in. Each tree has made class predictions for the training data. The RF then finalizes the classification by taking the majority class among all the trees, essentially democratizing the classification amongst relatively uncorrelated decision trees. Similar to Monte Carlo simulations and the Law of Large Numbers, as we use more runs or samples, the results will tend toward higher accuracy or the expected value.

<img src="imgs/DT.png" width="450"> 

**Figure 1.** Example of a decision tree from ["Understanding Random Forest"](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) by Tony Yiu.

In [1]:
# import modules
import ee
import folium
from branca.element import Figure
import pandas as pd

# Initialize the Google Earth Engine module
#ee.Initialize(project='ee-gis71220244')

#ee.Initialize(project='mypro-403317')

# ee.Initialize(project='my_projectid')
ee.Initialize(project='ctk-ncsu-gis712-fall2024')

In [2]:
# from https://spatial.utk.edu/maps/ee-api-folium-setup.html
# Define a method for displaying Earth Engine image tiles on a folium map.
def add_ee_layer(self, ee_object, vis_params, name):
    
    try:    
        # display ee.Image()
        if isinstance(ee_object, ee.image.Image):    
            map_id_dict = ee.Image(ee_object).getMapId(vis_params)
            folium.raster_layers.TileLayer(
            tiles = map_id_dict['tile_fetcher'].url_format,
            attr = 'Google Earth Engine',
            name = name,
            overlay = True,
            control = True
            ).add_to(self)
        # display ee.ImageCollection()
        elif isinstance(ee_object, ee.imagecollection.ImageCollection):    
            ee_object_new = ee_object.mosaic()
            map_id_dict = ee.Image(ee_object_new).getMapId(vis_params)
            folium.raster_layers.TileLayer(
            tiles = map_id_dict['tile_fetcher'].url_format,
            attr = 'Google Earth Engine',
            name = name,
            overlay = True,
            control = True
            ).add_to(self)
        # display ee.Geometry()
        elif isinstance(ee_object, ee.geometry.Geometry):    
            folium.GeoJson(
            data = ee_object.getInfo(),
            name = name,
            overlay = True,
            control = True
        ).add_to(self)
        # display ee.FeatureCollection()
        elif isinstance(ee_object, ee.featurecollection.FeatureCollection):  
            ee_object_new = ee.Image().paint(ee_object, 0, 2)
            map_id_dict = ee.Image(ee_object_new).getMapId(vis_params)
            folium.raster_layers.TileLayer(
            tiles = map_id_dict['tile_fetcher'].url_format,
            attr = 'Google Earth Engine',
            name = name,
            overlay = True,
            control = True
        ).add_to(self)
    
    except:
        print("Could not display {}".format(name))
    
# Add EE drawing method to folium.
folium.Map.add_ee_layer = add_ee_layer

## Collect training data

From the GEE guide:

"The training data is a `FeatureCollection` with a property storing the class label and properties storing predictor variables. Class labels should be consecutive, integers starting from 0. If necessary, use `remap()` to convert class values to consecutive integers. The predictors should be numeric.

Training and/or validation data can come from a variety of sources. To collect training data interactively in Earth Engine, you can use the geometry drawing tools (see the [geometry tools section of the Code Editor page](https://developers.google.com/earth-engine/guides/playground#geometry-tools)). Alternatively, you can import predefined training data from an Earth Engine table asset (see the [Importing Table Data page](https://developers.google.com/earth-engine/guides/table_upload) for details). Get a classifier from one of the constructors in `ee.Classifier`. Train the classifier using `classifier.train()`. Classify an `Image` or `FeatureCollection` using `classify()`. The following example uses a Classification and Regression Trees (CART) classifier (Breiman et al. 1984) to predict three simple classes:"

# Example RF Classification

## **Data to be classified:** Landsat 08 Bands 1-7

## **Training data:** MODIS Land cover classification

In [3]:
# Define a region of interest (roi) as a point.  Change the coordinates
# to get a classification of any place where there is imagery.
roi = ee.Geometry.Point(-78.638, 35.779)

fig = Figure(width=600, height=400)
lat, lon = 35.779, -78.638
raleigh_map = folium.Map(location=[lat, lon], zoom_start=10)
fig.add_child(raleigh_map)


## Get the Landsat 8 image for the least cloudy day in 2018 in our ROI

In [4]:
# Make a cloud-free Landsat 8 Surface Reflectance composite (from raw imagery).

# Load Landsat 8 input imagery.
landsat = ee.Image(ee.ImageCollection('LANDSAT/LC08/C02/T1_L2')
  # Filter to get only one year of images.
  .filterDate('2018-01-01', '2018-12-31')
  # Filter to get only images under the region of interest.
  .filterBounds(roi)
  # Sort by scene cloudiness, ascending.
  .sort('CLOUD_COVER')
  # Get the first (least cloudy) scene.
  .first())


In [5]:
# Define visualization parameters for True Color
vis_params = {
    "opacity": 1,
    "bands": ["SR_B4", "SR_B3", "SR_B2"],
    "min": 7729,
    "max": 13931,
    "gamma": 1
}


# Display the Landsat 8 image in True Color
raleigh_map.add_ee_layer(landsat, vis_params, 'landsat')
raleigh_map.add_child(folium.LayerControl())
fig.add_child(raleigh_map)



## Get MODIS Land Cover data for 2018 (to be used for training and validation)

In [23]:
# Use MODIS land cover, IGBP classification, for training.
modis = ee.Image('MODIS/006/MCD12Q1/2018_01_01').select('LC_Type1')


# make a new map
fig2 = Figure(width=600, height=400)
lat, lon = 35.779, -78.638
raleigh_map2 = folium.Map(location=[lat, lon], zoom_start=10)

# set the color palette
modis_v6_pal = {'palette' : ['05450a', # Evergreen Needleleaf Forests: dominated by evergreen conifer trees (canopy >2m). Tree cover >60%.
                             '086a10', # Evergreen Broadleaf Forests: dominated by evergreen broadleaf and palmate trees (canopy >2m). Tree cover >60%.
                             '54a708', # Deciduous Needleleaf Forests: dominated by deciduous needleleaf (larch) trees (canopy >2m). Tree cover >60%.
                             '78d203', # Deciduous Broadleaf Forests: dominated by deciduous broadleaf trees (canopy >2m). Tree cover >60%.
                             '009900', # Mixed Forests: dominated by neither deciduous nor evergreen (40-60% of each) tree type (canopy >2m). Tree cover >60%.
                             'c6b044', # Closed Shrublands: dominated by woody perennials (1-2m height) >60% cover.
                             'dcd159', # Open Shrublands: dominated by woody perennials (1-2m height) 10-60% cover.
                             'dade48', # Woody Savannas: tree cover 30-60% (canopy >2m).
                             'fbff13', # Savannas: tree cover 10-30% (canopy >2m).
                             'b6ff05', # Grasslands: dominated by herbaceous annuals (<2m).
                             '27ff87', # Permanent Wetlands: permanently inundated lands with 30-60% water cover and >10% vegetated cover.
                             'c24f44', # Croplands: at least 60% of area is cultivated cropland.
                             'a5a5a5', # Urban and Built-up Lands: at least 30% impervious surface area including building materials, asphalt and vehicles.
                             'ff6d4c', # Cropland/Natural Vegetation Mosaics: mosaics of small-scale cultivation 40-60% with natural tree, shrub, or herbaceous vegetation.
                             '69fff8', # Permanent Snow and Ice: at least 60% of area is covered by snow and ice for at least 10 months of the year.
                             'f9ffa4', # Barren: at least 60% of area is non-vegetated barren (sand, rock, soil) areas with less than 10% vegetation.
                             '1c0dff'  # Water Bodies: at least 60% of area is covered by permanent water bodies.
                             ], 
                'min': 1.0,  
                'max': 17.0,}

# map the true color Landsat 5 image
raleigh_map2.add_ee_layer(landsat, vis_params, 'landsat')
# map the modis land cover image
raleigh_map2.add_ee_layer(modis, modis_v6_pal, 'modis')
# add a layer button
raleigh_map2.add_child(folium.LayerControl())
# display the map
fig2.add_child(raleigh_map2)

## Select training data from MODIS land cover data

In [29]:
# Sample the input imagery to get a FeatureCollection of training data.
# Copy the modis bands to landsat and sample
training = landsat.addBands(modis).sample(
  # numPixels = 5000,
  seed = 0
)


## Make the RF classifier from the training data and Landsat 8 bands

**How does the classifier use the MODIS training data?**


**What does the inputProperties argument mean?**
The inputProperties are the specific bands to use from our training image, corresponding to specific bands that exist in Landsat products.

**What does the classProperty argument mean?**
The classProperty is the name of the band containing our training labels (features) for our training data.

**How many trees are we using?**
10 trees.


In [36]:
# Make a Random Forest classifier with 10 trees and train it.
classifier = ee.Classifier.smileRandomForest(10).train(
      features = training,
      classProperty = 'LC_Type1',
      inputProperties = ['SR_B1', 'SR_B2', 'SR_B3', 'SR_B4', 'SR_B5', 'SR_B6', 'SR_B7']
    )

## Run the classifier on the Landsat 8 image

In [37]:
# Classify the input imagery.
classified = landsat.classify(classifier)

## Map the result

In [38]:
# Define a palette for the IGBP classification.
igbpPalette = [
  'aec3d4', # water
  '152106', '225129', '369b47', '30eb5b', '387242', # forest
  '6a2325', 'c3aa69', 'b76031', 'd9903d', '91af40',  # shrub, grass
  '111149', # wetlands
  'cdb33b', # croplands
  'cc0013', # urban
  '33280d', # crop mosaic
  'd7cdcc', # snow and ice
  'f7e084', # barren
  '6f6f6f'  # tundra
]

# Display the input and the classification.

fig2 = Figure(width=600, height=400)
lat, lon = 35.779, -78.638
raleigh_map2 = folium.Map(location=[lat, lon], zoom_start=10)

raleigh_map2.add_ee_layer(landsat, vis_params, 'landsat')
raleigh_map2.add_ee_layer(modis,modis_v6_pal,'modis')
raleigh_map2.add_ee_layer(classified, modis_v6_pal, 'classification')


# add layer control panel to map
raleigh_map2.add_child(folium.LayerControl())
fig2.add_child(raleigh_map2)

Could not display classification


## Measure the accuracy of the classification of the training data

In [39]:
# Get a confusion matrix representing resubstitution accuracy.
trainAccuracy = classifier.confusionMatrix()
# print('Resubstitution error matrix:\n', pd.DataFrame(trainAccuracy.getInfo()))
print('Training overall accuracy: ',  trainAccuracy.accuracy().getInfo())



EEException: User memory limit exceeded.

## Measure the accuracy of a random selection of data (validation data)

In [40]:
# Sample the input with a different random seed to get validation data.
validation = landsat.addBands(modis).sample(
  numPixels = 5000,
  seed = 1
  # Filter the result to get rid of any null pixels.
).filter(ee.Filter.neq('SR_B1', None))

# Classify the validation data.
validated = validation.classify(classifier)

# Get a confusion matrix representing expected accuracy.
testAccuracy = validated.errorMatrix('LC_Type1', 'classification')
print('Validation error matrix:\n', pd.DataFrame(testAccuracy.getInfo()))
print('Validation overall accuracy: ', testAccuracy.accuracy().getInfo())

EEException: User memory limit exceeded.

## What might be some sources of error in our Random Forest?
We are only using the first, least cloudy image in the collection of the 2018 date range for training our Random Forest. This training will clearly overfit the model to this single image as the model has not seen any variation in training data. Additionally, we are sampling a subset of this single image to only include a total of 5000 pixels, thus, reducing the variation in our training data even further. This subsampled image is dependent on the seed we provide to the `sample()` function, which may be an additional source of error in providing a homogenous sample of the image.

## How can we improve our RF?
Provide additional imagery, increase the number of trees in our random forest, 

## How might changing the number of trees impact our results?

## How might changing the number of training data impact our results?

## What dataset other than MODIS Land Cover could we use as training data? (Think about a moderate resolution land cover product.) How would a different resolution impact our results? 

# Test a few of your suggested changes out and compare the results!