Explicitly incorporating spatial information to recurrent networks for agriculture
Claus Smitt,
Michael Halstead,
Alireza Ahmadi,
Chris McCool
Agricultural Robotics & Engineering, Institute of Agriculture, University of Bonn
Presented at IROS 2022 (Best AgRobotics Paper Winner!)
ReprojRNN leverages spatial-temporal cues available in most agricultural robots to improve crop monitoring tasks.
python3 -m venv ".venv"
. .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
bash scripts/get_bup20.sh # ~70GB
bash scripts/get_st_atte_model_bup20.sh # ~500MB
python test.py \
trained_models/st_atte_bup20/config.yaml \
-g 1 \
--data_path ~/datasets/CKA_sweet_pepper_2020_summer/CKA_sweet_pepper_2020_summer.yaml
If all went well, you should get the list of all metrics when the model finishes testing.
Note: Change the torch version in the requirements.txt
file according to your CUDA version in case you get version errors
We propose a Spatial-Temporal fusion layer (ST-Fusion) that spatially registers feature maps throughout a sequence of frames. It leverages spatial-temporal information commonly available in agricultural robots (RGB-D images & robot poses) and leverages multi-view geometry to re-project complete tensors between frames, at any given depth of a deep convolutional neural network.
This layer computes a pixel-wise shift matrix using the depth image from a previous frame and robot trajectory (S- Spatial Prior).
Moreover, since it is desirable to fuse information at different scales fo the network, the shift matrix can be interpolated to the corresponding tensor size to perform registration (see our paper for more details). This matrix is used to register a prior recurrent feature map (T- Temporal Prior) to the feature maps of the current frame, which can then finally be fused together.
In this repository the available tensor fusion methods are px-wise attention (models/rnn_avg_attention_reproj_segmentation
) and Conv-GRUs (models/gru_reproj_segmentation
).
Finally instances of the ST-Fusion layer are interleaved in a fully convolutional segmentation pipeline at various depths of it's decoder, explicitly incorporating the Spatial-Temporal information to the mask predictions.
Tested on Ubuntu 18.04; CUDA 11.3
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y python3-pip
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
sudo apt-get install -y python3-venv
Install CUDA and cuDDN.
python3 -m venv ".venv"
. .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
We train our models with sequences of N frames as samples, and a batch is a set of these sequences. These include RGB-D images and associated camera/robot odometry. A limitation of temporal models is that they are generally trained and evaluated on consistent framerates. However, real-world systems need to deal with variable framerate as well as frame drops and jitter. Therefore, we train our models on sequences with artificial frame drops and jitter with monotonously increasing indices (see our paper for more details).
We evaluate our ST-Fusion layer by inserting them on segmentation tasks segmentation task in two challenging agricultural datasets:
Horticulture glasshouse - sweet pepper (BUP20)
Donwload the dataset (~70GB):
bash scripts/get_sb20.sh
A dataset consisting of video sequences from a glasshouse environment in campus Klein-Altendorf (CKA), with two sweet pepper cultivar Mavera (yellow) and allrounder (red), each cultivar matured from green, to mixed, to their primary color.This data, which contains all the colors, was captured by the autonomous phenotyping platform PATHoBot driving at 0.2 m/s. The dataset comprises 10 sequences of 6 crop-rows, captured using Intel RealSense D435i cameras recording RGB-D images, as well as wheel & RGB-D refined odometry.
Below is a summary of the dataset characteristics
Image size | Images type | Robot Pose | FPS | Train | Val. | Eval. |
---|---|---|---|---|---|---|
1280 x 720 | RGB-D | wheel & RGB-D odometry | 15 | 124 | 62 | 93 |
Arable farming - sugar beet (SB20)
Download the dataset (~9GB):
bash scripts/get_sb20.sh
This dataset was captured at a sugar beet field in CKA of the University of Bonn using an Intel RealSense D435i camera with a nadir view of the ground mounted on BonnBot-I driving at 0.4 m/s.
Sequences contain robot wheel odometry and RGB-D images of crops and 8 different categories of weeds at different growth stages, different illumination conditions and three herbicide treatment regimes (30%, 70%, 100%), impacting weed density directly.
Below is a sumary of the dataset characteristics
Image size | Images type | Robot Pose | FPS | Train | Val. | Eval. |
---|---|---|---|---|---|---|
640 x 480 | RGB-D | wheel & RGB-D odometry | 15 | 71 | 37 | 35 |
Folder ./config
has several yaml
config examples used in the paper that can be used as examples
python train.py \
-g [num_gpus] \
--out_path /net/outputs/save/path
--log_path /train/logs/save/path
--ckpt_path /model/checkpoints/save/path
--data_path /dataset/yaml/file/location
The training script uses PytorchLightning
DDP plugin for multi GPU training.
Note: The training process is quite memory intensive due to the recurrent nature of the models (trained on Nvidia RTX A6000). In case you get an out of memory error, try making the following changes to your yaml
config file:
- Reduce
dataloader/batch_size
- Reduce
dataloader/sequencing/num_frames
to use shorter frame sequences - Set
/trainer/precision
to 16
This would likely give you different results from the ones reported in the paper but you'll be able to train the model on you own.
@article{smitt2022explicitly,
title={Explicitly incorporating spatial information to recurrent networks for agriculture},
author={Smitt, Claus and Halstead, Michael and Ahmadi, Alireza and McCool, Chris},
journal={IEEE Robotics and Automation Letters},
year={2022},
}