# Data preparation

This notebook we process the satellite photos from the `Radiant Earth Spot the Crop Challenge` to transform the data to a format that is easier to handle.

The images were download in the previous notebook `0. Download.ipynb` and it is the starting point for this Notebook.
This are the libraries that we will need:

In [1]:
import datetime
import rasterio
import numpy as np
import pandas as pd
import geopandas as gpd

## 1. List of assets from Radiant MLHub


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/radiantearth/stac-spec/tree/master/extensions/label) definition.

We create a DataFrame that sumarizes all the different assets.

In [2]:
import importlib
import create_asset_files as crass
importlib.reload(crass)

labels_path = {"train": 'ref_south_africa_crops_competition_v1_train_labels',
              "test": 'ref_south_africa_crops_competition_v1_test_labels'}

In [None]:
# creating asset collection for the train set
case = "train"
assets_train_df = crass.create_assets(labels_path[case])

assets_train_df.datetime = pd.to_datetime(assets_train_df.datetime)
assets_train_df = assets_train_df.assign(
    date = lambda x: x['datetime'].dt.date,
    month = lambda x: x['datetime'].dt.month.astype('Int64'),
    dayofyear = lambda x: x['datetime'].dt.dayofyear.astype('Int64'),
)

assets_train_df.tile_id = assets_train_df.tile_id.astype(int)
assets_train_df.to_csv(f"data/assets_{case}.csv", index = False)
print(f"assets_{case}.csv was saved in /data")
assets_train_df.head()

In [41]:
# creating asset collection for the test set
case = "test"
assets_test_df = create_assets(labels_path[case])

assets_test_df.datetime = pd.to_datetime(assets_test_df.datetime)
assets_test_df = assets_test_df.assign(
    date = lambda x: x['datetime'].dt.date,
    month = lambda x: x['datetime'].dt.month.astype('Int64'),
    dayofyear = lambda x: x['datetime'].dt.dayofyear.astype('Int64'),
)

assets_test_df.tile_id = assets_test_df.tile_id.astype(int)
assets_test_df.to_csv(f"data/assets_{case}.csv", index = False)
print(f"assets_{case}.csv was saved in /data")
assets_test_df.head()

Procesing 1137 of 1137
assets_test.csv was saved in /data


## 2. Fields DataFrame

Now we want to generate a dataframe containing the inforatio of all the different fields.

In [11]:
assets_train_df = pd.read_csv('data/assets_train.csv')
fields_train = crass.create_field_list_train(assets_train_df)
fields_train.head()

Tile Nr. 2650 of total 2650


In [287]:
assets_test_df = pd.read_csv('data/assets_test.csv')
fields_test = crass.create_field_list_test(assets_test_df)
fields_test.head()

Tile Nr. 1137 of total 1137


Unnamed: 0,geometry,field_id,tile_id,field_area_km2
0,"POLYGON ((18.13951 -33.00683, 18.13950 -33.007...",62027,590,0.037301
1,"POLYGON ((18.14625 -33.00699, 18.14624 -33.007...",62071,590,0.050804
2,"POLYGON ((18.14045 -33.00757, 18.14044 -33.007...",85373,590,0.022991
3,"POLYGON ((18.13053 -33.00662, 18.13052 -33.006...",102896,590,0.281303
4,"POLYGON ((18.14654 -33.00771, 18.14654 -33.007...",3079,590,0.139693


Some fields with 0 label and 0 field_id are present, they seem to be very small, so we will drop them from the list

In [17]:
fields_train = crass.remove_dupl_fields(fields_train)

In [289]:
fields_test = crass.remove_dupl_fields(fields_test)

We also add another column with the fields elevation over the sea in m.

In [None]:
crass.download_elevation(fields_train)

In [40]:
elev_src = rasterio.open('/home/jupyter/NF-Capstone-Crop-Classification/data/elev_merged.tif')
elev_read = elev_src.read()

We can save the DataFrame after that

In [None]:
fields_train['elevation'] = fields_train.geometry.apply(crass.get_elev, args=(elev_src, elev_read, ))
fields_train.to_file('data/fields_train.geojson', driver='GeoJSON')

In [294]:
fields_test['elevation'] = fields_test.geometry.apply(crass.get_elev, args=(elev_src, elev_read, ))
fields_test.to_file('data/fields_test.geojson', driver='GeoJSON')

## 3. Tile DataFrame

Now we want to extract all the information regarding the Tiles.

First, we check which days the images contain clouds.

In [3]:
fields_train = gpd.read_file('data/fields_train.geojson')
fields_test = gpd.read_file('data/fields_test.geojson')

In [4]:
assets_train_df = pd.read_csv('data/assets_train.csv')
assets_test_df = pd.read_csv('data/assets_test.csv')

In [3]:
import tile_utils as tilu
importlib.reload(tilu)

<module 'tile_utils' from '/home/jupyter/NF-Capstone-Crop-Classification/tile_utils.py'>

In [23]:
tiles_train = tilu.create_basic_tile_df(assets_train_df)
tiles_test = tilu.create_basic_tile_df(assets_test_df)

In [24]:
sunny_train = tilu.create_sunny_df(assets_train_df)
tiles_train = tiles_train.merge(sunny_train, how='inner', on='tile_id')

2650


We have included:
+ the days that the sentinel 1 has taken a photo
+ the days that the sentinel 2 has taken a photo
+ the days that the sentinel 2 has taken a photo with no clouds
+ the rate between the days with no clouds
+ the days that the sentinel 2 has taken a photo with no clouds and with at least the 80% of the picture in good state.

We alse are interested in knowing the neighbor tiles

In [25]:
tiles_train_crs32634 = tiles_train.to_crs(32634)

# apply with a threshold of 4000m
tiles_train['tiles_closest'] = tiles_train_crs32634.apply(tilu.tiles_closest, axis='columns', args=(tiles_train_crs32634, 4000,))

In [30]:
tiles_train['tile_label_dist'] = tiles_train.apply(tilu.label_distribution, axis='columns', args=(fields_train,))

{1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.6666666666666666, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.3333333333333333}


In [68]:
tiles_train.tiles_closest = tiles_train.tiles_closest.apply(eval)


In [74]:
tiles_train['close_label_dist'] = tiles_train.apply(tilu.close_tiles_label_dist, axis='columns', args=(tiles_train,))

{1: 0.0, 2: 0.0, 3: 0.09683794466403162, 4: 0.0, 5: 0.21865236213062297, 6: 0.034782608695652174, 7: 0.0, 8: 0.0, 9: 0.6497270845096932}


In [93]:
tiles_train.tiles_closest = tiles_train.tiles_closest.astype(str)
tiles_train.to_file('data/tiles_train.geojson', driver='GeoJSON')

In [94]:
sunny_test = tilu.create_sunny_df(assets_test_df)
tiles_test = tiles_test.merge(sunny_test, how='inner', on='tile_id')

1137


In [102]:
tiles_train = tilu.read_tile_geojson('data/tiles_train.geojson')
tiles_train_crs32634 = tiles_train.to_crs(32634)
tiles_test_crs32634 = tiles_test.to_crs(32634)

# apply with a threshold of 4000m
tiles_test['tiles_closest'] = tiles_test_crs32634.apply(tilu.tiles_closest, axis='columns', args=(tiles_train_crs32634, 4000,))

and the most common crops in them

In [105]:
tiles_test.head()

Unnamed: 0,tile_id,geometry,s1_days,all_days,sunny_days,sun_rate,clean_days,tiles_closest
0,590,"POLYGON ((18.12208 -33.02951, 18.14946 -33.030...","{259, 132, 264, 139, 271, 144, 276, 151, 283, ...","{131, 261, 134, 264, 266, 141, 269, 271, 144, ...","{261, 134, 266, 141, 271, 274, 151, 279, 284, ...",0.578947,"{261, 134, 266, 141, 271, 274, 151, 279, 284, ...","[129, 743, 459, 1658, 1493, 1350, 1582, 1651]"
1,1026,"POLYGON ((18.66982 -33.04095, 18.69721 -33.041...","{259, 132, 264, 139, 271, 144, 276, 151, 283, ...","{131, 261, 266, 141, 269, 271, 276, 151, 279, ...","{261, 266, 269, 141, 271, 151, 279, 286, 289, ...",0.680851,"{261, 266, 141, 271, 151, 286, 291, 296, 171, ...","[2215, 939, 1250, 2490, 830, 357, 2381]"
2,100,"POLYGON ((18.67173 -31.90977, 18.69878 -31.910...","{259, 132, 264, 139, 271, 144, 276, 151, 283, ...","{131, 261, 134, 264, 266, 141, 269, 271, 144, ...","{261, 134, 266, 269, 141, 144, 274, 279, 284, ...",0.644737,"{261, 134, 266, 269, 141, 144, 274, 279, 284, ...","[1209, 1515, 2565, 1941, 850, 2560, 1097, 1765..."
3,332,"POLYGON ((18.21607 -31.76210, 18.24308 -31.762...","{259, 132, 264, 139, 271, 144, 276, 151, 283, ...","{134, 264, 269, 144, 274, 279, 154, 284, 289, ...","{134, 269, 144, 274, 279, 284, 289, 294, 174, ...",0.789474,"{134, 269, 144, 274, 279, 284, 289, 294, 174, ...","[1260, 1450]"
4,756,"POLYGON ((18.40064 -32.87405, 18.42798 -32.874...","{259, 132, 264, 139, 271, 144, 276, 151, 283, ...","{131, 261, 134, 264, 266, 141, 269, 271, 144, ...","{261, 134, 266, 269, 141, 271, 144, 274, 151, ...",0.592105,"{261, 134, 266, 269, 141, 271, 144, 274, 151, ...","[675, 1411, 1422, 1521, 1449, 389, 485]"


In [108]:
#tiles_test.tiles_closest = tiles_test.tiles_closest.apply(eval)
tiles_test['close_label_dist'] = tiles_test.apply(tilu.close_tiles_label_dist, axis='columns', args=(tiles_train,))

{1: 0.04839328708488483, 2: 0.2357246576367556, 3: 0.0038325189230621823, 4: 0.3283579759634396, 5: 0.0021786492374727667, 6: 0.2184166082641485, 7: 0.11909296778040722, 8: 0.04400333510982934, 9: 0.0}


In [109]:
tiles_test.tiles_closest = tiles_test.tiles_closest.astype(str)
tiles_test.to_file('data/tiles_test.geojson', driver='GeoJSON')

## 4. Stack Sentinel 2 Channels

In [4]:
import importlib
import stack_NN_utils as stul
importlib.reload(stul)

<module 'stack_NN_utils' from '/home/jupyter/NF-Capstone-Crop-Classification/stack_NN_utils.py'>

In [7]:
assets_train_df = pd.read_csv('data/assets_train.csv')
assets_test_df = pd.read_csv('data/assets_test.csv')

In [8]:
tiles_train = tilu.read_tile_geojson('data/tiles_train.geojson')
tiles_test = tilu.read_tile_geojson('data/tiles_test.geojson')

In [None]:
num_days = 8
satellite = "s2"

assets_stacked_8days_train = stul.create_stacked_NN(assets_train_df ,tiles_train, satellite, num_days, train=True)
#assets_stacked_8days_train.to_csv('data/assets_stacked_16days_train_CNN.csv')

1183


In [9]:
num_days = 8
satellite = "s2"

assets_stacked_8days_test = stul.create_stacked_NN(assets_test_df ,tiles_test, satellite, num_days, train=False)
#assets_stacked_8days_train.to_csv('data/assets_stacked_16days_train_CNN.csv')

947


In [6]:
assets_stacked_8days_test = stul.create_stacked_s2(assets_test_df ,tiles_test)
assets_stacked_8days_test.to_csv('data/assets_stacked_16days_test_CNN.csv')

AttributeError: module 'stack_NN_utils' has no attribute 'create_stacked_s2'

## 5. Crop Fields and create Mean + Var

In [305]:
import crop_utils as crul
importlib.reload(crul)

<module 'crop_utils' from '/home/jupyter/NF-Capstone-Crop-Classification/crop_utils.py'>

In [295]:
asset_test_df = assets_test_df = pd.read_csv('data/assets_test.csv')
tiles_test = tilu.read_tile_geojson('data/tiles_test.geojson')
fields_test = gpd.read_file('data/fields_test.geojson')
assets_stacked_8days_test = pd.read_csv('data/assets_stacked_8days_test.csv')

In [None]:
train_mean_var = crul.create_mean_var(assets_train_df, assets_stacked_8days_train, fields_test, tiles_train, True)
train_mean_var.to_csv('data/mean_var_8days_train.csv', index=False)

In [307]:
test_mean_var = crul.create_mean_var(assets_test_df, assets_stacked_8days_test, fields_test, tiles_test, False)
test_mean_var.to_csv('data/mean_var_8days_test.csv', index=False)