# Crop Yield Prediction - ML Baseline

We use WOFOST crop growth indicators, weather variables, geographic information, soil data and remote sensing indicators to predict the yield.

## Google Colab Notes

**To run the script in Google Colab environment**
1. Download the data directory and save it somewhere convenient.
2. Open the notebook using Google Colaboratory.
3. Create a copy of the notebook for yourself.
4. Click connect on the right hand side of the bar below menu items. When you are connected to a machine, you will see a green tick mark and bars showing RAM and disk.
5. Click the folder icon on the left sidebar and click upload. Upload the data files you downloaded. Click *Ok* when you see a warning saying the files will be deleted after the session is disconnected.
6. Use *Runtime* -> *Run before* option to run all cells before **Set Configuration**.
7. Run the remaining cells except **Python Script Main**. The configuration subsection allows you to change configuration and rerun experiments.


## Install Spark

Install PySpark package. Package installation is required only in Google Colab.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -c -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install joblibspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"

import findspark
findspark.init()

import pyspark
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

conf = SparkConf().setMaster('local[*]')
conf.set('spark.executor.memory', '12g')
conf.set('spark.driver.memory', '6g')
conf.set('spark.sql.execution.arrow.pyspark.enabled', True)

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession



## Install MLBaseline package

Here we use `cypml.zip` to install package. We can replace this with a `pip` package later. Uncomment the first line if you need to replace the package installation. You will have to restart runtime after reinstalling the package.


In [None]:
! pip install -i https://test.pypi.org/simple/ cypml

Looking in indexes: https://test.pypi.org/simple/


## Run Workflow

### Set Configuration


In [None]:
from cypml.common import globals
from cypml.common.config import CYPConfiguration
from cypml.tests.test_util import TestUtil
from cypml.common.util import getLogFilename

cyp_config = CYPConfiguration()
run_tests = globals.run_tests

if (run_tests):
  test_util = TestUtil(spark)
  test_util.runAllTests()

my_config = {
      'crop_name' : 'sugarbeet',
      'season_crosses_calendar_year' : 'N',
      'country_code' : 'NL',
      'data_sources' : [ 'WOFOST', 'METEO_DAILY', 'SOIL', 'YIELD'],
      'data_path' : '.',
      'output_path' : '.',
      'nuts_level' : 'NUTS2',
      'use_yield_trend' : 'Y',
      'predict_yield_residuals' : 'N',
      'trend_windows' : [5, 7, 10],
      'use_centroids' : 'N',
      'use_remote_sensing' : 'Y',
      'early_season_prediction' : 'N',
      'early_season_end_dekad' : 0,
      'save_features' : 'N',
      'use_saved_features' : 'N',
      'save_predictions' : 'N',
      'use_saved_predictions' : 'N',
      'compare_with_mcyfs' : 'Y',
      'debug_level' : 2,
}

cyp_config.updateConfiguration(my_config)
crop = cyp_config.getCropName()
country = cyp_config.getCountryCode()
nuts_level = cyp_config.getNUTSLevel()
debug_level = cyp_config.getDebugLevel()
use_saved_predictions = cyp_config.useSavedPredictions()
use_saved_features = cyp_config.useSavedFeatures()
use_yield_trend = cyp_config.useYieldTrend()
early_season_prediction = cyp_config.earlySeasonPrediction()
early_season_end = cyp_config.getEarlySeasonEndDekad()

print('##################')
print('# Configuration  #')
print('##################')
output_path = cyp_config.getOutputPath()
log_file = getLogFilename(crop, country, use_yield_trend,
                          early_season_prediction, early_season_end)
log_fh = open(output_path + '/' + log_file, 'w+')
cyp_config.printConfig(log_fh)

##################
# Configuration  #
##################

Current ML Baseline Configuration
--------------------------------
Crop name: sugarbeet
Crop ID: 6
Crop growing season crosses calendar year boundary: N
Country code (e.g. NL): NL
NUTS level for yield prediction: NUTS2
Input data sources: WOFOST, METEO_DAILY, SOIL, YIELD, REMOTE_SENSING
Estimate and use yield trend: Y
Predict yield residuals instead of full yield: N
Find optimal trend window: N
List of trend window lengths (number of years): 5, 7, 10
Use centroid coordinates and distance to coast: N
Use remote sensing data (FAPAR): Y
Predict yield early in the season: N
End dekad for early season prediction: 0
Path to all input data. Default is current directory.: .
Path to all output files. Default is current directory.: .
Save features to a CSV file: N
Use features from a CSV file: Y
Save predictions to a CSV file: N
Use predictions from a CSV file: N
Compare predictions with MARS Crop Yield Forecasting System: Y
Debug level t

### Load and Preprocess Data


In [None]:
from cypml.workflow.data_loading import CYPDataLoader
from cypml.workflow.data_preprocessing import CYPDataPreprocessor
from cypml.tests.test_data_loading import TestDataLoader
from cypml.tests.test_data_preprocessing import TestDataPreprocessor
from cypml.run_workflow.run_data_preprocessing import preprocessData

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('#################')
  print('# Data Loading  #')
  print('#################')

  if (run_tests):
    test_loader = TestDataLoader(spark)
    test_loader.runAllTests()

  cyp_loader = CYPDataLoader(spark, cyp_config)
  data_dfs = cyp_loader.loadAllData()

  print('#######################')
  print('# Data Preprocessing  #')
  print('#######################')

  if (run_tests):
    test_preprocessor = TestDataPreprocessor(spark)
    test_preprocessor.runAllTests()

  cyp_preprocessor = CYPDataPreprocessor(spark, cyp_config)
  data_dfs = preprocessData(cyp_config, cyp_preprocessor, data_dfs)

#################
# Data Loading  #
#################
Data file name "./WOFOST_NUTS2_NL.csv"
Data file name "./METEO_DAILY_NUTS2_NL.csv"
Data file name "./SOIL_NUTS2_NL.csv"
Data file name "./YIELD_NUTS2_NL.csv"
Data file name "./REMOTE_SENSING_NUTS2_NL.csv"
Loaded data: WOFOST, METEO, SOIL, YIELD, REMOTE_SENSING


#######################
# Data Preprocessing  #
#######################
WOFOST data available for 12 region(s)
Season end information
+--------+-----+---------------+----------+
|IDREGION|FYEAR|PREV_SEASON_END|SEASON_END|
+--------+-----+---------------+----------+
|    NL11| 1979|              0|        36|
|    NL11| 1980|             36|        34|
|    NL11| 1981|             34|        34|
|    NL11| 1982|             34|        34|
|    NL11| 1983|             34|        36|
|    NL11| 1984|             36|        36|
|    NL11| 1985|             36|        36|
|    NL11| 1986|             36|        36|
|    NL11| 1987|             36|        36|
|    NL11| 1988|     

### Split Data into Training and Test Sets

In [None]:
from cypml.tests.test_train_test_split import TestCustomTrainTestSplit
from cypml.run_workflow.run_train_test_split import splitDataIntoTrainingTestSets

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('###########################')
  print('# Training and Test Split #')
  print('###########################')

  if (run_tests):
    yield_df = data_dfs['YIELD']
    test_custom = TestCustomTrainTestSplit(yield_df)
    test_custom.runAllTests()

  prep_train_test_dfs, test_years = splitDataIntoTrainingTestSets(cyp_config, data_dfs, log_fh)

###########################
# Training and Test Split #
###########################

Test years: 2012, 2013, 2014, 2015, 2016, 2017, 2018



### Summarize Data

In [None]:
from cypml.tests.test_data_summary import TestDataSummarizer
from cypml.workflow.data_summary import CYPDataSummarizer
from cypml.run_workflow.run_data_summary import summarizeData

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('#################')
  print('# Data Summary  #')
  print('#################')

  if (run_tests):
    test_summarizer = TestDataSummarizer(spark)
    test_summarizer.runAllTests()

  cyp_summarizer = CYPDataSummarizer(cyp_config)
  summary_dfs = summarizeData(cyp_config, cyp_summarizer, prep_train_test_dfs)

#################
# Data Summary  #
#################
Crop calender information based on WOFOST data
+--------+-------------+---------+----------+----------+----------------+
|IDREGION|CAMPAIGN_YEAR|START_DVS|START_DVS1|START_DVS2|EARLY_SEASON_END|
+--------+-------------+---------+----------+----------+----------------+
|    NL11|         2011|       13|        19|        30|              36|
|    NL12|         2011|       15|        20|        31|              36|
|    NL13|         2011|       13|        19|        29|              36|
|    NL21|         2011|       11|        17|        28|              36|
|    NL22|         2011|       11|        17|        27|              36|
|    NL23|         2011|       12|        17|        28|              36|
|    NL31|         2011|       12|        17|        27|              36|
|    NL32|         2011|       12|        17|        28|              36|
|    NL33|         2011|       12|        17|        27|              36|
|    NL34| 

### Create Features

In [None]:
from cypml.workflow.feature_design import CYPFeaturizer
from cypml.run_workflow.run_feature_design import createFeatures
from cypml.tests.test_yield_trend import TestYieldTrendEstimator
from cypml.workflow.yield_trend import CYPYieldTrendEstimator
from cypml.run_workflow.run_trend_feature_design import createYieldTrendFeatures

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('###################')
  print('# Feature Design  #')
  print('###################')

  cyp_featurizer = CYPFeaturizer(cyp_config)
  # WOFOST, Meteo and Remote Sensing Features
  pd_feature_dfs = createFeatures(cyp_config, cyp_featurizer,
                                  prep_train_test_dfs, summary_dfs, log_fh)

  # yield trend features
  if (use_yield_trend):
    yield_train_df = prep_train_test_dfs['YIELD'][0]
    yield_test_df = prep_train_test_dfs['YIELD'][1]

    if (run_tests):
      test_yield_trend = TestYieldTrendEstimator(yield_train_df)
      test_yield_trend.runAllTests()

    cyp_trend_est = CYPYieldTrendEstimator(cyp_config)
    pd_yield_train_ft, pd_yield_test_ft = createYieldTrendFeatures(cyp_config, cyp_trend_est,
                                                                   yield_train_df, yield_test_df,
                                                                   test_years)
    pd_feature_dfs['YIELD_TREND'] = [pd_yield_train_ft, pd_yield_test_ft]

###################
# Feature Design  #
###################
Yield min year 1994

 WOFOST Aggregate Features: Training
  IDREGION  FYEAR  maxWLIM_YBp2  ...  maxWLAIp4  avgRSMp2  avgRSMp4
0     NL42   1995       2615.32  ...       3.68     89.34     33.48
1     NL34   2000       3082.30  ...       4.60     91.75     69.31
2     NL13   2006       3007.10  ...       4.16     90.38     66.29
3     NL12   1999       2393.70  ...       5.82     94.75     66.60
4     NL22   2004       3378.76  ...       5.67     91.03     79.29

[5 rows x 11 columns]

 WOFOST Aggregate Features: Test
  IDREGION  FYEAR  maxWLIM_YBp2  ...  maxWLAIp4  avgRSMp2  avgRSMp4
0     NL13   2016       2175.01  ...       4.36     95.45     64.84
1     NL32   2014       2763.33  ...       4.26     93.49     65.14
2     NL21   2018       4368.62  ...       3.10     80.75     29.71
3     NL42   2015       3801.85  ...       4.16     82.57     60.24
4     NL33   2012       2129.41  ...       4.85    103.96     77.55

[5 rows 

### Combine Features and Labels

In [None]:
from cypml.run_workflow.combine_features import combineFeaturesLabels

if ((not use_saved_predictions) and
    (not use_saved_features)):

  join_cols = ['IDREGION', 'FYEAR']
  pd_train_df, pd_test_df = combineFeaturesLabels(cyp_config, sqlContext,
                                                  prep_train_test_dfs, pd_feature_dfs,
                                                  join_cols, log_fh)


Combine Features and Labels
---------------------------
Yield min year 1994

Data size after including SOIL data: 
Train 12 rows.
Test 12 rows.

Data size after including WOFOST features: 
Train 216 rows.
Test 84 rows.

Data size after including METEO features: 
Train 216 rows.
Test 84 rows.

Data size after including REMOTE_SENSING features: 
Train 143 rows.
Test 77 rows.

Data size after including yield trend features: 
Train 143 rows.
Test 77 rows.

Data size after including yield (label) data: 
Train 143 rows.
Test 77 rows.


All Features and labels: Training
   IDREGION  FYEAR  SM_WHC  ...    YIELD-1  YIELD_TREND      YIELD
51     NL11   1999    0.22  ...  47.000000        50.01  60.299999
42     NL11   2000    0.22  ...  60.299999        55.92  59.099998
49     NL11   2001    0.22  ...  59.099998        60.46  54.400002
48     NL11   2002    0.22  ...  54.400002        58.15  55.299999
47     NL11   2003    0.22  ...  55.299999        58.43  59.200001

[5 rows x 61 columns]

All

### Apply Machine Learning using scikit learn


In [None]:
from cypml.run_workflow.load_saved_features import loadSavedFeaturesLabels
from cypml.run_workflow.run_machine_learning import getMachineLearningPredictions
from cypml.run_workflow.run_machine_learning import saveMLPredictions

if ((not use_saved_predictions) and
    (use_saved_features)):
    pd_train_df, pd_test_df = loadSavedFeaturesLabels(cyp_config, spark)

if ((not use_saved_predictions)):
  print('\n###################################')
  print('# Machine Learning using sklearn  #')
  print('###################################')

  pd_ml_predictions = getMachineLearningPredictions(cyp_config, pd_train_df, pd_test_df, log_fh)
  save_predictions = cyp_config.savePredictions()
  if (save_predictions):
    saveMLPredictions(cyp_config, sqlContext, pd_ml_predictions)


All Features and labels
  IDREGION  FYEAR  SM_WHC  AVG_ELEV  ...  YIELD-2  YIELD-1  YIELD_TREND  YIELD
0     NL11   1999    0.22  1.761161  ...     55.7     47.0        50.01   60.3
1     NL11   2000    0.22  1.761161  ...     47.0     60.3        55.92   59.1
2     NL11   2001    0.22  1.761161  ...     60.3     59.1        60.46   54.4
3     NL11   2002    0.22  1.761161  ...     59.1     54.4        58.15   55.3
4     NL11   2003    0.22  1.761161  ...     54.4     55.3        58.43   59.2

[5 rows x 56 columns]
  IDREGION  FYEAR  SM_WHC  AVG_ELEV  ...  YIELD-2  YIELD-1  YIELD_TREND  YIELD
0     NL11   2012    0.22  1.761161  ...     71.2     75.8        77.87   75.2
1     NL11   2013    0.22  1.761161  ...     75.8     75.2        76.46   74.6
2     NL11   2014    0.22  1.761161  ...     75.2     74.6        75.40   86.8
3     NL11   2015    0.22  1.761161  ...     74.6     86.8        85.72   76.0
4     NL11   2016    0.22  1.761161  ...     86.8     76.0        81.28   74.8

[5 

### Compare Predictions with JRC Predictions

In [None]:
from cypml.run_workflow.load_saved_predictions import loadSavedPredictions
from cypml.run_workflow.compare_with_mcyfs import comparePredictionsWithMCYFS

if (use_saved_predictions):
  pd_ml_predictions = loadSavedPredictions(cyp_config, spark)

compareWithMCYFS = cyp_config.compareWithMCYFS()
if (compareWithMCYFS):
  comparePredictionsWithMCYFS(sqlContext, cyp_config, pd_ml_predictions, log_fh)

log_fh.close()

##############
# Load Data  #
##############
Data file name "./AREA_FRACTIONS_NUTS2_NL.csv"
Data file name "./AREA_FRACTIONS_NUTS1_NL.csv"
Data file name "./YIELD_NUTS0_NL.csv"
Data file name "./YIELD_PRED_MCYFS_NUTS0_NL.csv"
Loaded data: AREA_FRACTIONS, YIELD, YIELD_PRED_MCYFS


####################
# Preprocess Data  #
####################
NUTS0 Yield before preprocessing
+------+--------+-----+-----+
|  CROP|IDREGION|FYEAR|YIELD|
+------+--------+-----+-----+
|potato|      NL| 1971| 37.3|
|potato|      NL| 1972| 37.5|
|potato|      NL| 1973| 36.8|
|potato|      NL| 1974| 38.4|
|potato|      NL| 1975| 33.1|
|potato|      NL| 1976| 29.8|
|potato|      NL| 1977| 33.8|
|potato|      NL| 1978| 38.6|
|potato|      NL| 1979| 37.8|
|potato|      NL| 1980| 36.3|
+------+--------+-----+-----+
only showing top 10 rows

NUTS0 Yield after preprocessing
+--------+-----+-----+
|IDREGION|FYEAR|YIELD|
+--------+-----+-----+
|      NL| 1971| 49.1|
|      NL| 1972| 43.9|
|      NL| 1973| 47.7|
|      