# Crop Yield Prediction - ML Baseline

We use WOFOST crop growth indicators, weather variables, geographic information, soil data and remote sensing indicators to predict the yield.

## Google Colab Notes

**To run the script in Google Colab environment**
1. Download the data directory and save it somewhere convenient.
2. Open the notebook using Google Colaboratory.
3. Create a copy of the notebook for yourself.
4. Click connect on the right hand side of the bar below menu items. When you are connected to a machine, you will see a green tick mark and bars showing RAM and disk.
5. Click the folder icon on the left sidebar and click upload. Upload the data files you downloaded. Click *Ok* when you see a warning saying the files will be deleted after the session is disconnected.
6. Use *Runtime* -> *Run before* option to run all cells before **Set Configuration**.
7. Run the remaining cells except **Python Script Main**. The configuration subsection allows you to change configuration and rerun experiments.


## Install Spark

Install PySpark package. Package installation is required only in Google Colab.

In [None]:
!pip install pyspark > /dev/null
!sudo apt update > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install joblibspark > /dev/null

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

import pyspark
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

SparkContext.setSystemProperty('spark.executor.memory', '12g')
SparkContext.setSystemProperty('spark.driver.memory', '6g')
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)





## Install MLBaseline Improvements

Install cypml package from the test PyPi site. Version 1.1.* have baseline improvements.


In [None]:
! pip install cypml==1.1.8

Collecting cypml==1.1.4
[?25l  Downloading https://files.pythonhosted.org/packages/7e/3c/eca4bd7099f4b640ce04513d6ab8581e2f9196c8065077a28e38f0a4cb21/cypml-1.1.4-py3-none-any.whl (60kB)
[K     |█████▍                          | 10kB 12.7MB/s eta 0:00:01[K     |██████████▉                     | 20kB 11.5MB/s eta 0:00:01[K     |████████████████▏               | 30kB 9.6MB/s eta 0:00:01[K     |█████████████████████▋          | 40kB 8.4MB/s eta 0:00:01[K     |███████████████████████████     | 51kB 5.6MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.0MB/s 
Installing collected packages: cypml
Successfully installed cypml-1.1.4


## Run Workflow

### Set Configuration


In [None]:
from cypml.common import globals
from cypml.common.config import CYPConfiguration
from cypml.tests.test_util import TestUtil
from cypml.common.util import getLogFilename

cyp_config = CYPConfiguration()
run_tests = globals.run_tests

if (run_tests):
  test_util = TestUtil(spark)
  test_util.runAllTests()

my_config = {
      'crop_name' : 'soft wheat',
      'season_crosses_calendar_year' : 'Y',
      'country_code' : 'NL',
      'data_sources' : [ 'WOFOST', 'METEO_DAILY', 'SOIL', 'YIELD'],
      'clean_data' : 'N',
      'data_path' : '.',
      'output_path' : '.',
      'nuts_level' : 'NUTS2',
      'use_yield_trend' : 'Y',
      'predict_yield_residuals' : 'N',
      'trend_windows' : [5, 7, 10],
      'use_centroids' : 'N',
      'use_remote_sensing' : 'Y',
      'use_gaes' : 'N',
      'use_per_year_crop_calendar' : 'N',
      'early_season_prediction' : 'N',
      'early_season_end_dekad' : 0,
      'use_features_v2' : 'N',
      'save_features' : 'N',
      'use_saved_features' : 'N',
      'retrain_per_test_year' : 'N',
      'save_predictions' : 'N',
      'use_saved_predictions' : 'N',
      'compare_with_mcyfs' : 'Y',
      'debug_level' : 2,
}

cyp_config.updateConfiguration(my_config)
crop = cyp_config.getCropName()
country = cyp_config.getCountryCode()
nuts_level = cyp_config.getNUTSLevel()
debug_level = cyp_config.getDebugLevel()
use_saved_predictions = cyp_config.useSavedPredictions()
use_saved_features = cyp_config.useSavedFeatures()
use_yield_trend = cyp_config.useYieldTrend()
early_season_prediction = cyp_config.earlySeasonPrediction()
early_season_end = cyp_config.getEarlySeasonEndDekad()

print('##################')
print('# Configuration  #')
print('##################')
output_path = cyp_config.getOutputPath()
log_file = getLogFilename(crop, country, use_yield_trend,
                          early_season_prediction, early_season_end)
log_fh = open(output_path + '/' + log_file, 'w+')
cyp_config.printConfig(log_fh)

##################
# Configuration  #
##################

Current ML Baseline Configuration
--------------------------------
Crop name: soft wheat
Crop ID: 90
Crop growing season crosses calendar year boundary: Y
Country code (e.g. NL): NL
NUTS level for yield prediction: NUTS2
Input data sources: WOFOST, METEO_DAILY, SOIL, YIELD, REMOTE_SENSING
Remove data or regions with duplicate or missing values: N
Estimate and use yield trend: Y
Predict yield residuals instead of full yield: N
Find optimal trend window: N
List of trend window lengths (number of years): 5, 7, 10
Use centroid coordinates and distance to coast: N
Use remote sensing data (FAPAR): Y
Use agro-environmental zones data: N
Use per region per year crop calendar: N
Predict yield early in the season: N
Early season end dekad relative to harvest: 0
Path to all input data. Default is current directory.: .
Path to all output files. Default is current directory.: .
Use feature design v2: N
Save features to a CSV file: N
Use feat

### Load and Preprocess Data


In [None]:
from cypml.workflow.data_loading import CYPDataLoader
from cypml.workflow.data_preprocessing import CYPDataPreprocessor
from cypml.tests.test_data_loading import TestDataLoader
from cypml.tests.test_data_preprocessing import TestDataPreprocessor
from cypml.run_workflow.run_data_preprocessing import preprocessData

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('#################')
  print('# Data Loading  #')
  print('#################')

  if (run_tests):
    test_loader = TestDataLoader(spark)
    test_loader.runAllTests()

  cyp_loader = CYPDataLoader(spark, cyp_config)
  data_dfs = cyp_loader.loadAllData()

  print('#######################')
  print('# Data Preprocessing  #')
  print('#######################')

  if (run_tests):
    test_preprocessor = TestDataPreprocessor(spark)
    test_preprocessor.runAllTests()

  cyp_preprocessor = CYPDataPreprocessor(spark, cyp_config)
  data_dfs = preprocessData(cyp_config, cyp_preprocessor, data_dfs)

#################
# Data Loading  #
#################
Data file name "./WOFOST_NUTS2_NL.csv"
Data file name "./METEO_DAILY_NUTS2_NL.csv"
Data file name "./SOIL_NUTS2_NL.csv"
Data file name "./YIELD_NUTS2_NL.csv"
Data file name "./REMOTE_SENSING_NUTS2_NL.csv"
Loaded data: WOFOST, METEO, SOIL, YIELD, REMOTE_SENSING


#######################
# Data Preprocessing  #
#######################
WOFOST data available for 12 region(s)
Season end information
+--------+-----+---------------+----------+
|IDREGION|FYEAR|PREV_SEASON_END|SEASON_END|
+--------+-----+---------------+----------+
|    NL11| 1979|              0|        24|
|    NL11| 1980|             24|        24|
|    NL11| 1981|             24|        23|
|    NL11| 1982|             23|        23|
|    NL11| 1983|             23|        25|
|    NL11| 1984|             25|        25|
|    NL11| 1985|             25|        24|
|    NL11| 1986|             24|        25|
|    NL11| 1987|             25|        25|
|    NL11| 1988|     

### Split Data into Training and Test Sets

In [None]:
from cypml.tests.test_train_test_split import TestCustomTrainTestSplit
from cypml.run_workflow.run_train_test_split import splitDataIntoTrainingTestSets

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('###########################')
  print('# Training and Test Split #')
  print('###########################')

  if (run_tests):
    yield_df = data_dfs['YIELD']
    test_custom = TestCustomTrainTestSplit(yield_df)
    test_custom.runAllTests()

  prep_train_test_dfs, test_years = splitDataIntoTrainingTestSets(cyp_config, data_dfs, log_fh)

###########################
# Training and Test Split #
###########################

Test years: 2012, 2013, 2014, 2015, 2016, 2017, 2018



### Summarize Data

In [None]:
from cypml.tests.test_data_summary import TestDataSummarizer
from cypml.workflow.data_summary import CYPDataSummarizer
from cypml.run_workflow.run_data_summary import summarizeData

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('#################')
  print('# Data Summary  #')
  print('#################')

  if (run_tests):
    test_summarizer = TestDataSummarizer(spark)
    test_summarizer.runAllTests()

  cyp_summarizer = CYPDataSummarizer(cyp_config)
  summary_dfs = summarizeData(cyp_config, cyp_summarizer, prep_train_test_dfs)

#################
# Data Summary  #
#################
Crop calender information based on WOFOST data
+--------+-------------+---------+----------+----------+---------------------+
|IDREGION|CAMPAIGN_YEAR|START_DVS|START_DVS1|START_DVS2|CAMPAIGN_EARLY_SEASON|
+--------+-------------+---------+----------+----------+---------------------+
|    NL11|         2011|       13|        19|        30|                   31|
|    NL12|         2011|       15|        20|        31|                   32|
|    NL13|         2011|       13|        19|        29|                   30|
|    NL21|         2011|       11|        17|        28|                   29|
|    NL22|         2011|       11|        17|        27|                   28|
|    NL23|         2011|       12|        17|        28|                   29|
|    NL31|         2011|       12|        17|        27|                   28|
|    NL32|         2011|       12|        17|        28|                   29|
|    NL33|         2011|      

### Create Features

In [None]:
from cypml.workflow.feature_design import CYPFeaturizer
from cypml.run_workflow.run_feature_design import createFeatures
from cypml.tests.test_yield_trend import TestYieldTrendEstimator
from cypml.workflow.yield_trend import CYPYieldTrendEstimator
from cypml.run_workflow.run_trend_feature_design import createYieldTrendFeatures
from cypml.run_workflow.run_trend_feature_design import addFeaturesFromPreviousYears

if ((not use_saved_predictions) and
    (not use_saved_features)):

  print('###################')
  print('# Feature Design  #')
  print('###################')

  # WOFOST, Meteo and Remote Sensing Features
  cyp_featurizer = CYPFeaturizer(cyp_config)
  pd_feature_dfs = createFeatures(cyp_config, cyp_featurizer,
                                  prep_train_test_dfs, summary_dfs, log_fh)

  # trend features
  join_cols = ['IDREGION', 'FYEAR']
  if (use_yield_trend):
    yield_train_df = prep_train_test_dfs['YIELD'][0]
    yield_test_df = prep_train_test_dfs['YIELD'][1]

    # Trend features from feature data
    use_features_v2 = cyp_config.useFeaturesV2()
    if (use_features_v2):
      pd_feature_dfs = addFeaturesFromPreviousYears(cyp_config, pd_feature_dfs,
                                                    1, test_years, join_cols)

    if (run_tests):
      test_yield_trend = TestYieldTrendEstimator(yield_train_df)
      test_yield_trend.runAllTests()

    # Trend features from label data
    cyp_trend_est = CYPYieldTrendEstimator(cyp_config)
    pd_yield_train_ft, pd_yield_test_ft = createYieldTrendFeatures(cyp_config, cyp_trend_est,
                                                                   yield_train_df, yield_test_df,
                                                                   test_years)
    pd_feature_dfs['YIELD_TREND'] = [pd_yield_train_ft, pd_yield_test_ft]

###################
# Feature Design  #
###################

 WOFOST Aggregate Features: Training
  IDREGION  FYEAR  maxWLIM_YBp2  ...  maxWLAIp4  avgRSMp2  avgRSMp4
0     NL42   1980         10.25  ...       4.85     40.29     80.23
1     NL42   1979          5.68  ...       3.84     44.26     69.26
2     NL42   1981        205.46  ...       4.22     97.70     64.85
3     NL42   1982       1544.47  ...       3.45     94.16     50.04
4     NL42   1983       2432.85  ...       4.32     95.97     40.61

[5 rows x 11 columns]

 WOFOST Aggregate Features: Test
  IDREGION  FYEAR  maxWLIM_YBp2  ...  maxWLAIp4  avgRSMp2  avgRSMp4
0     NL42   2012       2284.85  ...       4.45     99.08     62.59
1     NL42   2013       2848.06  ...       4.96     91.85     57.10
2     NL42   2014       3829.60  ...       5.14     94.32     89.62
3     NL42   2015       3801.85  ...       4.16     82.57     62.69
4     NL42   2016       1865.08  ...       3.14     97.52     43.86

[5 rows x 11 columns]

 WOFO

### Combine Features and Labels

In [None]:
from cypml.run_workflow.combine_features import combineFeaturesLabels

if ((not use_saved_predictions) and
    (not use_saved_features)):

  join_cols = ['IDREGION', 'FYEAR']
  pd_train_df, pd_test_df = combineFeaturesLabels(cyp_config, sqlContext,
                                                  prep_train_test_dfs, pd_feature_dfs,
                                                  join_cols, log_fh)


Combine Features and Labels
---------------------------
Yield min year 1994

Data size after including SOIL data: 
Train 12 rows.
Test 12 rows.

Data size after including WOFOST features: 
Train 396 rows.
Test 84 rows.

Data size after including METEO features: 
Train 396 rows.
Test 84 rows.

Data size after including REMOTE_SENSING features: 
Train 143 rows.
Test 77 rows.

Data size after including yield trend features: 
Train 143 rows.
Test 77 rows.

Data size after including yield (label) data: 
Train 143 rows.
Test 77 rows.


All Features and labels: Training
   IDREGION  FYEAR  SM_WHC  ...    YIELD-1  YIELD_TREND      YIELD
39     NL11   1999    0.22  ...  47.000000        50.01  60.299999
40     NL11   2000    0.22  ...  60.299999        55.92  59.099998
42     NL11   2001    0.22  ...  59.099998        60.46  54.400002
41     NL11   2002    0.22  ...  54.400002        58.15  55.299999
43     NL11   2003    0.22  ...  55.299999        58.43  59.200001

[5 rows x 61 columns]

All

### Apply Machine Learning using scikit learn


In [None]:
from cypml.run_workflow.load_saved_features import loadSavedFeaturesLabels
from cypml.run_workflow.run_machine_learning import getMachineLearningPredictions
from cypml.run_workflow.run_machine_learning import saveMLPredictions
from cypml.run_workflow.run_machine_learning import dropHighlyCorrelatedFeatures

if ((not use_saved_predictions) and
    (use_saved_features)):
    pd_train_df, pd_test_df = loadSavedFeaturesLabels(cyp_config, spark)

if ((not use_saved_predictions)):
  print('\n###################################')
  print('# Machine Learning using sklearn  #')
  print('###################################')

  # # drop mutually correlated features
  # corr_threshold = 0.9
  # pd_train_df, pd_test_df = dropHighlyCorrelatedFeatures(cyp_config, pd_train_df, pd_test_df,
  #                                                        corr_thresh=corr_threshold)

  pd_ml_predictions = getMachineLearningPredictions(cyp_config, pd_train_df, pd_test_df, log_fh)
  save_predictions = cyp_config.savePredictions()
  if (save_predictions):
    saveMLPredictions(cyp_config, sqlContext, pd_ml_predictions)


###################################
# Machine Learning using sklearn  #
###################################

Training and Evaluation
-------------------------

Training Data Size: 143 rows
X cols: 57, Y cols: 4
IDREGION  FYEAR  SM_WHC  maxWLIM_YBp2  maxTWCp2  maxWLAIp2  maxWLIM_YBp4  maxWLIM_YSp4  maxTWCp4  maxWLAIp4  avgRSMp2  avgRSMp4  RSMp1gt1STD  RSMp1lt1STD  RSMp1gt2STD  RSMp1lt2STD  RSMp2gt1STD  RSMp2lt1STD  RSMp2gt2STD  RSMp2lt2STD  RSMp3gt1STD  RSMp3lt1STD  RSMp3gt2STD  RSMp3lt2STD  RSMp4gt1STD  RSMp4lt1STD  RSMp4gt2STD  RSMp4lt2STD  avgTAVGp0  avgPRECp0  avgCWBp0  avgTAVGp1  avgPRECp1  avgTAVGp2  avgCWBp2  avgPRECp3  avgCWBp4  avgPRECp5  TMINp1gt1STD  PRECp1gt1STD  TMINp1lt1STD  TMINp1gt2STD  PRECp1gt2STD  TMINp1lt2STD  PRECp3gt1STD  TMAXp3gt1STD  TMAXp3lt1STD  PRECp3gt2STD  TMAXp3gt2STD  TMAXp3lt2STD  PRECp5gt1STD  PRECp5gt2STD  avgFAPARp2  avgFAPARp4    YIELD-5    YIELD-4    YIELD-3    YIELD-2    YIELD-1  YIELD_TREND      YIELD
    NL11   1999    0.22       2919.89      3.5

### Compare Predictions with JRC Predictions

In [None]:
from cypml.run_workflow.load_saved_predictions import loadSavedPredictions
from cypml.run_workflow.compare_with_mcyfs import comparePredictionsWithMCYFS

if (use_saved_predictions):
  pd_ml_predictions = loadSavedPredictions(cyp_config, spark)

compareWithMCYFS = cyp_config.compareWithMCYFS()
if (compareWithMCYFS):
  comparePredictionsWithMCYFS(sqlContext, cyp_config, pd_ml_predictions, log_fh)

log_fh.close()