Made by, Jeffrey Stynen r0784111

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pycaret.regression import *

In [10]:
file = pd.ExcelFile("../data/food-twentieth-century-crop-statistics-1900-2017-xlsx.xlsx")
df = file.parse('CropStats')
df = df.set_index(df.columns[0])
df.index.name = None

## Clean Data
Columns with too many null values or meaningless information are dropped.  
Columns with unclear names are renamed.  
For the subnational column null values are filled up with corresponding values from the national column.  
For yield, production, and hectares, if one of them is null but the others in the same row are not, the null one can be calculated. Here we need to pay attention to not divide by 0.  
We chose to focus on wheat, so we will be filtering the df accordingly.  
We also added columns with logarithmic transformations for yield, production, and hectares.

In [None]:
df.drop(['admin2', 'notes', 'Harvest_year'], axis=1, inplace=True)
df.rename(columns = {'admin0': 'national', 'admin1': 'subnational', 'hectares (ha)': 'hectares_ha', 'production (tonnes)': 'production_tonnes', 'yield(tonnes/ha)': 'yield_tonnes_ha'}, inplace=True)
df.loc[df['subnational'].isna(), 'subnational'] = df['national']
# Calculate yield
mask = df['yield_tonnes_ha'].isna() & ~df['production_tonnes'].isna() & ~df['hectares_ha'].isna() & df['hectares_ha'] != 0
df.loc[mask, 'yield_tonnes_ha'] = df['production_tonnes'] / df['hectares_ha']
df.dropna(subset=['yield_tonnes_ha'], inplace=True)
# Calculate production
mask = df['production_tonnes'].isna() & ~df['yield_tonnes_ha'].isna() & ~df['hectares_ha'].isna()
df.loc[mask, 'production_tonnes'] = df['yield_tonnes_ha'] * df['hectares_ha']
df.dropna(subset=['production_tonnes'], inplace=True)
# Calculate hectares
mask = df['hectares_ha'].isna() & ~df['yield_tonnes_ha'].isna() & ~df['production_tonnes'].isna()
df.loc[mask, 'hectares_ha'] = df['yield_tonnes_ha'] * df['production_tonnes']
df.dropna(subset=['hectares_ha'], inplace=True)
# The columns we just adapted just changed into objects, let's make them floats again
df['hectares_ha'] = df['hectares_ha'].astype(float)
df['production_tonnes'] = df['production_tonnes'].astype(float)
df['yield_tonnes_ha'] = df['yield_tonnes_ha'].astype(float)
# Filter for wheat
df = df[df['crop'] == 'wheat']
# Remove the crop column
df.drop('crop', axis=1, inplace =True)
# Logarithmic transformations
# df['log_yield'] = np.log1p(df['yield_tonnes_ha'])
# df['log_hectares'] = np.log1p(df['production_tonnes'])
# df['log_production'] = np.log1p(df['hectares_ha'])


#### Pycaret
We fisrt setup the model, I chose to use 70% of the data for testing. The target we want to look at is the yield. We want to be able to predict the future yield. I chose to use 10 folds for cross validation.

In [None]:
s = setup(data=df, train_size=0.7, target='yield_tonnes_ha', fold=10, categorical_features=['national', 'subnational'], session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,yield_tonnes_ha
2,Target type,Regression
3,Original data shape,"(15479, 6)"
4,Transformed data shape,"(15479, 26)"
5,Transformed train set shape,"(10835, 26)"
6,Transformed test set shape,"(4644, 26)"
7,Numeric features,3
8,Categorical features,2
9,Preprocess,True


K Neighbors Regressor seems to be the best model for our purposes.

In [None]:
best_model = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
knn,K Neighbors Regressor,0.0622,0.0251,0.1578,0.9905,0.0307,0.0272,0.101
xgboost,Extreme Gradient Boosting,0.1281,0.0403,0.2003,0.9846,0.059,0.0714,0.158
et,Extra Trees Regressor,0.1542,0.0595,0.2438,0.9773,0.0766,0.0969,1.045
rf,Random Forest Regressor,0.1589,0.0646,0.254,0.9754,0.0757,0.0938,1.933
lightgbm,Light Gradient Boosting Machine,0.1765,0.0686,0.2617,0.9738,0.0812,0.1101,0.337
dt,Decision Tree Regressor,0.2114,0.1093,0.3301,0.9583,0.0992,0.1216,0.073
gbr,Gradient Boosting Regressor,0.2799,0.1534,0.3916,0.9415,0.127,0.1928,0.661
ada,AdaBoost Regressor,0.513,0.4028,0.6339,0.8468,0.2277,0.4436,0.349
lar,Least Angle Regression,0.6379,0.7066,0.8401,0.7311,0.2678,0.4513,0.055
br,Bayesian Ridge,0.6379,0.7067,0.8402,0.7311,0.2679,0.4512,0.056


In [None]:
tuned_model = tune_model(best_model)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0594,0.0208,0.1443,0.992,0.0256,0.0274
1,0.0665,0.0271,0.1646,0.9912,0.0285,0.027
2,0.0599,0.0308,0.1755,0.9884,0.0288,0.0259
3,0.0585,0.0179,0.134,0.9932,0.0259,0.024
4,0.0602,0.0224,0.1497,0.9918,0.027,0.0233
5,0.0518,0.0165,0.1286,0.9929,0.0266,0.0257
6,0.0599,0.0161,0.127,0.9934,0.0293,0.0252
7,0.0589,0.0217,0.1474,0.9915,0.0276,0.027
8,0.0653,0.0287,0.1693,0.9894,0.0347,0.0275
9,0.0656,0.0258,0.1608,0.99,0.036,0.031


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


When we evaluate the model we see that the model performed very well, with very few outliers in the predictions. 

In [None]:
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [None]:
save_model(best_model, 'crops_jeffrey')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['hectares_ha', 'production_tonnes',
                                              'year'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['national', 'subnational'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('onehot_encoding',
                  TransformerWrapper(include=['national'],
                                     transformer=OneHotEncoder(cols=['national'],
                                                               handle_missing='return_nan',
                                                               use_cat_names=True))),
                 ('rest_encoding',
                  TransformerWrapper(include=['subnational'],
                                     transformer=TargetE