## Importing the file

I've imported the data and add 2 more columns to make it easier to query in the future when I'm creating different dataframes for different fields.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('Train.csv')

# Filling the empty cells for Irrigation
df['Irrigation field 1'].fillna(0, inplace=True)
df['Irrigation field 2'].fillna(0, inplace=True)
df['Irrigation field 3'].fillna(0, inplace=True)
df['Irrigation field 4'].fillna(0, inplace=True)


# Putting a column to make it easier to differentiate train and test data
def train_test(cols):
    temp = cols[0]
    
    if pd.isnull(temp): return 1
    else: return 0
    
df['train_test'] = df[['Air temperature (C)']].apply(train_test,axis=1)

# I've put this column to make it easier to arrange dataframes for different fields
def segmentation(cols):
    date = cols[0]
    
    if date<=10066: return 1
    elif date<=17235: return 2
    else: return 3

df.reset_index(inplace=True)
    
df['segmentation'] = df[['index']].apply(segmentation,axis=1)

df.head()

Unnamed: 0,index,timestamp,Soil humidity 1,Irrigation field 1,Soil humidity 2,Irrigation field 2,Soil humidity 3,Irrigation field 3,Soil humidity 4,Irrigation field 4,Air temperature (C),Air humidity (%),Pressure (KPa),Wind speed (Km/h),Wind gust (Km/h),Wind direction (Deg),train_test,segmentation
0,0,2019-02-23 00:00:00,67.92,0.0,55.72,0.0,-1.56,1.0,26.57,1.0,19.52,55.04,101.5,2.13,6.3,225.0,0,1
1,1,2019-02-23 00:05:00,67.89,0.0,55.74,0.0,-1.51,1.0,26.58,1.0,19.49,55.17,101.5,2.01,10.46,123.75,0,1
2,2,2019-02-23 00:10:00,67.86,0.0,55.77,0.0,-1.47,1.0,26.59,1.0,19.47,55.3,101.51,1.9,14.63,22.5,0,1
3,3,2019-02-23 00:15:00,67.84,0.0,55.79,0.0,-1.42,1.0,26.61,1.0,19.54,54.2,101.51,2.28,16.08,123.75,0,1
4,4,2019-02-23 00:20:00,67.81,0.0,55.82,0.0,-1.38,1.0,26.62,1.0,19.61,53.09,101.51,2.66,17.52,225.0,0,1


## Updating the values of test data days with the previous date

I didn't do any ARIMA or time-series model to predict the climate values of test dates. I've just copied the last day's values as test data.

In [2]:
import warnings
warnings.filterwarnings('ignore')

liste = ['Air temperature (C)','Air humidity (%)','Pressure (KPa)','Wind speed (Km/h)','Wind gust (Km/h)',
         'Wind direction (Deg)']

for item in liste:
    
    i = 8914
    i_max = 10066
    while i<=i_max:
        df[item][i] = df[item][i-288]
        i = i+1
    print(item+' 1')

    i = 16083 
    i_max = 17235
    while i<=i_max:
        df[item][i] = df[item][i-288]
        i = i+1
    print(item+' 2')

    i = 26301  
    i_max = 28029
    while i<=i_max:
        df[item][i] = df[item][i-288]
        i = i+1
    print(item+' 3')

Air temperature (C) 1
Air temperature (C) 2
Air temperature (C) 3
Air humidity (%) 1
Air humidity (%) 2
Air humidity (%) 3
Pressure (KPa) 1
Pressure (KPa) 2
Pressure (KPa) 3
Wind speed (Km/h) 1
Wind speed (Km/h) 2
Wind speed (Km/h) 3
Wind gust (Km/h) 1
Wind gust (Km/h) 2
Wind gust (Km/h) 3
Wind direction (Deg) 1
Wind direction (Deg) 2
Wind direction (Deg) 3


I've created the target column as average difference between 5 consecutive records. My model will try to predict these values.

I've also create a time column to show the time.

In [3]:
df['target_1'] = (df['Soil humidity 1'] - df['Soil humidity 1'].shift(5))/5
df['target_2'] = (df['Soil humidity 2'] - df['Soil humidity 2'].shift(5))/5
df['target_3'] = (df['Soil humidity 3'] - df['Soil humidity 3'].shift(5))/5
df['target_4'] = (df['Soil humidity 4'] - df['Soil humidity 4'].shift(5))/5

df['time'] = df.timestamp.str.slice(11,13).astype(int) * 100 +df.timestamp.str.slice(14,16).astype(int)

## Creating train and test dataframes

I've decided to model for the irrigation on and off situations seperately. So for each field, I've created 1 "irrigation on" and 1 "irrigation off" train and test dataframes.

In the next part I will train the model both with these dataframes and make the predictions accordingly.

So my model will be able to say "if the irrigation is on, this field is gaining humidity with x.xx%, if the irrigation is off the field is losing humidity with x.xx%".

In [5]:
#### FIELD 1
liste = ['index','timestamp','target_1','Irrigation field 1','Air temperature (C)','Air humidity (%)','Pressure (KPa)',
         'Wind speed (Km/h)','Wind gust (Km/h)','Wind direction (Deg)','time']

df_1_train_on = df[liste][(df['segmentation']<=1) & (df['train_test']==0) & (df['Irrigation field 1']==1)]
df_1_train_on.dropna(inplace=True)
df_1_train_on.rename(columns={'target_1': 'target', 'Irrigation field 1': 'Irrigation'}, inplace=True)

df_1_test_on = df[liste][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==1)]
df_1_test_on.rename(columns={'target_1': 'target', 'Irrigation field 1': 'Irrigation'}, inplace=True)
df_1_test_on.fillna(0, inplace=True)

df_1_train_off = df[liste][(df['segmentation']<=1) & (df['train_test']==0) & (df['Irrigation field 1']==0)]
df_1_train_off.dropna(inplace=True)
df_1_train_off.rename(columns={'target_1': 'target', 'Irrigation field 1': 'Irrigation'}, inplace=True)

df_1_test_off = df[liste][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==0)]
df_1_test_off.rename(columns={'target_1': 'target', 'Irrigation field 1': 'Irrigation'}, inplace=True)
df_1_test_off.fillna(0, inplace=True)


#### FIELD 2
liste = ['index','timestamp','target_2','Irrigation field 2','Air temperature (C)','Air humidity (%)','Pressure (KPa)',
         'Wind speed (Km/h)','Wind gust (Km/h)','Wind direction (Deg)','time']

df_2_train_on = df[liste][(df['segmentation']<=3) & (df['train_test']==0) & (df['Irrigation field 2']==1)]
df_2_train_on.dropna(inplace=True)
df_2_train_on.rename(columns={'target_2': 'target', 'Irrigation field 2': 'Irrigation'}, inplace=True)

df_2_test_on = df[liste][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==1)]
df_2_test_on.rename(columns={'target_2': 'target', 'Irrigation field 2': 'Irrigation'}, inplace=True)
df_2_test_on.fillna(0, inplace=True)

df_2_train_off = df[liste][(df['segmentation']<=3) & (df['train_test']==0) & (df['Irrigation field 2']==0)]
df_2_train_off.dropna(inplace=True)
df_2_train_off.rename(columns={'target_2': 'target', 'Irrigation field 2': 'Irrigation'}, inplace=True)

df_2_test_off = df[liste][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==0)]
df_2_test_off.rename(columns={'target_2': 'target', 'Irrigation field 2': 'Irrigation'}, inplace=True)
df_2_test_off.fillna(0, inplace=True)


#### FIELD 3
liste = ['index','timestamp','target_3','Irrigation field 3','Air temperature (C)','Air humidity (%)','Pressure (KPa)',
         'Wind speed (Km/h)','Wind gust (Km/h)','Wind direction (Deg)','time']

df_3_train_on = df[liste][(df['segmentation']<=2) & (df['train_test']==0) & (df['Irrigation field 3']==1)]
df_3_train_on.dropna(inplace=True)
df_3_train_on.rename(columns={'target_3': 'target', 'Irrigation field 3': 'Irrigation'}, inplace=True)

df_3_test_on = df[liste][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==1)]
df_3_test_on.rename(columns={'target_3': 'target', 'Irrigation field 3': 'Irrigation'}, inplace=True)
df_3_test_on.fillna(0, inplace=True)

df_3_train_off = df[liste][(df['segmentation']<=2) & (df['train_test']==0) & (df['Irrigation field 3']==0)]
df_3_train_off.dropna(inplace=True)
df_3_train_off.rename(columns={'target_3': 'target', 'Irrigation field 3': 'Irrigation'}, inplace=True)

df_3_test_off = df[liste][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==0)]
df_3_test_off.rename(columns={'target_3': 'target', 'Irrigation field 3': 'Irrigation'}, inplace=True)
df_3_test_off.fillna(0, inplace=True)


#### FIELD 4
liste = ['index','timestamp','target_4','Irrigation field 4','Air temperature (C)','Air humidity (%)','Pressure (KPa)',
         'Wind speed (Km/h)','Wind gust (Km/h)','Wind direction (Deg)','time']

df_4_train_on = df[liste][(df['segmentation']<=3) & (df['train_test']==0) & (df['Irrigation field 4']==1)]
df_4_train_on.dropna(inplace=True)
df_4_train_on.rename(columns={'target_4': 'target', 'Irrigation field 4': 'Irrigation'}, inplace=True)

df_4_test_on = df[liste][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==1)]
df_4_test_on.rename(columns={'target_4': 'target', 'Irrigation field 4': 'Irrigation'}, inplace=True)
df_4_test_on.fillna(0, inplace=True)

df_4_train_off = df[liste][(df['segmentation']<=3) & (df['train_test']==0) & (df['Irrigation field 4']==0)]
df_4_train_off.dropna(inplace=True)
df_4_train_off.rename(columns={'target_4': 'target', 'Irrigation field 4': 'Irrigation'}, inplace=True)

df_4_test_off = df[liste][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==0)]
df_4_test_off.rename(columns={'target_4': 'target', 'Irrigation field 4': 'Irrigation'}, inplace=True)
df_4_test_off.fillna(0, inplace=True)

## Model

I've used a simple XGBRegressor with following parameters (n_estimators=100, booster='dart')

I've trained the same model seperately for each field and for irrigation on/off cases.

In [6]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

model_xgb = XGBRegressor(n_estimators=100, booster='dart')


##################################### FIELD 1 #####################################
df_on = df_1_train_on[df_1_train_on['target']>=0.05]

outcomes = df_on[['index','timestamp','target']].copy()
features = df_on.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_1_test_on[['index','timestamp','target']].copy()
X_test = df_1_test_on.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_1_on = model_xgb.predict(X_test)

oran_real = df_1_train_on['target'].mean()
oran_predictions = predictions_xgb_1_on.mean()

deneme = model_xgb.predict(X_train)
error_1_on = mean_squared_error(y_train['target'],deneme)
error_1_on_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)

##########
df_off = df_1_train_off[df_1_train_off['target']<0]

outcomes = df_off[['index','timestamp','target']].copy()
features = df_off.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_1_test_off[['index','timestamp','target']].copy()
X_test = df_1_test_off.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_1_off = model_xgb.predict(X_test)

oran_real = df_1_train_off['target'].mean()
oran_predictions = predictions_xgb_1_off.mean()

deneme = model_xgb.predict(X_train)
error_1_off = mean_squared_error(y_train['target'],deneme)
error_1_off_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)


##################################### FIELD 2 #####################################
df_on = df_2_train_on[df_2_train_on['target']>=0.05]

outcomes = df_on[['index','timestamp','target']].copy()
features = df_on.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_2_test_on[['index','timestamp','target']].copy()
X_test = df_2_test_on.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_2_on = model_xgb.predict(X_test)

oran_real = df_2_train_on['target'].mean()
oran_predictions = predictions_xgb_2_on.mean()

deneme = model_xgb.predict(X_train)
error_2_on = mean_squared_error(y_train['target'],deneme)
error_2_on_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)

##########
df_off = df_2_train_off[df_2_train_off['target']<0]

outcomes = df_off[['index','timestamp','target']].copy()
features = df_off.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_2_test_off[['index','timestamp','target']].copy()
X_test = df_2_test_off.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_2_off = model_xgb.predict(X_test)

oran_real = df_2_train_off['target'].mean()
oran_predictions = predictions_xgb_2_off.mean()

deneme = model_xgb.predict(X_train)
error_2_off = mean_squared_error(y_train['target'],deneme)
error_2_off_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)


##################################### FIELD 3 #####################################
df_on = df_3_train_on[df_3_train_on['target']>=0.05]

outcomes = df_on[['index','timestamp','target']].copy()
features = df_on.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_3_test_on[['index','timestamp','target']].copy()
X_test = df_3_test_on.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_3_on = model_xgb.predict(X_test)

oran_real = df_3_train_on['target'].mean()
oran_predictions = predictions_xgb_3_on.mean()

deneme = model_xgb.predict(X_train)
error_3_on = mean_squared_error(y_train['target'],deneme)
error_3_on_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)

##########
df_off = df_3_train_off[df_3_train_off['target']<0]

outcomes = df_off[['index','timestamp','target']].copy()
features = df_off.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_3_test_off[['index','timestamp','target']].copy()
X_test = df_3_test_off.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_3_off = model_xgb.predict(X_test)

oran_real = df_3_train_off['target'].mean()
oran_predictions = predictions_xgb_3_off.mean()

deneme = model_xgb.predict(X_train)
error_3_off = mean_squared_error(y_train['target'],deneme)
error_3_off_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)


##################################### FIELD 4 #####################################
df_on = df_4_train_on[df_4_train_on['target']>=0.05]

outcomes = df_on[['index','timestamp','target']].copy()
features = df_on.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_4_test_on[['index','timestamp','target']].copy()
X_test = df_4_test_on.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_4_on = model_xgb.predict(X_test)

oran_real = df_4_train_on['target'].mean()
oran_predictions = predictions_xgb_4_on.mean()

deneme = model_xgb.predict(X_train)
error_4_on = mean_squared_error(y_train['target'],deneme)
error_4_on_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)

##########
df_off = df_4_train_off[df_4_train_off['target']<0]

outcomes = df_off[['index','timestamp','target']].copy()
features = df_off.drop(['index','timestamp','target'], axis = 1).copy()

y_test = df_4_test_off[['index','timestamp','target']].copy()
X_test = df_4_test_off.drop(['index','timestamp','target'], axis = 1).copy()

# Train - Valid seperation
X_train, X_valid, y_train, y_valid = train_test_split(features, outcomes, test_size=0.25, random_state=42)

model_xgb.fit(X_train, y_train['target'], verbose=True, eval_metric='rmse',
              eval_set=[(X_train, y_train['target']), (X_valid, y_valid['target'])])

predictions_xgb_4_off = model_xgb.predict(X_test)

oran_real = df_4_train_off['target'].mean()
oran_predictions = predictions_xgb_4_off.mean()

deneme = model_xgb.predict(X_train)
error_4_off = mean_squared_error(y_train['target'],deneme)
error_4_off_fixed = mean_squared_error(y_train['target'],deneme * oran_real / oran_predictions)


#####################

print("Error 1 on : "+str(round(error_1_on,6))+" - "+str(round(error_1_on_fixed,6)))
print("Error 1 off: "+str(round(error_1_off,6))+" - "+str(round(error_1_off_fixed,6)))
print("Error 2 on : "+str(round(error_2_on,6))+" - "+str(round(error_2_on_fixed,6)))
print("Error 2 off: "+str(round(error_2_off,6))+" - "+str(round(error_2_off_fixed,6)))
print("Error 3 on : "+str(round(error_3_on,6))+" - "+str(round(error_3_on_fixed,6)))
print("Error 3 off: "+str(round(error_3_off,6))+" - "+str(round(error_3_off_fixed,6)))
print("Error 4 on : "+str(round(error_4_on,6))+" - "+str(round(error_4_on_fixed,6)))
print("Error 4 off: "+str(round(error_4_off,6))+" - "+str(round(error_4_off_fixed,6)))

[0]	validation_0-rmse:0.338469	validation_1-rmse:0.332912
[1]	validation_0-rmse:0.307234	validation_1-rmse:0.302122
[2]	validation_0-rmse:0.277905	validation_1-rmse:0.273288
[3]	validation_0-rmse:0.251771	validation_1-rmse:0.247889
[4]	validation_0-rmse:0.22796	validation_1-rmse:0.225471
[5]	validation_0-rmse:0.207305	validation_1-rmse:0.205674
[6]	validation_0-rmse:0.188868	validation_1-rmse:0.187714
[7]	validation_0-rmse:0.171436	validation_1-rmse:0.171457
[8]	validation_0-rmse:0.15644	validation_1-rmse:0.157378
[9]	validation_0-rmse:0.141984	validation_1-rmse:0.143505
[10]	validation_0-rmse:0.129293	validation_1-rmse:0.1322
[11]	validation_0-rmse:0.117984	validation_1-rmse:0.121959
[12]	validation_0-rmse:0.108444	validation_1-rmse:0.113341
[13]	validation_0-rmse:0.099268	validation_1-rmse:0.104692
[14]	validation_0-rmse:0.091377	validation_1-rmse:0.097376
[15]	validation_0-rmse:0.084036	validation_1-rmse:0.090752
[16]	validation_0-rmse:0.077466	validation_1-rmse:0.084755
[17]	valida

[36]	validation_0-rmse:0.022628	validation_1-rmse:0.024943
[37]	validation_0-rmse:0.022073	validation_1-rmse:0.024395
[38]	validation_0-rmse:0.021577	validation_1-rmse:0.02392
[39]	validation_0-rmse:0.021149	validation_1-rmse:0.023508
[40]	validation_0-rmse:0.020808	validation_1-rmse:0.023182
[41]	validation_0-rmse:0.020436	validation_1-rmse:0.022793
[42]	validation_0-rmse:0.020168	validation_1-rmse:0.022535
[43]	validation_0-rmse:0.019923	validation_1-rmse:0.022299
[44]	validation_0-rmse:0.019754	validation_1-rmse:0.022126
[45]	validation_0-rmse:0.019481	validation_1-rmse:0.021866
[46]	validation_0-rmse:0.019316	validation_1-rmse:0.021697
[47]	validation_0-rmse:0.019152	validation_1-rmse:0.021522
[48]	validation_0-rmse:0.019013	validation_1-rmse:0.021388
[49]	validation_0-rmse:0.018822	validation_1-rmse:0.021192
[50]	validation_0-rmse:0.018712	validation_1-rmse:0.021068
[51]	validation_0-rmse:0.018545	validation_1-rmse:0.020868
[52]	validation_0-rmse:0.018462	validation_1-rmse:0.02080

[74]	validation_0-rmse:0.290143	validation_1-rmse:0.33996
[75]	validation_0-rmse:0.289599	validation_1-rmse:0.338984
[76]	validation_0-rmse:0.286879	validation_1-rmse:0.336262
[77]	validation_0-rmse:0.28535	validation_1-rmse:0.335645
[78]	validation_0-rmse:0.284108	validation_1-rmse:0.334641
[79]	validation_0-rmse:0.282735	validation_1-rmse:0.333891
[80]	validation_0-rmse:0.281612	validation_1-rmse:0.3331
[81]	validation_0-rmse:0.280009	validation_1-rmse:0.33096
[82]	validation_0-rmse:0.277539	validation_1-rmse:0.329449
[83]	validation_0-rmse:0.275965	validation_1-rmse:0.327907
[84]	validation_0-rmse:0.275011	validation_1-rmse:0.326795
[85]	validation_0-rmse:0.274402	validation_1-rmse:0.326444
[86]	validation_0-rmse:0.27223	validation_1-rmse:0.324842
[87]	validation_0-rmse:0.270183	validation_1-rmse:0.323068
[88]	validation_0-rmse:0.269473	validation_1-rmse:0.322142
[89]	validation_0-rmse:0.268595	validation_1-rmse:0.321633
[90]	validation_0-rmse:0.267688	validation_1-rmse:0.321015
[91

[10]	validation_0-rmse:0.320841	validation_1-rmse:0.332489
[11]	validation_0-rmse:0.312997	validation_1-rmse:0.324891
[12]	validation_0-rmse:0.304286	validation_1-rmse:0.319544
[13]	validation_0-rmse:0.29696	validation_1-rmse:0.314735
[14]	validation_0-rmse:0.291051	validation_1-rmse:0.30956
[15]	validation_0-rmse:0.284706	validation_1-rmse:0.304652
[16]	validation_0-rmse:0.278684	validation_1-rmse:0.300146
[17]	validation_0-rmse:0.272818	validation_1-rmse:0.296539
[18]	validation_0-rmse:0.268295	validation_1-rmse:0.292208
[19]	validation_0-rmse:0.262896	validation_1-rmse:0.289405
[20]	validation_0-rmse:0.260087	validation_1-rmse:0.287302
[21]	validation_0-rmse:0.255624	validation_1-rmse:0.283965
[22]	validation_0-rmse:0.251603	validation_1-rmse:0.280839
[23]	validation_0-rmse:0.249314	validation_1-rmse:0.279385
[24]	validation_0-rmse:0.245275	validation_1-rmse:0.276257
[25]	validation_0-rmse:0.24382	validation_1-rmse:0.275212
[26]	validation_0-rmse:0.241223	validation_1-rmse:0.273442


[48]	validation_0-rmse:0.044211	validation_1-rmse:0.0489
[49]	validation_0-rmse:0.044044	validation_1-rmse:0.048761
[50]	validation_0-rmse:0.043977	validation_1-rmse:0.048683
[51]	validation_0-rmse:0.043868	validation_1-rmse:0.048581
[52]	validation_0-rmse:0.043783	validation_1-rmse:0.048473
[53]	validation_0-rmse:0.043704	validation_1-rmse:0.048347
[54]	validation_0-rmse:0.043608	validation_1-rmse:0.04827
[55]	validation_0-rmse:0.043403	validation_1-rmse:0.048131
[56]	validation_0-rmse:0.043334	validation_1-rmse:0.04806
[57]	validation_0-rmse:0.043308	validation_1-rmse:0.048029
[58]	validation_0-rmse:0.04312	validation_1-rmse:0.047902
[59]	validation_0-rmse:0.043064	validation_1-rmse:0.047844
[60]	validation_0-rmse:0.042988	validation_1-rmse:0.047779
[61]	validation_0-rmse:0.042952	validation_1-rmse:0.047745
[62]	validation_0-rmse:0.0428	validation_1-rmse:0.047639
[63]	validation_0-rmse:0.04278	validation_1-rmse:0.047616
[64]	validation_0-rmse:0.042728	validation_1-rmse:0.04756
[65]	v

[86]	validation_0-rmse:0.227358	validation_1-rmse:0.235737
[87]	validation_0-rmse:0.226669	validation_1-rmse:0.235147
[88]	validation_0-rmse:0.22638	validation_1-rmse:0.234963
[89]	validation_0-rmse:0.225739	validation_1-rmse:0.234407
[90]	validation_0-rmse:0.225249	validation_1-rmse:0.234155
[91]	validation_0-rmse:0.225025	validation_1-rmse:0.234034
[92]	validation_0-rmse:0.224296	validation_1-rmse:0.233171
[93]	validation_0-rmse:0.223599	validation_1-rmse:0.232976
[94]	validation_0-rmse:0.222975	validation_1-rmse:0.232806
[95]	validation_0-rmse:0.222632	validation_1-rmse:0.232546
[96]	validation_0-rmse:0.221951	validation_1-rmse:0.231851
[97]	validation_0-rmse:0.221504	validation_1-rmse:0.231433
[98]	validation_0-rmse:0.220273	validation_1-rmse:0.230915
[99]	validation_0-rmse:0.220046	validation_1-rmse:0.230708
[0]	validation_0-rmse:0.532063	validation_1-rmse:0.53285
[1]	validation_0-rmse:0.47944	validation_1-rmse:0.4802
[2]	validation_0-rmse:0.43213	validation_1-rmse:0.432882
[3]	va

## Creating the test dataframes with predicted humidity change

In [8]:
field_1_results_on = df[['timestamp','Soil humidity 1']][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==1)]
field_1_results_on['predictions'] = predictions_xgb_1_on

field_1_results_off = df[['timestamp','Soil humidity 1']][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==0)]
field_1_results_off['predictions'] = predictions_xgb_1_off

field_1_results = pd.concat([field_1_results_on,field_1_results_off])
field_1_results.sort_values(by=['timestamp'], inplace=True)

##

field_2_results_on = df[['timestamp','Soil humidity 2']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==1)]
field_2_results_on['predictions'] = predictions_xgb_2_on

field_2_results_off = df[['timestamp','Soil humidity 2']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==0)]
field_2_results_off['predictions'] = predictions_xgb_2_off

field_2_results = pd.concat([field_2_results_on,field_2_results_off])
field_2_results.sort_values(by=['timestamp'], inplace=True)

##

field_3_results_on = df[['timestamp','Soil humidity 3']][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==1)]
field_3_results_on['predictions'] = predictions_xgb_3_on

field_3_results_off = df[['timestamp','Soil humidity 3']][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==0)]
field_3_results_off['predictions'] = predictions_xgb_3_off

field_3_results = pd.concat([field_3_results_on,field_3_results_off])
field_3_results.sort_values(by=['timestamp'], inplace=True)

##

field_4_results_on = df[['timestamp','Soil humidity 4']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==1)]
field_4_results_on['predictions'] = predictions_xgb_4_on

field_4_results_off = df[['timestamp','Soil humidity 4']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==0)]
field_4_results_off['predictions'] = predictions_xgb_4_off

field_4_results = pd.concat([field_4_results_on,field_4_results_off])
field_4_results.sort_values(by=['timestamp'], inplace=True)

I have the predicted humidity change for each field. To calculate the soil humidity, I need to go over for each record, calculate the sum of predicted humidity change for this record and the previous humidity.

In addtion to my predictions, I've added the average humidity change for irrigation on and off cases to the prediction. So that my predictions do not get too volatile.

Since Zindi said we can use the peak humidity levels on the test data, when doing the one by one humidity calculation, if I encounter with one of those peak humidity moments, I've used that value as my new soil humidity. So I didn't use the future data, but I've just adjusted my predictions when I see an measured humidity.

I feel this is more reasonable as your model will be predicting humidity for a field, but when you send a real measured value for a moment, than it should adjust itself according to that newly measured value.

In [9]:
import warnings
warnings.filterwarnings('ignore')

##################################### FIELD 1 #####################################

field_1_results_on = df[['timestamp','Soil humidity 1']][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==1)]
field_1_results_on['predictions'] = (predictions_xgb_1_on+3.5*df_1_train_on['target'].mean())/4.5

field_1_results_off = df[['timestamp','Soil humidity 1']][(df['segmentation']==1) & (df['train_test']==1) & (df['Irrigation field 1']==0)]
field_1_results_off['predictions'] = (predictions_xgb_1_off+3.5*df_1_train_off['target'].mean())/4.5

field_1_results = pd.concat([field_1_results_on,field_1_results_off])
field_1_results.sort_values(by=['timestamp'], inplace=True)

field_1_ek = df[['timestamp','Soil humidity 1']][(df['segmentation']==1) & (df['train_test']==0)][-1:]
field_1_ek['predictions'] = 0

field_1_results = pd.concat([field_1_ek,field_1_results])
field_1_results.reset_index(drop=True, inplace=True)
field_1_results['calculated_humidity'] = field_1_results['Soil humidity 1']

i = 1
i_max = field_1_results.index.max()

while i <= i_max:
    current_humidity = field_1_results['calculated_humidity'][i]
    
    if pd.isnull(current_humidity):
        previous_humidity = field_1_results['calculated_humidity'][i-1]
        difference = field_1_results['predictions'][i]
        value = previous_humidity + difference
        field_1_results['calculated_humidity'][i] = value
    else: 
        value = current_humidity
        field_1_results['calculated_humidity'][i] = value
               
    print('Field 1: '+str(i)+' / '+str(i_max)+'   ', end="\r")
    i = i+1


##################################### FIELD 2 #####################################

field_2_results_on = df[['timestamp','Soil humidity 2']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==1)]
field_2_results_on['predictions'] = (predictions_xgb_2_on+3.5*df_2_train_on['target'].mean())/4.5

field_2_results_off = df[['timestamp','Soil humidity 2']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 2']==0)]
field_2_results_off['predictions'] = (predictions_xgb_2_off+3.5*df_2_train_off['target'].mean())/4.5

field_2_results = pd.concat([field_2_results_on,field_2_results_off])
field_2_results.sort_values(by=['timestamp'], inplace=True)

field_2_ek = df[['timestamp','Soil humidity 2']][(df['segmentation']==3) & (df['train_test']==0)][-1:]
field_2_ek['predictions'] = 0

field_2_results = pd.concat([field_2_ek,field_2_results])
field_2_results.reset_index(drop=True, inplace=True)
field_2_results['calculated_humidity'] = field_2_results['Soil humidity 2']

i = 1
i_max = field_2_results.index.max()

while i <= i_max:
    current_humidity = field_2_results['calculated_humidity'][i]
    
    if pd.isnull(current_humidity):
        previous_humidity = field_2_results['calculated_humidity'][i-1]
        difference = field_2_results['predictions'][i]
        value = previous_humidity + difference
        field_2_results['calculated_humidity'][i] = value
    else: 
        value = current_humidity
        field_2_results['calculated_humidity'][i] = value
               
    print('Field 2: '+str(i)+' / '+str(i_max)+'   ', end="\r")
    i = i+1

    

##################################### FIELD 3 #####################################

field_3_results_on = df[['timestamp','Soil humidity 3']][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==1)]
field_3_results_on['predictions'] = (predictions_xgb_3_on+3.5*df_3_train_on['target'].mean())/4.5

field_3_results_off = df[['timestamp','Soil humidity 3']][(df['segmentation']==2) & (df['train_test']==1) & (df['Irrigation field 3']==0)]
field_3_results_off['predictions'] = (predictions_xgb_3_off+3.5*df_3_train_off['target'].mean())/4.5

field_3_results = pd.concat([field_3_results_on,field_3_results_off])
field_3_results.sort_values(by=['timestamp'], inplace=True)

field_3_ek = df[['timestamp','Soil humidity 3']][(df['segmentation']==2) & (df['train_test']==0)][-1:]
field_3_ek['predictions'] = 0

field_3_results = pd.concat([field_3_ek,field_3_results])
field_3_results.reset_index(drop=True, inplace=True)
field_3_results['calculated_humidity'] = field_3_results['Soil humidity 3']

i = 1
i_max = field_3_results.index.max()

while i <= i_max:
    current_humidity = field_3_results['calculated_humidity'][i]
    
    if pd.isnull(current_humidity):
        previous_humidity = field_3_results['calculated_humidity'][i-1]
        difference = field_3_results['predictions'][i]
        value = previous_humidity + difference
        field_3_results['calculated_humidity'][i] = value
    else: 
        value = current_humidity
        field_3_results['calculated_humidity'][i] = value
               
    print('Field 3: '+str(i)+' / '+str(i_max)+'   ', end="\r")
    i = i+1


##################################### FIELD 4 #####################################

field_4_results_on = df[['timestamp','Soil humidity 4']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==1)]
field_4_results_on['predictions'] = (predictions_xgb_4_on+3.5*df_4_train_on['target'].mean())/4.5

field_4_results_off = df[['timestamp','Soil humidity 4']][(df['segmentation']==3) & (df['train_test']==1) & (df['Irrigation field 4']==0)]
field_4_results_off['predictions'] = (predictions_xgb_4_off+3.5*df_4_train_off['target'].mean())/4.5

field_4_results = pd.concat([field_4_results_on,field_4_results_off])
field_4_results.sort_values(by=['timestamp'], inplace=True)

field_4_ek = df[['timestamp','Soil humidity 4']][(df['segmentation']==3) & (df['train_test']==0)][-1:]
field_4_ek['predictions'] = 0

field_4_results = pd.concat([field_4_ek,field_4_results])
field_4_results.reset_index(drop=True, inplace=True)
field_4_results['calculated_humidity'] = field_4_results['Soil humidity 4']

i = 1
i_max = field_4_results.index.max()

while i <= i_max:
    current_humidity = field_4_results['calculated_humidity'][i]
    
    if pd.isnull(current_humidity):
        previous_humidity = field_4_results['calculated_humidity'][i-1]
        difference = field_4_results['predictions'][i]
        value = previous_humidity + difference
        field_4_results['calculated_humidity'][i] = value
    else: 
        value = current_humidity
        field_4_results['calculated_humidity'][i] = value
               
    print('Field 4: '+str(i)+' / '+str(i_max)+'   ', end="\r")
    i = i+1


Field 4: 1748 / 1748   

## Exporting the results

In [10]:
df_export_1 = pd.DataFrame()
df_export_1['ID'] = field_1_results['timestamp'][1:]+' x Soil humidity 1'
df_export_1['Values'] = field_1_results['calculated_humidity'][1:]

df_export_2 = pd.DataFrame()
df_export_2['ID'] = field_2_results['timestamp'][1:-1]+' x Soil humidity 2'
df_export_2['Values'] = field_2_results['calculated_humidity'][1:-1]

df_export_3 = pd.DataFrame()
df_export_3['ID'] = field_3_results['timestamp'][1:]+' x Soil humidity 3'
df_export_3['Values'] = field_3_results['calculated_humidity'][1:]

df_export_4 = pd.DataFrame()
df_export_4['ID'] = field_4_results['timestamp'][1:-19]+' x Soil humidity 4'
df_export_4['Values'] = field_4_results['calculated_humidity'][1:-19]

df_export = pd.concat([df_export_1,df_export_2,df_export_3,df_export_4])
df_export.to_csv('Sertac_Ozker_submission.csv',index=False)