# Processing the data

## Step 1: Upload the cleaned dataset

In [1]:
import pandas as pd

X = pd.read_pickle('X.pkl')
y = pd.read_pickle('y.pkl')
print(X)
print(y)

               Rg        rH
1      467.124900  59.81609
2      567.265600  54.25453
3      510.038200  50.37691
4      562.113500  47.96687
5      489.934200  45.71235
...           ...       ...
11360   -2.632179  87.36536
11361   -0.885231  88.51657
11362   -1.904163  87.13349
11363   -1.932571  87.48014
11364   -1.310510  87.57803

[11361 rows x 2 columns]
            NEE
1     -1.282251
2     -5.457097
3     -3.174246
4     -3.362150
5     -3.737496
...         ...
11360  2.652203
11361  0.893950
11362  1.672291
11363  0.844095
11364  0.832299

[11361 rows x 1 columns]


## Step 2: Split the dataset

In [2]:
from sklearn.model_selection import train_test_split

'''
Data is splitted into train and test
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train)
print(X_train.size)
print(X_test)
print(X_test.size)

               Rg        rH
3982   237.787300  40.95904
589    611.959500  55.40580
7695    -2.076583  82.20747
2158    -2.639731  94.10475
6620    -3.033176  77.12545
...           ...       ...
7817    -2.844407  86.26534
10959   -0.778399  88.79167
906     -1.962875  93.33716
5196     2.517354  79.44413
236    502.330800  63.65034

[8520 rows x 2 columns]
17040
              Rg         rH
9668   89.029820   86.29816
9207  780.027400   36.52398
8007   -0.977286   98.51484
2249   -2.867806   90.99128
9200  343.705000   59.81927
...          ...        ...
3194   -0.983599   87.66892
9891    6.345633   68.90066
6479   -0.931993   97.25005
7946   -1.456769   85.97623
5007   -1.289494  100.03700

[2841 rows x 2 columns]
5682


In [3]:
print(y_train)
print(y_train.size)
print(y_test)
print(y_test.size)

            NEE
3982   1.478472
589   -4.249636
7695   3.894052
2158   3.945240
6620   4.294096
...         ...
7817   1.878224
10959  0.973065
906    3.682107
5196   2.751083
236   -1.585572

[8520 rows x 1 columns]
8520
           NEE
9668 -1.346135
9207 -6.474561
8007  4.732932
2249  4.329803
9200 -4.570991
...        ...
3194  0.641850
9891  2.095709
6479  1.144712
7946  2.675923
5007  0.973291

[2841 rows x 1 columns]
2841


## Step 3: Train the model
The goal is to predict future behavor of a time serie, so I apply a regression model.

Below I initialize the model into the object 'regreModel'. Then I apply the method 'fit' to it. This method will use the data to train the model and build it.

The model used is the Multi-layer Perceptron Regressor (MLP Regressor). This is an Artificial Neural Network composed of different layers with several neurons each.

These and more parameters can be selected in the function. For the whole description, visit the scikit website for this model (https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#examples-using-sklearn-neural-network-mlpregressor)

I only set the following config different from default:
1. random_state to 1, which means that 1 will be initial state of the weights of every neuron.
2. max_iter to 400, in order to prevent the model from unconvergence.
3. verbose to True, in order to print progress messages to stdout.

One of the most importart parameters, by the way, has to do with the variants of Gradient Descent algorithm. This is the so called solver. This model has the following possibilities:
1. LBFGS, or Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm using a limited amount of computer memory.
2. SGD, or Stochastic Gradient Descent, where only one training example is used to compute the gradient and update the parameters at each iteration. This can be faster than batch gradient descent but may lead to more noise in the updates.
3. ADAM (default), where adam stands for Adaptive Moment estimation. The adam algorithm combines the benefits of Momentum-based Gradient Descent, Adagrad, and RMSprop.

The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

In [4]:
from sklearn.neural_network import MLPRegressor

regreModel = MLPRegressor(random_state = 1, max_iter = 400, verbose = True)
regreModel.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 414.97882900
Iteration 2, loss = 10.38681335
Iteration 3, loss = 4.82978731
Iteration 4, loss = 4.13940614
Iteration 5, loss = 3.99579314
Iteration 6, loss = 3.95695062
Iteration 7, loss = 3.95257378
Iteration 8, loss = 3.94276638
Iteration 9, loss = 3.94048259
Iteration 10, loss = 3.93699180
Iteration 11, loss = 3.94915482
Iteration 12, loss = 3.95000307
Iteration 13, loss = 3.93166300
Iteration 14, loss = 3.93669877
Iteration 15, loss = 3.93226425
Iteration 16, loss = 3.92490388
Iteration 17, loss = 3.92684670
Iteration 18, loss = 3.92564482
Iteration 19, loss = 3.93750215
Iteration 20, loss = 3.94523789
Iteration 21, loss = 3.94423637
Iteration 22, loss = 3.91794321
Iteration 23, loss = 3.93371484
Iteration 24, loss = 3.94832099
Iteration 25, loss = 3.92464954
Iteration 26, loss = 3.92406757
Iteration 27, loss = 3.94986513
Iteration 28, loss = 3.92275847
Iteration 29, loss = 3.90784902
Iteration 30, loss = 3.90828279
Iteration 31, loss = 3.94528057
Iteration 32, 

## Step 4: Checking the model

In [5]:
regreModel.score(X_test, y_test)

0.5254528228752913

## Step 5: Testing the model

In [6]:
regreModel.predict(X_test[:2])

array([-0.16271392, -5.71196625])

## Step 6: Saving the model

In [30]:
'''
I cannot do the following.
It seems that the sklearn.externals package is deprecated in recent versions of scikit-learn
'''
#from sklearn.externals import joblib

import joblib

joblib.dump(regreModel, 'regreModel.model')

['regreModel.model']