# Welcome to the OBD SDD BE!

## Adrien Delgado
## Corentin Lefloch

Today, the goal is to understand how a distributed system can be useful when dealing with medium to large scale data sets.  
We'll see that Dask start to be nice as soon as the Data we need to process doesn't quite fit in memory, but also if we
need to launch several computations in parallel.

In this evaluation, you will:
- Use Dask to read and understand the several gigabytes input dataset in a interactive way,
- Preprocess the data in a distributed way: cleaning it up and adding some useful features,
- Launch some model training that can be parallelized,
- Reduce the dataset and train more accurate models on less Data,
- Do an hyper parameter search to find the best model on a small sample of Data.

In order to run and fill this notebook, you'll need to first deploy a Dask enabled Kubernetes cluster as seen last week. So please use the Kubernetes_DaskHub notebook for the steps to do it. __Be careful, it has been updated to use a Docker image containing ML Libraries since the last time!__.

Once the Jupyterhub is up, you can clone the OBD directory from a Jupyterlab terminal to get this notebook, and select the default kernel.
```
git clone https://github.com/SupaeroDataScience/OBD.git
```

## The Dataset

It is some statistics about NY Taxi cabs. 

See https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview, or https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data.
        
The goal of this evaluation will be to generate a model using machine learning technics that will predict the fare amount
of a taxi ride given the other input parameters we have.

The model will be evaluated using the Root mean squared error algorithm:  
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation.

## Try to analyze the Data using Kaggles'start-up code

As an introduction, we'll use Kaggle starters'code to get some insights on the data set and
computations we'll do and measure pandas library (non parallelized access) performance.

See https://www.kaggle.com/dster/nyc-taxi-fare-starter-kernel-simple-linear-model where this comes from.

On our data set (train and test available in gs://obd-dask), we'll see that with Kaggle method, we don't obtain a really good evaluation.

#### Reading the data with pandas

We're reading only about 20% from the whole data set.

In [None]:
%%time
import pandas as pd
train_df =  pd.read_csv('gs://obd-dask/train.csv', nrows = 10_000)
train_df.dtypes

#### Analysing dataset, adding some feature and droping null values

Let's see if we can see some links between passenger count and fare amount?

In [None]:
%%time
train_df.groupby(train_df.passenger_count).fare_amount.mean()

Maybe adding some feature about the distance of the trip could be a good idea?

In [None]:
%%time
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)

Are there some undefined values?

In [None]:
%%time
print(train_df.isnull().sum())

In [None]:
%%time
print('Old size: %d' % len(train_df))
train_df = train_df.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(train_df))

In [None]:
%%time
print(train_df.isnull().sum())

#### Quick analyze on new features and clean outliers

In [None]:
%%time
plot = train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

In [None]:
%%time
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]
print('New size: %d' % len(train_df))

In [None]:
%%time
plot = train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

#### Get training features and results

In [None]:
%%time
import numpy as np

# using the travel vector, plus a 1.0 for a constant bias term.
def get_input_matrix(df):
    return np.column_stack((df.abs_diff_longitude, df.abs_diff_latitude, np.ones(len(df))))

train_X = get_input_matrix(train_df)
train_y = np.array(train_df['fare_amount'])

print(train_X.shape)
print(train_y.shape)

#### Train a simple linear model using Numpy

In [None]:
%%time
# The lstsq function returns several things, and we only care about the actual weight vector w.
(w, _, _, _) = np.linalg.lstsq(train_X, train_y, rcond = None)
print(w)

#### Make prediction on our test set and measure performance

In [None]:
test_df =  pd.read_csv('gs://obd-dask/test.csv')
test_df.dtypes

In [None]:
add_travel_vector_features(test_df)
test_X = get_input_matrix(test_df)

In [None]:
test_y_predictions = np.matmul(test_X, w).round(decimals = 2)

In [None]:
test_y_ref = test_df.fare_amount

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y_ref, test_y_predictions, squared=False)

OK, so about 5,23$ of RMSE, this is not that bad... But we can do better.

<span style="color:#EB5E0B;font-style:italic">

### Some questions on this first Analysis

- What is the most expensive part of the analysis, the one that takes the most time (see the %%time we used above)?
</span>

In [None]:
## The most expensive part of the analysis was the data loading in the first cell

<span style="color:#EB5E0B;font-style:italic">
    
- Try to load the whole dataset with Pandas and comment.
</span>

In [None]:
%%time
# train_df_full =  pd.read_csv('gs://obd-dask/train.csv')
## My kernel died... How fun

# Processing our data set using dask

Dask will help us processing all the input data set. It is really useful when input data is too big to fit in memory. In this case, it can stream the computation by data blocs one one computer, or distribute the computation on several computers.

This is what we'll do next!

### Start an appropriate sized Dask cluster for our analysis

We'll need a Dask cluster to pre process the data and distribute some learning, the following code starts one in our K8S infrastructure.

In [None]:
from dask_gateway import Gateway
# Use values stored in your local configuration (recommended)
gateway = Gateway()

In [None]:
#cluster.shutdown()

In [None]:
gateway.list_clusters()
#cluster = gateway.connect('daskhub.5c7ed511372b4367a324a9eab59794a5')

In [None]:
cluster = gateway.new_cluster(worker_cores=1, worker_memory=3.4)
cluster

__Please click on the Dashboard link above, it will help you a lot!__

In [None]:
client = cluster.get_client()
cluster.scale(16)

### Launch some computation, what about Pi ?

Just to check our cluster is working!

We'll use Dask array, a Numpy extension for this, taht we'll use later on for the Machine Learning part of this evaluation.

In [None]:
%%time
import dask.array as da

sample = 10_000_000_000  # <- this is huge!
xxyy = da.random.uniform(-1, 1, size=(2, sample))
norm = da.linalg.norm(xxyy, axis=0)
summ = da.sum(norm <= 1)
insiders = summ.compute()
pi = 4 * insiders / sample
print("pi ~= {}".format(pi))

## Now, access the data of our BE using Dask

We'll use Dask Dataframe, an distributed version of Pandas Dataframe.

See https://docs.dask.org/en/latest/dataframe.html.

<span style="color:#EB5E0B;font-style:italic">

So instead of using Pandas to load the dataset, just use the equivalent dask method from dask.dataframe.

- Fill the following cell with the appropriate code to read the data using Dask.
</span>

In [None]:
import dask.dataframe as dd

In [None]:
%%time
df = dd.read_csv('gs://obd-dask/train.csv')
df

<span style="color:#EB5E0B;font-style:italic">

### Some questions about this data loading

- That was fast for several gigabytes, wasn't it? Why is this, what did we do?
- Why the return dataframe looks empty?
- See the number of partitions described above? What does it correspond to?
</span>

In [None]:
## Dask only reads the first line of the dataset to get the name of the columns and the types 
## The dataframe looks empty because it was not been read yet
## 85 partitions for the whole dataset. It corresponds to the number of sub-dataframes to load in order parallelize the read

## Little warm up: Analyzing our data to better understand it

<span style="color:#EB5E0B;font-style:italic">

- First, how many records do we have? (hint, in python, len() works for almost any object).
</span>

In [None]:
%%time
len(df)

<span style="color:#EB5E0B;font-style:italic">
    
- What did happend when counting record of our Dask dataframe? Remember with the Spark tutorial: transformations and actions... Same kind of concepts exist in Dask. Just look at the Dask Dashboard!
</span>

In [None]:
## For each chunk, dask reads the dataset, then gets the length of it, and sums the length to previously calculated ones

<span style="color:#EB5E0B;font-style:italic">
    
- Compare the time of this computation to the time of loading a subset of the Dataset with Pandas. What is fast enough considering the number of worker we have?
</span>

In [None]:
%%time
## Answer needed there.
n_rows = 54869617//85

train_df =  pd.read_csv('gs://obd-dask/train.csv', nrows = n_rows)

In [None]:
## 2.9 seconds * 85 is way more than 16 seconds! Dask is efficient!

Let's have a look at some data:

In [None]:
%%time
train_df.head()

<span style="color:#EB5E0B;font-style:italic">
    
- Why was it faster than the count records operation above? What did wee read?
</span>

In [None]:
## We only opened the 5 first lines, instead of reading everything

df = dd.read_csv('gs://obd-dask/train.csv')<span style="color:#EB5E0B;font-style:italic">
    
- Let's compute the mean of the fare given the passenger_count, as we've done with Pandas. Please fill the blank. (hint: don't forget the compute() call)
</span>

In [None]:
df = dd.read_csv('gs://obd-dask/train.csv')

In [None]:
%%time
df.groupby('passenger_count').fare_amount.mean().compute()

Wow, ever seen a cab with more than 200 people??

<span style="color:#EB5E0B;font-style:italic">

- This is a bit slow, much more than with Pandas, why? Which part of the computation is slow, look at the Dashboard to see the name of the tasks. 
</span>

In [None]:
## Reading the data is the longest

<span style="color:#EB5E0B;font-style:italic">
    
- How cloud we optimize the next computations ? Where will be the data at the end?
</span>

In [None]:
df = df.persist()

<span style="color:#EB5E0B;font-style:italic">
    
- Look at the Dashboard at what is happening beind the scene.
At the end, try again the computation:
</span>

In [None]:
%%time
df.groupby('passenger_count').fare_amount.mean().compute()
## Answer needed here, the same computation on fare_amount mean

Much better isn't it?

<span style="color:#EB5E0B;font-style:italic">
    
### Some other questions to practice

- Can you see a correlation between the fare amount and the dropoff latitude? Answer by doing a dask dataframe computation.

First you'll need to round the dropoff latitude to have some sort of categories using Series.round() function.

Then, just group_by this new colon to have some answer (and don't forget to compute to get the results).
</span>

In [None]:
df.round({'dropoff_latitude' : 0}).groupby('dropoff_latitude').fare_amount.mean().compute()

OK, this don't give a lot of insights, but it looks like we've got some strange values somewhere!

<span style="color:#EB5E0B;font-style:italic">

- Let's just have a look of non extreme values, so probably some records at the middle of the results.
We need first to sort the resulting series by index befor lookin at the middle of it.
</span>

In [None]:
fare_by_lat = df.round({'dropoff_latitude' : 1}).groupby('dropoff_latitude').fare_amount.mean().compute().sort_index()

OK, this is not really useful, but it's an exercise!

In [None]:
fare_by_lat.iloc[550:580]

<span style="color:#EB5E0B;font-style:italic">
    
- Do you think we could parallelize things better for any of our computation or data access?
</span>

In [None]:
## Maybe, maybe not, we are not bad!

## Let's do some preprocessing of our data to clean it up and add some features

<span style="color:#EB5E0B;font-style:italic">

- You'll need to do the same operations as in pandas, we just need to call compute when needing a result, and not compute when building our dataframe transformations.
</span>

<span style="color:#EB5E0B;font-style:italic">

#### Cleaning up

- Is there some null values in our data?
</span>

In [None]:
print(df.isnull().sum())

<span style="color:#EB5E0B;font-style:italic">
    
- Yep! We must get rid of them...
</span>

In [None]:
%%time
print('Old size: %d' % len(df))
df = df.dropna()
print('New size: %d' % len(df))

#### Adding features

<span style="color:#EB5E0B;font-style:italic">

- As with Pandas above, add the lattitude and longitude distance vector with a function call
</span>

In [None]:
%%time

add_travel_vector_features(df)

A quick look at our Dataframe to check things

In [None]:
df.head()

<span style="color:#EB5E0B;font-style:italic">
    
- Now let's quickly plot a subset of our travel vector features to see its distribution. Use dask.dataframe.sample() to get about five percent of the rows, and get it back with compute and plot like with Pandas
</span>

In [None]:
df.sample(frac=0.05).compute().plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

Wow, looks like we have some strange values here: more than 1000° of distance... There's a problem somewhere.

<span style="color:#EB5E0B;font-style:italic">
    
- Just get rid of the extreme values, we should keep inside the city wall or so. Like with Pandas.
</span>

In [None]:
%%time
print('Old size: %d' % len(df))
df = df[(df.abs_diff_longitude < 5.0) & (df.abs_diff_latitude < 5.0)]
print('New size: %d' % len(df))

<span style="color:#EB5E0B;font-style:italic">
    
- What is triggering the computation in the examples above?
</span>

In [None]:
## the .compute() is triggering the computation in the examples. 
## Before that, it is only staging computations without actually running them

<span style="color:#EB5E0B;font-style:italic">

- you can do another plot like above with the filtered values if you like.
</span>

In [None]:
df.sample(frac=0.05).compute().plot.scatter('abs_diff_longitude', 'abs_diff_latitude')

Ok, let's see some statistics on our Dataset. The describe() function inherited from Pandas compute a lot of statistics on a dataframe.

In [None]:
%%time
df.describe().compute()

<span style="color:#EB5E0B;font-style:italic">
    
- Are there some values that still looks odd to you in here?
</span>

In [None]:
## 93k$ in fare, -300$, logitudes of thousands, same for longitudes, 208 passengers...

## Training a model in a distributed way

Let's begin with a linear model that we can distributed with Dask ML.

### Building our feature vectors

Here again define a method so that we can use it later for our test set evaluation.

<span style="color:#EB5E0B;font-style:italic">
    
- Just do the same as with the Pandas example by defining a get_input_matrix(df) function. But this time you'll generate a dask array using to_dask_array(length=True) method on the dataframe. You should do a method that generate the X input features dask array, and also the same with y training results. You can do just one method that return both. 
- It is a good idea to persist() arrays in memory in or after the call.
- This time, we'll add the feature 'passenger_count' in addition to the distance vectors.
</span>

In [None]:
train_df = df # rename df for understanding purposes

In [None]:
%%time
import numpy as np

# using the travel vector, and the passenger count.
def get_input_matrix(df):
    return df[["abs_diff_longitude", "abs_diff_latitude", "passenger_count"]].to_dask_array(lengths=True), df.fare_amount.to_dask_array(lengths=True)

train_X, train_y = get_input_matrix(train_df)

print(train_X.shape)
print(train_y.shape)

Then we get the values, and display train_X to have some insights of its size and chunking scheme.

In [None]:
train_X

### Distributed training a Linear model

Be careful, this can take time, try first with few iterations (max_iter = 5).

see https://ml.dask.org/glm.html  
and https://ml.dask.org/modules/generated/dask_ml.linear_model.LinearRegression.html#dask_ml.linear_model.LinearRegression

<span style="color:#EB5E0B;font-style:italic">
    
- Train a LinearRegression model from dask_ml.linear_model on our inputs
</span>


In [None]:
from dask_ml.linear_model import LinearRegression

In [None]:
lr = LinearRegression(max_iter=5)
lr.fit(train_X, train_y)

## Evaluating our model


#### First we should load the test set.

In [None]:
test_df = dd.read_csv('gs://obd-dask/test.csv')
test_df

Adding our features to the test set and getting our feature array

In [None]:
add_travel_vector_features(test_df)
test_X, test_y = get_input_matrix(test_df)
test_X

We can use the score method inherited from Scikit learn, it gives some hints on the model performance.

In [None]:
lr.score(test_X, test_y)

Just get the numpy arrays for computing final score, this is small.

In [None]:
test_X = test_X.compute()
test_y = test_y.compute()

In [None]:
lr.predict(test_X)

#### Compute the RMSE

https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, lr.predict(test_X), squared=False)

<span style="color:#EB5E0B;font-style:italic">
    
- What RMSE did you get? Compare it to the Pandas only computation.
</span>

In [None]:
## Around 20 cents better than with Pandas

<span style="color:#EB5E0B;font-style:italic">
    
- Why is this model not really good?
</span>

In [None]:
## Because we are using a linear model on data which has no apparent reason of having linear correlations

## Use Dask to scale computation on Hyper Parameter Search

As seen above, Dask is well suited to distribute Data and learn a model on a big Data set. However, not all the models can be train in parallel on sub chunks of Data. See https://scikit-learn.org/stable/computing/scaling_strategies.html for the compatible models of Sickit learn for example.

Dask can also be used to train several model in parallel on small datasets, this is what we'll try now.

We will just take a sample of the training set, and try to learn several models with different hyper parameters, and find the best one.

Dask Hyper parameter search : https://ml.dask.org/hyper-parameter-search.html.

First we'll take a small subset of the Data, 5% is a maximum if we want to avoir memory issues on our workers and have appropriate training times. You can try with less if the results are still good.

In [None]:
#Take a sample of the input data, get it as pandas dataframe
train_sample_df = train_df.sample(frac=0.05, random_state=123456)
# Get feature vectors out of it
train_sample_X, train_sample_y = get_input_matrix(train_sample_df)

In order to optimize things, we can also change the type of the features to more appropriate and small types.

We also need to use Numpy arrays, so we'll gather the result from Dask to local variable.

In [None]:
train_sample_X = train_sample_X.astype('float32').compute()
train_sample_y = train_sample_y.astype('float32').compute()
train_sample_X

What size is our dataset ?

In [None]:
import sys
sys.getsizeof(train_sample_X)

About 32MB, this is still quite a big dataset for standard machine learning.

<span style="color:#EB5E0B;font-style:italic">

- Now, just use dask hyper parameter search Dask API to distribute the search. You can either use joblib integration with Sklearn or dask_ml directly. Be careful: do not use model too long to train, and limit their complexity at first or the combinations of hyper parameters you'll use. Hint, start first with a simple LinearModel like SGDRegressor and not more than 10 iterations per model.
</span>

In [None]:
from sklearn.linear_model import SGDRegressor
from dask_ml.model_selection import GridSearchCV

In [None]:
%%time
sgd = SGDRegressor(max_iter=10)
params = {'alpha': [0.001, 0.01, 0.1, 0.5, 0.99],
          'penalty': ['l2', 'l1', 'elasticnet']
         }
search = GridSearchCV(sgd, params)
search.fit(train_sample_X, train_sample_y)

In [None]:
search.score(test_X, test_y)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, search.predict(test_X), squared=False)

<span style="color:#EB5E0B;font-style:italic">

- So how does the result compare to distributed leaning with a linear model? On all the dataset?
    
</span>

In [None]:
## It is about the same so far, but as it is with only 5% on the data, 
## we think we can consider it is not too bad compared to the linear regressor

<span style="color:#EB5E0B;font-style:italic">
    
- Can you do better with Random forest? Caution: use limited trees, small number of estimators < 5 and max_depth < 40...
</span>

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
%%time
rfr = RandomForestRegressor(n_estimators=5, max_depth=30)
params = {
    'min_samples_split': [2, 5, 10],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_impurity_decrease': [0.0, 0.01]
}
search = GridSearchCV(rfr, params)
search.fit(train_sample_X, train_sample_y)

<span style="color:#EB5E0B;font-style:italic">
    
- What do you observe when training RandomForest tree on Dask parallelization Dashboard? Can you explain why there are so many tasks?
</span>

In [None]:
## Answer needed here

In [None]:
search.score(test_X, test_y)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, search.predict(test_X), squared=False)

<span style="color:#EB5E0B;font-style:italic">
    
- Did you get better results with RandomForest? Why ?
</span>

In [None]:
# We got better results with randomForest than with linear regression because the data 
# we try to analyse does not necessarily imply linear correlations between its features,
# hence the poor performance of the LinearRegression

<span style="color:#EB5E0B;font-style:italic">
    
# Extend this notebook
    
Try to do better!

- Add new features to the input Data using Dask Dataframes, or clean it better. Reapply the learning above with these new features. Do you get better results? Some suggestions for a better leaning:
  - Max passenger count of 208, maybe we should ignore this value? Rides with 0 passengers? Try to drop some data.
  - Apply some normalisation or regularization or other feature transformation? See https://ml.dask.org/preprocessing.html.
  - There are 0m rides?
  - Negative fare amount?? Drop some data.
  - Maybe the hour of the day, or the month, has some impact on fares? Try to add features. See https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes for some hints on how to do this.
  - Maybe try to find a way to use the start and drop off locations?
- Improve the model parameters or find a better one. Try using this time dask_ml HyperbandSearchCV. See https://ml.dask.org/hyper-parameter-search.html#basic-use. You can use it for example with https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor.

</span>


In [None]:
## Cleaning a bit the dataset

In [None]:
train_df.describe().compute()

In [None]:
## Clean fare_amount
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.fare_amount > 0.0) & (train_df.fare_amount < 1000)]
print('New size: %d' % len(train_df))

In [None]:
## Clean passenger_count
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.passenger_count > 0) & (train_df.passenger_count < 20)]
print('New size: %d' % len(train_df))

In [None]:
## Clean ride length
print('Old size: %d' % len(train_df))
train_df = train_df[(train_df.abs_diff_longitude > 0) | (train_df.abs_diff_latitude > 0)]
print('New size: %d' % len(train_df))

In [None]:
# let's try to add the pickup hour and month

def get_month(datetime):
    return int(datetime[5:7])

def get_hour(datetime):
    return int(datetime[11:13])

def add_datetime_features(df):
    df['pickup_month'] = df.pickup_datetime.apply(lambda datetime: int(datetime[5:7]), meta=int)
    df['pickup_hour'] = df.pickup_datetime.apply(lambda datetime: int(datetime[11:13]), meta=int)

add_datetime_features(train_df)
add_datetime_features(test_df)

In [None]:
# using the travel vector, the pickup month and hour and the passenger count.
def get_new_input_matrix(df):
    return df[["abs_diff_longitude", "abs_diff_latitude", "passenger_count", "pickup_month", "pickup_hour"]].to_dask_array(lengths=True), df.fare_amount.to_dask_array(lengths=True)

train_X, train_y = get_new_input_matrix(train_df)
test_X, test_y = get_new_input_matrix(test_df)

In [None]:
train_X

In [None]:
test_X

In [None]:
#Take a sample of the input data, get it as pandas dataframe
train_sample_df = train_df.sample(frac=0.05, random_state=123456)
# Get feature vectors out of it
train_sample_X, train_sample_y = get_new_input_matrix(train_sample_df)

In [None]:
train_sample_X

In [None]:
train_sample_X = train_sample_X.persist()
train_sample_X = train_sample_X.astype('float32').compute()
train_sample_y = train_sample_y.astype('float32').compute()
train_sample_X

In [None]:
%%time
# Let's try again the same RandomForestRegressor with the new features and cleaned dataset


rfr = RandomForestRegressor(n_estimators=5, max_depth=30)
params = {
    'min_samples_split': [2, 5, 10],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_impurity_decrease': [0.0, 0.01]
}
search = GridSearchCV(rfr, params)
search.fit(train_sample_X, train_sample_y)

In [None]:
search.score(test_X, test_y)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, search.predict(test_X), squared=False)

In [None]:
# For an unknown reason, the RandomForestRegressor performs worse once the data is cleaned...

In [None]:
from dask_ml.model_selection import HyperbandSearchCV
from sklearn.neural_network import MLPRegressor

In [None]:
%%time
mlp = MLPRegressor(hidden_layer_sizes=100, max_iter=100)
params = {
    'alpha': np.logspace(1e-4,1.0, num=5),
    'learning_rate_init': np.logspace(0.001, 0.1, num=3),
#     'solver': ['sgd', 'adam'],
#     'learning_rate': ['constant', 'adaptive', 'invscaling'],
}
search = HyperbandSearchCV(mlp, params, max_iter=10)
search.fit(train_sample_X, train_sample_y)

In [None]:
search.score(test_X, test_y)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, search.predict(test_X), squared=False)

In [None]:
# The neural network does not seem to perform better than the RandomForestRegressor in that case.

In [None]:
# In the end, the best result we got was with the RandomForestRegressor, before cleaning the data and adding the hour,
# and we had an error of around $3.58