## FDMS TME3  

Kaggle [How Much Did It Rain? II](https://www.kaggle.com/c/how-much-did-it-rain-ii)

Florian Toque & Paul Willot  

### Notes
We tried different model, like SVM regression, MLP, Random Forest and KNN as recommanded by the winning team of the Kaggle on taxi trajectories. So far Random Forest seems to be the best, slightly better than the SVM.  
The new features we exctracted only made a very small impact on predictions.

In [2]:
# from __future__ import exam_success
from __future__ import absolute_import
from __future__ import print_function

%matplotlib inline
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
import pandas as pd
import scipy.stats as stats

# Sk cheatsfrom sklearn.ensemble import ExtraTreesRegressor
from sklearn.cross_validation import cross_val_score  # cross val
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import Imputer   # get rid of nan
from sklearn.neighbors import KNeighborsRegressor
from sklearn import grid_search
import os

* 13.765.202 lines in train.csv  
*  8.022.757 lines in test.csv  

### Few words about the dataset

Predictions is made in the USA corn growing states (mainly Iowa, Illinois, Indiana) during the season with the highest rainfall (as illustrated by [Iowa](https://en.wikipedia.org/wiki/Iowa#Climate) for the april to august months)

The Kaggle page indicate that the dataset have been shuffled, so working on a subset seems acceptable  
The test set is not a extracted from the same data as the training set however, which make the evaluation trickier

### Load the dataset

In [4]:
%%time
#filename = "data/train.csv"
filename = "data/reduced_train_100000.csv"
#filename = "data/reduced_train_1000000.csv"
raw = pd.read_csv(filename)
raw = raw.set_index('Id')

CPU times: user 232 ms, sys: 52.3 ms, total: 284 ms
Wall time: 286 ms


In [5]:
raw.columns

Index([u'minutes_past', u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th', u'Expected'],
      dtype='object')

In [6]:
raw['Expected'].describe()

count    100000.000000
mean        129.579825
std         687.622542
min           0.010000
25%           0.254000
50%           1.016000
75%           3.556002
max       32740.617000
Name: Expected, dtype: float64

Per wikipedia, a **value of more than 421 mm/h is considered "Extreme/large hail"**  
If we encounter the value 327.40 meter per hour, we should probably start building Noah's ark  
Therefor, it seems reasonable to **drop values too large**, considered as outliers

In [7]:
# Considering that the gauge may concentrate the rainfall, we set the cap to 1000
# Comment this line to analyse the complete dataset 
l = len(raw)
raw = raw[raw['Expected'] < 300]  #1000
print("Dropped %d (%0.2f%%)"%(l-len(raw),(l-len(raw))/float(l)*100))

Dropped 6241 (6.24%)


In [8]:
raw.head(5)

Unnamed: 0_level_0,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,RefComposite_5x5_90th,...,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,10,,,,,,,,,...,,,,,,,,,,0.254
1,16,10,,,,,,,,,...,,,,,,,,,,0.254
1,25,10,,,,,,,,,...,,,,,,,,,,0.254
1,35,10,,,,,,,,,...,,,,,,,,,,0.254
1,45,10,,,,,,,,,...,,,,,,,,,,0.254


In [9]:
raw.describe()

Unnamed: 0,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,RefComposite_5x5_90th,...,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
count,93759.0,93759.0,44923.0,38391.0,45078.0,53038.0,47946.0,42219.0,48042.0,55118.0,...,42064.0,35938.0,30835.0,35925.0,42064.0,31231.0,26418.0,31283.0,36505.0,93759.0
mean,29.68483,11.022334,23.684482,20.788948,23.378688,26.427731,25.424874,22.956797,25.139201,27.982365,...,1.014742,0.597837,-0.564851,0.429577,2.018197,-0.013098,-3.383198,-0.429909,3.855601,5.837679
std,17.418876,4.259865,10.224306,9.073503,9.936862,11.186952,10.627954,9.638466,10.372406,11.535892,...,0.045336,1.388384,0.974288,0.864887,1.539513,3.747791,2.771442,2.194894,3.762005,22.764656
min,0.0,0.0,-29.0,-31.5,-31.5,-26.5,-26.5,-27.5,-25.0,-23.0,...,0.208333,-7.875,-7.875,-7.875,-7.875,-52.880005,-51.42,-46.87001,-41.54001,0.01
25%,15.0,9.0,17.5,15.5,17.5,19.5,19.0,17.5,18.5,20.5,...,0.998333,-0.0625,-1.0,0.0625,1.125,-1.410004,-4.230011,-0.710007,1.759994,0.254
50%,30.0,12.0,24.0,21.0,23.5,27.0,25.5,23.0,25.5,28.5,...,1.005,0.5,-0.5,0.375,1.6875,0.0,-2.809998,0.0,3.169998,0.83
75%,45.0,14.0,30.5,27.0,30.5,34.5,33.0,29.5,32.5,36.5,...,1.051667,1.125,0.0,0.75,2.5,1.409988,-1.740006,0.349991,5.289993,2.794001
max,59.0,21.0,64.5,57.0,61.5,67.5,68.0,59.5,64.0,79.5,...,1.051667,7.9375,5.9375,7.9375,7.9375,47.84999,1.759994,5.62999,43.20999,244.00012


We regroup the data by ID

In [10]:
# We select all features except for the minutes past,
# because we ignore the time repartition of the sequence for now

features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getXy(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX, docY = [], []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
            docY.append(float(raw.loc[i]["Expected"]))
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
            docY.append(float(raw.loc[i][:1]["Expected"]))
    X , y = np.array(docX) , np.array(docY)
    return X,y

### On fully filled dataset

In [11]:
#noAnyNan = raw.loc[raw[features_columns].dropna(how='any').index.unique()]
noAnyNan = raw.dropna()

In [12]:
noFullNan = raw.loc[raw[features_columns].dropna(how='all').index.unique()]

In [13]:
fullNan = raw.drop(raw[features_columns].dropna(how='all').index)

In [15]:
print(len(raw))
print(len(noAnyNan))
print(len(noFullNan))
print(len(fullNan))

93759
22158
70149
23610


---
# Predicitons


As a first try, we make predictions on the complete data, and return the 50th percentile and uncomplete and fully empty data

In [40]:
%%time
#X,y=getXy(noAnyNan)
X,y=getXy(noFullNan)

CPU times: user 3.61 s, sys: 24.2 ms, total: 3.63 s
Wall time: 3.64 s


In [17]:
%%time
#XX = [np.array(t).mean(0) for t in X]
XX = [np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) ) for t in X]

CPU times: user 349 ms, sys: 5.06 ms, total: 354 ms
Wall time: 371 ms




In [33]:
t = np.array([[10,1,10],
            [20,np.nan,12],
            [30,20,30]])
np.nanpercentile(t,90,axis=0)

array([ 28. ,  18.1,  26.4])

In [122]:
# used to fill fully empty datas
global_means = np.nanmean(noFullNan,0)

# reduce the sequence structure of the data and produce
# new hopefully informatives features
def addFeatures(X):
    # used to fill fully empty datas
    #global_means = np.nanmean(X,0)
    
    XX=[]
    nbFeatures=float(len(X[0][0]))
    for t in X:
        
        # compute means, ignoring nan when possible, marking it when fully filled with nan
        nm = np.nanmean(t,0)
        tt=[]
        for idx,j in enumerate(nm):
            if np.isnan(j):
                nm[idx]=global_means[idx]
                tt.append(1)
            else:
                tt.append(0)
        tmp = np.append(nm,np.append(tt,tt.count(0)/nbFeatures))
        
        # faster if working on fully filled data:
        #tmp = np.append(np.nanmean(np.array(t),0),(np.array(t)[1:] - np.array(t)[:-1]).sum(0) )
        
        # add the percentiles
        tmp = np.append(tmp,np.nanpercentile(t,10,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,50,axis=0))
        tmp = np.append(tmp,np.nanpercentile(t,90,axis=0))
        
        for idx,i in enumerate(tmp):
            if np.isnan(i):
                tmp[idx]=0

        # adding the dbz as a feature
        test = t
        try:
            taa=test[:,0]
        except TypeError:
            taa=[test[0][0]]
        valid_time = np.zeros_like(taa)
        valid_time[0] = taa[0]
        for n in xrange(1,len(taa)):
            valid_time[n] = taa[n] - taa[n-1]
        valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
        valid_time = valid_time / 60.0


        sum=0
        try:
            column_ref=test[:,2]
        except TypeError:
            column_ref=[test[0][2]]
        for dbz, hours in zip(column_ref, valid_time):
            # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
            if np.isfinite(dbz):
                mmperhr = pow(pow(10, dbz/10)/200, 0.625)
                sum = sum + mmperhr * hours

        XX.append(np.append(np.array(sum),tmp))
        #XX.append(np.array([sum]))
        #XX.append(tmp)
    return XX

In [123]:
%%time
XX=addFeatures(X)

CPU times: user 21.2 s, sys: 149 ms, total: 21.4 s
Wall time: 21.6 s


In [60]:
XX[2]

array([  2.71759937,  28.15384615,   9.        ,  26.6       ,
        20.07142857,  25.8       ,  30.26923077,  26.66666667,
        21.09090909,  25.11538462,  32.23076923,   0.98833334,
         0.92817184,   0.98833334,   1.01583333,  -1.125     ,
        -0.56485122,   0.5       ,   1.515625  ,   7.029999  ,
        -3.38319791,   0.        ,   6.3299943 ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   1.        ,   0.        ,
         0.        ,   0.        ,   1.        ,   0.        ,
         0.        ,   0.        ,   1.        ,   0.        ,
         0.        ,   0.86363636,   5.8       ,   9.        ,
        16.45      ,   9.7       ,  15.95      ,  18.4       ,
        11.9       ,  12.        ,  13.5       ,  20.3       ,
         0.98833334,   0.        ,   0.98833334,   1.00033333,
        -1.125     ,   0.        ,   0.5       ,   0.56

In [61]:
def splitTrainTest(X, y, split=0.2):
    tmp1, tmp2 = [], []
    ps = int(len(X) * (1-split))
    index_shuf = range(len(X))
    random.shuffle(index_shuf)
    for i in index_shuf:
        tmp1.append(X[i])
        tmp2.append(y[i])
    return tmp1[:ps], tmp2[:ps], tmp1[ps:], tmp2[ps:]

In [130]:
X_train,y_train, X_test, y_test = splitTrainTest(XX,y)

---

In [63]:
def manualScorer(estimator, X, y):
    err = (estimator.predict(X_test)-y_test)**2
    return -err.sum()/len(err)

---

max prof 24
nb trees 84
min sample per leaf 17
min sample to split 51

In [64]:
from sklearn import svm

In [65]:
svr = svm.SVR(C=100000)

In [66]:
%%time
srv = svr.fit(X_train,y_train)

CPU times: user 12.3 s, sys: 74.7 ms, total: 12.3 s
Wall time: 12.4 s


In [67]:
err = (svr.predict(X_train)-y_train)**2
err.sum()/len(err)

2.21383042070573

In [68]:
err = (svr.predict(X_test)-y_test)**2
err.sum()/len(err)

206.66017740269754

In [69]:
%%time
svr_score = cross_val_score(svr, XX, y, cv=5)
print("Score: %s\nMean: %.03f"%(svr_score,svr_score.mean()))

Score: [ 0.34469104  0.72603728  0.58081123  0.21586955 -0.30521254]
Mean: 0.312
CPU times: user 54.3 s, sys: 163 ms, total: 54.5 s
Wall time: 54.5 s


---

In [131]:
knn = KNeighborsRegressor(n_neighbors=6,weights='distance',algorithm='ball_tree')

In [29]:
#parameters = {'weights':('distance','uniform'),'algorithm':('auto', 'ball_tree', 'kd_tree', 'brute')}
parameters = {'n_neighbors':range(1,10,1)}
grid_knn = grid_search.GridSearchCV(knn, parameters,scoring=manualScorer)

In [30]:
%%time
grid_knn.fit(X_train,y_train)

CPU times: user 2.51 s, sys: 0 ns, total: 2.51 s
Wall time: 2.51 s


GridSearchCV(cv=None,
       estimator=KNeighborsRegressor(algorithm='ball_tree', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=6, p=2, weights='distance'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x7fcc54db8050>, verbose=0)

In [31]:
print(grid_knn.grid_scores_)
print("Best: ",grid_knn.best_params_)

[mean: nan, std: nan, params: {'n_neighbors': 1}, mean: nan, std: nan, params: {'n_neighbors': 2}, mean: nan, std: nan, params: {'n_neighbors': 3}, mean: nan, std: nan, params: {'n_neighbors': 4}, mean: nan, std: nan, params: {'n_neighbors': 5}, mean: nan, std: nan, params: {'n_neighbors': 6}, mean: nan, std: nan, params: {'n_neighbors': 7}, mean: nan, std: nan, params: {'n_neighbors': 8}, mean: nan, std: nan, params: {'n_neighbors': 9}]
Best:  {'n_neighbors': 1}


In [32]:
knn = grid_knn.best_estimator_

In [132]:
knn= knn.fit(X_train,y_train)

In [133]:
print(knn.score(X_train,y_train))
print(knn.score(X_test,y_test))

0.995275391578
0.608658349718


In [72]:
err = (knn.predict(X_train)-y_train)**2
err.sum()/len(err)

1.2703084913435574

In [73]:
err = (knn.predict(X_test)-y_test)**2
err.sum()/len(err)

85.489427551154463

---

In [134]:
etreg = ExtraTreesRegressor(n_estimators=200, max_depth=None, min_samples_split=1, random_state=0)

In [55]:
parameters = {'n_estimators':range(100,200,20)}
grid_rf = grid_search.GridSearchCV(etreg, parameters,n_jobs=2,scoring=manualScorer)

In [37]:
%%time
grid_rf.fit(X_train,y_train)

CPU times: user 4.44 s, sys: 241 ms, total: 4.68 s
Wall time: 16.3 s


GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=1, min_weight_fraction_leaf=0.0,
          n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=2,
       param_grid={'n_estimators': [100, 120, 140, 160, 180]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function manualScorer at 0x110aa0ed8>, verbose=0)

In [38]:
print(grid_rf.grid_scores_)
print("Best: ",grid_rf.best_params_)

[mean: -55.73522, std: 40.35044, params: {'n_estimators': 100}, mean: -55.47051, std: 39.85010, params: {'n_estimators': 120}, mean: -56.18434, std: 40.62698, params: {'n_estimators': 140}, mean: -56.15046, std: 40.74838, params: {'n_estimators': 160}, mean: -56.37052, std: 40.72395, params: {'n_estimators': 180}]
Best:  {'n_estimators': 120}


In [39]:
grid_rf.best_params_

{'n_estimators': 120}

In [135]:
es = etreg
#es = grid_rf.best_estimator_

In [136]:
%%time
es = es.fit(X_train,y_train)

CPU times: user 18.6 s, sys: 70.4 ms, total: 18.7 s
Wall time: 18.7 s


In [137]:
print(es.score(X_train,y_train))
print(es.score(X_test,y_test))

0.995275391578
0.638642349462


In [77]:
err = (es.predict(X_train)-y_train)**2
err.sum()/len(err)

1.2703084913435574

In [78]:
err = (es.predict(X_test)-y_test)**2
err.sum()/len(err)

87.171223669446121

---

In [138]:
import xgboost as xgb

In [139]:
# the dbz feature does not influence xgbr so much
xgbr = xgb.XGBRegressor(max_depth=6, learning_rate=0.1, n_estimators=700, silent=True,
                        objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1,
                        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,
                        seed=0, missing=None)

In [140]:
%%time
xgbr = xgbr.fit(X_train,y_train)

CPU times: user 42.3 s, sys: 1.6 s, total: 43.9 s
Wall time: 13.5 s


In [141]:
print(xgbr.score(X_train,y_train))
print(xgbr.score(X_test,y_test))

0.994996215113
0.642309801446


---

In [79]:
gbr = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=900,
                                subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_depth=4, init=None,
                                random_state=None, max_features=None, alpha=0.5,
                                verbose=0, max_leaf_nodes=None, warm_start=False)

In [80]:
%%time
gbr = gbr.fit(X_train,y_train)
#os.system('say "終わりだ"') #its over!

CPU times: user 37.2 s, sys: 229 ms, total: 37.5 s
Wall time: 37.9 s


In [82]:
#parameters = {'max_depth':range(2,5,1),'alpha':[0.5,0.6,0.7,0.8,0.9]}
#parameters = {'subsample':[0.2,0.4,0.5,0.6,0.8,1]}
#parameters = {'subsample':[0.2,0.5,0.6,0.8,1],'n_estimators':[800,1000,1200]}
#parameters = {'max_depth':range(2,4,1)}
parameters = {'n_estimators':[400,800,1100]}
#parameters = {'loss':['ls', 'lad', 'huber', 'quantile'],'alpha':[0.3,0.5,0.8,0.9]}
#parameters = {'learning_rate':[0.1,0.5,0.9]}



grid_gbr = grid_search.GridSearchCV(gbr, parameters,n_jobs=2,scoring=manualScorer)

In [83]:
%%time
grid_gbr = grid_gbr.fit(X_train,y_train)

CPU times: user 38.1 s, sys: 566 ms, total: 38.7 s
Wall time: 2min 40s


In [84]:
print(grid_gbr.grid_scores_)
print("Best: ",grid_gbr.best_params_)

[mean: -132.42728, std: 13.82615, params: {'n_estimators': 400}, mean: -129.63912, std: 18.26752, params: {'n_estimators': 800}, mean: -131.56520, std: 14.32391, params: {'n_estimators': 1100}]
Best:  {'n_estimators': 800}


In [81]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))

2.7346018758
92.3521455159


In [46]:
err = (gbr.predict(X_train)-y_train)**2
print(err.sum()/len(err))
err = (gbr.predict(X_test)-y_test)**2
print(err.sum()/len(err))

17.0046462288
1578.69166275


---

In [57]:
t = []
for i in XX:
    t.append(np.count_nonzero(~np.isnan(i)) / float(i.size))
pd.DataFrame(np.array(t)).describe()

Unnamed: 0,0
count,3093
mean,1
std,0
min,1
25%,1
50%,1
75%,1
max,1


---

**Here for legacy**

In [41]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD,RMSprop

in_dim = len(XX[0])
out_dim = 1  

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(128, input_shape=(in_dim,)))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, init='uniform'))
model.add(Activation('linear'))

#sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='mean_squared_error', optimizer=sgd)

rms = RMSprop()
model.compile(loss='mean_squared_error', optimizer=rms)

#model.fit(X_train, y_train, nb_epoch=20, batch_size=16)
#score = model.evaluate(X_test, y_test, batch_size=16)

In [42]:
prep = []
for i in y_train:
    prep.append(min(i,20))

In [43]:
prep=np.array(prep)
mi,ma = prep.min(),prep.max()
fy = (prep-mi) / (ma-mi)
#my = fy.max()
#fy = fy/fy.max()

In [44]:
model.fit(np.array(X_train), fy, batch_size=10, nb_epoch=10, validation_split=0.1)  

Train on 2224 samples, validate on 248 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x112430c10>

In [45]:
pred = model.predict(np.array(X_test))*ma+mi

In [46]:
err = (pred-y_test)**2
err.sum()/len(err)

182460.82171163053

In [None]:
r = random.randrange(len(X_train))
print("(Train) Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_train[r]]))[0][0]*ma+mi,y_train[r]))

r = random.randrange(len(X_test))
print("(Test)  Prediction %0.4f, True: %0.4f"%(model.predict(np.array([X_test[r]]))[0][0]*ma+mi,y_test[r]))

---

In [None]:
def marshall_palmer(ref, minutes_past):
    #print("Estimating rainfall from {0} observations".format(len(minutes_past)))
    # how long is each observation valid?
    valid_time = np.zeros_like(minutes_past)
    valid_time[0] = minutes_past.iloc[0]
    for n in xrange(1, len(minutes_past)):
        valid_time[n] = minutes_past.iloc[n] - minutes_past.iloc[n-1]
    valid_time[-1] = valid_time[-1] + 60 - np.sum(valid_time)
    valid_time = valid_time / 60.0

    # sum up rainrate * validtime
    sum = 0
    for dbz, hours in zip(ref, valid_time):
        # See: https://en.wikipedia.org/wiki/DBZ_(meteorology)
        if np.isfinite(dbz):
            mmperhr = pow(pow(10, dbz/10)/200, 0.625)
            sum = sum + mmperhr * hours
    return sum


def simplesum(ref,hour):
    hour.sum()

# each unique Id is an hour of data at some gauge
def myfunc(hour):
    #rowid = hour['Id'].iloc[0]
    # sort hour by minutes_past
    hour = hour.sort('minutes_past', ascending=True)
    est = marshall_palmer(hour['Ref'], hour['minutes_past'])
    return est

In [None]:
info = raw.groupby(raw.index)

In [None]:
estimates = raw.groupby(raw.index).apply(myfunc)
estimates.head(20)

In [None]:
%%time
etreg.fit(X_train,y_train)

In [None]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))

In [None]:
%%time
et_score = cross_val_score(etreg, XX, y, cv=5)
print("Score: %s\tMean: %.03f"%(et_score,et_score.mean()))

In [None]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)

In [None]:
err = (etreg.predict(X_test)-y_test)**2
err.sum()/len(err)

In [None]:
r = random.randrange(len(X_train))
print(r)
print(etreg.predict(X_train[r]))
print(y_train[r])

r = random.randrange(len(X_test))
print(r)
print(etreg.predict(X_test[r]))
print(y_test[r])

---

In [93]:
%%time
#filename = "data/reduced_test_5000.csv"
filename = "data/test.csv"
test = pd.read_csv(filename)
test = test.set_index('Id')

CPU times: user 16.4 s, sys: 4.88 s, total: 21.3 s
Wall time: 22.3 s


In [94]:
features_columns = list([u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])

def getX(raw):
    selected_columns = list([ u'minutes_past',u'radardist_km', u'Ref', u'Ref_5x5_10th',
       u'Ref_5x5_50th', u'Ref_5x5_90th', u'RefComposite',
       u'RefComposite_5x5_10th', u'RefComposite_5x5_50th',
       u'RefComposite_5x5_90th', u'RhoHV', u'RhoHV_5x5_10th',
       u'RhoHV_5x5_50th', u'RhoHV_5x5_90th', u'Zdr', u'Zdr_5x5_10th',
       u'Zdr_5x5_50th', u'Zdr_5x5_90th', u'Kdp', u'Kdp_5x5_10th',
       u'Kdp_5x5_50th', u'Kdp_5x5_90th'])
    
    data = raw[selected_columns]
    
    docX= []
    for i in data.index.unique():
        if isinstance(data.loc[i],pd.core.series.Series):
            m = [data.loc[i].as_matrix()]
            docX.append(m)
        else:
            m = data.loc[i].as_matrix()
            docX.append(m)
    X = np.array(docX)
    return X

In [95]:
#%%time
#X=getX(test)

#tmp = []
#for i in X:
#    tmp.append(len(i))
#tmp = np.array(tmp)
#sns.countplot(tmp,order=range(tmp.min(),tmp.max()+1))
#plt.title("Number of ID per number of observations\n(On test dataset)")
#plt.plot()

In [96]:
testFull = test.dropna()

In [97]:
%%time
X=getX(testFull)  # 1min
#XX = [np.array(t).mean(0) for t in X]  # 10s

CPU times: user 1min 12s, sys: 1.87 s, total: 1min 14s
Wall time: 1min 14s


In [98]:
XX=addFeatures(X)

In [99]:
pd.DataFrame(gbr.predict(XX)).describe()

Unnamed: 0,0
count,235515.0
mean,5.131102
std,7.194517
min,-10.455328
25%,1.638401
50%,3.263728
75%,5.93211
max,167.607432


In [100]:
predFull = zip(testFull.index.unique(),gbr.predict(XX))

In [101]:
testNan = test.drop(test[features_columns].dropna(how='all').index)

In [102]:
tmp = np.empty(len(testNan))
tmp.fill(0.445000)   # 50th percentile of full Nan dataset
predNan = zip(testNan.index.unique(),tmp)

In [103]:
testLeft = test.drop(testNan.index.unique()).drop(testFull.index.unique())

In [104]:
tmp = np.empty(len(testLeft))
tmp.fill(1.27)   # 50th percentile of full Nan dataset
predLeft = zip(testLeft.index.unique(),tmp)

In [105]:
len(testFull.index.unique())

235515

In [106]:
len(testNan.index.unique())

232148

In [107]:
len(testLeft.index.unique())

249962

In [108]:
pred = predFull + predNan + predLeft

In [113]:
pred.sort(key=lambda x: x[0], reverse=False)

In [114]:
submission = pd.DataFrame(pred)
submission.columns = ["Id","Expected"]
submission.head()

Unnamed: 0,Id,Expected
0,1,1.27
1,2,1.27
2,3,2.361996
3,4,14.492731
4,5,0.445


In [115]:
submission.loc[submission['Expected']<0,'Expected'] = 0.445

In [116]:
submission.to_csv("submit4.csv",index=False)

In [73]:
filename = "data/sample_solution.csv"
sol = pd.read_csv(filename)

In [74]:
sol

Unnamed: 0,Id,Expected
0,1,0.085765
1,2,0.000000
2,3,1.594004
3,4,6.913380
4,5,0.000000
5,6,0.173935
6,7,3.219921
7,8,0.867394
8,9,0.000000
9,10,14.182371


In [None]:
ss = np.array(sol)

In [None]:
%%time
for a,b in predFull:
    ss[a-1][1]=b

In [None]:
ss

In [75]:
sub = pd.DataFrame(pred)
sub.columns = ["Id","Expected"]
sub.Id = sub.Id.astype(int)
sub.head()

Unnamed: 0,Id,Expected
0,1,1.27
1,2,1.27
2,3,2.37866
3,4,8.851727
4,5,0.445


In [76]:
sub.to_csv("submit3.csv",index=False)