### Simple baseline

In this baseline we take data from the primary source, aggregate it by squares and compute some basic features from those squares.

We then fit a gradient boosting ensemble to predict whether it was raining in this particular square & hour.

For starters, let's take a look at our data.

In [42]:
TRAIN_PATH = "data/train_spb.tsv"
NETATMO_PATH = "data/train_spb_netatmo.tsv"
TEST_PATH = "data/test_spb_features.tsv"
TEST_NETATMO_PATH = "data/test_spb_netatmo.tsv"

CITY_PREDICTIONS_PATH = "intermediate_data/prediction_spb.csv"

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
#%matplotlib inline
%matplotlib
plt.style.use('ggplot')
from scipy.stats import kurtosis,skew, mode

Using matplotlib backend: Qt5Agg


In [29]:
data = pd.read_csv('data/trainspb.csv')
print(data.shape)
data.head()

(568809, 33)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,hour_hash,sq_x,sq_y,EventTimestampDelta,day_hour,cell_hash,sq_time,radio,LocationSpeed,...,day,ver_hash,SignalStrength,rain,u_hashed,city_code,cell_lat,OperatorID,LocationTimestampDelta,device_model_hash
0,0,15594925468529168,-4,-1,-138,19,15359413209819631993,1499367600,2,-999,...,5,10377131132567209046,-107,True,10792001126932717568,78,59.93866,1,-137,17694276751343575040
1,1,30420971216007726,-4,-2,-1393,10,9378067244757718949,1500199200,1,13,...,15,14766465407719166092,-11,False,17843300878001616896,78,59.845276,2,-1392,16545423038250614784
2,2,15594925468529168,-4,-1,-2787,19,5828444396397487323,1499367600,1,1,...,5,2970743340414944099,-81,True,11044946288132845568,78,59.944153,20,-2788,12880924904999671808
3,3,15594925468529168,-5,-3,-183,19,1811340204703671917,1499367600,1,5,...,5,14766465407719166092,-81,False,8824907216792626176,78,59.832352,20,-180,16545423038250614784
4,4,15594925468529168,-4,-1,-2211,19,6174653592142994962,1499367600,-999,10,...,5,2970743340414944099,-51,True,8207343304811761664,78,-999.0,1,-2240,2903999018945938944


In [32]:
data[data['EventTimestampDelta']<600].mean()

-3599

__Note:__ if you're low on memory, try this:
* Most obviously, downsample data
* Read one square at a time: read it, compute features, and only then read next square
* Entries for each cell appear as subsequent rows in the dataset, so you can just read, say, 25% of the data and process it, then go for next 25%, etc.
* Delete training data and intermediate aggregations liky `groupby` after you've done with feature engineering.

In [81]:
for x in ['kazan','msk','spb']:
    print(pd.read_csv("data/train_"+x+"_netatmo.tsv",na_values="None",sep='\t').groupby('hour_hash').count().min())

sq_x                               103
sq_y                               103
netatmo_timestamp_delta            103
netatmo_sum_rain_24h                 0
netatmo_sum_rain_1h                  0
day_hour                           103
netatmo_wind_gust_direction_deg      0
sq_time                            103
point_longitude                    103
netatmo_time_day_rain                0
netatmo_wind_timestamp               0
netatmo_wind_speed_kmh               0
netatmo_time_hour_rain               0
hours_since                        103
utc_date                           103
netatmo_wind_gust_timestamp          0
netatmo_timestamp                  103
netatmo_humidity_percent            80
netatmo_pressure_mbar              103
precipitation                      103
netatmo_wind_direction_deg           0
netatmo_temperature_c               80
day                                103
rain                               103
city_code                          103
point_latitude           

  interactivity=interactivity, compiler=compiler, result=result)


sq_x                               151
sq_y                               151
netatmo_timestamp_delta            151
netatmo_sum_rain_24h                11
netatmo_sum_rain_1h                 11
day_hour                           151
netatmo_wind_gust_direction_deg     24
sq_time                            151
point_longitude                    151
netatmo_time_day_rain               11
netatmo_wind_timestamp              24
netatmo_wind_speed_kmh              24
netatmo_time_hour_rain              11
hours_since                        151
utc_date                           151
netatmo_wind_gust_timestamp         24
netatmo_timestamp                  151
netatmo_humidity_percent           128
netatmo_pressure_mbar              151
precipitation                      151
netatmo_wind_direction_deg          24
netatmo_temperature_c              128
day                                151
rain                               151
city_code                          151
point_latitude           

#### Working with netatmo

Customer grade meteostations are excellent sources of data on rain. Alas, they're rather scarce and we're unlikely to find stations in every square/time block. Therefore we're gonna need to quickly find ones from neighboring blocks.

For performance reasons, we'll use fast nearest neighbor lookup methods from sklearn.
Note that those are not the fastest neighbor lookup methods available, but they should be enough for the baseline.

We'll query the users that have neighboring longitude/lattitude within this hour. In this baseline we implicitly compute euclidian distance over latitude/longitude axes which has a number of problems: the distance gets larger as you move from equator to the poles. More importantly, this method does not take adjacent hours into consideration.
You are invited to improve on those points in your solution :)


In [43]:
from sklearn.neighbors import KDTree
def preprocess_netatmo(df):
    """organizes netatmo stations into KDTrees for each distinct time frame"""
    
    df_by_hour = df.groupby('hour_hash')
    anns = {}
    for hour,stations_group in df_by_hour:
        anns[hour] = KDTree(stations_group[["netatmo_latitude","netatmo_longitude"]].values,metric='minkowski',p=2)
    
    #convert groupby to dict to get faster queries
    df_by_hour = {group:stations_group for group,stations_group in df_by_hour}
    
    return df_by_hour,anns
        

In [44]:
netatmo_groups,netatmo_anns = preprocess_netatmo(pd.read_csv(NETATMO_PATH,na_values="None",sep='\t'))

  interactivity=interactivity, compiler=compiler, result=result)


## Feature engineering

In this baseline, we're going to aggregate all user data from a specific square and a specific hour to predict whether it's raining in this square. We'll split data into blocks by `[sq_lon,sq_lat,sq_time]` and process such blocks independently.

<img src="https://usercontent1.hubstatic.com/12943886_f520.jpg" width=240px>


The next cell defines a function that extracts features from such blocks. Feel free to add some new features here or drop those you believe to be harmful.

Also note that this isn't the only way to process such data. See the [known unknowns](#known_unknowns) section.

In [58]:
data.head()

Unnamed: 0,hour_hash,sq_x,sq_y,EventTimestampDelta,day_hour,cell_hash,sq_time,radio,LocationSpeed,LAC,...,day,ver_hash,SignalStrength,rain,u_hashed,city_code,cell_lat,OperatorID,LocationTimestampDelta,device_model_hash
400012,422518526921346549,-4,0,-3278,10,2884660664377744674,1500717600,3,8,7834,...,21,2970743340414944099,-65,True,14626389373734152192,78,59.949646,2,-3235,1360803345112542720
400013,422518526921346549,-4,0,-3278,10,13638004307918185832,1500717600,2,8,7807,...,21,2970743340414944099,-94,True,14626389373734152192,78,59.944153,2,-3235,1360803345112542720
400014,422518526921346549,-4,0,-3278,10,2884660664377744674,1500717600,3,8,7834,...,21,2970743340414944099,-65,True,14626389373734152192,78,59.949646,2,-3235,1360803345112542720
400015,422518526921346549,-4,0,-3278,10,13638004307918185832,1500717600,2,8,7807,...,21,2970743340414944099,-94,True,14626389373734152192,78,59.944153,2,-3235,1360803345112542720
400016,422518526921346549,-4,0,-3278,10,2884660664377744674,1500717600,3,8,7834,...,21,2970743340414944099,-65,True,14626389373734152192,78,59.949646,2,-3235,1360803345112542720


In [63]:
data.columns

Index(['hour_hash', 'sq_x', 'sq_y', 'EventTimestampDelta', 'day_hour',
       'cell_hash', 'sq_time', 'radio', 'LocationSpeed', 'LAC', 'eventid',
       'LocationPrecision', 'LocationAltitude', 'hours_since', 'sq_lat',
       'sq_lon', 'range', 'ulat', 'LocationDirection', 'precipitation',
       'cell_lon', 'ulon', 'day', 'ver_hash', 'SignalStrength', 'rain',
       'u_hashed', 'city_code', 'cell_lat', 'OperatorID',
       'LocationTimestampDelta', 'device_model_hash'],
      dtype='object')

In [11]:
for i in data['OperatorID'].unique():
    print(i)

99
1
11
2
20
23
25
35
215
255
192
54
237


In [100]:
oper = data['OperatorID'].unique()
def extract_features(group,netatmo_groups,netatmo_anns):
    """
    Extracts all kinds of features from a dataframe containing users in one group
    """
    features = {}

    #square features
    square = {col: group[col].iloc[0] for col in group.columns}
    
    features['square_lat'] = square['sq_lat']
    features['square_lon'] = square['sq_lon']
    features['time_of_day'] = square['day_hour']
    #features['time_of_day'] = square['day_hour']
    #features['time_of_day'] = square['day_hour']
    features['signal_sum'] = group['SignalStrength'].sum()
    features['time_std'] = abs(group['EventTimestampDelta']-group['LocationTimestampDelta']).std()
    features['time_mean'] = abs(group['EventTimestampDelta']-group['LocationTimestampDelta']).mean()
    features['time_median'] = abs(group['EventTimestampDelta']-group['LocationTimestampDelta']).median()
    features['time_dm'] = features['time_mean'] - features['time_median']
    
    
    #signal strength
    features['signal_mean'] = group['SignalStrength'].mean()
    features['signal_std'] = group['SignalStrength'].std()
    features['signal_median'] = group['SignalStrength'].median()
    features['signal_max-min'] = group['SignalStrength'].max() - group['SignalStrength'].min()
    features['signal_max+min'] = (group['SignalStrength'].max() + group['SignalStrength'].min())/2
    features['signal_q75-q25'] = group['SignalStrength'].quantile(0.75) - group['SignalStrength'].quantile(0.25)
    
    features['signal_dm'] = features['signal_mean'] - features['signal_median']
    features['signal_dmm'] = features['signal_mean'] - features['signal_max+min']
    features['signal_dmmm'] = features['signal_median'] - features['signal_max+min']
    #Location Precision
    features['LocationPrecision_mean'] = group['LocationPrecision'].mean()
    features['LocationPrecision_std'] = group['LocationPrecision'].std()
    features['LocationPrecision_median'] = group['LocationPrecision'].median()
    features['LocationPrecision_max'] = group['LocationPrecision'].max()
    features['LocationPrecision_min'] = group['LocationPrecision'].min()
    features['LocationPrecision_sem'] = group['LocationPrecision'].sem()
    features['LocationPrecision_q_25'] = group['LocationPrecision'].quantile(0.25)
    features['LocationPrecision_q_75'] = group['LocationPrecision'].quantile(0.75)
    features['LocationPrecision_max-min'] = features['LocationPrecision_max'] - features['LocationPrecision_min']
    features['LocationPrecision_max+min'] = (features['LocationPrecision_max'] + features['LocationPrecision_min'])/2
    features['LocationPrecision_q75-q25'] = features['LocationPrecision_q_25'] - features['LocationPrecision_q_75']
    
    #LocationSpeed
    features['LocationSpeed_mean'] = group['LocationSpeed'].mean()
    features['LocationSpeed_std'] = group['LocationSpeed'].std()
    features['LocationSpeed_median'] = group['LocationSpeed'].median()
    features['LocationSpeed_max'] = group['LocationSpeed'].max()
    features['LocationSpeed_min'] = group['LocationSpeed'].min()
    features['LocationSpeed_sem'] = group['LocationSpeed'].sem()
    features['LocationSpeed_q_25'] = group['LocationSpeed'].quantile(0.25)
    features['LocationSpeed_q_75'] = group['LocationSpeed'].quantile(0.75)
    features['LocationSpeed_max-min'] = features['LocationSpeed_max'] - features['LocationSpeed_min']
    features['LocationSpeed_max+min'] = (features['LocationSpeed_max'] + features['LocationSpeed_min'])/2
    features['LocationSpeed_q75-q25'] = features['LocationSpeed_q_25'] - features['LocationSpeed_q_75']
    
    for i in oper:
        features['LocationSpeed_mean'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].mean()
        features['LocationSpeed_std'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].std()
        features['LocationSpeed_median'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].median()
        features['LocationSpeed_max'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].max()
        features['LocationSpeed_min'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].min()
        features['LocationSpeed_q_25'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].quantile(0.25)
        features['LocationSpeed_q_75'+str(i)] = group[group['OperatorID']==i]['LocationSpeed'].quantile(0.75)
        features['LocationSpeed_max-min'+str(i)] = features['LocationSpeed_max'+str(i)] - features['LocationSpeed_min'+str(i)]
        features['LocationSpeed_max+min'+str(i)] = (features['LocationSpeed_max'+str(i)] + features['LocationSpeed_min'+str(i)])/2
        features['LocationSpeed_q75-q25'+str(i)] = features['LocationSpeed_q_25'+str(i)] - features['LocationSpeed_q_75'+str(i)]
    '''
    #signal strength
    features['signal_mean'] = group['SignalStrength'].mean()
    features['signal_std'] = group['SignalStrength'].std()
    features['signal_median'] = group['SignalStrength'].median()
    features['signal_max'] = group['SignalStrength'].max()
    features['signal_min'] = group['SignalStrength'].min()
    features['signal_sem'] = group['SignalStrength'].sem()
    features['signal_q_25'] = group['SignalStrength'].quantile(0.25)
    features['signal_q_75'] = group['SignalStrength'].quantile(0.75)
    features['max-min'] = features['signal_max'] - features['signal_min']
    features['max+min'] = (features['signal_max'] + features['signal_min'])/2
    features['q75-q25'] = features['signal_q_25'] - features['signal_q_75']
    '''
    
    
    
    
    

    #features for each user
    group_by_user = group.groupby('u_hashed')
    group_by_user.apply(lambda group: group['ulat'].std()+group['ulon'].std())
    
    features['num_users'] = len(group_by_user)
    features['mean_entries_per_user'] = group_by_user.apply(len).mean()
    features['median_entries_per_user'] = group_by_user.apply(len).median()
    features['mean_user_signal_std'] = group_by_user.apply(
        lambda user_entries: user_entries['SignalStrength'].std()).mean()
    features['median_user_signal_std'] = group_by_user.apply(
        lambda user_entries: user_entries['SignalStrength'].std()).median()
    features['dm_entries_per_user'] = features['mean_entries_per_user'] - features['median_entries_per_user']
    features['dm_signal_std'] = features['mean_user_signal_std'] - features['median_user_signal_std']
    #netatmo features
    if square['hour_hash'] in netatmo_groups:
        local_stations,neighbors = netatmo_groups[square['hour_hash']],netatmo_anns[square['hour_hash']]
        [distances],[neighbor_ids] = neighbors.query([(square['sq_lat'],square['sq_lon'])],k=50)

        neighbor_stations = local_stations.iloc[neighbor_ids]

        features['min_distance_to_closest_station'] = np.min(distances)
        features['max_distance_to_closest_station'] = np.max(distances)
        features['max+min_distance_to_closest_station'] = (features['min_distance_to_closest_station'] +features['max_distance_to_closest_station']) /2
        features['median_distance_to_closest_station'] = np.median(distances)
        features['mean_distance_to_station'] = np.mean(distances)

        for colname in ['netatmo_pressure_mbar','netatmo_temperature_c','netatmo_sum_rain_1h','netatmo_sum_rain_24h',
                        'netatmo_wind_direction_deg','netatmo_wind_gust_direction_deg','netatmo_humidity_percent','netatmo_wind_speed_kmh','netatmo_wind_gust_speed_kmh']:
            col = neighbor_stations[colname].dropna()
            if len(col)!=0:
                features[colname+"_mean"],features[colname+"_std"], features[colname+"_median"] = col.mean(),col.std(),col.median()
                features[colname+"_mm"] = col.max() - col.min()
            else:
                features[colname+"_mean"],features[colname+"_std"],features[colname+"_median"] = np.nan,np.nan,np.nan
                features[colname+"_mm"] = np.nan
        for colname in ['netatmo_pressure_mbar','netatmo_temperature_c','netatmo_sum_rain_1h','netatmo_humidity_percent','netatmo_wind_speed_kmh','netatmo_wind_gust_speed_kmh']:
            #col = neighbor_stations.dropna()
            col = neighbor_stations[np.isfinite(neighbor_stations[colname])]
            if len(col)!=0:
                try:
                    features[colname+"_meand"] = col[col['netatmo_timestamp_delta']<-2500][colname].mean()-col[col['netatmo_timestamp_delta']>-1100][colname].mean()
                    features[colname+"_mediand"] = col[col['netatmo_timestamp_delta']<-2500][colname].median()-col[col['netatmo_timestamp_delta']>-1100][colname].median()
                    features[colname+"_stdd"] = col[col['netatmo_timestamp_delta']<-2500][colname].std()-col[col['netatmo_timestamp_delta']>-1100][colname].std()
                    features[colname+"_maxd"] = col[col['netatmo_timestamp_delta']<-2500][colname].max()-col[col['netatmo_timestamp_delta']>-1100][colname].max()
                    features[colname+"_mind"] = col[col['netatmo_timestamp_delta']<-2500][colname].min()-col[col['netatmo_timestamp_delta']>-1100][colname].min()
                except:
                    features[colname+"_meand"]= np.nan
                    features[colname+"_mediand"]= np.nan
                    features[colname+"_stdd"]= np.nan
                    features[colname+"_maxd"]= np.nan
                    features[colname+"_mind"]= np.nan
            else:
                features[colname+"_meand"]= np.nan
                features[colname+"_mediand"]= np.nan
                features[colname+"_stdd"]= np.nan
                features[colname+"_maxd"]= np.nan
                features[colname+"_mind"]= np.nan
    return features
    

In [89]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
                    [np.nan, np.nan, np.nan, 5]],
                   columns=list('ABCD'))

In [90]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [92]:
df[np.isfinite(df['B'])]

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1


In [None]:
data[data['EventTimestampDelta']<600].mean()

netatmo_humidity_percent, netatmo_latitude, netatmo_longitude, netatmo_pressure_mbar, netatmo_sum_rain_1h, netatmo_sum_rain_24h, netatmo_temperature_c, netatmo_timestamp_delta, netatmo_uid, netatmo_wind_direction_deg, netatmo_wind_gust_direction_deg, netatmo_wind_gust_speed_kmh, netatmo_wind_speed_kmh, 

In [4]:
data = pd.read_csv('data/datatr.csv')

  interactivity=interactivity, compiler=compiler, result=result)


We now apply it to all the squares we have.

This may take time, more so if you use complex features, so you can try to speed stuff up by using [joblib.Parallel](http://pythonhosted.org/joblib/parallel.html) or similar.

In [101]:
from tqdm import tqdm

groupby = data.groupby(["city_code","sq_x","sq_y","hour_hash"])

X,yy,y,block_ids = [],[],[], []

for block_id in tqdm(groupby.groups):
    group = groupby.get_group(block_id)
    X.append(extract_features(group,netatmo_groups,netatmo_anns))
    y.append(group.iloc[0]['rain'])
    yy.append(group.iloc[0]['precipitation'])
    block_ids.append(block_id+(group.iloc[0]["hours_since"],))

X = pd.DataFrame(X)#.fillna(-999.)
y = np.array(y)
block_ids = pd.DataFrame(block_ids,columns=["city_code","sq_x","sq_y","hour_hash","hours_since"])


  0%|          | 0/22362 [00:00<?, ?it/s][A
  0%|          | 1/22362 [00:00<2:22:45,  2.61it/s][A
  0%|          | 2/22362 [00:00<1:58:06,  3.16it/s][A
  0%|          | 3/22362 [00:00<1:40:44,  3.70it/s][A
  0%|          | 4/22362 [00:00<1:29:17,  4.17it/s][A
  0%|          | 5/22362 [00:01<1:20:35,  4.62it/s][A
  0%|          | 6/22362 [00:01<1:18:08,  4.77it/s][A
  0%|          | 7/22362 [00:01<1:14:49,  4.98it/s][A
  0%|          | 8/22362 [00:01<1:09:04,  5.39it/s][A
  0%|          | 9/22362 [00:01<1:07:16,  5.54it/s][A
  0%|          | 10/22362 [00:01<1:04:48,  5.75it/s][A
  0%|          | 11/22362 [00:02<1:04:33,  5.77it/s][A
  0%|          | 12/22362 [00:02<1:04:32,  5.77it/s][A
  0%|          | 13/22362 [00:02<1:06:57,  5.56it/s][A
  0%|          | 14/22362 [00:02<1:05:50,  5.66it/s][A
  0%|          | 15/22362 [00:02<1:05:20,  5.70it/s][A
  0%|          | 16/22362 [00:02<1:04:27,  5.78it/s][A
  0%|          | 17/22362 [00:03<1:02:01,  6.00it/s][A
  0%|      

__Note:__ If you're low on memory, it's time to either delete train & groupby or pickle X/y/block_ids and restart.

In [96]:
X.shape,y.shape,yy.shape

AttributeError: 'list' object has no attribute 'shape'

In [99]:
pd.DataFrame(X)['netatmo_wind_speed_kmh_mind']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
18    NaN
19    NaN
20    NaN
21    NaN
22    NaN
23    NaN
24    NaN
25    NaN
26    NaN
27    NaN
28    NaN
29    NaN
       ..
251   NaN
252   NaN
253   NaN
254   NaN
255   NaN
256   NaN
257   NaN
258   NaN
259   NaN
260   NaN
261   NaN
262   NaN
263   NaN
264   NaN
265   NaN
266   NaN
267   NaN
268   NaN
269   NaN
270   NaN
271   NaN
272   NaN
273   NaN
274   NaN
275   NaN
276   NaN
277   NaN
278   NaN
279   NaN
280   NaN
Name: netatmo_wind_speed_kmh_mind, dtype: float64

In [102]:
block_ids.to_csv('data/blockspb.csv')

In [31]:
data.to_csv('data/datatr.csv')


In [103]:
dy = pd.DataFrame()
dy['y'] = y
dy['yy'] = np.array(yy)
dy.to_csv('data/yspb.csv')

In [67]:
dy.head()

Unnamed: 0,y,yy
0,True,0.51818
1,False,0.0
2,False,0.076159
3,False,0.052781
4,True,0.514332


In [104]:
X.to_csv('data/Xtrspb.csv')

### Classifier

Once the data is processed, it's time to train some machine learning model that would predict rain given all features we gathered.

Since our features are all of different nature and unit scale (hours,decibels,degrees,etc.), it makes sense to use decision tree-based methods to for classification.


<img src="http://zdnet2.cbsistatic.com/hub/i/2017/07/18/d3f47c3e-8529-4855-a0e1-c686ee3b4007/d1113adf74bb59c3b46419a531c39c3e/orig.png" width=320>
In particular, we apply [CatBoost](https://catboost.yandex/), Yandex' recent open source gradient boosting implementation.

To make this baseline simple, we use catboost with default settings. You can certainly find a better combination of parameters. 

Here's a [guide](https://tech.yandex.com/catboost/doc/dg/concepts/parameter-tuning-docpage/) on how catboost hyperparameters work.

In [7]:
X = pd.read_csv('data/Xtr.csv')
y = np.array(pd.read_csv('data/y.csv')['y'])

In [105]:
in_train = block_ids['hours_since'] <= np.percentile(block_ids['hours_since'],85) #leave last 15% for validation

X_train,y_train = X[in_train],y[in_train]
X_val,y_val = X[~in_train],y[~in_train]
print("Training samples: %i; Validation samples: %i"%(len(X_train),len(X_val)))

Training samples: 19036; Validation samples: 3326


In [106]:
X.shape,y.shape

((22362, 247), (22362,))

In [72]:
import xgboost as xgb



In [74]:
X.head()

Unnamed: 0,LocationSpeed_max+min1,LocationSpeed_max+min11,LocationSpeed_max+min2,LocationSpeed_max+min20,LocationSpeed_max+min215,LocationSpeed_max+min23,LocationSpeed_max+min25,LocationSpeed_max+min255,LocationSpeed_max+min3,LocationSpeed_max+min35,...,netatmo_wind_speed_kmh_stdd,num_users,signal_sum,square_lat,square_lon,time_dm,time_mean,time_median,time_of_day,time_std
0,-489.0,-999.0,-489.5,-492.5,,-493.5,,,,,...,,46,-70790,59.918629,30.328329,7.535055,9.535055,2.0,19,33.156581
1,-491.0,,-488.0,-496.5,,,,,,,...,,29,-37498,59.864773,30.328329,8.949495,10.949495,2.0,10,30.815681
2,17.0,,-484.5,-489.0,,,,,,,...,,7,-6852,59.810921,30.220709,-0.018182,2.981818,3.0,19,1.92988
3,-487.0,,-489.5,-491.0,,14.5,,,,,...,,26,-33568,59.972481,30.328329,7.271845,8.271845,1.0,19,20.330215
4,-490.5,,-490.0,-494.0,,,,,,,...,,20,-36374,59.864773,30.435949,-10.564356,22.435644,33.0,19,26.601523


In [80]:
X['netatmo_wind_speed_kmh_stdd'][5]

city_code                         NaN
day                               NaN
day_hour                          NaN
hour_hash                         NaN
hours_since                       NaN
netatmo_humidity_percent          NaN
netatmo_latitude                  NaN
netatmo_longitude                 NaN
netatmo_pressure_mbar             NaN
netatmo_sum_rain_1h               NaN
netatmo_sum_rain_24h              NaN
netatmo_temperature_c             NaN
netatmo_time_day_rain             NaN
netatmo_time_hour_rain            NaN
netatmo_timestamp                 NaN
netatmo_timestamp_delta           NaN
netatmo_uid                       NaN
netatmo_wind_direction_deg        NaN
netatmo_wind_gust_direction_deg   NaN
netatmo_wind_gust_speed_kmh       NaN
netatmo_wind_gust_timestamp       NaN
netatmo_wind_speed_kmh            NaN
netatmo_wind_timestamp            NaN
point_latitude                    NaN
point_longitude                   NaN
precipitation                     NaN
rain        

In [85]:
feat = ['LocationSpeed_max+min1',
 'LocationSpeed_max+min11',
 'LocationSpeed_max+min2',
 'LocationSpeed_max+min20',
 'LocationSpeed_max+min215',
 'LocationSpeed_max+min23',
 'LocationSpeed_max+min25',
 'LocationSpeed_max+min255',
 'LocationSpeed_max+min3',
 'LocationSpeed_max+min35',
 'LocationSpeed_max+min36',
 'LocationSpeed_max+min5',
 'LocationSpeed_max+min99',
 'LocationSpeed_max-min1',
 'LocationSpeed_max-min11',
 'LocationSpeed_max-min2',
 'LocationSpeed_max-min20',
 'LocationSpeed_max-min215',
 'LocationSpeed_max-min23',
 'LocationSpeed_max-min25',
 'LocationSpeed_max-min255',
 'LocationSpeed_max-min3',
 'LocationSpeed_max-min35',
 'LocationSpeed_max-min36',
 'LocationSpeed_max-min5',
 'LocationSpeed_max-min99',
 'LocationSpeed_max1',
 'LocationSpeed_max11',
 'LocationSpeed_max2',
 'LocationSpeed_max20',
 'LocationSpeed_max215',
 'LocationSpeed_max23',
 'LocationSpeed_max25',
 'LocationSpeed_max255',
 'LocationSpeed_max3',
 'LocationSpeed_max35',
 'LocationSpeed_max36',
 'LocationSpeed_max5',
 'LocationSpeed_max99',
 'LocationSpeed_mean1',
 'LocationSpeed_mean11',
 'LocationSpeed_mean2',
 'LocationSpeed_mean20',
 'LocationSpeed_mean215',
 'LocationSpeed_mean23',
 'LocationSpeed_mean25',
 'LocationSpeed_mean255',
 'LocationSpeed_mean3',
 'LocationSpeed_mean35',
 'LocationSpeed_mean36',
 'LocationSpeed_mean5',
 'LocationSpeed_mean99',
 'LocationSpeed_median1',
 'LocationSpeed_median11',
 'LocationSpeed_median2',
 'LocationSpeed_median20',
 'LocationSpeed_median215',
 'LocationSpeed_median23',
 'LocationSpeed_median25',
 'LocationSpeed_median255',
 'LocationSpeed_median3',
 'LocationSpeed_median35',
 'LocationSpeed_median36',
 'LocationSpeed_median5',
 'LocationSpeed_median99',
 'LocationSpeed_min1',
 'LocationSpeed_min11',
 'LocationSpeed_min2',
 'LocationSpeed_min20',
 'LocationSpeed_min215',
 'LocationSpeed_min23',
 'LocationSpeed_min25',
 'LocationSpeed_min255',
 'LocationSpeed_min3',
 'LocationSpeed_min35',
 'LocationSpeed_min36',
 'LocationSpeed_min5',
 'LocationSpeed_min99',
 'LocationSpeed_q75-q251',
 'LocationSpeed_q75-q2511',
 'LocationSpeed_q75-q252',
 'LocationSpeed_q75-q2520',
 'LocationSpeed_q75-q25215',
 'LocationSpeed_q75-q2523',
 'LocationSpeed_q75-q2525',
 'LocationSpeed_q75-q25255',
 'LocationSpeed_q75-q253',
 'LocationSpeed_q75-q2535',
 'LocationSpeed_q75-q2536',
 'LocationSpeed_q75-q255',
 'LocationSpeed_q75-q2599',
 'LocationSpeed_q_251',
 'LocationSpeed_q_2511',
 'LocationSpeed_q_252',
 'LocationSpeed_q_2520',
 'LocationSpeed_q_25215',
 'LocationSpeed_q_2523',
 'LocationSpeed_q_2525',
 'LocationSpeed_q_25255',
 'LocationSpeed_q_253',
 'LocationSpeed_q_2535',
 'LocationSpeed_q_2536',
 'LocationSpeed_q_255',
 'LocationSpeed_q_2599',
 'LocationSpeed_q_751',
 'LocationSpeed_q_7511',
 'LocationSpeed_q_752',
 'LocationSpeed_q_7520',
 'LocationSpeed_q_75215',
 'LocationSpeed_q_7523',
 'LocationSpeed_q_7525',
 'LocationSpeed_q_75255',
 'LocationSpeed_q_753',
 'LocationSpeed_q_7535',
 'LocationSpeed_q_7536',
 'LocationSpeed_q_755',
 'LocationSpeed_q_7599',
 'LocationSpeed_std1',
 'LocationSpeed_std11',
 'LocationSpeed_std2',
 'LocationSpeed_std20',
 'LocationSpeed_std215',
 'LocationSpeed_std23',
 'LocationSpeed_std25',
 'LocationSpeed_std255',
 'LocationSpeed_std3',
 'LocationSpeed_std35',
 'LocationSpeed_std36',
 'LocationSpeed_std5',
 'LocationSpeed_std99',
 'dm_entries_per_user',
 'dm_signal_std',
 'max+min_distance_to_closest_station',
 'max_distance_to_closest_station',
 'mean_distance_to_station',
 'mean_entries_per_user',
 'mean_user_signal_std',
 'median_distance_to_closest_station',
 'median_entries_per_user',
 'median_user_signal_std',
 'min_distance_to_closest_station',
 'netatmo_humidity_percent_mm',
 'netatmo_humidity_percent_std',
 'netatmo_humidity_percent_mean',
 'netatmo_humidity_percent_median',
 'netatmo_pressure_mbar_mean',
 'netatmo_pressure_mbar_median',
 'netatmo_pressure_mbar_mm',
 'netatmo_pressure_mbar_std',
 'netatmo_sum_rain_1h_mean',
 'netatmo_sum_rain_1h_median',
 'netatmo_sum_rain_1h_mm',
 'netatmo_sum_rain_1h_std',
 'netatmo_sum_rain_24h_mean',
 'netatmo_sum_rain_24h_median',
 'netatmo_sum_rain_24h_mm',
 'netatmo_sum_rain_24h_std',
 'netatmo_temperature_c_mean',
 'netatmo_temperature_c_median',
 'netatmo_temperature_c_mm',
 'netatmo_temperature_c_std',
 'netatmo_wind_direction_deg_mean',
 'netatmo_wind_direction_deg_median',
 'netatmo_wind_direction_deg_mm',
 'netatmo_wind_direction_deg_std',
 'netatmo_wind_gust_direction_deg_mean',
 'netatmo_wind_gust_direction_deg_median',
 'netatmo_wind_gust_direction_deg_mm',
 'netatmo_wind_gust_direction_deg_std',
 'netatmo_wind_gust_speed_kmh_mean',
 'netatmo_wind_gust_speed_kmh_median',
 'netatmo_wind_gust_speed_kmh_mm',
 'netatmo_wind_gust_speed_kmh_std',
 'netatmo_wind_speed_kmh_mean',
 'netatmo_wind_speed_kmh_median',
 'netatmo_wind_speed_kmh_mm',
 'netatmo_wind_speed_kmh_std',
 'num_users',
 'signal_sum',
 'time_dm',
 'time_mean',
 'time_median',
 'time_of_day',
 'time_std']

In [83]:
set(list(X.columns))-

['LocationSpeed_max+min1',
 'LocationSpeed_max+min11',
 'LocationSpeed_max+min2',
 'LocationSpeed_max+min20',
 'LocationSpeed_max+min215',
 'LocationSpeed_max+min23',
 'LocationSpeed_max+min25',
 'LocationSpeed_max+min255',
 'LocationSpeed_max+min3',
 'LocationSpeed_max+min35',
 'LocationSpeed_max+min36',
 'LocationSpeed_max+min5',
 'LocationSpeed_max+min99',
 'LocationSpeed_max-min1',
 'LocationSpeed_max-min11',
 'LocationSpeed_max-min2',
 'LocationSpeed_max-min20',
 'LocationSpeed_max-min215',
 'LocationSpeed_max-min23',
 'LocationSpeed_max-min25',
 'LocationSpeed_max-min255',
 'LocationSpeed_max-min3',
 'LocationSpeed_max-min35',
 'LocationSpeed_max-min36',
 'LocationSpeed_max-min5',
 'LocationSpeed_max-min99',
 'LocationSpeed_max1',
 'LocationSpeed_max11',
 'LocationSpeed_max2',
 'LocationSpeed_max20',
 'LocationSpeed_max215',
 'LocationSpeed_max23',
 'LocationSpeed_max25',
 'LocationSpeed_max255',
 'LocationSpeed_max3',
 'LocationSpeed_max35',
 'LocationSpeed_max36',
 'LocationSpe

In [107]:
dtr = xgb.DMatrix(X_train[feat], label=y_train)
dval = xgb.DMatrix(X_val[feat], label=y_val)
watchlist = [(dtr, 'train'), (dval, 'eval')]
history = dict()
params = {
    'max_depth': 7,
    'eta': 0.02,
    'objective':  "binary:logistic",
    'eval_metric' : 'auc',
    'nthread': 4,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 99,
    'seed':7
}

In [110]:
model = xgb.train(params, dtr, num_boost_round=500, evals=watchlist,evals_result=history, verbose_eval=10)

[0]	train-auc:0.76043	eval-auc:0.605527
[10]	train-auc:0.806587	eval-auc:0.679997
[20]	train-auc:0.810218	eval-auc:0.683204
[30]	train-auc:0.811395	eval-auc:0.688704
[40]	train-auc:0.813468	eval-auc:0.689728
[50]	train-auc:0.816157	eval-auc:0.693611
[60]	train-auc:0.815823	eval-auc:0.692128
[70]	train-auc:0.817779	eval-auc:0.694421
[80]	train-auc:0.817987	eval-auc:0.691748
[90]	train-auc:0.821495	eval-auc:0.696361
[100]	train-auc:0.823146	eval-auc:0.695972
[110]	train-auc:0.824355	eval-auc:0.694332
[120]	train-auc:0.826553	eval-auc:0.696947
[130]	train-auc:0.828194	eval-auc:0.699073
[140]	train-auc:0.830905	eval-auc:0.699715
[150]	train-auc:0.833438	eval-auc:0.699399
[160]	train-auc:0.836323	eval-auc:0.700008
[170]	train-auc:0.837748	eval-auc:0.69994
[180]	train-auc:0.839883	eval-auc:0.701972
[190]	train-auc:0.842224	eval-auc:0.702325
[200]	train-auc:0.844177	eval-auc:0.704581
[210]	train-auc:0.845821	eval-auc:0.70463
[220]	train-auc:0.847336	eval-auc:0.706142
[230]	train-auc:0.84888	e

In [None]:
model.best_

In [None]:
l

In [111]:
xgb.plot_importance(model)
plt.show()

In [56]:
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

AttributeError: 'Booster' object has no attribute 'feature_importances_'

In [68]:
model.get_fscore()

{'LocationPrecision_max': 13,
 'LocationPrecision_max+min': 19,
 'LocationPrecision_max-min': 7,
 'LocationPrecision_mean': 28,
 'LocationPrecision_median': 9,
 'LocationPrecision_min': 23,
 'LocationPrecision_q75-q25': 9,
 'LocationPrecision_q_25': 16,
 'LocationPrecision_q_75': 33,
 'LocationPrecision_sem': 20,
 'LocationPrecision_std': 35,
 'LocationSpeed_max': 62,
 'LocationSpeed_max+min': 12,
 'LocationSpeed_max-min': 5,
 'LocationSpeed_mean': 41,
 'LocationSpeed_median': 35,
 'LocationSpeed_min': 3,
 'LocationSpeed_q75-q25': 7,
 'LocationSpeed_q_25': 16,
 'LocationSpeed_q_75': 44,
 'LocationSpeed_sem': 20,
 'LocationSpeed_std': 19,
 'dm_entries_per_user': 2,
 'dm_signal_std': 15,
 'max+min_distance_to_closest_station': 91,
 'max_distance_to_closest_station': 79,
 'mean_distance_to_station': 45,
 'mean_entries_per_user': 20,
 'mean_user_signal_std': 41,
 'median_distance_to_closest_station': 52,
 'median_entries_per_user': 10,
 'median_user_signal_std': 21,
 'min_distance_to_close

In [69]:
pd.DataFrame(model.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

PandasError: DataFrame constructor not properly called!

In [38]:
model = xgb.train(params, dtr, num_boost_round=500, evals=watchlist,evals_result=history, verbose_eval=10)

[0]	train-auc:0.688593	eval-auc:0.588477
[10]	train-auc:0.74003	eval-auc:0.587787
[20]	train-auc:0.748935	eval-auc:0.598691
[30]	train-auc:0.755814	eval-auc:0.607096
[40]	train-auc:0.758372	eval-auc:0.607187
[50]	train-auc:0.762092	eval-auc:0.618856
[60]	train-auc:0.765888	eval-auc:0.619292
[70]	train-auc:0.770579	eval-auc:0.625995
[80]	train-auc:0.773034	eval-auc:0.626868
[90]	train-auc:0.777196	eval-auc:0.634508
[100]	train-auc:0.779771	eval-auc:0.634719
[110]	train-auc:0.783795	eval-auc:0.639688
[120]	train-auc:0.787586	eval-auc:0.645544
[130]	train-auc:0.790103	eval-auc:0.649224
[140]	train-auc:0.79428	eval-auc:0.651258
[150]	train-auc:0.796623	eval-auc:0.653725
[160]	train-auc:0.799539	eval-auc:0.658166
[170]	train-auc:0.801948	eval-auc:0.66109
[180]	train-auc:0.804542	eval-auc:0.662395
[190]	train-auc:0.806302	eval-auc:0.662169
[200]	train-auc:0.808128	eval-auc:0.662695
[210]	train-auc:0.811196	eval-auc:0.663035
[220]	train-auc:0.813406	eval-auc:0.666606
[230]	train-auc:0.815923	

In [32]:
model = xgb.train(params, dtr, num_boost_round=500, evals=watchlist,evals_result=history, verbose_eval=10)

[0]	train-auc:0.739152	eval-auc:0.637049
[10]	train-auc:0.774755	eval-auc:0.65183
[20]	train-auc:0.779083	eval-auc:0.645738
[30]	train-auc:0.783729	eval-auc:0.650622
[40]	train-auc:0.787999	eval-auc:0.656943
[50]	train-auc:0.791767	eval-auc:0.665197
[60]	train-auc:0.793851	eval-auc:0.667413
[70]	train-auc:0.799188	eval-auc:0.674892
[80]	train-auc:0.802983	eval-auc:0.679085
[90]	train-auc:0.806337	eval-auc:0.680242
[100]	train-auc:0.808359	eval-auc:0.683031
[110]	train-auc:0.811373	eval-auc:0.683126
[120]	train-auc:0.815862	eval-auc:0.690262
[130]	train-auc:0.819705	eval-auc:0.695758
[140]	train-auc:0.823459	eval-auc:0.698362
[150]	train-auc:0.826121	eval-auc:0.698017
[160]	train-auc:0.828091	eval-auc:0.697219
[170]	train-auc:0.830312	eval-auc:0.702368
[180]	train-auc:0.833444	eval-auc:0.706125
[190]	train-auc:0.836199	eval-auc:0.709053
[200]	train-auc:0.838973	eval-auc:0.709676
[210]	train-auc:0.841083	eval-auc:0.709652
[220]	train-auc:0.844253	eval-auc:0.710919
[230]	train-auc:0.84640

In [6]:
y

array([[0, False],
       [1, False],
       [2, False],
       ..., 
       [22378, False],
       [22379, False],
       [22380, False]], dtype=object)

In [None]:
#if you don't have catboost installed, use !pip install catboost
from catboost import CatBoostClassifier

model = CatBoostClassifier().fit(X,y)

### Analyzing results

Here you can see importances of all individual features, ranked from worst to best.



In [None]:
from sklearn.metrics import roc_auc_score,roc_curve
import matplotlib.pyplot as plt
%matplotlib inline

y_train_pred = model.predict_proba(X_train)[:,1]
print("Train ROC AUC:",roc_auc_score(y_train,y_train_pred))

fpr,tpr,_ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label='train AUC')

y_val_pred = model.predict_proba(X_val)[:,1]
print("Val ROC AUC:",roc_auc_score(y_val,y_val_pred))

fpr,tpr,_ = roc_curve(y_val, y_val_pred)
plt.plot(fpr,tpr,label='validation AUC')

plt.plot([0,1],[0,1])
plt.legend(loc='lower right')

In [None]:
order = np.argsort(model._feature_importance)
plt.figure(figsize=[6,9])
plt.plot(np.array(model._feature_importance)[order],range(len(order)),marker='o')
plt.hlines(range(len(order)),np.zeros_like(order),np.array(model._feature_importance)[order],linestyles=':')
plt.yticks(range(X.shape[1]),X.columns[order]);
plt.tick_params(labelsize=16)
plt.xlim([0.1,max(model._feature_importance)*1.5])
plt.ylim(-1,len(order))
plt.xscale('log')

## Final model and uploading the results


Competition data contains three cities: Moscow, Saint-Petersburg and Kazan. To submit a prediction, you'll have to run this baseline three times separately for each city and concatenate the results. 

The code assumes that you ran this solution for each city (see comments below).

In [None]:
#Train the model on full data. Copy model definition here.

model = CatBoostClassifier().fit(X,y)

In [None]:

test = pd.read_csv(TEST_PATH, sep='\t',dtype=json.load(open("./data/test_col_dtypes.json")),)
test_groupby = test.groupby(["city_code","sq_x","sq_y","hour_hash"])
test_netatmo_groups,test_netatmo_anns = preprocess_netatmo(pd.read_csv(TEST_NETATMO_PATH,na_values="None",
                                                                       sep='\t',dtype={'hour_hash':"uint64"}))


In [None]:
X_test,test_block_ids = [],[]
for block_id in tqdm(test_groupby.groups):
    group = test_groupby.get_group(block_id)
    X_test.append(extract_features(group,test_netatmo_groups,test_netatmo_anns))
    test_block_ids.append(block_id)
    
X_test = pd.DataFrame(X_test)
test_block_ids = pd.DataFrame(test_block_ids,columns=["city_code","sq_x","sq_y","hour_hash"])

In [None]:
#This code saves the prediction for one city.
prediction_for_one_city = test_block_ids.copy()
prediction_for_one_city["prediction"] = model.predict_proba(X_test)[:,1]
prediction_for_one_city.to_csv(CITY_PREDICTIONS_PATH)

prediction_for_one_city.head()

#WARNING! you must run this notebook for all three regions before proceeding!
#We assume that you have prediction_msk.csv , prediction_spb.csv and prediction_kazan.csv files prepared.

In [None]:
data = X.copy()
data["target"] = y
data.to_csv("intermediate_data/spb.csv")
X_test.to_csv("intermediate_data/spb_test.csv")

Gather all predictions and make submission file.

In [None]:
import pandas as pd

predictions = pd.concat(
    [pd.read_csv(fname,index_col=0) for fname in ("./intermediate_data/prediction_kazan.csv",
                                                  "./intermediate_data/prediction_spb.csv",
                                                  "./intermediate_data/prediction_msk.csv")],
    ignore_index=True
)
blocks = pd.read_csv("./data/hackathon_tosubmit.tsv",sep='\t')
assert len(predictions) == len(blocks),"Predictions don't match blocks. Sumbit at your own risk."

merged = pd.merge(blocks,predictions,how='left',on=["sq_x","sq_y","hour_hash"])
assert not np.isnan(merged.prediction).any(), "some predictions are missing. Sumbit at your own risk."


In [None]:
merged[['id','prediction']].to_csv("baseline_submission.csv",sep=',',index=False,header=False)

You can now upload baseline_submission.csv to the competition interface.

In [None]:
!head baseline_submission.csv

### Known unknowns <a id='known_unknowns'>

Here's a few ideas to improve your solution:
* Right now we only consider users in the same square where we're going to make prediction.
 * It may be useful to consider neighboring squares in square id and/or time
 * It may be useful to use global city-wide estimate (like "There's currently no rain in Moscow")
 * Same is true for netatmo stations
* There's a lot of underexplored features
 * Netatmo stations' features
 * User behavior on device level, e.g. "phone signal worse than usual"
 * Latitude/longitude are fed to model in 
 * Relations between several  kinds of features (e.g. signal over distance to cell)
 * Relations over location/time, e.g. "less users than usual"
* Data splits
 * Test set rains may be more/less frequent than on the training set
 * There also may be some difference in user activity
 * There's definitely a difference in distribution of users and stations in different cities
 * We only train model on one fixed region. Try using several regions at once to get more training data.



 ```
 
 ```

![img](https://images-na.ssl-images-amazon.com/images/I/31la29lBQxL.jpg)


 ```
 
 ```
