# Design Pattern 1 - Hashed Feature (Chapter 2)

## Introduction to Design Pattern

This is pattern is for using with categortical input values where there are a large number of categories compared to the number of training samples available. In this case, the aim is to collapse the categories into a smaller number by merging them, while hopefully not losing too much predictive skill. The books description when this is approproiate to use, and when alternatives should be sought.

Link to original example code
* https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/hashed_feature.ipynb 

## Example python implementation - XBT

In this example we will use the data from the [XBT project](https://github.com/MetOffice/XBTs_classification). The platform and institute variables both contain many categories (hundreds or thousands), so we will demonstrate using the hashed feature pattern. 

As described in the book, we need to use a *fingerprint hash* rather than a *cryptographic hash*, so we will be using the farm hash as used in the original example.

### Library Requirements
* pandas
* scikit-learn
* [pyfarmhash](https://pypi.org/project/pyfarmhash/)

In [5]:
import pathlib
import pandas

In [1]:
import farmhash

In [6]:
root_data_loc = pathlib.Path('/Users/stephen.haddad/data/xbt-data/dask_clean')

In [7]:
xbt_fname_template = 'xbt_{year}.csv'

In [8]:
year_range= (1966,2015)

In [9]:
xbt_df = pandas.concat([pandas.read_csv(root_data_loc / xbt_fname_template.format(year=year1)) for year1 in range(year_range[0], year_range[1])])
xbt_df

Unnamed: 0.1,Unnamed: 0,country,lat,lon,date,year,month,day,institute,platform,cruise_number,instrument,model,manufacturer,max_depth,imeta_applied,id
0,0,UNITED STATES,32.966667,-117.633331,19660412,1966,4,12,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.892670,1,2052528
1,1,UNITED STATES,33.016666,-118.116669,19660413,1966,4,13,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.852051,1,2052529
2,2,UNITED STATES,33.066666,-118.466667,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,70.602089,1,2052530
3,3,UNITED STATES,32.700001,-118.666664,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.907410,1,2052531
4,4,UNITED STATES,32.933334,-117.916664,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.811493,1,2052532
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18995,18995,UNITED STATES,-58.394001,-63.181000,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,899.722412,0,16686048
18996,18996,UNITED STATES,-58.500999,-63.125000,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,929.809082,0,16686049
18997,18997,UNITED STATES,-58.598000,-63.064999,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,908.195984,0,16686051
18998,18998,UNITED STATES,-58.681999,-63.015999,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,914.778015,0,16686052


Having loaded the data into memory, we can see how many categories the *institute* and *platform* features contain.

In [14]:
len(xbt_df['institute'].unique())

249

In [60]:
len(xbt_df['platform'].unique())

2632

Doing this in the real world, we would start creating a train/test split, using pandas sample function in this case.

In [61]:
import sklearn.model_selection

In [64]:
xbt_train, xbt_test = sklearn.model_selection.train_test_split(xbt_df)

In [65]:
xbt_train.shape

(1689845, 19)

In [66]:
xbt_test.shape

(563282, 19)

Now we create the hashed feature. In this example we are selecting a relatively small number of hashes.

In [67]:
num_hashes = 10

In [68]:
xbt_df['platform_hashed'] = xbt_df['platform'].apply(lambda s1: farmhash.fingerprint64(s1) % num_hashes) 
xbt_df['platform_hashed']

0        1
1        1
2        1
3        1
4        1
        ..
18995    9
18996    9
18997    9
18998    9
18999    7
Name: platform_hashed, Length: 2253127, dtype: int64

In [69]:
xbt_df['platform_hashed'].value_counts()

5    520113
1    270603
8    220289
2    189834
7    185400
9    185351
0    182578
3    167636
6    165933
4    165390
Name: platform_hashed, dtype: int64

Once the feature is created in can be used the same as any other categorical feature, as demonstrated below.

In [70]:
import sklearn.preprocessing
import numpy

In [71]:
scaler_year1 = sklearn.preprocessing.MinMaxScaler().fit(xbt_train[['year']])

In [72]:
scaler_maxDepth1 = sklearn.preprocessing.MinMaxScaler().fit(xbt_train[['max_depth']])

In [73]:
ohe_platform = sklearn.preprocessing.OneHotEncoder(sparse=False).fit(xbt_train[['platform_hashed']])

In [74]:
X_train = numpy.concatenate([
    scaler_year1.transform(xbt_train[['year']]),
    scaler_maxDepth1.transform(xbt_train[['max_depth']]),
    ohe_platform.transform(xbt_train[['platform_hashed']]),
], axis=1)

In [75]:
X_train.shape

(1689845, 12)

## Real world examples

It's quite to imagine when one might use this in the real world. The original example is predicting fraction of planes that will be late for an airport, using airport ID as an input. This does lump together many different airports into a single value, but was considered to provide sufficient description. A weather dataset with station or used IDs where there are many IDs compared to data points might be an example. In reality as pointed out in the book, other data about a weather station, such as location given by latitude / longitude, might be a better bet. 
