# Design Pattern 1 - Hashed Feature (Chapter 1)

## Introduction to Design Pattern

Give a brief text descriptive overview. This should not be expected to replace the discussiuion in the actual book.

Link to original example code
* https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/hashed_feature.ipynb 

## Example python implementation

In [5]:
import pathlib
import pandas

In [1]:
import farmhash

In [6]:
root_data_loc = pathlib.Path('/Users/stephen.haddad/data/xbt-data/dask_clean')

In [7]:
xbt_fname_template = 'xbt_{year}.csv'

In [8]:
year_range= (1966,2015)

In [9]:
xbt_df = pandas.concat([pandas.read_csv(root_data_loc / xbt_fname_template.format(year=year1)) for year1 in range(year_range[0], year_range[1])])
xbt_df

Unnamed: 0.1,Unnamed: 0,country,lat,lon,date,year,month,day,institute,platform,cruise_number,instrument,model,manufacturer,max_depth,imeta_applied,id
0,0,UNITED STATES,32.966667,-117.633331,19660412,1966,4,12,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.892670,1,2052528
1,1,UNITED STATES,33.016666,-118.116669,19660413,1966,4,13,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.852051,1,2052529
2,2,UNITED STATES,33.066666,-118.466667,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,70.602089,1,2052530
3,3,UNITED STATES,32.700001,-118.666664,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.907410,1,2052531
4,4,UNITED STATES,32.933334,-117.916664,19660414,1966,4,14,US NAVY SHIPS OF OPPORTUNITY,KEARSARGE,US044120,XBT: T4 (SIPPICAN),T4,SIPPICAN,466.811493,1,2052532
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18995,18995,UNITED STATES,-58.394001,-63.181000,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,899.722412,0,16686048
18996,18996,UNITED STATES,-58.500999,-63.125000,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,929.809082,0,16686049
18997,18997,UNITED STATES,-58.598000,-63.064999,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,908.195984,0,16686051
18998,18998,UNITED STATES,-58.681999,-63.015999,20141231,2014,12,31,0,LAURENCE M. GOULD (R/V; call sign WCX7445; bui...,US036605,XBT: DEEP BLUE (SIPPICAN),DEEP BLUE,SIPPICAN,914.778015,0,16686052


In [39]:
xbt_df['test'] = False

In [None]:
xbt_df.loc[xbt_df.sample(frac=0.2).index, 'test'] = True

In [14]:
len(xbt_df['institute'].unique())

249

In [15]:
len(xbt_df['platform'].unique())

2632

In [20]:
num_hashes = 10

In [24]:
xbt_df['platform_hashed'] = xbt_df['platform'].apply(lambda s1: farmhash.fingerprint64(s1) % num_hashes) 
xbt_df['platform_hashed']

0        1
1        1
2        1
3        1
4        1
        ..
18995    9
18996    9
18997    9
18998    9
18999    7
Name: platform_hashed, Length: 2253127, dtype: int64

In [25]:
xbt_df['platform_hashed'].value_counts()

5    520113
1    270603
8    220289
2    189834
7    185400
9    185351
0    182578
3    167636
6    165933
4    165390
Name: platform_hashed, dtype: int64

In [19]:
import sklearn
import sklearn.preprocessing
import sklearn.tree

In [33]:
scaler_year1 = sklearn.preprocessing.MinMaxScaler().fit(xbt_df[['year']])

In [34]:
scaler_maxDepth1 = sklearn.preprocessing.MinMaxScaler().fit(xbt_df[['max_depth']])

In [31]:
ohe_platform = sklearn.preprocessing.OneHotEncoder()

array([[0.],
       [0.],
       [0.],
       ...,
       [1.],
       [1.],
       [1.]])

In [None]:
#TODO; demonstrate training using the new feature.

## Real world examples


Try to include some actual/possible examples of where this DP could be used in a weather and climate context.