# Custom Transformers Lab

### Introduction

In this lesson, we'll practice creating our own transformer and integrating with oru DataFrameMapper.  Let's get started.

### Loading our Data

Let's continue working with our Airbnb dataset.

In [1]:
import pandas as pd

listings_df = pd.read_csv('./listings_summary.csv.zip')

This time let's select the zipcode column.  Now, while zipcodes to contain numbers, we do not think of this as being linear.  That is, we don't expect our increasing our zipcode from 11001 to 11002 to necessarily increase our target value.  Zipcode represents a geographic region, and so can be handled as a category.

Let's see how we can handle this.

Let's begin by checking if there any null values.

In [43]:
zipcode_col = listings_df[['zipcode']]

In [44]:
zipcode_col.isna().sum()

zipcode    656
dtype: int64

That's a lot of null values.  Let's begin by using a DataFrameMapper to convert them each to an empty string, `''`.

In [49]:
from sklearn_pandas import DataFrameMapper, FunctionTransformer
from sklearn.impute import SimpleImputer
mapper = DataFrameMapper([
    (['zipcode'], SimpleImputer(strategy = 'constant', fill_value = ''))
], df_out = True)

In [53]:
mapper.fit_transform(listings_df).isna().sum()

# zipcode    0
# dtype: int64

zipcode    0
dtype: int64

Ok, now we want to handle zipcodes like a categorical variable.  But if we simply one hot encode our zipcodes we will have sparse columns.  A workable technique instead is to just select a subset of the string.

In [70]:
listings_df['zipcode'].str[:4].value_counts(normalize = True)[:10].index

Index(['1024', '1043', '1204', '1096', '1205', '1011', '1099', '1335', '1040',
       '1055'],
      dtype='object')

We can see that the top ten values account for 68 percent of our string.  Not bad.

So let's add two custom imputers to our DataFrameMapper.  The first one just selects the first four values from each string.  And the second one replaces the zipcode with `'Other'` if not in our top ten values.  Then apply one hot encoding on the column.

In [71]:
def subset_zipcode(zipcode):
    return zipcode[:4]

def replace_other(zipcode):
    top_ten_cols = ['1024', '1043', '1204', '1096', '1205', '1011', '1099', '1335', '1040',
       '1055']
    if zipcode in top_ten_cols:
        return zipcode
    else:
        return 'Other'

In [79]:
from sklearn_pandas import DataFrameMapper, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
mapper = DataFrameMapper([
    (['zipcode'], [SimpleImputer(strategy = 'constant', fill_value = ''), 
     FunctionTransformer(subset_zipcode), FunctionTransformer(replace_other)])
], df_out = True)

In [80]:
zipcode_updated = mapper.fit_transform(listings_df)

In [84]:
zipcode_updated['zipcode'].value_counts(normalize = True)

# Other    0.340679
# 1024     0.115333
# 1043     0.076268
# 1204     0.074361
# 1096     0.072455
# 1205     0.069173
# 1011     0.066868
# 1099     0.052013
# 1335     0.047934
# 1040     0.047845
# 1055     0.037070

Other    0.340679
1024     0.115333
1043     0.076268
1204     0.074361
1096     0.072455
1205     0.069173
1011     0.066868
1099     0.052013
1335     0.047934
1040     0.047845
1055     0.037070
Name: zipcode, dtype: float64

### Summary

In this lesson, we saw how we can use Custom Transformers to both coerce object columns, and handle categorical columns with sparse values.  We can also see that using Custom Transformers encourages to write our code in methods, which makes it more reusable going forward.