## Feature Engineering With One Hot Encoding, CountVectorizer, HashingVectorizer

In previous notebooks we experimented with FeatureHasher for encoding categorical variables with high dimensionality (~3MM feature columns). 

This notebook applies LabelEncoder and OneHotEncoder to categorical variables with smaller numbers of unique values (<10,000).  In addition we test CountVectorizer and HashingVectorizer.

Since feature engineering for categorical variables (eg one hot encoding) is difficult with our full 6GB dataset on our local computer (8GB RAM), in this notebook we load ~25% of our training data (~10MM rows) to test our strategies.

### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.utils import resample
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from scipy import sparse
from sklearn.feature_extraction import FeatureHasher 
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost
from scipy.sparse import hstack
from sklearn.metrics import classification_report, log_loss, roc_auc_score, roc_curve
from sklearn.decomposition import TruncatedSVD


### Loading Test Data, Sample Submission and Partial Training Data

Loading in our full test file as well as the sample submission CSV

In [2]:
# loading in test data as well as the Sample Submission file
sample = pd.read_csv('../assets/sampleSubmission')
test = pd.read_csv('../assets/test')

# Because of GitHub space limits (no files over 2GB), train data file was split into 5 pieces

# Loading the first file with header row to use for column names -- about 20% of our full training dataset
%time train = pd.read_csv("../assets/trainaa")

# Checking the columns present
train.head(1)

CPU times: user 46.8 s, sys: 12.8 s, total: 59.6 s
Wall time: 1min 7s


Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79


### Addressing Class Imbalance
We have only 16% minority class representation in this reduced dataset. Let's see whether changing this balance results in better model performance.

In [3]:
# Splitting training dataset into minority and majority classes so we can downsample majority class
train_0 = train[train.click == 0]
train_1 = train[train.click == 1]

# Reducing the majority class to be the same size as the minority class
%time train_0_reduced = resample(train_0, replace=False, n_samples=train_1.shape[0], random_state=99)

# Appending the reduced majority class to our minority class to make a new training set
train = train_0_reduced.append(train_1)

# Now we can see that each class accounts for about 50% of the rows in our dataset
train.click.value_counts(normalize=True)

CPU times: user 4.02 s, sys: 1.66 s, total: 5.68 s
Wall time: 6.17 s


1    0.5
0    0.5
Name: click, dtype: float64

### Making Y Train
We'll use the "click" column in the training set as our target or "y" series for training purposes, dropping it from our training dataframe.  We will also remove the "id" column, which does not add value for training purposes.


In [4]:
# Creating a y_train with our click information
y_train = train.click

# From the Kaggle site, we know 'id' is just a record indicator, so we can drop it
# along with the "click" column, which is our target
train = train.drop(columns=['click', 'id'])


In [5]:
# Checking our shapes. We see that "test" has an extra column, because it still has "id". We'll drop this column next 
# after using it to prepare a submission dataframe, where we'll put our predictions
(train.shape, test.shape, y_train.shape)

((3322714, 22), (4577464, 23), (3322714,))

In [6]:
# Checking class balancce in y_train
y_train.value_counts(normalize=True)

1    0.5
0    0.5
Name: click, dtype: float64

### Prepping a Submission Dataframe
We take the "id" column and index from the "test" dataframe and add a "click" column, following the model of the sample submission file.  For now the click column is filled with zeros; later as we generate model predictions we will place them here, before saving this dataframe out as a csv to upload to Kaggle.


In [7]:
# Creating the submission dataframe and conforming types
submit = pd.DataFrame(test.id, index=test.index)
submit['click'] = 0.0
submit['id'] = submit['id'].astype(np.uint64)

# Dropping the id column from test
test = test.drop(columns="id")

# Now we can verify that "train" and "test" have the same number of columns, as expected
# We can also verify that the "submit" dataframe has the same number of rows as "test"
(train.shape, test.shape, submit.shape)

((3322714, 22), (4577464, 22), (4577464, 2))

### Feature Engineering: Date Columns
We have a column representing hour/day/month/year for each hour in our 14 day sample ("hour"). We will convert this to a date-time object, then add feature columns based on the day of the week and the hour of the day.  Finally we will remove the hour column. We'll perform the same operation on both our "train" and "test" dataframes.


In [8]:
# function to make day-of-week, hour-of-day features
def make_date_features(dataframe, frame_name="dataframe", date_col="hour"): 
    date_obj = pd.to_datetime(dataframe[date_col])
    
    dataframe['hour-of-the-day'] = date_obj.dt.hour
    print(f"Created 'hour-of-the-day' column in {frame_name}")
    
    dataframe['day-of-the-week'] = date_obj.dt.dayofweek
    print(f"Created 'day-of-the-week' column in {frame_name}")

# running "train" and "test" through our date features function to create the engineered columns 
make_date_features(train, "train", "hour")
make_date_features(test, "test", "hour")

# dropping the "hour" column now that we no longer need it
train = train.drop(columns=["hour"])
test = test.drop(columns=["hour"])

Created 'hour-of-the-day' column in train
Created 'day-of-the-week' column in train
Created 'hour-of-the-day' column in test
Created 'day-of-the-week' column in test


### Number of categorical features

We have multiple columns with large numbers of categorical values. In fact, from our documentation, we know that _all_ the columns are categorical, whether their datatype is "object" or not.

In [9]:
# Looking at the number of unique values in each column of the training dataset
uniques = pd.DataFrame(data=[train.columns.values, [len(train[col].unique()) for col in train.columns], [train[col].dtype for col in train.columns]]).T
uniques = uniques.rename({0:'Column Name', 1:'Unique Values', 2:'Dtype'}, axis=1)
uniques = uniques.sort_values(by='Unique Values', ascending=False).reset_index(drop=True)
uniques

Unnamed: 0,Column Name,Unique Values,Dtype
0,device_ip,1128197,object
1,device_id,361531,object
2,device_model,6043,object
3,app_id,4252,object
4,site_domain,3464,object
5,site_id,2964,object
6,C14,1002,int64
7,app_domain,299,object
8,C17,222,int64
9,C20,166,int64


We'll use OneHotEncoder (and LabelEncoder, if needed) on each feature from C14 on down.  Let's start with one feature (C14), following which we'll create a function to generate the appropriate pipelines for all these features.

### OneHotEncoding an integer column 
Let's start with C14. Since this is a column containing integer values, we don't need to use a LabelEncoder and we can proceed directly to OneHotEncoder.

In [10]:
# Creating an encoder that we can dedicate to this column
ohe_C14 = OneHotEncoder(handle_unknown='ignore') # so that we'll ignore any new values found in the test dataframe

In [11]:
# Generating an encoded matrix from our C14 column as well as fitting this encoder on its values
encoded_C14 = ohe_C14.fit_transform(train.C14.as_matrix().reshape(-1, 1)) # reshaping to 2 dimensions

  from ipykernel import kernelapp as app


In [12]:
# Encoding the same column from our test database
encoded_C14_test = ohe_C14.transform(test.C14.as_matrix().reshape(-1, 1))

  from ipykernel import kernelapp as app


In [13]:
(encoded_C14.shape, encoded_C14_test.shape)

((3322714, 1002), (4577464, 1002))

### OneHotEncoding an Object Column with LabelEncoder
Now let's apply OneHotEncoder to "App Domain". Since this column contains string values, we'll need to use LabelEncoder to convert them to integers first.

In [14]:
# We'll use separate encoders for train and test -- OneHotEncoder will deal with any new values in test

# Label-encoding the column in train 
label_encoded_app_domain = LabelEncoder().fit_transform(train.app_domain)

# Using a different label encoder for the column in test
label_encoded_app_domain_test = LabelEncoder().fit_transform(test.app_domain)

In [15]:
# Checking the resulting shapes. 
label_encoded_app_domain.shape, label_encoded_app_domain_test.shape

((3322714,), (4577464,))

In [16]:
# Instantiating a OneHotEncoder and fitting it on the encoded column from train
# Since we set "handle_unknown" to "ignore", we expect the model to handle the extra 
ohe_app_domain = OneHotEncoder(handle_unknown="ignore").fit(label_encoded_app_domain.reshape(1, -1)) 

# Transforming the train column with the fitted encoder
encoded_app_domain = ohe_app_domain.transform(label_encoded_app_domain.reshape(1, -1))

# Attempting to transform the test column... but running into an error
encoded_app_domain_test = ohe_app_domain.transform(label_encoded_app_domain_test.reshape(1, -1))

# Open below to see error!

ValueError: X has different shape than during fitting. Expected 3322714, got 4577464.

It seems that "handle_unknown = 'ignore'" is not working as expected

### CountVectorizing An Object Column
Since OneHotEncoder and LabelEncoder don't really address the issue of different values in train and test, let's try CountVectorizer instead.

In [None]:
cvec = CountVectorizer()
cvec = cvec.fit(train.app_domain)
app_domain_cvec = cvec.transform(train.app_domain)
app_domain_cvec_test = cvec.transform(test.app_domain)

In [None]:
app_domain_cvec.shape, app_domain_cvec_test.shape

Success! This worked very well!

### HashVectorizing An Object Column
Now let's try HashVectorizer - this may end up being useful for our largest text columns

In [None]:
hvec = HashingVectorizer()
hvec = hvec.fit(train.app_domain)
app_domain_hvec = hvec.transform(train.app_domain)
app_domain_hvec_test = hvec.transform(test.app_domain)

In [None]:
app_domain_hvec.shape, app_domain_hvec_test.shape

Looks like something we can use for object columns with over 1MM values

### Summary of Feature Engineering Steps Planned
After this review we now have a good feeling for which types of encoding to use as we build a pipeline.  

For large object columns (1MM+ values) we will use HashingVectorizer.  

For smaller object columns we will use CountVectorizer.

For integer columns we will use OneHotEncoding.



### Test: Multiple Columns With OneHotEncoding
In theory we should be able to pass multiple columns to OneHotEncoder. Let's see how well that works in practice.

In [18]:
# Checking all our columns with datatypes and number of unique values
train_numeric = train.select_dtypes(include='number')
test_numeric = test.select_dtypes(include='number')
train_numeric.columns, test_numeric.columns

(Index(['C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14', 'C15',
        'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'hour-of-the-day',
        'day-of-the-week'],
       dtype='object'),
 Index(['C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14', 'C15',
        'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'hour-of-the-day',
        'day-of-the-week'],
       dtype='object'))

In [19]:
# This test fails because we have negative values. We need to transform these to run OneHotEncoder.
#ohe = OneHotEncoder()
#ohe = ohe.fit(train_numeric)

ValueError: X needs to contain only non-negative integers.

In [43]:
# note: C15 and C16 are banner ad dimensions and should be combined to a "size" variable
# EDA shows that variable C20 has -1 as its most frequent value.

train.C20.value_counts().head(5)

-1         1743866
 100084     218999
 100148     135398
 100111     122583
 100077     113742
Name: C20, dtype: int64

In [44]:
# We can see that all the other values are over 10000, so this will be easy to transform
train.C20.unique()

array([    -1, 100075, 100181, 100151, 100156, 100176, 100191, 100084,
       100081, 100020, 100143, 100177, 100076, 100000, 100079, 100111,
       100192, 100119, 100162, 100148, 100139, 100194, 100031, 100172,
       100083, 100070, 100062, 100131, 100021, 100202, 100160, 100077,
       100199, 100189, 100088, 100033, 100046, 100228, 100074, 100002,
       100060, 100064, 100200, 100144, 100039, 100114, 100087, 100101,
       100096, 100217, 100241, 100173, 100150, 100094, 100097, 100003,
       100193, 100103, 100182, 100185, 100105, 100166, 100019, 100005,
       100004, 100149, 100034, 100050, 100109, 100049, 100012, 100233,
       100155, 100210, 100128, 100052, 100013, 100057, 100048, 100152,
       100161, 100130, 100183, 100086, 100065, 100195, 100188, 100022,
       100170, 100221, 100126, 100215, 100190, 100225, 100113, 100053,
       100055, 100061, 100037, 100028, 100141, 100072, 100212, 100059,
       100106, 100016, 100001, 100117, 100025, 100168, 100063, 100112,
      

In [52]:
# Series.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
train_numeric.C20.replace(to_replace=-1, value=1, inplace=True)
test_numeric.C20.replace(to_replace=-1, value=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [55]:
# Now we can run again without that error
ohe = OneHotEncoder(handle_unknown='ignore')
ohe = ohe.fit(train_numeric)
numeric_encoded_train = ohe.transform(train_numeric)
numeric_encoded_test = ohe.transform(test_numeric)
numeric_encoded_train.shape, numeric_encoded_test.shape

((3322714, 1523), (4577464, 1523))

Success! This can be one pipeline in a FeatureUnion against the whole dataframe.

In [21]:
# Useful post on a technique to use DefaultDict for something similar.
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn

### Using CountVectorizer on a numeric column
What about using CountVectorizer on numeric columns?

To do this we need to convert all the integer values to strings. Potentially feasible with our full data set?

In [31]:
cvec2 = CountVectorizer()
string_col_train = train['C1'].map(lambda x: str(x))
string_col_test = test['C1'].map(lambda x: str(x))

train_cvec2 = cvec2.fit_transform(string_col_train)
test_cvec2 = cvec2.transform(string_col_test)

train_cvec2.shape, test_cvec2.shape


((3322714, 7), (4577464, 7))

In [None]:
# Execution Stopper
assert "red" == 'blue'

In [56]:
### Testing Predictions on OneHotEncoder Output

In [57]:
bnb = BernoulliNB()
bnb = bnb.fit(numeric_encoded_train, y_train)
y_hat_bnb = bnb.predict(numeric_encoded_train)
print(roc_auc_score(y_train, y_hat_bnb))
unique, counts = np.unique(y_hat_bnb, return_counts=True)
print(unique, counts)

0.632135657778551
[0 1] [1727131 1595583]


Now let's try Logistic Regression

In [None]:
lr = LogisticRegression()
%time lr = lr.fit(numeric_encoded_train, y_train)
y_hat_lr = lr.predict(numeric_encoded_train)
print(roc_auc_score(y_train, y_hat_lr)) 
unique, counts = np.unique(y_hat_lr, return_counts=True)
print(unique, counts)

Better results on these columns than we got on the feature hashed columns previously.

In [None]:
xg = xgboost()
%time xg = xg.fit(numeric_encoded_train, y_train)
y_hat_xg = xg.predict(numeric_encoded_train)
print(roc_auc_score(y_train, y_hat_xg)) 
unique, counts = np.unique(y_hat_xg, return_counts=True)
print(unique, counts)

In [None]:
rf = RandomForestClassifier()
%time rf = rf.fit(numeric_encoded_train, y_train)
y_hat_rf = rf.predict(numeric_encoded_train)
print(roc_auc_score(y_train, y_hat_rf)) 
unique, counts = np.unique(y_hat_rf, return_counts=True)
print(unique, counts)

### Conclusion
With undersampling of the majority class, we are able to see some positive impact on predictions from the FeatureHasher columns - however it is fairly minimal (+.05 Roc-Auc score). 