## Feature Engineering With Reduced Dataset

(Original version of notebook)

Since feature engineering for categorical variables (eg one hot encoding) is difficult with our full 6GB dataset on our local computer (8GB RAM), in this notebook we load ~25% of our training data (~10MM rows) to explore different techniques, such as Feature Hashing. 

### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from scipy import sparse
from sklearn.feature_extraction import FeatureHasher 
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack
from sklearn.metrics import classification_report, log_loss, roc_auc_score, roc_curve
from sklearn.decomposition import TruncatedSVD


### Loading Test Data and Sample Submission

Loading in our full test file as well as the sample submission CSV

In [2]:
# loading in test data as well as the Sample Submission file
sample = pd.read_csv('../assets/sampleSubmission')
test = pd.read_csv('../assets/test')

### Loading Partial Training Data

For this notebook we are only loading 20% of our full training dataset (~1.5GB, ~10MM rows). This will allow us to experiment with feature engineering on a local device with 8GB of memory. Running the same operations against the full set may require a different solution (eg AWS Sagemaker or Databricks Spark Cluster).

In [3]:
# Because of GitHub space limits (no files over 2GB), train data file was split into 5 pieces

# Loading the first file with header row to use for column names
%time train = pd.read_csv("../assets/trainaa")

# Checking the columns present
train.head(1)

CPU times: user 40.7 s, sys: 10.7 s, total: 51.5 s
Wall time: 53.8 s


Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79


### Making Y Train
We'll use the "click" column in the training set as our target or "y" series for training purposes, dropping it from our training dataframe.  We will also remove the "id" column, which does not add value for training purposes.


In [4]:
# Creating a y_train with our click information
y_train = train.click

# From the Kaggle site, we know 'id' is just a record indicator, so we can drop it
# along with the "click" column, which is our target
train = train.drop(columns=['click', 'id'])


In [5]:
# Checking our shapes. We see that "test" has an extra column, because it still has "id". We'll drop this column next 
# after using it to prepare a submission dataframe, where we'll put our predictions
(train.shape, test.shape)

((9999999, 22), (4577464, 23))

### Prepping a Submission Dataframe
We take the "id" column and index from the "test" dataframe and add a "click" column, following the model of the sample submission file.  For now the click column is filled with zeros; later as we generate model predictions we will place them here, before saving this dataframe out as a csv to upload to Kaggle.


In [6]:
# Checking the columns of our submission sample
sample.head(1)

Unnamed: 0,id,click
0,10000174058809263569,0.5


In [7]:
# Checking the datatypes of our submission sample
sample.dtypes

id        uint64
click    float64
dtype: object

In [8]:
# Creating the submission dataframe and conforming types
submit = pd.DataFrame(test.id, index=test.index)
submit['click'] = 0.0
submit['id'] = submit['id'].astype(np.uint64)

# Verifying that our new submission has the correct datatypes
submit.dtypes

id        uint64
click    float64
dtype: object

In [9]:
# Verifying that our new submission has the correct columns
submit.head(1)

Unnamed: 0,id,click
0,10000174058809264128,0.0


In [10]:
# Dropping the id column from test
test = test.drop(columns="id")

# Now we can verify that "train" and "test" have the same number of columns, as expected
# We can also verify that the "submit" dataframe has the same number of rows as "test"

(train.shape, test.shape, submit.shape)

((9999999, 22), (4577464, 22), (4577464, 2))

### Feature Engineering: Date Columns
We have a column representing hour/day/month/year for each hour in our 14 day sample ("hour"). We will convert this to a date-time object, then add feature columns based on the day of the week and the hour of the day.  Finally we will remove the hour column. We'll perform the same operation on both our "train" and "test" dataframes.


In [11]:
# function to make day-of-week, hour-of-day features
def make_date_features(dataframe, frame_name="dataframe", date_col="hour"): 
    date_obj = pd.to_datetime(dataframe[date_col])
    
    dataframe['hour-of-the-day'] = date_obj.dt.hour
    print(f"Created 'hour-of-the-day' column in {frame_name}")
    
    dataframe['day-of-the-week'] = date_obj.dt.dayofweek
    print(f"Created 'day-of-the-week' column in {frame_name}")



In [12]:
# running "train" and "test" through our date features function to create the engineered columns 
make_date_features(train, "train", "hour")
make_date_features(test, "test", "hour")

Created 'hour-of-the-day' column in train
Created 'day-of-the-week' column in train
Created 'hour-of-the-day' column in test
Created 'day-of-the-week' column in test


In [13]:
# dropping the "hour" column now that we no longer need it
train = train.drop(columns=["hour"])
test = test.drop(columns=["hour"])

### Notes from Adam - Standup July 3

In [14]:
# data generator object from readcsv - ton of cool features - get batches
# hashing vectorizer
# stochastic sgd
# no SVM - no KNN 

### Feature Hash Experiment

We have multiple columns with large numbers of categorical values. Let's see if we can featureize them in memory with the FeatureHash class.

In [15]:
# Looking at the number of unique values in each column of the training dataset
uniques = pd.DataFrame(data=[train.columns.values, [len(train[col].unique()) for col in train.columns], [train[col].dtype for col in train.columns]]).T
uniques = uniques.rename({0:'Column Name', 1:'Unique Values', 2:'Dtype'}, axis=1)
uniques = uniques.sort_values(by='Unique Values', ascending=False).reset_index(drop=True)
uniques

Unnamed: 0,Column Name,Unique Values,Dtype
0,device_ip,2129661,object
1,device_id,786740,object
2,device_model,6863,object
3,app_id,5469,object
4,site_domain,4585,object
5,site_id,3496,object
6,C14,1030,int64
7,app_domain,390,object
8,C17,226,int64
9,C20,168,int64


For this notebook we are going to let FeatureHasher choose the number of features vs setting them ourselves

In [16]:
# Let's try this out with "device model" first - only about 7k values
fh_1 = FeatureHasher(num_features=uniques.iloc[2, 1], input_type='string', non_negative=True) # so we can use NaiveBayes
%time fit = fh_1.fit_transform(train.device_model)

TypeError: __init__() got an unexpected keyword argument 'num_features'

In [None]:
print(f"Fit shape: {fit.shape}, Fit non-nulls: {fit.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit.nnz/(fit.shape[0] * fit.shape[1]))}")

In [None]:
# Now we'll try "device_id" - about 700k values
fh_2 = FeatureHasher(num_features=uniques.iloc[1, 1], input_type='string', non_negative=True)
%time fit2 = fh_2.fit_transform(train.device_id)

In [None]:
print(f"Fit2 shape: {fit2.shape}, Fit2 non-nulls: {fit2.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit2.nnz/(fit2.shape[0] * fit2.shape[1]))}")

In [None]:
# Now we'll try "device_ip" - 2MM values in the reduced training set
fh_3 = FeatureHasher(num_features=uniques.iloc[0, 1], input_type='string', non_negative=True)
%time fit3 = fh_3.fit_transform(train.device_ip)

In [None]:
print(f"Fit3 shape: {fit3.shape}, Fit3 non-nulls: {fit3.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit3.nnz/(fit3.shape[0] * fit3.shape[1]))}")

In [None]:
# We can use the same objects on test now that they have been fitted on train
fit1_test = fh_1.transform(test.device_model)
fit2_test = fh_2.transform(test.device_id)
fit3_test = fh_3.transform(test.device_ip)

In [None]:
# Verifying that all our shapes are as expected
print((fit.shape, fit1_test.shape))
print((fit2.shape, fit2_test.shape))
print((fit3.shape, fit3_test.shape))

In [None]:
# To use with a numeric series (eg C14) we need to convert the values to strings
# Note that casting the series as type "object" won't work - we need each value to be parsed as a string
fh_4 = FeatureHasher(num_features=uniques.iloc[6, 1], input_type='string', non_negative=True)
%time fit4 = fh_4.fit_transform(train.C14.map(lambda x: str(x)))
%time fit4_test = fh_4.transform(test.C14.map(lambda x: str(x)))
print((fit4.shape, fit4_test.shape))
print(fit4.nnz)

### Testing Models on FeatureHash Columns
At this point, we have used FeatureHash to convert four columns of each dataset into sparse feature matrices, with identical width (ie number of columns). Therefore we are in a position to use these features for training and predicting on a model which takes sparse features. Let's do that for demonstration purposes. 

If successful, we can go on to create pipelines for transforming all of our columns and move on to the modeling stage. 

In [None]:
# We'll use the Multinomial Naive Bayes classifier, since each of our features may have a multinomial distribution
nb = MultinomialNB()

In [None]:
# Checking shapes to make sure our matrices are congruent
(fit.shape, fit2.shape, fit3.shape, fit4.shape)

In [None]:
# Assembling one big sparse matrix from the different HashFeature outputs
%time train_4 = hstack((fit, fit2, fit3, fit4))

In [None]:
# Scaling our sparse matrix
ss = StandardScaler(with_mean=False) # to maintain sparsity

In [None]:
%time train_4 = ss.fit_transform(train_4)

In [None]:
# Verifying that we have compatible shapes of correct dimensionality
(train_4.shape, y_train.shape)

In [None]:
# Training the model 
%time nb.fit(train_4, y_train)

In [None]:
# Getting predictions for train from our training set
y_hat = nb.predict(train_4)

In [None]:
# Making sure we have the right number of predictions
(y_hat.shape, y_train.shape)

In [None]:
# Scoring our train predictions. They should be close to 1. 
roc_auc_score(y_train, y_hat) 

In [None]:
# Something's wrong - this is the baseline...let's look at log loss (should be under 1)
log_loss(y_train, y_hat)

In [None]:
# This shows that we guessed 0 for each one of our rows.
unique, counts = np.unique(y_hat, return_counts=True)
print(unique, counts)

MultinomialNB didn't work with our sparse matrix. Our model guessed all 0s. Let's try a different model.

In [None]:
lr = LogisticRegression()

In [None]:
%time lr = lr.fit(train_4, y_train)

In [None]:
y_hat_lr = lr.predict(train_4)

In [None]:
roc_auc_score(y_train, y_hat_lr) 

In [None]:
unique, counts = np.unique(y_hat, return_counts=True)
print(unique, counts)

Same issue. Our models are guessing all 0s - letting FeatureHasher choose the number of components makes no difference.

### Trying Dimensionality Reduction First
As a hypothesis, maybe our matrix is simply too big, with too many features (2.9MM). Let's try reducing dimensionality and see whether that helps us get better results after training new models. Since PCA doesn't accept sparse inputs, we'll try this using the "Truncated SVD" class from SK Learn.

In [None]:
# let's try 25 components
# cf https://arxiv.org/abs/1305.5870
tsvd = TruncatedSVD(n_components=25) #  try algorithm='arpack' to keep from crashing

In [None]:
# fitting our reducer to the full dataset
# NOTE: takes 18 minutes on my local 8GB machine
%time tsvd = tsvd.fit(train_4)

In [None]:
# reducing our "train_4" sparse matrix which has already been scaled
# this crashes the kernel at 1000 and 100 components...
%time train_4_reduced = tsvd.transform(train_4)

In [None]:
# checking shapes before fitting a model
train_4_reduced.shape, y_train.shape

In [None]:
# now we have negative values, so we need one more transform before we can run MultinomialNB
mmx = MinMaxScaler()
%time train_4_reduced = mmx.fit_transform(train_4_reduced)

In [None]:
nb2 = MultinomialNB()
%time nb2 = nb2.fit(train_4_reduced, y_train)
%time y_hat_nb2 = nb2.predict(train_4_reduced)
print(roc_auc_score(y_train, y_hat_nb2))
print(log_loss(y_train, y_hat_nb2))

In [None]:
lr2 = LogisticRegression()
%time lr2 = lr2.fit(train_4_reduced, y_train)
%time y_hat_lr2 = lr2.predict(train_4_reduced)
print(roc_auc_score(y_train, y_hat_lr2))
print(log_loss(y_train, y_hat_lr2))

In [None]:
unique, counts = np.unique(y_hat_lr2, return_counts=True)
print(unique, counts)

Same results. Our models are simply guessing 0 for every row despite dimensionality reduction.

### Conclusion
We have shown that while we can use FeatureHasher to generate sparse matrices from our high-volume categorical features, the resulting matrices are not presently useful for training (2018-07-05).