## Feature Engineering With Reduced Dataset and Class Rebalancing

(A revised version of this notebook where we allow FeatureHasher to pick the num_features instead of settting them directly)

(We also add class rebalancing to make a 50/50 ratio)

Since feature engineering for categorical variables (eg one hot encoding) is difficult with our full 6GB dataset on our local computer (8GB RAM), in this notebook we load ~25% of our training data (~10MM rows) to explore different techniques, such as Feature Hashing. 

### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.utils import resample
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from scipy import sparse
from sklearn.feature_extraction import FeatureHasher 
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost
from scipy.sparse import hstack
from sklearn.metrics import classification_report, log_loss, roc_auc_score, roc_curve
from sklearn.decomposition import TruncatedSVD


### Loading Test Data and Sample Submission

Loading in our full test file as well as the sample submission CSV

In [2]:
# loading in test data as well as the Sample Submission file
sample = pd.read_csv('../assets/sampleSubmission')
test = pd.read_csv('../assets/test')

### Loading Partial Training Data

For this notebook we are only loading 20% of our full training dataset (~1.5GB, ~10MM rows). This will allow us to experiment with feature engineering on a local device with 8GB of memory. Running the same operations against the full set may require a different solution (eg AWS Sagemaker or Databricks Spark Cluster).

In [3]:
# Because of GitHub space limits (no files over 2GB), train data file was split into 5 pieces

# Loading the first file with header row to use for column names
%time train = pd.read_csv("../assets/trainaa")

# Checking the columns present
train.head(1)

CPU times: user 40.2 s, sys: 9.67 s, total: 49.8 s
Wall time: 51.3 s


Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79


### Addressing Class Imbalance
We have only 16% minority class representation in this reduced dataset. Let's see whether changing this balance results in better model performance.

In [4]:
train.click.value_counts(normalize=True)

0    0.833864
1    0.166136
Name: click, dtype: float64

We will downsample the majority class to be the same size as the minority class in order to test the models

In [5]:
# Splitting training dataset into minority and majority classes so we can downsample majority class
train_0 = train[train.click == 0]
train_1 = train[train.click == 1]
(train_0.shape, train_1.shape)

((8338642, 24), (1661357, 24))

In [6]:
# Reducing the majority class to be the same size as the minority class
%time train_0_reduced = resample(train_0, replace=False, n_samples=train_1.shape[0], random_state=99)

CPU times: user 3.37 s, sys: 1.8 s, total: 5.17 s
Wall time: 5.37 s


In [7]:
# Appending the reduced majority class to our minority class to make a new training set
train = train_0_reduced.append(train_1)

In [8]:
# We have reduced down to about 1/3 of our original dataset size
train.shape

(3322714, 24)

In [9]:
# Now we can see that each class accounts for about 50% of the rows in our dataset
train.click.value_counts(normalize=True)

1    0.5
0    0.5
Name: click, dtype: float64

Now we have a 50/50 balance between classes in our training set. Let's see whether we get different results in prediction.

### Making Y Train
We'll use the "click" column in the training set as our target or "y" series for training purposes, dropping it from our training dataframe.  We will also remove the "id" column, which does not add value for training purposes.


In [10]:
# Creating a y_train with our click information
y_train = train.click

# From the Kaggle site, we know 'id' is just a record indicator, so we can drop it
# along with the "click" column, which is our target
train = train.drop(columns=['click', 'id'])


In [11]:
# Checking our shapes. We see that "test" has an extra column, because it still has "id". We'll drop this column next 
# after using it to prepare a submission dataframe, where we'll put our predictions
(train.shape, test.shape, y_train.shape)

((3322714, 22), (4577464, 23), (3322714,))

In [12]:
# Checking class balancce in y_train
y_train.value_counts(normalize=True)

1    0.5
0    0.5
Name: click, dtype: float64

### Prepping a Submission Dataframe
We take the "id" column and index from the "test" dataframe and add a "click" column, following the model of the sample submission file.  For now the click column is filled with zeros; later as we generate model predictions we will place them here, before saving this dataframe out as a csv to upload to Kaggle.


In [13]:
# Checking the columns of our submission sample
sample.head(1)

Unnamed: 0,id,click
0,10000174058809263569,0.5


In [14]:
# Checking the datatypes of our submission sample
sample.dtypes

id        uint64
click    float64
dtype: object

In [15]:
# Creating the submission dataframe and conforming types
submit = pd.DataFrame(test.id, index=test.index)
submit['click'] = 0.0
submit['id'] = submit['id'].astype(np.uint64)

# Verifying that our new submission has the correct datatypes
submit.dtypes

id        uint64
click    float64
dtype: object

In [16]:
# Verifying that our new submission has the correct columns
submit.head(1)

Unnamed: 0,id,click
0,10000174058809264128,0.0


In [17]:
# Dropping the id column from test
test = test.drop(columns="id")

# Now we can verify that "train" and "test" have the same number of columns, as expected
# We can also verify that the "submit" dataframe has the same number of rows as "test"

(train.shape, test.shape, submit.shape)

((3322714, 22), (4577464, 22), (4577464, 2))

### Feature Engineering: Date Columns
We have a column representing hour/day/month/year for each hour in our 14 day sample ("hour"). We will convert this to a date-time object, then add feature columns based on the day of the week and the hour of the day.  Finally we will remove the hour column. We'll perform the same operation on both our "train" and "test" dataframes.


In [18]:
# function to make day-of-week, hour-of-day features
def make_date_features(dataframe, frame_name="dataframe", date_col="hour"): 
    date_obj = pd.to_datetime(dataframe[date_col])
    
    dataframe['hour-of-the-day'] = date_obj.dt.hour
    print(f"Created 'hour-of-the-day' column in {frame_name}")
    
    dataframe['day-of-the-week'] = date_obj.dt.dayofweek
    print(f"Created 'day-of-the-week' column in {frame_name}")



In [19]:
# running "train" and "test" through our date features function to create the engineered columns 
make_date_features(train, "train", "hour")
make_date_features(test, "test", "hour")

Created 'hour-of-the-day' column in train
Created 'day-of-the-week' column in train
Created 'hour-of-the-day' column in test
Created 'day-of-the-week' column in test


In [20]:
# dropping the "hour" column now that we no longer need it
train = train.drop(columns=["hour"])
test = test.drop(columns=["hour"])

### Notes from Adam - Standup July 3

In [21]:
# data generator object from readcsv - ton of cool features - get batches
# hashing vectorizer
# stochastic sgd
# no SVM - no KNN 

### Feature Hash Experiment

We have multiple columns with large numbers of categorical values. Let's see if we can featureize them in memory with the FeatureHash class.

In [22]:
# Looking at the number of unique values in each column of the training dataset
uniques = pd.DataFrame(data=[train.columns.values, [len(train[col].unique()) for col in train.columns], [train[col].dtype for col in train.columns]]).T
uniques = uniques.rename({0:'Column Name', 1:'Unique Values', 2:'Dtype'}, axis=1)
uniques = uniques.sort_values(by='Unique Values', ascending=False).reset_index(drop=True)
uniques

Unnamed: 0,Column Name,Unique Values,Dtype
0,device_ip,1128197,object
1,device_id,361531,object
2,device_model,6043,object
3,app_id,4252,object
4,site_domain,3464,object
5,site_id,2964,object
6,C14,1002,int64
7,app_domain,299,object
8,C17,222,int64
9,C20,166,int64


For this notebook we are going to let FeatureHasher choose the number of features vs setting them ourselves

In [23]:
# Let's try this out with "device model" first - only about 7k values
fh_1 = FeatureHasher(input_type='string', non_negative=True) # so we can use NaiveBayes
%time fit = fh_1.fit_transform(train.device_model)



CPU times: user 9.52 s, sys: 282 ms, total: 9.8 s
Wall time: 9.81 s


In [24]:
print(f"Fit shape: {fit.shape}, Fit non-nulls: {fit.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit.nnz/(fit.shape[0] * fit.shape[1]))}")

Fit shape: (3322714, 1048576), Fit non-nulls: 21568819
Non-null fraction of total: 0.0000061906


In [25]:
# Now we'll try "device_id" - about 700k values
fh_2 = FeatureHasher(input_type='string', non_negative=True)
%time fit2 = fh_2.fit_transform(train.device_id)



CPU times: user 9.29 s, sys: 268 ms, total: 9.55 s
Wall time: 9.56 s


In [26]:
print(f"Fit2 shape: {fit2.shape}, Fit2 non-nulls: {fit2.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit2.nnz/(fit2.shape[0] * fit2.shape[1]))}")

Fit2 shape: (3322714, 1048576), Fit2 non-nulls: 20220364
Non-null fraction of total: 0.0000058036


In [27]:
# Now we'll try "device_ip" - 2MM values in the reduced training set
fh_3 = FeatureHasher(input_type='string', non_negative=True)
%time fit3 = fh_3.fit_transform(train.device_ip)



CPU times: user 10 s, sys: 371 ms, total: 10.4 s
Wall time: 10.4 s


In [28]:
print(f"Fit3 shape: {fit3.shape}, Fit3 non-nulls: {fit3.nnz}")
print(f"Non-null fraction of total: {'{:.10f}'.format(fit3.nnz/(fit3.shape[0] * fit3.shape[1]))}")

Fit3 shape: (3322714, 1048576), Fit3 non-nulls: 21415816
Non-null fraction of total: 0.0000061467


In [29]:
# We can use the same objects on test now that they have been fitted on train
fit1_test = fh_1.transform(test.device_model)
fit2_test = fh_2.transform(test.device_id)
fit3_test = fh_3.transform(test.device_ip)

In [30]:
# Verifying that all our shapes are as expected
print((fit.shape, fit1_test.shape))
print((fit2.shape, fit2_test.shape))
print((fit3.shape, fit3_test.shape))

((3322714, 1048576), (4577464, 1048576))
((3322714, 1048576), (4577464, 1048576))
((3322714, 1048576), (4577464, 1048576))


In [31]:
# To use with a numeric series (eg C14) we need to convert the values to strings
# Note that casting the series as type "object" won't work - we need each value to be parsed as a string
fh_4 = FeatureHasher(input_type='string', non_negative=True)
%time fit4 = fh_4.fit_transform(train.C14.map(lambda x: str(x)))
%time fit4_test = fh_4.transform(test.C14.map(lambda x: str(x)))
print((fit4.shape, fit4_test.shape))
print(fit4.nnz)



CPU times: user 7.64 s, sys: 361 ms, total: 8 s
Wall time: 8.02 s
CPU times: user 11 s, sys: 775 ms, total: 11.8 s
Wall time: 11.9 s
((3322714, 1048576), (4577464, 1048576))
13914168


### Testing Models on FeatureHash Columns
At this point, we have used FeatureHash to convert four columns of each dataset into sparse feature matrices, with identical width (ie number of columns). Therefore we are in a position to use these features for training and predicting on a model which takes sparse features. Let's do that for demonstration purposes. 

If successful, we can go on to create pipelines for transforming all of our columns and move on to the modeling stage. 

In [32]:
# Checking shapes to make sure our matrices are congruent
(fit.shape, fit2.shape, fit3.shape, fit4.shape)

((3322714, 1048576),
 (3322714, 1048576),
 (3322714, 1048576),
 (3322714, 1048576))

In [33]:
# Assembling one big sparse matrix from the different HashFeature outputs
%time train_hashed = hstack((fit, fit2, fit3, fit4))

CPU times: user 2.52 s, sys: 3.62 s, total: 6.14 s
Wall time: 6.91 s


In [34]:
# Scaling our sparse matrix
ss = StandardScaler(with_mean=False) # to maintain sparsity

In [35]:
%time train_hashed = ss.fit_transform(train_hashed)

CPU times: user 3.77 s, sys: 3.84 s, total: 7.61 s
Wall time: 8.33 s


In [36]:
# Verifying that we have compatible shapes of correct dimensionality
(train_hashed.shape, y_train.shape)

((3322714, 4194304), (3322714,))

In [37]:
# We'll use the Multinomial Naive Bayes classifier, since each of our features may have a multinomial distribution
nb = MultinomialNB()

In [38]:
# Training the model 
%time nb.fit(train_hashed, y_train)

CPU times: user 1.06 s, sys: 179 ms, total: 1.23 s
Wall time: 1.25 s


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [39]:
# Getting predictions for train from our training set
y_hat = nb.predict(train_hashed)

In [40]:
# Making sure we have the right number of predictions
(y_hat.shape, y_train.shape)

((3322714,), (3322714,))

In [41]:
# Scoring our train predictions. They should be close to 1. 
roc_auc_score(y_train, y_hat) 

0.5632392676589077

In [42]:
log_loss(y_train, nb.predict_proba(train_hashed))

0.6895454334215533

In [43]:
# This shows that we guessed 0 for each one of our rows.
unique, counts = np.unique(y_hat, return_counts=True)
print(unique, counts)

[0 1] [1751384 1571330]


MultinomialNB did make guesses in both classes, but did not achieve very good results. Let's try a different Bayesian model.

In [44]:
bnb = BernoulliNB()
bnb = bnb.fit(train_hashed, y_train)
y_hat_bnb = bnb.predict(train_hashed)
print(roc_auc_score(y_train, y_hat_bnb))
unique, counts = np.unique(y_hat_bnb, return_counts=True)
print(unique, counts)

0.5666945755788791
[0 1] [1766735 1555979]


Now let's try Logistic Regression

In [None]:
lr = LogisticRegression()
%time lr = lr.fit(train_hashed, y_train)
y_hat_lr = lr.predict(train_hashed)
print(roc_auc_score(y_train, y_hat_lr)) 
unique, counts = np.unique(y_hat_lr, return_counts=True)
print(unique, counts)

Similar results, and we are now predicting both classes, but we're not doing much better than the baseline (.5).

Let's try two more models on the same data set.

In [None]:
xg = xgboost()
%time xg = xg.fit(train_hashed, y_train)
y_hat_xg = xg.predict(train_hashed)
print(roc_auc_score(y_train, y_hat_xg)) 
unique, counts = np.unique(y_hat_xg, return_counts=True)
print(unique, counts)

In [None]:
rf = RandomForestClassifier()
%time rf = rf.fit(train_hashed, y_train)
y_hat_rf = rf.predict(train_hashed)
print(roc_auc_score(y_train, y_hat_rf)) 
unique, counts = np.unique(y_hat_rf, return_counts=True)
print(unique, counts)

### Conclusion
With undersampling of the majority class, we are able to see some positive impact on predictions from the FeatureHasher columns - however it is fairly minimal (+.05 Roc-Auc score). 