# Aggregating data per customer

We need to predict the target at customer-level, i.e., we predict one value for each customer
in the test set. Our data, however, contains a row for every site visit. One obvious way to deal
with this discrepancy is to aggregate the visit data by customer, to obtain features on the
customer-level.

The _preprocessing.py_ file contains functions to do so.

In [None]:
import numpy as np
import pandas as pd

import sys
sys.path.append('..')
from preprocessing import keep_intersection_of_columns
from aggregation import (
    load_train_test_dataframes,
    aggregate_data_per_customer,
)

data_dir = "../data/"
train, test = load_train_test_dataframes(data_dir, nrows_train=50000, nrows_test=10000)

In [None]:
print("Grab a coffee, you're train data will be aggregated in 5 minutes.")
aggregated_train = aggregate_data_per_customer(train)
aggregated_train.head()

This data should be close to what we need to start fitting models. We tried to keep as much information as possible, so from here it is of course still possible to remove features that seem unnecessary or do other dimensionality reduction. The good thing is that we don't have to do aggregation every time, so let's save the results:

In [None]:
print("Saving file, this takes a while. Like, a lot longer than you hope.")
aggregated_train.to_csv(os.path.join(data_dir, "aggregated_train.csv"), index=True)
print("Train data saved")

In [None]:
# repeat for test set
aggregated_test = aggregate_data_per_customer(test)
print("Saving test file...")
aggregated_test.to_csv(os.path.join(data_dir, "aggregated_test.csv"), index=True)
print("All set and ready to start modeling!")

### Our first attempt!
Let's see if we can fit a model on this data.

In [None]:
# if already saved:
# aggregated_train = pd.read_csv(os.path.join(data_dir, "aggregated_train.csv"), dtype={"fullVisitorId": str})
# aggregated_test = pd.read_csv(os.path.join(data_dir, "aggregated_test.csv"), dtype={"fullVisitorId": str})

Note that due to the one-hot encoding, the columns of train and test are not the same. For this experiment, only keep the intersection of columns. There are also other ways to deal with this (e.g., by mapping categories to external data), so we don't do this in the aggregation step.

In [None]:
# just for illustration, so let's keep it simple
from sklearn import linear_model

# create train and test sets and labels excluding visitor ID
x_train, x_test = keep_intersection_of_columns(aggregated_train.reset_index(drop=True),
                                               aggregated_test.reset_index(drop=True))
y_train = np.log(aggregated_train.reset_index(drop=True)["target_sum"]+1)

In [None]:
# set NaNs to zero and fit linear model
x_train = x_train.fillna(0)
x_test = x_test.fillna(0)

lm = linear_model.LinearRegression()
lm.fit(x_train, y_train)
r_squared = lm.score(x_train, y_train)
print("The model has an R^2 of {}.".format(r_squared))

In [None]:
# predict and create a submission
predictions = lm.predict(x_test)
submission = pd.concat([aggregated_test.reset_index()["fullVisitorId"], pd.Series(predictions)], axis=1)
submission.columns = ["fullVisitorId", "PredictedLogRevenue"]

# set everything below $1 to zero
submission["PredictedLogRevenue"] = np.maximum(0, submission["PredictedLogRevenue"])
submission["PredictedLogRevenue"][submission["PredictedLogRevenue"]<1] = 0
submission.head()

In [None]:
submission.to_csv("first_submission.csv", index=False)