# Aggregation with dynamic features
This notebook shows how to do the aggregation with monthly values for certain columns. 

__Remark:__ Because we now filter out a lot of users that only visited once, this notebook is not such a pain in the ass anymore. Don't be afraid to run it, your memory will be sufficient and you'll be done in a couple of minutes.

In [None]:
import json
import datetime
import os
import time
import sys
import shutil
import glob
import re

import pandas as pd
import numpy as np
from sklearn import preprocessing

import matplotlib.pyplot as plt

sys.path.append('..')
from preprocessing import *
from aggregation import *

## Managing a huge file

Below is the new version of `load`, where processing takes place in chunks. After all chunks have been processed, they are concatenated to a single file. Since many columns are either dropped or aggregated, the resulting dataframe fits in RAM.

In [None]:
# Only run these the first time - after it you can just load the reduced_datasets.
reduce_df("../data/train_v2.csv", output="../data/reduced_train.csv", nrows=None, chunksize=20000)
reduce_df("../data/test_v2.csv", output="../data/reduced_test.csv", nrows=None, chunksize=20000)

In [None]:
print("Let's widen the train dataset.")
train = pd.read_csv("../data/reduced_train.csv", dtype={'fullVisitorId': 'str'})
wide_train = aggregate(train)
wide_train.to_csv("../data/wide_train.csv", encoding="utf-8", index=False)

print("Let's widen the test dataset.")
test = pd.read_csv("../data/reduced_test.csv", dtype={'fullVisitorId': 'str'})
wide_test = aggregate(test)
wide_test.to_csv("../data/wide_test.csv", encoding="utf-8", index=False)

# Old stuff

The cells below include usage of aggregation functionality developed prior to the competition's restart. We keep it here in case we need to go back and check our initial ideas but these will typically not work equally well on the new dataset.

### Load and preprocess the data
First we need to load and preprocess the original data.
Note that there might be three additional columns in the new dataset. These need to be preprocessed as well.

The `preprocess_and_save` method now has an argument `drop_users=True` (default is True). You can set this to false if you wish to keep all users.

In [None]:
preprocess_and_save("../data", nrows_train=None, nrows_test=None, start_x_train='2016-08-01', 
                    end_x_train='2016-10-16', start_y_train='2016-12-01', end_y_train='2017-02-01', 
                    start_x_test='2017-08-01', end_x_test='2017-10-16', drop_users=True)

### Aggregating data per customer

We need to predict the target at customer-level, i.e., we predict one value for each customer
in the test set. Our data, however, contains a row for every site visit. One obvious way to deal
with this discrepancy is to aggregate the visit data by customer, to obtain features on the
customer-level.

The _preprocessing.py_ file contains functions to do so.

In [None]:
data_dir = "../data/"
x_train, y_train, x_test = load_train_test_dataframes(data_dir, nrows_train=None, nrows_test=None)

In [None]:
print("This only takes a few seconds now.")
x_train_aggregated = aggregate_data_per_customer(x_train, startdate_y='2016-12-01', startdate_x='2016-08-01')
print("Train data is aggregated")
print("Aggregating test data.")
x_test_aggregated = aggregate_data_per_customer(x_test, startdate_y='2017-12-01', startdate_x='2017-08-01')
print("Test data is aggregated")
y_train_aggregated = y_train.groupby(['fullVisitorId'])[['target']].sum()
print("Train target data is aggregated")

This data should be close to what we need to start fitting models. We tried to keep as much information as possible, so from here it is of course still possible to remove features that seem unnecessary or do other dimensionality reduction. The good thing is that we don't have to do aggregation every time, so let's save the results:

In [None]:
print("Saving files")
x_train_aggregated.to_csv(os.path.join(data_dir, "aggregated_x_train.csv"), index=True)
x_test_aggregated.to_csv(os.path.join(data_dir, "aggregated_x_test.csv"), index=True)
y_train_aggregated.to_csv(os.path.join(data_dir, "aggregated_y_train.csv"), index=True)
print("Aggregated data saved")

### Our first attempt!
Let's see if we can fit a model on this data.

Note that due to the one-hot encoding, the columns of train and test are not the same. For this experiment, only keep the intersection of columns. There are also other ways to deal with this (e.g., by mapping categories to external data), so we don't do this in the aggregation step.

In [None]:
# just for illustration, so let's keep it simple
from sklearn import linear_model

# create train and test sets and labels excluding visitor ID
x_train, x_test = keep_intersection_of_columns(aggregated_train.reset_index(drop=True),
                                               aggregated_test.reset_index(drop=True))
y_train = np.log(aggregated_train.reset_index(drop=True)["target_sum"]+1)

In [None]:
# set NaNs to zero and fit linear model
x_train = x_train.fillna(0)
x_test = x_test.fillna(0)

lm = linear_model.LinearRegression()
lm.fit(x_train, y_train)
r_squared = lm.score(x_train, y_train)
print("The model has an R^2 of {}.".format(r_squared))

In [None]:
# predict and create a submission
predictions = lm.predict(x_test)
submission = pd.concat([aggregated_test.reset_index()["fullVisitorId"], pd.Series(predictions)], axis=1)
submission.columns = ["fullVisitorId", "PredictedLogRevenue"]

# set everything below $1 to zero
submission["PredictedLogRevenue"] = np.maximum(0, submission["PredictedLogRevenue"])
submission["PredictedLogRevenue"][submission["PredictedLogRevenue"]<1] = 0
submission.head()

In [None]:
submission.to_csv("first_submission.csv", index=False)