# Aggregation with dynamic features
This notebook shows how to do the aggregation with monthly values for certain columns. 

__Remark:__ Because we now filter out a lot of users that only visited once, this notebook is not such a pain in the ass anymore. Don't be afraid to run it, your memory will be sufficient and you'll be done in a couple of minutes.

In [None]:
import json
import datetime
import os
import time
import sys
import shutil
import glob
import re

import pandas as pd
import numpy as np
from sklearn import preprocessing

import matplotlib.pyplot as plt

sys.path.append('..')
from preprocessing import *
from aggregation import *

## Managing a huge file

Below is the new version of `load`, where processing takes place in chunks. After all chunks have been processed, they are concatenated to a single file. Since many columns are either dropped or aggregated, the resulting dataframe fits in RAM.

In [None]:
# Only run these the first time - after it you can just load the reduced_datasets.
reduce_df("../data/train_v2.csv", output="../data/reduced_train.csv", nrows=None, chunksize=20000)
reduce_df("../data/test_v2.csv", output="../data/reduced_test.csv", nrows=None, chunksize=20000)

In [None]:
train = pd.read_csv("../data/reduced_train.csv")
test = pd.read_csv("../data/reduced_test.csv")

In [None]:
# Date intervals to split the data
x_train_dates=('2016-08-01', '2017-11-30') 
y_train_dates=('2017-12-01', '2018-01-31')
x_test_dates=('2017-08-01', '2018-11-30')

# Final data processing
x_train, y_train, x_test = split_data(train, test, x_train_dates=x_train_dates, y_train_dates=y_train_dates, x_test_dates=x_test_dates, selec_top_per=0.5, max_cat=10)
train, test = None, None  # To liberate memory space

# Save dfs as pickle objects -> faster to load and save. In addition, we do not need to worry about format issues
x_train.to_pickle("../data/x_train.pkl") 
y_train.to_pickle("../data/y_train.pkl") 
x_test.to_pickle("../data/x_test.pkl") 

### Our first attempt!
Let's see if we can fit a model on this data.

Note that due to the one-hot encoding, the columns of train and test are not the same. For this experiment, only keep the intersection of columns. There are also other ways to deal with this (e.g., by mapping categories to external data), so we don't do this in the aggregation step.

In [None]:
x_train = pd.read_pickle("../data/x_train.pkl") 
y_train = pd.read_pickle("../data/y_train.pkl") 
x_test = pd.read_pickle("../data/x_test.pkl") 

# Save the dataset ids
id_x_train =  x_train['fullVisitorId']
id_y_train = y_train['fullVisitorId']
id_x_test = x_test['fullVisitorId']

# Delete fullVisitor ID -> probably we want to leave it as a OHE feature
del x_train['fullVisitorId']
del y_train['fullVisitorId']
del x_test['fullVisitorId']


In [None]:
# set NaNs to zero and fit linear model
from sklearn import linear_model
x_train = x_train.fillna(0)
x_test = x_test.fillna(0)
y_train = y_train.fillna(0)

lm = linear_model.LinearRegression()
lm.fit(x_train, y_train)
r_squared = lm.score(x_train, y_train)
print("The model has an R^2 of {}.".format(r_squared))
# do the prediction
prediction = list(lm.predict(x_test).flat)