# Importing libraries and loading training data

In [0]:
#from google.colab import drive
#drive.mount('/gdrive')
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime
import os
import gc
import random
import dask.dataframe as dd
import sys
sns.set()
import pickle
import itertools


In [0]:
#connecting to kaggle and importing datasets
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

!kaggle competitions download -c talkingdata-adtracking-fraud-detection
!unzip train.csv.zip

Saving kaggle.json to kaggle.json
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
train_sample.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test_supplement.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  train.csv.zip
  inflating: mnt/ssd/kaggle-talkingdata2/competition_files/train.csv  


In [0]:
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'click_id'      : 'uint32'
        }

train = dd.read_csv('mnt/ssd/kaggle-talkingdata2/competition_files/train.csv', dtype=dtypes, usecols=(['ip', 'app', 'device', 'os', 'channel']))

# Creating features

I want to see if grouping features in pairs or triples and then calculating click counts across the entire dataset for each pair and triple have any predictive power in our model. For example,  I'm thinking this because maybe certain app_channel combinations have higher click volume and thus higher download rates OR higher click volume is indicative of fraudulent activity. I'm going to merge these features on my train and test datasets. I'm going to first experiment with all possible groupings, model, and then be selective about which ones I actually end up including. I want to aggregate across the entire training dataset so will have to manipulate and compute from a dask dataframe

In [0]:
#trying to get an idea of how long it takes to get value counts by grouping
import timeit

start = timeit.default_timer()
name=train.groupby(['ip', 'app']).size().compute()
name=name.reset_index()
stop = timeit.default_timer()

print('Time: ', stop - start)  

Time:  114.02019569999902


Here's an example of what the aggregate click counts look like:

In [6]:
#example value counts
name.head(5)

Unnamed: 0,ip,app,0
0,9,3,802
1,9,9,403
2,9,12,538
3,9,18,339
4,10,12,174


The total # of clicks for all clicks in training dataset with ip address ''9" and app_id "3" was 802.  I'm using itertools.combinations to make all possible 2-way and 3-way combinations. I'm doing all possilble two and three combinations. Then looping through each combination to get the actual click counts from the dask dataframe. Then pickling the counts to be used during training.

In [0]:
#finding all possible grouping pairs and finding total number of click counts in 
#train dataset. Creating list where each list item is the value counts for 
#a pair of features

features = ['ip', 'app', 'device', 'os', 'channel']
double=list(itertools.combinations(features, 2))

doubles_counts = []

for i in range(0, len(double)):
  counts=train.groupby(list(double[i])).size().compute()
  counts=counts.reset_index()
  doubles_counts.append(counts)

outfile = open('/gdrive/My Drive/kaggle/pickles/doubles_counts','wb')
pickle.dump(doubles_counts,outfile)
outfile.close()


I found that I was able to run through all the 2-way pairs fine at once. But the three way pairs were too large and my session kept crashing. So I'm saving the output of click counts for each triple grouping separately, deleting to make room in session, rinse and repeat. 

In [0]:
#finding all possible triple groupings and finding total number of click counts in 
#train dataset.
features = ['ip', 'app', 'device', 'os', 'channel']
triple=list(itertools.combinations(features, 3))

for i in range(5, len(triple)):
  counts=train.groupby(list(triple[i])).size().compute()
  counts=counts.reset_index()
  group_vars = '_'.join(triple[i])
  file_name = '/gdrive/My Drive/kaggle/pickles/triples'+group_vars
  outfile = open(file_name,'wb')
  pickle.dump(counts, outfile)
  outfile.close()
  del counts
  gc.collect()

del triple, group_vars, file_name
gc.collect()

0