# Button Click Data
This is the step by step method of preprocesssing, transforming, and building a model for the CAKE button click data. The steps have to be the same as the low level interaction data.

In [16]:
import pandas as pd
import numpy as np

In [17]:
df_test = pd.read_csv('./data/test-interactions_results.csv')
df_live = pd.read_csv('./data/live-interactions_statistics_results.csv')

On reflection, this seems like a silly metric -- it's the same as the total length of the session. We also don't need the most frequent item and action pair, while we could do some NLP on this, it's beyond the scope of this (and wasn't done with the low-level data)

In [18]:
df_test.drop(['totalTimeBetweenClicks', 'mostFrequentItemAction'], inplace=True, axis=1)
df_live.drop(['totalTimeBetweenClicks', 'mostFrequentItemAction'], inplace=True, axis=1)

### Preprocess the data


First, let's remove those users that experienced technical difficulties during the live study

In [19]:
technical_difficulties_pid = ["109", "112", "113", "121", "217", "220", "205", "401", "407", "425"]
del_index = []

def remove_technical_difficulties(row):
    pid = row['userid']
    if pid in technical_difficulties_pid:
        return del_index.append(row.name)
    
# apply the function to remove the pids
df_live.apply(remove_technical_difficulties, axis=1)
df_live.drop(del_index, inplace=True)
df_live.head()

Unnamed: 0,userid,sessionNo,totalEvents,totalTime,meanTimeBetweenClicks,stdTimeBetweenClicks,meanClicksPerSecond,stdClicksPerSecond,meanClicksPerMinute,stdClicksPerMinute
0,310,1,13,181,13.923077,15.35937,1.083333,0.288675,3.25,1.5
1,310,2,2,449,224.5,317.490945,1.0,0.0,1.0,0.0
2,310,3,59,2794,47.355932,87.028684,1.18,0.437526,2.36,2.413158
3,206,1,2,34,17.0,24.041631,1.0,0.0,1.0,0.0
4,206,2,77,287,3.727273,6.229464,1.509804,0.880285,12.833333,15.223885


Now that we've done that, we need add our labels to the features.

In [23]:
df_live['label'] = 0
df_test['label'] = 1

Okay, let's split the participants that have multiple sessions off from the main live dataframe.

In [24]:
df_live_multi = df_live[df_live.groupby('userid').userid.transform(len) > 1]
df_live_single = df_live.loc[~df_live.duplicated(subset='userid', keep=False), :]

### Transform the Data

We need to create our X (feature set) and y (labels).

In [None]:
features = pd.concat([df_live_single, df_test], ignore_index=True)
true_y = features['label']
true_X = features.loc[:, 'meanClicksPerMinute':'totalTime']
test_X = df_live_multi.loc[:, 'totalEvents':'stdClicksPerMinute']

Let's reorder the columns (it should matter but it still bothers me).

In [None]:
true_X = true_X.reindex(test_X.columns, axis=1)

Let's now scale the data!

In [None]:
from sklearn.preprocessing import StandardScaler
scaled_X = pd.DataFrame(StandardScaler().fit_transform(true_X), columns=true_X.columns)
scaled_test_X = pd.DataFrame(StandardScaler().fit_transform(test_X), columns=test_X.columns)

### Classification

In [42]:
from sklearn.svm import SVC
svm = SVC() # out of the box classifier
svm.fit(scaled_X, true_y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Prediction!

In [43]:
test_X['predicition'] = svm.predict(scaled_test_X)
test_X

Unnamed: 0,totalEvents,totalTime,meanTimeBetweenClicks,stdTimeBetweenClicks,meanClicksPerSecond,stdClicksPerSecond,meanClicksPerMinute,stdClicksPerMinute,predicition
0,13,181,13.923077,15.35937,1.083333,0.288675,3.25,1.5,0
1,2,449,224.5,317.490945,1.0,0.0,1.0,0.0,0
2,59,2794,47.355932,87.028684,1.18,0.437526,2.36,2.413158,0
3,2,34,17.0,24.041631,1.0,0.0,1.0,0.0,0
4,77,287,3.727273,6.229464,1.509804,0.880285,12.833333,15.223885,0
5,112,2776,24.785714,36.817159,1.0,0.0,3.111111,2.538591,0
6,7,175,25.0,30.697448,1.0,0.0,1.75,0.957427,0
7,45,1782,39.6,37.250015,1.0,0.0,1.8,1.080123,0
12,3,52,17.333333,21.221059,1.0,0.0,3.0,0.0,0
13,23,212,9.217391,13.218295,1.045455,0.213201,5.75,4.645787,0
