# TalkingData AdTracking Fraud Detection Challenge

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, **you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.** To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

In [1]:
import pandas as pd

df = pd.read_csv('train_all.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500000 entries, 0 to 1499999
Data columns (total 76 columns):
record              1500000 non-null int64
channelIp10s        1500000 non-null int64
channelIp3s         1500000 non-null int64
channelApp10s       1500000 non-null int64
channelApp3s        1500000 non-null int64
channelDevice10s    1500000 non-null int64
channelDevice3s     1500000 non-null int64
channelOs10s        1500000 non-null int64
channelOs3s         1500000 non-null int64
osIp10s             1500000 non-null int64
osIp3s              1500000 non-null int64
osApp10s            1500000 non-null int64
osApp3s             1500000 non-null int64
osDevice10s         1500000 non-null int64
osDevice3s          1500000 non-null int64
osChannel10s        1500000 non-null int64
osChannel3s         1500000 non-null int64
deviceIp1s          1500000 non-null int64
deviceApp1s         1500000 non-null int64
deviceOs1s          1500000 non-null int64
deviceChannel1s     1500000

In [2]:
#get y
y = df.is_attributed.values

#get X and drop y 
X = df.drop(columns=['record', 'click_time', 'is_attributed', 'ip'])
X.hour = X.hour.astype('category')
X.channel = X.channel.astype('category')
X.os = X.os.astype('category')
X.device = X.device.astype('category')
X.app = X.app.astype('category')

print(y[:10])
print(X.iloc[:5,21:40])

[0 0 0 0 0 0 0 0 0 0]
   appIp3s  appDevice10s  appDevice3s  appOs10s  appOs3s  appChannel10s  \
0        1             1            1         1        1              1   
1        2             1            1         2        2              2   
2        1             1            1         1        1              1   
3        2             2            2         2        2              2   
4        1             1            1         1        1              1   

   appChannel3s  ipApp10s  ipApp3s  ipDevice10s  ipDevice3s  ipOs10s  ipOs3s  \
0             1         1        1            1           1        1       1   
1             2         1        1            1           1        1       1   
2             1         1        1            1           1        1       1   
3             2         1        1            1           1        1       1   
4             1         1        1            1           1        1       1   

   ipChannel10s  ipChannel3s hour  hourAttrib 

In [3]:
X = X.drop(columns=['hour', 'channel', 'os', 'device', 'app'])

### Downsampling

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000000)

goods = (y_train==1)

X_train_goods = X_train[goods]
y_train_goods = y_train[goods]

X_train_bads = X_train[~goods]
y_train_bads = y_train[~goods]

index = np.random.choice(range(len(y_train_bads)), len(y_train_goods)*3)

print(len(y_train_goods), len(y_train_bads), len(index))

X_train = pd.concat( [X_train_goods,X_train_bads.iloc[index,:]], axis=0)
y_train = np.hstack([y_train_goods,y_train_bads[index]])

print(len(y_train))
X_train.info()





3454 996546 10362
13816
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13816 entries, 528537 to 1102086
Data columns (total 67 columns):
channelIp10s        13816 non-null int64
channelIp3s         13816 non-null int64
channelApp10s       13816 non-null int64
channelApp3s        13816 non-null int64
channelDevice10s    13816 non-null int64
channelDevice3s     13816 non-null int64
channelOs10s        13816 non-null int64
channelOs3s         13816 non-null int64
osIp10s             13816 non-null int64
osIp3s              13816 non-null int64
osApp10s            13816 non-null int64
osApp3s             13816 non-null int64
osDevice10s         13816 non-null int64
osDevice3s          13816 non-null int64
osChannel10s        13816 non-null int64
osChannel3s         13816 non-null int64
deviceIp1s          13816 non-null int64
deviceApp1s         13816 non-null int64
deviceOs1s          13816 non-null int64
deviceChannel1s     13816 non-null int64
appIp10s            13816 non-null int64

In [6]:
#Grid search cross-validation
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

steps = [('scaler', StandardScaler()),('knn', KNeighborsClassifier())]#(kernel='linear', probability=True))]

pipeline = Pipeline(steps)

param_grid = {'knn__n_neighbors': np.arange(1,50)}

cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
knn_cv = GridSearchCV(pipeline, param_grid, scoring='roc_auc', cv=cv)

knn_cv.fit(X_train, y_train)
print(knn_cv.best_params_)
print(knn_cv.cv_results_)

{'knn__n_neighbors': 33}
{'mean_fit_time': array([ 0.17413864,  0.17672658,  0.18605042,  0.17539001,  0.18694305,
        0.18677812,  0.18863778,  0.20319304,  0.17580557,  0.18347316,
        0.19245195,  0.21253662,  0.19748454,  0.1711328 ,  0.16938319,
        0.17632198,  0.17481685,  0.18936543,  0.2024601 ,  0.2067903 ,
        0.20545425,  0.20379963,  0.19946556,  0.21936655,  0.18016562,
        0.25410204,  0.27754459,  0.19263959,  0.17769232,  0.22552032,
        0.27157254,  0.18737717,  0.24409194,  0.21211805,  0.19899859,
        0.19173307,  0.18856955,  0.22301536,  0.18189874,  0.18498559,
        0.19314723,  0.17840524,  0.19359221,  0.26122689,  0.2759305 ,
        0.18881617,  0.22829838,  0.18177462,  0.23867741]), 'std_fit_time': array([ 0.01212496,  0.01201217,  0.02016021,  0.01392638,  0.01898603,
        0.01265083,  0.01510744,  0.03160989,  0.01167049,  0.0114449 ,
        0.02751045,  0.02441313,  0.03118116,  0.00965557,  0.01217624,
        0.016020

In [9]:
knn = KNeighborsClassifier(n_neighbors=knn_cv.best_params_['knn__n_neighbors'])
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print('score: {}'.format(knn.score(X_test, y_test)))

#print(confusion_matrix(y_test, y_pred))
#print(classification_report(y_test, y_pred))

score: 0.958814


NameError: name 'confusion_matrix' is not defined

In [8]:
knn_cv.best_params_

{'knn__n_neighbors': 33}

In [11]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

c_mat = confusion_matrix(y_test, y_pred)
print(c_mat)
print(classification_report(y_test, y_pred))

#fdr score
fdr = 0

auc = 0#roc_auc_score(y_test, y_pred_prob)
specificity = 1 - c_mat[0][1] / (c_mat[0][0]+c_mat[0][1])
sensitivity = c_mat[1][1] / (c_mat[1][0]+c_mat[1][1]) #Recall
precision = c_mat[1][1] / (c_mat[0][1]+c_mat[1][1]) #Recall
print(auc, fdr, sensitivity, specificity, precision)

[[477924  20310]
 [   283   1483]]
             precision    recall  f1-score   support

          0       1.00      0.96      0.98    498234
          1       0.07      0.84      0.13      1766

avg / total       1.00      0.96      0.98    500000

0 0 0.839750849377 0.959236021628 0.0680493736521
